How to Use vLLM for LLM Inference in 2026

Introduction

vLLM is an open-source inference engine for large language models (LLMs), designed for exceptional speed and efficiency. Launched in 2023, it uses PagedAttention to optimize GPU memory, delivering up to 4x more throughput than alternatives like Hugging Face Transformers. In 2026, with the rise of AI agents and real-time apps, vLLM is essential for developers deploying models locally without sacrificing performance.

Why use it? Imagine serving a model like Llama 3 at 100 requests/second on a single GPU, compared to 20-30 with standard tools. This beginner tutorial takes you from installation to deployment with complete, working code. By the end, you'll launch an OpenAI-compatible server and query LLMs in production. Ideal for RAG prototypes, chatbots, or scalable AI APIs. (128 words)

Prerequisites

Python 3.9+ installed
NVIDIA GPU with CUDA 11.8+ (or CPU for testing, but slow)
16 GB RAM minimum, 8 GB VRAM for small models
Free Hugging Face account to download models
pip and git installed
Basic familiarity with the command line and Python

Installing vLLM

terminal

pip install vllm==0.6.1.post1

# Verification
python -c "from vllm import LLM; print('vLLM installed successfully!')"

# Optional: for Ampere+ GPUs (A100, RTX 40xx)
pip install vllm[flashinfer]

# Install torch with CUDA if not already done
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

This command installs vLLM and its dependencies. The pinned version avoids breaking changes. The flashinfer flag boosts performance on recent GPUs. Test immediately to validate CUDA; otherwise, it falls back to slow CPU.

Downloading a Test Model

Start with a small model: TinyLlama-1.1B-Chat-v1.0 (1.1B params, fast on 8GB GPU). vLLM supports 1000+ Hugging Face models. Avoid gated models like Llama without a token; use huggingface-cli login for authentication.

Launching the OpenAI-Compatible Server

terminal-serveur

vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --dtype auto \
  --trust-remote-code

# The server is ready! Access http://localhost:8000/docs for Swagger UI

This command starts a REST/HTTP server compatible with OpenAI on port 8000. --host 0.0.0.0 enables remote access. --tensor-parallel-size 1 is for a single GPU. The server auto-loads the model and exposes /v1/chat/completions.

Testing the Server with curl

vLLM emulates the OpenAI API: endpoints like /v1/models, /v1/completions, /v1/chat/completions. Use curl or Swagger to validate. Think of it like the ChatGPT API, but local and free.

curl Request to the Server

terminal-curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "system", "content": "Tu es un assistant utile."},
      {"role": "user", "content": "Explique vLLM en 3 mots."}
    ],
    "temperature": 0.7,
    "max_tokens": 128
  }'

This curl sends a chat prompt and gets a JSON response. max_tokens limits generation. Copy-paste directly; tweak temperature for creativity (0=deterministic, 1=random).

Python Client with openai Library

client.py

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-fake"  # vLLM ignores the key
)

response = client.chat.completions.create(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    messages=[
        {"role": "system", "content": "Tu es un expert Python."},
        {"role": "user", "content": "Écris une fonction fizzbuzz."}
    ],
    temperature=0.1,
    max_tokens=200
)

print(response.choices[0].message.content)

This script uses the OpenAI library to query vLLM without code changes. base_url points to the local server. Perfect for migrating from OpenAI to local. Run pip install openai first.

Standalone Usage Without a Server

For batch scripts or notebooks, use vLLM's LLM class directly. No server needed—synchronous or asynchronous inference. Great for data processing or fine-tuning evaluation.

Standalone Inference with LLM Class

standalone.py

from vllm import LLM, SamplingParams

llm = LLM(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    tensor_parallel_size=1,
    dtype="auto"
)

prompts = [
    "<|system|>\nTu es un assistant.\n<|user|>\nBonjour ! Comment ça va ?<|end|>\n<|assistant|>\n",
    "<|system|>\nTu es un assistant.\n<|user|>\n2+2=?<|end|>\n<|assistant|>\n"
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=64)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated: {generated_text!r}")

Loads the model into memory and generates for multiple prompts in batch. SamplingParams controls generation (top_p avoids repetitions). Uses TinyLlama's special token format. Efficient for 1000+ inferences.

Advanced Configuration with Quantization

terminal-quantized

vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --quantization awq \
  --host 0.0.0.0 \
  --port 8001 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code

# For multiple models at once
vllm serve --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --served-model-name tinyllama-chat

AWQ quantization halves VRAM usage with minimal loss. --max-model-len limits context. --gpu-memory-utilization 0.9 maximizes GPU usage. --served-model-name renames for the API.

Docker Deployment

Dockerfile

FROM vllm/vllm-openai:latest

WORKDIR /app

COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh

ENTRYPOINT ["/app/entrypoint.sh"]

# Build: docker build -t vllm-server .
# Run: docker run --gpus all -p 8000:8000 vllm-server \
#   --env HF_TOKEN=your_token \
#   vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host 0.0.0.0

Official vLLM image for production. Add entrypoint.sh for dynamic args. --gpus all exposes Docker GPU. Scalable on Kubernetes.

Best Practices

Batch requests: Send 10-100 prompts in parallel for maximum throughput.
Always quantize: Use AWQ/GPTQ for >7B models on <24GB VRAM.
Monitor GPU: Run nvidia-smi -l 1 during serving; tweak --gpu-memory-utilization.
Prompt engineering: Use native chat formats (e.g., ~~[INST] ).~~
KV cache: Enable --enable-prefix-caching for reusable RAG.

Common Errors to Avoid

GPU OOM: Reduce --max-model-len or quantize; don't load 70B without A100.
Gated model: Run huggingface-cli login before serving, or get false OOM.
Port in use: Kill process with lsof -ti:8000 | xargs kill -9.
CUDA mismatch: Check torch.cuda.is_available(); reinstall torch cu121.

Next Steps

~~Official docs: vLLM GitHub~~
~~Benchmarks: Compare with TensorRT-LLM on HuggingFace Spaces~~
~~Advanced: Integrate Ray Serve for multi-GPU.~~
~~Training: Check our AI training courses at Learni to master LLM ops in production.~~