Introduction
vLLM is an open-source inference engine for large language models (LLMs), designed for exceptional speed and efficiency. Launched in 2023, it uses PagedAttention to optimize GPU memory, delivering up to 4x more throughput than alternatives like Hugging Face Transformers. In 2026, with the rise of AI agents and real-time apps, vLLM is essential for developers deploying models locally without sacrificing performance.
Why use it? Imagine serving a model like Llama 3 at 100 requests/second on a single GPU, compared to 20-30 with standard tools. This beginner tutorial takes you from installation to deployment with complete, working code. By the end, you'll launch an OpenAI-compatible server and query LLMs in production. Ideal for RAG prototypes, chatbots, or scalable AI APIs. (128 words)
Prerequisites
- Python 3.9+ installed
- NVIDIA GPU with CUDA 11.8+ (or CPU for testing, but slow)
- 16 GB RAM minimum, 8 GB VRAM for small models
- Free Hugging Face account to download models
- pip and git installed
- Basic familiarity with the command line and Python
Installing vLLM
pip install vllm==0.6.1.post1
# Verification
python -c "from vllm import LLM; print('vLLM installed successfully!')"
# Optional: for Ampere+ GPUs (A100, RTX 40xx)
pip install vllm[flashinfer]
# Install torch with CUDA if not already done
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121This command installs vLLM and its dependencies. The pinned version avoids breaking changes. The flashinfer flag boosts performance on recent GPUs. Test immediately to validate CUDA; otherwise, it falls back to slow CPU.
Downloading a Test Model
Start with a small model: TinyLlama-1.1B-Chat-v1.0 (1.1B params, fast on 8GB GPU). vLLM supports 1000+ Hugging Face models. Avoid gated models like Llama without a token; use huggingface-cli login for authentication.
Launching the OpenAI-Compatible Server
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--trust-remote-code
# The server is ready! Access http://localhost:8000/docs for Swagger UIThis command starts a REST/HTTP server compatible with OpenAI on port 8000. --host 0.0.0.0 enables remote access. --tensor-parallel-size 1 is for a single GPU. The server auto-loads the model and exposes /v1/chat/completions.
Testing the Server with curl
vLLM emulates the OpenAI API: endpoints like /v1/models, /v1/completions, /v1/chat/completions. Use curl or Swagger to validate. Think of it like the ChatGPT API, but local and free.
curl Request to the Server
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "system", "content": "Tu es un assistant utile."},
{"role": "user", "content": "Explique vLLM en 3 mots."}
],
"temperature": 0.7,
"max_tokens": 128
}'This curl sends a chat prompt and gets a JSON response. max_tokens limits generation. Copy-paste directly; tweak temperature for creativity (0=deterministic, 1=random).
Python Client with openai Library
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-fake" # vLLM ignores the key
)
response = client.chat.completions.create(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
messages=[
{"role": "system", "content": "Tu es un expert Python."},
{"role": "user", "content": "Écris une fonction fizzbuzz."}
],
temperature=0.1,
max_tokens=200
)
print(response.choices[0].message.content)This script uses the OpenAI library to query vLLM without code changes. base_url points to the local server. Perfect for migrating from OpenAI to local. Run pip install openai first.
Standalone Usage Without a Server
For batch scripts or notebooks, use vLLM's LLM class directly. No server needed—synchronous or asynchronous inference. Great for data processing or fine-tuning evaluation.
Standalone Inference with LLM Class
from vllm import LLM, SamplingParams
llm = LLM(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
tensor_parallel_size=1,
dtype="auto"
)
prompts = [
"<|system|>\nTu es un assistant.\n<|user|>\nBonjour ! Comment ça va ?<|end|>\n<|assistant|>\n",
"<|system|>\nTu es un assistant.\n<|user|>\n2+2=?<|end|>\n<|assistant|>\n"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=64)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated: {generated_text!r}")Loads the model into memory and generates for multiple prompts in batch. SamplingParams controls generation (top_p avoids repetitions). Uses TinyLlama's special token format. Efficient for 1000+ inferences.
Advanced Configuration with Quantization
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--quantization awq \
--host 0.0.0.0 \
--port 8001 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--trust-remote-code
# For multiple models at once
vllm serve --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --served-model-name tinyllama-chatAWQ quantization halves VRAM usage with minimal loss. --max-model-len limits context. --gpu-memory-utilization 0.9 maximizes GPU usage. --served-model-name renames for the API.
Docker Deployment
FROM vllm/vllm-openai:latest
WORKDIR /app
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
ENTRYPOINT ["/app/entrypoint.sh"]
# Build: docker build -t vllm-server .
# Run: docker run --gpus all -p 8000:8000 vllm-server \
# --env HF_TOKEN=your_token \
# vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host 0.0.0.0Official vLLM image for production. Add entrypoint.sh for dynamic args. --gpus all exposes Docker GPU. Scalable on Kubernetes.
Best Practices
- Batch requests: Send 10-100 prompts in parallel for maximum throughput.
- Always quantize: Use AWQ/GPTQ for >7B models on <24GB VRAM.
- Monitor GPU: Run
nvidia-smi -l 1during serving; tweak--gpu-memory-utilization. - Prompt engineering: Use native chat formats (e.g.,
[INST] ). - KV cache: Enable
--enable-prefix-cachingfor reusable RAG.
Common Errors to Avoid
- GPU OOM: Reduce
--max-model-lenor quantize; don't load 70B without A100. - Gated model: Run
huggingface-cli loginbefore serving, or get false OOM. - Port in use: Kill process with
lsof -ti:8000 | xargs kill -9. - CUDA mismatch: Check
torch.cuda.is_available(); reinstall torch cu121.
Next Steps
- Official docs: vLLM GitHub
- Benchmarks: Compare with TensorRT-LLM on HuggingFace Spaces
- Advanced: Integrate Ray Serve for multi-GPU.
- Training: Check our AI training courses at Learni to master LLM ops in production.