L'Atelier — Réseau francophone IA (produit, tech, business)

Comparatif des frameworks d'inférence LLM

Quatre frameworks dominent l'inférence de modèles open source. Chacun a ses forces et cas d'usage optimaux.

Les contenders

vLLM

Créateur : UC Berkeley
Focus : Débit maximum en production
Innovation clé : PagedAttention
Langage : Python + CUDA

TGI (Text Generation Inference)

Créateur : Hugging Face
Focus : Intégration écosystème HF
Innovation clé : Flash Attention, watermarking
Langage : Rust + Python

llama.cpp

Créateur : Georgi Gerganov (communauté)
Focus : Portabilité, CPU inference
Innovation clé : Format GGUF, quantization avancée
Langage : C/C++

Ollama

Créateur : Ollama Inc.
Focus : Simplicité d'utilisation
Innovation clé : UX développeur, Modelfiles
Langage : Go (wraps llama.cpp)

Comparaison détaillée

Performance (tokens/seconde)

Sur Llama 3.3 70B, 1x A100 80 Go : - vLLM : ~150 t/s (batch), ~40 t/s (single) - TGI : ~130 t/s (batch), ~35 t/s (single) - llama.cpp : ~30 t/s (GPU), ~10 t/s (CPU) - Ollama : ~25 t/s (GPU, basé sur llama.cpp)

Facilité d'installation

Ollama : 1 commande, zéro config
vLLM : pip install + quelques paramètres
TGI : Docker recommandé
llama.cpp : Compilation nécessaire (cmake)

Support matériel

llama.cpp / Ollama : CPU, NVIDIA, AMD, Apple Metal, Vulkan
vLLM : NVIDIA (principal), AMD (expérimental)
TGI : NVIDIA principalement

API

vLLM : OpenAI-compatible
TGI : API custom + OpenAI-compatible
Ollama : API custom + OpenAI-compatible
llama.cpp : server mode avec API custom

Quantization

llama.cpp / Ollama : GGUF (Q2 à Q8, K-quants)
vLLM : AWQ, GPTQ, FP8
TGI : GPTQ, AWQ, EETQ

Guide de choix

Choisir vLLM si

Production avec beaucoup de requêtes simultanées
Débit maximum requis
GPU NVIDIA disponibles
API OpenAI-compatible nécessaire

Choisir TGI si

Déjà dans l'écosystème Hugging Face
Besoin de watermarking des sorties
Déploiement via Inference Endpoints HF

Choisir llama.cpp si

Inférence sur CPU ou hardware exotique
Contrôle total et personnalisation
Intégration C/C++ native
Quantization maximale requise

Choisir Ollama si

Développement local et prototypage
Équipe non-ML qui veut tester des LLMs
Besoin d'une solution qui marche en 2 minutes
Interface avec Open WebUI

Combinaisons courantes

Dev : Ollama local + Open WebUI
Staging : vLLM sur GPU cloud
Production : vLLM + load balancer + monitoring

Comparatif des frameworks d'inférence