vLLM

High-performance production inference engine — PagedAttention, 2-4× throughput over baseline

LLM serving engine built for production (73K+ ⭐). PagedAttention technology optimizes GPU memory, achieving 2-4× throughput over FasterTransformer/Orca. Supports continuous batching, tensor paralle...

v7.0