vLLM

High-performance production inference engine — PagedAttention, 2-4× throughput over baseline

LLM serving engine built for production. PagedAttention technology optimizes GPU memory, achieving 2-4× throughput over FasterTransformer/Orca. Supports continuous batching, tensor parallelism, spe...