Unlocking Scalable Long-Context LLM Inference: Introducing Medha for 30x Faster Latency & 5x Higher Throughput 🚀 Excited to share our work on Medha, a new system designed to efficiently serve Large Language Model (LLM) inference requests with multi-million token contexts – without compromising performance on shorter, interactive queries. As LLMs tackle increasingly vast contexts (think book summaries, code analysis, long conversations), serving these requests efficiently becomes a critical challenge. A major pain point in existing systems is Head-of-Line (HOL) blocking: long-context requests can monopolize resources for minutes, causing unacceptable delays for subsequent short requests, even in systems using context parallelism. Medha addresses this head-on with several key innovations rooted in system design and parallelism: - Adaptive Chunking & Slack-Aware Scheduling: We challenge the conventional wisdom that chunking long prefills is inherently inefficient. By analyzing arithmetic intensity, we show that even small chunks are viable for long contexts. Medha uses adaptive chunking combined with a novel LRS++ scheduler to enable fine-grained preemption and time/space sharing, effectively eliminating HOL blocking while maximizing GPU utilization. - Sequence Pipeline Parallelism (SPP): To tackle the Time-To-First-Token (TTFT) latency for massive prefills, we introduce SPP – a new pipelining strategy. Unlike standard PP, SPP densely pipelines consecutive chunks of the same long request across stages, achieving near-linear TTFT reduction as we scale GPUs. - KV Cache Parallelism (KVP): To minimize Time-Per-Output-Token (TPOT) during the memory-bandwidth-bound decode phase, KVP shards the enormous KV cache across devices. This parallelizes the critical KV cache reads, keeping token generation fast even with millions of context tokens. Novel 3D Parallelism: Medha integrates Tensor Parallelism (TP), SPP, and KVP into a cohesive 3D parallelism strategy, allowing unprecedented scaling and efficient resource use across hundreds of GPUs. Our evaluations show Medha: Reduces median TTFT latency by up to 30x compared to state-of-the-art systems in mixed workloads. Improves effective throughput by upwards of 5x. Medha provides a practical path towards deploying extremely long-context LLMs in production environments, ensuring both responsiveness for interactive tasks and high throughput for large analytical jobs. We believe this work significantly advances the state of LLM inference systems. Read the full details in our paper: https://lnkd.in/ddCzhQnF Work done with Haoran Qiu, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov and Esha Choukse at Microsoft and Georgia Institute of Technology.
Ph.D. Computer Science @ Georgia Tech
1hNice. TPOT TTFT are good metrics