Senthilkumar Gopal

Musings of a machine learning researcher, engineer and leader

Paged Attention and Chunked Prefill for LLM Inference

This post explains how Paged Attention and Chunked Prefill optimize memory and computation in vLLM by organizing key-value caches into dynamic blocks and processing input sequences in manageable chunks. It includes a simple walkthrough with tensor shapes and code to showcase their integration for LLM inference.