- Sat 28 December 2024
- Large Language Models
Paged Attention and Chunked Prefill for LLM Inference
This post explains how Paged Attention and Chunked Prefill optimize memory and computation in vLLM by organizing key-value caches into dynamic blocks and processing input sequences in manageable chunks. It includes a simple walkthrough with tensor shapes and code to showcase their integration for LLM inference.