Senthilkumar Gopal

Musings of a machine learning researcher, engineer and leader


Paged Attention and Chunked Prefill for LLM Inference

This post explains how Paged Attention and Chunked Prefill optimize memory and computation in vLLM by organizing key-value caches into dynamic blocks and processing input sequences in manageable chunks. It includes a simple walkthrough with tensor shapes and code to showcase their integration for LLM inference.





AI Compilers - A Study guide

A growing list/study guide of AI compilers, from foundational concepts like graph lowering and systolic arrays to practical tools like TorchDynamo and Glow.

Aliasing on XLA

This post explores the concept of aliasing in XLA, its significance, the mechanisms through which it is implemented, and future directions for extending aliasing optimizations.