Senthilkumar Gopal

Musings of a machine learning researcher, engineer and leader


Paged Attention and Chunked Prefill for LLM Inference

This post explains how Paged Attention and Chunked Prefill optimize memory and computation in vLLM by organizing key-value caches into dynamic blocks and processing input sequences in manageable chunks. It includes a simple walkthrough with tensor shapes and code to showcase their integration for LLM inference.




AI Compilers - A Study guide

A growing list/study guide of AI compilers, from foundational concepts like graph lowering and systolic arrays to practical tools like TorchDynamo and Glow.

What is Neuron SDK

This post introduces AWS Neuron SDK - an SDK that streamlines deep learning and generative AI workloads on AWS Inferentia and Trainium by integrating with frameworks like PyTorch and JAX.

Neuron Glossary

This post acts as a running glossary for Neuron and HPC related terms and technologies.