Large Language Models Articles | Senthilkumar Gopal

Sat 28 December 2024
Large Language Models

Paged Attention and Chunked Prefill for LLM Inference

This post explains how Paged Attention and Chunked Prefill optimize memory and computation in vLLM by organizing key-value caches into dynamic blocks and processing input sequences in manageable chunks. It includes a simple walkthrough with tensor shapes and code to showcase their integration for LLM inference.

Sat 18 May 2024
Large Language Models

LLM Inference Systems

A quick clarification between the terms - Triton, TensorRT, and TensorRT-LLM