- Sat 18 May 2024
- Large Language Models
- #ml-code, #llm
LLM Inference Systems
In my exploration of various inference systems, there is a general confusion about Triton, TensorRT, and TensorRT-LLM. These are all related to optimizing and deploying machine learning models, but with different purposes and use cases, specifically what inputs and outputs that they support.
Triton
Triton is an open-source inference server (NVIDIA) designed to optimize and deploy machine learning models for inference workloads and supports multiple deep learning frameworks, including TensorFlow, PyTorch, and TensorRT. Triton supportsfeatures like batching, concurrent model execution, and dynamic batching to improve performance and efficiency.
- Inputs Triton can accept inputs in various formats, such as images, text, or numerical data, depending on the machine learning model being deployed.
- Outputs The outputs depend on the specific machine learning model and its task, such as classification results, object detection bounding boxes, or text generation.
TensorRT
TensorRT is a high-performance deep learning inference optimizer and runtime library developed by NVIDIA designed to optimize and accelerate the inference performance of deep learning models on NVIDIA GPUs. TensorRT can optimize models from various deep learning frameworks, including TensorFlow, PyTorch, and ONNX.
- Inputs TensorRT accepts deep learning models in various formats, such as TensorFlow frozen graphs, PyTorch traced models, or ONNX models.
- Outputs TensorRT optimizes the models for faster inference on NVIDIA GPUs, but it does not change the model’s output format. The outputs remain the same as the original model, such as classification probabilities, object detection bounding boxes, or segmentation masks.
TensorRT-LLM
TensorRT-LLM is a branch of TensorRT specifically designed for optimizing and deploying large language models (LLMs) on NVIDIA GPUs. It provides optimizations and techniques tailored for the efficient inference of LLMs, which can be computationally expensive due to their large size and complex architectures.
- Inputs TensorRT-LLM accepts pre-trained LLM models in various formats, such as TensorFlow, PyTorch, or ONNX.
- Outputs The outputs of TensorRT-LLM remain the same as the original LLM model, which typically involves generating text based on the input prompt or context.
TL;DR - Triton is an inference server that can deploy and optimize models from different frameworks, including TensorRT-optimized models. TensorRT is a library for optimizing and accelerating deep learning models for inference on NVIDIA GPUs. TensorRT-LLM is a specialized branch of TensorRT focused on optimizing and deploying large language models efficiently on NVIDIA GPUs.
If you found this useful, please cite this post using
Senthilkumar Gopal. (May 2024). LLM Inference Systems. sengopal.me. https://sengopal.me/posts/llm-inference-systems
or
@article{gopal2024llminferencesystems, title = {LLM Inference Systems}, author = {Senthilkumar Gopal}, journal = {sengopal.me}, year = {2024}, month = {May}, url = {https://sengopal.me/posts/llm-inference-systems} }