Neuron Glossary
Last Updated on Mar 12, 2024
The following are a running list of new terms that are encountered while working on the Neuron stack. This acts as a jump off point into deeper dives into these terms and context behind them
FPGA (Field-Programmable Gate Array) Unlike GPUs with a fixed design, FPGAs are essentially blank slates. They contain a fabric of programmable logic blocks that can be configured to perform specific tasks. This flexibility allows FPGAs to be customized for a wide range of applications, including cryptography, financial modeling, and high-frequency trading.
ASIC An ASIC is a chip designed for a specific purpose. It offers high performance and efficiency for that particular task because the hardware is optimized for it. This aligns exactly with how Intel describes the Gaudi as a deep learning accelerator.
Arithmetic Intensity Metric that quantifies the ratio of computational operations (measured in floating-point operations, or FLOPs) to data movement (measured in bytes) during a computation. It helps determine whether a particular operation is compute-bound or memory-bound. For example, applying the ReLU activation function to a tensor involves reading 2 bytes, performing 1 comparison operation, and writing 2 bytes per element, resulting in an arithmetic intensity of 1 FLOP per 4 bytes accessed. This low ratio indicates that such operations are typically memory-bound, meaning the time spent on memory accesses exceeds the time spent on computations. Ref
Gradient checkpointing /activation checkpointing A technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra computation time for reduced memory usage. If a module is checkpointed, at the end of a forward pass, the inputs to and outputs from the module stay in memory.ref
Strength Reduction
This optimization replaces computationally expensive operations with
equivalent but less costly ones. For example, replacing multiplication
by a constant with a shift operation.
Constant Folding
Evaluates constant expressions at compile time, replacing them with
their computed values to reduce runtime computation.
Common Subexpression Elimination
Identifies and eliminates duplicate calculations by reusing previously
computed values, enhancing efficiency.
Dead-Code Elimination
Removes code that does not affect the program’s outcome, such as
computations whose results are never used, thereby streamlining the
codebase.
Scalar Replacement
Replaces array references with scalar variables when possible, reducing
memory access overhead and improving performance.
If-Conversion
Transforms conditional branches into conditional instructions,
minimizing branch penalties and enhancing instruction-level
parallelism.
Function Inlining
Substitutes the body of a called function directly into the calling
code, eliminating call overhead and enabling further optimizations.
Call Specialization
Tailors function calls based on known arguments, creating specialized
versions of functions to improve performance.
Peephole Optimizations
Examines a small window of consecutive instructions to identify and
replace inefficient sequences with more efficient ones, enhancing code
quality.
Canonicalizing Loops
Transforms loops to start at index 0 and stride by 1, which simplifies
the loop structure and may enable more optimizations.
Generating Constants
Replaces index calculations with pre-computed constants, reducing the
need for computation during program execution.
Eliminating Load/Store Pairs
Removes redundant load and store operations by combining them or
eliminating unnecessary memory accesses.
XLA (Accelerated Linear Algebra)
XLA is a domain-specific compiler for linear algebra that optimizes
computations for machine learning models. It is particularly focused on
optimizing TensorFlow computations and improving performance on hardware
accelerators like GPUs and TPUs.
TVM
TVM is an open-source deep learning compiler stack designed to optimize
the performance of deep learning models across different hardware
platforms. It provides a flexible and efficient way to deploy models on
a wide range of devices, including CPUs, GPUs, and specialized
accelerators.
MLIR (Multi-Level Intermediate Representation)
MLIR is an intermediate representation used to define and optimize
programs at multiple levels of abstraction. It is designed to facilitate
cross-compiler optimizations and improve the portability and efficiency
of code across different hardware architectures.
LLVM / GCC LLVM is a modular and flexible compiler infrastructure consisting of reusable components that provide fine-grained control over code generation and optimization. Its intermediate representation (IR) is portable and efficient, allowing it to target various hardware architectures. In contrast, GCC is a monolithic compiler system that integrates the compiler, assembler, linker, and debugger, following a more traditional approach to code generation and optimizations with a long history of use in production environments. While LLVM’s modularity allows for greater extensibility and customization, GCC tends to be less modular, requiring deeper integration for adding features. LLVM excels in modern hardware optimizations, especially for GPUs and specialized accelerators, whereas GCC is more focused on traditional CPU-based platforms and embedded systems.
PCIe high-speed interface standard used to connect peripheral devices, such as graphics cards, network cards, and storage devices, to the CPU and memory. It provides fast, low-latency data transfer with a scalable architecture, offering multiple lanes for simultaneous data communication. PCIe operates with a point-to-point architecture, where each device communicates directly with the CPU through a dedicated link, ensuring high performance and low overhead. Its speed and flexibility make it ideal for modern hardware accelerators, like GPUs and NVMe storage devices.
Memory-Mapped I/O (MMIO) technique where I/O devices are mapped to specific memory addresses, allowing the CPU to communicate with them as if they were part of the system’s memory. This method provides a simple and efficient way to read from and write to I/O devices using standard memory access instructions, without the need for special I/O instructions. MMIO allows devices such as graphics cards, network interfaces, and other peripherals to interact directly with the processor and memory, enabling faster and more efficient data transfers, especially when working with high-performance devices like GPUs or custom accelerators.
Hypervisors Hypervisors are software, firmware, or hardware components that enable virtualization by managing the creation and execution of virtual machines. Type 1 (bare-metal) run directly on the hardware, providing high performance and better isolation for VMs. Examples include VMware ESXi, Microsoft Hyper-V, and Xen. Type 2 (hosted) run on top of a host operating system, with the hypervisor providing virtualized hardware resources to guest operating systems. Examples include VMware Workstation and Oracle VirtualBox.
SR-IOV (Single Root I/O Virtualization) allows a single physical network interface card (NIC) or other I/O devices to appear as multiple separate virtual devices to virtual machines. It improves performance by allowing VMs to access I/O resources directly without the need for extensive virtualization overhead. SR-IOV enables better scalability and efficiency by allowing multiple VMs to share a single physical device while maintaining near-native performance. In SR-IOV-enabled systems, the hypervisor configures the physical device (e.g., NIC, GPU) to expose virtual interfaces, which are then directly assigned to virtual machines.
If you found this useful, please cite this post using
Senthilkumar Gopal. (Dec 2023). Neuron Glossary. sengopal.me. https://sengopal.me/posts/neuron-glossary
or
@article{gopal2023neuronglossary, title = {Neuron Glossary}, author = {Senthilkumar Gopal}, journal = {sengopal.me}, year = {2023}, month = {Dec}, url = {https://sengopal.me/posts/neuron-glossary} }