Machine Learning Systems, Part 3 — Motivating GPU Hardware

Written: March 5, 2023
First Published: August 4, 2025

Given our understanding of dense computation and tensors and how they relate to machine learning, we'll watch how the GPU architecture naturally emerges as a good candidate on which to do computational tasks in deep learning.

To get there, rather than describing the GPU architecture, let's see how we can design a piece of hardware from first principles, and notice how it resembles modern GPUs.

Our Goal: Parallel Operations

We have a good sense of the building-block ingredients for dense computation in deep learning. To start mapping things to hardware, let's remind ourselves of the matrix multiplication operation:

\begin{array}{ccccc} A_{m, n} & \times & B_{n, p} & = & C_{m, p} \\ & & & & \\ \begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} & \times & \begin{bmatrix} b_{1, 1} & \dots & b_{1, p} \\ \vdots & \ddots & \vdots \\ b_{n, 1} & \dots & b_{n, p} \\ \end{bmatrix} & = & \begin{bmatrix} c_{1, 1} & \dots & c_{1, p} \\ \vdots & \ddots & \vdots \\ c_{m, 1} & \dots & c_{m, p} \\ \end{bmatrix} \end{array}

What are the constituent arithmetic operations here? We can start by breaking down matrix multiplication into sets of dot products:

\begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} \times \begin{bmatrix} b_{1, 1} & \dots & b_{1, p} \\ \vdots & \ddots & \vdots \\ b_{n, 1} & \dots & b_{n, p} \\ \end{bmatrix} = % output matrix \begin{bmatrix} \begin{array}{ccc} % upper left \begin{bmatrix}a_{1, 1}, \dots a_{1, n}\end{bmatrix} \cdot \begin{bmatrix}b_{1, 1} \\ \vdots \\ b_{n, 1}\end{bmatrix} & \dots & % upper right \begin{bmatrix}a_{1, 1}, \dots a_{1, n}\end{bmatrix} \cdot \begin{bmatrix}b_{1, p} \\ \vdots \\ b_{m, p}\end{bmatrix} \\ % middle \vdots & \ddots & \vdots \\ % lower left \begin{bmatrix}a_{m, 1}, \dots a_{m, n}\end{bmatrix} \cdot \begin{bmatrix}b_{1, 1} \\ \vdots \\ b_{n, 1}\end{bmatrix} & \dots & % lower right \begin{bmatrix}a_{m, 1}, \dots a_{m, n}\end{bmatrix} \cdot \begin{bmatrix}b_{1, p} \\ \vdots \\ b_{m, p}\end{bmatrix} \\` \end{array} \end{bmatrix}

Clearly, each element of the output matrix is an dot product (or inner product). Without loss of generality, let's pick one of those and explicitly enumerate the multiplication and addition operations therein:

\begin{bmatrix}a_{1, 1}, \dots a_{1, n}\end{bmatrix} \cdot \begin{bmatrix}b_{1, 1} \\ \vdots \\ b_{n, 1}\end{bmatrix} = \sum_{i = 1}^n a_{1, i} \cdot b_{i, 1} = a_{1, 1} \cdot b_{1, 1} + a_{1, 2} \cdot b_{2, 1} + \dots + a_{1, n} \cdot b_{n, 1}

Rewriting in a particular way:

\begin{array}{cc} \left. \begin{array}{c} a_{1, 1} \cdot b_{1, 1} \\ a_{1, 2} \cdot b_{2, 1} \\ \vdots \\ a_{1, n} \cdot b_{n, 1} \\ \end{array} \right\} & + \end{array}

From this rewrite, one thing is clear: we can do the multiplication operations in each constituent inner product in parallel! If we had $n$ circuits, we could do all multiplications in a single step, then sum all of the results together in $O(n)$ steps (although we can do better than this).

As we've seen, there are many other operations in deep learning computation that have numerous parallel arithmetic operations. We can also consider the Hadamard Product ( $\odot$ ) which involves element-wise multiplication of matrices:

\begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} \odot \begin{bmatrix} b_{1, 1} & \dots & b_{1, p} \\ \vdots & \ddots & \vdots \\ b_{n, 1} & \dots & b_{n, p} \\ \end{bmatrix} = \begin{bmatrix} a_{1, 1} \cdot b_{1, 1} & \dots & a_{1, n} \cdot b_{1, p} \\ \vdots & \ddots & \vdots \\ a_{m, 1} \cdot b_{n, 1} & \dots & a_{m, n} \cdot b_{n, p} \\ \end{bmatrix}

Here, it's clear that the element-wise multiplication operations for each element of the output matrix can be performed in parallel; there's no dependence between individual multiplication operations.

Designing Hardware for Parallel Arithmetic

Our desiderata are clear: we want hardware that can perform numerous arithmetic operations in parallel, with the ability to combine values with operations like reductions. Let's begin with these goals.

We'll start with a von Neumann architecture including a memory unit, registers, and an arithmetic logic unit with a loop that writes data back into registers, which can read and write from memory.

A diagram of a Von Neumann architecture with main memory with load and store capabilities from registers, registers and an arithmetic logic unit (or ALU) that takes data from registers, performs arithmetic or logical operations, then stores it back into registers.

In this loop, data is loadable from main memory into registers, where it then flows into an arithmetic-logic unit (ALU), which is capable of performing arithmetic operations (e.g. addition, multiplication, logical AND), the result of which is then written back to registers and can be stored back into main memory on subsequent cycles.

With a single loop, each "cycle" denotes a roundtrip from registers (or from main memory) through the ALU and back. Given our original formulation of the inner product above, we have $n$ multiplication operations to perform. With the above circuit, this requires $n$ cycles, with subsequent cycles required to add (i.e. reduce) the resulting products together.

What's our bottleneck? If we were able to perform multiple multiplication operations simultaneously, we could perform all $n$ at once! We're operating with a blank canvas; let's create more loops!

A bunch of von Neumann loops next to each other.

If we create $n$ loops, we can do all $n$ multiplications in a single step if we have reach of the needed operands in memory! There are two problems here:

If we have operand tensors contiguously in memory, how do we get each pair of elements in the input tensors into memory for each loop?
Once we have the results of each multiplication, how do we combine them if we need to access all of the intermediate results at once?

To solve this, we'll introduce the notion of "shared memory" — that is, memory that's shared between loops! Revising our circuit diagram, we have the following:

A bunch of von Neumann loops next to each other but that share a memory component that spans across all of them.

In this setup, we can load and store data from anywhere in shared memory into registers in each loop, perform operations, then write back, all at once. Now that ALUs can operate on shared data, we can start with the tensors contiguously in memory, assign each loop parallel arithmetic tasks, execute them, and write back to a shared location.

Putting it all Together

There are a few basic components we'll add to fill in the gaps between our toy model and a real GPU¹:

An instruction control unit which controls sets of the memory-register-ALU loops
Caches which sit in between shared memory and sets of registers.

Reassigning proper terminology to the abstractions we've already created:

Streaming processor (SP) — the register-ALU abstraction that, along with other proximal SPs, share a control unit and instruction cache.²
Streaming mutliprocessor (SM) — a collection of SPs, connected to common shared memory and caches. SMs are the hardware building blocks on top of which parallel computing abstractions are based.

Modern GPUs typically feature hundreds of streaming processors per streaming multiprocessor.³. Given the above, a full streaming multiprocessor looks like the following:

A streaming multiprocessor (SM) unit labeled with the aforementioned revised terminology.

Modern datacenter GPUs typically contain dozens⁴ of SMs. The shared memory in each SM is connected to an intermediate cache (usually labeled level 2, or L2), which is connected to device memory (also called VRAM, or "Video Random-Access Memory" on GPUs). Combining the above together, our revised picture of a GPU is as follows:

A full-GPU schematic with multiple SMs, the shared memory in which is connected to an L2 cache, which is in turn connected to device memory.

Take notice of several artifacts of how busses (high-bandwidth connections between components) are connected:

Direct VRAM to shared memory loads — new instructions enable direct I/O from GPU VRAM into shared memory in an SM, thus bypassing caches and control units.
SMs can write to an L2 cache — SMs can cache arbitrary data, which is faster to read/write from cache than from VRAM.

This diagram is an approximation of how a real GPU is laid out. In sections that follow, we'll revisit this schematic in the context of hierarchical data transfers and storage as it relates to bottlenecks and performance as well as in the context of distributed computation in deep learning.

Citation

@misc{kahn2024mlsystems,
  title        = {Machine Learning Systems},
  author       = {Jacob Kahn},
  year         = {2024},
  url          = {https://jacobkahn.me/writing/tag/machine_learning_systems/},
}

There are other components of a real SP (including other small intermediate caches, in some cases), but most of these additional components are not programmable. ↩
GPUs have instruction caches as do many other microarchitectures; ARM's is a good explanation. ↩
These numbers typically must be inferred from provided benchmarks. As an example, an NVIDIA H100 does 256 fp16 multiplications per SM clock cycle. Each SM has 4 warp schedulers, which means 128 (4 by 32) instructions would be issued per block cycle, which corresponds to 256 fp16 multiplications assuming each fp32 unit handles two fp16 units per cycle. ↩
An NVIDIA H100 GPU has 132 SMs. ↩

Tags: Deep Learning, Machine Learning Systems, Hardware, GPU

Our Goal: Parallel Operations

Designing Hardware for Parallel Arithmetic

Putting it all Together

Citation

Footnotes