Machine Learning Systems, Part 1 — Motivating Dense Models

Written: February 25, 2023
First Published: August 4, 2025

This is the first piece of writing in a series about machine learning systems, geared specifically towards deep neural models. While this series will be technical, it will also introduce some concepts from first principles, where possible.

To start, we'll draw connections between information that humans consume and the type of mathematical structures and operations with which to operate on that information.

Describing the World in Data

We frequently hear the word "dense" thrown around when describing deep neural networks and related computation. What does it mean, and how did we get here?

If our goal is to create intelligent systems that can reason in the same ways as humans do, we need to ingest data which is human-interpretable: that is, data that humans can read (or write). Humans receive sensory inputs across a variety of modalities, including visual, auditory, tactile, and others. If we were to encode data streams from these inputs in such a way that a computer could understand them, what would they look like?

Composition and Locality

The world we live in is continuous¹ in multiple ways. Not only do we use continuous values (real scalars) to describe physical quantities (the strength of an audio signal impacting our eardrum at a moment-in-time or the brightness of a patch of our vision), but there are also relationships between scalar quantities representing axes of our physical world.² What are some of these axes?

Spatial relationships — some data is expressed via changing quantities across space; objects have physical proximity and the components that make them up (e.g. pixels of color in an image we perceive) are physically adjacent to one another so as to compose the whole.
Temporal relationships — other data is expressed due to changing quantities over time. Sound (and therefore speech) has a temporal dimension over which a signal varies.
"Syntactic" proximity — while we do not have sensory organs to perceive text itself, text is ordered so as to express relationships between syntactic units. Ordering is but one of numerous syntactic concepts.

Examples of visual, audio, and text-based forms, along with the axes along which composition and locality manifest.

We've developed standards to store and augment data representing these modalities in computers already. These span standards for images, audio, and text that allow us to store and perform computation on data.

Representing Dense Modalities

With the simplest possible representation (simpler than the standard encoding formats mentioned above), how might we encode data of these types?

Let's first consider audio. Audio signals are 1D signals; for each point in time, a scalar value corresponds to the signal strength (for example, the air pressures on a microphone or membrane such an eardrum). Given this, our representation of an audio signal would be a number at each point of time³, or a vector. Concretely, vectors are written as follows:

\begin{bmatrix} 0.3 & 0.6 & 0.8 & 0.1 \end{bmatrix}

vectors are typically expressed in a column representation, which follows the corresponding convention in linear algebra:

\begin{bmatrix} 0.3 \\ 0.6 \\ 0.8 \\ 0.1 \end{bmatrix}

In our next example, consider a black-and-white image — one in which we can describe a pixel of this image with a simple scalar representing the grayscale shade. This "image" can be represented with a 2D collection of scalars, i.e. a matrix:

\begin{bmatrix} 0.17 & 0.38 & 0.23 & 0.44 & 0.32 \\ 0.92 & 0.23 & 0.62 & 0.99 & 0.09 \\ 0.05 & 0.83 & 0.48 & 0.67 & 0.31 \\ 0.19 & 0.70 & 0.38 & 0.52 & 0.11 \\ 0.32 & 0.18 & 0.59 & 0.84 & 0.03 \end{bmatrix}

Grayscale is representable with a single scalar, but image color is typically represented via RGB, three scalar values. Given that there are two spatial dimensions in an image and an third dimension across RGB values, a three-dimensional matrix (difficult to illustrate) can be used to represent image data per pixel.

Representing text is similar. Historically common in machine learning settings (despite its limitations) is the one-hot encoding used to represent categorical data in which each character (or word) is represented by a vector containing all zeros except for a one in the position corresponding to that particular word or character.

Concretely, in a language with four characters $\{a, b, c, d\}$ where each character corresponds to a vector with a one in that character's position (first position for $a$ , second for $b$ , etc), this a sequence of words would look like the following:

\begin{array}{cccccc} a & c & a & d & c & b \\ \begin{bmatrix}1 \\ 0 \\ 0 \\ 0\end{bmatrix} &\begin{bmatrix}0 \\ 0 \\ 1 \\ 0\end{bmatrix} &\begin{bmatrix}1 \\ 0 \\ 0 \\ 0\end{bmatrix} &\begin{bmatrix}0 \\ 0 \\ 0 \\ 1\end{bmatrix} &\begin{bmatrix}0 \\ 0 \\ 1 \\ 0\end{bmatrix} &\begin{bmatrix}0 \\ 1 \\ 0 \\ 0\end{bmatrix} \end{array}

Which might be stacked into a matrix:

\begin{bmatrix} 1 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{bmatrix}

Tensors

Our notion of "dimension" colloquially has a more concrete definition to describe the underlying mathematical structure storing data: that notion is the order. A vector has order one and a matrix has order two.

We can generalize this idea to describe structures of any order. These multi-dimensional arrays are one representation of a tensor, a more general mathematical object.

Tensors are a natural candidates with which to represent dense human-modal data given that we can express spatial and temporal relationships across arbitrarily many axes, and that tensors' constituent values are all scalars taken from sets of values that can represent continuous quantities, such as the real numbers $\mathbb{R}$ . We'll follow up on tensors in more detail in what follows.

Dense Models for Dense Modalities

What makes the tensors and human modalities we have considered thus far dense? Dense refers to the opposite of sparse, which relates to the number of non-zero elements in a tensor. A dense tensor is one that has few non-zero elements.

While the universe is very sparsely populated with matter and energy, on earth, things abound. Spatially, most things have context — gravity binds things together to have vertical proximity relative to the earth's surface, and we tend to move and manipulate objects horizontally. Temporally, fluids abound on earth and are continuously perturbed — sound being perturbations of the fluid that is air.

Given this, most tensors that encode data from human-interpretable modalities are dense. There are few perfectly colorless things (which might be represented as zeros in an RGB matrix-based representation of an image) just as there are few completely silent places (microphones typically pick up noise in the air around us). That said, we can also choose to represent data sparsely; for example, we might choose to represent samples in an audio signal (that are below a certain intensity) as zeros — or represent pixels which are sufficiently close to the color black (zero in RGB) as zero, regardless of whether matches are exact.

While not discussed in depth here, it is more efficient to operate on sparse data in many cases, remembering that $\forall k \in \mathbb{R}, ~0 \times k = 0$ (zero multiplied by any number is zero); multiplying sparse tensors can be more efficient since not all multiplications need to be explicitly computed if a zero is involved. A dot product of two sparse vectors $u$ and $v$ looks like the following:

\begin{align*} u \cdot v =& \begin{bmatrix}1 & 0 & 1 & 0 & 0\end{bmatrix} \cdot \begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \\ 1\end{bmatrix} \\ =& ~1 \times 0 + 0 \times 0 + 1 \times 1 + 0 \times 0 + 0 \times 1 \\ =& ~ 1 \end{align*}

We can simplify such an operation by noting that whenever any of the pairwise multiplications have a zero, we can discard that product from the sum (we could use the logical AND operation to describe this).

Taking Inventory: GPUs in the Right Place at the Right Time

We can now begin to see the utility of GPUs in deep learning, and the serendipity with which their utility emerged. GPUs were built for computer graphics — performing operations on images (real or simulated) in order to display an output image or scene. GPUs excel at operations on dense data, such as matrix multiplication.

Images are dense, as we've seen above; the types of models we might want would be those that can perform dense operations on those images, efficiently utilizing GPU hardware. The convolution⁴ is a mathematical operation that can be formulated to operate on images, and has applications in computer graphics. It is dense in the sense that it performs numerous arithmetic operations on contiguous (spatially local) sets of values from an input image.

AlexNet (Krizhevsky et al. 2012) employed neural networks that used numerous convolutions, which was previously very computationally expensive with conventional CPU-based hardware. Employing GPUs to accelerate convolutions, which were already abundant in computer graphics⁵, allowed the authors to train significantly larger (deeper) and thus more effective models for image classification. Neural models employing other dense operations proliferated thereafter across vision, speech, and later, language and other task areas.

In what follows, we'll discuss tensors, then design hardware from first principles which can efficiently perform dense operations on those tensors, noticing that such hardware resembles a GPU!

Citation

@misc{kahn2024mlsystems,
  title        = {Machine Learning Systems},
  author       = {Jacob Kahn},
  year         = {2024},
  url          = {https://jacobkahn.me/writing/tag/machine_learning_systems/},
}

The world appears continuous the macro-scale; the quantization of energy and space governs micro-scale interaction that underlie our universe's physical laws as we understand them today. ↩
Our promising physical models of the universe start from those things which we're able to interpret as humans: for example, mass or spacetime. ↩
The density of scalars that are used to represent audio is related to the sample rate. ↩
Convolutions are far more general than those that operate on images, and are used in signal processing, probability and statistics, and other areas of engineering and physics. ↩
Numerous algorithms in computer vision employ dense operations on GPUs; convolutions are used in blurring, processing on edges; matrix multiplication is used to apply transformations to images such as rotation, translation, or scaling, and can transform 3D images by moving around objects in space. ↩

Tags: Deep Learning, Machine Learning Systems