Machine Learning Systems, Part 2 — Tensors

Written: March 1, 2023
First Published: August 4, 2025

What is a tensor?

A tensor is a general mathematical object that relates objects in a vector space. There are many equivalent definitions of a tensor and appearances of tensors in mathematics. For the purpose of deep learning, a tensor is a mathematical object that represents a multi-dimensional array of values.

Tensors contain sequences of numbers (these might be real numbers, integers, boolean values, or other types) arranged in a structure. This structure includes axes — that is, given an element (as specified by an index), we might move to other elements along one of the axes, as long as we stay within the bounds of the tensor.

Tensors have various properties which define them and the operations that can be performed on them. These properties include:

Order (sometimes referred to as rank or dimensionality) — this can be thought of as the "number of axes." A vector has order one, a matrix has order two, and a single scalar (i.e. 2.718) has order zero, by convention.
Shape (sometimes referred to as "dimension," confusingly) — refers to the size, or number of elements, of each axis of a tensor, typically expressed as a tuple¹, e.g. $(3, 6, 2)$ . A vector with 8 elements has shape $(8)$ , a $10 \times 12$ matrix has shape $(10, 12)$ , and the scalar 8 is typically described by having an empty shape $()$ ². Axis shapes are ordered; a tensor with shape $(2, 7, 4, 9)$ and a tensor of shape $(2, 4, 7, 9)$ are structurally different. The number of values in a tensor's shape is equal to its order/rank.
- When describing the shape of matrices, the convention is to include the number of rows before the number of columns; an $m \times n$ matrix has $m$ rows ( $m$ elements along its first axis) and $n$ columns ( $n$ elements along its second axis).
Type — mathematically, the values of which tensors are composed are typically real numbers. In deep learning, with software and hardware constraints, it's typical for tensors to contain floating point values, integers, or other datatypes.

A small schematic showing axes of a matrix.

How do we refer to elements in, or index into tensors? Where there are many ways to notate particular elements of tensors³, in deep learning, indices are typically found in subscripts, and tuples of indices along each axes correspond to a particular element.

Consider a vector $v$ , for example: $v = \begin{bmatrix} 2 & 9 & 8 & 3 & 1\end{bmatrix}$ . To refer to the first element in $v$ , we would write $v_0 = 2$ (with zero-indexing⁴), and to refer to the third element, $v_2 = 8$ . In tensors of order greater than one, such as matrices, each index corresponds to indexing along a particular axis. Consider a matrix $m$ with shape $(2, 3)$ :

m = \begin{bmatrix} 0.21 & 0.84 & 0.33 \\ 0.04 & 0.91 & 0.13 \end{bmatrix}

As is the convention with labeling shapes of matrices, we index by row, then column; $a_{1, 1} = 0.21$ (first row, first column). $a_{2, 2} = 0.91$ (second row, second column), and $a_{1, 3} = 0.33$ (first row, third column).

Why think in tensors, and what do we need them to do?

Tensors are good for many reasons:

They provide a mathematically convenient way to represent dense, multi-dimensional quantitites, which represent human modalities well.
We have hardware machinery (e.g. GPUs) which can efficiently operate on tensors

As we build neural networks, we've coalesced around a schema of operations on tensors. Deep neural models have a surprisingly small set of operations which are ultimately relevant to us.

Some Tensor Operations in Deep Learning

Elementwise Operations

Elementwise operations are exactly that: for a given arithmetic operator (e.g. addition), we apply that operation to each symmetric element in each tensor operand.

Consider two tensors $A$ and $B$ , both with shape $m \times n$ :

\begin{array}{cc} A = \begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} & B = \begin{bmatrix} b_{1, 1} & \dots & b_{1, n} \\ \vdots & \ddots & \vdots \\ b_{m, 1} & \dots & b_{m, n} \\ \end{bmatrix} \end{array}

An elementwise binary function $f$ , if applied to $A$ and $B$ , yields the following output matrix:

f(A, B) = \begin{bmatrix} f(a_{1, 1}, b_{1, 1}) & \dots & f(a_{1, n}, b_{1, n}) \\ \vdots & \ddots & \vdots \\ f(a_{m, 1}, b_{m, 1}) & \dots & f(a_{m, n}, b_{m, n}) \\ \end{bmatrix}

In the case of addition, $f(a, b) = a + b$ , so the output matrix becomes:

C = \text{ElementwiseAdd}(A, B) = \begin{bmatrix} a_{1, 1} + b_{1, 1} & \dots & a_{1, n} +b_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} + b_{m, 1} & \dots & a_{m, n} + b_{m, n} \\ \end{bmatrix}

which is also an $m \times n$ matrix. Elementwise multiplication is also known as the Hadamard Product ( $\odot$ ). Elementwise addition between two matrices is frequently written as $C = A + B$ ; elementwise multiplication is sometimes written as $C = A \odot B$ (to avoid confusion with matrix multiplication, discussed below).

How do element-wise operations affect tensor shape and order? Element wise operations combine corresponding elements from two or more tensors of the same shape: the resulting tensor is thus the same shape as the input tensors, and order is unaffected.

We'll also discuss broadcasting below which relaxes the assumption that input tensors in elementwise operations must be of the same shape, and allows us to apply elementwise operations across axes in more elegant ways.

Reductions

A reduction is a tensor operation that involves combining values along a particular axis of a tensor. As part of this operation, the dimension (or size) of the axis in question is "reduced" to 1 (or 0 if not explicitly written).

Consider a 3-dimensional tensor $T \in \mathbb{R}^{3}$ of shape $(m, n, k)$ . If we were to reduce over the second axis of this tensor, which has size $n$ , the dimensions of the resulting tensors would be $(m, k)$ (or $(m, 1, k)$ if we chose to keep the reduced dimension), as all $n$ elements along the second axis have been combined into a single value which is represented for each element a tuple of the other axes.

The reduction function $f$ used to combine elements of a tensor along a certain axis can be single or multi-valued. Some examples of reduction functions include:

arithmetic operations ( $+, -, \times, \div$ ) — for example: a sum-based reduction ( $+$ ) would add up all values along an axis for each column of other axes;
min/max — which summarize all values along an axis using their minimum or maximum values;
argmin/argmax – return the indices of the minimum or maximum values along an axis;
mean/median or other summary statistics;
logical operations — bitwise operations (and or or) along an axis.

A reduction may be performed over arbitrarily many axes — in these cases, all of the values along each axis are combined, and each axis over which the reduction was performed is reduced away in the resulting tensor's shape.

Reductions over a matrix over both single and multiple axes with the resulting changes in shape.

How do reductions affect tensor shape and order? As we've seen above, reductions squash the axes along which they are performed, removing them from the tensor's shape, and keeping the size of all other axes invariant. As such, reductions reduce the order of a tensor by the number of axes over which the reduction is performed.

Contractions

Tensor contractions refer to a general class of operations that combine tensors or distinct axes of a single tensor. In particular, contractions are sums over products of pairs of indices.

Tensor contractions allow us to define a different of multiplication than element-wise multiplication⁵, and are a generalization of matrix multiplication. Consider two vectors $u = \begin{bmatrix} u_1, \dots , u_k \end{bmatrix}$ and $v = \begin{bmatrix} v_1, \dots , v_k \end{bmatrix}$ , each with $k$ elements. Performing a tensor contraction on these two vectors would involve identifying a common pair of indices across the tensors; in this case, both tensors only have one axis (they're of order one) and therefore have one index $u_i$ and $v_i$ . The pair of indices us thus the first and only index.

The contraction of $u$ and $v$ is written as $u \cdot v$ — it is, in the case of vectors, the dot product! The products of the pairs would be written as:

v \cdot u = \begin{bmatrix} u_1 \\ \vdots \\ u_k \end{bmatrix} \cdot \begin{bmatrix} v_1 \dots v_k \end{bmatrix} = \sum_{i = 1}^k u_i v_i = u_1 v_1 + \dots + u_k v_k

We write the vector $u$ vertically for conventional and notational reasons³. We have effectively iterated over $u_i$ and $v_i$ ; the index $i$ is our contraction index. We have "zipped together" the corresponding terms for those indices via a product, then summed the resulting products together.

We can extend tensor contractions to matrix multiplication as well. In this case, the pairs of indices would involve the 2-tuples $(i, j)$ and $(j, k)$ that describe elements of each matrix; the contraction index is $j$ . Making this concrete, consider two matrices $A$ and $B$ of shapes $(m, n)$ and $(n, p)$ , respectively.

\begin{array}{ccc} A ~ (m \times n) & & B ~ (n \times p) \\ & & \\ \begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} & \times & \begin{bmatrix} b_{1, 1} & \dots & b_{1, p} \\ \vdots & \ddots & \vdots \\ b_{n, 1} & \dots & b_{n, p} \\ \end{bmatrix} \end{array}

In this contraction, we'll select our contraction indices to be $(i, j)$ for matrix $A$ , and $(j, k)$ for our matrix $B$ . That is, we will iterate over $i$ , $j$ , and $k$ , and the pairs of indices will be those that match based on the value of $j$ . Some examples of paired indices for matrices $A$ and $B$ respectively include $i = 1, j = 4, p = 7$ , giving $(1, 4)$ and $(4, 7)$ .

Demonstration of how tensor contraction keeps the second (middle) index invariant and loops over the outer indices in matrix multiplicaton.

Indeed, this tensor contraction between two order-2 tensors is just matrix multiplication! That said, tensor contractions give us ways to think about higher dimensional generalizations of matrix multiplication to tensors of orders beyond two.

Tensor contractions may occur within a tensor (combining two axes with paired indices via the products of their elements summed together), or between two tensors (as we've seen in the case of the dot and matrix product).⁶

In deep learning, tensor contractions are usually tensor products; they are generally between two different tensors; one set of indices from each tensor is used to construct the index pair.

How do contractions affect tensor shape and order? For two input tensors of orders $n$ and $m$ over which a tensor contraction is performed, the order of the resulting tensor is $n + m - 2$ . The contraction removes one "unit" of order by "zipping" indices via an element-wise product and another by unit summing over those zipped indices.⁷ We can illustrate this formula this with the two canonical examples of contractions over two tensors which we've seen here, then with new higher-order examples:

Vector dot product — two input vectors each have order one. Accordingly, the order of the resulting tensor is $n + m - 2 = 1 + 1 - 2 = 0$ , which is indeed the case; the dot product of two vectors is a scalar, a zero-order tensor!
Matrix multiplication — two input matrices each have order two. The resulting matrix has order $n + m - 2 = 2 + 2 - 2 = 2$ , which is the same order as the original matrices; indeed, matrix multiplication results in an output matrix which is order two. In each case, the middle axes (that is, the first axis of the left hand tensor and the second axis of the right hand tensor), which must be equal in size, is removed. An $(n \times m)$ matrix multiplied by a $(m \times p)$ matrix produces an $(n \times p)$ matrix.
Higher order tensor products — for tensors of higher orders, $n + m - 2$ still applies. Multiplying two order 3 ("3-dimensional") tensors, the resulting tensors will have order $3 + 3 - 2 = 4$ , with the contracted axis vanishing from the final tensor's shape.

Other Operations Affecting Shape and Ordering

Besides the tensor operations we've seen, other operations can affect the shape of tensors and the ordering of data that composes them. While these operations might not change underlying data itself, they affect the layout and organization of tensors including the relative sizes of axes. Other operations can add new data to tensors by changing their shapes.

Our initial exploration of these operations will assume that tensors are represented contiguously in memory. In practice, this means that the elements of a tensor are sequentially stored, with no gaps. We'll revisit this assumption in the future.

Reshape

Reshape operations take an input tensor $A$ and a target shape to which to change the tensor's dimensions. It preserves the ordering of elements of the tensor. The number of elements in the output shape must be the same as the input tensor, which implicitly restricts the list of possible output shapes.

Concretely, we can describe the space of valid reshape operations for a matrix $A$ with shape $(m, n)$ . To start, consider a target shape of $(2m, \frac{n}{2})$ , which has twice as many rows and half as many columns. The original matrix has $m \times n$ elements, and the resulting matrix also has $2m \times \frac{n}{2} = m \times n$ elements. More generally, the set of valid target shapes for a reshape to another order-2 tensor is described by $(km, \frac{n}{k})$ or $(\frac{m}{k}, kn)$ for some positive integer $k$ . These two reshape operations are illustrated below, with the output matrices defined in terms of indices from the input matrix. The reshape to $(km, \frac{n}{k})$ can be thought of "flattening" $k$ rows of $A$ into one, while the reshape into $(\frac{m}{k}, kn)$ can be thought of as splitting each row into $k$ parts, then stacking them.

\begin{array}{ccc} A ~ (m \times n) & \text{reshape to } (\frac{m}{k}, kn) & B ~ (\frac{m}{k}, kn) \\ & & \\ \begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} & \rightarrow & \begin{bmatrix} a_{1, 1} & \dots & a_{1, n} & a_{2, 1} & \dots & a_{2, n} & a_{k, 1} & \dots & a_{k, n} \\ \vdots &&&& \ddots &&&& \vdots \\ a_{m - k + 1, 1} & \dots & a_{m - k + 1, n} & a_{m - 1, 1} & \dots & a_{m - 1, n} & a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} \end{array}

\begin{array}{ccc} A ~ (m \times n) & \text{reshape to } (mk, \frac{n}{k}) & B ~ (mk, \frac{n}{k}) \\ & & \\ \begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} & \rightarrow & \begin{bmatrix} a_{1, 1} & \dots & a_{1, \frac{n}{k}} \\ a_{1, \frac{n}{k} + 1} & \dots & a_{1, \frac{2n}{k}} \\ \vdots & \ddots & \vdots \\ a_{1, n - \frac{n}{k} + 1} & \dots & a_{1, n} \\ a_{2, 1} & \dots & a_{2, \frac{n}{k}} \\ \vdots & \ddots & \vdots \\ a_{m, n - \frac{n}{k} + 1} & \dots & a_{m, n} \\ \end{bmatrix} \end{array}

Notice that in the operations above, the relative ordering of the elements is kept invariant — the delineation of axes changes, affecting the number of tensors along a particular axis.

Reshape operations can also result in a tensors of different orders. We consider a few illustrative cases:

Flattening — an order- $n$ tensor with $N$ elements can be flattened into a order-1 tensor of shape $(N)$ . This operation turns the tensor into a one-dimensional list of elements.
Increasing order — given an order-2 tensor of shape $(m, n)$ , we can transform such a tensor into, for example, an order-3 tensor of dimension $(\frac{m}{k}, k, n)$ , which would break up chunks of the tensor along the first dimension (of size $m$ ) into $k$ -sized chunks along the second dimension. We can both increase and decrease tensor order with reshape operations, arbitrarily.

Indexing

Indexing refers to accessing particular subsets of tensor elements. We might access a particular element, but we might also want to access a subset of the original tensor's values along one or more axes (itself a lower-order tensor).

The indexing notation shown below has become relatively standard across many tensor library implementations, with minor differences. We index into a tensor as denoted by square brackets as if we're indexing into a multi-dimensional array⁸. Each position in our indexing array specification corresponds to the axis, which is typically zero indexed given that we're thinking about tensors as arrays when we represent them computationally. We use specific index values to index along a particular axis.

A diagram demonstrating the semantics for indexing into a 2D tensor, complte with the span operator.

Slicing refers to indexing along an axis using a range. That is, along one or more axes, a slice includes a contiguous range of indices. For example, given a tensor with shape $(m, n, k)$ , we might consider an indexing slice to be $[:, 0:5, :]$ , which takes a "slice" of size six (from indices 0 to 5) along the second axis of the tensor. The other two axes are "spanned," meaning we want to access all values along those axes.

In mathematics, we often use Einstein notation for reasoning about axes and indices of tensors.⁹ We'll use some of the conventional syntax from numpy here, as it's most easily applicable to implementations.

Operations that Modify Shape or Data

There are also operations which combine or expand tensors, introducing new data. These either combine multiple distinct tensors, or they expand a tensor by copying it in some fashion. In what follows, we'll look at the transposition, concatenation and tiling operations.

Transposition, Reordering, and Permutation

The reordering operation swaps one or more axes of a tensor. Reordering changes the order with which the elements of the tensor are stored in memory, which is something we'll revisit in lectures that follow.

Given a tensor with $n$ dimensions, the reordering operation takes a permutation of the dimensions and reorders the resulting tensor given that permutation. Typically, axes are referred to in a zero-indexed fashion.

As an example, let's consider an order-4 tensor $T$ on which we want to reorder its axes. Axes in this case would be notated and indexed as $(0, 1, 2, 3)$ . If we wanted to reorder the tensor with a full reversal, we would reorder with the axis argument as $(3, 2, 1, 0)$ , which implies that what was axis 3 becomes axis 0, what was axis 2 becomes axis 2, and so on.

If we wanted to only reorder the middle two axes of $T$ , we might consider a reorder as $(0, 2, 1, 3)$ . Here, we make the second axis of the new tensor the third axis (index 2) of the old tensor, and the third axis of the tensor the second axis of the old tensor (index 1).

We can also consider a more basic case, which is that transposition. In most settings, transposing a tensor refers to a reorder operation on an order-2 tensor in which axes are flipped. This is equivalent to a order operation with the axis argument as $(1, 0)$ , which swaps the axes of the input matrix. A transpose operation is illustrated below.

Notice that the ordering of the indices is also swapped as it relates to mapping elements of the new tensor to that of the old. An element $a_{i,j}$ in the old tensor becomes $a_{j,i}$ in the new tensor.

In general, when we transpose axes of a tensor, the indices that refer to elements of the old tensor are also swapped. Consider our reorder operation $(0, 2, 1, 3)$ of our order-4 tensor $T$ above. The input tensor $T$ is described by elements like $t_{i,j,k,l}$ , with four index subscripts. Elements in the new tensor would correspond to elements in the old tensor according to $t_{i,k,j,l}$ , which involves the same permutation of the axes as do the indices.

Similarly, with our alternative indexing notation discussed previously, reordering swaps the axes we reference when we denote a specific set of values in a position in our indexing array.

Permutation is another way of describing reordering. With reordering we're permuting the axes of an input tensor.

Concatenation

Tensor concatenation refers to combining multiple tensors together along a certain axis. We might concatenate only two tensors, or arbitrarily many; we concatenate in a certain direction, as described by that axis.

Consider the example below, in which we have two tensors, $A$ and $B$ . If we concatenate along the first axis, we simply stack the respective elements of each tensor along it. If we concatenate along the second axis, we concatenate in that direction.

Note that concatenated tensors must have the same shape, except for along the concatenation dimension, since we're appending values along that dimension, and it must be fully filled.

Concatenation can be thought of an opposite operation to slicing a tensor or splitting a tensor into chunks.

Tiling

The tiling operation repeats a tensor along one or more axes. That is, it concatenates copies of a tensor a specified number of times in a particular direction, resulting in a tensor with shape that is a constant multiple of the input shape along the axes for which tiling was performed.

In this example, we tile a matrix over one axis, then two axes by repeating its data. We can see that the shape of the resulting tensor grows only along those axes where the multiple is the number of times we've tiled. First, we tile an tensor with shape $(m, n)$ $b$ times over its second axis, resuting in a tensor of shape $(m, nb)$ .

A diagram showing a matrix tiled over one axis.

Second, we demonstrate tiling a tensor over multiple axes simultaneously, we tile an tensor with shape $(m, n)$ $b$ times over its second axis, resuting in a tensor of shape $(mb, nb)$ .

A diagram showing a matrix tiled over two axes.

Broadcasting

As an addendum to the previous section discussing elementwise operations and contractions, there are many cases in which we may want to perform an operation on one or more tensors such that we apply that operation equally to multiple elements of the tensor. In these cases, we can broadcast an input tensor, implicitly expanding its shape so that it is compatible with the larger shape of another tensor. Broadcasting is often used for elementwise operations.

Broadcasting semantics are somewhat standardized; those in in numpy are most-often used, and we'll cover these below.

Elementwise operations on scalars — a simple, useful example of broadcasting is with scalars in elementwise operations. Consider a matrix $A$ of shape $(m, n)$ . We often want to perform an operation with $A$ and a scalar: for example, adding 1 to each element of a matrix; $B = A + 1$ . The scalar value 1 has a order 0 and empty shape $()$ , but we broadcast it to a tensor of size $(m, n)$ filled with ones; we create two new dimensions from $()$ with sizes $m$ and $n$ .

\begin{array}{ccc} A ~ (m \times n) & & \text{Broadcast 1 to } (m, n) \\ & & \\ \begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} + 1 & \rightarrow & \begin{bmatrix} a_{1, 1} & \dots & a_{1, n} \\ \vdots & \ddots & \vdots \\ a_{m, 1} & \dots & a_{m, n} \\ \end{bmatrix} + \begin{bmatrix} 1 & \dots & 1 \\ \vdots & \ddots & \vdots \\ 1 & \dots & 1 \\ \end{bmatrix} \end{array}

Tensors can also be broadcast up to dimensions that add axis-wide. This operation is similar to tiling but is implicit; we conceptually expand a tensor across one or more axes. We apply broadcasting in the following procedure.

We compare dimensions pair by pair of both tensors. Dimensions are eligible to be broadcast if:
- The size of both dimensions are equal, or
- The size of one of the dimensions is 1.
We right-align the shapes of the two input tensors (which may have different orders)
- We move from right to left assessing dimension broadcasting eligibility.
- If we find two compatible and aligned dimensions, we continue leftwards to identify other potentially compatible dimensions.

Diagram showing right-to-left dimensional comparison rules for broadcasting.

Concretely, consider a tensor $A$ with shape $(m, n, k)$ and another tensor $B$ with shape $(k)$ . The operation $A + B$ can be performed based on the broadcasting rules; the rightmost dimensions are equal. We would add the $(k)$ elements to each $k$ -sized slice of $A$ (there are $m \times n$ of them per shape $(m, n)$ of the remaining axes). We would conceptually broadcast $k$ along the first two dimensions of $A$ .

Diagram showing an example of tensor broadcasting with a 3D and 1D non-scalar tensor.

A classic example involves operating on tensors representing images, which we've seen have 3 values per pixel (RGB). An image tensor $A$ with dimension $224 \times 224$ would thus be represented by a tensor of shape $(224, 224, 3)$ . We may wish to apply an identical transformation $B$ to each pixel: that is, to the last dimension (of size 3) for each of the $224 \times 224$ indices of the outer tensor. That is, we have a small operand of shape $(3)$ .

As a final example, note that we can broadcast inner dimensions surrounded by complete dimensions. For example: a tensor of shape $(m, 1, k)$ can be broadcast up to a tensor of shape $(m, n, k)$ by tiling the tensor $n$ times along its second axis.

Broadcasting for tensor contractions — broadcasting is often used for multiplication of tensors that otherwise might not have compatible dimensions for multiplications. Consider two tensors $A$ and $B$ , where we're interested in the matrix multiplication $C = AB$ . We only care about the middle (contracted) dimensions of $A$ and $B$ ; broadcasting occurs behind the scenes with the other dimensions.

Let $A$ have shape $(s_1, ..., s_n)$ and $B$ have shape $(t_1, t_2)$ such that $A$ and $B$ have orders $n$ and $m$ , respectively. We have shape multiplication $(s_1, ..., s_n) \times (t_1, t_2)$ , so as long as $s_n = t_1$ , we have proper shape compatibility. What about the other dimensions $(s_1, ..., s_{n - 2})$ ? We can see $A$ as many $(s_{n - 1}, s_n)$ -dimensional matrices (exactly $n_A = \prod_{i = 1}^{n - 2} s_i$ of those matrices!). We thus perform $n_A$ multiplications of $(s_{n - 1}, s_n) \times (t_1, t_2)$ , and the stacked results along the other axes of $A$ ( $(s_1, ..., s_{n - 2})$ ) form our result.

What if the tensor $B$ were also higher-dimensional? Let $B$ have order $m$ with shape $(t_1, ..., t_m)$ . In this case, we do what we did with $A$ and view $B$ as many $(t_{n - 1}, t_n)$ dimensional matrices, stacked $n_B$ times according to the other axes of $B$ , $(t_1, ..., t_{m - 2})$ (where $m_B = \prod_{i = 0}^{m - 2} t_i$ ). But $(s_1, ..., s_{n - 2})$ and $(t_1, ..., t_{m - 2})$ can also be broadcasted to match the sizes of the stacks of matrices! Thus, our matrix multiplication is between stacks (often called batches) of shapes $(s_{n - 1}, s_n)$ and $(t_{m - 1}, t_n)$

For a full case of arbitrarily ordered tensors $A$ and $B$ , the result is the general operation known as batched matrix multiplication (or BMM¹⁰).

\begin{array}{c} A \\ (s_1, \ldots, s_{n-2}, s_{n-1}, s_n) \end{array} \quad \odot \quad \begin{array}{c} B \\ (t_1, \ldots, t_{m-2}, t_{m-1}, t_m) \end{array} \quad = \quad \begin{array}{c} C \\ \texttt{broadcast}(s_1, \ldots, s_{n-2}) \times (s_{n-1}, t_m) \end{array}

Writing down the batch (stack of matrices) and matrix dimensions explicitly, we can generate the following table:

\begin{array}{c|c|c} & A & B \\ \text{Batch Dims:} & (s_1, \ldots, s_{n-2}) & (t_1, \ldots, t_{m-2}) \\ \text{Matrix Dims:} & (s_{n-1}, s_n) & (t_{m-1}, t_m) \\ \text{Constraints:} & s_n = t_{m-1} & \\ \text{Result:} & \texttt{broadcast}(s_1, \ldots, s_{n-2}, t_1, \ldots, t_{m-2}) \times (s_{n-1}, t_m) \end{array}

This approach applies to contractions that combine multiple axes of a tensor; we simply view the dimensions being contracted as tiled or stacked along the other axes of their input tensors.

Citation

@misc{kahn2024mlsystems,
  title        = {Machine Learning Systems},
  author       = {Jacob Kahn},
  year         = {2024},
  url          = {https://jacobkahn.me/writing/tag/machine_learning_systems/},
}

Tuples can themselves be thought of as tensors; that is, the shape of a tensor could be represented as a 1-dimensional tensor. In some software implementations of tensors, asking for the shape of a tensor returns another tensor whose entries are the shape rather than a tuple. ↩
As scalars are order/rank zero tensors, their shapes contain zero elements. ↩
Tensor indexing is part of a larger set of indexing machinery that differs between covariant and contravariant tensors. Most tensors referred to in deep neural models are reasoned about in a covariant fashion; lower indices are thus conventionally used. ↩ ↩²
Tensors are typically zero-indexed in mathematical notation. ↩
The tensor contraction, with the inner or dot product as its most basic form, can be thought of as a bilinear form that contracts dimensions, allowing values to intermix across dimensions. ↩
Matrix multiplication can also be viewed as a tensor product followed by a contraction. A tensor product between two matrices $A$ and $B$ with shapes $(m \times n)$ and $(n \times p)$ , respectively produces a matrix of shape $(m \times n \times n \times p)$ of order $2 + 2 = 4$ ; tensor products enumerate every combination of the two input tensors. The tensor contraction over the two middle dimensions, eliminating them, results in a $(m, p)$ tensor. ↩
In tensor analysis, for two input tensors of order $n$ and $m$ , the order of the resulting tensor is technically $n + m - k$ , where $k$ is the number of paired indices over the contraction is performed. These operations are more uncommon in deep learning and are thus omitted here. ↩
Syntactically, there are many different conventions for indexing tensors. The implementation in numpy is today often used outside of computational settings; we'll use it here as it translates best to computational implementations of these ideas. ↩
Einstein notation for tensors arose from the high-dimensional geometry with which the Theory of General Relativity was developed. ↩
BMM is implemented in commonly-used deep learning frameworks, for example, in PyTorch. ↩

Tags: Deep Learning, Machine Learning Systems