Notation

In this book we employ notation from a variety of mathematical disciplines, summarized below. The reader is encouraged to consult this list as a reference as they read the various chapters of the book. We have tried to keep their usage as coherent as possible across chapters.

As a first example, for a natural number \(n\) the set \(\{1, 2, \dots , n\}\) is denoted \([n]\).

Linear Algebra Notation

Scalars (either deterministic or random) are non-bold, e.g., \(x\), \(X\).

Vectors (either deterministic or random) are lower-case bold, e.g., \(\vx \), \(\vpi \). Entries of vectors are denoted \(x_{i}\) or \(\pi _{i}\), alternatively \((\vx )_{i}\). Generic terms (which can be scalars, vectors, matrices, or higher-order tensors) are also lower-case bold.

Matrices (either deterministic or random) are capital bold, e.g., \(\vX , \vPi \). Entries of matrices are denoted \(X_{ij}\) or \(\Pi _{ij}\), alternatively \((\vX )_{ij}\). Columns of matrices are lower-case bold vectors, e.g., \(\vx _{i}\) is the \(i\th \) column of \(\vX \), unless otherwise defined.

The transpose of a matrix is \(\top \), e.g., \(\vA ^{\top }\). The adjoint is \(\adj \), e.g., \(\vA \adj \). The pseudo-inverse is \(\dagger \), e.g., \(\vA ^{\dagger }\).

The rank of a matrix is \(\rank (\vA )\), the trace is \(\tr (\vA )\), the determinant is \(\det (\vA )\), and the log-determinant is \(\logdet (\vA )\).

The element-wise multiplication of matrices is \(\vA \hada \vB \). Element-wise squaring is \(\vA ^{\hada 2}\), etc. Similarly, the Kronecker product of matrices is \(\vA \kron \vB \). The iterated Kronecker product is \(\vA ^{\kron 2}\), etc.

For a function \(f \colon \R \to \R \), its element-wise application to a matrix \(\vX \) is \(f[\vX ]\), i.e., with square brackets. For a function \(f \colon \R ^{d} \to \R ^{k}\) and a matrix \(\vX \in \R ^{d \times n}\), we may broadcast \(f\) to apply to each column without special notation, i.e., \(f(\vX ) = [f(\vx _{1}), \dots , f(\vx _{n})] \in \R ^{k \times n}\).

The (Euclidean) unit sphere is \(\Sphere ^{n - 1} \subseteq \R ^{n}\). The (Euclidean) unit ball is \(\Ball ^{n} \subseteq \R ^{n}\).

The set of orthogonal matrices are \(\O (m, n) \subseteq \R ^{m \times n}\) or \(\O (n) = \O (n, n)\).

Symmetric matrices are \(\Sym (n)\), symmetric PSD matrices are \(\PSD (n)\), symmetric PD matrices are \(\PD (n)\).

The eigenvalues of symmetric matrix \(\vS \in \Sym (n)\) are all real by the spectral theorem; they are denoted \(\lambda _{i}(\vS )\) and ordered such that \(\lambda _{1}(\vS ) \geq \cdots \geq \lambda _{n}(\vS )\).

The singular values of \(\vA \in \R ^{m \times n}\) with \(\rank (\vA ) = r\) are denoted \(\sigma _{i}(\vA )\) and ordered such that \(\sigma _{1}(\vA ) \geq \cdots \geq \sigma _{r}(\vA ) > 0\). By convention we set \(\sigma _{r + 1}(\vA ) = \sigma _{r + 2}(\vA ) = \cdots = 0\).

The projection onto a set \(\cK \) is denoted \(\proj _{\cK }\).

The \(\ell ^{p}\) norms of vectors are denoted by \(\norm {\vx }_{p}\), \(p \in [0, \infty ]\).

The Euclidean operator norm on matrices is \(\norm {\vA } = \sigma _{1}(\vA )\).

The Frobenius norm on matrices is \(\norm {\vA }_{F} = \tr (\vA ^{\top }\vA )\).

The Euclidean inner product of vectors is \(\ip {\vx }{\vy } = \vy ^{\top } \vx \).

The Frobenius inner product on matrices is \(\ip {\vX }{\vY } = \tr (\vY ^{\top }\vX )\).

The all-ones vector/matrix is denoted \(\vone \) (the shape should be obvious from context). Similarly, the all-zeros vector/matrix is denoted \(\vzero \). The identity matrix is denoted \(\vI \).

Probability Notation

The base probability measure is \(\Pr \), i.e., the probability of an event \(A\) occurring is \(\Pr [A]\). We may specify the distribution of a random variable using a subscript, i.e., \(\Pr _{\vx \sim \mu }[\vx \in S]\) describes the probabilities associated to a random variable \(\vx \) with distribution (measure) \(\mu \). If two random variables \(\vx \) and \(\vx '\) have the same distribution, we write \(\vx \equid \vx '\).

The expected value operator is \(\Ex \), with the same subscript caveat.

The covariance operator is \(\Cov \), with the same subscript caveat.

The correlation operator is \(\Corr \), with the same subscript caveat.

In certain probabilistic modeling contexts where a density is assumed (especially in Chapters 3 and 7), we also use the notation \(p_{\vx }\) for the density of a random variable \(\vx \). This extends to conditional densities, e.g. \(p_{\vx \mid \vy }\) for the density of \(\vx \) conditioned on \(\vy \).

We generally use Greek letters to denote realizations of random variables (in contrast to the random variables themselves). For example, \(p_{\vx }(\vxi )\), \(p_{\vx \mid \vy }(\vxi \mid \vnu )\), \(\Ex [\vx \mid \vy =\vnu ]\), etc.

The set of probability distributions on a finite set \(\cX \) is denoted \(\Delta (\cX )\). (For \(\cX = [n]\), this is the probability simplex which can be embedded into \(\R ^{n}\)).

The Gaussian distribution with mean \(\vmu \) and covariance \(\vSigma \) is denoted \(\dNorm (\vmu , \vSigma )\).

The uniform distribution over a compact set \(\cX \) is denoted \(\dUnif (\cX )\).

Machine Learning Notation

Encoders and denoisers are usually denoted by \(f\) and usually have parameters \(\theta \in \Theta \). We usually write the features \(\vz = f_{\theta }(\vx )\).

Decoders are usually denoted by \(g\) and usually have parameters \(\eta \). We usually write the auto-encoding \(\hat {\vx } = g_{\eta }(\vz )\) corresponding to \(\vx \).

When \(f\) and \(g\) are implemented as neural networks, they will have the same number of layers without loss of generality; we write them as \(f = f^{L} \circ f^{L - 1} \circ \cdots \circ f^{2} \circ f^{1} \circ f^{\pre }\) and \(g = g^{\post } \circ g^{L} \circ g^{ L - 1} \circ \cdots \circ g^{2} \circ g^{1}\).

We write the features at the input to layer \(\ell \) as \(\vz ^{\ell }\) such that \(\vz ^{\ell + 1} = f^{\ell }(\vz ^{\ell })\) with \(\vz ^{1} = f^{\pre }(\vx )\) and \(\vz = \vz ^{L + 1}\). Similarly the autoencoding features are \(\hat {\vx }^{\ell }\) with \(\hat {\vx }^{\ell + 1} = g^{\ell }(\hat {\vx }^{\ell })\) with \(\hat {\vx }^{1} = \vz \) and \(\hat {\vx } = g^{\post }(\hat {\vx }^{L + 1})\).

For denoising, optimization, and processes which have a continuous time index, the time is almost always a subscript, e.g., \(\vx _{t}\). Discrete sequences can have the index be a superscript or subscript.

Modeling and Optimization Notation

As in the Probability and Machine Learning Notation sections above, when we have a model, say for the realizations of a data distribution supported on \(\bR ^D\), we always denote it with “plain” accented variables, say \(\vx \).

In cases where we assume our data \(\vx \) is generated by a model from a restricted class of parametric models, say with a parameter \(\vmu \), we also denote the true parameter with plain accenting.³

Decision variables (or learnable parameters) of a model fit to observed data are denoted with “tilde” accenting, e.g. \(\tilde {\vx }\), in cases where there is an underlying model \(\vx \), or with plain notation in cases where no confusion is possible (e.g., if an empirically-fit model for nonparametric data \(\vx \) involves “mean” parameters \(\vmu \)).

Optimal solutions to optimization problems are denoted with “star” accenting, either as a superscript or a subscript (say \(\vx _\star \) or \(\vx ^\star \)).

Estimators for data that correspond to a specific statistical model, especially for the mean of that statistical model, are denoted with “bar” accenting, say \(\bar {\vx }\) (especially for minimum mean-squared error denoising in Chapter 3).

Approximations associated to computational procedures for modeling data are denoted with “hat” accenting, say \(\hat {\vx }\) for the output of an autoencoder or a generative model that approximates data \(\vx \).⁴