Notation
In this book we employ notation from a variety of mathematical disciplines,
summarized below. The reader is encouraged to consult this list as a reference as they
read the various chapters of the book. We have tried to keep their usage as coherent
as possible across chapters.
As a first example, for a natural number \(n\) the set \(\{1, 2, \dots , n\}\) is denoted \([n]\).
Linear Algebra Notation
- Scalars (either deterministic or random) are non-bold, e.g., \(x\), \(X\).
- Vectors (either deterministic or random) are lower-case bold, e.g., \(\vx \), \(\vpi \).
Entries of vectors are denoted \(x_{i}\) or \(\pi _{i}\), alternatively \((\vx )_{i}\). Generic terms (which can
be scalars, vectors, matrices, or higher-order tensors) are also lower-case
bold.
- Matrices (either deterministic or random) are capital bold, e.g., \(\vX , \vPi \). Entries
of matrices are denoted \(X_{ij}\) or \(\Pi _{ij}\), alternatively \((\vX )_{ij}\). Columns of matrices are
lower-case bold vectors, e.g., \(\vx _{i}\) is the \(i\th \) column of \(\vX \), unless otherwise defined.
- The transpose of a matrix is \(\top \), e.g., \(\vA ^{\top }\). The adjoint is \(\adj \), e.g., \(\vA \adj \). The
pseudo-inverse is \(\dagger \), e.g., \(\vA ^{\dagger }\).
- The rank of a matrix is \(\rank (\vA )\), the trace is \(\tr (\vA )\), the determinant is \(\det (\vA )\), and the
log-determinant is \(\logdet (\vA )\).
- The element-wise multiplication of matrices is \(\vA \hada \vB \). Element-wise squaring
is \(\vA ^{\hada 2}\), etc. Similarly, the Kronecker product of matrices is \(\vA \kron \vB \). The iterated
Kronecker product is \(\vA ^{\kron 2}\), etc.
- For a function \(f \colon \R \to \R \), its element-wise application to a matrix \(\vX \) is \(f[\vX ]\), i.e., with
square brackets. For a function \(f \colon \R ^{d} \to \R ^{k}\) and a matrix \(\vX \in \R ^{d \times n}\), we may broadcast \(f\) to apply
to each column without special notation, i.e., \(f(\vX ) = [f(\vx _{1}), \dots , f(\vx _{n})] \in \R ^{k \times n}\).
- The (Euclidean) unit sphere is \(\Sphere ^{n - 1} \subseteq \R ^{n}\). The (Euclidean) unit ball is \(\Ball ^{n} \subseteq \R ^{n}\).
- The set of orthogonal matrices are \(\O (m, n) \subseteq \R ^{m \times n}\) or \(\O (n) = \O (n, n)\).
- Symmetric matrices are \(\Sym (n)\), symmetric PSD matrices are \(\PSD (n)\), symmetric PD
matrices are \(\PD (n)\).
- The eigenvalues of symmetric matrix \(\vS \in \Sym (n)\) are all real by the spectral theorem;
they are denoted \(\lambda _{i}(\vS )\) and ordered such that \(\lambda _{1}(\vS ) \geq \cdots \geq \lambda _{n}(\vS )\).
- The singular values of \(\vA \in \R ^{m \times n}\) with \(\rank (\vA ) = r\) are denoted \(\sigma _{i}(\vA )\) and ordered such that \(\sigma _{1}(\vA ) \geq \cdots \geq \sigma _{r}(\vA ) > 0\). By
convention we set \(\sigma _{r + 1}(\vA ) = \sigma _{r + 2}(\vA ) = \cdots = 0\).
- The projection onto a set \(\cK \) is denoted \(\proj _{\cK }\).
- The \(\ell ^{p}\) norms of vectors are denoted by \(\norm {\vx }_{p}\), \(p \in [0, \infty ]\).
- The Euclidean operator norm on matrices is \(\norm {\vA } = \sigma _{1}(\vA )\).
- The Frobenius norm on matrices is \(\norm {\vA }_{F} = \tr (\vA ^{\top }\vA )\).
- The Euclidean inner product of vectors is \(\ip {\vx }{\vy } = \vy ^{\top } \vx \).
- The Frobenius inner product on matrices is \(\ip {\vX }{\vY } = \tr (\vY ^{\top }\vX )\).
- The all-ones vector/matrix is denoted \(\vone \) (the shape should be obvious from
context). Similarly, the all-zeros vector/matrix is denoted \(\vzero \). The identity
matrix is denoted \(\vI \).
Probability Notation
- The base probability measure is \(\Pr \), i.e., the probability of an event \(A\) occurring
is \(\Pr [A]\). We may specify the distribution of a random variable using a subscript,
i.e., \(\Pr _{\vx \sim \mu }[\vx \in S]\) describes the probabilities associated to a random variable \(\vx \) with
distribution (measure) \(\mu \). If two random variables \(\vx \) and \(\vx '\) have the same
distribution, we write \(\vx \equid \vx '\).
- The expected value operator is \(\Ex \), with the same subscript caveat.
- The covariance operator is \(\Cov \), with the same subscript caveat.
- The correlation operator is \(\Corr \), with the same subscript caveat.
- In certain probabilistic modeling contexts where a density is assumed
(especially in Chapters 3 and 7), we also use the notation \(p_{\vx }\) for the density
of a random variable \(\vx \). This extends to conditional densities, e.g. \(p_{\vx \mid \vy }\) for the
density of \(\vx \) conditioned on \(\vy \).
- We generally use Greek letters to denote realizations of random variables
(in contrast to the random variables themselves). For example, \(p_{\vx }(\vxi )\), \(p_{\vx \mid \vy }(\vxi \mid \vnu )\), \(\Ex [\vx \mid \vy =\vnu ]\), etc.
- The set of probability distributions on a finite set \(\cX \) is denoted \(\Delta (\cX )\). (For \(\cX = [n]\), this
is the probability simplex which can be embedded into \(\R ^{n}\)).
- The Gaussian distribution with mean \(\vmu \) and covariance \(\vSigma \) is denoted \(\dNorm (\vmu , \vSigma )\).
- The uniform distribution over a compact set \(\cX \) is denoted \(\dUnif (\cX )\).
Machine Learning Notation
- Encoders and denoisers are usually denoted by \(f\) and usually have
parameters \(\theta \in \Theta \). We usually write the features \(\vz = f_{\theta }(\vx )\).
- Decoders are usually denoted by \(g\) and usually have parameters \(\eta \). We usually
write the auto-encoding \(\hat {\vx } = g_{\eta }(\vz )\) corresponding to \(\vx \).
- When \(f\) and \(g\) are implemented as neural networks, they will have the same
number of layers without loss of generality; we write them as \(f = f^{L} \circ f^{L - 1} \circ \cdots \circ f^{2} \circ f^{1} \circ f^{\pre }\) and \(g = g^{\post } \circ g^{L} \circ g^{ L - 1} \circ \cdots \circ g^{2} \circ g^{1}\).
- We write the features at the input to layer \(\ell \) as \(\vz ^{\ell }\) such that \(\vz ^{\ell + 1} = f^{\ell }(\vz ^{\ell })\) with \(\vz ^{1} = f^{\pre }(\vx )\) and \(\vz = \vz ^{L + 1}\).
Similarly the autoencoding features are \(\hat {\vx }^{\ell }\) with \(\hat {\vx }^{\ell + 1} = g^{\ell }(\hat {\vx }^{\ell })\) with \(\hat {\vx }^{1} = \vz \) and \(\hat {\vx } = g^{\post }(\hat {\vx }^{L + 1})\).
- For denoising, optimization, and processes which have a continuous time
index, the time is almost always a subscript, e.g., \(\vx _{t}\). Discrete sequences can
have the index be a superscript or subscript.
Modeling and Optimization Notation
- As in the Probability and Machine Learning Notation sections above, when
we have a model, say for the realizations of a data distribution supported
on \(\bR ^D\), we always denote it with “plain” accented variables, say \(\vx \).
- In cases where we assume our data \(\vx \) is generated by a model from a
restricted class of parametric models, say with a parameter \(\vmu \), we also denote
the true parameter with plain accenting.
- Decision variables (or learnable parameters) of a model fit to observed
data are denoted with “tilde” accenting, e.g. \(\tilde {\vx }\), in cases where there is an
underlying model \(\vx \), or with plain notation in cases where no confusion is
possible (e.g., if an empirically-fit model for nonparametric data \(\vx \) involves
“mean” parameters \(\vmu \)).
- Optimal solutions to optimization problems are denoted with “star”
accenting, either as a superscript or a subscript (say \(\vx _\star \) or \(\vx ^\star \)).
- Estimators for data that correspond to a specific statistical model,
especially for the mean of that statistical model, are denoted with “bar”
accenting, say \(\bar {\vx }\) (especially for minimum mean-squared error denoising in
Chapter 3).
- Approximations associated to computational procedures for modeling data
are denoted with “hat” accenting, say \(\hat {\vx }\) for the output of an autoencoder or
a generative model that approximates data \(\vx \).