模式识别与机器学习笔记

2024-10-18

Notes

5060 Words

本文为清华大学”模式识别与机器学习”课程的复习笔记($\text{Full Version}$)。

Evaluation Metric

$\begin{aligned} \text{Accuracy} &= \frac{\text{TP+TN}}{\text{TP+FP+TN+FN}} \newline \text{Precision} &= \frac{\text{TP}}{\text{TP+FP}} \newline \text{Recall} &= \text{Sensitivity} = \frac{\text{TP}}{\text{TP+FN}} \newline \text{Specificity} &= \frac{\text{TN}}{\text{TN+FP}} \newline \text{Type-I Error} &= \frac{\text{FP}}{\text{TP+FN}} = 1 - \text{Sensitivity} \newline \text{Type-II Error} &= \frac{\text{FN}}{\text{TN+FP}} = 1 - \text{Specificity} \newline \end{aligned}$

k-NN

Nearest Neighbor

For a new instance $x’$, its class $\omega’$ can be predicted by:

$\omega' = \omega_i, \text{ where } i = \underset{j}{\arg\min} \, \delta(x', x_j)$

k-Nearest Neighbor

For a new instance $x$, define $g_i(x)$ as: the number of $x$’s k-nearest instances belonging to the class $\omega_i$.

Then the new instance’s class $\omega’$ can be predicted as:

$\omega' = \omega_j,\text{ where }j = \underset{i}{\arg\max} \, g_i(x)$

k-NN Improvements

Branch-Bound Algorithm

Use tree structure to reduce calculation.

Edit Nearest Neighbor

Delete nodes that may be misguiding from the training instance set.

Condensed Nearest Neighbor

Delete nodes that are far away from decision boundaries.

The Curse of Dimensionality

Problem

Many irrelevant attributes
In high-dimensional spaces, most points are equally far from each other.

Solution

Dimensionality reduction techniques
manifold learning
Feature selection
Use prior knowledge

Linear Regression (Multivariate ver.)

For a multivariate linear regression, the function becomes $y_i = \mathbf{w}^{\rm T}\mathbf{x}_i$ , where:

$\mathbf{x} = \left(1, x_i^1, \cdots, x_i^{d} \right)^{\mathrm T} \in \mathbb{R}^{d+1}, \mathbf{w} = \left(w_0, w_1, \cdots, w_d\right)^{\mathrm T} \in \mathbb{R}^{d+1}$

, We adjust the values of $\mathbf{w}$ to find the equation that gives the best fitting line $f(x) = \mathbf{w}^{\rm T}\mathbf{x}$

We find the best $\mathbf{w}^*$ using the Mean Squared Loss:

$\ell(f(\mathbf x, y)) = \min\limits_{\mathbf w} \frac{1}{N} \sum_{i = 1}^N (f(\mathbf x_i) - y_i)^2 = \min \limits_{\mathbf w} \frac{1}{N}(\mathbf {Xw-y})^{\rm T}(\mathbf {Xw-y})$

So that $\mathbf{w}^{\star} $ must satisfy $\mathbf {X^{\rm T}} \mathbf {Xw^{\star}} = \mathbf X^{\rm T}\mathbf y$ , so we get $\mathbf{w^{\star}} = (\mathbf {X^{\rm T}X})^{-1}\mathbf X^{\rm T}\mathbf y$ or $\mathbf{w^{\star}} = (\mathbf {X^{\rm T}X} + \lambda \mathbf I)^{-1}\mathbf X^{\rm T}\mathbf y$ (Ridge Regression)

Linear Discriminant Analysis

project input vector $\mathbf x \in \mathbb{R}^{d+1}$ down to a 1-dimensional subspace with projection vector $\mathbf w$

The problem is how do we find the good projection vector? We have Fisher’s Criterion, that is to maximize a function that represents the difference between-class means, which is normalized by a measure of the within-class scatter.

We have between-class scatter $\tilde{S}_b = (\tilde{m}_1 - \tilde{m}_2)^2$, where $\tilde{m}_i$ is the mean for the i-th class. Also we have within-class scatter $\tilde{S}_i=\sum_{y_j \in \mathscr{y}_{i}} (y_j - \tilde{m}_i)^2$, then we have total within-class scatter $\tilde{S}_w = \tilde{S}_1+ \tilde{S}_2$. Combining the 2 expressions, the new objective function will be $J_F(\mathbf w) = \frac{\tilde{S}_b}{\tilde{S}_w}$

We have $\tilde{S}_b = (\tilde{m}_1 - \tilde{m}_2)^2 = (\mathbf w^{\rm T} \mathbf m_1 - \mathbf w^{\rm T} \mathbf m_2)^2 = \mathbf w^{\rm T} (\mathbf m_1 - \mathbf m_2)(\mathbf m_1 - \mathbf m_2)^{\rm T} \mathbf w = \mathbf w^{\rm T} \mathbf S_b \mathbf w$, also $\tilde{S}_w = \mathbf w^{\rm T} \mathbf S_w \mathbf w$, so now optimize objective function $J_F$ w.r.t $\mathbf w$:

$\max\limits_{\mathbf w} J_F(\mathbf w) = \max \limits_ {\mathbf w} \frac{\mathbf w^{\rm T} \mathbf S_b \mathbf w}{\mathbf w^{\rm T} \mathbf S_w \mathbf w}$

Use Lagrange Multiplier Method we obtain: $\lambda w^{\star} = \mathbf{S}_W^{-1} (\mathbf m_1 - \mathbf m_2)(\mathbf m_1 - \mathbf m_2)^{\rm T}\mathbf w^{\star}$, since we only care about the direction of $\mathbf w^*$ and $(\mathbf m_1 - \mathbf m_2)^{\rm T}\mathbf w^{\star}$ is scalar, thus we obtain $w^{\star} = \mathbf{S}_W^{-1} (\mathbf m_1 - \mathbf m_2)$

Logistic Regression

Logistic regression is a statistical method used for binary classification, which means it is used to predict the probability of one of two possible outcomes. Unlike linear regression, which predicts a continuous output, logistic regression predicts a discrete outcome (0 or 1, yes or no, true or false, etc.).

Key Concepts

Odds and Log-Odds:
- Odds: The odds of an event are the ratio of the probability that the event will occur to the probability that it will not occur. $\text{Odds} = \frac{P(y=1)}{P(y=0)}$
- Log-Odds (Logit): The natural logarithm of the odds. $\text{Log-Odds} = \log\left(\frac{P(y=1)}{P(y=0)}\right)$
Logistic Function (Sigmoid Function):
- The logistic function maps any real-valued number into the range (0, 1), making it suitable for probability predictions. $\sigma(z) = \frac{1}{1 + e^{-z}}$
- In logistic regression, $ z $ is a linear combination of the input features. $z = w^T x + b$
Model Equation:
- The probability of the positive class (e.g., $ y=1 $) is given by the logistic function applied to the linear combination of the features. $P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$
- The probability of the negative class (e.g., $ y=0 $) is: $P(y=0|x) = 1 - P(y=1|x)$
Decision Boundary:
- To make a binary decision, we typically use a threshold (commonly 0.5). If $ P(y=1|x) $ is greater than 0.5, we predict the positive class; otherwise, we predict the negative class.

Training the Model

We use MLE(Maximum Likelihood Estimation) for logistic regression:

$\max_{\mathbf w} \prod_{i=1}^{N} \left[ \theta(w^T x)^{\mathbf 1(y_i=1)} \times (1 - \theta(w^T x))^{\mathbf 1(y_i=0)} \right]$

Applying negative log to the likelihood function, we obtain the log-likelihood for logistic regression. =

$\min_{\mathbf w} J(\mathbf w) = \min\limits_{\mathbf w} - \sum_{i=1}^{N} \left\{ y_i \log \left( \frac{e^{\mathbf w^{\rm T} \mathbf x_i}}{1 + e^{\mathbf w^{\rm T} \mathbf x_i}} \right) + (1 - y_i) \log \left( 1 - \frac{e^{\mathbf w^{\rm T} \mathbf x_i}}{1 + e^{\mathbf w^{\rm T} \mathbf x_i}} \right) \right\}$

Substituting $y_i \in \{0, +1\}$ with $\tilde y_i \in \{-1, +1\}$, and noting that $\theta(-s) + \theta(s) = 1$, we can simplify the previous expression:

$\min_w J(w) = \min_{\mathbf w} \sum_{i = 1}^N \log(1 + e^{-\tilde y_i \mathbf w ^ {\rm T}\mathbf x_i})$

This is called the Cross Entropy Loss.

Generalization to K-classes

The generalized version of logistic regression is called Softmax Regression.

The probability of an input $x$ being class $k$ is denoted as:

$P(y = k | x; \mathbf{W}) = \frac{e^{\mathbf w_k^{\rm T} x}}{\sum_{i=1}^{K} e^{\mathbf w_i^{\rm T} x}}$

In multiclass, the likelihood function can be written as:

$\max_{w_1, w_2, \ldots, w_k} \prod_{i=1}^{N} \prod_{k=0}^{K} P(y_i = k | x_i; \mathbf{W})^{\mathbf 1(y_i = k)}$

We can use the minimum negative log-likehood estimation:

$\min\limits_{\mathbf{W}} J(\mathbf{W}) = \min_{\mathbf w_1, \mathbf w_2, \ldots, \mathbf w_k} -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=0}^{K} \mathbf 1(y_i = k) \cdot \log \frac{e^{\mathbf w_k^{\rm T} x_i}}{\sum_{j=1}^{K} e^{\mathbf w_j^T x_i}}$

Perceptron

We predict based on the sign of $y$: $y = \text{sign}(f_{\mathbf w}(x)) = \text{sign}(\mathbf w^{\rm T}\mathbf x)$

For Perceptron the objective loss function is defined as:

$J_p(\mathbf{w}) = \sum_{\hat{x}_j \in \mathcal{X}^k} (-\mathbf{w}^T \hat{x}_j)$

where $\mathcal{X}^k$ is the misclassified sample set at step $k$.

We can use gradient descent to solve for $\mathbf w^*$:

$\mathbf{w}_{k+1} = \mathbf{w}_k + \rho_k \sum_{x_j \in \mathcal{X}^k} (-\hat{x}_j)$

Support Vector Machine

We want the optimal linear separators, that is the most robust classifier to the noisy data, meaning it has the largest margin to the training data. So we want to find the classifier with the largest margin.

Modeling(For Linear-Separable Problem)

We want the margin is largest: $\max\limits_{\mathbf w, b}\rho(\mathbf w, b)$, and all the datapoints are classified correctly, that is $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1$.

The distance between two paralleled hyperplanes is: $|b_1 - b_2| / ||a||$, and the distance between a point $\mathbf x_0$ and a hyperplane $(\mathbf w, b)$ is $|\mathbf w^{\rm T} \mathbf x_0 + b| / ||\mathbf w||$.

Choose the points that are closest to the classifier, and they satisify: $|\mathbf w^{\rm T} \mathbf x_0 + b| = 1$, so that margin $\rho$ = $|\mathbf w^{\rm T} \mathbf x_1 + b| / ||\mathbf w|| + |\mathbf w^{\rm T} \mathbf x_2 + b| / ||\mathbf w|| = 2 / ||\mathbf w||$.

Thus we got the Hard-margin Support Vector Machine:

$\max\limits_{\mathbf w, b}\frac{2}{||\mathbf w||}$

s.t. $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1, 1 \leq i \leq n$

For compute convenience, we convert it into

$\min\limits_{\mathbf w, b}\frac{1}{2}||\mathbf w||^2$

s.t. $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1, 1 \leq i \leq n$

Modeling(For Linearly Non-Separable Problem)

We add a slack that allows points to be classified on the wrong side of the decision boundary, also we add a penalty. So we got the Soft-margin SVM:

$\min\limits_{\mathbf w, b}\frac{1}{2}||\mathbf w||^2 + C\sum_{i=1}^N \xi_i$

s.t. $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1 - \xi_i, 1 \leq i \leq n$

Using hinge-loss $\ell_{\text{hinge}}(t) = \max(1-t, 0)$, we have the final version of Soft-margin SVM:

$\min\limits_{\mathbf w, b}\frac{1}{2}||\mathbf w||^2 + C\sum_{i=1}^N \ell_{\text{hinge}}(y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b))$

Optimization For Training

Lagrangian Function & KKT Condition

Consider a constrained optimization problem

$\min_{x \in \mathbb{R}^d} f(x), \text{ s.t. } g_i(x) \leq 0, \forall i = 1, \dots, n$

The Lagrangian function $L(x, \mu)$ is defined as:

$L(x, \mu) = f(x) + \sum_{j = 1}^J \mu_ig_j(x)$

We have KKT conditions(necessary condition): for $1 \leq j \leq J$

Primal feasibility: $g_j(x) \leq 0$
dual feasibility: $\mu_i \geq 0$
Complementary slackness: $\mu_i g_j(x^*) = 0$
Lagrangian optimality: $\nabla_x L(x_*, \mu) = 0$

Dual Problem For Soft-margin SVM

For Soft-margin Support Vector Machine:

$\min\limits_{\mathbf w, b}\frac{1}{2}||\mathbf w||^2 + C\sum_{i=1}^N \xi_i$

s.t. $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1 - \xi_i, \xi_i \geq 0, 1 \leq i \leq n$

We have the Lagrangian function(with $2n$ inequality constraints):

$L(\mathbf{w}, b, \alpha, \xi, \mu) = \frac{1}{2} \|\mathbf{w}\|_2^2 + C \sum_{i=1}^{n} \xi_i + \sum_{i=1}^{n} \alpha_i [1 - \xi_i - y_i (\mathbf{w}^T \mathbf{x}_i + b)] - \sum_{i=1}^{n} \mu_i \xi_i$

s.t. $\alpha_i \geq 0, \mu_i \geq 0, \, i = 1, \ldots, n$.

take the partial derivatives of Lagrangian w.r.t $\mathbf w, b, \xi_i$ and set to zero

$\begin{aligned} \frac{\partial L}{\partial \mathbf{w}} &= 0 \implies \mathbf{w} = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i \\ \frac{\partial L}{\partial b} &= 0 \implies \sum_{i=1}^{n} \alpha_i y_i = 0 \\ \frac{\partial L}{\partial \xi_i} &= 0 \implies C = \alpha_i + \mu_i, \, i = 1, \cdots, n \\ \end{aligned}$

So that we got:

$L(\mathbf{w}, b, \alpha, \xi, \mu) = \frac{1}{2} \|\mathbf{w}\|_2^2 + C \sum_{i=1}^{n} \xi_i + \sum_{i=1}^{n} \alpha_i [1 - \xi_i - y_i (\mathbf{w}^T \mathbf{x}_i + b)] - \sum_{i=1}^{n} \mu_i \xi_i$ $= \frac{1}{2} \mathbf{w}^T \mathbf{w} + \sum_{i=1}^{n} \xi_i (C - \alpha_i - \mu_i) + \sum_{i=1}^{n} \alpha_i - \sum_{i=1}^{n} \alpha_i \cdot y_i \cdot \mathbf{w}^T \mathbf{x}_i - b \sum_{i=1}^{n} \alpha_i \cdot y_i$ $= \frac{1}{2} \left( \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i \right)^T \left( \sum_{j=1}^{n} \alpha_j y_j \mathbf{x}_j \right) + 0 + \sum_{i=1}^{n} \alpha_i - \sum_{i=1}^{n} \alpha_i \cdot y_i \cdot \left( \sum_{j=1}^{n} \alpha_j y_j \mathbf{x}_j \right) x_i + 0$ $= \frac{1}{2} \left( \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j \right) + \sum_{i=1}^{n} \alpha_i - \left( \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j \right)$ $= \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \left( \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j \right)$

So we have the Dual Problem of Soft-SVM:

$\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$

s.t. $\sum_{i=1}^{n} \alpha_i y_i = 0, \quad 0 \leq \alpha_i \leq C, \, i = 1, \ldots, n.$

After solving $\alpha$, we can get $\mathbf{w} = \sum_{j=1}^n\alpha_j y_j x_j$ and $b$

Kernel Method for SVM

Linear SVM cannot handle linear non-separable data. So we need to map the original feature space to a higher-dimensional feature space where the training set is separable.

Basically we could set $x \to \phi(x)$, but calculating $x_i \dots x_j$ will cause heavy computation cost, so we use the kernel trick, that is to find a function $k(x_i, x_j) = \phi(x_i) \dots \phi(x_j)$.

Some commonly used kernel:

Linear Kernel:
$k(\mathbf{x}, \mathbf{x}_i) = (\mathbf{x} \cdot \mathbf{x}_i)$
Polynomial Kernel:
$k(\mathbf{x}, \mathbf{x}_i) = [(\mathbf{x} \cdot \mathbf{x}_i) + 1]^q$
Radial Basis Function Kernel (a.k.a. RBF kernel, Gaussian kernel):
$k(\mathbf{x}, \mathbf{x}_i) = \exp \left( -\frac{\|\mathbf{x} - \mathbf{x}_i\|^2}{2\sigma^2} \right)$
Sigmoid Kernel:
$k(\mathbf{x}, \mathbf{x}_i) = \tanh (v(\mathbf{x} \cdot \mathbf{x}_i) + c)$

Kernel tricks can also be applied to more algorithms, such as k-NN, LDA, etc.

Decision Tree

We use a tree-like structure to deal with categorical features.

For each node, we find the most useful feature, that means the feature that can better divide the data on the node.

ID3 Algorithm

We use entropy as criterion:

$H(D) = -\sum_{k=1}^K \frac{|C_k|}{|D|} \log \frac{|C_k|}{|D|}$

A good split gives minimal weighted average entropy of child nodes:

$\frac{|D_1|}{|D|}H(D_1) + \frac{|D_2|}{|D|}H(D_2)$

For any split, the entropy of the parent node is constant. Minimizing the weighted
entropy of son nodes is equivalent to maximizing the information gain (IG):

$H(D) - \frac{|D_1|}{|D|}H(D_1) - \frac{|D_2|}{|D|}H(D_2)$

C4.5 Algorithm

Information Gain is highly biased to multivalued features. So we use Information Gain Ratio (GR) to choose optimal feature:

$\text{GR} = \frac{\text{Information Gain}}{\text{Intrinsic Value}}$

Intrinsic Value (IV) is to punish multivalued features. For a selected feature $f$, its Intrinsic Value is:

$IV(f) = -\sum_{k=1}^{|V|}\frac{|F_k|}{|D|} \log \frac{|F_k|}{|D|}$

where $V$ is the set of all possible values of the feature $f$, and $F_k$ is the subset of $D$ where the value of the feature $A$ is $k$. Features with many possible values tend to have a large Intrinsic Value.

Classification and Regression Tree(CART)

The CART Tree muse be a binary tree.

Regression Tree

How to divide the regions $R = \{R_1, \dots, R_m\}$ and decide the values $V = \{v_1, \dots, v_m\}$?

We use minimum mean-square error over all examples $x_i$ with label $y_i$

$\min_{R, V} l = \min_{R, V} \sum_{j = 1}^m \sum_{x_i \in R_j} (y_i - v_j)^2$

Assuming that R has been determined and first find the optimal V. For a given region R_j, the value $v_j$ to minimize the loss is the average value of the labels of all samples belonging to region $R_j$:

$v_j = \frac{1}{|R_j|} \sum_{x_i \in R_j} y_i$

Now for each feature $A$ and split threshold $a$, the parent node $R$ is split by $(A, a)$ to $R_1$ and $R_2$. We choose $(A, a) over all possible values to minimize:

$l(A, a) = \sum_{x_i \in R_1} (y_i - v_1(A, a))^2 + \sum_{x_i \in R_2} (y_i - v_2(A, a))^2$

where $v_1(A, a)$ and $v_2(A, a)$ are described above.

Classification Tree

The split criteria is now Gini Index:

$\text{Gini}(D) = 1 - \sum_{k = 1}^K \left(\frac{|C_k|}{|D|}\right)^2$

We choose the feature $A$ and the threshold $a$ over all possible values with the
maximal gain

$\text{Gini}(D) - \frac{|D_1|}{|D|} \text{Gini}(D_1) - \frac{|D_2|}{|D|} \text{Gini}(D_2)$

Ensemble Learning

Reduce the randomness (variance) by combining multiple learners.

Bagging(Bootstrap Aggregating)

Create $M$ bootstrap datasets
Train a learner on each dataset
Ensemble $M$ learners

Uniformly sample from the original data D with replacement. The bootstrap dataset
has the same size as the original data D, the probability of not showing up is

$(1-\frac{1}{n})^n \approx \frac{1}{e} \approx 0.37$

We use the elements show up in $D$ but not in the bootstrap dataset as the validation set(The out-of-bag dataset).

Random Forest

Ensemble decision trees (Training data with $d$ features)

Create bootstrap datasets
During tree construction, randomly sample $K (K<d)$ features as candidates for each split. (Usually choose $K = \sqrt d$)

Use feature selection to make treees mutally independent and diverse.

Boosting

Boosting: Sequentially train learners. Current Weak learners focus more on the
examples that previous weak learners misclassified.

Weak classifiers $h_1, \cdots, h_m$ are build sequentially. $h_m$ outputs ‘$+1$’ for one
class and ‘$-1$’ for another class.

Classify by $g(x) = \text{sgn}(\sum \alpha_m h_m(x))$

AdaBoost

Core idea: give higher weights to the misclassified examples so that half of the
training samples come from incorrect classifications. (re-weighting)

Mathematical Formulation:

Weighted Error:
$\epsilon_t = \sum_{i=1}^n w_i \cdot \mathbf 1(y_i \neq h_t(x_i))$
Alpha Calculation:
$\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \epsilon_t}{\epsilon_t} \right)$
Weight Update:
$w_i \leftarrow w_i \exp(\alpha_t \cdot \mathbf 1(y_i \neq h_t(x_i)))$
Final Hypothesis:
$H(x) = \text{sign} \left( \sum_{t=1}^T \alpha_t h_t(x) \right)$

Gradient Boosting

View boosting as an optimization problem. The criterion is to minimize the empirical loss:

$\arg \min_{(\alpha_1, \ldots, \alpha_t, h_1, \ldots, h_t)} \sum_{i=1}^{n} l \left( y_i, \sum_{s=1}^{t} \alpha_s h_s(x) \right)$

Loss function $l$ depends on the task:

Cross entropy for multi-classification
$\text{L2}$ loss for regression

We use sequential training: optimize a single model at a time, that is freeze $h_1, \cdots, h_{t-1}$ and optimize $h_t$. (Let $f_{t-1}(x) = \sum_{s=1}^{t-1} \alpha_s h_s(x)$, denoting the ensemble of $t-1$ learners.)

Now let’s see how to choose the $\alpha_t$ and $h_t$, we define:

$u = (f_{t-1}(x_1), \cdots, f_{t-1}(x_n)) \\ \Delta u = (h_t(x_1), \cdots, h_t(x_n))$

Consider function $F(u) = \sum_{i=1}^n l(y_i, u_i)$, then the original objective is equivalent to find a direction $\Delta u$ and step size $\alpha$ at the point $u$
to minimize:

$F(u + \alpha_t \Delta u) = \sum_{i=1}^n l(y_i, u_i + \alpha \Delta u_i)$

According to Gradient Descent, we could let $\delta u = \nabla_u F(u)$, thus

$h_t(x_i) = -\frac{\partial F(u)}{\partial u_i} = -\left[ \frac{\partial l(y_i, u_i)}{\partial u_i} \right]_{u_i = f_{t-1}(x_i)}$

Then how to decide $\alpha_t$? Use one-dimensional search $(y_i, x_i, f_{t-1}, h_t \text{ is fixed})$

$\alpha_t = \arg\min_{\alpha_t} \sum_{i=1}^{n} l(y_i, f_{t-1}(x_i) + \alpha_t h_t(x_i))$

For simplicity, search of optimal multiplier can be replaced by setting it a constant.

In conclusion, Gradient Boosting = Gradient Descent + Boosting.

Learning Theory

Empirical Risk Minimization (ERM)

Empirical Risk: The average loss of the model $f$ on training set $\mathcal D = \{x_i, y_i\}^N_{i=1}$

$\hat{R}(f) = \frac{1}{N} \sum_{i = 1}^N \ell(f(x_i), y_i)$

Empirical Risk Minimization(ERM): The learning algorithm selects the model that minimizes the empirical risk on the training dataset.

$\mathcal A(\mathcal D, \mathcal H) = \arg \min_{f \in \mathcal H} \hat R(f)$

The Consistency of Learning Process

We say a learning process is consistent, if the minimizer for empirical risk at
the infinite data limit, converges to the minimum expected risk.

Overfitting and Bias-Variance Trade-off

Define the Population Loss (also called Expected Risk) as

$R(f) = \mathbb E_{(x, y) \sim u} \ell(f(x), y)$

Therefore define the Generalization Gap as: $R(f) - \hat R(f)$

There are two important concepts of predicting model

Bias: The assumptions of target model, represents the extent to which the
average prediction over all datasets differs from the desired function.
Variance: The extent of change for the model when the training data changes
(can be understood as “stability” to dataset change).

Bias-Variance Trade-off : There is an intrinsic contradict between bias and variance. The model’s test error contains the sum of both.

Bias-Variance Decomposition :

Suppose the ground truth function is $f^*$, the data distribution is $\mu$, the algorithm $\mathcal{A}$ learns from hypothesis space $\mathcal{H}$. We use $y(x; \mathcal{D}) = \mathcal{A}(\mathcal{D}, \mathcal{H})(x)$ to denote the output of ERM model $\hat{f} = \mathcal{A}(\mathcal{D}, \mathcal{H})$ on input $x$.
We are interested in the learned model’s prediction error on any $x$, namely

$[y(x; \mathcal{D}) - f^*(x)]^2 = \{y(x; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] + \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] - f^*(x)\}^2$ $= \{y(x; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})]\}^2 + \{\mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] - f^*(x)\}^2$ $- 2\{y(x; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})]\}\{\mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] - f^*(x)\}$

Taking expectation over all possible datasets $\mathcal{D}$, the last term is zero.

$= \{\mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] - f^*(x)\}^2 + \mathbb{E}_{\mathcal{D}}[\{y(x; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})]\}^2]$ $= (\text{bias})^2 + \text{variance}$

Regularization refers to techniques that are used to calibrate machine learning
models in order to prevent overfitting, which picks a small subset of solutions
that are more regular (punish the parameters for behaving abnormally) to
reduce the variance.

Generalization Error and Regularization

VC dimension

VC dimension is a measure of complexity for a certain hypothesis class:
The largest integer $d$ for a binary classification hypothesis class $\mathcal H$, such that
there exists 𝑑 points in the input space 𝒳 that can be perfectly classified by some
function $h \in \mathcal H$ no matter how you assign labels for these $d$ points.

VC dimension characterizes the model class’s capacity for fitting random labels.

Generalization Error Bound

If a hypothesis class $\mathcal{H}$ has VC dimension $d_{vc}$, we have a theorem that states that with probability $1 - \delta$ and $m$ samples, we can bound the generalization gap for any model $h \in \mathcal{H}$ as

$R(h) \leq \hat{R}(h) + \sqrt{\frac{8d_{vc} \ln\left(\frac{2em}{d_{vc}}\right) + 8 \ln\left(\frac{4}{\delta}\right)}{m}}$

Bayesian Decision

Bayesian Decision: Find an optimal classifier according to the prior probability and class-conditional probability density of the feature

The a priori or prior probability reflects our knowledge of how likely we expect a certain state of nature before we can actually observe said state of nature.

The class-conditional probability density function is the probability
density function $P(x|\omega)$ for our feature $x$, given that the state/class is $\omega$

Posterior Probability is the probability of a certain state/class given
our observable feature $x$: $P(\omega | x)$

Minimum Prediction Error Principle. The optimal classifier $f(\cdot)$ should minimize the expected prediction error, defined as

$P(\text{error}) = \int \sum_{\omega_j \neq f(x)} P(x, \omega_j) \, dx$

So, for each $x$, we want

$f(x) = \arg\min_{\omega_i} \sum_{\omega_j \neq \omega_i} P(x, \omega_j) = \arg\min_{\omega_i} P(x) - P(x, \omega_i)$ $f(x) = \arg\max_{\omega_i} P(x, \omega_i) = \arg\max_{\omega_i} P(\omega_i | x)$

Therefore, the classifier just needs to pick the class with largest posterior probability.

We could use a decision threshold $\theta$ for diciding. Also we can avoid making decisions on the difficult cases in anticipation of a high error rate on those examples.

Density estimation

We need a method to estimate the distribution of each feature, this is called density estimation.

Parametric Density Estimation Method

We can assume that the density function follows some form, for example:

$P(x|\omega_i) = \frac{1}{\sqrt{2\pi}\sigma_i}e^{-\frac{(x-\mu_i)^2}{2\sigma_i^2}}$

The unknown $\theta_i = (\mu_i, \sigma_i)$ is called the parameters.

Maximum Likelihood Estimation (MLE)

Likelihood Function: $p(x|\theta)$ measures the likelihood of a parametrized distribution to generate a sample $x$.

Max Likelihood Estimation (MLE): Choose the parameter 𝜃 that maximizes the
likelihood function for all the samples.

For example, if we use Gaussian to estimate $X = \{x_i\}_{i=1}^N$, MLE gives the result as

$\mu, \sigma = \arg\max_{\mu, \sigma} \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}}$

For the sake of simplicity, denote $H(\theta) = \ln p(X|\theta) = \sum_{i=1}^{N} \ln p(x_i|\theta)$

$\frac{dH}{d\mu} = 0 \implies \sum_{i=1}^{N} \frac{1}{\sigma} (x_i - \mu) = 0 \implies \mu = \frac{1}{N} \sum_{i=1}^{N} x_i,$ $\frac{dH}{d\sigma} = 0 \implies -\sum_{i=1}^{N} \frac{1}{\sigma} + \sum_{i=1}^{N} \frac{(x_i - \mu)^2}{2\sigma^2} = 0 \implies \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2.$

Non-parametric Density Estimation Method

Non-parametric method makes few assumptions about the form of the distribution and does not involve any parameter about the density function’s form.

Suppose totally we sample $N$ data, of which $K$ points are within $R$. Each data is
sample identically and independently. For each sample, whether it belongs to 𝑅 follows Bernoulli distribution with parameter $P_R$. We have $p(x) \approx \frac{P_R}{V} \approx \frac{K}{NV}$

We could apply kernel methods to it.

Hidden Markov Models (HMMs)

Understanding Bayes’ Rule:

$p(H|E)=\frac{p(E|H)P(H)}{P(E)}$

Prior $P(H)$ : How probable was our hypothesis before observing the evidence?
Likelihood $p(E|H)$ : How probable is the evidence given that our hypothesis is true?
Marginal $P(E)$: How probable is the new evidence?

Notation	Explanation
$Q = \{q_1, \ldots, q_n\}$	The set of $n$ hidden states.
$V = \{v_1, \ldots, v_v\}$	The set of all possible observed values.
$A = [a_{ij}]_{n \times n}$	Transition matrix. $a_{ij}$ is the probability of transitioning from state $i$ to state $j$. $\sum_{j=1}^n a_{ij} = 1 \, \forall i$.
$O = o_1 o_2 \cdots o_L$	Observed sequence. $o_t \in V$.
$x = x_1 x_2 \cdots x_L$	Hidden state sequence. $x_t \in Q$.
$E = [e_{ij}]_{n \times v}$	Emission probability matrix. $e_{ij} = P(o = v_j \mid x = q_i)$ is the probability of observing $v_j$ at state $q_i$. $\sum_{j=1}^V e_{ij} = 1 \, \forall i$.
$\pi = [\pi_1, \pi_2, \ldots, \pi_n]$	Start probability distribution. $\pi_i$ is the probability of Markov chain starting from $i$. $\sum_{i=1}^n \pi_i = 1$.

alt text

Question #1 – Evaluation

The evaluation problem in HMM: Given a model $M$ and an observed sequence $O$, calculate the probability of the observed sequence $P(O|M)$ .

Forward Algorithm

Denote $\alpha_t(j)$ as the probability of observing $o_1 o_2 \ldots o_t$ and the hidden state at $t$ being $q_j$:

$\alpha_t(j) = p(o_1 o_2 \ldots o_t, x_t = q_j)$

Obviously, $\alpha_t(j)$ can be rewritten as:

$\alpha_t(j) = e_j(o_t) \times \sum_{i=1}^{n} \alpha_{t-1}(i) a_{ij}$

Define Initial Values:
$\alpha_1(j) = e_j(o_1) \times \pi_j, \quad j = 1, \cdots, n$
Iterative solving:
$\alpha_t(j) = e_j(o_t) \times \sum_{i=1}^{n} \alpha_{t-1}(i) a_{ij}, \quad t = 1:L$
Obtaining results:
$p(O) = \sum_{i=1}^{n} \alpha_L(i)$

Backward Algorithm

Denote $\beta_t(j)$ as the probability of observing $o_{t+1} o_{t+2} \ldots o_L$ and the hidden state at $t$ being $q_j$:

$\beta_t(j) = p(o_{t+1} o_{t+2} \ldots o_L \mid x_t = q_j)$

Obviously, $\beta_t(j)$ can be rewritten as:

$\beta_t(j) = \sum_{i=1}^{n} a_{ji} e_i(o_{t+1}) \beta_{t+1}(i)$

Define Initial Values:
$\beta_L(j) = 1, \quad j = 1:n \quad (L + 1 \text{ is terminal state})$
Iterative solving:
$\beta_t(j) = \sum_{i=1}^{n} a_{ji} e_i(o_{t+1}) \beta_{t+1}(i), \quad t = 1:L, \quad j = 1:n$
Obtaining results:
$p(O) = \sum_{i=1}^{n} \pi_i e_i(o_1) \beta_1(i)$

Question #2 – Decoding

The decoding problem in HMM: Given a model $M$ and an observed sequence $O$, calculate the most probable hidden state sequence $\mathbf{x} = \arg\max_{\mathbf{x}} p(\mathbf{x}, O | M)$.

Define:

$v_t(j) = \max_{q_1 \ldots q_{t-1}} p(q_1 \ldots q_{t-1}, o_1 \ldots o_t, x_t = q_j)$

According to the recurrence relation, rewrite the above as:

$v_t(j) = \max_{i=1}^n v_{t-1}(i) a_{ij} e_j(o_t)$

Therefore, the most probable hidden state sequence is:

$pa_t(j) = \arg\max_{i=1}^n v_{t-1}(i) a_{ij} e_j(o_t)$

Viterbi Algorithm

Define Initial Values:
$v_1(j) = e_j(o_1) \times \pi_j, \quad pa_1(j) = 0, \quad j = 1:n$
Iterative solving:
$v_t(j) = \max_{i=1}^n v_{t-1}(i) a_{ij} e_j(o_t)$ $pa_t(j) = \arg\max_{i=1}^n v_{t-1}(i) a_{ij} e_j(o_t)$
Obtaining results:
$p^* = \max_{i=1:n} v_L(i)$

$x^*_L = \arg\max_{i=1:n} v_L(i)$

Computational Complexity: $O(n^2 L)$

Question #3 – Learning

The learning problem in HMM: Given an observed sequence $O$, estimate the parameters of model: $M = \arg \max \limits_{M}P(M|O)$

For simplicity, in the following steps we only present the learning process of transition matrix $A$. (The other parameters can be learned in a similar manner.)

Baum-Welch Algorithm (a special case of EM algorithm)

Expectation Step (E-step): Using the observed available data of the dataset, we estimate (guess) the values of the missing data with the current parameters $\theta_{\text{old}}$.
Maximization Step (M-step): Using complete data generated after the E-step, we update the parameters of the model.

E-step

(#$T_{ij}$ denotes the times of hidden state transitioning from $q_i$ to $q_j$)

Generate the guesses of #$T_{ij}$, i.e., the expected counts:

$\text{Expected Counts} = \sum_{t=1}^{L} p(x_t = q_i, x_{t+1} = q_j \mid O, \theta_{\text{old}})$

Can be estimated with Forward Algorithm and Backward Algorithm.

M-step

Generate new estimations with the expected counts:

$\hat{a}_{ij} = \frac{\sum_{t=1}^{L-1} p(x_t = q_i, x_{t+1} = q_j \mid O, \theta_{\text{old}})}{\sum_{t=1}^{L-1} \left( \sum_{j'} p(x_t = q_i, x_{t+1} = q_{j'} \mid O, \theta_{\text{old}}) \right)}$

Estimation when hidden state is unknown.

Iterative Solving: Recalculate the expected counts with newly estimated parameters (E-step). Then generate newer estimations of $\theta$ with (M-step). Repeat until convergence.

Bayesian Networks

Naive Bayes

Naïve Bayes Assumption: Features $X_i$ are independent given class $Y$:

$P_\theta(X_1, \ldots, X_n \mid Y) = \prod_i P_\theta(X_i \mid Y)$

Inference: the label can be easily predicted with Bayes’ rule

$Y^* = \arg\max_Y \prod_i P_\theta(X_i \mid Y) P(Y)$

$Y^*$ is the value that maximizes Likelihood $\times$ Prior.

When the number of samples is small, it is likely to encounter cases where $\text{Count}(Y = y) = 0$ or $\text{Count}(X_i = x, Y = y) = 0$. So we use Laplace Smoothing. The parameters of Naïve Bayes can be learned by counting:

Prior:

$P(Y = y) = \frac{\text{Count}(Y = y) + 1}{\sum_{y'} \text{Count}(Y = y') + C}$

Observation Distribution

$P(X_i = x \mid Y = y) = \frac{\text{Count}(X_i = x, Y = y) + 1}{\sum_{x'} \text{Count}(X_i = x', Y = y) + S}$

Here, $C$ is the number of classes, $S$ is the number of possible values that $X_i$ can take.

Learning & Decision on BN

Bayesian Network

BN$(G, \Theta)$: a Bayesian network

$G$ is a DAG with nodes and directed edges.
Each node represents a random variable. Each edge represents a causal relationship/dependency.
$\Theta$ is the network parameters that constitute conditional probabilities.
- For a node $t$, its parameters are represented as $p(x_t \mid x_{\text{pa}(t)})$.

Joint probability of BN:

$p(x) = \prod_{t=1}^{n} p(x_t \mid x_{\text{pa}(t)})$

where $\text{pa}(t)$ is the set of all parent nodes of node $t$.

$\begin{aligned} \begin{array}{ccc} & D \\ & \downarrow \\ & A \rightarrow B \rightarrow C \end{array} \end{aligned}$ $P(A, B, C, D) = P(A) P(D) P(B \mid A, D) P(C \mid B)$

Learning on Bayesian Network

Notation: Suppose BN has $n$ nodes, we use $\text{pa}(t)$ to denote the parent nodes of $t$ $(t = 1, \ldots, n)$

By the conditional independence of BN, we have

$p(D \mid \Theta) = \prod_{i=1}^{N} p(x_i \mid \Theta) = \prod_{i=1}^{N} \prod_{t=1}^{n} p(x_{i,t} \mid x_{i,\text{pa}(t)}, \theta_t) = \prod_{t=1}^{n} \prod_{i=1}^{N} p(D_{i,t} \mid \theta_t)$ $p(\Theta) = \prod_{t=1}^{n} p(\theta_t)$

Thus, the posterior becomes:

$p(\Theta \mid D) \sim \prod_{t=1}^{n} p(D_t \mid \theta_t) p(\theta_t)$ $p(\theta \mid D) \sim \prod_{t=1}^{n} \prod_{c=1}^{q_t} p(D_{tc} \mid \theta_{tc}) \cdot p(\theta_{tc})$

Learning BN with Categorical Distribution

Consider a case where each probability distribution in BN is categorical, In this case, we can model the conditional distribution of node $t$ as(We use a scalar value $c$ to represent parent nodes’ states for simplicity.):

$P(x_t = k \mid x_{\text{pa}(t)} = c) = \theta_{tck}$

and the conditional probability of node $t$ can be denoted as:

$\theta_{tc} = [\theta_{tc1}, \theta_{tc2}, \ldots, \theta_{tcK_t}], \quad \sum_{k=1}^{K_t} \theta_{tck} = 1$

Categorical Distribution:

$p = [\theta_1, \theta_2, \ldots, \theta_d], \quad \theta_i \geq 0, \quad \sum_{i} \theta_i = 1$

E.g., toss a coin $(d = 2)$, roll a die $(d = 6)$

Count the training samples where $x_t = k, x_{\text{pa}(t)} = c$:

$N_{tck} = \sum_{i=1}^{N} I(x_{i,t} = k, x_{i,\text{pa}(t)} = c)$

According to the property of categorical distribution, we can represent the likelihood function as:

$p(D_t \mid \theta_t) = \prod_{c=1}^{q_t} \prod_{k=1}^{K_t} \theta_{tck}^{N_{tck}} = \prod_{c=1}^{q_t} p(D_{tc} \mid \theta_{tc})$

Thus the posterior can be further factorized:

$p(\theta \mid D) \sim \prod_{t=1}^{n} p(D_t \mid \theta_t)p(\theta_t) = \prod_{t=1}^{n} \prod_{c=1}^{q_t} p(D_{tc} \mid \theta_{tc})p(\theta_{tc})$

Notation:

$D_{tc}$ are the sample set where the value of $x_{\text{pa}(t)}$ is $c$
$q_t$ is the number of possible values of $x_{\text{pa}(t)}$
$K_t$ is the number of possible values of $x_t$

How to choose the probability distribution function for the prior $p(\theta_{tc})$? It would be highly convenient if the posterior shares the same form as the prior.

Conjugate Prior: A prior distribution is called a conjugate prior for a likelihood function if the posterior distribution is in the same probability distribution family as the prior.

The conjugate prior for the categorical distribution is the Dirichlet distribution:

Choosing the prior as conjugate prior — Dirichlet distribution:

$p(\theta_{tc}) \propto \prod_{k=1}^{K_t} \theta_{tck}^{\alpha_{tck} - 1}$

$\alpha_{tck}$ are integers and are the hyperparameters of BN model.

In this case, the posterior can be easily derived as:

$p(D_{tc} \mid \theta_{tc}) p(\theta_{tc}) \propto \left( \prod_{k=1}^{K_t} \theta_{tck}^{N_{tck}} \right) * \left( \prod_{k=1}^{K_t} \theta_{tck}^{\alpha_{tck} - 1} \right) = \prod_{k=1}^{K_t} \theta_{tck}^{N_{tck} + \alpha_{tck} - 1}$

We can then derive an estimate of $\theta_{tck}$ by calculating the expectation:

$\hat{\theta}_{tck} = E(\theta_{tck}) = \frac{N_{tck} + \alpha_{tck}}{\sum_{k'} (N_{tck'} + \alpha_{tck'})}$

K-Means Algorithm

Initalize cluster centers $\mu_1, \cdots, \mu_k$ randomly.
Repeat until no change of cluster assignment
1. Assignment step: Assign data points to closest cluster center $C_k \leftarrow \set{n \mid x_n \text{ is closest to } \mu_k}$
2. Update Step: Change the cluster center to the average of its assigned points $\mu_k \leftarrow \frac{1}{|C_k|} \sum_{n \in C_k} x_n$

Optimization View of K-Means

Optimization Objective: within-cluster sum of squares (WCSS)

$\min_{\mu, r} J_e = \sum_{k=1}^{K} \sum_{n=1}^{N} r_{n,k} \| x_n - \mu_k \|^2$

Step 1: Fix $\mu$, optimize $r$

$r_{n,k^*} = 1 \quad \Leftrightarrow \quad k^* = \arg\min_k \| x_n - \mu_k \|$

Step 2: Fix $r$, optimize $\mu$

$\mu_k^* = \frac{\sum_{n} r_{n,k} x_n}{\sum_{n} r_{n,k}} = \frac{1}{|C_k|} \sum_{n \in C_k} x_i$

Rule of Thumbs for initializing k-means

Random Initialization: Randomly generate 𝑘 points in the space.
Random Partition Initialization: Randomly group the data into 𝑘 clusters and
use their cluster center to initialize the algorithm.
Forgy Initialization: Randomly select 𝑘 samples from the data.
K-Means++: Iteratively choosing new centroids that are farthest from the existing
centroids.

How to tell the right number of clusters?

We find the elbow point of the $J_e$ image.

EM Algorithm for Gaussian Mixture Model (GMM)

Multivariate Gaussian Distribution

$d$-dimensional Multivariate Gaussian:

$N(x \mid \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)$

$\mu \in \mathbb{R}^d$ the mean vector
$\Sigma \in \mathbb{R}^{d \times d}$ the covariance matrix

MLE of Gaussian Distribution

The likelihood function of a given dataset $X = \{x_1, x_2, \ldots, x_N\}$:

$p(X \mid \mu, \Sigma) = \prod_{n=1}^{N} p(x_n \mid \mu, \Sigma) = \prod_{n=1}^{N} \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x_n - \mu)^T \Sigma^{-1} (x_n - \mu) \right)$

The maximum likelihood estimation (MLE) of the parameters is defined by:

$\mu^*, \Sigma^* = \arg\max_{\mu, \Sigma} \mathcal{L}(\mu, \Sigma)$ $\mathcal{L}(\mu, \Sigma) = \log p(X \mid \mu, \Sigma) = \frac{N}{2} \log |\Sigma| - \frac{1}{2} \sum_{n=1}^{N} (x_n - \mu)^T \Sigma^{-1} (x_n - \mu)$

The optimization problem of maximum likelihood estimation (MLE):

$\max_{\mu, \Sigma} \mathcal{L}(\mu, \Sigma) = \frac{N}{2} \log |\Sigma| - \frac{1}{2} \sum_{n=1}^{N} (x_n - \mu)^T \Sigma^{-1} (x_n - \mu)$

Solve the optimization by taking the gradient:

$0 = \frac{\partial \mathcal{L}}{\partial \mu} = \sum_{n=1}^{N} \Sigma^{-1} (x_n - \mu) \quad \Rightarrow \quad \mu^* = \frac{1}{N} \sum_{n=1}^{N} x_n \quad \text{(Sample Mean)}$ $0 = \frac{\partial \mathcal{L}}{\partial \Sigma^{-1}} = \frac{N}{2} \Sigma - \frac{1}{2} \sum_{n=1}^{N} (x_n - \mu)(x_n - \mu)^T \quad \Rightarrow \quad \Sigma^* = \frac{1}{N} \sum_{n=1}^{N} (x_n - \mu^*)(x_n - \mu^*)^T \quad \text{(Sample Covariance)}$

Gaussian Mixture Model (GMM)

A Gaussian Mixture Model (GMM) is the weighted sum of a family of Gaussians whose density function has the form:

$p(x \mid \pi, \mu, \Sigma) = \sum_{k=1}^{K} \pi_k N(x \mid \mu_k, \Sigma_k)$

Each Gaussian $N(\mu_k, \Sigma_k)$ is called a component of GMM.
Scalars $\{\pi_k\}_{k=1}^{K}$ are referred to as mixing coefficients, which satisfy

$\sum_{k=1}^{K} \pi_k = 1$

This condition ensures $p(x \mid \pi, \mu, \Sigma)$ is indeed a density function.

Soft Clustering with Mixture Model

$p(z = k) = \pi_k, \quad p(x \mid z) = N(x \mid \mu_z, \Sigma_z)$

By Bayes Rule, the posterior probability of $z$ given $x$ is:

$\gamma_k \overset{\Delta}{=} p(z = k \mid x) = \frac{p(z = k, x)}{p(x)} = \frac{\pi_k N(x \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j N(x \mid \mu_j, \Sigma_j)}$

We call $\gamma_k$ the responsibility of the $k$-th component on the data $x$.

Probabilistic Clustering: each data point is assigned a probability distribution over the clusters.

“$x$ belongs to the $k$-th cluster with probability $\gamma_k$”

MLE for Gaussian Mixture Model

Log-likelihood function of GMM

$\log p(X \mid \pi, \mu, \Sigma) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k N(x_n \mid \mu_k, \Sigma_k) \right)$

Maximum Likelihood Estimation

$\max_{\pi, \mu, \Sigma} \mathcal{L}(\pi, \mu, \Sigma) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k N(x_n \mid \mu_k, \Sigma_k) \right)$

subject to:

$\sum_{k=1}^{K} \pi_k = 1$

Optimality Condition for $\mu$

$N(x \mid \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right),$ $\frac{\partial x^T A x}{\partial x} = (A + A^T) x$ $\max_{\pi, \mu, \Sigma} \mathcal{L}(\pi, \mu, \Sigma) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k N(x_n \mid \mu_k, \Sigma_k) \right)$

Take partial derivative with respect to $\mu_k$,

$0 = \frac{\partial \mathcal{L}}{\partial \mu_k} = -\sum_{n=1}^{N} \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{\sum_j \pi_j N(x_n \mid \mu_j, \Sigma_j)} \Sigma_k^{-1} (x_n - \mu_k)$

Notice that the posterior of $z_n$ (also known as responsibility $\gamma_{n,k}$) can be written as

$\gamma_{n,k} \overset{\Delta}{=} p(z_n = k \mid x_n) = \frac{p(z_n = k) p(x_n \mid z_n = k)}{\sum_j p(z_n = j) p(x_n \mid z_n = j)} = \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{\sum_j \pi_j N(x_n \mid \mu_j, \Sigma_j)}$

Thus

$0 = \sum_{n=1}^{N} \gamma_{n,k} (x_n - \mu_k)$ $\mu_k = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{n,k} x_n, \text{ where } N_k = \sum_{n=1}^{N} \gamma_{n,k}$

Optimality Condition for $\Sigma$

$\max_{\pi, \mu, \Sigma} \mathcal{L}(\pi, \mu, \Sigma) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k N(x_n \mid \mu_k, \Sigma_k) \right)$ $\gamma_{n,k} = p(z_n = k \mid x_n) = \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{\sum_j \pi_j N(x_n \mid \mu_j, \Sigma_j)}, \quad N_k \overset{\Delta}{=} \sum_{n=1}^{N} \gamma_{n,k}$

Similarly, take derivative with respect to $\Sigma_k$, which yields

$0 = \frac{\partial \mathcal{L}}{\partial \Sigma_k} \quad \Rightarrow \quad \Sigma_k = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{n,k} (x_n - \mu_k)(x_n - \mu_k)^T$

Responsibility-reweighted Sample Covariance

Optimality Condition for $\pi$

Constraints of mixing coefficients $\pi$: $\sum_{k=1}^{K} \pi_k = 1$

Introduce Lagrange multiplier:

$\mathcal{L}' = \mathcal{L} + \lambda \left( \sum_{k=1}^{K} \pi_k - 1 \right)$

Take derivative with respect to $\pi_k$, which gives

$0 = \frac{\partial \mathcal{L}'}{\partial \pi_k} \quad \Rightarrow \quad \sum_{n=1}^{N} \frac{\gamma_{n,k}}{\pi_k} + \lambda = \frac{N_k}{\pi_k} + \lambda \quad \Rightarrow \quad \pi_k = \frac{-N_k}{\lambda}$

By the constraints, we have $1 = \sum_{k=1}^{K} \pi_k = \frac{-1}{\lambda} \sum_{k=1}^{K} N_k$,

Also notice that

$\sum_{k=1}^{K} N_k = \sum_{k=1}^{K} \sum_{n=1}^{N} \gamma_{n,k} = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{n,k} = \sum_{n=1}^{N} 1 = N$

Therefore,

$\lambda = -\sum_{k=1}^{K} N_k = -N, \quad \pi_k = \frac{N_k}{N}$

Expectation-Maximization (EM) Algorithm

Initialize $\pi_k, \mu_k, \Sigma_k, \quad k = 1, 2, \ldots, K$
E-Step: Evaluate the responsibilities using the current parameter values

$\gamma_{n,k} = p(z_n = 1 \mid x_n) = \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{\sum_j \pi_j N(x_n \mid \mu_j, \Sigma_j)}$

M-Step: Re-estimate the parameters using the current responsibilities

$\mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{n,k} x_n$ $\Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{n,k} (x_n - \mu_k^{\text{new}})(x_n - \mu_k^{\text{new}})^T$ $\pi_k^{\text{new}} = \frac{N_k}{N}$

where $N_k = \sum_{n=1}^{N} \gamma_{n,k}$

Return to step 2 if the convergence criterion is not satisfied.

Hierarchical Clustering

Distance Function: The distance function affects which pairs of clusters are merged/split and in what order.

Single Linkage:

$d(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$

Complete Linkage:

$d(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$

Average Linkage:

$d(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|} \sum_{x \in C_i, y \in C_j} d(x, y)$

Two Types of Hierarchical Clustering

Bottom-Up (Agglomerative)
- Start with each item in its own cluster, find the best pair to merge into a new cluster.
- Repeat until all clusters are fused together.
Top-Down (Divisive)
- Start with one all-inclusive cluster, consider every possible way to divide the cluster in two.
- Choose the best division and recursively operate on both sides.

Agglomerative (Bottom-up) Clustering

Input: cluster distance measure $d$, dataset $X = \{x_n\}_{n=1}^{N}$, number of clusters $k$
Initialize $\mathcal{C} = \{C_i = \{x_n\} \mid x_n \in X\}$ // Each point in separate cluster
Repeat:
- Find the closest pair of clusters $C_i, C_j \in \mathcal{C}$ based on distance metric $d$
- $C_{ij} = C_i \cup C_j$ // Merge the selected clusters
- $\mathcal{C} = (\mathcal{C} \setminus \{C_i, C_j\}) \cup \{C_{ij}\}$ // Update the clustering
Until $|\mathcal{C}| = k$

A naïve implementation takes space complexity $O(N^2)$, time complexity $O(N^3)$.

LASSO Regression

LASSO (Least Absolute Shrinkage and Selection Operator): Simply linear regression with an $\ell_1$ penalty for sparsity

$L(w) = \sum_{i=1}^{n} \left( w^T x_i - y_i \right)^2 + C \|w\|_1$

sparse solution $\leftrightarrow$ feature selection

Principal Component Analysis (PCA)

Computing PCA: Eigenvalue Decomposition

Objective: Maximize variance of projected data

$\max_{\mathbf{u}_j} \mathbb{E}[(\mathbf{u}_j^T \mathbf{x})^2]$

subject to $\mathbf{u}_j^T \mathbf{u}_j = 1$, $\mathbf{u}_j^T \mathbf{u}_k = 1$, $k < j$

Observation: PC $j$ is direction of the $j$-th largest eigenvector of $\frac{1}{n} \mathbf{X}^T \mathbf{X}$

Eigenvalue Decomposition:

$\mathbf{U} = \begin{pmatrix} \mathbf{u}_1 & \cdots & \mathbf{u}_k \\ \end{pmatrix}$

are eigenvectors of $\frac{1}{n} \mathbf{X}^T \mathbf{X}$

Manifold Learning

Geodesic distance: lines of shortest length between points on a manifold

Classical Neural Networks

Forward propagation

$\begin{aligned} \mathbf{a}^{(1)} &= \mathbf{x} \newline \mathbf{z}^{(2)} &= \Theta^{(1)} \mathbf{a}^{(1)} \newline \mathbf{a}^{(2)} &= g(\mathbf{z}^{(2)}) \quad [\text{append } a_0^{(2)}] \newline \mathbf{z}^{(3)} &= \Theta^{(2)} \mathbf{a}^{(2)} \newline \mathbf{a}^{(3)} &= g(\mathbf{z}^{(3)}) \quad [\text{append } a_0^{(3)}] \newline \mathbf{z}^{(4)} &= \Theta^{(3)} \mathbf{a}^{(3)} \newline \mathbf{a}^{(4)} &= h_\Theta(\mathbf{x}) = g(\mathbf{z}^{(4)}) \end{aligned}$

Backpropagation: Gradient Computation

Apply the chain rule to compute gradients.

Summary of backpropagation:

$\delta^{(4)} = \frac{\partial J(\Theta)}{\partial \mathbf{z}^{(4)}} = \mathbf{a}^{(4)} - \mathbf{y}$ $\delta^{(3)} = \frac{\partial J(\Theta)}{\partial \mathbf{z}^{(3)}} = (\Theta^{(3)})^T \delta^{(4)} \ast g'(\mathbf{z}^{(3)})$ $\delta^{(2)} = \frac{\partial J(\Theta)}{\partial \mathbf{z}^{(2)}} = (\Theta^{(2)})^T \delta^{(3)} \ast g'(\mathbf{z}^{(2)})$

(No $\delta^{(1)}$)

Based on $\delta^{(l)}$, $\frac{\partial J(\Theta)}{\partial \Theta^{(l)}}$ can be computed as:

$\frac{\partial J(\Theta)}{\partial \Theta^{(l)}} = \delta^{(l+1)} (\mathbf{a}^{(l)})^T \quad (l = 1, 2, 3)$

For example, the activation function $g(x)$ is sigmoid, i.e., $g(x) = \frac{1}{1+e^{-x}}$ and $g’(x) = g(x)(1 - g(x))$.

For example, $J(\Theta)$ is the cross-entropy loss for binary classification, i.e.,

$J(\Theta) = -(1 - y) \log(1 - h_\Theta(\mathbf{x})) - y \log(h_\Theta(\mathbf{x}))$

and

$J'(\Theta) = \frac{h_\Theta(\mathbf{x}) - y}{h_\Theta(\mathbf{x})(1 - h_\Theta(\mathbf{x}))}$

Optimization of Deep Networks

Vanilla Gradient Descent

Core: Compute the gradient of the loss function $g_t = \nabla \mathcal L$ on all training samples

Stochastic Gradient Descent

Core: Select a sample $(\mathbf x_i, y_i)$ from the training set and compute the gradient of the loss function $g_t = \nabla \mathcal L$ on the selected sample.

Mini-batch Gradient Descent

Core Randomly select $b$ samples $\{(\mathbf x_i, y_i)\}_{i \in [1, n]}$ and compute the gradient of the loss function $g_t = \nabla \mathcal L$ on the selected sample.

Gradient Descent with Momentum

$\Theta_{t+1} = \Theta_t + \mathbf{v}_t$

where

$\mathbf{v}_t = \beta \mathbf{v}_{t-1} - \alpha \nabla_\Theta \mathcal{L}$

The momentum term ($\mathbf{v}_t$) accumulates the gradients from the past several steps.

Adaptative Gradient (AdaGrad)

Particularly, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.

In AdaGrad, the parameters are updated as:

$\Theta_{t+1} = \Theta_t - \frac{\alpha}{\sqrt{r_t} + \epsilon} .* \mathbf{g}_t$

where

$r_t = r_{t-1} + \mathbf{g}_t .* \mathbf{g}_t$

$\epsilon$ is a small number to ensure numerical stability.

Here, $.*$ is the element-wise product, and $\mathbf{g}_t = \nabla \mathcal{L}(\Theta_t)$.

Root Mean Square Propagation (RMSProp)

RMSProp changes the gradient accumulation in AdaGrad into an exponentially weighted moving average.

This method uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl.

The update rule is denoted as:

$\Theta_{t+1} = \Theta_t - \frac{\alpha}{\sqrt{r_t} + \epsilon} .* \mathbf{g}_t$

where

$r_t = \beta r_{t-1} + (1 - \beta) \mathbf{g}_t .* \mathbf{g}_t$

$r$ is the moving average of squared gradients, $\beta$ is the decay rate.

Adaptive Moment Estimation (Adam)

Adam extends the RMSProp method by making use of first moments of gradients, instead of second moment only in RMSProp.

Adam can be seen as a variant of combination of RMSProp and Momentum with a few distinctions:

First-order Moment:

$s_t = \beta_1 s_{t-1} + (1 - \beta_1) g_t$

Second-order Moment:

$r_t = \beta_2 r_{t-1} + (1 - \beta_2) g_t .* g_t$

Considering the first-order moment $s_t = \beta_1 s_{t-1} + (1 - \beta_1) g_t$, we start by initializing $s_0 = 0$, then:

$\begin{aligned} s_1 &= \beta_1 s_0 + (1 - \beta_1) g_1 = (1 - \beta_1) g_1 \\ s_2 &= \beta_1 s_1 + (1 - \beta_1) g_2 = \beta_1 (1 - \beta_1) g_1 + (1 - \beta_1) g_2 \\ s_3 &= \beta_1 s_2 + (1 - \beta_1) g_3 = \beta_1 [ \beta_1 (1 - \beta_1) g_1 + (1 - \beta_1) g_2 ] + (1 - \beta_1) g_3 \\ s_t &= (1 - \beta_1) \sum_{i=0}^t (\beta_1^i) g_i \end{aligned}$

Note that we initialized $s_0 = 0$, this causes significant amount of bias initially towards smaller values. We can use the fact that $\sum_{i=0}^{t-1} \beta_1^i = \frac{1 - \beta_1^t}{1 - \beta_1}$ to re-normalize the terms, and get:

$\hat{s}_t = \frac{s_t}{1 - \beta_1^t}$

The same method can be performed in the second-order moments, we get:

$\hat{r}_t = \frac{r_t}{1 - \beta_2^t}$

Finally, Adam combines the bias-corrected first and second-order moments and updates the parameters as:

$\theta_{t+1} = \theta_t - \alpha \frac{\hat{s}_t}{\sqrt{\hat{r}_t + \epsilon}}$

Convolutional Neural Networks

Convolution Layer

The convolution operator preserves the spatial structure of image. Different filters extract different features from the original image.

Pooling Layer

The pooling layer is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance.

Commonly used pooling operations: Max Pooling and Average Pooling

Batch Normalization Layer

Batch Normalization alleviate the problem of gradient vanishing.

Fully Connected Layer (FC) or Dense Layer

The fully connected layer operates on a flattened input where each input is
connected to all neurons.

If present, FC layers are usually found towards the end of CNN architectures and
can be used to optimize objectives such as class scores

Word Embedding

Build a dense vector for each word, chosen so that it is similar to vectors of
words that appear in similar contexts, measuring similarity as the vector dot
product, such a representation is called work embedding or word vector.

Word2Vec

Idea: We have a large corpus (“body”) of text: a long list of words. Every word in a fixed vocabulary is represented by a vector. Go through each position t in the text, which has a center word $c$ and context (“outside”) words $o$. Use the similarity of the word vectors for $c$ and $o$ to calculate the probability of $o$ given $c$. Keep adjusting the word vectors to maximize this probability.

Word2vec: Objective Function

For each position $t = 1, \ldots, T$, predict context words within a window of fixed size $m$, given center word $w_t$.
In our case, each word is represented as a parameter vector $\theta_i$.
$\theta = [\theta_1, \ldots, \theta_V]$ represents all the parameters of $V$-many words.
The objective function $J(\theta)$ is the (average) negative log likelihood:

$J(\theta) = -\frac{1}{T} \log L(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \le j \le m, j \ne 0} \log P(w_{t+j} | w_t; \theta)$

Word2vec: Prediction Function

How to calculate $P(w_{t+j} | w_t; \theta)$?
- Softmax function
We will use two vectors per word $w$
- $v_w$ when $w$ is a center word; $u_w$ when $w$ is a context word.
- $V$: the set of all possible words

$P(o | c) = \frac{\exp(u_o^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)}$

Dot product compares similarity of $o$ and $c$.
Exponentiation makes anything positive.
Normalize over entire vocabulary to give probability distribution.

Language Modeling

A language model is a probability distribution over sequences of words, e.g., predicting what word comes next. A system that does this is called a Language Model.

Recurrent Neural Network(RNN)

Need a neural network that can process any length input? Apply the same weights repeatedly.

Training RNN

Get a big corpus of text which is sequences of words

Sample a (batch of) sequence of length ( T ) into RNN-LM; compute output distribution ( \hat{y}^{(t)} ) for every step ( t ), i.e., predict probability distribution of every word, given words so far.

Average this to get overall loss for a sentence (actually, a batch of sentences):

$J(\theta) = \frac{1}{T} \sum_{t=1}^{T} J^{(t)}(\theta)$

Backpropagation for RNNs: backpropagation through time

Apply the multivariable chain rule:

$\frac{\partial J}{\partial \mathbf{W}_h} = \sum_{i=1}^{t} \frac{\partial J^{(t)}}{\partial \mathbf{W}_h^{(i)}} = \sum_{i=1}^{t} \frac{\partial J^{(t)}}{\partial \mathbf{W}_h^{(i)}}$

The gradient w.r.t. a repeated weight is the sum of the gradient w.r.t. each time it appears.

Long Short-Term Memory RNN

The key to LSTMs is the cell state, the hidden state stores short-term information; the cell stores long-term information.

The cell runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

LSTM – Forget Gate

Gating mechanisms: Control which information is erased/written/read from the cell

On each timestep, each element of the gates can be open (1), closed (0), or somewhere in-between (Sigmoid function).
The gates are dynamic; their value is computed based on the current context.

Forget gate:

Controls what is forgotten from the previous cell state

$f^{(t)} = \sigma (W_f h^{(t-1)} + U_f x^{(t)} + b_f)$

LSTM – Input Gate

New cell content:
The content that will be added to the cell state

$\tilde{c}^{(t)} = \tanh (W_c h^{(t-1)} + U_c x^{(t)} + b_c)$

Input gate:

Controls what parts of the new cell content are written to the cell

$i^{(t)} = \sigma (W_i h^{(t-1)} + U_i x^{(t)} + b_i)$

LSTM – Update Cell State

Cell state: Forget some content from the last cell state; input some new content to the cell state

$c^{(t)} = f^{(t)} \odot c^{(t-1)} + i^{(i)} \odot \tilde{c}^{(t)}$

LSTM – Output Gate

Output gate: Controls what parts of cell content are output to hidden state

$o^{(t)} = \sigma (W_o h^{(t-1)} + U_o x^{(t)} + b_o)$

Hidden state: Output some content from the cell as hidden state

$h^{(t)} = o^{(t)} \odot \tanh c^{(t)}$

alt text

Transformers

Attention

Attention: directly model relationships between any two positions in the input sequence, regardless of their distance.
Let the sequence be $w_{1:n}$. For each word $w_i$, let $x_i$ be its word embedding.
Each word vector (embedding) is transformed into three vectors query, key, value.

$q_i = Qx_i$ $k_i = Kx_i$ $v_i = Vx_i$

Matrices (Q, K, V) are learnable parameters.

$o_i = \sum_{j'} \alpha_{ij} v_j$ $e_{ij} = q_i^T k_j$ $\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{j'} \exp(e_{ij'})}$

Transformer Encoder

Position encoding

Since self-attention doesn’t build the order information, we need to encode the order of the sentence in word embeddings.

Consider representing each sequence index as $p_i \in \mathbb{R}^d$

$\tilde{x}_i = x_i + p_i$

$x_i$ is word embedding, $\tilde{x}_i$ is positioned word embedding. $p_i$ can be a sinusoidal function or learnable parameters.

Multi-head attention

Define multiple attention “heads” through multiple $Q$, $K$, $V$ matrices.
Each attention head performs attention independently.
Then the outputs of all the heads are combined.

Residual connection

A trick from ResNet to help models train better.

Layer normalization

A trick to stabilize the training.
$\mu, \sigma$ is the mean and standard deviation of $x \in \mathbb{R}^d$

$o = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} * \gamma + \beta$

Feed-forward network

Self-attention is just the linear combination of values.
Feed-forward network introduces nonlinearity.

$o = W_2 * \text{ReLu}(W_1 * x + b_1) + b_2$

Transformer Decoder

Masked attention: Mask out attention to future words by setting attention scores to $-\infty$

$e_{ij} = q_i^T k_j \quad \Rightarrow \quad e_{ij} = \begin{cases} q_i^T k_j & \text{if } j \leq i \\ -\infty & \text{if } j > k \end{cases}$

For any current word $i$ and future word $j$ ($i < j$), we have the attention weight

$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{j'} \exp(e_{ij'})} = 0$

Cross-attention: The queries are drawn from the decoder, the keys and values are drawn from the encoder. Establish a connection between the source (input) and target (output) sequences.

$q_i = Q x_{\text{decoder}}$ $k_i = K x_{\text{encoder}}$ $v_i = V x_{\text{encoder}}$

alt text

Variational Autoencoder(VAE)

Let’s turn the autoencoder into a probabilistic model.

The encoder encodes the input data into a distribution of the latent space instead of a single point in latent space.

$q_{\phi}(z | x) = \mathcal{N} \left(z; \mu_{\phi}(x), \sigma_{\phi}^2(x) \right)$

The decoder maps any latent code to a meaningful data distribution

$p_{\theta}(x | z) = \mathcal{N} \left(x; \mu_{\theta}(z), \Sigma_{\theta}(z) \right)$

VAE: Generative Process

WE assume each data point is generated by the following two steps:

Sample latent variable $z$ from its prior distribution $p(z)$
Generate $x$ by the conditional model $p_{\theta}(x \mid z)$

The prior distribution $p(z)$ is usually simple, say $p(z) \sim N(0, I)$

$p_{\theta}(x) = \int p(z) p_{\theta}(x | z) \, dz$

The likelihood $p_{\theta}(x)$ is intractable. However, with the help of an encoder $q_{\phi}(z | x)$, we can obtain a tractable lower bound of the likelihood.

$q_{\phi}(z | x) = \mathcal{N} \left(z; \mu_{\phi}(x), \sigma_{\phi}^2(x) \right)$

The model can be trained by maximizing this lower bound.

$\log p_{\theta}(x) = \log \int p_{\theta}(x, z) \, dz = \log \int \frac{p_{\theta}(x, z)}{q_{\phi}(z | x)} \cdot q_{\phi}(z | x) \, dz$ $= \log \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \frac{p_{\theta}(x, z)}{q_{\phi}(z | x)} \right] \geq \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log \frac{p_{\theta}(x, z)}{q_{\phi}(z | x)} \right]$

The Evidence Lower Bound (ELBO)

$L_{ELBO}(x, \theta, \phi) \triangleq \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log \frac{p_{\theta}(x, z)}{q_{\phi}(z | x)} \right]$

The formula for the Kullback-Leibler (KL) divergence:

The KL divergence from distribution $Q$ to distribution $P$ is defined as:

$D_{KL}(P \| Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \, dx$

For discrete probability distributions, it is defined as:

$D_{KL}(P \| Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}$

In the context of variational autoencoders, where $q_{\phi}(z|x)$ is the approximate posterior and $p_{\theta}(z|x)$ is the true posterior, the KL divergence is given by:

$D_{KL}(q_{\phi}(z|x) \| p_{\theta}(z|x)) = \mathbb{E}_{z \sim q_{\phi}(z|x)} \left[ \log \frac{q_{\phi}(z|x)}{p_{\theta}(z|x)} \right]$

Notice that

$D_{KL} \left( q_{\phi}(z | x) \| p_{\theta}(z | x) \right) = \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log \frac{q_{\phi}(z | x)}{p_{\theta}(z | x)} \right] = \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log \frac{q_{\phi}(z | x) p_{\theta}(x)}{p_{\theta}(x, z)} \right]$ $= \log p_{\theta}(x) + \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log \frac{q_{\phi}(z | x)}{p_{\theta}(x, z)} \right] = \log p_{\theta}(x) - L_{ELBO}$ $\Rightarrow \log p_{\theta}(x) = L_{ELBO}(x, \theta, \phi) + D_{KL} \left( q_{\phi}(z | x) \| p_{\theta}(z | x) \right)$

The gap between log-likelihood and ELBO = distance between true posterior and $q_{\phi}$

Encoder and decoder are jointly trained to maximize the evidence lower bound.

$\theta^*, \phi^* = \arg \max_{\theta, \phi} \sum_{i=1}^{N} L_{ELBO}(x_i, \theta, \phi)$

This type of DGM is called Variational Autoencoder (VAE).

Further Analysis for ELBO

ELBO is tractable and differentiable.

$\log p_{\theta}(x) \geq L_{ELBO}(x, \theta, \phi) = \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log \frac{p_{\theta}(x, z)}{q_{\phi}(z | x)} \right]$ $= \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log p_{\theta}(x | z) + \log p(z) - \log q_{\phi}(z | x) \right]$ $= \mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log p_{\theta}(x | z) \right] - D_{KL} \left( q_{\phi}(z | x) \| p(z) \right)$

Reconstruction Loss:

$\mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log p_{\theta}(x | z) \right]$

The reconstruction error of sending $x$ through the encoder and decoder.

Prior Regularization:

$D_{KL} \left( q_{\phi}(z | x) \| p(z) \right)$

Make approximate posterior close to prior.

Prior Regularization

Make approximate posterior distribution closer to prior.

It prevents the “overfitting” of the autoencoder.

Suppose a standard Gaussian prior and a Gaussian decoder:

$p(z) \sim \mathcal{N}(0, 1) \quad q_{\phi}(z | x) = \mathcal{N} \left(z; \mu_{\phi}(x), \sigma_{\phi}^2(x) \right)$

The KL-divergence between Gaussians has closed-form solution:

$-D_{KL} \left( q_{\phi}(z | x) \| p(z) \right) = \frac{1}{2} \left(1 + \log \sigma_{\phi}^2(x) - \mu_{\phi}^2(x) - \sigma_{\phi}^2(x) \right)$

Reconstruction Loss

Suppose the encoder and decoder are both Gaussian

$q_{\phi}(z | x) = \mathcal{N} \left(z; \mu_{\phi}(x), \sigma_{\phi}^2(x) \right) \quad p_{\theta}(x | z) = \mathcal{N} \left(x; \mu_{\theta}(z), \sigma^2 \right)$

(Decoder has fixed variance for simplicity)

The likelihood function:

$\log p_{\theta}(x | z) = -\frac{d}{2} \log \sigma^2 - \frac{\| x - \mu_{\theta}(z) \|^2}{2\sigma^2} \text{ (The later term denotes L2 loss between decoder output and the input data) }$

The reconstruction loss can be approximated via Monte Carlo methods:

$\mathbb{E}_{z \sim q_{\phi}(z | x)} \left[ \log p_{\theta}(x | z) \right] \approx \frac{1}{K} \sum_{k=1}^{K} \log p_{\theta}(x | z^{(k)}) \quad \text{where} \quad z^{(k)} \sim q_{\phi}(z | x)$ $= -C_1 \sum_{k=1}^{K} \| x - \mu_{\theta}(z^{(k)}) \|^2 + C_2$

where $C_1 = \frac{1}{2 \sigma^2 K}$

$K$ is the number of MC samples.

VAE: Putting together

Training

Forward Encoder, compute regularization term $L_{\text{reg}} = -D_{KL} \left( q_{\phi}(z | x) \| p(z) \right)$
Sample $z \sim q_{\phi}(z | x)$ with reparameterization trick.
Forward decoder, compute reconstruction term $x' \sim p_{\theta}(x | z), \quad L_{\text{recon}} = -C_1 \| x - x' \|^2$
Maximize ELBO with gradient ascent $L_{ELBO} = L_{\text{recon}} + L_{\text{reg}}$

Sampling

Discard the encoder
Sample latent from prior $z \sim \mathcal{N}(0, 1)$
Forward decoder $x' \sim p_{\theta}(x | z)$

Generative Adversarial Network

A generative adversarial network (GAN) consists of:

A discriminator $D(x)$
A generator $G(z)$

$D$ is a binary classifier that tries to discriminate between a sample from the data distribution and a sample from the generator $G$.

$G$ tries to “trick” $D$ by generating samples that are hard for $D$ to distinguish from data.

Min-max objective:

$\min_G \max_D V(D, G) = \mathbb E_{x \sim p_{data}(x)} [ \log D(x) ] + \mathbb E_{z \sim p(z)} [\log(1 - D(G(z)))]$ $\begin{aligned} V(D,G)&=\mathbb{E}_{x\sim p_{data}(x)}[log~D(x)]+\mathbb{E}_{z\sim p(z)}[log(1-D(G(z)))] \newline &=\int p_{data}(x)log~D(x)dx+\int p(z)log(1-D(G(z)))dz \newline &=\int p_{data}(x)log~D(x)dx+\int p_{model}(x)log(1-D(x))dx \newline &=\int(p_{data}(x)log~D(x)+p_{model}(x)log(1-D(x)))dx \newline \end{aligned}$

For a fixed generator $G$, the optimal discriminator $D^*$ is

$D^* = \arg \max_D V(D, G) = \frac{p_{data}(x)}{p_{data}(x) + p_{model}(x)}$

Now consider the min-max objective:

$\begin{aligned} \min_G \max_D V(D, G) &= \min_G V(D^*, G) \newline &= \min_G \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D_G^*(x)] + \mathbb{E}_{z \sim p(z)}[\log (1 - D_G^*(G(z)))] \newline &= \min_G \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D_G^*(x)] + \mathbb{E}_{x \sim p_{\text{model}}(x)}[\log (1 - D_G^*(x))] \newline &= \min_G \mathbb{E}_{x \sim p_{\text{data}}(x)}\left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_{\text{model}}(x)}\right] + \mathbb{E}_{x \sim p_{\text{model}}(x)}\left[\log \frac{p_{\text{model}}(x)}{p_{\text{data}}(x) + p_{\text{model}}(x)}\right] \newline &= \min_G \left[ D_{KL}(p_{\text{data}}(x) \| \frac{p_{\text{data}}(x) + p_{\text{model}}(x)}{2}) + D_{KL}(p_{\text{model}}(x) \| \frac{p_{\text{data}}(x) + p_{\text{model}}(x)}{2}) - \log 4 \right] \newline &= \min_G \left[ 2 \cdot \text{JSD}(p_{\text{data}}(x) \| p_{\text{model}}(x)) - \log 4 \right] \newline \end{aligned}$

Where $\text{JSD}$ is Jensen-Shannon Divergence:

$\text{JSD}(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)$ $M = \frac{1}{2} (P + Q)$

Thus we have the Unique global minimum

$p_{\text{model}}(x) = p_{\text{data}}(x)$

For training we use Gradient ascent on generator, with objective

$\max_G \mathbb E_{z \sim p(z)} [\log(D(G(z)))]$

alt text

Diffusion Probabilistic Models

Forward Diffusion Process

Given a data point sampled from real distribution $x_0 \sim q(x_0)$, define a forward diffusion process in which we add small amount of Gaussian noise to the sample in $T$ steps, producing noisy samples $x_1, \ldots, x_T$. The step sizes are controlled by $\beta_t$.

$q(x_t | x_{t-1}) = \mathcal{N} (x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$ $q(x_{1:T} | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})$

Let $\alpha_t = 1 - \beta_t, \bar{\alpha}_t = \prod_{i=1}^{t} \alpha_t$. Then we have

$x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_{t-1}$ $= \sqrt{\alpha_t} (\sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2}) + \sqrt{1 - \alpha_t} \epsilon_{t-1}$ $= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t (1 - \alpha_{t-1})} \epsilon_{t-2} + \sqrt{1 - \alpha_t} \epsilon_{t-1}$ $= \ldots = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$ $\Rightarrow q(x_t | x_0) = \mathcal{N} (x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)$

Backward Diffusion Process

Core idea: Learn to map noise to data by reversing the time.

To reverse the diffusion process, we need to estimate the reverse conditional probabilities $q(x_{t-1} | x_t)$. Note that if $\beta_t$ is small enough, $q(x_{t-1} | x_t)$ will also be Gaussian. Learn a model $p_{\theta}(x_{t-1} | x_t)$ to approximate the reverse process $q(x_{t-1} | x_t)$.

$p_{\theta}(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_{\theta}(x_{t-1} | x_t)$ $p_{\theta}(x_{t-1} | x_t) = \mathcal{N} (x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t))$

DPM: Putting Together

Training Algorithm

repeat

$x_0 \sim q(x_0)$
$t \sim \text{Uniform}(\{1, \ldots, T\})$
$\epsilon \sim \mathcal{N}(0, I)$
Take gradient descent step on $\nabla_{\theta} \| \epsilon - \epsilon_{\theta} (\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) \|^2$

until converged

Sampling Algorithm

$x_T \sim \mathcal{N}(0, I)$

for $t = T, \ldots, 1$ do

$z \sim \mathcal{N}(0, I)$ if $t > 1$ else $z = 0$
$x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \right) + \sigma_t z$

end for

return $x_0$

Contrastive Representation Learning

We want a feature extractor $f$ and a score function $S$, such that

$S(f(x), f(x^+)) \gg S(f(x), f(x^-))$

Here, $x$ is the reference sample, $x^+$ is positive sample and $x^-$ is negative sample.

Given a chosen score function $S(\cdot)$, we aim to learn an encoder function $f$ that yields high score for positive pairs $(x, x^+)$ and low scores for negative pairs $(x, x^-)$.

$L = - \mathbb{E}_X \left[ \log \frac{\exp \left( s(f(x), f(x^+)) \right)}{\exp \left( s(f(x), f(x^+)) \right) + \sum_{j=1}^{N-1} \exp \left( s(f(x), f(x_j^-)) \right)} \right]$

Commonly known as InfoNCE loss

A lower bound on the mutual information between $f(x)$ and $f(x^+)$

$I[f(x): f(x^+)] \geq \log(N) - L$

Therefore, the larger total samples $N$ is, the lower loss $L$ is, $f(x)$ and $f(x^+)$ are more correlated.

Deep Learning, Machine Learning