经典机器学习笔记

2024-06-09

Notes

8105 Words

本文为清华大学”模式识别与机器学习”课程的复习笔记。

Evaluation Metric

$\begin{aligned} \text{Accuracy} &= \frac{\text{TP+TN}}{\text{TP+FP+TN+FN}} \newline \text{Precision} &= \frac{\text{TP}}{\text{TP+FP}} \newline \text{Recall} &= \text{Sensitivity} = \frac{\text{TP}}{\text{TP+FN}} \newline \text{Specificity} &= \frac{\text{TN}}{\text{TN+FP}} \newline \text{Type-I Error} &= \frac{\text{FP}}{\text{TP+FN}} = 1 - \text{Sensitivity} \newline \text{Type-II Error} &= \frac{\text{FN}}{\text{TN+FP}} = 1 - \text{Specificity} \newline \end{aligned}$

k-NN

Nearest Neighbor

For a new instance $x’$, its class $\omega’$ can be predicted by:

$\omega' = \omega_i, \text{ where } i = \underset{j}{\arg\min} \, \delta(x', x_j)$

k-Nearest Neighbor

For a new instance $x$, define $g_i(x)$ as: the number of $x$’s k-nearest instances belonging to the class $\omega_i$.

Then the new instance’s class $\omega’$ can be predicted as:

$\omega' = \omega_j,\text{ where }j = \underset{i}{\arg\max} \, g_i(x)$

k-NN Improvements

Branch-Bound Algorithm

Use tree structure to reduce calculation.

Edit Nearest Neighbor

Delete nodes that may be misguiding from the training instance set.

Condensed Nearest Neighbor

Delete nodes that are far away from decision boundaries.

The Curse of Dimensionality

Problem

Many irrelevant attributes
In high-dimensional spaces, most points are equally far from each other.

Solution

Dimensionality reduction techniques
manifold learning
Feature selection
Use prior knowledge

Linear Regression (Multivariate ver.)

For a multivariate linear regression, the function becomes $ y_i = \mathbf{w}^{\rm T}\mathbf{x}_i $ , where $ \mathbf{x}_i = (1, x_i^1, \cdots, x_i^d)^{\rm T}\in \mathbb{R}^{d+1}, \mathbf{w} = (w_0, w_1, \cdots, w_d)^{\rm T} \in \mathbb{R}^{d+1}$, We adjust the values of $\mathbf{w}$ to find the equation that gives the best fitting line $f(x) = \mathbf{w}^{\rm T}\mathbf{x}$

We find the best $ \mathbf{w}^*$ using the Mean Squared Loss: $\ell(f(\mathbf x, y)) = \min\limits_{\mathbf w} \frac{1}{N} \sum_{i = 1}^N (f(\mathbf x_i) - y_i)^2 = \min \limits_{\mathbf w} \frac{1}{N}(\mathbf {Xw-y})^{\rm T}(\mathbf {Xw-y})$

So that $ \mathbf{w}^{\star} $ must satisfy $ \mathbf {X^{\rm T}} \mathbf {Xw^{\star}} = \mathbf X^{\rm T}\mathbf y$ , so we get $\mathbf{w^{\star}} = (\mathbf {X^{\rm T}X})^{-1}\mathbf X^{\rm T}\mathbf y$ or $\mathbf{w^{\star}} = (\mathbf {X^{\rm T}X} + \lambda \mathbf I)^{-1}\mathbf X^{\rm T}\mathbf y$ (Ridge Regression)

Linear Discriminant Analysis

project input vector $\mathbf x \in \mathbb{R}^{d+1}$ down to a 1-dimensional subspace with projection vector $\mathbf w$

The problem is how do we find the good projection vector? We have Fisher’s Criterion, that is to maximize a function that represents the difference between-class means, which is normalized by a measure of the within-class scatter.

We have between-class scatter $\tilde{S}_b = (\tilde{m}_1 - \tilde{m}_2)^2$, where $\tilde{m}_i$ is the mean for the i-th class. Also we have within-class scatter $\tilde{S}_i=\sum_{y_j \in \mathscr{y}_{i}} (y_j - \tilde{m}_i)^2$, then we have total within-class scatter $\tilde{S}_w = \tilde{S}_1+ \tilde{S}_2$. Combining the 2 expressions, the new objective function will be $J_F(\mathbf w) = \frac{\tilde{S}_b}{\tilde{S}_w}$

We have $\tilde{S}_b = (\tilde{m}_1 - \tilde{m}_2)^2 = (\mathbf w^{\rm T} \mathbf m_1 - \mathbf w^{\rm T} \mathbf m_2)^2 = \mathbf w^{\rm T} (\mathbf m_1 - \mathbf m_2)(\mathbf m_1 - \mathbf m_2)^{\rm T} \mathbf w = \mathbf w^{\rm T} \mathbf S_b \mathbf w$, also $\tilde{S}_w = \mathbf w^{\rm T} \mathbf S_w \mathbf w$, so now optimize objective function $J_F$ w.r.t $\mathbf w$:

$\max\limits_{\mathbf w} J_F(\mathbf w) = \max \limits_ {\mathbf w} \frac{\mathbf w^{\rm T} \mathbf S_b \mathbf w}{\mathbf w^{\rm T} \mathbf S_w \mathbf w}$

Use Lagrange Multiplier Method we obtain: $\lambda w^{\star} = \mathbf{S}_W^{-1} (\mathbf m_1 - \mathbf m_2)(\mathbf m_1 - \mathbf m_2)^{\rm T}\mathbf w^{\star}$, since we only care about the direction of $\mathbf w^*$ and $(\mathbf m_1 - \mathbf m_2)^{\rm T}\mathbf w^{\star}$ is scalar, thus we obtain $w^{\star} = \mathbf{S}_W^{-1} (\mathbf m_1 - \mathbf m_2)$

Logistic Regression

Logistic regression is a statistical method used for binary classification, which means it is used to predict the probability of one of two possible outcomes. Unlike linear regression, which predicts a continuous output, logistic regression predicts a discrete outcome (0 or 1, yes or no, true or false, etc.).

Key Concepts

Odds and Log-Odds:
- Odds: The odds of an event are the ratio of the probability that the event will occur to the probability that it will not occur. $\text{Odds} = \frac{P(y=1)}{P(y=0)}$
- Log-Odds (Logit): The natural logarithm of the odds. $\text{Log-Odds} = \log\left(\frac{P(y=1)}{P(y=0)}\right)$
Logistic Function (Sigmoid Function):
- The logistic function maps any real-valued number into the range (0, 1), making it suitable for probability predictions. $\sigma(z) = \frac{1}{1 + e^{-z}}$
- In logistic regression, $ z $ is a linear combination of the input features. $z = w^T x + b$
Model Equation:
- The probability of the positive class (e.g., $ y=1 $) is given by the logistic function applied to the linear combination of the features. $P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$
- The probability of the negative class (e.g., $ y=0 $) is: $P(y=0|x) = 1 - P(y=1|x)$
Decision Boundary:
- To make a binary decision, we typically use a threshold (commonly 0.5). If $ P(y=1|x) $ is greater than 0.5, we predict the positive class; otherwise, we predict the negative class.

Training the Model

We use MLE(Maximum Likelihood Estimation) for logistic regression:

$\max_{\mathbf w} \prod_{i=1}^{N} \left[ \theta(w^T x)^{\mathbf 1(y_i=1)} \times (1 - \theta(w^T x))^{\mathbf 1(y_i=0)} \right]$

Applying negative log to the likelihood function, we obtain the log-likelihood for logistic regression. =

$\min_{\mathbf w} J(\mathbf w) = \min\limits_{\mathbf w} - \sum_{i=1}^{N} \left\{ y_i \log \left( \frac{e^{\mathbf w^{\rm T} \mathbf x_i}}{1 + e^{\mathbf w^{\rm T} \mathbf x_i}} \right) + (1 - y_i) \log \left( 1 - \frac{e^{\mathbf w^{\rm T} \mathbf x_i}}{1 + e^{\mathbf w^{\rm T} \mathbf x_i}} \right) \right\}$

Substituting $y_i \in \{0, +1\}$ with $\tilde y_i \in \{-1, +1\}$, and noting that $\theta(-s) + \theta(s) = 1$, we can simplify the previous expression:

$\min_w J(w) = \min_{\mathbf w} \sum_{i = 1}^N \log(1 + e^{-\tilde y_i \mathbf w ^ {\rm T}\mathbf x_i})$

This is called the Cross Entropy Loss.

Generalization to K-classes

The generalized version of logistic regression is called Softmax Regression.

The probability of an input $x$ being class $k$ is denoted as:

$P(y = k | x; \mathbf{W}) = \frac{e^{\mathbf w_k^{\rm T} x}}{\sum_{i=1}^{K} e^{\mathbf w_i^{\rm T} x}}$

In multiclass, the likelihood function can be written as:

$\max_{w_1, w_2, \ldots, w_k} \prod_{i=1}^{N} \prod_{k=0}^{K} P(y_i = k | x_i; \mathbf{W})^{\mathbf 1(y_i = k)}$

We can use the minimum negative log-likehood estimation:

$\min\limits_{\mathbf{W}} J(\mathbf{W}) = \min_{\mathbf w_1, \mathbf w_2, \ldots, \mathbf w_k} -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=0}^{K} \mathbf 1(y_i = k) \cdot \log \frac{e^{\mathbf w_k^{\rm T} x_i}}{\sum_{j=1}^{K} e^{\mathbf w_j^T x_i}}$

Perceptron

We predict based on the sign of $y$: $y = \text{sign}(f_{\mathbf w}(x)) = \text{sign}(\mathbf w^{\rm T}\mathbf x)$

For Perceptron the objective loss function is defined as:

$J_p(\mathbf{w}) = \sum_{\hat{x}_j \in \mathcal{X}^k} (-\mathbf{w}^T \hat{x}_j)$

where $\mathcal{X}^k$ is the misclassified sample set at step $k$.

We can use gradient descent to solve for $\mathbf w^*$:

$\mathbf{w}_{k+1} = \mathbf{w}_k + \rho_k \sum_{x_j \in \mathcal{X}^k} (-\hat{x}_j)$

Support Vector Machine

We want the optimal linear separators, that is the most robust classifier to the noisy data, meaning it has the largest margin to the training data. So we want to find the classifier with the largest margin.

Modeling(For Linear-Separable Problem)

We want the margin is largest: $\max\limits_{\mathbf w, b}\rho(\mathbf w, b)$, and all the datapoints are classified correctly, that is $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1$.

The distance between two paralleled hyperplanes is: $|b_1 - b_2| / ||a||$, and the distance between a point $\mathbf x_0$ and a hyperplane $(\mathbf w, b)$ is $|\mathbf w^{\rm T} \mathbf x_0 + b| / ||\mathbf w||$.

Choose the points that are closest to the classifier, and they satisify: $|\mathbf w^{\rm T} \mathbf x_0 + b| = 1$, so that margin $\rho$ = $|\mathbf w^{\rm T} \mathbf x_1 + b| / ||\mathbf w|| + |\mathbf w^{\rm T} \mathbf x_2 + b| / ||\mathbf w|| = 2 / ||\mathbf w||$.

Thus we got the Hard-margin Support Vector Machine:

$\max\limits_{\mathbf w, b}\frac{2}{||\mathbf w||}$

s.t. $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1, 1 \leq i \leq n$

For compute convenience, we convert it into

$\min\limits_{\mathbf w, b}\frac{1}{2}||\mathbf w||^2$

s.t. $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1, 1 \leq i \leq n$

Modeling(For Linearly Non-Separable Problem)

We add a slack that allows points to be classified on the wrong side of the decision boundary, also we add a penalty. So we got the Soft-margin SVM:

$\min\limits_{\mathbf w, b}\frac{1}{2}||\mathbf w||^2 + C\sum_{i=1}^N \xi_i$

s.t. $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1 - \xi_i, 1 \leq i \leq n$

Using hinge-loss $\ell_{\text{hinge}}(t) = \max(1-t, 0)$, we have the final version of Soft-margin SVM:

$\min\limits_{\mathbf w, b}\frac{1}{2}||\mathbf w||^2 + C\sum_{i=1}^N \ell_{\text{hinge}}(y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b))$

Optimization For Training

Lagrangian Function & KKT Condition

Consider a constrained optimization problem

$\min_{x \in \mathbb{R}^d} f(x), \text{ s.t. } g_i(x) \leq 0, \forall i = 1, \dots, n$

The Lagrangian function $L(x, \mu)$ is defined as:

$L(x, \mu) = f(x) + \sum_{j = 1}^J \mu_ig_j(x)$

We have KKT conditions(necessary condition): for $1 \leq j \leq J$

Primal feasibility: $g_j(x) \leq 0$
dual feasibility: $\mu_i \geq 0$
Complementary slackness: $\mu_i g_j(x^*) = 0$
Lagrangian optimality: $\nabla_x L(x_*, \mu) = 0$

Dual Problem For Soft-margin SVM

For Soft-margin Support Vector Machine:

$\min\limits_{\mathbf w, b}\frac{1}{2}||\mathbf w||^2 + C\sum_{i=1}^N \xi_i$

s.t. $y_i \cdot (\mathbf w^{\rm T}\mathbf x_i + b) \geq 1 - \xi_i, \xi_i \geq 0, 1 \leq i \leq n$

We have the Lagrangian function(with $2n$ inequality constraints):

$L(\mathbf{w}, b, \alpha, \xi, \mu) = \frac{1}{2} \|\mathbf{w}\|_2^2 + C \sum_{i=1}^{n} \xi_i + \sum_{i=1}^{n} \alpha_i [1 - \xi_i - y_i (\mathbf{w}^T \mathbf{x}_i + b)] - \sum_{i=1}^{n} \mu_i \xi_i$

s.t. $\alpha_i \geq 0, \mu_i \geq 0, \, i = 1, \ldots, n$.

take the partial derivatives of Lagrangian w.r.t $\mathbf w, b, \xi_i$ and set to zero

$\begin{aligned} \frac{\partial L}{\partial \mathbf{w}} &= 0 \implies \mathbf{w} = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i \\ \frac{\partial L}{\partial b} &= 0 \implies \sum_{i=1}^{n} \alpha_i y_i = 0 \\ \frac{\partial L}{\partial \xi_i} &= 0 \implies C = \alpha_i + \mu_i, \, i = 1, \cdots, n \\ \end{aligned}$

So that we got:

$L(\mathbf{w}, b, \alpha, \xi, \mu) = \frac{1}{2} \|\mathbf{w}\|_2^2 + C \sum_{i=1}^{n} \xi_i + \sum_{i=1}^{n} \alpha_i [1 - \xi_i - y_i (\mathbf{w}^T \mathbf{x}_i + b)] - \sum_{i=1}^{n} \mu_i \xi_i$ $= \frac{1}{2} \mathbf{w}^T \mathbf{w} + \sum_{i=1}^{n} \xi_i (C - \alpha_i - \mu_i) + \sum_{i=1}^{n} \alpha_i - \sum_{i=1}^{n} \alpha_i \cdot y_i \cdot \mathbf{w}^T \mathbf{x}_i - b \sum_{i=1}^{n} \alpha_i \cdot y_i$ $= \frac{1}{2} \left( \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i \right)^T \left( \sum_{j=1}^{n} \alpha_j y_j \mathbf{x}_j \right) + 0 + \sum_{i=1}^{n} \alpha_i - \sum_{i=1}^{n} \alpha_i \cdot y_i \cdot \left( \sum_{j=1}^{n} \alpha_j y_j \mathbf{x}_j \right) x_i + 0$ $= \frac{1}{2} \left( \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j \right) + \sum_{i=1}^{n} \alpha_i - \left( \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j \right)$ $= \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \left( \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j \right)$

So we have the Dual Problem of Soft-SVM:

$\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$

s.t. $\sum_{i=1}^{n} \alpha_i y_i = 0, \quad 0 \leq \alpha_i \leq C, \, i = 1, \ldots, n.$

After solving $\alpha$, we can get $\mathbf{w} = \sum_{j=1}^n\alpha_j y_j x_j$ and $b$

Kernel Method for SVM

Linear SVM cannot handle linear non-separable data. So we need to map the original feature space to a higher-dimensional feature space where the training set is separable.

Basically we could set $x \to \phi(x)$, but calculating $x_i \dots x_j$ will cause heavy computation cost, so we use the kernel trick, that is to find a function $k(x_i, x_j) = \phi(x_i) \dots \phi(x_j)$.

Some commonly used kernel:

Linear Kernel:
$k(\mathbf{x}, \mathbf{x}_i) = (\mathbf{x} \cdot \mathbf{x}_i)$
Polynomial Kernel:
$k(\mathbf{x}, \mathbf{x}_i) = [(\mathbf{x} \cdot \mathbf{x}_i) + 1]^q$
Radial Basis Function Kernel (a.k.a. RBF kernel, Gaussian kernel):
$k(\mathbf{x}, \mathbf{x}_i) = \exp \left( -\frac{\|\mathbf{x} - \mathbf{x}_i\|^2}{2\sigma^2} \right)$
Sigmoid Kernel:
$k(\mathbf{x}, \mathbf{x}_i) = \tanh (v(\mathbf{x} \cdot \mathbf{x}_i) + c)$

Kernel tricks can also be applied to more algorithms, such as k-NN, LDA, etc.

Decision Tree

We use a tree-like structure to deal with categorical features.

For each node, we find the most useful feature, that means the feature that can better divide the data on the node.

ID3 Algorithm

We use entropy as criterion:

$H(D) = -\sum_{k=1}^K \frac{|C_k|}{|D|} \log \frac{|C_k|}{|D|}$

A good split gives minimal weighted average entropy of child nodes:

$\frac{|D_1|}{|D|}H(D_1) + \frac{|D_2|}{|D|}H(D_2)$

For any split, the entropy of the parent node is constant. Minimizing the weighted
entropy of son nodes is equivalent to maximizing the information gain (IG):

$H(D) - \frac{|D_1|}{|D|}H(D_1) - \frac{|D_2|}{|D|}H(D_2)$

C4.5 Algorithm

Information Gain is highly biased to multivalued features. So we use Information Gain Ratio (GR) to choose optimal feature:

$\text{GR} = \frac{\text{Information Gain}}{\text{Intrinsic Value}}$

Intrinsic Value (IV) is to punish multivalued features. For a selected feature $f$, its Intrinsic Value is:

$IV(f) = -\sum_{k=1}^{|V|}\frac{|F_k|}{|D|} \log \frac{|F_k|}{|D|}$

where $V$ is the set of all possible values of the feature $f$, and $F_k$ is the subset of $D$ where the value of the feature $A$ is $k$. Features with many possible values tend to have a large Intrinsic Value.

Classification and Regression Tree(CART)

The CART Tree muse be a binary tree.

Regression Tree

How to divide the regions $R = \{R_1, \dots, R_m\}$ and decide the values $V = \{v_1, \dots, v_m\}$?

We use minimum mean-square error over all examples $x_i$ with label $y_i$

$\min_{R, V} l = \min_{R, V} \sum_{j = 1}^m \sum_{x_i \in R_j} (y_i - v_j)^2$

Assuming that R has been determined and first find the optimal V. For a given region R_j, the value $v_j$ to minimize the loss is the average value of the labels of all samples belonging to region $R_j$:

$v_j = \frac{1}{|R_j|} \sum_{x_i \in R_j} y_i$

Now for each feature $A$ and split threshold $a$, the parent node $R$ is split by $(A, a)$ to $R_1$ and $R_2$. We choose $(A, a) over all possible values to minimize:

$l(A, a) = \sum_{x_i \in R_1} (y_i - v_1(A, a))^2 + \sum_{x_i \in R_2} (y_i - v_2(A, a))^2$

where $v_1(A, a)$ and $v_2(A, a)$ are described above.

Classification Tree

The split criteria is now Gini Index:

$\text{Gini}(D) = 1 - \sum_{k = 1}^K \left(\frac{|C_k|}{|D|}\right)^2$

We choose the feature $A$ and the threshold $a$ over all possible values with the
maximal gain

$\text{Gini}(D) - \frac{|D_1|}{|D|} \text{Gini}(D_1) - \frac{|D_2|}{|D|} \text{Gini}(D_2)$

Ensemble Learning

Reduce the randomness (variance) by combining multiple learners.

Bagging(Bootstrap Aggregating)

Create $M$ bootstrap datasets
Train a learner on each dataset
Ensemble $M$ learners

Uniformly sample from the original data D with replacement. The bootstrap dataset
has the same size as the original data D, the probability of not showing up is

$(1-\frac{1}{n})^n \approx \frac{1}{e} \approx 0.37$

We use the elements show up in $D$ but not in the bootstrap dataset as the validation set(The out-of-bag dataset).

Random Forest

Ensemble decision trees (Training data with $d$ features)

Create bootstrap datasets
During tree construction, randomly sample $K (K<d)$ features as candidates for each split. (Usually choose $K = \sqrt d$)

Use feature selection to make treees mutally independent and diverse.

Boosting

Boosting: Sequentially train learners. Current Weak learners focus more on the
examples that previous weak learners misclassified.

Weak classifiers $h_1, \cdots, h_m$ are build sequentially. $h_m$ outputs ‘$+1$’ for one
class and ‘$-1$’ for another class.

Classify by $g(x) = \text{sgn}(\sum \alpha_m h_m(x))$

AdaBoost

Core idea: give higher weights to the misclassified examples so that half of the
training samples come from incorrect classifications. (re-weighting)

Mathematical Formulation:

Weighted Error:
$\epsilon_t = \sum_{i=1}^n w_i \cdot \mathbf 1(y_i \neq h_t(x_i))$
Alpha Calculation:
$\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \epsilon_t}{\epsilon_t} \right)$
Weight Update:
$w_i \leftarrow w_i \exp(\alpha_t \cdot \mathbf 1(y_i \neq h_t(x_i)))$
Final Hypothesis:
$H(x) = \text{sign} \left( \sum_{t=1}^T \alpha_t h_t(x) \right)$

Gradient Boosting

View boosting as an optimization problem. The criterion is to minimize the empirical loss:

$\arg \min_{(\alpha_1, \ldots, \alpha_t, h_1, \ldots, h_t)} \sum_{i=1}^{n} l \left( y_i, \sum_{s=1}^{t} \alpha_s h_s(x) \right)$

Loss function $l$ depends on the task:

Cross entropy for multi-classification
$\text{L2}$ loss for regression

We use sequential training: optimize a single model at a time, that is freeze $h_1, \cdots, h_{t-1}$ and optimize $h_t$. (Let $f_{t-1}(x) = \sum_{s=1}^{t-1} \alpha_s h_s(x)$, denoting the ensemble of $t-1$ learners.)

Now let’s see how to choose the $\alpha_t$ and $h_t$, we define:

$u = (f_{t-1}(x_1), \cdots, f_{t-1}(x_n)) \\ \Delta u = (h_t(x_1), \cdots, h_t(x_n))$

Consider function $F(u) = \sum_{i=1}^n l(y_i, u_i)$, then the original objective is equivalent to find a direction $\Delta u$ and step size $\alpha$ at the point $u$
to minimize:

$F(u + \alpha_t \Delta u) = \sum_{i=1}^n l(y_i, u_i + \alpha \Delta u_i)$

According to Gradient Descent, we could let $\delta u = \nabla_u F(u)$, thus

$h_t(x_i) = -\frac{\partial F(u)}{\partial u_i} = -\left[ \frac{\partial l(y_i, u_i)}{\partial u_i} \right]_{u_i = f_{t-1}(x_i)}$

Then how to decide $\alpha_t$? Use one-dimensional search $(y_i, x_i, f_{t-1}, h_t \text{ is fixed})$

$\alpha_t = \arg\min_{\alpha_t} \sum_{i=1}^{n} l(y_i, f_{t-1}(x_i) + \alpha_t h_t(x_i))$

For simplicity, search of optimal multiplier can be replaced by setting it a constant.

In conclusion, Gradient Boosting = Gradient Descent + Boosting.

Learning Theory

Empirical Risk Minimization (ERM)

Empirical Risk: The average loss of the model $f$ on training set $\mathcal D = \{x_i, y_i\}^N_{i=1}$

$\hat{R}(f) = \frac{1}{N} \sum_{i = 1}^N \ell(f(x_i), y_i)$

Empirical Risk Minimization(ERM): The learning algorithm selects the model that minimizes the empirical risk on the training dataset.

$\mathcal A(\mathcal D, \mathcal H) = \arg \min_{f \in \mathcal H} \hat R(f)$

The Consistency of Learning Process

We say a learning process is consistent, if the minimizer for empirical risk at
the infinite data limit, converges to the minimum expected risk.

Overfitting and Bias-Variance Trade-off

Define the Population Loss (also called Expected Risk) as

$R(f) = \mathbb E_{(x, y) \sim u} \ell(f(x), y)$

Therefore define the Generalization Gap as: $R(f) - \hat R(f)$

There are two important concepts of predicting model

Bias: The assumptions of target model, represents the extent to which the
average prediction over all datasets differs from the desired function.
Variance: The extent of change for the model when the training data changes
(can be understood as “stability” to dataset change).

Bias-Variance Trade-off : There is an intrinsic contradict between bias and variance. The model’s test error contains the sum of both.

Bias-Variance Decomposition :

Suppose the ground truth function is $f^*$, the data distribution is $\mu$, the algorithm $\mathcal{A}$ learns from hypothesis space $\mathcal{H}$. We use $y(x; \mathcal{D}) = \mathcal{A}(\mathcal{D}, \mathcal{H})(x)$ to denote the output of ERM model $\hat{f} = \mathcal{A}(\mathcal{D}, \mathcal{H})$ on input $x$.
We are interested in the learned model’s prediction error on any $x$, namely

$[y(x; \mathcal{D}) - f^*(x)]^2 = \{y(x; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] + \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] - f^*(x)\}^2$ $= \{y(x; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})]\}^2 + \{\mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] - f^*(x)\}^2$ $- 2\{y(x; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})]\}\{\mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] - f^*(x)\}$

Taking expectation over all possible datasets $\mathcal{D}$, the last term is zero.

$= \{\mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})] - f^*(x)\}^2 + \mathbb{E}_{\mathcal{D}}[\{y(x; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[y(x; \mathcal{D})]\}^2]$ $= (\text{bias})^2 + \text{variance}$

Regularization refers to techniques that are used to calibrate machine learning
models in order to prevent overfitting, which picks a small subset of solutions
that are more regular (punish the parameters for behaving abnormally) to
reduce the variance.

Generalization Error and Regularization

VC dimension

VC dimension is a measure of complexity for a certain hypothesis class:
The largest integer $d$ for a binary classification hypothesis class $\mathcal H$, such that
there exists 𝑑 points in the input space 𝒳 that can be perfectly classified by some
function $h \in \mathcal H$ no matter how you assign labels for these $d$ points.

VC dimension characterizes the model class’s capacity for fitting random labels.

Generalization Error Bound

If a hypothesis class $\mathcal{H}$ has VC dimension $d_{vc}$, we have a theorem that states that with probability $1 - \delta$ and $m$ samples, we can bound the generalization gap for any model $h \in \mathcal{H}$ as

$R(h) \leq \hat{R}(h) + \sqrt{\frac{8d_{vc} \ln\left(\frac{2em}{d_{vc}}\right) + 8 \ln\left(\frac{4}{\delta}\right)}{m}}$

Bayesian Decision

Bayesian Decision: Find an optimal classifier according to the prior probability and class-conditional probability density of the feature

The a priori or prior probability reflects our knowledge of how likely we expect a certain state of nature before we can actually observe said state of nature.

The class-conditional probability density function is the probability
density function $P(x|\omega)$ for our feature $x$, given that the state/class is $\omega$

Posterior Probability is the probability of a certain state/class given
our observable feature $x$: $P(\omega | x)$

Minimum Prediction Error Principle. The optimal classifier $f(\cdot)$ should minimize the expected prediction error, defined as

$P(\text{error}) = \int \sum_{\omega_j \neq f(x)} P(x, \omega_j) \, dx$

So, for each $x$, we want

$f(x) = \arg\min_{\omega_i} \sum_{\omega_j \neq \omega_i} P(x, \omega_j) = \arg\min_{\omega_i} P(x) - P(x, \omega_i)$ $f(x) = \arg\max_{\omega_i} P(x, \omega_i) = \arg\max_{\omega_i} P(\omega_i | x)$

Therefore, the classifier just needs to pick the class with largest posterior probability.

We could use a decision threshold $\theta$ for diciding. Also we can avoid making decisions on the difficult cases in anticipation of a high error rate on those examples.

Density estimation

We need a method to estimate the distribution of each feature, this is called density estimation.

Parametric Density Estimation Method

We can assume that the density function follows some form, for example:

$P(x|\omega_i) = \frac{1}{\sqrt{2\pi}\sigma_i}e^{-\frac{(x-\mu_i)^2}{2\sigma_i^2}}$

The unknown $\theta_i = (\mu_i, \sigma_i)$ is called the parameters.

Maximum Likelihood Estimation (MLE)

Likelihood Function: $p(x|\theta)$ measures the likelihood of a parametrized distribution to generate a sample $x$.

Max Likelihood Estimation (MLE): Choose the parameter 𝜃 that maximizes the
likelihood function for all the samples.

For example, if we use Gaussian to estimate $X = \{x_i\}_{i=1}^N$, MLE gives the result as

$\mu, \sigma = \arg\max_{\mu, \sigma} \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}}$

For the sake of simplicity, denote $H(\theta) = \ln p(X|\theta) = \sum_{i=1}^{N} \ln p(x_i|\theta)$

$\frac{dH}{d\mu} = 0 \implies \sum_{i=1}^{N} \frac{1}{\sigma} (x_i - \mu) = 0 \implies \mu = \frac{1}{N} \sum_{i=1}^{N} x_i,$ $\frac{dH}{d\sigma} = 0 \implies -\sum_{i=1}^{N} \frac{1}{\sigma} + \sum_{i=1}^{N} \frac{(x_i - \mu)^2}{2\sigma^2} = 0 \implies \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2.$

Non-parametric Density Estimation Method

Non-parametric method makes few assumptions about the form of the distribution and does not involve any parameter about the density function’s form.

Suppose totally we sample $N$ data, of which $K$ points are within $R$. Each data is
sample identically and independently. For each sample, whether it belongs to 𝑅 follows Bernoulli distribution with parameter $P_R$. We have $p(x) \approx \frac{P_R}{V} \approx \frac{K}{NV}$

We could apply kernel methods to it.

Hidden Markov Models (HMMs)

Understanding Bayes’ Rule:

$p(H|E)=\frac{p(E|H)P(H)}{P(E)}$

Prior $P(H)$ : How probable was our hypothesis before observing the evidence?
Likelihood $p(E|H)$ : How probable is the evidence given that our hypothesis is true?
Marginal $P(E)$: How probable is the new evidence?

Notation	Explanation
$Q = \{q_1, \ldots, q_n\}$	The set of $n$ hidden states.
$V = \{v_1, \ldots, v_v\}$	The set of all possible observed values.
$A = [a_{ij}]_{n \times n}$	Transition matrix. $a_{ij}$ is the probability of transitioning from state $i$ to state $j$. $\sum_{j=1}^n a_{ij} = 1 \, \forall i$.
$O = o_1 o_2 \cdots o_L$	Observed sequence. $o_t \in V$.
$x = x_1 x_2 \cdots x_L$	Hidden state sequence. $x_t \in Q$.
$E = [e_{ij}]_{n \times v}$	Emission probability matrix. $e_{ij} = P(o = v_j \mid x = q_i)$ is the probability of observing $v_j$ at state $q_i$. $\sum_{j=1}^V e_{ij} = 1 \, \forall i$.
$\pi = [\pi_1, \pi_2, \ldots, \pi_n]$	Start probability distribution. $\pi_i$ is the probability of Markov chain starting from $i$. $\sum_{i=1}^n \pi_i = 1$.

alt text

Question #1 – Evaluation

The evaluation problem in HMM: Given a model $M$ and an observed sequence $O$, calculate the probability of the observed sequence $P(O|M)$ .

Forward Algorithm

Denote $\alpha_t(j)$ as the probability of observing $o_1 o_2 \ldots o_t$ and the hidden state at $t$ being $q_j$:

$\alpha_t(j) = p(o_1 o_2 \ldots o_t, x_t = q_j)$

Obviously, $\alpha_t(j)$ can be rewritten as:

$\alpha_t(j) = e_j(o_t) \times \sum_{i=1}^{n} \alpha_{t-1}(i) a_{ij}$

Define Initial Values:
$\alpha_1(j) = e_j(o_1) \times \pi_j, \quad j = 1, \cdots, n$
Iterative solving:
$\alpha_t(j) = e_j(o_t) \times \sum_{i=1}^{n} \alpha_{t-1}(i) a_{ij}, \quad t = 1:L$
Obtaining results:
$p(O) = \sum_{i=1}^{n} \alpha_L(i)$

Backward Algorithm

Denote $\beta_t(j)$ as the probability of observing $o_{t+1} o_{t+2} \ldots o_L$ and the hidden state at $t$ being $q_j$:

$\beta_t(j) = p(o_{t+1} o_{t+2} \ldots o_L \mid x_t = q_j)$

Obviously, $\beta_t(j)$ can be rewritten as:

$\beta_t(j) = \sum_{i=1}^{n} a_{ji} e_i(o_{t+1}) \beta_{t+1}(i)$

Define Initial Values:
$\beta_L(j) = 1, \quad j = 1:n \quad (L + 1 \text{ is terminal state})$
Iterative solving:
$\beta_t(j) = \sum_{i=1}^{n} a_{ji} e_i(o_{t+1}) \beta_{t+1}(i), \quad t = 1:L, \quad j = 1:n$
Obtaining results:
$p(O) = \sum_{i=1}^{n} \pi_i e_i(o_1) \beta_1(i)$

Question #2 – Decoding

The decoding problem in HMM: Given a model $M$ and an observed sequence $O$, calculate the most probable hidden state sequence $\mathbf{x} = \arg\max_{\mathbf{x}} p(\mathbf{x}, O | M)$.

Define:

$v_t(j) = \max_{q_1 \ldots q_{t-1}} p(q_1 \ldots q_{t-1}, o_1 \ldots o_t, x_t = q_j)$

According to the recurrence relation, rewrite the above as:

$v_t(j) = \max_{i=1}^n v_{t-1}(i) a_{ij} e_j(o_t)$

Therefore, the most probable hidden state sequence is:

$pa_t(j) = \arg\max_{i=1}^n v_{t-1}(i) a_{ij} e_j(o_t)$

Viterbi Algorithm

Define Initial Values:
$v_1(j) = e_j(o_1) \times \pi_j, \quad pa_1(j) = 0, \quad j = 1:n$
Iterative solving:
$v_t(j) = \max_{i=1}^n v_{t-1}(i) a_{ij} e_j(o_t)$ $pa_t(j) = \arg\max_{i=1}^n v_{t-1}(i) a_{ij} e_j(o_t)$
Obtaining results:
$p^* = \max_{i=1:n} v_L(i)$

$x^*_L = \arg\max_{i=1:n} v_L(i)$

Computational Complexity: $O(n^2 L)$

Question #3 – Learning

The learning problem in HMM: Given an observed sequence $O$, estimate the parameters of model: $M = \arg \max \limits_{M}P(M|O)$

For simplicity, in the following steps we only present the learning process of transition matrix $A$. (The other parameters can be learned in a similar manner.)

Baum-Welch Algorithm (a special case of EM algorithm)

Expectation Step (E-step): Using the observed available data of the dataset, we estimate (guess) the values of the missing data with the current parameters $\theta_{\text{old}}$.
Maximization Step (M-step): Using complete data generated after the E-step, we update the parameters of the model.

E-step

(#$T_{ij}$ denotes the times of hidden state transitioning from $q_i$ to $q_j$)

Generate the guesses of #$T_{ij}$, i.e., the expected counts:

$\text{Expected Counts} = \sum_{t=1}^{L} p(x_t = q_i, x_{t+1} = q_j \mid O, \theta_{\text{old}})$

Can be estimated with Forward Algorithm and Backward Algorithm.

M-step

Generate new estimations with the expected counts:

$\hat{a}_{ij} = \frac{\sum_{t=1}^{L-1} p(x_t = q_i, x_{t+1} = q_j \mid O, \theta_{\text{old}})}{\sum_{t=1}^{L-1} \left( \sum_{j'} p(x_t = q_i, x_{t+1} = q_{j'} \mid O, \theta_{\text{old}}) \right)}$

Estimation when hidden state is unknown.

Iterative Solving: Recalculate the expected counts with newly estimated parameters (E-step). Then generate newer estimations of $\theta$ with (M-step). Repeat until convergence.

Bayesian Networks

Naive Bayes

Naïve Bayes Assumption: Features $X_i$ are independent given class $Y$:

$P_\theta(X_1, \ldots, X_n \mid Y) = \prod_i P_\theta(X_i \mid Y)$

Inference: the label can be easily predicted with Bayes’ rule

$Y^* = \arg\max_Y \prod_i P_\theta(X_i \mid Y) P(Y)$

$Y^*$ is the value that maximizes Likelihood $\times$ Prior.

When the number of samples is small, it is likely to encounter cases where $\text{Count}(Y = y) = 0$ or $\text{Count}(X_i = x, Y = y) = 0$. So we use Laplace Smoothing. The parameters of Naïve Bayes can be learned by counting:

Prior:

$P(Y = y) = \frac{\text{Count}(Y = y) + 1}{\sum_{y'} \text{Count}(Y = y') + C}$

Observation Distribution

$P(X_i = x \mid Y = y) = \frac{\text{Count}(X_i = x, Y = y) + 1}{\sum_{x'} \text{Count}(X_i = x', Y = y) + S}$

Here, $C$ is the number of classes, $S$ is the number of possible values that $X_i$ can take.

Learning & Decision on BN

Bayesian Network

BN$(G, \Theta)$: a Bayesian network

$G$ is a DAG with nodes and directed edges.
Each node represents a random variable. Each edge represents a causal relationship/dependency.
$\Theta$ is the network parameters that constitute conditional probabilities.
- For a node $t$, its parameters are represented as $p(x_t \mid x_{\text{pa}(t)})$.

Joint probability of BN:

$p(x) = \prod_{t=1}^{n} p(x_t \mid x_{\text{pa}(t)})$

where $\text{pa}(t)$ is the set of all parent nodes of node $t$.

$\begin{aligned} \begin{array}{ccc} & D \\ & \downarrow \\ & A \rightarrow B \rightarrow C \end{array} \end{aligned}$ $P(A, B, C, D) = P(A) P(D) P(B \mid A, D) P(C \mid B)$

Learning on Bayesian Network

Notation: Suppose BN has $n$ nodes, we use $\text{pa}(t)$ to denote the parent nodes of $t$ $(t = 1, \ldots, n)$

By the conditional independence of BN, we have

$p(D \mid \Theta) = \prod_{i=1}^{N} p(x_i \mid \Theta) = \prod_{i=1}^{N} \prod_{t=1}^{n} p(x_{i,t} \mid x_{i,\text{pa}(t)}, \theta_t) = \prod_{t=1}^{n} \prod_{i=1}^{N} p(D_{i,t} \mid \theta_t)$ $p(\Theta) = \prod_{t=1}^{n} p(\theta_t)$

Thus, the posterior becomes:

$p(\Theta \mid D) \sim \prod_{t=1}^{n} p(D_t \mid \theta_t) p(\theta_t)$ $p(\theta \mid D) \sim \prod_{t=1}^{n} \prod_{c=1}^{q_t} p(D_{tc} \mid \theta_{tc}) \cdot p(\theta_{tc})$

Learning BN with Categorical Distribution

Consider a case where each probability distribution in BN is categorical, In this case, we can model the conditional distribution of node $t$ as(We use a scalar value $c$ to represent parent nodes’ states for simplicity.):

$P(x_t = k \mid x_{\text{pa}(t)} = c) = \theta_{tck}$

and the conditional probability of node $t$ can be denoted as:

$\theta_{tc} = [\theta_{tc1}, \theta_{tc2}, \ldots, \theta_{tcK_t}], \quad \sum_{k=1}^{K_t} \theta_{tck} = 1$

Categorical Distribution:

$p = [\theta_1, \theta_2, \ldots, \theta_d], \quad \theta_i \geq 0, \quad \sum_{i} \theta_i = 1$

E.g., toss a coin $(d = 2)$, roll a die $(d = 6)$

Count the training samples where $x_t = k, x_{\text{pa}(t)} = c$:

$N_{tck} = \sum_{i=1}^{N} I(x_{i,t} = k, x_{i,\text{pa}(t)} = c)$

According to the property of categorical distribution, we can represent the likelihood function as:

$p(D_t \mid \theta_t) = \prod_{c=1}^{q_t} \prod_{k=1}^{K_t} \theta_{tck}^{N_{tck}} = \prod_{c=1}^{q_t} p(D_{tc} \mid \theta_{tc})$

Thus the posterior can be further factorized:

$p(\theta \mid D) \sim \prod_{t=1}^{n} p(D_t \mid \theta_t)p(\theta_t) = \prod_{t=1}^{n} \prod_{c=1}^{q_t} p(D_{tc} \mid \theta_{tc})p(\theta_{tc})$

Notation:

$D_{tc}$ are the sample set where the value of $x_{\text{pa}(t)}$ is $c$
$q_t$ is the number of possible values of $x_{\text{pa}(t)}$
$K_t$ is the number of possible values of $x_t$

How to choose the probability distribution function for the prior $p(\theta_{tc})$? It would be highly convenient if the posterior shares the same form as the prior.

Conjugate Prior: A prior distribution is called a conjugate prior for a likelihood function if the posterior distribution is in the same probability distribution family as the prior.

The conjugate prior for the categorical distribution is the Dirichlet distribution:

Choosing the prior as conjugate prior — Dirichlet distribution:

$p(\theta_{tc}) \propto \prod_{k=1}^{K_t} \theta_{tck}^{\alpha_{tck} - 1}$

$\alpha_{tck}$ are integers and are the hyperparameters of BN model.

In this case, the posterior can be easily derived as:

$p(D_{tc} \mid \theta_{tc}) p(\theta_{tc}) \propto \left( \prod_{k=1}^{K_t} \theta_{tck}^{N_{tck}} \right) * \left( \prod_{k=1}^{K_t} \theta_{tck}^{\alpha_{tck} - 1} \right) = \prod_{k=1}^{K_t} \theta_{tck}^{N_{tck} + \alpha_{tck} - 1}$

We can then derive an estimate of $\theta_{tck}$ by calculating the expectation:

$\hat{\theta}_{tck} = E(\theta_{tck}) = \frac{N_{tck} + \alpha_{tck}}{\sum_{k'} (N_{tck'} + \alpha_{tck'})}$

K-Means Algorithm

Initalize cluster centers $\mu_1, \cdots, \mu_k$ randomly.
Repeat until no change of cluster assignment
1. Assignment step: Assign data points to closest cluster center $C_k \leftarrow \set{n \mid x_n \text{ is closest to } \mu_k}$
2. Update Step: Change the cluster center to the average of its assigned points $\mu_k \leftarrow \frac{1}{|C_k|} \sum_{n \in C_k} x_n$

Optimization View of K-Means

Optimization Objective: within-cluster sum of squares (WCSS)

$\min_{\mu, r} J_e = \sum_{k=1}^{K} \sum_{n=1}^{N} r_{n,k} \| x_n - \mu_k \|^2$

Step 1: Fix $\mu$, optimize $r$

$r_{n,k^*} = 1 \quad \Leftrightarrow \quad k^* = \arg\min_k \| x_n - \mu_k \|$

Step 2: Fix $r$, optimize $\mu$

$\mu_k^* = \frac{\sum_{n} r_{n,k} x_n}{\sum_{n} r_{n,k}} = \frac{1}{|C_k|} \sum_{n \in C_k} x_i$

Rule of Thumbs for initializing k-means

Random Initialization: Randomly generate 𝑘 points in the space.
Random Partition Initialization: Randomly group the data into 𝑘 clusters and
use their cluster center to initialize the algorithm.
Forgy Initialization: Randomly select 𝑘 samples from the data.
K-Means++: Iteratively choosing new centroids that are farthest from the existing
centroids.

How to tell the right number of clusters?

We find the elbow point of the $J_e$ image.

EM Algorithm for Gaussian Mixture Model (GMM)

Multivariate Gaussian Distribution

$d$-dimensional Multivariate Gaussian:

$N(x \mid \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)$

$\mu \in \mathbb{R}^d$ the mean vector
$\Sigma \in \mathbb{R}^{d \times d}$ the covariance matrix

MLE of Gaussian Distribution

The likelihood function of a given dataset $X = \{x_1, x_2, \ldots, x_N\}$:

$p(X \mid \mu, \Sigma) = \prod_{n=1}^{N} p(x_n \mid \mu, \Sigma) = \prod_{n=1}^{N} \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x_n - \mu)^T \Sigma^{-1} (x_n - \mu) \right)$

The maximum likelihood estimation (MLE) of the parameters is defined by:

$\mu^*, \Sigma^* = \arg\max_{\mu, \Sigma} \mathcal{L}(\mu, \Sigma)$ $\mathcal{L}(\mu, \Sigma) = \log p(X \mid \mu, \Sigma) = \frac{N}{2} \log |\Sigma| - \frac{1}{2} \sum_{n=1}^{N} (x_n - \mu)^T \Sigma^{-1} (x_n - \mu)$

The optimization problem of maximum likelihood estimation (MLE):

$\max_{\mu, \Sigma} \mathcal{L}(\mu, \Sigma) = \frac{N}{2} \log |\Sigma| - \frac{1}{2} \sum_{n=1}^{N} (x_n - \mu)^T \Sigma^{-1} (x_n - \mu)$

Solve the optimization by taking the gradient:

$0 = \frac{\partial \mathcal{L}}{\partial \mu} = \sum_{n=1}^{N} \Sigma^{-1} (x_n - \mu) \quad \Rightarrow \quad \mu^* = \frac{1}{N} \sum_{n=1}^{N} x_n \quad \text{(Sample Mean)}$ $0 = \frac{\partial \mathcal{L}}{\partial \Sigma^{-1}} = \frac{N}{2} \Sigma - \frac{1}{2} \sum_{n=1}^{N} (x_n - \mu)(x_n - \mu)^T \quad \Rightarrow \quad \Sigma^* = \frac{1}{N} \sum_{n=1}^{N} (x_n - \mu^*)(x_n - \mu^*)^T \quad \text{(Sample Covariance)}$

Gaussian Mixture Model (GMM)

A Gaussian Mixture Model (GMM) is the weighted sum of a family of Gaussians whose density function has the form:

$p(x \mid \pi, \mu, \Sigma) = \sum_{k=1}^{K} \pi_k N(x \mid \mu_k, \Sigma_k)$

Each Gaussian $N(\mu_k, \Sigma_k)$ is called a component of GMM.
Scalars $\{\pi_k\}_{k=1}^{K}$ are referred to as mixing coefficients, which satisfy

$\sum_{k=1}^{K} \pi_k = 1$

This condition ensures $p(x \mid \pi, \mu, \Sigma)$ is indeed a density function.

Soft Clustering with Mixture Model

$p(z = k) = \pi_k, \quad p(x \mid z) = N(x \mid \mu_z, \Sigma_z)$

By Bayes Rule, the posterior probability of $z$ given $x$ is:

$\gamma_k \overset{\Delta}{=} p(z = k \mid x) = \frac{p(z = k, x)}{p(x)} = \frac{\pi_k N(x \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j N(x \mid \mu_j, \Sigma_j)}$

We call $\gamma_k$ the responsibility of the $k$-th component on the data $x$.

Probabilistic Clustering: each data point is assigned a probability distribution over the clusters.

“$x$ belongs to the $k$-th cluster with probability $\gamma_k$”

MLE for Gaussian Mixture Model

Log-likelihood function of GMM

$\log p(X \mid \pi, \mu, \Sigma) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k N(x_n \mid \mu_k, \Sigma_k) \right)$

Maximum Likelihood Estimation

$\max_{\pi, \mu, \Sigma} \mathcal{L}(\pi, \mu, \Sigma) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k N(x_n \mid \mu_k, \Sigma_k) \right)$

subject to:

$\sum_{k=1}^{K} \pi_k = 1$

Optimality Condition for $\mu$

$N(x \mid \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right),$ $\frac{\partial x^T A x}{\partial x} = (A + A^T) x$ $\max_{\pi, \mu, \Sigma} \mathcal{L}(\pi, \mu, \Sigma) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k N(x_n \mid \mu_k, \Sigma_k) \right)$

Take partial derivative with respect to $\mu_k$,

$0 = \frac{\partial \mathcal{L}}{\partial \mu_k} = -\sum_{n=1}^{N} \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{\sum_j \pi_j N(x_n \mid \mu_j, \Sigma_j)} \Sigma_k^{-1} (x_n - \mu_k)$

Notice that the posterior of $z_n$ (also known as responsibility $\gamma_{n,k}$) can be written as

$\gamma_{n,k} \overset{\Delta}{=} p(z_n = k \mid x_n) = \frac{p(z_n = k) p(x_n \mid z_n = k)}{\sum_j p(z_n = j) p(x_n \mid z_n = j)} = \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{\sum_j \pi_j N(x_n \mid \mu_j, \Sigma_j)}$

Thus

$0 = \sum_{n=1}^{N} \gamma_{n,k} (x_n - \mu_k)$ $\mu_k = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{n,k} x_n, \text{ where } N_k = \sum_{n=1}^{N} \gamma_{n,k}$

Optimality Condition for $\Sigma$

$\max_{\pi, \mu, \Sigma} \mathcal{L}(\pi, \mu, \Sigma) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k N(x_n \mid \mu_k, \Sigma_k) \right)$ $\gamma_{n,k} = p(z_n = k \mid x_n) = \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{\sum_j \pi_j N(x_n \mid \mu_j, \Sigma_j)}, \quad N_k \overset{\Delta}{=} \sum_{n=1}^{N} \gamma_{n,k}$

Similarly, take derivative with respect to $\Sigma_k$, which yields

$0 = \frac{\partial \mathcal{L}}{\partial \Sigma_k} \quad \Rightarrow \quad \Sigma_k = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{n,k} (x_n - \mu_k)(x_n - \mu_k)^T$

Responsibility-reweighted Sample Covariance

Optimality Condition for $\pi$

Constraints of mixing coefficients $\pi$: $\sum_{k=1}^{K} \pi_k = 1$

Introduce Lagrange multiplier:

$\mathcal{L}' = \mathcal{L} + \lambda \left( \sum_{k=1}^{K} \pi_k - 1 \right)$

Take derivative with respect to $\pi_k$, which gives

$0 = \frac{\partial \mathcal{L}'}{\partial \pi_k} \quad \Rightarrow \quad \sum_{n=1}^{N} \frac{\gamma_{n,k}}{\pi_k} + \lambda = \frac{N_k}{\pi_k} + \lambda \quad \Rightarrow \quad \pi_k = \frac{-N_k}{\lambda}$

By the constraints, we have $1 = \sum_{k=1}^{K} \pi_k = \frac{-1}{\lambda} \sum_{k=1}^{K} N_k$,

Also notice that

$\sum_{k=1}^{K} N_k = \sum_{k=1}^{K} \sum_{n=1}^{N} \gamma_{n,k} = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{n,k} = \sum_{n=1}^{N} 1 = N$

Therefore,

$\lambda = -\sum_{k=1}^{K} N_k = -N, \quad \pi_k = \frac{N_k}{N}$

Expectation-Maximization (EM) Algorithm

Initialize $\pi_k, \mu_k, \Sigma_k, \quad k = 1, 2, \ldots, K$
E-Step: Evaluate the responsibilities using the current parameter values

$\gamma_{n,k} = p(z_n = 1 \mid x_n) = \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{\sum_j \pi_j N(x_n \mid \mu_j, \Sigma_j)}$

M-Step: Re-estimate the parameters using the current responsibilities

$\mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{n,k} x_n$ $\Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{n,k} (x_n - \mu_k^{\text{new}})(x_n - \mu_k^{\text{new}})^T$ $\pi_k^{\text{new}} = \frac{N_k}{N}$

where $N_k = \sum_{n=1}^{N} \gamma_{n,k}$

Return to step 2 if the convergence criterion is not satisfied.

Hierarchical Clustering

Distance Function: The distance function affects which pairs of clusters are merged/split and in what order.

Single Linkage:

$d(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$

Complete Linkage:

$d(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$

Average Linkage:

$d(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|} \sum_{x \in C_i, y \in C_j} d(x, y)$

Two Types of Hierarchical Clustering

Bottom-Up (Agglomerative)
- Start with each item in its own cluster, find the best pair to merge into a new cluster.
- Repeat until all clusters are fused together.
Top-Down (Divisive)
- Start with one all-inclusive cluster, consider every possible way to divide the cluster in two.
- Choose the best division and recursively operate on both sides.

Agglomerative (Bottom-up) Clustering

Input: cluster distance measure $d$, dataset $X = \{x_n\}_{n=1}^{N}$, number of clusters $k$
Initialize $\mathcal{C} = \{C_i = \{x_n\} \mid x_n \in X\}$ // Each point in separate cluster
Repeat:
- Find the closest pair of clusters $C_i, C_j \in \mathcal{C}$ based on distance metric $d$
- $C_{ij} = C_i \cup C_j$ // Merge the selected clusters
- $\mathcal{C} = (\mathcal{C} \setminus \{C_i, C_j\}) \cup \{C_{ij}\}$ // Update the clustering
Until $|\mathcal{C}| = k$

A naïve implementation takes space complexity $O(N^2)$, time complexity $O(N^3)$.

LASSO Regression

LASSO (Least Absolute Shrinkage and Selection Operator): Simply linear regression with an $\ell_1$ penalty for sparsity

$L(w) = \sum_{i=1}^{n} \left( w^T x_i - y_i \right)^2 + C \|w\|_1$

sparse solution $\leftrightarrow$ feature selection

Principal Component Analysis (PCA)

Computing PCA: Eigenvalue Decomposition

Objective: Maximize variance of projected data

$\max_{\mathbf{u}_j} \mathbb{E}[(\mathbf{u}_j^T \mathbf{x})^2]$

subject to $\mathbf{u}_j^T \mathbf{u}_j = 1$, $\mathbf{u}_j^T \mathbf{u}_k = 1$, $k < j$

Observation: PC $j$ is direction of the $j$-th largest eigenvector of $\frac{1}{n} \mathbf{X}^T \mathbf{X}$

Eigenvalue Decomposition:

$\mathbf{U} = \begin{pmatrix} \mathbf{u}_1 & \cdots & \mathbf{u}_k \\ \end{pmatrix}$

are eigenvectors of $\frac{1}{n} \mathbf{X}^T \mathbf{X}$

Manifold Learning

Geodesic distance: lines of shortest length between points on a manifold

Machine Learning