# Principal Component Analysis(PCA)

|

## Covariance Matrix

Suppose $$X$$$is a $$n \times p$$$ matrix, where n is the number of objects and p is the number of features. $$X$$$is zero mean. $$X^T X = nV$$ $$V$$$ is the covariance matrix of $$X$$$, and it's obvious that $$V$$$ is a symmetric and positive-definite matrix.

## Minimize Residuals

In PCA, we want to project the original data into a lower space, suppose we want to project the p-dimensional data into line space, and the unit vector of the line is $$\vec{w}$$$, then for $$\vec{x}_i$$$ the projection is $$\vec{x}_i \cdot \vec{w}$$$, and the coordinate in p-dimensional space is $$(\vec{x}_i \cdot \vec{w}) \vec{w}$$$. The residual is:

$$\begin{eqnarray*} \|\vec{x}_i - (\vec{x}_i \cdot \vec{w}) \vec{w}\|^2 &=& \|\vec{x}_i\|^2-2(\vec{x}_i\cdot\vec{w})^2 + \|(\vec{x}_i \cdot \vec{w})\vec{w}\|^2 \\ &=& \|\vec{x}_i\|^2 - (\vec{x}_i \cdot \vec{w})^2 \end{eqnarray*}$$

For all vectors, the residuals is:

$$\begin{eqnarray*} RSS(\vec{w}) &=& \sum_{i=1}^n\|\vec{x}_i\|^2 - \sum_{i=1}^n(\vec{x}_i \cdot \vec{w})^2 \\ \end{eqnarray*}$$

Because $$\| \vec{x}_i \|^2$$$does not depend on $$\vec{w}$$$, to minimize RSS, we need to maximize $$\sum_{i=1}^n(\vec{x}_i \cdot \vec{w})^2$$$. Because $$X$$$ is zero mean， so

$$\begin{eqnarray*} E(X\vec{w}) &=& \frac{1}{n}\sum_{i=1}^n \vec{x}_i\vec{w} \\ &=& (\frac{1}{n}\sum_{i=1}^n \vec{x}_i) \vec{w} \\ &=& E(X)\vec{w} \\ &=& 0 \end{eqnarray*}$$ $$\begin{eqnarray*} \sum_{i=1}^n(\vec{x}_i \cdot \vec{w})^2 &=& \sum_{i=1}^n(\vec{x}_i \cdot \vec{w} - 0)^2 \\ &=& \sum_{i=1}^n(\vec{x}_i\cdot\vec{w} - E(X\vec{w}))^2 \\ &=& n\frac{1}{n}\sum_{i=1}^n(\vec{x}_i\cdot\vec{w} - E(X\vec{w}))^2 \\ &=& n Var(X\vec{w}) \end{eqnarray*}$$

Now we can transform RSS into:

$$\begin{eqnarray*} RSS(\vec{w}) &=& \sum_{i=1}^n \|\vec{x}_i\|^2 - \sum_{i=1}^n(\vec{x}_i \cdot \vec{w})^2 \\ &=& \sum_{i=1}^n \|\vec{x}_i\|^2 - n Var(X\vec{w}) \end{eqnarray*}$$

So, in order to minimize RSS, we just need to maximize variance $$Var(X\vec{w})$$$. $$\begin{eqnarray*} \sigma^2 &=& \frac{1}{n}\sum_{i=1}^n(\vec{x}_i\cdot\vec{w})^2 \\ &=& \frac{1}{n}(X\vec{w})^T(X\vec{w}) \\ &=& \frac{1}{n}\vec{w}^TX^TX\vec{w} \\ &=& \vec{w}^T \frac{X^TX}{n} \vec{w} \\ &=& \vec{w}^T V \vec{w} \end{eqnarray*}$$ What we want to do is to find a unit vector $$\vec{w}$$$, so that $$\sigma^2$$$can be maximized. Let $$f(\vec{w}) = \vec{w}^TV\vec{w}$$$ and $$g(\vec{w}) = \vec{w}^T\vec{w}$$$. We want to maximize $$f(\vec{w})$$$ with respect to $$g(\vec{w}) = 1$$$. To solve this problem, we can add a Lagrange multiplier $$\lambda$$$ and construct a new function.

$$L(\vec{w}, \lambda) = f(\vec{w}) - \lambda(g(\vec{w}) - 1)$$

This is our new objective function, so we differentiate with repsect to both arguments and set the derivatives equal to zero:

$$\begin{eqnarray*} \frac{\partial L}{\partial \vec{w}} &=& \frac{\partial f(\vec{w})}{\partial \vec{w}} - \lambda \frac{\partial g(\vec{w})}{\partial \vec{w}}= 0 \\ \frac{\partial L}{\partial \lambda} &=& -g(\vec{w}) + 1 = 0 \end{eqnarray*}$$

Maximizing with respect to $$\lambda$$$gives us back our constriant equation. If $$\lambda$$$ were zero, then the objective function would be unconstrained, the bigger $$\lambda$$$is, the more constraint "bites". Because we have to satisfy the constraint, we need to ensure $$\lambda \neq 0$$$, so that the maximized value of L are equal to the origin objective function f with constraint g=1. For V is a symmetric matrix, so we have $$\frac{\partial \vec{w}^TV\vec{w}}{\partial \vec{w}} = 2V\vec{w}$$$. $$\begin{eqnarray*} \frac{\partial f}{\partial \vec{w}} - \lambda\frac{\partial g}{\partial \vec{w}} &=& 0 \\ \frac{\partial \vec{w}^TV\vec{w}}{\partial \vec{w}} - \lambda\frac{\partial \vec{w}^T\vec{w}}{\partial \vec{w}} &=& 0 \\ 2V\vec{w} - 2\lambda\vec{w} &=& 0 \\ V\vec{w} &=& \lambda\vec{w} \end{eqnarray*}$$ It's obvious that $$\vec{w}$$$ is an eigenvector of V and $$\lambda$$$is an eigenvalue of V.This is good news, because finding eigenvectors is something that can be done comparatively rapidly, besides eigenvectors have many nice mathematical properties. ## Eigenvector For any square matrix $$A$$$ and any nonzero vector $$\vec{w}$$$, if $$A\vec{w} = \lambda \vec{w}$$$, then $$\vec{w}$$$is the eigenvector of A and $$\lambda$$$ is the eigenvalue of A. Suppose $$I$$$is the identity matrix, then we have $$(A - \lambda I)\vec{w} = 0$$$. Because $$\vec{w}$$$is nonzero, $$(A-\lambda I)$$$ must be singular, that is $$|A - \lambda I| = 0$$$, a n degree equation. According to the property of n degree equation, we will n $$\lambda$$$s, some of them may be the same, or even complex numbers. There are two common properties for eigenvalues:

• $$\sum_{i=1}^n \lambda_i = trace(A)$$$• $$\prod_{i=1}^n \lambda_i = |A|$$$

If the matrix A is symmetric, then the eigenvector has another nice property: