Principal Component Analysis(PCA)

| Tags Machine Learning  Math 

Covariance Matrix

Suppose $$$ X $$$ is a $$$ n \times p $$$ matrix, where n is the number of objects and p is the number of features. $$$X$$$ is zero mean.

$$ X^T X = nV $$ $$$ V $$$ is the covariance matrix of $$$X$$$, and it's obvious that $$$V$$$ is a symmetric and positive-definite matrix.

Minimize Residuals

In PCA, we want to project the original data into a lower space, suppose we want to project the p-dimensional data into line space, and the unit vector of the line is $$$ \vec{w} $$$, then for $$$ \vec{x}_i $$$ the projection is $$$ \vec{x}_i \cdot \vec{w} $$$, and the coordinate in p-dimensional space is $$$ (\vec{x}_i \cdot \vec{w}) \vec{w} $$$. The residual is:

$$ \begin{eqnarray*} \|\vec{x}_i - (\vec{x}_i \cdot \vec{w}) \vec{w}\|^2 &=& \|\vec{x}_i\|^2-2(\vec{x}_i\cdot\vec{w})^2 + \|(\vec{x}_i \cdot \vec{w})\vec{w}\|^2 \\ &=& \|\vec{x}_i\|^2 - (\vec{x}_i \cdot \vec{w})^2 \end{eqnarray*} $$

For all vectors, the residuals is:

$$ \begin{eqnarray*} RSS(\vec{w}) &=& \sum_{i=1}^n\|\vec{x}_i\|^2 - \sum_{i=1}^n(\vec{x}_i \cdot \vec{w})^2 \\ \end{eqnarray*} $$

Because $$$\| \vec{x}_i \|^2$$$ does not depend on $$$\vec{w}$$$, to minimize RSS, we need to maximize $$$ \sum_{i=1}^n(\vec{x}_i \cdot \vec{w})^2 $$$. Because $$$ X$$$ is zero mean, so

$$ \begin{eqnarray*} E(X\vec{w}) &=& \frac{1}{n}\sum_{i=1}^n \vec{x}_i\vec{w} \\ &=& (\frac{1}{n}\sum_{i=1}^n \vec{x}_i) \vec{w} \\ &=& E(X)\vec{w} \\ &=& 0 \end{eqnarray*} $$ $$ \begin{eqnarray*} \sum_{i=1}^n(\vec{x}_i \cdot \vec{w})^2 &=& \sum_{i=1}^n(\vec{x}_i \cdot \vec{w} - 0)^2 \\ &=& \sum_{i=1}^n(\vec{x}_i\cdot\vec{w} - E(X\vec{w}))^2 \\ &=& n\frac{1}{n}\sum_{i=1}^n(\vec{x}_i\cdot\vec{w} - E(X\vec{w}))^2 \\ &=& n Var(X\vec{w}) \end{eqnarray*} $$

Now we can transform RSS into:

$$ \begin{eqnarray*} RSS(\vec{w}) &=& \sum_{i=1}^n \|\vec{x}_i\|^2 - \sum_{i=1}^n(\vec{x}_i \cdot \vec{w})^2 \\ &=& \sum_{i=1}^n \|\vec{x}_i\|^2 - n Var(X\vec{w}) \end{eqnarray*} $$

So, in order to minimize RSS, we just need to maximize variance $$$Var(X\vec{w})$$$.

$$ \begin{eqnarray*} \sigma^2 &=& \frac{1}{n}\sum_{i=1}^n(\vec{x}_i\cdot\vec{w})^2 \\ &=& \frac{1}{n}(X\vec{w})^T(X\vec{w}) \\ &=& \frac{1}{n}\vec{w}^TX^TX\vec{w} \\ &=& \vec{w}^T \frac{X^TX}{n} \vec{w} \\ &=& \vec{w}^T V \vec{w} \end{eqnarray*} $$

What we want to do is to find a unit vector $$$\vec{w}$$$, so that $$$\sigma^2$$$ can be maximized. Let $$$f(\vec{w}) = \vec{w}^TV\vec{w}$$$ and $$$ g(\vec{w}) = \vec{w}^T\vec{w}$$$. We want to maximize $$$f(\vec{w})$$$ with respect to $$$ g(\vec{w}) = 1 $$$. To solve this problem, we can add a Lagrange multiplier $$$ \lambda $$$ and construct a new function.

$$ L(\vec{w}, \lambda) = f(\vec{w}) - \lambda(g(\vec{w}) - 1) $$

This is our new objective function, so we differentiate with repsect to both arguments and set the derivatives equal to zero:

$$ \begin{eqnarray*} \frac{\partial L}{\partial \vec{w}} &=& \frac{\partial f(\vec{w})}{\partial \vec{w}} - \lambda \frac{\partial g(\vec{w})}{\partial \vec{w}}= 0 \\ \frac{\partial L}{\partial \lambda} &=& -g(\vec{w}) + 1 = 0 \end{eqnarray*} $$

Maximizing with respect to $$$\lambda$$$ gives us back our constriant equation. If $$$\lambda$$$ were zero, then the objective function would be unconstrained, the bigger $$$\lambda$$$ is, the more constraint "bites". Because we have to satisfy the constraint, we need to ensure $$$ \lambda \neq 0 $$$, so that the maximized value of L are equal to the origin objective function f with constraint g=1. For V is a symmetric matrix, so we have $$$ \frac{\partial \vec{w}^TV\vec{w}}{\partial \vec{w}} = 2V\vec{w}$$$.

$$ \begin{eqnarray*} \frac{\partial f}{\partial \vec{w}} - \lambda\frac{\partial g}{\partial \vec{w}} &=& 0 \\ \frac{\partial \vec{w}^TV\vec{w}}{\partial \vec{w}} - \lambda\frac{\partial \vec{w}^T\vec{w}}{\partial \vec{w}} &=& 0 \\ 2V\vec{w} - 2\lambda\vec{w} &=& 0 \\ V\vec{w} &=& \lambda\vec{w} \end{eqnarray*} $$

It's obvious that $$$\vec{w}$$$ is an eigenvector of V and $$$\lambda$$$ is an eigenvalue of V.This is good news, because finding eigenvectors is something that can be done comparatively rapidly, besides eigenvectors have many nice mathematical properties.

Eigenvector

For any square matrix $$$A$$$ and any nonzero vector $$$\vec{w}$$$, if $$$A\vec{w} = \lambda \vec{w}$$$, then $$$\vec{w}$$$ is the eigenvector of A and $$$\lambda$$$ is the eigenvalue of A. Suppose $$$I$$$ is the identity matrix, then we have $$$ (A - \lambda I)\vec{w} = 0 $$$. Because $$$\vec{w}$$$ is nonzero, $$$ (A-\lambda I) $$$ must be singular, that is $$$ |A - \lambda I| = 0 $$$, a n degree equation. According to the property of n degree equation, we will n $$$\lambda$$$s, some of them may be the same, or even complex numbers. There are two common properties for eigenvalues:

  • $$$ \sum_{i=1}^n \lambda_i = trace(A) $$$
  • $$$ \prod_{i=1}^n \lambda_i = |A| $$$

If the matrix A is symmetric, then the eigenvector has another nice property:

  • $$$\vec{w}_i \cdot \vec{w}_j = 0 $$$, that is the eignevectors of a symmetric matrix are </strong>orthogoanl</strong>.

Now, let's prove this property. First, for any real matrix A, and any vector $$$\vec{x}$$$, $$$\vec{y}$$$, we have

$$ \begin{eqnarray*} \langle A\vec{x}, \vec{y} \rangle &=& (A\vec{x})^T \vec{y} = \vec{x}^T A^T \vec{y} \\ \langle \vec{x}, A^T\vec{y} \rangle &=& \vec{x}^T A^T \vec{y} \\ \langle A\vec{x}, \vec{y} \rangle &=& \langle \vec{x}, A^T \vec{y} \rangle \end{eqnarray*} $$

Suppose A is a symmetric matrix, that is $$$ A^T = A$$$, and two distinct eigenvalues $$$\lambda_i$$$, $$$\lambda_j$$$, with corresponding eigenvectors $$$\vec{w}_i , \vec{w}_j$$$. From the above we have:

$$ \begin{eqnarray*} \lambda_i \langle \vec{w}_i, \vec{w}_j \rangle &=& \langle \lambda_i\vec{w}_i, \vec{w}_j \rangle \\ &=& \langle A\vec{w}_i, \vec{w}_j \rangle \\ &=& \langle \vec{w}_i, A^T\vec{w}_j \rangle \\ &=& \langle \vec{w}_i, A\vec{w}_j \rangle \\ &=& \langle \vec{w}_i, \lambda_j \vec{w}_j \rangle \\ &=& \lambda_j \langle \vec{w}_i, \vec{w}_j \rangle \end{eqnarray*} $$

Because $$$\lambda_i \neq \lambda_j $$$, so $$$\langle \vec{w}_i, \vec{w}_j \rangle = 0$$$ which means that the eigenvectors of a symmetric matrix are orthogonal.

Transform matrix

Now let's go back to PCA, because covariance matrix $$$V$$$ is positive-definite, for any vector $$$\vec{v}$$$, $$$\vec{v}^TV\vec{v} \geq 0 $$$. So the eigenvalues of V are non-negative. And the eigenvectors of V are the Principal Component of the data. Usually we pick the biggest q positive eigenvalues, and using their eigenvectors to construct a q-dimensional basis, because these eigenvectors are orthogonal, so they can span the whole q-dimensional space. These q eigenvectors comprise a $$$ p \times q $$$ transform matrix $$$W$$$, and this matrix what we use to reduce the dimension of q-dimensional data to p-dimensional data.


Previous     Next