Principal components analysis

In statistics, principal components analysis (PCA) is a technique to simplify a dataset; more formally it is a transform used for reducing dimensionality in a dataset while retaining those characteristics of the dataset that contribute most to its variance. These characteristics may be the 'most important', but this is not necessarily the case, depending on the application.

PCA is also called the Karhunen-Ločve transform or the Hotelling transform. PCA has the speciality of being the optimal linear transform for keeping the subspace that has largest variance. However this comes at the price of greater computational requirement, e.g. if compared to the discrete cosine transform. Unlike other linear transforms, the PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set.

The principal component w₁ of a dataset x can be defined as (assuming zero mean, i.e. E(x)=0)

(See arg max for the notation.) With the first components, the -th component can be found by subtracting the first principal components from x:

and by substituting this as the new dataset to find a principal component in:

A simpler way to calculate the components w_i uses the covariance matrix of x, the measurement vector. By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the dataset. The original measurements are finally projected onto the reduced vector space.

Related (or even more similar than related?) is the calculus of empirical orthogonal functions (EOF).

Another method of dimension reduction is a self-organizing map.

If PCA is used in pattern recognition an often useful alternative is the linear discriminant analysis that takes into account the class separability, which is not the case for PCA.