principle component analysis
Table of Contents
PCA is a data analysis technique that:
- input
- take a set of data points of arbitary many features
- output
- a series of principle components(unit vectors)
- meaning
- the first few principle components would account for most of the variations of the data1
- how
- center the data points aronud origin with mean of each feature ([1 50] [2 10] [3 0] -> every body apply [-2 -20])
find a line with the maximum sum squared projected distance \(\sum_{p \in P}{d_p^2}\)
an unit vector along it would be the first principle component
- find another line that is perpendicular to the previous line that have the maximum sum squared projected distance, the unit vector along it would be the second p.c.
- find a third line that is perpendicular to the previous two, that have the maximum sum squared projected distance, the unit vector alont it would be the third p.c.
- and so on and so on, until either each feature or each data point have their own dimension(\(\#p.c. \le min(\#features,\#datapoints\)))
Footnotes:
1
if there indeed is imbalanced distribution, like with a 4 features, the points are actually all on a 2-d plane, then the 3rd and 4th principle component would account for no variation at all, while the first 2 acounts for 100%.