principle component analysis

Table of Contents

PCA is a data analysis technique that:

input
take a set of data points of arbitary many features
output
a series of principle components(unit vectors)
meaning
the first few principle components would account for most of the variations of the data1
how
  1. center the data points aronud origin with mean of each feature ([1 50] [2 10] [3 0] -> every body apply [-2 -20])
  2. find a line with the maximum sum squared projected distance \(\sum_{p \in P}{d_p^2}\)

    _20240307_130809screenshot.png an unit vector along it would be the first principle component

  3. find another line that is perpendicular to the previous line that have the maximum sum squared projected distance, the unit vector along it would be the second p.c.
  4. find a third line that is perpendicular to the previous two, that have the maximum sum squared projected distance, the unit vector alont it would be the third p.c.
  5. and so on and so on, until either each feature or each data point have their own dimension(\(\#p.c. \le min(\#features,\#datapoints\)))

Backlinks

Footnotes:

1

if there indeed is imbalanced distribution, like with a 4 features, the points are actually all on a 2-d plane, then the 3rd and 4th principle component would account for no variation at all, while the first 2 acounts for 100%.

Author: Linfeng He

Created: 2024-04-03 Wed 20:59