by Alessia Saggio

Hello everybody,

As promised in my previous post, I’m gonna spend some words about a statistical technique that finds application in many fields such as Particle Physics, Neuroscience, and Computer Vision called Principal Component Analysis (PCA).

Basically, the Principal Component Analysis is a way to identify patterns in data and to catch similarities and differences between different classes of data. In other words, we can use PCA to see if we can combine some variables to derive new components which could provide a simpler description of our system.

One important aspect of PCA is that it allows to reduce the number of dimensions in the data without loosing much information: a quite relevant feature in high energy physics, I would say!

Suppose we have 10 observations and that for each of them we have measured 4 properties (or variables). Practically speaking, this means that, for instance, you consider 10 people and for each of them you know the age, the average number of hours/week spent doing sports, the state of health and the average quantity of healthy food they eat/week.

This data table can be seen as a 10 x 4 matrix, i.e. we associate a row to each person and a column to each corresponding variable.

We now wonder if some similarities between these variables exist, and if they’re related to each other… for this purpose we can use the PCA!

The best known relationship between variables is the linear one. The Principal Component Analysis focuses on this type of relationship (more complex links also exist such as quadratic, logarithmic, exponential functions etc,. but they are not used in PCA).

For a better understanding of the procedure, let’s consider only two sets of data, x and y (this means that a certain number of observed values of one variable is associated to the set x and the same number of observed values of another variable to the set y) and let’s try to establish if they’re somehow related to each other by means of the PCA.

The scatter plot of these two observables is shown in fig. 1.

fig1
Fig. 1: Scatter plot of the two data sets x and y.

The first step is to calculate the mean for the datasets x and y and to subtract them from each value of the corresponding set. In this way, you center the data around 0.

Then, calculate the covariance matrix (a matrix that represents the variation of each variable with respect to the others, including itself) and its eigenvalues and eigenvectors. In this case, since we have the two variables x and y, the covariance matrix will be of dimension 2×2.

We now plot the original data centered in 0 and the two eigenvectors of the covariance matrix (fig.2).

fig2
Fig. 2: Scatter plot of the data with the mean subtracted and the eigenvectors superimposed.

What can we infer from this plot?

For sure that the data have a very strong pattern. There’s one eigenvector following much better this pattern with respect to the other one, as if it’s drawing a line of best fit. It tells us how the two datasets x and y are related to each other along that line. The second eigenvector still carries some information on this correlation, but less important: it tells us that the points don’t follow the line drawn by the first eigenvector strictly but there’s a small offset to the right and to the left of this line.

As I pointed out before, one of the main properties of the PCA is that it allows to reduce the dimensionality of a system.

To see how to do that, we first need to order the eigenvectors by descending order of the eigenvalues they’re associated to. In other words, the first principal component is the eigenvector coming from the largest eigenvalue, the second principal component is the eigenvector coming from the second largest eigenvalue (and so on, for systems with dimensionality greater than 2).

This process of ordering gives you the principal components in order of importance. In our case, we can reduce our system with dimensionality 2 to another system with dimensionality 1, i.e. made up of the first component only. Note that in our case, the first component is the eigenvector that approximates the data very well (and of course this is not a coincidence: the highest eigenvalue corresponds to the component with the highest variance, that means it accounts for as much of the variability in the data as possible).

To summarize a bit, we can therefore choose to represent our data in terms of principal components, instead of the original x and y. This means that if we decide to keep just the first principal component, the final plot will have only one dimension.

Getting back to the original data, in fig.3 you can see the plot derived by taking only the eigenvector with the largest eigenvalue. Of course, you won’t see exactly the same configuration of the original data, and this is quite trivial, since we didn’t use both the eigenvectors. But we’re still maintaining the main pattern!

fig3
Fig. 3: Scatter plot of the original data derived using only one eigenvector.

So what we’ve done, basically, is to transform our data in order to be expressed in terms of patterns between them, where the patterns are the lines that best describe the relationship between the data. So in the simple case we’re treating, the variation along the principal component has been kept, while neglecting the other one.

Of course, this is a very simple case that will never happen in Particle Physics. In the “real world”, where one has to deal with many datasets and many variables, PCA would simplify a lot the things…

Actually, an extension of PCA exists, named Kernel PCA. This is the nonlinear form of PCA (if the kernel is not linear, otherwise it coincides with the PCA), which better exploits the complicated spatial structure of high-dimensional features.

I apologize if this post was very long. I hope everything’s clear and, overall, that you didn’t get bored too much!

À plus!

All images are taken from http://www.cs.otago.ac.nz/: Lindsay I Smith, A tutorial on Principal Components Analysis