Hello everybody,

As promised in my previous post, I’m gonna spend some words about a statistical technique that finds application in many fields such as Particle Physics, Neuroscience, and Computer Vision called **Principal Component Analysis** (PCA).

Basically, the Principal Component Analysis is a way to identify patterns in data and to catch similarities and differences between different classes of data. In other words, we can use PCA to see if we can combine some variables to derive new components which could provide a simpler description of our system.

One important aspect of PCA is that it allows to reduce the number of dimensions in the data without loosing much information: a quite relevant feature in high energy physics, I would say!

Suppose we have 10 observations and that for each of them we have measured 4 properties (or variables). Practically speaking, this means that, for instance, you consider 10 people and for each of them you know the age, the average number of hours/week spent doing sports, the state of health and the average quantity of healthy food they eat/week.

This data table can be seen as a 10 x 4 matrix, i.e. we associate a row to each person and a column to each corresponding variable.

We now wonder if some similarities between these variables exist, and if they’re related to each other… for this purpose we can use the PCA!

The best known relationship between variables is the linear one. The Principal Component Analysis focuses on this type of relationship (more complex links also exist such as quadratic, logarithmic, exponential functions etc,. but they are not used in PCA).

For a better understanding of the procedure, let’s consider only two sets of data, *x* and *y* (this means that a certain number of observed values of one variable is associated to the set *x* and the same number of observed values of another variable to the set *y*) and let’s try to establish if they’re somehow related to each other by means of the PCA.

The scatter plot of these two observables is shown in fig. 1.

The first step is to calculate the mean for the datasets *x* and *y* and to subtract them from each value of the corresponding set. In this way, you center the data around 0.

Then, calculate the covariance matrix (a matrix that represents the variation of each variable with respect to the others, including itself) and its eigenvalues and eigenvectors. In this case, since we have the two variables *x* and *y*, the covariance matrix will be of dimension 2×2.

We now plot the original data centered in 0 and the two eigenvectors of the covariance matrix (fig.2).

What can we infer from this plot?

For sure that the data have a very strong pattern. There’s one eigenvector following much better this pattern with respect to the other one, as if it’s drawing a line of best fit. It tells us how the two datasets *x* and *y* are related to each other along that line. The second eigenvector still carries some information on this correlation, but less important: it tells us that the points don’t follow the line drawn by the first eigenvector strictly but there’s a small offset to the right and to the left of this line.

As I pointed out before, one of the main properties of the PCA is that it allows to reduce the dimensionality of a system.

To see how to do that, we first need to order the eigenvectors by descending order of the eigenvalues they’re associated to. In other words, the first principal component is the eigenvector coming from the largest eigenvalue, the second principal component is the eigenvector coming from the second largest eigenvalue (and so on, for systems with dimensionality greater than 2).

This process of ordering gives you the principal components in order of importance. In our case, we can reduce our system with dimensionality 2 to another system with dimensionality 1, i.e. made up of the first component only. Note that in our case, the first component is the eigenvector that approximates the data very well (and of course this is not a coincidence: the highest eigenvalue corresponds to the component with the highest variance, that means it accounts for as much of the variability in the data as possible).

*x*and

*y*. This means that if we decide to keep just the first principal component, the final plot will have only one dimension.

Getting back to the original data, in fig.3 you can see the plot derived by taking only the eigenvector with the largest eigenvalue. Of course, you won’t see exactly the same configuration of the original data, and this is quite trivial, since we didn’t use both the eigenvectors. But we’re still maintaining the main pattern!

So what we’ve done, basically, is to transform our data in order to be expressed in terms of *patterns* between them, where the patterns are the lines that best describe the relationship between the data. So in the simple case we’re treating, the variation along the principal component has been kept, while neglecting the other one.

Of course, this is a very simple case that will never happen in Particle Physics. In the “real world”, where one has to deal with many datasets and many variables, PCA would simplify a lot the things…

Actually, an extension of PCA exists, named Kernel PCA. This is the nonlinear form of PCA (if the kernel is not linear, otherwise it coincides with the PCA), which better exploits the complicated spatial structure of high-dimensional features.

I apologize if this post was very long. I hope everything’s clear and, overall, that you didn’t get bored too much!

À plus!

*All images are taken from http://www.cs.otago.ac.nz/: Lindsay I Smith,* *A tutorial on Principal Components Analysis*

24 June 2016 at 16:38

Hi Alessia,

A clear and very nice post, thank you! 🙂 Just to share a few further bits and bobs around the same lines as your post, Jeremy Kun also has a nice (but less general audience-friendly) post:

https://jeremykun.com/2016/05/16/singular-value-decomposition-part-2-theorem-proof-algorithm/

And Alex Williams discusses some generalisations of PCA here:

http://alexhwilliams.info/itsneuronalblog/2016/03/27/pca/

Personally, one bit I found intriguing about PCA (and seems to be relatively unknown, amongst the non-data analyst crowd at least) and which your readers might be interested in, is that it under its appearance of purely linear algebra it hides the assumption that the noise in the data is Gaussian! Extensions to this for noise approximated from other distributions have been worked on though:

http://papers.nips.cc/paper/2078-a-generalization-of-principal-components-analysis-to-the-exponential-family.pdf

Lastly, on the topic of non-linear dimensionality reduction techniques you touch upon, I would be remiss if I didn’t somehow bring the all-fashionable topic of neural networks into it, as arbitrary non-linear function approximation is their forte: Some nice work has been done on using autoencoders precisely for this. This is something I have experimented a bit with, and expect to see some more applications of it crop up in the future!

https://www.cs.toronto.edu/~hinton/science.pdf

Keep up the nice posts! 🙂

Kind regards,

Ilan

LikeLike

25 June 2016 at 16:16

Hi Ilan,

I’m very happy you enjoyed the post. It’s always nice to share our knowledge and to comment together on such intriguing topics.

Thank you for this feedback, I found the papers very interesting and I also learned a lot! I found the post by Alex Williams very well written and detailed (the procedure to determine how many principal components to keep is simply awesome 🙂 ).

The fact that the data points can be seen actually as “noise-corrupted versions of some true points” is new to me and sounds like a valid, different interpretation of the “philosophy” hidden behind this technique. This whole article is a source of knowledge about the extensions of the PCA and I’m sure I’ll refer to it in the future…

Finally, thank you for mentioning the neural networks. From what I’ve learned in the last weeks (I’m still pretty new to these topics), the machine learning world is so big and complex that we would need millions of posts on this bog to cover it entirely. The article you propose to me is really cool: we all expect these techniques to widely spread among scientists in the next future. Perhaps I will write a post about that…

Thank you again for commenting and good work with your neural networks 😉

Best,

Alessia

LikeLike

26 June 2016 at 16:53

Hi Alessia,

Glad to hear you enjoyed those posts/papers as much as I did!

As you said, there is still so much to read, learn and write about any of these topics that that should keep us busy for some time, and also lead to some nice applications/papers as well one can hope!

I look forward to reading more posts and all the best in the meantime 🙂

Kind regards,

Ilan

LikeLike

9 April 2017 at 9:13

Very nice post. I just stumbled upon your blog and

wanted to say that I’ve truly enjoyed surfing around your blog posts.

In any case I’ll be subscribing to your

feed and I hope you write again soon! https://www.evangelistbisola.com/groups/three-incredible-monster-truck-games-examples/

LikeLike

10 April 2017 at 10:41

Dear Muriel,

Thank you so much, I’m glad to read this! I will write again soon, I hope you will enjoy the future posts as much as you did with the old ones.

Have a nice day!

Alessia

LikeLike