The article describing the clustering technique that my group has designed to study physics models living in multi-dimensional spaces has finally been published today on JHEP, a high-impact-factor journal. You can find it here (it’s Open Access!)

In statistical analysis, clustering is the way to call a class of problems where elements of a data set have to be grouped based on some criterion – some user-defined similarity measure. Clustering methods enable the identification of clusters of elements. Inside each cluster all elements are “more similar” to one another than in the original set.

If you think it over for a minute, you’re likely to find many day-to-day problems where cluster analysis can be meaningfully applied. For instance, take the dataset of galaxies of the galaxy zoo: like a novel Hubble, you can define variables that describe the morphological features of each galaxy from their picture in the database, and then run a clustering algorithm to find how the galaxies group. You will then rediscover the existence of elliptical galaxies, spiral galaxies, barred spirals, etcetera. Fascinating, isn’t it ?

Image credit:

Or for a more down-to-Earth example, imagine you are trying to characterize a set of buyers about whom you know the past shopping habit (from their credit card usage, say). A cluster analysis on those data can help you predict their next purchase, creating opportunities for targeted advertising. Unfortunately, that’s what people do out there, much to my frustration (I do not like to be treated as a predictable sheep – but alas, we all are to some extent).

Our use of cluster analysis for the paper cited above was a bit more noble, I’d like to say. The issue we were facing was the fact that Higgs boson pairs may get produced, in LHC proton-proton collisions, through a variety of mechanisms if one allows the possibility of physics beyond the Standard Model. In the most general case, one has five unknown parameters that determine the detailed physics of di-Higgs production. The number of possibilities is humongous!

In order to consider only a limited number of possibilities for the kind of events we have to search, we did a cluster analysis on the features displayed by the final state of the production mechanism – where you have two Higgs bosons flying out of the interaction point, before any decay or radiation mechanism has taken place. This allowed us to define 12 “benchmark” points in the complex five-dimensional space. Studying those 12 benchmarks will allow the ATLAS and CMS analyses to have the maximal impact in terms of reach for new physics models.

I guess I have to stop this explanation here as it is getting too technical. Maybe what I can do is to just show you how we defined our clustering procedure. We have pairs of physics models, which can be pictured as points on a plane. A suitable “test statistic” may measure how similar those points are (it is labeled “TS” in the graph below). The clustering iteratively merges points to clusters or small clusters and large clusters together, depending on whether ALL elements of the prospective merged cluster are similar to one another more than in any of the other possible merging situations.


On the left you see an intermediate situation in the clustering procedure. 7 elements in the plane have been ordered into three clusters. Now the calculation of the test statistic allows one to decide that the most favourable merging is the one between cluster 1 and cluster 3, because the elements of the resulting cluster are more similar between each other than would the elements of the other two possible merged clusters be.

Ah, and another detail: the hatched elements indicate the “benchmarks” in each cluster: these are defined as the “most similar” elements to all others, within each cluster. The procedure we have designed univocally determines them.

I think the above study is quite cool and I had a lot of fun programming the algorithms. I wish I could face clustering problems more often in HEP, but it’s not such a frequent situation – in fact, besides the clustering of hits from subdetector components (to extract the trajectory of a particle, say) or energy deposits (to obtain hadronic jets), I had not had any need for messing with clustering techniques in the past.