While working with LHC data we encounter the problem of a high-dimensional observed space as we deal with about 40 variables. Of course some of them are correlated, dependent or even made out of others. From one side, having such an abundance of data gives us a lot of information, but it has also some disadvantages about which I want to write in this article.
Many approaches for dimension reduction have been proposed that transform data to a space of smaller dimension. One of the reasons is to condense data so that classifiers are more stable. Is is also believed that while reducing dimension we are removing noise from the data.
In the following I will focus on the need for dimension reduction to contrast observation sparsity in multidimensional data.
Suppose we have 101 observations uniformly distributed in the d-dimensional cube [0,1]d. Let’s pick the corner (0, …, 0). The question is: what is the probability that the median euclidean distance of the observations from the picked corner is greater than 1. Let’s illustrate the problem in 2 dimensions using 11 observations.
In the figure above we see only one point (marked in red) with a distance greater than 1 from the chosen corner. Surely the median distance is smaller than 1. In fact, we can calculate the probability that the median distance is smaller than 1, that is that at least half of the observations are in the blue circle.
For a single observation the probability of being “close” to the chosen corner is the ratio of the volume of the d-dimensional hypersphere with radius R=1 to the d-dimensional box with edge equal to 2 (2•R). In 2 dimensions that is just π over 4 (~78.5%). Using the rules of order statistics we find that the median distance to the corner is greater than 1 in 2 dimensions for 11 observations with a probability of 10.8%, but when having 101 observations this probability is only 0.03%.
As expected, this probability is very small, especially when we increase the number of observations. However to our surprise, in high dimensions we loose this intuition. For d=10, the probability of a single observation to be close to the chosen corner is only 0.25% and for d=20 this probability is 2.5•10-08! How come?
Take a point in 10-dimensional space with all coordinates equal to 1/3. Its distance to the corner is √(10/9) and it is already greater than 1. Simply speaking, the finding is due to the fact that every next dimension “consumes” some part of the distance so it is less and less to share for other dimensions.
For me this is enough evidence to show that highly dimensional data is just sparse, and distances between observations become bigger. That is also why classifiers could not be stable as there are many degrees of freedom for hyperplanes to divide data.
Therefore, when applying sophisticated algorithms as Boosted Decision Trees or Neural Networks to highly dimensional data it is really worth the effort to reduce the space dimension, otherwise the results could be meaningless. If I wanted to infer from observations information about those events not further than 1 from the chosen corner, I would have had almost no data to base on in highly dimensional space.