by Greg Kotkowski

Science never stops to astonish me. Lately, I’ve found out that two seemingly different approaches lead two independent groups of scientists to obtain almost the same results – so called multiple discovery. Certainly I was aware of many historical multiple discoveries, like the invention of calculus by Newton, Leibniz, and Fermat or the development of the Higgs boson idea simultaneously by at least 3 groups of physicist. However, I thought that nowadays, in the time of easy and quick information access, the multiple discovery should not occur.

I was given the following problem: let X_1,..,X_m and Y_1,..,Y_n be two samples of independent random d-vectors from the unknown distributions F and G, respectively. Generally speaking, the goal is to test if observations X_i and Y_j are realizations of one common and unknown distribution. This kind of test is called two-sample test.

In statistical terms we consider the null hypothesis

\displaystyle H_0:\quad \forall x\in\textbf{R}^d\quad F(x)=G(x),

against the alternative

\displaystyle H_1:\quad \exists x\in\textbf{R}^d\quad F(x)\neq G(x) .

In the univariate case the best known two-sample tests are Kolmogorov-Smirnov and \chi^2 . However, it is not trivial to generalize these tests into the multivariate settings due to for example data sparsity and curse of dimensionality (I mentioned this problem in this article). For this reason there are not many available multivariate two-sample tests on the market.

In the following text I want to describe the two approaches to construct the mentioned test. The first approach is the work of L. Baringhaus and C. Franz (2004). At the beginning of their paper the authors mention an interesting mathematical relationship proven by Morgenstern (2001).

  • For equal numbers of black and white points in Euclidean space, the sum of the
    distances between points of equal color is less than or equal to the sum of the pairwise distances between points of different color.
  • Equality holds only in the case that black and white points coincide.

Based on this lemma and using probability theory and functional analysis, they construct the statistics that enables to perform a two-sample test. Their statistic is of the following form:

\displaystyle T_{nm}=\frac{mn}{m+n} \left( \frac1{mn} \sum_{i,j} \phi(||X_i-Y_j||^2) - \frac1{2m^2} \sum_{i,j} \phi(||X_i-X_j||^2) - \frac1{2n^2} \sum_{i,j} \phi(||Y_i-Y_j||^2) \right),

where \phi(\cdot) is a kernel function defined on the positive real line with value 0 at 0 and a non-constant monotone first derivative.

In general, the more different are the distributions F and G, the higher the T statistics is expected to be. Unfortunately, the distribution of the T statistics is not known a priori and it depends both on F and G, hence the bootstrap is applied to obtain critical values of the test.

The second test I want to describe was developed by Aslan and Zech (2003) using an analogy to electrostatics. They write

the electrostatic energy of a superposition of a positive and a negative charge distribution is a minimum if the two distributions coincide.

Therefore, for every observation X_i the weight \frac1m is assigned and, in analogy, to every observation Y_i the weight -\frac1n . The observations are treated as electrical charges in euclidean space and the potential energy should be calculated. They create the following testing statistics

\displaystyle \Psi_{nm}= \frac1{2m^2} \sum_{i<j}^m R(||X_i-X_j||^2) + \frac1{2n^2} \sum_{i<j}^n R(||Y_i-Y_j||^2) -\frac1{mn} \sum_{i,j} R(||X_i-Y_j||^2),

where the distance function R(\cdot) is positive definite. The critical values of \Psi_{nm} are also to be calculated via bootstrapping.

What is interesting for me is the fact that both tests up to the multiplication constant and kernel/distance function used are exactly the same. The bibliographies of both papers have only one reference in common – the famous An Introduction to the Bootstrap by B. Efron and R. Tibshirani. Starting from totally different intuitions, the authors obtained the same testing statistics. (I don’t even know if they are aware of this multiple discovery).

Baringhaus and Franz put a lot of effort into studying the performance of the test. In particular, they studied the usage of different kernel functions \phi(\cdot) . The test is available for use in R language in the package “cramer“.

From my experience, the described two-sample test is very powerful and beats the other well-know multivariate two-sample test based on kernel density estimation. The only pitfall of the described test is its computational complexity. It is due not only to bootstrapping, but also to the calculation of all the distances between the points, which has complexity O(nm).

Physicists and statisticians have lots of troubles to cooperate due to for example different language and notation used. However, thanks to my finding described in this blog post I believe that there is a lot of space to work together and enrich both fields of science. For this reason I’m happy to be a statistician who works on physics applications!