For the last three weeks I’ve been experiencing the real adventure of being a researcher. Before, I was rather confident with my data analysis skills. I constantly developed my knowledge in this area, but it was rather an extension of the methods than unexpected discoveries.
But then, three weeks ago, my state of the art was demolished right from its foundations.
During the last semester I attended an interesting course on Theory and Methods of Inference as part of my first year PhD studies. At the end of the course, we had to prepare a project. Each of us was given a scientific paper. We had to write a review of this paper and 3-4 other papers dealing with a similar subject, in order to have a good overview on the particular field. I got the following publication: Kabaila, Welsh and Abeysekera (2016) – “Model-Averaged Confidence Intervals”.
This homework seemed difficult to me at the beginning. The first reading was like hitting a wall with my head. I could hardly understand the main idea. However, while digging deeper into the subject and by tracking other papers from the references, I finally reached – and understood – the core of the issue. Consequently, I reverted my search back through the reference articles to the starting paper. The way back made (almost) everything clear.
Despite understanding, the paper left me rather puzzled. The articles suggested that a large part of my data analyses might have been incorrect! Of course, this started a war inside myself. I had to choose between staying in my old, comfortable, well known zone of statistics (that could lead to incorrect inference) or cross over to the other side of the spectrum and claim my former approach to be incorrect. It was harsh, but I had to choose the second option.
Let me end this too long, philosophical introduction and focus on the details. The issue was the following:
The main goal for most data analysis problems is to determine the appropriate model, which explains the observed values for a given data set. The specifics of this model allow the researchers to understand the data. The appropriate choice of the correct model is a very complex task and is known as model selection.
The most common practice in applied statistics is to select a data-driven model based on preliminary hypothesis tests or by minimizing an information criterion (the Akaike Information Criterion (AIC) is the most commonly used). The AIC is, by definition, a numerical value, used to create a ranking of competing models in terms of information loss in approximating the unknowable truth. This approach enables the user to make unconditional inferences from a specific model.
The AIC is calculated as
where L is the maximum likelihood estimate for a model and k is the number of fitted parameters. The model with the lowest AIC value should be chosen as the best approximating model.
Other methods for model selection based on cross-validation or bootstrapping exist, but they are computationally intensive. However, in all cases a single model is selected. In the following, the selected model is often used for inference or for construction of confidence intervals, as if it had been given to us a priori as the true model.
Breiman called this the “quiet scandal of statistics”. If the selected model is used to construct confidence intervals of a given parameter, this could lead us to incorrect inference. The minimum coverage probability of the interval obtained by this naive method could be far below the nominal coverage probability, as shown in by Kabaila and Giri.
The described, naive method is commonly taught in every data analysis course. But if we have a close look at this approach we see that we indeed use the data twice: for model selection and for building the confidence intervals. Additionally, to account for the variability of the estimated parameters, some variance due to model selection uncertainty should be added.
These arguments convinced me that the naive approach could lead to incorrect inference.
But if not the naive approach then what?
There is quite a lot of literature for multi-model inference that omits the issue of model selection. I’m not going to introduce it in this brief article, but I think it is an important issue to learn. For further reading I recommend the article of Symond and Moussalli, which is well written and without difficult mathematical terminology, understandable also for non-statisticians .
Feature image taken from http://www.planetminecraft.com/