One of the main principles of the scientific method is reproducibility, which could be defined as the possibility to duplicate an entire experiment or study independently in the future.
For those doing scientific data analyses, like the members of this network, the same principle applies, so that all the data, methods, and tools should be provided and documented with enough detail to allow other researchers to obtain exactly the same results for the same datasets or to redo the analysis with new data. Do you think this is an unrealistic expectation or the way to go?
Lately, I have been reading up on the new EU’s General Data Protection Regulation (GDPR). This regulation will basically enforce that, in less than two years, all organizations that use automated analyses to make decisions affecting users must be able to fully explain to the concerned citizens the data and analyses process that led to their particular choice.
For example, suppose that a company has an automated system to approve insurance policies and after providing your data you are denied an insurance policy because their automated algorithm says so. According to the new regulation, the company should be able to provide information which fully explains the output of the automated system. This will definitely be tricky for analyses which use “black-boxish” machine learning algorithms (e.g. neural networks).
In summary, the regulation states that all that analyses and models that have individual user impact must be accountable for output or fines of the order of millions will follow.
Why did I just tell you about the GDPR? Because, in a way, reproducibility in scientific analyses can be thought of in a similar way: the authors must ensure accountability for their output and conclusions by allowing future replication of their results.
However, while companies will be motivated by the avoidance of exorbitant monetary sanctions, practical rewards of reproducibility practices in current scientific research are not so clear. Full reproducibility is hard, especially without the right tools and procedures. Indeed, it might be so cumbersome to make it practically unachievable.
In addition, competitivity between research groups and funding or conference deadlines might be additional factors to consider neglecting end-to-end reproducibility. According to this recent Nature News article, based on a survey of 1500 scientific researchers, there are also a few more factors, as shown in the figure below.
In my opinion, none of these excuses is good enough to justify non-reproducible publicly-funded research in a knowledge-based economy. Only if others can carry out exactly the same study you did (and we both know that usually the analysis description given in your paper will not suffice), they can verify the veracity of your results and claim.
In addition, full reproducibility implies sharing of properly documented data, results, and methods, so other people can reuse them in their own research. Therefore, reproducibility leads to faster global scientific progress and lower research costs by promoting the reusability.
In the Nature survey I mentioned before, one of the questions asked to the researchers was “How much published work in your field is reproducible?“. I am really curious to see what the members of this network and the readers of this blog would answer to this question. I have prepared a fully anonymous and short (it took me 1 minute to answer) survey for you:
Thanks for your feedback.
If I collect enough answers to make an analysis I will provide a short post summarizing the survey results and I will make the analysis itself reproducible, as an example. For the time being and hoping that it will not condition your responses, I can share my views regarding reproducibility for experimental High Energy Physics in general and the experiment I work for in particular. I think reproducibility is almost never ensured, because of the following reasons, not necessarily listed in order of importance:
- Data policies of LHC experiments are too restrictive.
- Management of large data quantities is costly.
- Complex collaborative analysis workflows are hard to preserve.
- Unnecessarily complicated and badly-designed research tools.
- Lack of know-how on reproducibility practices.
- Reproducibility requirements are not imposed within the collaboration.
- Pressure for having analyses done for conference deadlines.
These are all fixable and some progress is currently undergoing. It is all about rewarding quality and reproducibility over throughput and quantity when reviewing research. I will stop here my wordy reflection, before I make this post too long and boring for anyone else to read, but I plan to complement it in the future with a few recommendations for assuring reproducibility in your data analyses.
Feel free to contribute or criticize my views in the comment section below.