Statistical sins and scientific salvation

Blog

Statistical sins and scientific salvation

How pre-registration protects research

26 June 2025

“There are three kinds of lies: lies, damned lies, and statistics.” – Mark Twain (crediting Benjamin Disraeli)

One of my favourite webcomics is titled “significant”. The comic tells the story of two scientists investigating the claim that jelly beans cause acne. After finding no link, they are asked to try again, in case only a certain colour is at fault. What follows is a series of 20 nearly identical panes with statistical tests, one for each colour. Only one test (green jelly beans) finds a significant result. The final pane is a newspaper headline screaming “green jelly beans linked to acne!”

This comic is a funny illustration of “p-hacking”, the practice of running many statistical tests until a “significant” relationship is uncovered. Repeated testing of even the most ridiculous claim may lead to apparent associations, cherry-picked and presented as strong scientific evidence.

Read on to learn about common statistical crimes and a powerful tool to combat them: pre-registration. We also discuss the challenges of working with existing data, and how good research practices have been applied to our analysis of the long-term impacts of business support programmes.

Imagine a scientist is about to conduct an experiment. Pre-registration is the act of publicly declaring their research plan – including their hypotheses, experimental design, data collection, and planned analysis – before starting the experiment or looking at any results. This process is akin to writing down the rules of a game, including every detail someone else might need to know to play it.

A pre-registered experiment can be replicated by other researchers, who can run and analyse the same study with a different set of participants. After data is collected, scientists can also upload anonymised datasets and code, enabling anyone to reproduce their results by running the same analyses on the existing data.

Crime 1: p-hacking

What is it?

In conventional statistical analysis (null hypothesis testing), we compare an alternative hypothesis (“there is an association between Y and X”) to the null hypothesis (“there is no association between Y and X”), accepting or rejecting the null hypothesis based on this comparison. Because the data analysed is a sample, there is uncertainty in our conclusions. The p-value quantifies this uncertainty: it is the probability that we find a significant association when no true association exists. In the social sciences, a p-value of 5% is often used as a benchmark for “statistical significance”. If an analysis meets this threshold, there is only a 5% chance of incorrectly rejecting the null hypothesis.

p-hacking occurs when researchers run many statistical tests until they meet this threshold, defying the p-value’s intended purpose. In the webcomic, with a p-value of 5%, we would expect one of the 20 tests to be significant purely by chance. The scientists in the comic find exactly one significant result out of twenty, suggesting it is likely accidental.

Most credible research now includes corrections for multiple comparisons, modifying the significance threshold to reflect the number of tests. Various methods are used, from a simple Bonferroni correction (dividing the p-value by the number of hypotheses) to more complex approaches.

Even with such corrections, the core problem of p-hacking remains: we don’t know about all the tests researchers have run, only the results they present (which often emphasise the most dramatic findings).

How does pre-registration help?

Pre-registration reduces p-hacking concerns because it ensures researchers specify exactly which (and how many) tests they intend to run before data collection. If a study is pre-registered correctly, it is not possible to opportunistically test data and present only the most significant results. Although researchers can run as many statistical tests as they like, any tests not pre-registered should be presented as “exploratory” when documenting and discussing the results.

Pre-registration also makes it much easier for experts such as Data Colada to compare the results presented in a published paper to the intended analysis, allowing them to identify opportunities for p-hacking.

Crime 2: HARKing

What is it?

HARKing is Hypothesising After Results are Known. It happens when a researcher tests for statistical relationships in their data before specifying what they believe these relationships will be (based on empirical or theoretical precedent). For example, having found the relationship between green jelly beans and acne, the scientists in the webcomic might hypothesise that green dye increases oil production in the skin, causing acne. The key feature of HARKing is that it happens after analysis. This misleads readers if the final report or paper is then written as if the hypothesis had been developed before the analysis was carried out.

How does pre-registration help?

Pre-registration requires provision of information about outcome measures and predictors – you must explain what factors you think will influence which outcomes, and in what way. This forces researchers to hypothesise before running statistical tests, and holds them accountable if their later explanations deviate from anticipated relationships.

Crime 3: Spinning yarns

What is it?

There isn’t a specific term for this last statistical crime, but many researchers will recognise it. “Spinning yarns” involves elaborate storytelling centred around the most prominent, dramatic, and unexpected results. For instance, a researcher might highlight subgroup analyses, as in the webcomic.

Spinning yarns does not mean the statistical analysis was incorrect, nor does it necessarily detract from research quality (unless linked with p-hacking or HARKing). It is understandable that a surprising secondary finding would be highlighted, as this contributes to knowledge. However, a problem emerges when secondary findings are presented as though they are primary, with a corresponding change to the “story” of the research. Perverse incentives abound, encouraging researchers to spin these yarns: journal articles with significant results are published at a much higher rate than those with null results, a phenomenon known as “publication bias” or the “file drawer problem”.

Two additional problems arise: First, when research is broadcast to the public, secondary or unexpected findings are (often) made the focal point of news coverage, leading to dramatic (and potentially misleading) headlines. Second, non-experts ranging from casual readers to aspiring researchers are led to believe that scientific endeavour should produce clear, dramatic findings – which is almost never true. Science is hard, and results should be presented as honestly as possible.

How does pre-registration help?

As with p-hacking, when research is pre-registered, other researchers (and the public) can check which hypotheses were listed as primary and secondary, cross-referencing with their presentation in the published paper and any news coverage.

There are other, practical benefits to pre-registration. Planning intended data analysis carefully in advance enables a researcher to carry out the analysis rapidly once the outcome data becomes available, so that the key findings from the evaluation can be made available with little delay. This process also forces researchers to be much more specific about the types of data that will be required for their analysis, often improving the data collected and retained during implementation.

Limitations of pre-registrations

While a powerful tool, pre-registrations are only as powerful as the information provided. With any statistical analysis, the devil is in the details – even small decisions can hugely influence results. Ideally, pre-registration makes these decisions clear, but in practice, more effort may be necessary within the research community to ensure that sufficient detail is always included, such that any trained researcher can understand the analysis.

Another challenge is that pre-registrations are primarily used for experiments not yet run, but researchers in many fields use existing datasets. Some platforms (like OSF) allow pre-registration of analyses that use data that has already been collected. Analysis of existing data makes p-hacking, HARKing, and spinning yarns much more tempting: there is no way to confirm whether statistical tests have already been run, how comparison groups are constructed, and dozens of other important details. Scientists are increasingly calling for consistent pre-registration of analyses of existing data to lessen these concerns.

Finally, pre-registration cannot protect against outright fraud. If the data supposedly collected for an experiment is in fact fabricated or falsified (in whole or in part), the pre-specified analysis can be run and presented as though credible. While pre-registration does not directly uncover fraud, it establishes a transparent record that makes fraud less likely and can be used to identify suspicious deviations, in conjunction with other mechanisms such as data audits or replications.

Using pre-registration in the legacy evaluations project:

In our evaluation of the long-term impacts of three business support programmes, including two RCTs. Both RCTs have already been pre-registered on the AEA trials database: the Growth Impact Pilot and Innovation Vouchers Programme. These pre-registrations are sparse by current standards and provide only limited detail – for instance, on how outcome measures were to be defined. However, they still provide good detail about the structure of the experiments themselves. By contrast, the third programme (GrowthAccelerator) was not an RCT and next to nothing had previously been written about plans for the quasi-experimental analysis.

We are looking at long-term outcomes in the Longitudinal Business Database (LBD), an existing dataset. However, we had an unusual opportunity: due to the sensitive nature of the data, we had to use the Secure Research Service (SRS), which created a bottleneck to our data access. Therefore, we had not yet observed the outcomes when we pre-registered our intended analysis.

In line with good scientific practice, we included a very detailed analysis plan with each of our pre-registrations, specifying choices ranging from our definitions of outcome measures, exclusion criteria for outliers to our methodology for evaluating the quality of matches in PSM.

We believe that research such as ours, with both implications for government policy and the potential to be broadcast to a wide audience, creates a moral obligation to share our methods and results transparently and accurately.

Project

Spotting or Creating Success? Evaluating Business Support
12 March 2025

SME technology adoption

Experimental research

Transforming Research to Impact: A Test & Learn Approach

Innovating for Success: Enhancing Science Commercialization through Experimentation

Statistical sins and scientific salvation

How pre-registration protects research

Crime 1: p-hacking

What is it?

How does pre-registration help?

Crime 2: HARKing

What is it?

How does pre-registration help?

Crime 3: Spinning yarns

What is it?

How does pre-registration help?

Limitations of pre-registrations

Using pre-registration in the legacy evaluations project:

Spotting or Creating Success? Evaluating Business Support