Confirmation Bias

The Original Error

Lesson 1:

Our biased brains

1.1 Defining confirmation bias

Confirmation bias is often referred to as the “original error” in human cognition. In simple terms, it is the mind’s tendency to seek out information that supports the views we already hold by selectively filtering data and distorting analyses (Casad and Luebering, 2025)—even when evidence to the contrary is available (Nickerson, 1998). This bias is not a minor quirk. Instead, it is a cognitive phenomenon that influences every phase of scientific work, from designing experiments and formulating hypotheses  to interpreting data.

This is an innate, unconscious proclivity of the human mind that persists even in cases when stakes are trivial and tasks are simple, and whose impact is felt even outside of sciencefrom its role in popular performances like mind reading to the idea of a “lucky” sweater, confirmation bias pervades our day-to-day lives.

In this unit, the term “confirmation bias" also covers two related cognitive biases: Expectation bias and observer bias. Expectation bias (also experimenter's bias) is the tendency for researcher expectations to influence subjects and the outcomes by altering participant behavior. Similarly, observer bias is the tendency for researcher expectations to influence outcomes by altering the perception and scoring of researchers.

TipNot all biases are the same.
It’s worth noting that although expectation and observer bias can be considered sub-types of confirmation bias because they involve expectation-driven distortions, the same is not true of all biases. Fundamental attribution error, conformity bias, and the availability heuristic are all examples of cognitive biases that are distinct from confirmation bias.

Let’s look at both of these related biases more closely!

"Confirmation Bias . . . is the mind’s tendency to seek out information that supports the views we already hold by selectively filtering data and distorting analyses."
Casad and Luebering, 2025

Expectations can produce changes. In the 1960s, Robert Rosenthal and Kermit Fode asked students to train 2 breeds of rats ("maze-bright" and "maze-dull") to solve mazes.

The rats were actually genetically identical. Student expectations, however, affected training and created performance differences by the end of the study. In other words, the expectations of the student experimenters produced an actual, measurable difference in performance in the two groups of mice.

This is an example of expectation bias affecting outcomes by altering subject performance. Note that the students had no real active interest in one group outperforming another and no personal stakes in a particular outcome – they had simply been primed to expect such an outcome.

Expectations can also affect data collection. In a similar study, Tuyttens et al. (2014) asked students to rate the sociability of two breeds of pigs ("normal" and "high social breeding value" [SBV+]) from video recordings. The videos were actually identical, yet students rated the SBV+ pigs as more social. 

Unlike the Rosenthal and Fode study, the students did not interact with the pigs, so differences in ratings could only result from different human perceptions of the same video data. In other words, the expectations of the data collectors caused them to perceive differences that were not actually there: a clear instance of observer bias, again in a situation where bias presented itself even though the students had no personal stake in the outcome.

Deep diveHow the brain filters belief-inconsistent evidence
It may be tempting to dismiss confirmation bias as the habit of an undisciplined or ideologically compromised mind, yet that is an oversimplification that not only blinds us to the endemic risk of confirmation bias in all human thought, but also grossly mis-characterizes the role this bias plays in human cognition.

Most modern models of decision-making posit that the brain operates on a case-building basis, whereby it collects many small, noisy bits of evidence which, in time, accumulate into confidence in a particular choice. In computational and cognitive circles, these are referred to as drift diffusion models, and they have proven both efficient and biologically plausible, with potential artifacts of their underlying computations having been identified in monkeys. With this perspective in mind, confirmation bias essentially presents a case in which prior information — that being both the aforementioned bits and the cumulative commitment into which these grow — modulates the way later evidence is integrated into the broader decision-making calculus.

We already know that this occurs even in the case of low-level perceptual tasks from the student classification experiments we mentioned earlier in this lesson, but Talluri et al., 2018 undertook a formal investigation into this fact in 2018, when they designed separate perceptual and numerical tasks, each with discrimination and estimation judgments, and found that the choices subjects made during the task resulted in later evidence being weighted, yielding an effect that strongly resembled selective attention — basically, the commitments we make tell our brain what kind of future evidence to amplify.

The key question, however, is when and how this happens: are prior and later bits of evidence encoded the same and merely used in different ways during the decision-making process, or are they encoded differently to begin? The distinction between these two options determines whether the information our brains downplay as a result of confirmation bias is lost from the get-go, or whether this is a distortion for which we could theoretically correct.

In order to answer this question, Park et al., 2025 had subjects undergo magnetoencephalography while they performed tasks similar to those used by Talluri et al, and then leveraged information theory to extract three key measures:
  • The amount of information the brain registered when perceiving stimuli.
  • The amount of information the brain used when evaluating those stimuli.
  • The overlap between those two, i.e. how much the former informs the latter.

Comparing these three measures between confirmation bias trials and control trials, Park et al found that the brain encoded stimuli fine in both cases, but contributed less of that encoded information to the decision-making process in confirmation bias trials — suggesting that the brain registers belief-inconsistent evidence just fine, it just doesn’t use it. This is essential because it tells us that though confirmation bias is an inherent quirk of the way we make decisions, the knowledge that challenges our beliefs does exist in our brains and merely needs to be leveraged correctly.

Fortunately for us, this is something we may be able to help in some cases by judiciously interrogating our beliefs, explicitly enumerating facts, and practicing rigorous vigilance — particularly when we feel confident in our beliefs, which research has shown to be a compounding factor in the functional elements of bias that we just described.

As the existence of the seven remaining lessons in this unit may suggest, however, these methods are not foolproof: they are impossible to practice at the perceptual level, and do not nullify bias even in the best of non-perceptual cases.

Activity: Deduce the Number Rule

Before delving further into definitions, try out a revamped version of a task first developed by Peter Cathcart Wason in 1960. Your goal is to try and deduce a secret rule that matches some sequences of 3 numbers.

But there is a catch—you can’t guess the rule directly!

Form a hypothesis by interpreting results from your guesses, then submit your final hypothesis to see if you were correct!

Post-activity questions:

  • How did that go? Did you falsify your hypotheses?
  • When testing sequences, did you match the hypothesis you had in mind or try to falsify that hypothesis? Why?
  • Were there moments when you realized you needed a new strategy?
  • How might collaboration with others help avoid some of the pitfalls demonstrated in the task?

1.2 Real-world studies of confirmation bias

As evidenced by this task, humans are inclined to confirm rather than falsify hypotheses. Many quickly latch onto an initial pattern (say, “increasing by 2”). Once an early hypothesis is formed, subsequent searches for evidence tend to focus on confirming that rule, with little effort to seek out counterexamples (Wason, 1960). This results in unconsciously filtering out disconfirming evidence.

It’s not difficult to imagine that if such bias emerges even under the simple constraints of the Wason task and the low stakes of the student experiments mentioned earlier, it may well present all the more prominently in higher stakes scientific settings.

By distorting thinking, confirmation bias skews how scientific hypotheses are conceived, tested, and evaluated
(Kahneman & Tversky, 1996).

Much like throws aimed at a bullseye in a game of darts, scientific investigation can be either precise or imprecise, accurate or inaccurate, in its pursuit of truthprecision being a measure of consistency, and accuracy being a measure of correctness. In that sense, bias is a failure of accuracyit refers to a distortion of our thinking which skews how scientific hypotheses are conceived, tested, and evaluated (Kahneman & Tversky, 1996).

TipYou could say...
In some ways, one might well argue that science itself is meant to be a means of combating the biases that govern non-scientific insight — after all, the systematic rigor of the scientific method helps mitigate many of the deficiencies of unscientific observation and reasoning. In that sense, confirmation bias is one such deficiency that has proven particularly pervasive and requires additional caution.

Crucially, this kind of error will mislead us even if we make the most precise measurements using the most state-of-the-art equipment (Born, 2024), because it shapes and presents in every stage of scientific investigation:

Early on, bias can predispose you to make certain observations and favor certain hypotheses. Later in the process, it can influence your data collection and your data analysis alike, motivatingconsciously or unconsciouslyquestionable practices like p-hacking and circular analysis (Simmons et al., 2011; Kriegeskorte et al., 2009; Button, 2019). It thwarts the goal of scientific research to build objective experiments and achieve an impartial interpretation of experimental results.

Deep DiveThe long lineage of bias in science
As Born notes in his paper, academics have been aware of the insidiousness of these biases for a long time. Quoting an excerpt from Francis Bacon’s 1620 Novum Organum:

 “The human understanding when it has once adopted an opinion (either as being the received opinion or as being agreeable to itself) draws all things else to support and agree with it. And though there be a greater number and weight of instances to be found on the other side, yet these it either neglects and despises, or else by some distinction sets aside and rejects; in order that by this great and pernicious predetermination the authority of its former conclusions may remain inviolate.”

Itself a foundational work in philosophy of science, Novum Organum outlines four “Idols of the Mind”, or cognitive errors to which humans have proven prone:

The Idols of the Tribe
The eponymous tribe Bacon references here is all of humankind, and thus the errors associated with their idols are those which he considers innate to human nature itself. This is so broad a the brush that most biases could well be argued to fit the label, so to quote the man himself: “[…] it is a false assertion that the sense of man is the measure of things […] human understanding is like a false mirror, which […] distorts and discolors the nature of things by mingling its own nature with it.”

The Idols of the Cave
Perhaps the cleanest and most direct analogue of what we now call bias, the Idols of the Cave were Bacon’s conceptualization of errors tied to propensities and predispositions rooted in our individual backgrounds — in personal experiences, such as those of our childhood and our education.

The Idols of the Marketplace
These are the errors tied to language and the undisciplined use of poorly curated vernacular — in neuroscience, you might think of words like “intelligence”, or “consciousness” as examples of how vague, ill-, or under-defined terminology could risk false or presumptive agreement and disagreement alike. The latter in particular has been the subject of much debate.

The Idols of the Theater
Here, “the Theater” describes the broader scientific canon within which we operate: the theories, models, and established wisdoms which frame our perspective. From phlogiston to phrenology, the history of science is filled to the brim with big ideas that misled scientists, but the amyloid hypothesis or the downplaying of astrocytes are particularly recent examples of times when canonical belief may have biased us towards incorrect conclusions.

Though the Baconian Idols of the Mind represent a notable milestone in formalizing the threat certain innate tendencies of human cognition may present to scientific enterprise, such ideas are as old as science itself. Bacon himself quotes the ancient philosopher Heraclitus, whose lacking faith in the human ability to shed subjective perspective defined several of the surviving fragments of his intellectual work:

“Although the logos is common, the many live as if they had a private understanding.”
    (From Fragment DK B2, translated by John Burnet and kindly hosted by Randy Hoyt.)

Here, Heraclitus notes that, though truth is (in his view) universal and objective, most people are blinded by their own experience and the subjective perspective it shapes. His prescription differs significantly from Bacon’s — Heraclitus famously composed his teachings in the form of riddles so as to increase the odds of students arriving at true, objective insight — yet his is nonetheless a message echoed by Bacon and this unit alike.

Beyond the base merits of Bacon’s Novum Organum and the Heraclitean fragments, these works provide us with a consistent through-line of thinkers that identified the apparent fallibility of human thought and the ease with which it blinds and self-deludes. In other words — the content of this unit is part of an ancient human struggle, and the bias-fighting tactics awaiting in future lessons are stratagems millenia in the making.

Arbeck, CC BY 4.0, via Wikimedia Commons

Arbeck, CC BY 4.0, via Wikimedia Commons

Much like throws aimed at a bullseye in a game of darts, scientific investigation can be either precise or imprecise, accurate or inaccurate, in its pursuit of truthprecision being a measure of consistency, and accuracy being a measure of correctness. In that sense, bias is a failure of accuracyit refers to a distortion of our thinking which skews how scientific hypotheses are conceived, tested, and evaluated (Kahneman & Tversky, 1996).

Crucially, this kind of error will mislead us even if we make the most precise measurements using the most state-of-the-art equipment (Born, 2024), because it shapes and presents in every stage of scientific investigation:

Early on, bias can predispose you to make certain observations and favor certain hypotheses. Later in the process, it can influence your data collection and your data analysis alike, motivatingconsciously or unconsciouslyquestionable practices like p-hacking and circular analysis (Simmons et al., 2011; Kriegeskorte et al., 2009; Button, 2019). It thwarts the goal of scientific research to build objective experiments and achieve an impartial interpretation of experimental results.

Deep DiveThe long lineage of bias in science
As Born notes in his paper, academics have been aware of the insidiousness of these biases for a long time. Quoting an excerpt from Francis Bacon’s 1620 Novum Organum:

 “The human understanding when it has once adopted an opinion (either as being the received opinion or as being agreeable to itself) draws all things else to support and agree with it. And though there be a greater number and weight of instances to be found on the other side, yet these it either neglects and despises, or else by some distinction sets aside and rejects; in order that by this great and pernicious predetermination the authority of its former conclusions may remain inviolate.”

Itself a foundational work in philosophy of science, Novum Organum outlines four “Idols of the Mind”, or cognitive errors to which humans have proven prone:

The Idols of the Tribe
The eponymous tribe Bacon references here is all of humankind, and thus the errors associated with their idols are those which he considers innate to human nature itself. This is so broad a the brush that most biases could well be argued to fit the label, so to quote the man himself: “[…] it is a false assertion that the sense of man is the measure of things […] human understanding is like a false mirror, which […] distorts and discolors the nature of things by mingling its own nature with it.”

The Idols of the Cave
Perhaps the cleanest and most direct analogue of what we now call bias, the Idols of the Cave were Bacon’s conceptualization of errors tied to propensities and predispositions rooted in our individual backgrounds — in personal experiences, such as those of our childhood and our education.

The Idols of the Marketplace
These are the errors tied to language and the undisciplined use of poorly curated vernacular — in neuroscience, you might think of words like “intelligence”, or “consciousness” as examples of how vague, ill-, or under-defined terminology could risk false or presumptive agreement and disagreement alike. The latter in particular has been the subject of much debate.

The Idols of the Theater
Here, “the Theater” describes the broader scientific canon within which we operate: the theories, models, and established wisdoms which frame our perspective. From phlogiston to phrenology, the history of science is filled to the brim with big ideas that misled scientists, but the amyloid hypothesis or the downplaying of astrocytes are particularly recent examples of times when canonical belief may have biased us towards incorrect conclusions.

Though the Baconian Idols of the Mind represent a notable milestone in formalizing the threat certain innate tendencies of human cognition may present to scientific enterprise, such ideas are as old as science itself. Bacon himself quotes the ancient philosopher Heraclitus, whose lacking faith in the human ability to shed subjective perspective defined several of the surviving fragments of his intellectual work:

“Although the logos is common, the many live as if they had a private understanding.”
    (From Fragment DK B2, translated by John Burnet and kindly hosted by Randy Hoyt.)

Here, Heraclitus notes that, though truth is (in his view) universal and objective, most people are blinded by their own experience and the subjective perspective it shapes. His prescription differs significantly from Bacon’s — Heraclitus famously composed his teachings in the form of riddles so as to increase the odds of students arriving at true, objective insight — yet his is nonetheless a message echoed by Bacon and this unit alike.

Beyond the base merits of Bacon’s Novum Organum and the Heraclitean fragments, these works provide us with a consistent through-line of thinkers that identified the apparent fallibility of human thought and the ease with which it blinds and self-deludes. In other words — the content of this unit is part of an ancient human struggle, and the bias-fighting tactics awaiting in future lessons are stratagems millenia in the making.

This inclination toward confirmation bias can be managed, however. In the activity inspired by the Wason task, for instance, deliberately looking for counterexamples that might refute a guess is a conscientious counter to the implicit pull toward confirming beliefs. 

In other scientific research, confirmation bias can be mitigated by:

  • Building habits to recognize bias in thinking.
  • Placing checks to pause and reflect on choices.
  • Reducing error via principles of rigorous experimental design.
  • Making clear distinctions between exploratory and testing research.

Numerous studies have illustrated how confirmation bias can subtly yet powerfully affect behavior and decision-making. For example, a study by Dror and colleagues (2006) showed how contextual cues can undermine forensic expertise. In a within‐subject design, five fingerprint examiners (averaging 17 years’ experience) reviewed prints they had previously matched. 

When given misleading context that the FBI had misidentified the prints in a high-profile case, four changed their opinions: three ruled the prints non-matches and one deemed the evidence inconclusive, despite clear instructions to ignore the extra information. Only one examiner stuck with the original match. The study highlights the impact of cognitive bias on professional judgment and supports safeguards like blind examinations in forensic science.

In another published example of this tendency to discount information that undermines prior judgments, participants were placed in a betting scenario to monitor their own confidence in their wagers. The scenario showed that when participants had betting partners who agreed with their predictions, the participants greatly increased their bets. When their partners disagreed with their predictions, they only slightly decreased their wagers, but held to their initial predictions nonetheless. This imbalance—where confirmation has a stronger effect than refutation—demonstrates that preexisting beliefs can exert a powerful influence on behavior (Kappes et al., 2020).

Confirmation bias affects real-world decision-making, from the way we interpret experimental results to how we form our opinions on scientific theories. Even in high-stakes scenarios, such as clinical trials or policy-making, the tendency to overvalue confirmatory evidence can lead to inflated expectations and even the misinterpretation of data. For instance, early decisions in research might lead to a selective search in the literature or a tendency to report only positive findings. This, in turn, can skew the overall picture of a scientific field.

This is not just theoretical. In fact, later in this unit we will discuss reviews that have found that studies that fail to implement countermeasures to address confirmation bias consistently report larger effect sizes than studies that do. This is another example of the real-world impact unaddressed bias is having right now.

1.3 Why this matters to neuroscientists

For neuroscientists, the stakes of confirmation bias are particularly high. Neuroscience research often grapples with complex systems and subtle signals. When an initial hypothesis is formed, there is a risk of overlooking alternative mechanisms.

Consider a scenario where a hypothesis posits that a particular neurotransmitter is central for a specific behavior. If a researcher unconsciously prioritizes data that supports this hypothesis, it increases the likelihood of overlooking critical findings that suggest a role for other neurotransmitters or neural circuits. This oversight can lead to incomplete conclusions and develop an overconfidence in a flawed model of brain function.

The design and interpretation of experiments have far-reaching implications for understanding the brain. To counteract these pitfalls, it is essential to cultivate a mindset that actively challenges initial assumptions. Developing habits like deliberately searching for disconfirming evidence, discussing alternative explanations with colleagues, and employing robust statistical methods can help guard against the trap of confirmation bias. By integrating these practices, neuroscientists can design experiments that are more objective and reliable.

Developing habits like deliberately searching for disconfirming evidence, discussing alternative explanations with colleagues, and employing robust statistical methods can help guard against the trap of confirmation bias.

Takeaways:

  • Confirmation Bias is one of many naturally-occurring cognitive biases that move thinking and observations to prioritize evidence that confirms existing beliefs.
  • Neural reward systems reinforce this bias, making it difficult to recognize or challenge preconceptions.
  • Being aware of confirmation bias is the first step toward designing experiments that minimize its impact.
  • To counteract this tendency, individuals can intentionally seek out examples that violate hypotheses or other expectations.

Reflection:

  • Can you recall a time in your lab work where an unexpected result made you question your assumptions? 
  • When was the last time you formed a hypothesis and then found yourself ignoring an alternative explanation?
  • Being mindful of that feeling is an important first step toward taming confirmation bias in the lab.

Lesson 2:

Formulating
Rigorous Hypotheses

2.1 The Pitfalls of a Single, “Favored” Hypothesis

Confirmation bias is often revealed when scientists begin with an initial idea about the link between a cause and an outcome, like "when A happens, then that leads to B," and then design experiments to confirm that hypothesis. This kind of inquiry is an essential part of scientific discovery and experimentation, but it can lead to a project that focuses on demonstrating that a link between A and B exists, rather than conducting a more rigorous exploration of how A and B interact within a biological system.

Consider the Deduce the Number Rule activity: you may have come up with an idea for the number rule and then tested a number sequence that matched that rule. Confirmation bias creeps in when you interpret a matching result as confirmation of your rule. In actuality, a matching number sequence is merely evidence that is consistent with your rule being true—it’s possible other rules that could be true would have produced the same set of results!

Let's look more closely at the kinds of stumbling blocks encountered in scientific studies.

Case 1

If you start with a vague hypothesis, it is easy to interpret many kinds of results as supporting your hypothesis. For instance, if you suspect that "when A happens, then that leads to B," you may end up designing an experiment where the results show that A and B happen at the same time. However, just as in the number rule activity, this set of results is also consistent with other explanations:

  • The cause and effect are actually reversed, and B is causing A.
  • The experiment did not implement proper controls to identify what happens when A is NOT present, or to identify the conditions where B is NOT present (for more, see our forthcoming unit on controls).
  • Some other effect results in both A and B, and so they are always observed together.

To put this in context – suppose that our hypothesis is “exercise improves mood”, which we test by randomly assigning people to an exercise group and a control group; where the exercise group runs on treadmills for 30 minutes, and the control group reads for the same amount of time; and then measure the change in happiness on a 1-10 scale (which was assessed for each person before and after exercise/reading).

The results might show any of the following that, on average, happiness:

  • increased more in the exercise group, relative to the control group
  • change was about equal in both groups
  • increased less in the exercise group, relative to the control group

One could also imagine any number of explanations for these results (with the possibility of multiple explanations happening at the same time):

  • Some forms of exercise improve mood, but not running.
  • Exercise improves mood for some people but not others.
  • Running for 30 min decreases mood for people who dislike running.
  • Exercise causes endorphin release, which improves mood, but on a longer timescale than is measured by the study.
  • Exercise is more effective at mitigating sadness, but does not directly improve happiness.

Since our hypothesis never specified how much improvement we expect, in which populations we expect to observe the improvement, or under what specific conditions we expect to observe it, many different potential outcomes can be interpreted as supporting our idea. Thus, confirmation bias can lead us towards an interpretation favorable to our hypothesis, and not much actually ends up being learned.

A specific hypothesis (that delineates which variables, which populations, the type of change) constrains what evidence can be interpreted as supporting, thereby making it harder for confirmation bias to have an influence.

Case 2

If you design a study to test a hypothesis, but the study design does not perform a strict or systematic test, confirmation bias may lead to erroneous interpretations of the results as supporting the hypothesis.

This can be seen in the common framework of Null Hypothesis Significance Testing (NHST), where we would start by designating the hypotheses:

  • Null hypothesis (H0)—there is no relationship between A and B.
  • Alternative hypothesis (Ha)—there is a relationship between A and B.

We might approach the study by collecting data about A and B, and analyzing it in the hopes that the p-value is smaller than our pre-defined significance level (usually 0.05). If the p-value is less than the significance level, the results are deemed statistically significant, meaning that it is very unlikely to have observed our data if the null hypothesis is true. We've successfully rejected the null hypothesis!

But what does rejecting the null hypothesis actually tell us? The null hypothesis states that there is no relationship between A and B—but was this a meaningful hypothesis to begin with? In most situations, an experiment is motivated by the expectation of a specific kind of relationship between A and B, which we hoped to elucidate through research.

Tip:
This highlights a common misunderstanding of Null Hypothesis Significance Testing as equivalent or indicative of confirmatory research. NHST does not automatically render your hypothesis well-defined or rigorously tested, rather it is a method of statistical inference whichproperly appliedestimates the odds of a particular result reflecting a potential reality.

What we should have done is first determine what aspects of the hypothesized relationship between A and B are of scientific interest. For example, this could be a specific quantitative relationship (e.g. exponential growth, linear and decreasing, saturating response) or that the relationship only holds under certain conditions or for certain members of the study population. From this, we can design a suitable test for the relationship or the boundary conditions of the phenomenon.

Returning to the example of the Deduce the Number Rule activity, if our hypothesis is that the rule is "successive numbers increase by 2", we would expect certain number sequences to fail to match:

  • number sequences increasing successively by 1
  • number sequences increasing successively by 4
  • number sequences where the first increase is by 2, and the second increase is by a different amount (or not at all)
  • number sequences where the first increase is by a different amount (or not at all), and the second increase is by 2

Testing these alternative sequences and finding that they all fail to match the rule is thus stronger evidence that the rule is "successive numbers increase by 2" (and only by 2). By designing the study to demonstrate falsifiability of the hypothesis, we prevent over-interpretation of the evidence, and thus reduce the influence of confirmation bias.

Deep diveStatistics is not scientific reasoning
Science is not the easiest discipline to penetrate, and it is not uncommon for researchers early in their career to feel like impostors stumbling blindly across a complex, esoteric canon that stands between them and the privileged praxis of scientific discovery. For this and other reasons, we are often more concerned with using the correct tool than we are with understanding what makes that tool the right one for a given job — we learn that p < .05 is the standard of statistical significance, and proceed to profess and ritualize it like the definitive difference between noise and eureka, and in time, we often lose sight of what that ever really meant to begin with. A similar sentiment is echoed by John R. Platt in his “Strong Inference” article:

“Equipment, calculations, lectures become ends in themselves. How many of us write down our alternatives and crucial experiments every day, focusing on the exclusion of a hypothesis? We may write our scientific papers so that it looks as if we had steps 1, 2, and 3 in mind all along. But in between, we do busywork. We become ‘method-oriented’ rather than ‘problem-oriented.’”

Questioning things is a key part of what makes us not just scientists, but good scientists, and that should extend to our statistical methodologies and epistemological groundings as well. Failing to do so risks settling into a procedural routine that may allow us to persist within our systems and institutions, but does so at the cost of meaningful inferential progress. This is one of many problems at fault for the relative stagnation of many specialties within neuroscience, and a partial driver of the “data-rich, theory-poor” characterization the field often suffers.

This kind of trap exists in the case of many statistical measures commonly used in science, including NHST. The framework is valid and its reasoning is sound, but only if it is leveraged mindfully and does not substitute intelligent scientific thinking — that is, NHST answers the narrow statistical question of “Is there evidence against no effect?”, not the broader scientific question of “What specific biological claim does my experiment eliminate?”

That distinction matters because statistical measures are narrower in scope than is suggested by the authority with which scientific institutions often endow them. All a p-value really tells you is the probability of observing data at least as extreme as observed, assuming the null hypothesis is true. It is neither the probability of the null hypothesis itself being true, nor the probability of an alternative hypothesis being correct. Misuse of p-values is so pervasive a problem in the sciences that in 2016, the American Statistical Association issued a public statement on the improper use of p-values and statistical significance measures:

“Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning. ”

In an associated press release, ASA president Jessica Utts stated the problem outright:

“Over time it appears the p-value has become a gatekeeper for whether work is publishable […] It also leads to practices […] that emphasize the search for small p-values over other statistical and scientific reasoning.”

This is not a new problem either — in fact, the ASA statement echoes similar sentiments voiced 20 years earlier by Jacob Cohen (of Cohen’s d fame) in his 1994 essay The Earth Is Round (p < .05), wherein even he notes the age of the issue in his opening paragraph:

“After 4 decades of severe criticism, the ritual of null hypothesis significance testing—mechanical dichotomous decisions around a sacred .05 criterion—still persists. This article reviews the problems with this practice, including its near-universal misinterpretation of p as the probability that Ho is false, the misinterpretation that its complement is the probability of successful replication, and the mistaken assumption that if one rejects Ho one thereby affirms the theory that led to the test.”

Neuroscience is tragically able to lay claim to several particularly memorable instances in which statistical measures supplanted scientific thinking to disastrous effect:

In 2009, Vul et al., 2009 and Kriegeskorte et al., 2009 highlighted that a circular analysis epidemic had resulted in inflated correlations in fMRI studies. That same year, Bennett et al., 2009 published an infamous tongue-in-cheek experiment demonstrating that failing to correct for multiple comparisons resulted in statistically significant brain activity in dead salmon. Four years later, a review by Button et al., 2013 indicted the field for its chronically underpowered studies, estimating a median statistical power between ~8% and 31%, and noting inflated false-discovery rates.

The exact mechanism by which these scientific errors occur may vary, but at their heart lies the deeper problem of procedural gospel guiding research unconstrained by robust inferential investigation. NHST and p-values are not useless concepts, but that as scientists, we often forget that statistical measures exist downstream of reasoning and intellectual leveraging that grounds scientific claims.

Case 3

If you design an experiment around two hypotheses (H1 and H2) and they are NOT mutually exclusive, your experiment may not have a clear objective or interpretable set of results. While it is great to start with two hypotheses, if they are not mutually exclusive, remember that they could both be true at the same time (or alternatively, neither could be true)! Without providing definitive evidence on what the actual relationship is between A and B, future studies may not have much to build on.

Let’s imagine that we want to study the impact of caffeine on test performance and devise two hypotheses:

  • H1: Caffeine improves alertness.
  • H2: Caffeine improves motivation.

These hypotheses are not mutually exclusive – caffeine might improve both alertness and motivation, and so if we find that the coffee drinkers among our subjects outperform their counterparts, we’ve no way of knowing which hypothesis actually holds true. We could, however, revise these:

  • H1: Caffeine improves performance even when motivation remains constant.
  • H2: Caffeine improves performance only by increasing motivation.

Now we are able to design an experiment that sets up both contexts (by explicitly measuring and controlling for motivation), and in so doing discern its role in the underlying mechanism.

There is one case where it could be acceptable to have two hypotheses that can be true at the same time. This applies when the two hypotheses are known to occur independently (e.g. the relationship between A and B occurs through multiple pathways), and the goal of a study is to quantify the frequency or strength of the two hypothesized pathways. In such a study, it might still be valuable to implement a mutually exclusive control to establish the validity of the study design.

The Solution

The common thread through these three cases is that vague hypotheses and open-ended study design enable confirmation bias to take root. Conversely, we can support rigorous science by planning ahead with our hypothesis and study design. We can do this by ensuring three key qualities for rigorous hypotheses:

  1. Specific: What are the independent and dependent variables? How will they be measured as part of the study? Is the goal to test a general explanation of a phenomenon or a specific instance?
  2. Falsifiable: Under what conditions will the hypothesized effect occur? When will it fail to occur?
  3. Contextual: What are the populations, experimental factors, and pathways that influence the hypothesized effect? If a study is investigating these moderators, are there proper controls?

We would be remiss without mentioning the strategy of "Strong Inference" (Platt 1964). In this paper, Platt recommends that experiments should distinguish between mutually exclusive hypotheses:

It seems to me that the method of most rapid progress in such complex areas, the most effective way of using our brains, is going to be to set down explicitly at each step just what the question is, and what all the alternatives are, and then to set up crucial experiments to try to disprove some.
Platt, J. R. (1964). Strong inference. Science, 146(3642), 347–353.

Deep diveThe beautiful experiment
In John R. Platt’s landmark 1964 article “Strong Inference”, the biophysics professor opens by pointing out that some fields of science advance significantly faster than others despite being just as theoretically dense, technically complex, and intellectually intractable.

One of these faster-progressing fields Platt referenced was molecular biology, which, in 1958, had just seen the design and publication of the Meselson-Stahl experiment — often cited as the most beautiful experiment in all of biology, most famously so by biochemist and MacArthur fellow John Cairns. In order to understand what makes the experiment so beautiful, we first need to understand the question Meselson and Stahl were trying to answer: how does DNA replicate?

In the wake of discovery of the double-helix structure, three primary candidate models of replication had been hypothesized:

  • Conservative replication, wherein the parental double-helix remains intact and a new double-helix is constructed alongside it.
  • Semiconservative replication, wherein the two strands of the double-helix separate and each act as templates for new strands.
  • Dispersive replication, wherein the double-helix is broken up and distributed unto child molecules from which a new double-helix is then born.

Prior work rendered the semiconservative option attractive due to the fact that each strand theoretically carried all of the information required to construct its complement — but how could one distinguish between these three options experimentally, let alone in 1958?

Meselson and Stahl’s crucial insight was that the three hypotheses would each render different results after one round of replication: 

  • Conservative replication creates an entirely new double-helix, meaning that a single round of replication would yield one wholly new and one wholly old molecule.
  • Semiconservative replication utilizes one strand per new molecule, yielding two half-old, half-new molecules.
  • Dispersive replication breaks the parental structure up altogether, yielding two molecules made of various bits of old and new material.

What this meant is that if Meselson and Stahl could somehow mark the parental DNA structure by some means that allowed them to review its post-replication distribution, they could meaningfully differentiate between the three hypothesized processes.

That is precisely what they did by growing bacteria in a medium featuring heavy nitrogen isotopes, which caused its DNA to integrate that heavier isotope in its structure. They then moved this bacteria to a medium featuring lighter nitrogen, which would theoretically result in lighter DNA structures. By allowing the bacteria to replicate and then centrifuging its DNA in a density gradient, Meselson and Stahl could then determine the existence of heavy, light, and half-heavy, half-light DNA structures — for which each hypothesis carried distinct predictions.

After one generation, the pair found one group of intermediate weight, suggesting that strands had intermingled across generations and effectively eliminating the conservative replication hypothesis. After a second generation, they found two distinct groups: one of intermediate weight and one lighter weight, which is incompatible with the patchwork process of the dispersive replication hypothesis, thus eliminating all known options to reveal what we now understand to be the true paradigm of DNA replication: the semiconservative model.

The key behind Meselson and Stahl’s insight lay in reasoning from abstract theory to measurable physical reality, which the pair then leveraged to design an experiment that would allow them to make specific predictions based on which model of replication held true. It was in this inferential elegance that John Cairns (and many others) saw the beauty of the experiment, and this same approach — extending theory unto the measurable to then arrange a contest of predictions made by candidate possibilities — exemplifies the kind of rigorous science and powerful inductive inference Platt posited as the primary driver of accelerated scientific progress.

We agree with the general sentiment of Platt that each study or experiment should provide useful and interpretable evidence. However, we also expand beyond the requirement that studies must necessarily pit two (or more) competing hypotheses against each other. In many cases, rigorous, important work involves articulating the boundaries of hypotheses and the contexts in which they operate. Hence, our focus on the properties of a hypothesis being falsifiable and contextual.

2.2 Activity: Strategies for Hypothesis Generation

In the next activity, you have the opportunity to practice developing specific and falsifiable hypotheses. Aim to move beyond broad explanations in order to identify the underlying mechanism and specify clear, testable predictions. Similarly, a strong hypothesis makes explicit what necessary evidence would falsify it.

Post-activity questions:

  1. Think of a current hypothesis you are interested in testing; how could it benefit from any of the changes you made here?
  2. How can you develop a hypothesis that will then help you design a study with less space for ambiguity?

Developing a strong hypothesis is challenging, and part of strengthening a hypothesis is putting it up against contradicting ideas. Remember: it is necessary to be thoughtful about initial hypotheses and consider opposing ideas that could equally explain underlying mechanisms or observed effects.

2.3 Rigorous experiments start with good hypotheses

To recap, a rigorous experiment requires meaningfully challenging your hypotheses by:

  1. Developing a hypothesis that is specific, falsifiable, and contextual.
  2. Designing a study or experiment that challenges the hypothesis, and which can provide strong evidence regardless of how the results turn out.
  3. Iterating on the hypothesis and study design to make sure they are rigorous and well-aligned.

What are some specific actions that can be taken to support these steps?

  1. Examine your hypotheses: Look for a set of hypotheses that make predictions that are incompatible with each other. Ensure that these hypotheses are scientifically meaningful and biologically plausible.
  2. Exploratory pre-studies: Conduct pilot experiments or look at existing data to see if a hypothesis is indeed a plausible explanation for the observations.
  3. Seek opposing views: Talk to colleagues who are skeptical. They may quickly offer other competing ideas, helping you to refine how you approach your study design.
  4. Stay updated on literature: Carefully seek out literature about contradictory or inconclusive findings to ensure that you don't get tunnel vision about the publications that support your favored explanation.
  5. Be explicit about the circumstances under which the hypotheses should be rejected.

Takeaways:

  • Confirmation bias can lead you to focus on a favored hypothesis, causing you to interpret supportive evidence as definitive proof, even when other plausible explanations exist.
  • If you design experiments with competing hypotheses that make incompatible predictions, then you are more likely to develop a clear differentiation between potential explanations. This leads to more conclusive and informative results, which provides results consistent with reality.
  • Designing experiments that attempt to disprove hypotheses can often be both more efficient and more elucidative than designing experiments that attempt to prove them.
  • Writing specific, falsifiable hypotheses ensures that research design explicitly tests the causes and underlying mechanisms of a given phenomenon rather than just confirming what you already believe.

Reflection:

  • Think back to a recent project: where did you feel a strong urge to prove your favorite explanation instead of testing whether it could be wrong?
  • When you design an experiment, how do you make room for a rival idea that could oust your front-runner hypothesis?
  • Recall a time your results “fit” a vague prediction; what alternative stories about the data did you overlook?