In 1984, Hurlbert raised the alarm on a common statistical error in his paper “Pseudoreplication and design of ecological field experiments”. His aim was to draw attention to how the assumption of independence in various hypothesis tests is often violated. He called the error psuedoreplication. The paper describes how this error voids a test’s false positive rate, and throws the conclusions of an experiment into question. In addition, Hurlbert also offers much statistical wisdom to experimenters.


“Pseudoreplication is defined as the use of interfential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent.”

Hurlbert gives an example of testing the effects of DDT (an insecticide) on plankton growth. Let’s imagine setting up an experiment. We buy eight identical aquarium tanks, fill them with water, add equal amounts of plankton, and put them in a row on a table. Then we measure the plankton contents in each one, and add DDT to the first four. After a few days, we come back and measure the plankton content to figure out the growth rates.

Use of a statistical test here is pseudoreplication. The tanks are not statistically independent because they are not randomly laid out. If the four tanks untreated with DDT happen to be closer to a window, then no statistical test could separate the differnece in plankton growth due to lack of DDT versus due to increased light.

Keeping track of the experimental unit

An experimenter must always keep the experimental unit clear in his mind. The experimental unit is the object which we will be measuring. For A/B tests in web products this is often the user. However, there are cases where we use other experimental units. For example, we might want run an experiment on our landing page before a user account has been created. In that case, we could generate a unique cookie, attach it to the browser, and consider unique cookies as the experimental unit.

Consider an experiment trying to measure the sex ratios of field mice. We go into two field, and set up six plots per field and lay out some traps. Later, we return to count the number of male and female mice caught in each plots. We ask whether there is a difference in the proportion of males between the fields. When running statistical tests, our degrees of freedom should be basead on the number of plots and not the number of mice caught. We know this because we arn’t measuring a trait of each individual mouse. So a justified hypothesis test would be a two-sample t-test on the proportion of mice in each plot with 4 degrees of freedom.

Components of an experiment

There are five components to an experiment: hypothesis, experimental design, experimental execution, statistical analysis and interpretation. Hurlbert emphasizes how the hypothesis is of “primary importance”, because without a good hypothesis, even a well conducted experiment provides little value.

Mensurative versus manipulative experiments

The paper clearly distinguishes two types of experiments: mensurative and manipulative. A mensurative experiment is one which is purely descriptive. It tells the experimenter what is currently the case. Often, the only variables of interest are space and time. An example of a mensurative experiment would be one measuring the rate of decomposition of maple leaves at the bottom of a lake undedr 1m of water. To do this, we drop several porous bags at random locations on a lake bed at a depth of 1m. After some time, we return and measure the organic material left over.

The use of statistics is not necessarily the distinguishing feature of manipulative experiments. Imagine that we wanted to measure the difference in decomposition rates of maple leaves in 1m versus 10m depth of water. To do so, we would distribute a few bags, at random, at a depth of 1m and similarly at 10m. Afterwards, we run a Mann-Whitney U test to compare. In this example, we are still observing the system as it exists, rather than changing a component of the system.

Manipulative experiments always involves two or more treatments, and is concerned with the comparisons between them. The defining feature of manipulative experiments is that different experimental units receive different treatments and that the assignment of treatments to experiment units is or can be randomized.

While clean execution is crucial, it cannot substitute for critical features of experimental design: controls, replication, randomization and interspersion.

By convention, a control is any treatment against which the other ones are to be compared. We require controls because systems, like biological ones, experience temporal change. If we are absolutely certain that a system is constant over time, then a separate control is unnecessary. One could simply measure each experimental unit before and after treatment.

If we zoom out, we can use the word control to mean minimize the chances of confusion due to various sources. Randomization “controls” for bias in experimental unit assignment. Replication controls for stochastic factors like between-replicates variability, or introduced by the experimenter. Interspersion controls for spatial variation in properties of the experimental unit.

A third meaning of control is regulation of conditions under which the experiment is conducted. It may refer to homogeneity of experimental units. This is an unfortunate usage, because the validity of the experiment is not affected by such regulation.

Replication, randomization and independence

Replication increases the precision of our estimates and allows for statistical testing. Randomization eliminates bias from the experimenter and therefore increases the accuracy of our estimates.

With respect to testing, the “main purpose [of replication], which there is no alternative method of achieving, is to supply an estimate of error [i.e variability], by which the significance of these comparisons is to be judged…[and] the purpose of randomization…is to guarantee the validity of the test of significance, this test being based on an estimate of error made possible by replication” [Fisher, 1971].

Randomization, guarantees that, on average, “errors” are independently distributed, that “pairs of plots treated alike are not nearer together or further apart than, or in any other relevant way distinguishable from pairs of plots treated differently”except insofar as there is a treatment effect".

That is, there is no systematic way in which the treated and untreated plots are different.

A lack of independence of errors prohibits us from knowing alpha, the probability of a false positive. With an unknown true alpha, either higher or lower, our interpretation of statistical analysis becomes rather subjective.

In some ways, interspersion is the most critical distinction between manipulative and mensurative experiments. Randomization is just how we can achieve interspersion without experimenter bias and allows the accurate specification of the positive error rate.

Prelayout, and layout specific alpha

Prelayout alpha is the standard false positive rate. It’s the proportion of time a statistical procedure reports a significant result when there is, in fact, none. We obtain this by averaging the probability of a false positive in each specific layout, over all possible layouts. A layout specific alpha is one that would occur if we had chosen a fixed layout. The problem is that one cannot ever know the layout specific alpha.

When a gradient exists, a well interspersed layout will have a more conservative (lower) false positive rate, whereas a segregated one would result in a higher false positive rate. Intuitively, if the gradient is “aligned” with the treatment assignment then the effects of the gradient gets lumped in with the estimated effect.

Misuse of the chi-squared test

A common use of the chi-squared test is comparing sex ratios between two different plots. A trap is set in each plot and each capture is considered independent. Since each animal is considered independent, the chi-square test can be correctly applied. Not that the experimental unit is the plot.

Mistakes arise when treatments become involved. If there are only two plots, one treated and one untreated, then the chi-square test can be applied. However, this experiment hopelessly mixes location differences and treatment differences. So no conclusion on the treatment can be drawn.

If we replicate the treatment and control plots, having two each, then we have difficulty applying the chi-squared test. If we apply it on all four plots, a significant result can mean any two rows differ. If we pool, then we are commiting “sacrificial pseudoreplication” and losing information about variance within plots. Further, observations are correlated within the subgroups of the treatment and control pools.

The core of this misunderstanding is a confusion of the experimental unit. If we were measuring individual animals for their resting metabolism rate under treatment and control, then the chi-squared test can be appropriately applied. Here, the animal is the unit and each measurement is independent of others. In our field experiment, the plots are the units. We are interested in a measurement (sex ratio) per plot. Thus, if we were to run a four plot test, with two controls and two treatments, we would have 4 experimental units. That is, the appropriate test is a t-test with two-fold replication.

The chi-squared test can only run with separate conditions. With the two-fold replication example, there are actually four conditions: control at location 1, control at location 2, treatment at location 3 and treatment at location 4. Only within each condition are the observations independent. In this case an observation is a 1 for female and 0 for male. To deal with this, we “roll up” on the location. Now each location becomes an observation. With the proportion of males being the measure. Now we can carry out a t-test, since the observations of plots are independent and there are two observations per condition.

Errors In published ecological field experiments

Hurlbert conducted a review of 156 papers published in the ecology literature between 1977 and 1980. He categorized the papers by their subject of study: plankton, bethos or small mammals. Within each category, he classifies whether they use inferential statistics, and whether they have proper replicates. Papers that use inferential statistics without proper replication are committing pseudoreplication.

# nr = no replicates, ns = no statistics
d <- matrix(c(14, 13, 1,
              5, 18, 12,
              15, 15, 2,
              14, 11, 9),
            byrow=F, nrow=3, ncol=4)
rownames(d) <- c("plankton", "benthos", "s_mammals")
colnames(d) <- c("NoRepNoStats", "NoRepStats", "RepsNoStats", "RepStats")
data <- as.table(d)

prop.table(data, margin=1)
##           NoRepNoStats NoRepStats RepsNoStats RepStats
## plankton        0.2917     0.1042      0.3125   0.2917
## benthos         0.2281     0.3158      0.2632   0.1930
## s_mammals       0.0417     0.5000      0.0833   0.3750
# Testing for equality of proportions between the different rows
##  Pearson's Chi-squared test
## data:  data
## X-squared = 20, df = 6, p-value = 0.002

He comments that, “The distribution of studies among design and analysis categories varies significantly among the three specific subject matter areas.” In sub-fields where replicatation of treatments is expensive (e.g small mammal experiments), pseudoreplication is committed in about half the published experiments.

Closing thought

I find these results encouraging. As a group, scientists are particularly concerned with the validity of their conclusions. If professional scientists are having difficulty, then properly applying statistics is hard for everyone.

Incorrectly using statistics is worse than not using statistics at all. Don’t under-estimate the damage an (incorrect) p < 0.05 can do. Statistical methods all rely on assumptions and breaking these assumptions can give very misleading results. An easy one to violate is independence of errors: a thread.

When we run an A/B test, users are randomized into either treatment or control. This ensures that the users who end up with the treatment are not systematically different from the control. We don’t like systematic differences because they get hopelessly confused with the effect of our treatment.

If the randomization system has been well engineered, then how how do we lose independence? Easy. Imagine a web experiment where we randomize on a session identifier that is kept in a cookie. Each time the user comes back, they are put in the same bucket. If we analyze by session, say clicks per session, our measurements are not independent because some users visit more frequently and have more sessions.

There are many ways this can happen. So many, in fact, that the ecologist Hurlbert wrote an entire paper on it in 1984. He calls the error pseudoreplication. The paper contains many nuggest of statistical wisdom and I highly recommend it. Here are my highlights from the paper: