Example of collider bias

Let’s simulate some data. This represents academic papers, which have rigor and innovativeness. These are normally distributed, centered at 0, and are independent of each other.

Reviewers have a threshold for what they will accept for publication. As we all know, whether a given paper is accepted is due in part to luck. In this world, papers are judged on their rigor and innovativeness. We add those together, and if the sum is greater than a number randomly chosen between 0 and 2, the paper is published.

library(tidyverse)
library(modelsummary)

N = 1000

df = tibble(
  rigor = rnorm(N, 0, 1),
  innovativeness = rnorm(N, 0, 1),
  threshold = runif(N, 0, 2),
  published = (rigor + innovativeness) > threshold
)

Let’s visualize the relationship between rigor and innovativeness. As expected, there is no relationship between the two.

df |>
  ggplot(aes(x = rigor, y = innovativeness)) +
  geom_point(aes(color=published)) +
  geom_smooth(method = "lm") +
    theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

If we ran a regression with just these two variables, we would find no relationship, which is correct.

modelplot(lm(innovativeness ~ rigor, data = df))#, gof_omit = ".*", statistic = "conf.int")

However, when we break the data into published and unpublished papers, we see a relationship between rigor and innovativeness, even though it doesn’t really exist. This is an example of collider bias.

df |> 
  ggplot(aes(x = rigor, y = innovativeness, color = published)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~published) +
    theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

If we run the regression again, this time including whether a paper is published, we will be misled into thinking that rigor predicts innovativeness when it actually does not.

modelplot(lm(innovativeness ~ rigor + published, data = df))