How categorizing variables can induce interactions where there are none

I saw a talk where the speaker found that a continuous biomarker was not related to the outcome but if you interacted it with disease status, which was a binary variable created by taking a specific quantile of a continuous variable. That reminded me of a paper by Thoresen Spurious interaction as a result of categorization.

So I tried to replicate the results of the speaker but without an interaction. First I generated two variables $biomarker$ and $disease$_$cont$ that are continuous and correlated by their common ancestor $u$. I binarize variable $disease$_$cont$ at the 75th percentile. Then I make an outcome that is only a function of continuous disease status and not a function of the biomarker.

n <- 1e5
u <- rnorm(n)
biomarker <- u + rnorm(n)
disease_cont <- u + rnorm(n)
disease_bin <- as.numeric(disease_cont > quantile(disease_cont, 0.75))
outcome <- disease_cont + rnorm(n)

If we regress the outcome on the biomarker, continuous disease status and an interaction between the two we find that only disease status is related to outcome and we find no interaction:

(lm(outcome ~ biomarker*disease_cont) |> summary() )$coefficients |> round(2)
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)               -0.01          0   -1.74     0.08
## biomarker                  0.00          0    0.79     0.43
## disease_cont               1.00          0  386.96     0.00
## biomarker:disease_cont     0.00          0    1.08     0.28

But if we do the same with the binary disease variable we find entirely different results:

(lm(outcome ~ biomarker*disease_bin) |> summary() )$coefficients |> round(2)
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)              -0.51       0.00 -102.06        0
## biomarker                 0.30       0.00   82.29        0
## disease_bin               2.16       0.01  188.78        0
## biomarker:disease_bin    -0.15       0.01  -19.47        0

Now, not only do we find strong evidence of an interaction, we also find strong evidence that the biomarker is related to the outcome which we know it is not! In the paper mentioned above they explained why this comes about.

I have no idea whether this was what happened in the talk I saw but I do wonder how often this happens in the literature. Yet another reason you should (almost) never categorize a continuous variable.

Avatar
Jeremy A. Labrecque
Assistant professor, Epidemiology and causal inference

My research is on how we know what we know.