Overview

This vignette explores the functional form sensitivity problem in difference-in-differences with binary outcomes, following Roth & Sant’Anna (2023).

The Core Issue

Suppose two groups have baseline outcome probabilities:

Control: P(Y = 1) = 0.30
Treated: P(Y = 1) = 0.25

Both groups experience the same additive increase in probability: +0.10.

On the probability scale: The change is 0.10 for both — parallel trends holds.

On the logit scale: - Control: logit(0.40) - logit(0.30) = 0.442 - Treated: logit(0.35) - logit(0.25) = 0.48

These are not equal! A researcher testing parallel trends on the log-odds scale would (correctly) reject — even though the underlying time trend is identical for both groups on the probability scale.

# Demonstrate scale sensitivity
p_ctrl_pre  <- 0.30; p_ctrl_post <- 0.40
p_treat_pre <- 0.25; p_treat_post <- 0.35

cat("=== Probability Scale ===\n")
#> === Probability Scale ===
cat("Control change:  ", round(p_ctrl_post  - p_ctrl_pre, 4),  "\n")
#> Control change:   0.1
cat("Treated change:  ", round(p_treat_post - p_treat_pre, 4), "\n")
#> Treated change:   0.1
cat("DiD (prob):      ", round((p_treat_post - p_treat_pre) - (p_ctrl_post - p_ctrl_pre), 4), "\n\n")
#> DiD (prob):       0

cat("=== Log-Odds (Logit) Scale ===\n")
#> === Log-Odds (Logit) Scale ===
cat("Control change:  ", round(qlogis(p_ctrl_post)  - qlogis(p_ctrl_pre), 4),  "\n")
#> Control change:   0.4418
cat("Treated change:  ", round(qlogis(p_treat_post) - qlogis(p_treat_pre), 4), "\n")
#> Treated change:   0.4796
cat("DiD (logit):     ", round((qlogis(p_treat_post) - qlogis(p_treat_pre)) -
                               (qlogis(p_ctrl_post) - qlogis(p_ctrl_pre)), 4), "\n\n")
#> DiD (logit):      0.0377

cat("=== Probit Scale ===\n")
#> === Probit Scale ===
cat("Control change:  ", round(qnorm(p_ctrl_post)  - qnorm(p_ctrl_pre), 4),  "\n")
#> Control change:   0.2711
cat("Treated change:  ", round(qnorm(p_treat_post) - qnorm(p_treat_pre), 4), "\n")
#> Treated change:   0.2892
cat("DiD (probit):    ", round((qnorm(p_treat_post) - qnorm(p_treat_pre)) -
                               (qnorm(p_ctrl_post) - qnorm(p_ctrl_pre)), 4), "\n")
#> DiD (probit):     0.0181

Key takeaway: The same underlying DGP yields different DiD estimates and different pre-trends test results depending on the scale chosen. There is no uniquely correct scale — the right scale is determined by the DGP.

When Does Scale Choice Matter?

Scale sensitivity is most severe when:

Baseline probabilities are far from 0.5 (where logit and probit curves are most nonlinear)
Groups have different baseline rates (asymmetric baseline)
Treatment effects are large (more curvature in the relevant region)

# Show severity across baseline probability values
baseline_probs <- seq(0.05, 0.45, by = 0.05)
delta_p        <- 0.10  # same additive change for both groups

severity_df <- do.call(rbind, lapply(baseline_probs, function(p0) {
  p1 <- p0 + delta_p
  # Parallel in prob => same change
  # Logit DiD if treated has different baseline (p0 - 0.05)
  p0_treat <- max(p0 - 0.05, 0.02)
  p1_treat <- p0_treat + delta_p

  logit_did <- (qlogis(p1_treat) - qlogis(p0_treat)) -
               (qlogis(p1)       - qlogis(p0))

  data.frame(
    baseline_ctrl  = p0,
    baseline_treat = p0_treat,
    logit_did      = logit_did
  )
}))

cat("Logit-scale DiD when true probability DiD = 0:\n")
#> Logit-scale DiD when true probability DiD = 0:
print(severity_df, digits = 3, row.names = FALSE)
#>  baseline_ctrl baseline_treat logit_did
#>           0.05           0.02   0.68955
#>           0.10           0.05   0.39891
#>           0.15           0.10   0.17494
#>           0.20           0.15   0.09699
#>           0.25           0.20   0.05942
#>           0.30           0.25   0.03774
#>           0.35           0.30   0.02346
#>           0.40           0.35   0.01290
#>           0.45           0.40   0.00412
cat("\nLarger deviations at low baseline probabilities.\n")
#> 
#> Larger deviations at low baseline probabilities.

Simulation: Linear vs. Logit DiD

We now simulate data where parallel trends holds on the logit scale (the correct scale for the DGP), and compare estimates from: 1. Linear DiD (standard CS2021) 2. Logit DiD (NonlinearDiD)

# DGP: parallel trends on logit scale
dat <- sim_binary_panel(
  n            = 1000,
  nperiods     = 8,
  prop_treated = 0.5,
  n_cohorts    = 3,
  true_att     = c(0.20, 0.35, 0.25),
  base_prob    = 0.20,   # low baseline: nonlinearity matters most
  unit_fe_sd   = 0.5,
  seed         = 42
)

cat("Baseline outcome rate (untreated, pre-period):",
    round(mean(dat$y[dat$D == 0 & dat$period == 1]), 3), "\n")
#> Baseline outcome rate (untreated, pre-period): 0.202
cat("True ATTs (avg):", round(mean(c(0.20, 0.35, 0.25)), 3), "\n\n")
#> True ATTs (avg): 0.267

# Logit DiD
res_logit <- nonlinear_attgt(
  dat, "y", "period", "id", "g",
  outcome_model = "logit",
  control_group = "nevertreated"
)

# Linear DiD
res_linear <- nonlinear_attgt(
  dat, "y", "period", "id", "g",
  outcome_model = "linear",
  control_group = "nevertreated"
)

# Aggregate
agg_logit  <- nonlinear_aggte(res_logit,  type = "dynamic")
agg_linear <- nonlinear_aggte(res_linear, type = "dynamic")

cat("=== Overall ATT ===\n")
#> === Overall ATT ===
cat("Linear DiD: ", round(agg_linear$overall_att, 4), "\n")
#> Linear DiD:  0.046
cat("Logit DiD:  ", round(agg_logit$overall_att,  4), "\n")
#> Logit DiD:   0.046
cat("True ATT:   ", round(mean(c(0.20, 0.35, 0.25)), 4), "\n")
#> True ATT:    0.2667

Pre-Trends Sensitivity

A critical practical insight: even if the true DGP has no pre-treatment differences (i.e., the identifying assumption holds on some scale), a pre-trends test on the wrong scale may falsely reject.

# Test on logit scale
pt_logit <- nonlinear_pretest(res_logit, plot = FALSE)
cat("Pre-trends test (logit scale):\n")
#> Pre-trends test (logit scale):
cat("  Joint p-value:", round(pt_logit$joint_pval, 4), "\n\n")
#>   Joint p-value: 0.2518

# Test on linear scale
pt_linear <- nonlinear_pretest(res_linear, plot = FALSE)
cat("Pre-trends test (linear scale):\n")
#> Pre-trends test (linear scale):
cat("  Joint p-value:", round(pt_linear$joint_pval, 4), "\n\n")
#>   Joint p-value: 0.2518

cat("Note: If true DGP is logit-scale parallel trends, the linear-scale\n")
#> Note: If true DGP is logit-scale parallel trends, the linear-scale
cat("pre-trends test may spuriously reject due to functional form.\n")
#> pre-trends test may spuriously reject due to functional form.

Recommendations

Based on Roth & Sant’Anna (2023) and the evidence above:

Think about your DGP first. If your outcome is binary with moderate baseline rates (15–85%), use a nonlinear model.
Report the scale of your parallel trends assumption. “We assume parallel trends in log-odds” is a substantively different claim from “parallel trends in probabilities.”
Use doubly-robust estimation (doubly_robust = TRUE) which is consistent under misspecification of either the outcome model or the propensity score model.
Consider the odds-ratio DiD for binary outcomes when you want an estimate that does not depend on the reference group or period.
Use nonlinear_bounds() to report the range of ATTs consistent with the data under minimal assumptions.

Functional Form Sensitivity in Nonlinear DiD

NonlinearDiD Package

2026-05-03