Do handheld device bans actually save lives?

A staggered difference-in-differences analysis estimating the causal effect of primary enforcement handheld device laws on U.S. traffic fatalities (2010–2022). Code on GitHub.

Motivation

I picked up Freakonomics recently and found myself stuck on one claim in particular: that the legalization of abortion through Roe v. Wade was the single largest driver of the dramatic drop in U.S. crime in the 1990s. The argument, from Donohue & Levitt (2001), rests on a compelling natural experiment. States that legalized abortion earlier — through pre-Roe state laws — saw crime fall earlier, by roughly the same lag as the time between birth and peak criminal age. The book doesn't go deep on the methodology, but the underlying logic is pure difference-in-differences: exploit variation in the timing of a policy change to identify its causal effect.

That got me interested in trying something similar myself. I'd been working through Scott Cunningham's Causal Inference: The Mixtape, which covers staggered DiD in detail, and wanted to apply the methods to a real policy question. I landed on handheld device bans. The policy seems obviously effective on its face, has been adopted by states at different times over the past two decades, and has a surprisingly contested empirical literature.

Most U.S. states have banned handheld device use while driving, but the enforcement mechanism varies. A primary enforcement law lets officers stop drivers solely for device use; a secondary enforcement law only allows a citation after stopping for something else. Primary enforcement should deter more, but several studies (HLDI 2010; Bhargava & Pathania 2013) have found little to no effect on crash rates. Others (Abouk & Adams 2013) find significant reductions. The literature is mixed.

I use the staggered rollout of primary enforcement laws across states from 2011–2021 as a natural experiment, following the Mixtape's DiD chapters for the TWFE baseline and Goodman-Bacon decomposition, then turning to Callaway & Sant'Anna (2021) as the robust alternative for staggered adoption.

The methodological challenge

The standard approach, two-way fixed effects (TWFE), is biased when treatment effects are heterogeneous across cohorts or time. Goodman-Bacon (2021) showed that TWFE is a weighted average of all possible 2×2 DiD comparisons, including "forbidden" ones that use already-treated units as controls. With 10 treatment cohorts spanning a decade, that's a real concern.

To address this, I implement three layers of analysis:

Data

The panel covers 51 states (including DC) over 13 years (2010–2022), yielding 663 state-year observations.

Variable Mean Std. Dev. Min Max
Fatalities 680 695 30 4,258
VMT (millions) 61,800 62,300 4,594 340,600
Fatalities per 100M VMT 1.21 0.32 0.51 2.27
Treated (post-adoption) 0.14 0.35 0 1

Twenty states adopted primary enforcement during the study period across 10 cohorts (2011–2021). Five states (NY, NJ, CA, CT, DC) adopted before 2010 and are treated throughout, leaving 26 never-treated states as the primary control group.

Adoption year States N
2011Delaware1
2012Nevada1
2013Hawaii, Maryland, West Virginia3
2014Illinois, Vermont2
2015New Hampshire1
2017Oregon, Washington2
2018Georgia, Rhode Island2
2019Maine, Minnesota, Tennessee3
2020Idaho, Indiana, Massachusetts3
2021Arizona, Virginia2

Results

TWFE baseline

The standard TWFE regression (state and year fixed effects, standard errors clustered by state) gives a small, insignificant point estimate:

Yit = αi + λt + β · Dit + εit
Model Coef. Std. Err. t-stat p-value 95% CI N
Basic TWFE +0.019 0.035 0.54 0.590 [−0.050, +0.088] 663
TWFE (VMT-weighted) 663
Callaway & Sant'Anna +0.004 0.037 0.11 0.900 [−0.067, +0.076] 663

Outcome: fatalities per 100M VMT. State and year fixed effects. Standard errors clustered by state. Highlighted row is preferred specification.

Bottom line: Primary enforcement handheld device bans have no statistically significant effect on traffic fatality rates. Both estimators produce point estimates near zero with wide confidence intervals that include zero.

TWFE coefficient plot
Figure 1: TWFE coefficient estimates across four model specifications (95% CI, clustered SEs). All estimates are small and statistically insignificant.

Goodman-Bacon decomposition

Before trusting the TWFE estimate, I decompose it into its constituent 2×2 comparisons. With 10 treatment cohorts, the TWFE estimate aggregates 190 separate DiD pairs, and the composition matters:

Comparison type Weight Avg. estimate N pairs
Treated vs. Never-Treated (cleanest) 21% +0.065 10
Earlier vs. Later Treated (valid) 18% +0.056 45
Later vs. Earlier Treated (problematic) 61% −0.021 135

61% of the TWFE estimate's weight comes from "Later vs. Earlier" comparisons that use already-treated states as controls. If treatment effects evolve over time (common in policy settings), this can bias the overall estimate. The problematic comparisons here point negative while the valid comparisons point positive, which is the kind of sign-reversal that can come from heterogeneous treatment effects.

Goodman-Bacon decomposition scatter
Figure 2: Each point is one 2×2 comparison; size scales with weight. Red dots are the problematic "Later vs. Earlier" pairs that dominate the TWFE estimate.
Bacon weight distribution
Figure 3: Weight distribution (left) and contribution to TWFE (right) by comparison type. The problematic category carries the most weight.

Callaway & Sant'Anna estimator

The CS estimator only uses never-treated states as controls. For each cohort g and time t it estimates:

ATT(g, t) = E[Yt − Yg−1 | G = g] − E[Yt − Yg−1 | C]

where C is the never-treated group. These cohort-time cells are then aggregated into an overall ATT, an event-study, and cohort-specific effects. The result confirms the null finding: ATT = +0.004 (SE = 0.037, p = 0.90).

CS cohort-specific effects
Figure 4: Callaway-Sant'Anna cohort-specific ATTs. No adoption cohort shows a statistically significant effect; all confidence intervals span zero.

Event study & pre-trends test

An event study plots treatment effects at each period relative to adoption. This lets us check whether pre-treatment trends are parallel and whether any effect builds up, fades, or just doesn't show up at all.

Event study comparison TWFE vs CS
Figure 5: Event study from TWFE (left) and Callaway-Sant'Anna (right). Blue = pre-treatment periods, red = post-treatment. The dashed line marks adoption. Both estimators show coefficients hugging zero throughout.

A joint Wald test of all pre-treatment coefficients fails to reject the null of parallel trends (χ²(4) = 7.10, p = 0.13). Post-treatment effects are small and insignificant throughout: no immediate effect, no delayed effect, no fade-out.

Why the null result?

The result is consistent with prior work (HLDI 2010; Bhargava & Pathania 2013), even if it's not the answer you'd expect. A few things probably explain it:

Limitations. This study uses total traffic fatalities rather than distraction-specific crashes, which is a broad outcome that dilutes any real effect. It also cannot separate law passage from actual enforcement intensity, and effects may exist for subgroups (e.g., young drivers, urban areas) but get lost in state-level aggregation.

What I found

Notes: The preferred CS specification uses never-treated states as controls. Treatment timing compiled from IIHS and GHSA records. VMT from FHWA VM-2; fatalities from NHTSA FARS. Standard errors are clustered by state throughout.

References:
Bhargava & Pathania (2013). AEJ: Economic Policy, 5(3), 92–125.
Callaway & Sant'Anna (2021). Journal of Econometrics, 225(2), 200–230.
Goodman-Bacon (2021). Journal of Econometrics, 225(2), 254–277.
Highway Loss Data Institute (2010). HLDI Bulletin, 27(11).
Strayer & Johnston (2001). Psychological Science, 12(6), 462–466.