Do handheld device bans actually save lives?

A staggered difference-in-differences analysis estimating the causal effect of primary enforcement handheld device laws on U.S. traffic fatalities (2010–2022). Code on GitHub.

Motivation

I picked up Freakonomics recently and found myself stuck on one claim in particular: that the legalization of abortion through Roe v. Wade was the single largest driver of the dramatic drop in U.S. crime in the 1990s. The argument, from Donohue & Levitt (2001), rests on a compelling natural experiment. States that legalized abortion earlier — through pre-Roe state laws — saw crime fall earlier, by roughly the same lag as the time between birth and peak criminal age. The book doesn't go deep on the methodology, but the underlying logic is pure difference-in-differences: exploit variation in the timing of a policy change to identify its causal effect.

That got me interested in trying something similar myself. I'd been working through Scott Cunningham's Causal Inference: The Mixtape, which covers staggered DiD in detail, and wanted to apply the methods to a real policy question. I landed on handheld device bans. The policy seems obviously effective on its face, has been adopted by states at different times over the past two decades, and has a surprisingly contested empirical literature.

Most U.S. states have banned handheld device use while driving, but the enforcement mechanism varies. A primary enforcement law lets officers stop drivers solely for device use; a secondary enforcement law only allows a citation after stopping for something else. Primary enforcement should deter more, but several studies (HLDI 2010; Bhargava & Pathania 2013) have found little to no effect on crash rates. Others (Abouk & Adams 2013) find significant reductions. The literature is mixed.

I use the staggered rollout of primary enforcement laws across states from 2011–2021 as a natural experiment, following the Mixtape's DiD chapters for the TWFE baseline and Goodman-Bacon decomposition, then turning to Callaway & Sant'Anna (2021) as the robust alternative for staggered adoption.

The methodological challenge

The standard approach, two-way fixed effects (TWFE), is biased when treatment effects are heterogeneous across cohorts or time. Goodman-Bacon (2021) showed that TWFE is a weighted average of all possible 2×2 DiD comparisons, including "forbidden" ones that use already-treated units as controls. With 10 treatment cohorts spanning a decade, that's a real concern.

To address this, I implement three layers of analysis:

Baseline TWFE for reference and comparability with the existing literature
Goodman-Bacon decomposition to diagnose whether TWFE is likely biased here
Callaway & Sant'Anna (2021), which only compares each cohort against never-treated states and avoids the forbidden comparisons problem

Data

The panel covers 51 states (including DC) over 13 years (2010–2022), yielding 663 state-year observations.

Outcome: Fatalities per 100 million VMT, which controls for differences in driving exposure across states and time
Traffic fatalities: FARS (NHTSA), annual state-level counts
VMT: FHWA Highway Statistics (VM-2 table), downloaded via data.gov
Law dates: IIHS and GHSA, effective dates of primary enforcement bans

Variable	Mean	Std. Dev.	Min	Max
Fatalities	680	695	30	4,258
VMT (millions)	61,800	62,300	4,594	340,600
Fatalities per 100M VMT	1.21	0.32	0.51	2.27
Treated (post-adoption)	0.14	0.35	0	1

Twenty states adopted primary enforcement during the study period across 10 cohorts (2011–2021). Five states (NY, NJ, CA, CT, DC) adopted before 2010 and are treated throughout, leaving 26 never-treated states as the primary control group.

Adoption year	States	N
2011	Delaware	1
2012	Nevada	1
2013	Hawaii, Maryland, West Virginia	3
2014	Illinois, Vermont	2
2015	New Hampshire	1
2017	Oregon, Washington	2
2018	Georgia, Rhode Island	2
2019	Maine, Minnesota, Tennessee	3
2020	Idaho, Indiana, Massachusetts	3
2021	Arizona, Virginia	2

Results

TWFE baseline

The standard TWFE regression (state and year fixed effects, standard errors clustered by state) gives a small, insignificant point estimate:

Y_it = α_i + λ_t + β · D_it + ε_it

Model	Coef.	Std. Err.	t-stat	p-value	95% CI	N
Basic TWFE	+0.019	0.035	0.54	0.590	[−0.050, +0.088]	663
TWFE (VMT-weighted)	—	—	—	—	—	663
Callaway & Sant'Anna	+0.004	0.037	0.11	0.900	[−0.067, +0.076]	663

Outcome: fatalities per 100M VMT. State and year fixed effects. Standard errors clustered by state. Highlighted row is preferred specification.

Bottom line: Primary enforcement handheld device bans have no statistically significant effect on traffic fatality rates. Both estimators produce point estimates near zero with wide confidence intervals that include zero.

Figure 1: TWFE coefficient estimates across four model specifications (95% CI, clustered SEs). All estimates are small and statistically insignificant.

Goodman-Bacon decomposition

Before trusting the TWFE estimate, I decompose it into its constituent 2×2 comparisons. With 10 treatment cohorts, the TWFE estimate aggregates 190 separate DiD pairs, and the composition matters:

Comparison type	Weight	Avg. estimate	N pairs
Treated vs. Never-Treated (cleanest)	21%	+0.065	10
Earlier vs. Later Treated (valid)	18%	+0.056	45
Later vs. Earlier Treated (problematic)	61%	−0.021	135

61% of the TWFE estimate's weight comes from "Later vs. Earlier" comparisons that use already-treated states as controls. If treatment effects evolve over time (common in policy settings), this can bias the overall estimate. The problematic comparisons here point negative while the valid comparisons point positive, which is the kind of sign-reversal that can come from heterogeneous treatment effects.

Goodman-Bacon decomposition scatter — Figure 2: Each point is one 2×2 comparison; size scales with weight. Red dots are the problematic "Later vs. Earlier" pairs that dominate the TWFE estimate.

Bacon weight distribution — Figure 3: Weight distribution (left) and contribution to TWFE (right) by comparison type. The problematic category carries the most weight.

Callaway & Sant'Anna estimator

The CS estimator only uses never-treated states as controls. For each cohort g and time t it estimates:

ATT(g, t) = E[Y_t − Y_g−1 | G = g] − E[Y_t − Y_g−1 | C]

where C is the never-treated group. These cohort-time cells are then aggregated into an overall ATT, an event-study, and cohort-specific effects. The result confirms the null finding: ATT = +0.004 (SE = 0.037, p = 0.90).

Figure 4: Callaway-Sant'Anna cohort-specific ATTs. No adoption cohort shows a statistically significant effect; all confidence intervals span zero.

Event study & pre-trends test

An event study plots treatment effects at each period relative to adoption. This lets us check whether pre-treatment trends are parallel and whether any effect builds up, fades, or just doesn't show up at all.

Figure 5: Event study from TWFE (left) and Callaway-Sant'Anna (right). Blue = pre-treatment periods, red = post-treatment. The dashed line marks adoption. Both estimators show coefficients hugging zero throughout.

A joint Wald test of all pre-treatment coefficients fails to reject the null of parallel trends (χ²(4) = 7.10, p = 0.13). Post-treatment effects are small and insignificant throughout: no immediate effect, no delayed effect, no fade-out.

Why the null result?

The result is consistent with prior work (HLDI 2010; Bhargava & Pathania 2013), even if it's not the answer you'd expect. A few things probably explain it:

Outcome dilution. Distracted driving causes ~8–10% of fatal crashes (NHTSA, 2023), and phone use is only a subset of that. A 20% reduction in phone-related crashes would translate to roughly a 1.6% reduction in total fatalities, probably too small to detect with 51 states.
Low compliance and enforcement. Proving phone use requires direct observation; citation rates are low; drivers may simply lower the phone rather than stop using it.
Substitution to hands-free. Most bans permit hands-free use, but cognitive distraction from conversation may be just as dangerous as manual distraction (Strayer & Johnston, 2001).
Underreporting. FARS relies on officers correctly identifying distraction as a crash factor, which is hard to determine, especially when the driver is deceased.

Limitations. This study uses total traffic fatalities rather than distraction-specific crashes, which is a broad outcome that dilutes any real effect. It also cannot separate law passage from actual enforcement intensity, and effects may exist for subgroups (e.g., young drivers, urban areas) but get lost in state-level aggregation.

What I found

No effect on fatalities. Both TWFE (+0.019, p = 0.59) and Callaway-Sant'Anna (+0.004, p = 0.90) yield null results. The 95% CIs rule out effects larger than ~0.08 fatalities per 100M VMT.
TWFE has a methodological problem. 61% of its weight comes from forbidden comparisons, a red flag in other contexts, though here it doesn't flip the conclusion.
Parallel trends holds. Pre-treatment event-study coefficients are jointly insignificant (p = 0.13), which supports the DiD design.
No cohort or time heterogeneity. No individual adoption cohort and no post-treatment event-time shows a significant effect.

Notes: The preferred CS specification uses never-treated states as controls. Treatment timing compiled from IIHS and GHSA records. VMT from FHWA VM-2; fatalities from NHTSA FARS. Standard errors are clustered by state throughout.

References:
Bhargava & Pathania (2013). AEJ: Economic Policy, 5(3), 92–125.
Callaway & Sant'Anna (2021). Journal of Econometrics, 225(2), 200–230.
Goodman-Bacon (2021). Journal of Econometrics, 225(2), 254–277.
Highway Loss Data Institute (2010). HLDI Bulletin, 27(11).
Strayer & Johnston (2001). Psychological Science, 12(6), 462–466.