A staggered difference-in-differences analysis estimating the causal effect of primary enforcement handheld device laws on U.S. traffic fatalities (2010–2022). Code on GitHub.
I picked up Freakonomics recently and found myself stuck on one claim in particular: that the legalization of abortion through Roe v. Wade was the single largest driver of the dramatic drop in U.S. crime in the 1990s. The argument, from Donohue & Levitt (2001), rests on a compelling natural experiment. States that legalized abortion earlier — through pre-Roe state laws — saw crime fall earlier, by roughly the same lag as the time between birth and peak criminal age. The book doesn't go deep on the methodology, but the underlying logic is pure difference-in-differences: exploit variation in the timing of a policy change to identify its causal effect.
That got me interested in trying something similar myself. I'd been working through Scott Cunningham's Causal Inference: The Mixtape, which covers staggered DiD in detail, and wanted to apply the methods to a real policy question. I landed on handheld device bans. The policy seems obviously effective on its face, has been adopted by states at different times over the past two decades, and has a surprisingly contested empirical literature.
Most U.S. states have banned handheld device use while driving, but the enforcement mechanism varies. A primary enforcement law lets officers stop drivers solely for device use; a secondary enforcement law only allows a citation after stopping for something else. Primary enforcement should deter more, but several studies (HLDI 2010; Bhargava & Pathania 2013) have found little to no effect on crash rates. Others (Abouk & Adams 2013) find significant reductions. The literature is mixed.
I use the staggered rollout of primary enforcement laws across states from 2011–2021 as a natural experiment, following the Mixtape's DiD chapters for the TWFE baseline and Goodman-Bacon decomposition, then turning to Callaway & Sant'Anna (2021) as the robust alternative for staggered adoption.
The standard approach, two-way fixed effects (TWFE), is biased when treatment effects are heterogeneous across cohorts or time. Goodman-Bacon (2021) showed that TWFE is a weighted average of all possible 2×2 DiD comparisons, including "forbidden" ones that use already-treated units as controls. With 10 treatment cohorts spanning a decade, that's a real concern.
To address this, I implement three layers of analysis:
The panel covers 51 states (including DC) over 13 years (2010–2022), yielding 663 state-year observations.
| Variable | Mean | Std. Dev. | Min | Max |
|---|---|---|---|---|
| Fatalities | 680 | 695 | 30 | 4,258 |
| VMT (millions) | 61,800 | 62,300 | 4,594 | 340,600 |
| Fatalities per 100M VMT | 1.21 | 0.32 | 0.51 | 2.27 |
| Treated (post-adoption) | 0.14 | 0.35 | 0 | 1 |
Twenty states adopted primary enforcement during the study period across 10 cohorts (2011–2021). Five states (NY, NJ, CA, CT, DC) adopted before 2010 and are treated throughout, leaving 26 never-treated states as the primary control group.
| Adoption year | States | N |
|---|---|---|
| 2011 | Delaware | 1 |
| 2012 | Nevada | 1 |
| 2013 | Hawaii, Maryland, West Virginia | 3 |
| 2014 | Illinois, Vermont | 2 |
| 2015 | New Hampshire | 1 |
| 2017 | Oregon, Washington | 2 |
| 2018 | Georgia, Rhode Island | 2 |
| 2019 | Maine, Minnesota, Tennessee | 3 |
| 2020 | Idaho, Indiana, Massachusetts | 3 |
| 2021 | Arizona, Virginia | 2 |
The standard TWFE regression (state and year fixed effects, standard errors clustered by state) gives a small, insignificant point estimate:
| Model | Coef. | Std. Err. | t-stat | p-value | 95% CI | N |
|---|---|---|---|---|---|---|
| Basic TWFE | +0.019 | 0.035 | 0.54 | 0.590 | [−0.050, +0.088] | 663 |
| TWFE (VMT-weighted) | — | — | — | — | — | 663 |
| Callaway & Sant'Anna | +0.004 | 0.037 | 0.11 | 0.900 | [−0.067, +0.076] | 663 |
Outcome: fatalities per 100M VMT. State and year fixed effects. Standard errors clustered by state. Highlighted row is preferred specification.
Bottom line: Primary enforcement handheld device bans have no statistically significant effect on traffic fatality rates. Both estimators produce point estimates near zero with wide confidence intervals that include zero.
Before trusting the TWFE estimate, I decompose it into its constituent 2×2 comparisons. With 10 treatment cohorts, the TWFE estimate aggregates 190 separate DiD pairs, and the composition matters:
| Comparison type | Weight | Avg. estimate | N pairs |
|---|---|---|---|
| Treated vs. Never-Treated (cleanest) | 21% | +0.065 | 10 |
| Earlier vs. Later Treated (valid) | 18% | +0.056 | 45 |
| Later vs. Earlier Treated (problematic) | 61% | −0.021 | 135 |
61% of the TWFE estimate's weight comes from "Later vs. Earlier" comparisons that use already-treated states as controls. If treatment effects evolve over time (common in policy settings), this can bias the overall estimate. The problematic comparisons here point negative while the valid comparisons point positive, which is the kind of sign-reversal that can come from heterogeneous treatment effects.
The CS estimator only uses never-treated states as controls. For each cohort g and time t it estimates:
where C is the never-treated group. These cohort-time cells are then aggregated into an overall ATT, an event-study, and cohort-specific effects. The result confirms the null finding: ATT = +0.004 (SE = 0.037, p = 0.90).
An event study plots treatment effects at each period relative to adoption. This lets us check whether pre-treatment trends are parallel and whether any effect builds up, fades, or just doesn't show up at all.
A joint Wald test of all pre-treatment coefficients fails to reject the null of parallel trends (χ²(4) = 7.10, p = 0.13). Post-treatment effects are small and insignificant throughout: no immediate effect, no delayed effect, no fade-out.
The result is consistent with prior work (HLDI 2010; Bhargava & Pathania 2013), even if it's not the answer you'd expect. A few things probably explain it:
Limitations. This study uses total traffic fatalities rather than distraction-specific crashes, which is a broad outcome that dilutes any real effect. It also cannot separate law passage from actual enforcement intensity, and effects may exist for subgroups (e.g., young drivers, urban areas) but get lost in state-level aggregation.
Notes: The preferred CS specification uses never-treated states as controls. Treatment timing compiled from IIHS and GHSA records. VMT from FHWA VM-2; fatalities from NHTSA FARS. Standard errors are clustered by state throughout.
References:
Bhargava & Pathania (2013). AEJ: Economic Policy, 5(3), 92–125.
Callaway & Sant'Anna (2021). Journal of Econometrics, 225(2), 200–230.
Goodman-Bacon (2021). Journal of Econometrics, 225(2), 254–277.
Highway Loss Data Institute (2010). HLDI Bulletin, 27(11).
Strayer & Johnston (2001). Psychological Science, 12(6), 462–466.