Introduction
In this fifth blog post, I am going to discuss two areas of election forecasting: turnout and demographics.
Then, I am going to prepare a baseline model to predict the 2024 election. Over the next 4 weeks until the November 5th election, I will fine tune this model and ultimately use it to predict the next president of the United States of America.
The code used to produce these visualizations is publicly available in my github repository and draws heavily from the section notes and sample code provided by the Gov 1347 Head Teaching Fellow, Matthew Dardet.
Analysis
The United States’ current priors and general beliefs about turnout and demographics are informed, in part, by longstanding academic literature on the subject.
Two of the most influential publications are Who Votes?, a 1980 book by Professors Wolfinger and Rosenstone, and Mobilization, Participation, and Democracy in America, a 1993 book by Professors Rosenstone and Hansen. Both of these publications popularized theories about the connection between demographics and voter turnout that would permeate US society for decades to come.
Wolfinger and Rosenstone ran OLS regressions on census data between 1972 and 1974 to determine that education was the key demographic variable influencing turnout; age, marital status, and the restrictiveness of voter registration laws also dispalyed high rates of correlation. In 1993, Rosenstone and Hansen expanded upon Wolfinger and Rosenstone’s findings, determining that those most likely to vote tended to be white, wealthy, and educated. They also uncovered, however, using data from the American National Election Studies (ANES) that turnout was highly affected by mobilization efforts within social networks. These two studies gave the strong impression that demographics were of significant relevance to turnout.
This prevailing narrative, however, has faced increased scrutiny in recent years.
Professors Shaw and Petrocik, for example, challenged The Turnout Myth in their 2020 book of the same title, finding no evidence in the past 50 years of presidential election data that higher rates of turnout benefit Democrats, as the conventional narrative would suggest. Instead, Shaw and Petrocik argue that “turnout does not consistently help either party” (Shaw & Petrocik 2020).
Another two professors, using logistic regressions and random forest models on demographic data from the American National Election Studies (ANES) between 1952 and 2020, observed results that similarly pour cold water on assumptions about demographics’ high predictiveness of turnout. Leveraging public opinion surveys, Professors Kim and Zilinsky determined that predictions using the demographic “variables of age, gender, race, education, and income” exhibited less than 64% accuracy out-of-sample, regardless of whether the predictions were made with a random forest or a logistic regression model. Including party identification, however, improves the accuracy by between 20 and 30 percentage points. The improvement possible by including even all of the additional covariates found in a voter file (marital status, homeownership, etc.), at this point, is fairly marginal.
For these reasons, in my first electoral college and national popular vote model, I am not going to explicitly include demographic variables. Instead, I will consider polling averages and fundamental economic conditions. In future weeks, I hope to include additional analysis from voter files and more explicitly model turnout and demographics at the state level, but, for this first week, I will start with something simpler.
My Model
Both Sabato’s Crystal Ball of the Center for Politics and the Cook Political Report list the same seven states as “toss-ups.” These include the following:
- Arizona
- Georgia
- Michigan
- Nevada
- North Carolina
- Pennsylvania
- Wisconsin
While it is not inconceivable that other states/districts could unexpectedly flip (Florida, Ohio, Nebraska 2nd district, Virgina, Texas, etc), it is unlikely that one of these states/districts would ‘decide’ the election. If Florida were to go blue, for example, other more competitive states would have likely gone blue as well, clinching the election for Harris. While there exist realities where Texas or Florida or Ohio could be the tipping point of the presidential election, for the purposes of this week’s blog post, I will focus on the seven most commonly cited battleground states.
With this assumption in place, assuming other states and districts vote as they did in 2020, the base electoral map for 2024 looks as follows:
As we can see, this election cycle is incredibly competitive. 93 electoral votes reside in the seven toss-up states. Neither the Democrats nor the Republicans can claim a clear edge in the electoral college.
Preparing My Electoral College Model
Using state-level polling average data since 1980 from FiveThirtyEight and national economic data from the Federal Reserve Bank of St. Louis, I construct an elastic net model that uses the following fundamental and polling features:
- Latest polling average for the Republican candidate within a state
- Latest polling average for the Democratic candidate within a state
- Average polling average for the Republican candidate within a state
- Average polling average for the Democratic candidate within a state
- A lag of the previous election’s two-party vote share for the Democrats within a state
- A lag of the previous election’s two-party vote share for the Republicans within a state
- A lag of the election previous to last election’s two-party vote share for the Democrats within a state
- A lag of the election previous to last election’s two-party vote share for the Republicans within a state
- Whether a candidate was incumbent
- GDP growth in the second quarter of the election year
There are only 19 states for which we have polling averages for 2024. These 19 states include our 7 most competitive battleground states, a few other more competitive states, and a handful of non-competitive states (California, Montana, New York, Maryland, Missouri, etc.)
We will train a model using all of the state-level polling data that we have access to since 1980, and then test this data on our 19 states on which we have 2024 polling data. We can then evaluate how sensible the predictions are given what we know about each state.
state | predicted_R_pv2p | predicted_D_pv2p | pred_winner |
---|---|---|---|
Arizona | 51.02970 | 48.97038 | R |
California | 35.78881 | 64.21084 | D |
Florida | 52.64682 | 47.35331 | R |
Georgia | 50.87362 | 49.12645 | R |
Maryland | 33.17850 | 66.82108 | D |
Michigan | 49.20047 | 50.79956 | D |
Minnesota | 46.97779 | 53.02218 | D |
Missouri | 57.51158 | 42.48869 | R |
Montana | 59.79231 | 40.20801 | R |
Nevada | 49.65888 | 50.34117 | D |
New Hampshire | 46.29306 | 53.70689 | D |
New Mexico | 45.49095 | 54.50897 | D |
North Carolina | 50.73015 | 49.26993 | R |
Ohio | 55.30958 | 44.69062 | R |
Pennsylvania | 49.93654 | 50.06351 | D |
Texas | 53.84344 | 46.15672 | R |
Virginia | 46.34994 | 53.65000 | D |
Wisconsin | 49.47596 | 50.52409 | D |
New York | 44.43216 | 55.56768 | D |
Here, we can see that, apart from Arizona, Pennsylvania, and Georgia, all of the 19 states on which we have data are projected to vote for the same party they did in 2020. This should give us some confidence in the accuracy of our model as it is in line with the historical behavior of the states.
I will now use a simulation to get an estimate of how confident we are in these results. I will do this by sampling new state-level polling measurements for each of our 19 states 10,000 times, assuming a normal distribution around the current polling values with a standarad deviation of two percentage points.
Doing so yields the following table.
State | D Win Percentage |
---|---|
Arizona | 29.74 |
California | 100.00 |
Florida | 6.94 |
Georgia | 33.20 |
Maryland | 100.00 |
Michigan | 74.75 |
Minnesota | 97.69 |
Missouri | 0.00 |
Montana | 0.00 |
Nevada | 63.17 |
New Hampshire | 99.37 |
New Mexico | 99.83 |
New York | 99.92 |
North Carolina | 41.66 |
Ohio | 0.05 |
Pennsylvania | 57.53 |
Texas | 1.27 |
Virginia | 99.01 |
Wisconsin | 72.99 |
As we can see, the seven battleground states exhibit much more uncertainty than the other states. California, for example, does not vote red in a single simulation, and even Florida votes blue less than 7% of the time in our simulations. I will use the Democratic win percentages for the battleground states to estimate whether they will vote blue or red in 2024.
Projections
Using this model, our ultimate electoral college would look as follows, with Vice President Kamala Harris narrowly squeaking out a win.
If we also wanted to model the national popular vote, we could use what we did in Week 3, using an elastic net on both fundamental and polling data, weighting such that the polls closer to November matter more. This was Nate Silver’s approach.
Doing so, we find that the Democrats are projected to have a narrow lead in the two-party popular vote nationally (after scaling so that the estimates sum to 100%).
## Democrat two-party vote share: 50.86 %
## Republican two-party vote share: 49.14 %
Citations:
Kim, Seo-young Silvia, and Jan Zilinsky. 2021. “The Divided (But Not More Predictable) Electorate: A Machine Learning Analysis of Voting in American Presidential Elections.” APSA Preprints. doi: 10.33774/apsa-2021-45w3m-v2. This content is a preprint and has not been peer-reviewed.
Rosenstone, Steven J., and John Mark Hansen. Mobilization, Participation, and Democracy in America. Macmillan Pub. Co: Maxwell Macmillan Canada: Maxwell Macmillan International, 1993.
Shaw, Daron, and John Petrocik. The Turnout Myth: Voting Rates and Partisan Outcomes in American National Elections. 1st ed., Oxford University Press, 2020, https://doi.org/10.1093/oso/9780190089450.001.0001.
Wolfinger, Raymond E., and Steven J. Rosenstone. Who Votes? Yale University Press, 1980.
Data Sources:
Data are from the US presidential election popular vote results from 1948-2020, state-level polling data for 1980 onwards, and economic data from the St. Louis Fed.