Predicting Senate Races with Linear Regression (and R)

Matt Thoburn
12 min readNov 11, 2020

--

Preface

I was recently inspired by this article by David Robinson to try my hand at some kind of blogging. It seemed a good time as any, as it would afford me an opportunity to practice some of the skills I’ve been learning as part of an online statistics class that I’m taking as well as my writing. So, welcome to the first installment in what will surely be a long and illustrious data-journalism career. I’m by no means a statistical expert and this is mostly about the learning process for me, so if you find something egregiously incorrect or you have general feedback don’t hesitate to let me know in the comments. Furthermore, results were compiled by hand, so it is entirely possible that I’ve miscalculated or mistyped something. If you see something that doesn’t look right let me know so I can correct my numbers accordingly.

In the words of the great philosopher Jake the Dog: “Sucking at something is the first step to being sorta good at something”, so without further ado, let’s dive in and do some blogging.

Note: This article was originally written before the 2020 elections although it was not published until after, partially due to juggling other responsibilities and partially because I’m a master procrastinator. In a future installment I’ll examine how the models explored here held up in 2020.

Introduction

The dataset we’ll be using was manually put together based off data I pulled from Wikipedia. (You are free to draw your own conclusions about what it says about my social life that manually entering data into a spreadsheet is how I spend my free time, but in my defense this was done during a global pandemic.) It contains the State name, the Cook Partisan Voting Index for each state (using Wayback Machine to get PVI for previous cycles and expressed as a positive or negative value, with positive values representing democratic leaning states and negative values representing republican leaning states. This is a matter of semantics moreso than a value judgment of either party.), Year, Cycle (On being a presidential election year and Off being a midterm), Incumbency (1 being an a democratic incumbent, -1 indicating a republican incumbent, and 0 representing an open seat in which neither candidate is an incumbent), and results of US Senate races from 2012 to 2018 (expressed as a difference between the percent vote share of the democratic and republican candidates with positive values indicating a democratic win and negative values indicating a republican win).

Some Result fields were left blank in the event of unusual circumstances such as California’s Jungle Primaries (in which two democrats can advance to a runoff in the general), a strong third party (such Angus King from Maine winning as an Independent against both a Democrat and a Republican), or when a candidate runs unopposed (Jeff Sessions of Alabama ran unopposed in 2014 and won with almost 100% of the vote)

Another feature worth noting is that this data does not cover special elections (such as Doug Jones’ upset victory in Alabama in 2017) as special elections are, by their definition special.

You can grab a copy of the data as well as the source code from my GitHub.

Reading in the Data

library(GGally)
df <- read.table(‘voting_results.csv’,header = TRUE,sep=”,”)
df$Cycle <- as.factor(df$Cycle)
df$Incumbency <- as.factor(df$Incumbency)
df$Year <- as.factor(df$Year)
head(df)
cor(df[sapply(df, is.numeric)], use=”complete.obs”
##         State PVI Year Cycle Incumbency Result
## 1 Arizona -5 2018 Off 0 2.4
## 2 California 12 2018 Off 1 NA
## 3 Connecticut 6 2018 Off 1 20.1
## 4 Delaware 6 2018 Off 1 22.2
## 5 Florida -2 2018 Off 1 0.2
## 6 Hawaii 18 2018 Off 1 42.4
## PVI Result
## PVI 1.0000000 0.8499908
## Result 0.8499908 1.0000000

Reading the data and doing a quick correlation check we can see that there is a strong correlation between results and PVI, but let’s have a look at our categorical variables before continuing with any regression analysis.

Summarizing the Data

ggplot(df, aes(x=Result)) + geom_histogram(color=”black”,fill=”gray”,binwidth=10)ggplot(df, aes(x=PVI)) + geom_histogram(color=”black”,fill=”gray”,binwidth=5)ggplot(df, aes(x=Cycle, y=Result)) + geom_boxplot(fill=”gray”)
ggplot(df, aes(x=Incumbency, y=Result)) + geom_boxplot(fill=”gray”)
ggplot(df, aes(x=Year, y=Result)) + geom_boxplot(fill=”gray”)
Histogram of Results
Histogram of PVI
Comparing distribution of results by Cycle (on/off)
Comparing distribution of results by Incumbency (-1,0,1 -> GOP incumbent, No incumbent, Dem incumbent)
Comparing distribution of results by Year

Examining the boxplots, it doesn’t seem like there’s much difference in results based on midterm/presidential cycles (at least for the data provided), which leads us to believe that there doesn’t appear to be a partisan advantage to running in an on or off cycle race. However unsurprisingly, it looks as if incumbency does have an affect on results. Incumbents tend to do better than non-incumbents (Although the exact extent to which incumbents have an advantage is best left to actual pundits). Looking at results by year, we can see years in which the overall landscape seems to favor one party or another (Obama was reelected in 2012 and the Democratic party gained two seats in the senate that year, whereas in 2014 the they lost nine seats and their senate majority). The extent to which senate results correlate with presidential approval ratings or other broader partisan metrics is beyond the scope of this article although it would make for an interesting line of further inquiry.

Plotting state results against PVI

result.pvi <- lm(‘Result ~ PVI’,df)
summary(result.pvi)
ggplot(df,aes(x=PVI ,y=Result)) + geom_point() + geom_smooth(method = “lm”, fill = NA)
## Call:
## lm(formula = "Result ~ PVI", data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.207 -6.395 -0.093 6.670 38.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0461 1.1366 4.44 1.91e-05 ***
## PVI 2.0268 0.1102 18.40 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.69 on 130 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.7225, Adjusted R-squared: 0.7203
## F-statistic: 338.4 on 1 and 130 DF, p-value: < 2.2e-16
Result as a function of PVI, with regression line

Plotting a simple linear regression, we do see a relatively well-fit linear model with mostly normally distributed residuals and R² value of .72. However, our standard error of 12.7 is much too noisy to make what I would consider to be any meaningful predictions about potential senate elections. I imagine my career as a political strategist would be extremely short lived if I told candidates that they could expect a 2 point victory, plus or minus 25 percent (assuming a 95% confidence interval). When close elections can come down to a few thousand votes and fractions of a percent, we would have to do much better to do any meaningful predictions.

We do appear to have some non-linearity at the extremities, particularly in heavily republican states, although given the overall noise of the data I’m content to leave the model as is and keep it strictly linear.

Do On/Off Cycles affect the strength of PVI?

Motivation

Suppose we wanted to look at whether or not On/Off years have any effect on PVI’s effect on results. Perhaps partisan inclinations are weaker in midterms when there’s not a presidential candidate to straight-ticket vote with, or perhaps the difference in turnout between midterms and presidential cycles favors one party or another. We can include the Cycle as an additional variable in our linear model and see how we fare.

result.pvi.cycle <- lm(‘Result ~ PVI*Cycle’,df)
summary(result.pvi.cycle)
ggplot(df,aes(x=PVI ,y=Result, col=Cycle)) + geom_point() + geom_smooth(method = “lm”, fill = NA)
## Call:
## lm(formula = "Result ~ PVI*Cycle", data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.199 -6.852 0.623 6.154 37.802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2315 1.5891 3.292 0.00129 **
## PVI 1.8279 0.1462 12.503 < 2e-16 ***
## CycleOn -0.6007 2.2495 -0.267 0.78985
## PVI:CycleOn 0.4637 0.2199 2.109 0.03689 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.54 on 128 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.733, Adjusted R-squared: 0.7267
## F-statistic: 117.1 on 3 and 128 DF, p-value: < 2.2e-16
Results as a function of PVI and Cycle, with regression lines

Results

Sadly, the results are underwhelming. We don’t gain any noticeable advantage by including cycle as a predictor in our regression, and the p-values for the added predictors don’t indicate statistical significance. Perhaps more data is needed across a longer time span, or perhaps there really isn’t a difference. I would defer to the experts on this one.

But what happens if we try our other variables?

How does the strength of PVI change by year?

Motivation

Perhaps we can refine our predictions if we account for the year of the election. It wouldn’t be particularly useful for making future predictions by itself (What kinds of predictions would we make about 2020? Your guess is as good as mine), but it might indicate some kind of national mood favoring one party or another that could potentially be researched and quantified by other metrics.

result.pvi.year <- lm(‘Result ~ PVI*Year’,df)
summary(result.pvi.year)
ggplot(df,aes(x=PVI ,y=Result, col=Year)) + geom_point() + geom_smooth(method = “lm”, fill = NA)
## lm(formula = "Result ~ PVI*Year", data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.609 -4.712 0.509 5.863 31.448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6340 2.0053 4.804 4.40e-06 ***
## PVI 2.1228 0.2254 9.418 3.24e-16 ***
## Year2014 -9.5929 2.8937 -3.315 0.001202 **
## Year2016 -10.0529 2.8897 -3.479 0.000695 ***
## Year2018 1.1440 2.8334 0.404 0.687073
## PVI:Year2014 -0.1948 0.2955 -0.659 0.510955
## PVI:Year2016 0.1840 0.3010 0.611 0.542147
## PVI:Year2018 -0.5184 0.2916 -1.778 0.077920 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.32 on 124 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.7892, Adjusted R-squared: 0.7773
## F-statistic: 66.31 on 7 and 124 DF, p-value: < 2.2e-16
Results as a function of PVI and Year with regression lines

Results

Swapping Cycle for Year in our model, we do see improvements to our Adjusted R² value as well as a slight reduction in our Residual Standard Error. The p-values for the year terms for the most part look to be significant (although there’s no statistically significance between 2012 (the baseline based on how the lm was set up) and 2018). However, the data is still to noisy to make any meaningful predictions.

It does however hint at the possibility of some kind of national mood that could potentially be researched and quantified with something like presidential approval ratings or generic partisan approval rating. It’s always fun to set out to answer one question and only end up with more questions.

How does the strength of PVI change by incumbency?

Motivation

Maybe we can do better if we factor in incumbency. It’s possible that being a seated incumbent gives advantages whereas trying to unseat an existing incumbent is an uphill battle.

result.pvi.incumbency <- lm(‘Result ~ PVI*Incumbency’,df)
summary(result.pvi.incumbency)
ggplot(df,aes(x=PVI,y=Result, col=Incumbency)) + geom_point() + geom_smooth(method = “lm”, fill = NA)
## Call:
## lm(formula = "Result ~ PVI*Incumbency", data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.264 -5.231 0.820 5.902 23.872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.1858 2.3454 -3.490 0.000666 ***
## PVI 1.4950 0.1898 7.876 1.35e-12 ***
## Incumbency0 11.3086 3.3896 3.336 0.001116 **
## Incumbency1 20.4877 2.7111 7.557 7.41e-12 ***
## PVI:Incumbency0 0.3209 0.3130 1.025 0.307210
## PVI:Incumbency1 0.0143 0.2430 0.059 0.953157
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 126 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.8199, Adjusted R-squared: 0.8127
## F-statistic: 114.7 on 5 and 126 DF, p-value: < 2.2e-16
Results as a function of PVI and Incumbency with regression lines

Results

Modeling against PVI and Incumbency, we get our best results yet with an Adjusted R² of .81 and an RSE of 10.38. It does in fact seem to indicate that being an incumbent is associated with better results for your party (based on this data at least). We see that democratic incumbents tend to receive a higher share of the vote than non-incumbents for a given PVI, and we see that democratic non-incumbents running against republican senators receive a lower share of the vote for a given PVI. However, as before we still have too much unexplained variability to make any meaningful predictions.

Conclusion

The initial motivation for this line of inquiry was to try to see how much variability in election results can be described with publicly available data rather than something like polling that costs time and money to gather data on. These efforts to understand voting trends in a rigorous quantitative sense could be beneficial for party strategists to understand the effect of historical voting trends on future ones in order allocate their efforts into races where they have the best chance of winning, particularly if money was tight or reliable polling data wasn’t readily available in a given state. If you were the DCCC or the RSCC and you only had a finite amount of money, time, and personnel, how much should you invest defending ostensibly safe seats? How much should you invest trying to strike deep into the heart of enemy territory unseating an incumbent in a traditionally safe state? How much should you invest in swing states?

For the time being it looks like I’ll have to put any ambitions I may have of being a political strategist or forecaster on hold and keep my day job. It doesn’t seem like one can make any particularly accurate predictions about voting results, at least in terms of trying to pinpoint a specific numeric value based on these metrics alone. At the end of the day, not every exploration will end in a satisfying conclusion, and that’s just part of the process. I’m certainly not going to go through all the time to compile and analyze the data just to shelve it when it doesn’t make for a compelling conclusion. I’m going to try to score some internet points off it, prediction accuracy be damned.

If anything, we should take some comfort in knowing that PVI alone isn’t a silver bullet for predicting senate elections. If it were, it would imply that elections are set in stone and predetermined; it wouldn’t matter who was running or what kind of strategy they employed. Red states would be red states, and blue states would be blue states. In that regard we can breathe a sigh of relief knowing that voters aren’t completely mindless automatons who vote strictly along party lines (for better or for worse).

Future ideas

If I were to expand on this work in the future, I would be interested in exploring additional data that could potentially be informative in making these kinds of predictions such as fundraising data, social media engagement, or Google search trends. Google search trends in particular seem like an especially interesting line of inquiry to me. How much of non-presidential races comes down to name recognition? In the era of the fast moving 24-hour news cycle, is it advantageous to get your name into the ether by any means necessary, or can low profile, slow and steady campaigns win the race? (pun intended) The answers to these questions could lie in Google’s search traffic data and could have relevance for campaign strategy and election forecasting. However, the effort to aggregate this data for the hundred or so races I already have is non-trivial, seeing as Google Trends doesn’t have an API to query this kind of data programmatically and I don’t feel particularly inclined to spend more time entering data into a spreadsheet at the moment. (If any Google engineers are reading this, please publicly expose a Trends API. It would make me very happy)

Another additional line of question would be to take the existing data and the problem more probabilistically (i.e. given a set of initial conditions what is the probability that a candidate will win the election?), perhaps that would be more fruitful than trying to pin down precise values.

Postface (the opposite of preface?)

If you’ve made it this far, congratulations and thank you for bearing with me. What did you think? Did you love it? Did you hate it? Do you want to hire me to run your next campaign for office? Let me know in the comments so I can get better at this.

--

--

Matt Thoburn
Matt Thoburn

Written by Matt Thoburn

I enjoy nice beverages, long walks on the beach, and thinking about how the world works (ideally at the same time)

No responses yet