Do averaging models, markets, and odds produce more accurate predictions?

In judging state-level forecasts of the 2020 U.S. Presidential Election, we’ve primarily looked at prices from one prediction market (PredictIt), outputs from two forecasting models (FiveThirtyEight and The Economist), and the implied probabilities of gambling odds from three British bookmakers (Sky Bet, BoyleSports, and Betfred.)  Because diversity in predictions can benefit ensembles, I was curious if averaging predictions from different approaches (models, markets, and bookmakers) would produce a more accurate prediction.

Below are the Brier scores (lower is better) and the number of correct predictions (out of 102 total predictions) from forecasts and ensemble forecasts on the morning of Election Day 2020.

RankModel(s)Brier scoreCorrect Predictions
1Economist & Sky Bet0.030297 / 102
2Economist & BoyleSports0.030697 / 102
3Economist & Betfred0.030797 / 102
4Economist & PredictIt0.031496 / 102
5FiveThirtyEight & Sky Bet0.032297 / 102
6FiveThirtyEight & BoyleSports0.032597 / 102
7FiveThirtyEight & Betfred0.032797 / 102
8Sky Bet0.033299 / 102
9Sky Bet & BoyleSports0.033599 / 102
10PredictIt & FiveThirtyEight0.033696 / 102
11BoyleSports0.033899 / 102
12Sky Bet & Betfred0.033998 / 102
13Economist0.034198 / 102
14Betfred & BoyleSports0.034298 / 102
15Economist & FiveThirtyEight0.034498 / 102
16Betfred0.034798 / 102
17PredictIt & Sky Bet0.035199 / 102
18PredictIt & BoyleSports0.035398 / 102
19PredictIt & Betfred0.035899 / 102
20FiveThirtyEight0.035998 / 102
21PredictIt0.037599 / 102

In general, averaging predictions from different approaches led to lower Brier scores than averaging predictions from the same approach.  The eight ensembles that performed better than both of their components comprised a forecasting model with a bookmaker or market.  The seven ensembles that performed worse than at least one of their components were ensembles that combined similar approaches:

  • Bookmakers (Sky Bet, BoyleSports, and Betfred) with each other and PredictIt
  • Forecasting models (The Economist and FiveThirtyEight) with each other

The Brier scores are close together (all within 0.0073) and the number of correct predictions is also close (between 96 and 99), but there seems to be little correlation between a lower Brier score and more correct predictions.

The decrease in Brier scores from averaging forecasts is largely due to averages mitigating the errors from bad predictions from either pair.  For example, averaging together The Economist and PredictIt predictions had a lower Brier score than either individual forecast, but their ensemble also correctly predicted fewer state winners.  PredictIt gamblers were correct in Florida and wrong in Georgia, and The Economist was vice versa.  Averaging their predictions reduces overall error in Florida and Georgia, but now their ensemble predicts the wrong winner and loser in both states.

A few caveats:

  • For an apples-to-apples comparison, I’m using all predictions from the morning of Election Day, November 3, 2020.  The Economist’s model outputs had not been updated when I accessed them, so I’m using the data available at that moment in time on their website, which was from November 2nd
  • The British bookmaker odds were taken from oddschecker.com and converted into implied probabilities.
  • I used a 0.5 threshold for classifying a forecast as having predicted a candidate to win or lose a state.  PredictIt and the British bookmakers have fees built into their prices and odds, so their implied probabilities total over 1.0 and it’s possible for both Trump and Biden to have implied probabilities above 0.5 in the same state.  To compare the forecast models’ probabilities with the gambling implied probabilities, I used predictions for both Trump and Biden in 50 states and D.C. That comes out to 102 total predictions: two candidates x (50 states + D.C.).
  • PredictIt and FiveThirtyEight made forecasts for all five congressional districts (Maine and Nebraska) that allocate an elector for winning the district.  I did not use these predictions because The Economist and bookmakers did not make forecasts for the five congressional districts.  It’s worth noting that PredictIt arguably performs better than FiveThirtyEight when including these districts.

Future analysis could use true probabilities for PredictIt and the British bookmakers, averaging more than two predictions together (i.e., a market, model, and bookmaker combination), and averaging all 56 predictions from FiveThirtyEight and PredictIt.