Do averaging models, markets, and odds produce more accurate predictions?

In judging state-level forecasts of the 2020 U.S. Presidential Election, we’ve primarily looked at prices from one prediction market (PredictIt), outputs from two forecasting models (FiveThirtyEight and The Economist), and the implied probabilities of gambling odds from three British bookmakers (Sky Bet, BoyleSports, and Betfred.) Because diversity in predictions can benefit ensembles, I was curious if averaging predictions from different approaches (models, markets, and bookmakers) would produce a more accurate prediction.

Below are the Brier scores (lower is better) and the number of correct predictions (out of 102 total predictions) from forecasts and ensemble forecasts on the morning of Election Day 2020.

Rank	Model(s)	Brier score	Correct Predictions
1	Economist & Sky Bet	0.0302	97 / 102
2	Economist & BoyleSports	0.0306	97 / 102
3	Economist & Betfred	0.0307	97 / 102
4	Economist & PredictIt	0.0314	96 / 102
5	FiveThirtyEight & Sky Bet	0.0322	97 / 102
6	FiveThirtyEight & BoyleSports	0.0325	97 / 102
7	FiveThirtyEight & Betfred	0.0327	97 / 102
8	Sky Bet	0.0332	99 / 102
9	Sky Bet & BoyleSports	0.0335	99 / 102
10	PredictIt & FiveThirtyEight	0.0336	96 / 102
11	BoyleSports	0.0338	99 / 102
12	Sky Bet & Betfred	0.0339	98 / 102
13	Economist	0.0341	98 / 102
14	Betfred & BoyleSports	0.0342	98 / 102
15	Economist & FiveThirtyEight	0.0344	98 / 102
16	Betfred	0.0347	98 / 102
17	PredictIt & Sky Bet	0.0351	99 / 102
18	PredictIt & BoyleSports	0.0353	98 / 102
19	PredictIt & Betfred	0.0358	99 / 102
20	FiveThirtyEight	0.0359	98 / 102
21	PredictIt	0.0375	99 / 102

In general, averaging predictions from different approaches led to lower Brier scores than averaging predictions from the same approach. The eight ensembles that performed better than both of their components comprised a forecasting model with a bookmaker or market. The seven ensembles that performed worse than at least one of their components were ensembles that combined similar approaches:

Bookmakers (Sky Bet, BoyleSports, and Betfred) with each other and PredictIt
Forecasting models (The Economist and FiveThirtyEight) with each other

The Brier scores are close together (all within 0.0073) and the number of correct predictions is also close (between 96 and 99), but there seems to be little correlation between a lower Brier score and more correct predictions.

The decrease in Brier scores from averaging forecasts is largely due to averages mitigating the errors from bad predictions from either pair. For example, averaging together The Economist and PredictIt predictions had a lower Brier score than either individual forecast, but their ensemble also correctly predicted fewer state winners. PredictIt gamblers were correct in Florida and wrong in Georgia, and The Economist was vice versa. Averaging their predictions reduces overall error in Florida and Georgia, but now their ensemble predicts the wrong winner and loser in both states.

A few caveats:

For an apples-to-apples comparison, I’m using all predictions from the morning of Election Day, November 3, 2020. The Economist’s model outputs had not been updated when I accessed them, so I’m using the data available at that moment in time on their website, which was from November 2^nd.

The British bookmaker odds were taken from oddschecker.com and converted into implied probabilities.

I used a 0.5 threshold for classifying a forecast as having predicted a candidate to win or lose a state. PredictIt and the British bookmakers have fees built into their prices and odds, so their implied probabilities total over 1.0 and it’s possible for both Trump and Biden to have implied probabilities above 0.5 in the same state. To compare the forecast models’ probabilities with the gambling implied probabilities, I used predictions for both Trump and Biden in 50 states and D.C. That comes out to 102 total predictions: two candidates x (50 states + D.C.).

PredictIt and FiveThirtyEight made forecasts for all five congressional districts (Maine and Nebraska) that allocate an elector for winning the district. I did not use these predictions because The Economist and bookmakers did not make forecasts for the five congressional districts. It’s worth noting that PredictIt arguably performs better than FiveThirtyEight when including these districts.

Future analysis could use true probabilities for PredictIt and the British bookmakers, averaging more than two predictions together (i.e., a market, model, and bookmaker combination), and averaging all 56 predictions from FiveThirtyEight and PredictIt.