Pollster Rating Wars!

With both sides surviving the 2020 Model Wars, the election forecasting blood feud continues through another election.

In April 2023, news leaked out that Nate Silver and several FiveThirtyEight staffers would be victims of Disney’s mass layoffs.  Silver’s last day working for “The Mouse” was June 30th and per his contract, ABC/Disney kept rights to the domain and the name “FiveThirtyEight” while Silver retained rights to his models.  He took his IP to Substack, ordinarily a graveyard for journalists, and started the “Silver Bulletin.” 

Seemingly to spite Silver, The Mouse hired his arch-nemesis Elliot Morris to take over the rebranded “538.”

The first shots of the “Pollster Ratings Wars” were fired at the end of June 2023, after Morris started at ABC but a few days before Silver officially walked out the door.

Morris emailed Rasmussen Reports the following questions/demands on June 29th:

My name is Elliott Morris, and I am the Editorial Director of Data Analytics at ABC News. I am responsible for our editorial analysis of polls, elections results and other data, including the output at ABC’s FiveThirtyEight.

I am emailing you to send a final notice that FiveThirtyEight is considering formally banning Rasmussen Reports from its coverage. Such a ban would result in being removed from listing on our main polls page and being excluded from all of our aggregation and election forecasting models. If banned, Rasmussen Reports would also be removed from our historical averages of polls and from our pollster ratings. Your surveys would no longer appear in reporting and we would write an article explaining our reasons for the ban.

A pollster at Rasmussen Reports (preferably you) needs to reply to this email with satisfactory comments in order to avoid the ban. To be sure, response alone is not guaranteed to end in the avoidance of a ban; our concerns run much deeper than simple failure to reply to methodological queries — which, for what it’s worth, is itself grounds for a ban, and on which we have already given Rasmussen substantial leeway.

First, Rasmussen must explain the nature of its relationship with several right-leaning blogs and online media outlets, which have given us reason to doubt the ethical operation of the polling firm. please tell us whether questions are ever suggested to Rasmussen from these outlets, including Fox News and “Steve Bannon’s War Room”, where Rasmussen’s head pollster regularly appears, with the promise of coverage in return for “public” fieldwork? Do Rasmussen’s pollsters work with anyone from these organizations on topics to consider polling, despite listing polls as un-sponsored or sponsored by other groups? Does the pollster have a close personal relationship with any of these figures that might cloud their judgement in the operation of a public poll?

Related to this, does Rasmussen Reports believe the results of the 2022 Arizona Governor election, as certified by the state’s department of elections, to be fraudulent based on the results of a 2023 survey conducted by Rasmussen reports and sponsored by College Republicans, as it stated for Mr Bannon on his programming in April of this year? Does Rasmussen Reports believe its polls can provide more precise estimates of election results than certified ballot counts by states’ secretaries of state? Does it believe the results of the 2020 election as certified are accurate? What does it view its role as in providing public opinion data on this topic?

Second, Rasmussen must answer the following questions about its methodology, which Mr Mitchell has so far failed to answer for FiveThirtyEight senior reporter Nathaniel Rakich three times:

  1. This survey seems to indicate that Rasmussen’s weighting targets or sampling strategies are not well-tuned, since the outcome of the poll does not match the observable election result. How are you addressing that methodological problem?
  2. This tweet seems to indicate that Rasmussen’s IVR polling is reaching the same people (or, at least, person) multiple times. Is the phone portion of the poll relying on a panel of some type? If not, why would the same citizen get routine calls from the same pollster?
  3. Perhaps related to #2: Your methodology states, “Calls are placed to randomly-selected phone numbers through a process that ensures appropriate geographic representation.” What is the process being applied? And what does “randomly-selected” mean here? If not RDD, where are you getting your call lists?
  4. The methodology mentions a “demographically diverse panel” for online respondents. Is this panel proprietary, or are you contracting it out? If the former, how do you recruit and ensure balanced representation on the panel? If the latter, to whom are you contracting out?
  5. The methodology mentions you weight by “age, race, gender, political party, and other factors.” What are the other factors?
  6. The methodology also states, “For political surveys, census bureau data provides a starting point and a series of screening questions are used to determine likely voters. The questions involve voting history, interest in the current campaign, and likely voting intentions.” Does this mean you are weighting first and screening second? If so, is there additional rebalancing for the LV sample? For example, women are less likely than men to say they’re definitely going to vote, but they usually make up at least half of the electorate anyway.

In addition, please tell us:

  1. Where do the benchmarks for your non-census weighting variables, such as political party, come from? Are they constant across surveys in a given year or month or do they change over time? If they change, what are the changes based on?
  2. Does Rasmussen Reports check its likely voter screens against any ongoing estimate of the proportion of the population belonging to a given party?
  3. and; How does Rasmussen Reports account for the biases introduced into its sample by not calling cell phones?

Failure to reply, or failure to notify us of an intent to speedily reply, by the end of the day on Friday, June 30th, 2023 will be taken as a final concession of our grounds for a ban. The ban would take effect imminently thereafter. 

Thank you for your time,

Elliott Morris

ABC News

[Emphasis Added]

Nate Silver immediately clapped back on Twitter:

And Silver went into a lengthy critique of the political litmus test on his first day of “free agency” on July 1st:

First, I strongly oppose subjecting pollsters to an ideological or political litmus test. Look, there might be good reasons to exclude Rasmussen based on their methodology, although I’d note that their track record of polling accuracy is average, not poor.

Second, even if you’re going to remove Rasmussen from the averages going forward, it’s inappropriate to write them out of the past, as Morris has threatened to do.

Third, I think it’s clear that the letter is an ad hoc exercise to exclude Rasmussen, not an effort to develop a consistent set of standards.

The thing about running a polling average is that you need a consistent and legible set of rules that be applied to hundreds of pollsters you’ll encounter over the course of an election campaign. Going on a case-by-case basis is a) extremely time-consuming (don’t neglect how busy you’ll be in the middle of an election campaign) and b) highly likely to result in introducing your own biases, whether it’s the political outcome you’re rooting for or whatever you think will make your model look smart. That’s why, after 15 years of doing this, I’ve been a stickler for consistency, even if that means including some pollsters whom I subjectively don’t like, politically or methodologically.

Perhaps Morris’s questions were getting at some larger theme or more acute problem. But if so, he have should stated it more explicitly in his letter. Journalists, in most circumstances, shouldn’t act like Vincent D’Onofrio in Law & Order trying to sniff around for clues or throw a suspect off-kilter. Ask clear, concise questions that make your intentions clear.

Instead, this looks like a fishing expedition, with Morris hoping to catch Rasmussen in some sort of venial methodological sin that is probably fairly common within the industry. Or, because the questions are onerous, the tone of his email is hostile, and Carroll was only given a day-and-a-half to respond just before a four-day summer weekend, he was hoping that thy wouldn’t be answered at all — so he could say “See! They refused to answer my questions!”. Either way, this is the letter you get only once someone has already made their mind up.

[Emphasis in original]

Whatever else may have transpired or not between Morris and Rasmussen Reports, the polling firm is no longer included in polling averages today.  However, it’s still in the pollster ratings and ranked #69 out of 277 with 2.1 stars in the combined (accuracy plus transparency) ratings.  Morris has said that between 1.9 and 2.8 stars puts a pollster in “America’s core block of good pollsters.”  This is even more confusing since much lower ranked pollsters are included in the averages.  (And Silver Bulletin’s grade for Rasmussen is a “B”).

Below were going to look at the 2024 pollster ratings for 538 (Morris) and Silver Bulletin (Silver.)  We’re not trying to find the “best” system, even if that is possible, but are merely interested in differences in how each platform is rating pollsters.  Because Silver merely took the pre-Morris FiveThirtyEight pollster rating methodology and updated the data, we’re basically comparing FiveThirtyEight in 2023 (Silver) versus 538 in 2024 (Morris). 

Silver Bulletin (old FiveThirtyEight) pollster ratings

For the 2024 election cycle, Silver went back to FiveThirtyEight’s 2023 methodology:

These ratings restore the methodology that I used for pollster ratings when I left FiveThirtyEight in 2023. I’ve also updated them with polls and election results since the last update, namely:

  • The 2023 gubernatorial elections;
  • Special elections to Congress;
  • The 2024 Republican presidential primaries.

The 2023 FiveThirtyEight methodology is below.  (Please note: The original passage has been rearranged for brevity and clarity while preserving the original words and intent.)

Our pollster ratings are based on a metric called Predictive Plus-Minus. This metric is based on several key factors, including:

  • Simple error for polls (i.e., how far away the poll results are from the actual election margin).
  • How well other pollsters performed in the same races (i.e., whether this pollster is as good as, better than or worse than others).
  • Methodological quality (i.e., whether this pollster is conducting polls in accordance with professional standards).
  • Herding (i.e., whether this pollster appears to just be copying others’ results).

While our dataset includes several other metrics for understanding how well a pollster has historically performed, our letter grades are based entirely on Predictive Plus-Minus.

Step 1: Collect and classify polls

The ones represented in the pollster-ratings database meet our basic standards as well as three simple criteria:

  • They were conducted in 1998 or later.
  • They have a median field date within 21 days of the election date.
  • They were conducted for one of the following types of elections:
    • Presidential general elections
    • Presidential primaries or caucuses
    • U.S. Senate general elections
    • U.S. House general elections
    • Gubernatorial general elections

Step 2: Calculate simple average error

We compare the margin in each poll against the actual margin of the election and see how far apart they were. If the poll showed the Republican leading by 4 percentage points and they won by 9 instead, the poll’s simple error was 5 points…Simple error is calculated based on the margin separating the top two finishers in the election — not the top two candidates in the poll…We then calculate a simple average error for each pollster based on the average of the simple error of all its polls. This average is calculated using root-mean-square error.

Step 3: Calculate Simple Plus-Minus

We run a regression analysis that predicts polling error based on the type of election surveyed, a poll’s margin of sampling error and the number of days6 between the poll and the election.

We then calculate a Simple Plus-Minus score for each pollster by comparing its simple average error against the error one would expect from these factors. For instance, suppose a pollster has a simple average error of 4.6 points. By comparison, the average pollster, surveying the same types of races on the same dates and with the same sample sizes, would have an error of 5.3 points according to the regression. Our pollster therefore gets a Simple Plus-Minus score of -0.7. This is a good score: As in golf, negative scores indicate better-than-average performance. Specifically, it means this pollster’s polls have been 0.7 points more accurate than other polls under similar circumstances.

Error in polls [result] from three major components: sampling error, temporal error and pollster error (or “pollster-induced error”). These are related by a sum of squares formula:

Total Error= √(Sampling Error + Temporal Error + Pollster Error)

Sampling error reflects the fact that a poll surveys only some portion of the electorate rather than everybody. This matters less than you might expect; theoretically, a poll of 1,000 voters will miss the final margin in the race by an average of only about 2.5 points because of sampling error alone — even in a state with 10 million voters.

Another concern is that polls are (almost) never conducted on Election Day itself. We refer to this property as temporal error. There have been elections when important news events occurred in the 48 to 72 hours that separated the final polls from the election, such as the New Hampshire Democratic presidential primary debate in 2008.

The final component is pollster error (what we’ve referred to in the past as “pollster-induced error”); it’s the residual error component that can’t be explained by sampling error or temporal error. Certain things (like projecting turnout or ensuring a representative sample of the population) are inherently pretty hard. Our research suggests that even if all polls were conducted on Election Day itself (i.e., no temporal error) and took an infinite sample size (i.e., no sampling error), the average poll would still miss the final margin in the race by about 2 points.

Step 4: Calculate Advanced Plus-Minus

Simple Plus-Minus measures how well a pollster predicted the actual results.  Relative Plus-Minus measures how well a pollster did compared to other pollsters in the same election.

Advanced Plus-Minus is a combination of Relative Plus-Minus and Simple Plus-Minus, weighted by the number of other polling firms that surveyed the same race (let’s call this number n). Relative Plus-Minus gets the weight of n, and Simple Plus-Minus gets a weight of three. For example, if six other polling firms surveyed a certain race, Relative Plus-Minus would get two-thirds of the weight and Simple Plus-Minus would get one-third.

In other words, when there are a lot of polls in the field, Advanced Plus-Minus is mostly based on how well a poll did in comparison to the work of other pollsters that surveyed the same election. But when there is scant polling, it’s mostly based on Simple Plus-Minus.

Advanced Plus-Minus puts slightly more weight on more recent polls. It also contains a subtle adjustment to account for the higher volatility of certain election types, especially presidential primaries.

Step 5: Calculate Predictive Plus-Minus

Plus-Minus and Advanced Plus-Minus are useful for retrospective analysis of pollsters.  For forward looking predictions, FiveThirtyEight uses a measure called Predictive Plus-Minus.

Predictive Plus-Minus is that it also accounts for a polling firm’s methodological standards — albeit in a slightly roundabout way. A pollster gets a boost in Predictive Plus-Minus if it is a member of the American Association for Public Opinion Research’s Transparency Initiative or contributes polls to the Roper Center for Public Opinion Research’s archive. Participation in these organizations is a proxy variable for methodological quality.

One further complication is “herding,” or the tendency for polls to produce very similar results to other polls, especially toward the end of a campaign. A methodologically inferior pollster may be posting superficially good results by manipulating its polls to match those of the stronger polling firms.

The full formula for how to calculate Predictive Plus-Minus has evolved over the years. The formula [used in 2023] is as follows:

PPM = (max(-2,APM+herding_penalty)×(disc_pollcount)+prior×18)/(18+disc_pollcount)

disc_pollcount = the “discounted poll count”, in which older polls receive a lower weight than more recent polls.

prior = value change every time pollster ratings are updated, but [as of March 10, 2023] it was calculated as 0.66 – quality ⋅ 0.57 + min(18, disc_pollcount) ⋅ -0.03.

quality [used in prior] = 1 if the pollster meets the AAPOR/Roper transparency standard and 0 if it doesn’t.

ADPA = the “Average Distance from Polling Average”, how much the pollster’s average poll differs from the average of previous polls of that race — specifically, polls whose median field date was at least three days earlier.

herding_penalty = one-half of the difference between a pollster’s actual ADPA and its theoretical minimum ADPA based on sampling error (both of the pollster’s polls and the polling average they’re being compared with)

Basically, Predictive Plus-Minus is a version of Advanced Plus-Minus in which scores are reverted toward a mean, where the mean depends on both the methodological quality of the pollster and the recency of its polls. The fewer recent polls a firm has, the more its score is reverted toward this mean. So Predictive Plus-Minus is mostly about a poll’s methodological standards for firms with only a few recent surveys in the database, and mostly about its past results for those with many recent surveys.

Step 6: Convert Predictive Plus-Minus into a letter grade

We’ve translated each firm’s Predictive Plus-Minus rating into a letter grade, from A+ to D-. One purpose of this is to make clear that the vast majority of polling firms cluster somewhere in the middle of the spectrum; about 83% of polling firms receive grades in the B or C range.  Another, of course, is to make the ratings intuitive.

There are two exceptions:

  1. Pollsters that are banned by FiveThirtyEight automatically receive a grade of F. There is no Predictive Plus-Minus bad enough that it merits an F grade; if a pollster is rated “F,” that means it did something much worse than simply being bad at polling.
  • Pollsters with a relatively small sample of polling get a provisional rating rather than a precise letter grade. An “A/B” provisional rating means that the pollster has shown strong initial results, a “B/C” rating means it has average initial results and a “C/D” rating means below-average initial results. It takes roughly 20 recent polls (or a larger number of older polls) for a pollster to get a precise pollster rating.

Again, this was edited for clarity and brevity from FiveThirtyEight’s 2023 Pollster Ratings article.  Please read the original article for more details and to find older rating methodologies.

538 (new) pollster ratings

Elliot Morris put his own mark on 538 by adding transparency scores for pollsters, and adding a 3-star rating system on top off the combined (accuracy + transparency) scores.  [Block quotes below are edited for clarity and brevity.]

POLLSCORE

The accuracy portion is now called “POLLSCORE” (Predictive Optimization of Latent skill Level in Surveys, Considering Overall Record, Empirically) includes error and bias, but not transparency.  Lower is better, indicating less error and bias.

Here is Morris describing POLLSCORE:

We quantify error by calculating how close a pollster’s surveys land to actual election results, adjusting for how difficult each contest is to poll. Bias is just error that accounts for whether a pollster systematically overestimates Republicans or Democrats. We average our final error and bias values together into one measure of overall accuracy…POLLSCORE tells us whether a pollster is more accurate than a theoretical replacement-level pollster that polled all the same contests. Negative POLLSCOREs are better and mean that a pollster has less error and bias than this theoretical alternative.

For example, the ABC News/The Washington Post poll had a POLLSCORE of -1.1 and Rasmussen Reports -0.5.

The first step is to calculate a Raw Error and Raw Bias metric for each poll. Raw Bias is the difference between the margin between the top two candidates in the poll and the margin between those candidates in the actual election result. Directionality matters here: A positive Raw Bias means the poll overestimated support for Democrats, and a negative Raw Bias means the poll overestimated Republicans. Raw Error is the absolute value of Raw Bias (in other words, it’s the same thing, except directionality doesn’t matter). For example, if a poll showed the Democratic candidate leading by 2 percentage points but she actually won by 5 points, that poll’s Raw Bias is -3 points (overestimating Republicans), and its Raw Error is 3 points.

We calculate these values only for polls that identified the correct first- and second-place candidates. Further, we calculate Raw Bias only for polls with both Democratic and Republican candidates (we cannot detect the partisan bias of a survey if there are only two Democrats in the race).

Of course, we should expect that some surveys will have higher errors and biases than others. A poll with a sample size of 500 people, for example, has a larger margin of error than a poll of 5,000 people and should usually be less accurate. So the next thing we have to do is to calculate Excess Error and Bias.

But for Excess Error, we run a multilevel regression model on every nonpartisan poll in our dataset to calculate how much error we would expect it to have based on the implied standard deviation from its sample size, the square root of the number of days between the median date of the poll and the election, plus variables for the cycle the poll was conducted and the type of election it sampled (e.g., presidential primary, presidential general election, gubernatorial general election, Senate general election, House general election or House generic ballot). Each poll’s Excess Error, then, is simply its Raw Error minus that expected error. Our regression weights polls by the square roots of their sample size (which we capped at 5,000 to avoid giving any one poll too much weight), the number of polls in each race and the number of pollsters surveying each race.

Finally, we calculate an Excess Error and Excess Bias score for each pollster by taking a weighted average of the Excess Error and Bias of each of its polls, with older polls given less weight. The precise amount of decay changes every time we add more polls to the database, but it’s currently about 14 percent a year. For example, the Transparency Score of a 1-year-old poll would be weighted 86 percent as much as that of a fresh poll; the Transparency Score of a 2-year-old poll would be weighted 74 percent (86 percent times 86 percent) as much; etc.

So the next step in our process is to adjust each poll’s Excess Error and Bias based on how difficult of a race it polled. First, for every poll in our database, we calculate the weighted average Excess Error and Bias for all other polls in the race, with polls with larger sample sizes given higher weight. Then, we subtract those values from the Excess Error and Bias of the poll. That gives us the Relative Excess Error and Bias of every poll.

Finally, we calculate statistics called Adjusted Error and Adjusted Bias, which are a weighted combination of each poll’s Excess Error/Bias and Relative Excess Error/Bias, where the Relative Excess statistics get more weight when more pollsters release more surveys in a given race. We make this adjustment to reflect the fact that we are more confident in a pollster’s relative performance in a race when the benchmark we’re comparing them against (all the other pollsters in a race) is based on a larger sample of data.

ABC News/The Washington Post had a Predictive Error of -1.0 and a Predictive Bias of -1.2 (average -1.1).  Rasmussen Reports received -0.4 and -0.6 (average -0.5), respectively.  Note that the averages makeup each POLLSCORE.

Transparency

So we now also score firms based on their methodological transparency. To do this, we have quantified how much information each pollster released about every poll in our archive since 2016…Each poll gets 1 point for each of 10 criteria it meets, ranging from whether it published the actual question wording of its poll or listed sample sizes for key subgroups. We give each pollster a Transparency Score based on the weighted average of the scores of its individual polls and whether it shares data with the Roper Center for Public Opinion Research at Cornell University or is a member of the American Association for Public Opinion Research’s Transparency Initiative.

  1. Did the pollster publish the exact trial-heat question wording used in this poll?
  2. Did the pollster publish the exact question wording and response options for every question mentioned in the poll release?
  3. Did the pollster release both weighted and unweighted sample sizes for any demographic groups, or acknowledge the existence of a design effect in their data?
  4. Did the pollster publish crosstabs for every subgroup mentioned in the poll release?
  5. Did the pollster disclose the sponsor of the poll (if there was a sponsor)?
  6. Did the poll specify how the sample was selected (e.g., via a probability-based or non-probability method)? If the sample was probability, was the sampling frame disclosed? If non-probability, did the pollster disclose what marketplace or online panels were used to recruit responses or its model for respondent selection?
  7. Did the pollster list at least three of the variables the poll is weighted on?
  8. Did the pollster disclose the source of its weighting targets (e.g., “the 2022 American Community Survey”)?
  9. Did the poll report a margin of error or sample size for a “critical mass” of subgroups? We do not mandate this be a complete count, but if it looks like groups are intentionally missing (e.g., they are referenced in the press release but are missing in the crosstab documents), we withhold the point.
  10. Did the poll methodology or release include a general statement acknowledging a source of non-sampling error, such as question wording bias, coverage error, etc., in addition to the normal margin of sampling error inherent to surveying?

We award each poll a 0, 0.5 or 1 on each question; a perfect score is 10, while the worst is 0…We then calculate a Transparency Score for each pollster by taking a weighted average of the Transparency Scores of all its polls, with the Transparency Scores of older polls getting less weight.

The final Transparency Score for a given pollster is a weighted average of all of the above, with 70 percent of the weight going to its directly measured transparency and 30 percent of the weight on its implied transparency — a value of 10 if it is a member of the AAPOR/Roper group, and 0 if it is not, with one exception: If a pollster’s directly measured transparency is 8 or higher and it has released more than 15 polls, we treat it as an honorary member of the AAPOR/Roper group.

We developed this metric in collaboration with Mark Blumenthal, a pollster, past 538 contributor and co-founder of the (now sadly defunct) poll-aggregation website Pollster.com. Blumenthal has found that pollsters that released more information about their work tended to be more accurate during the 2022 election cycle.

ABC News/The Washington Post received a transparency score of 9.2 and Rasmussen received 3.3.

Combined 3-star score

This ranking procedure yields a single, combined score that summarizes each pollster’s performance along both dimensions. For interpretability, we convert this value to a star rating, between 0.5 and 3.

Only the best of the best will get 3.0 stars; these are pollsters who score in the 99th percentile or better for both accuracy and transparency. Pollsters scoring between 2.8 and 3.0 are still very good — just not the best of the best. Most pollsters score between a 1.9 and 2.8, representing what we see as America’s core block of good pollsters. Pollsters between 1.5 and 1.9 stars are decent, but they typically score poorly on either accuracy or transparency. Generally, we are very skeptical of pollsters that get less than 1 star, as they both have poor empirical records and share comparatively little about their methodology. A 0.5-star rating — the bare minimum — is reserved for pollsters that have records of severe error or bias or are disclosing only the bare minimum about their polls.

We use something called Pareto Optimality [or “Pareto Efficiency”]. In layman’s terms, the “best” pollster in America would theoretically be one with the lowest POLLSCORE value and a Transparency Score of 10 out of 10. But in practice, the pollster with the lowest POLLSCORE and the pollster with the best Transparency Score are different pollsters. So our algorithm determines the “best” pollster to be the one that is closest to ideal on both metrics, even if it is not the best on any one metric.

What we really want from our ratings are not single point predictions of each pollster’s POLLSCORE, Transparency Score and final rank, but distributions of them. We need a way to calculate how much each pollster’s scores change if you ignore certain good (or bad) polls it has released. The method we turn to for this is called “bootstrapping.” To bootstrap our model essentially means re-running all the steps we’ve described so far 1,000 times. Each time, we grade pollsters based on a random sample of their polls in our database. As is standard in bootstrapping, we sample the polls with replacement, meaning individual polls can be included multiple times in the same simulation. We do this to keep the number of polls we have for each pollster constant across simulations. In the end this procedure yields 1,000 different plausible pollster scores for each organization.

Finally, we calculate the median of these simulated POLLSCORE and Transparency Scores. Pollsters are re-ranked according to the algorithm described [above]. Compared to our point predictions, the bootstrapped results change rather little for most pollsters, but they punish those who score well because they got lucky once or twice and reward pollsters that more reliably have lower bias and error than other firms. As a final hedge against modeling error here, we average the bootstrapped results and point predictions together.  These numbers are the final ratings you see on our dashboard.

ABC News/The Washington Post received a final rating of 3 stars (ranked #1) and Rasmussen received 2.1 stars (ranked #69).

Please check out What are the best pollsters in America? and How 538’s pollster ratings work for 538’s entire description of their 2024 ratings and methodology.

Comparison

Let’s look at the top 20 pollsters for 538 (combined star ratings) and Silver Bulletin.  The top pollsters mostly overlapped, with 13/20 of the top 20 pollsters shared between Silver and 538, albeit in different orders.  There are 12 colleges in 538’s top 20, as opposed to 8 in the Silver Bulletin’s.

538 Top 20RankSilver Bulletin Top 20
The New York Times/Siena College1Selzer
ABC News/The Washington Post2The New York Times/Siena College
Marquette University Law School3ABC News/The Washington Post
YouGov4Research & Polling Inc.
Monmouth University Polling Institute5SurveyUSA
Marist College6Siena College
Suffolk University7Marquette University Law School
Data Orbital8AtlasIntel
Emerson College9Beacon Research/Shaw & Co. Research
University of Massachusetts Lowell Center for Public Opinion10Marist College
Muhlenberg College Institute of Public Opinion11Cygnal
Selzer & Co.12Monmouth University
University of North Florida Public Opinion Research Lab13Landmark Communications
SurveyUSA14Emerson College
Beacon Research/Shaw & Co. Research15MassINC Polling Group
Christopher Newport University Wason Center for Civic Leadership16University of Massachusetts Lowell
Ipsos17University of North Florida
MassINC Polling Group18TIPP Insights
Quinnipiac University19Public Policy Institute of California
Siena College20CBS News/The New York Times

Spearman’s rank correlation coefficient is similar to Pearson correlation but applied to ranked variables.  The coefficient goes from -1 to 1, with 1 being perfectly similar and -1 being perfectly dissimilar.  The coefficient here is 0.573 (p=2.7×10−21), indicating moderate correlation between Silver Bulletin and 538.  It should be noted that 538 now only ranks 277 pollsters while Silver Bulletin ranks 516; we only looked at the 228 that overlapped based on overlapping ‘Pollster Rating ID’.  The missing data shouldn’t be an issue because the coefficient is based on the ranking rather than absolute position.

The pollsters with the greatest disparity in rankings are:

Pollster_538Rank_538Rank_Silver
Roanoke College Institute for Policy and Opinion Research31478
Mitchell Research & Communications77497
SurveyMonkey94464
Targoz Market Research89451
Rutgers University Eagleton Center for Public Interest Polling58418
Baldwin Wallace University Community Research Institute134482
American Research Group59406
SoonerPoll.com118456
Marketing Resource Group (MRG)124460
Meredith College Department of History, Politics, and International Studies146481

This is a little misleading because of the difference between the two in the number of pollsters ranked, so let’s look at quartiles.

Pollsters in the Upper Quartile for 538 but in the Bottom Quartile for Silver Bulletin:

Pollster_538Ranking_538Rank_Silver
Roanoke College Institute for Policy and Opinion Research31478
Rutgers University Eagleton Center for Public Interest Polling58418
Mitchell Research & Communications77497

Pollsters in the Upper Quartile for Silver Bulletin but in the Bottom Quartile for 538:

Pollster_538Ranking_538Rank_Silver
North Star Opinion Research22895
GBAO230117
co/efficient23756
Garin-Hart-Yang Research Group24457
Harstad Strategic Research247109
Trafalgar Group27367

Excluding 538 transparency scores

Rather than using 538’s combined score, if we rank 538’s pollsters by POLLSCORE, with no transparency score, then the two rankings move even further together.  The Spearman Rank Correlation coefficient is 0.641 (p=5.13×10−54).  Because of the large jump in the number of pollsters that 538 gives a POLLSCORE, from 277 to 539, we now have 455 overlapping pollsters between 538 and Silver Bulletin.

With only one decimal place for the POLLSCORE rating, it’s difficult to specify a rank, but ABC News/The Washington Post poll would now be ranked by 538 between #5-10, while Rasmussen Reports would rise to somewhere between #36-46.  

Conclusion

The 538 and Silver Bulletin ratings for 2024 are generally similar, while Silver is staying the course with his methodology and Morris is definitely changing the direction of 538 now that he’s in charge.  We won’t know until after the election, at the earliest, who was right (or more right).

And both are wrong to call this a “pollster” rating since these are actually “polling firm” ratings.  As the business of politics continues to grow and consolidate, polls from large polling firms cease to be a unified work product.  Much more than Morris’ transparency questions, tracking the individual(s) responsible for a poll would increase the value of the rating system, and the integrity of the polling industry in general.  As long as they can land clients, a bad pollster will survive, if not prosper, in politics.

Additional Information

Silver Bulletin

Silver Bulletin model methodology
Silver Bulletin pollster ratings
FiveThirtyEight pollster rating methodology (2023)
Historic FiveThirtyEight pollster rating data
*Silver Bulletin pollster rating data is behind the paywall.

538

What are the best pollsters in America?
How 538’s pollster ratings work
Pollster Dashboard
538 pollster rating data (2024)

Silver v. Morris 2020

Who Won the 2020 model wars?