Buffalo Bills fans associate analytics and advanced metrics with Pro Football Focus. Bills fans associate Pro Football Focus with Josh Allen hatred. Therefore, by the transitive property, Bills fans associate analytics and advanced metrics with Josh Allen hatred.
For two years, a large subsection of Bills Mafia has railed against the usage of advanced metrics to measure divisive starting quarterback Josh Allen, explaining that the usage of such measurements don’t take into consideration the full benefit that Allen brings to the Western New York team. This year, as the advanced metrics that once concluded Allen was a mediocre or bad quarterback paint him in a different light, that narrative is beginning to change. But with the advent of so many different ways to measure a quarterback holistically, it can be difficult to ascertain what particular metric should be used in which case apart from the classic “I’ll use whichever one supports my current social media argument” approach. Arguments even devolve into the least nuanced and contextual QB argument ever: the wins argument. Wins are not a quarterback metric and contain the absolute least amount of value humanly possible when discussing how the quarterback performed aside from “starts.”
I think we can do better.
The cracks in the armor of any metric-based argument start to reveal themselves when we recognize that all quarterback metrics are inherently flawed. Raw metrics like passing yards and completion percentage lack context, but human beings are necessary to provide the context, so the introduction of that context we crave brings with it the possibility of bias and human error. We could use a raw ratio like yards per attempt that might make us feel better, but it still doesn’t provide context that comes from things like drops and average depth of target. We might take a look in farther then and use mathematical formulas like passer rating or average net yards per attempt. Those formulas were created by a human with the weighting given to each underlying “sub metric” represented in a way that the human creating the formula deemed necessary. There is no way to avoid either lack of context or human involvement.
But what if we use the strength of one metric to accommodate for the flaws in another? Doing so requires the consumers to a.) set aside their ego and attachments to which QB metric is “best" and b.) recognize the strengths and weaknesses of the metrics utilized in the composite. In addition, some of these metrics cannot be calculated by an average user because their formulas are not public. While this adds a “cloak and dagger” feeling to the associated metric in the eyes of some fans, the presence of other more transparent metrics alongside ones lacking it can help us to identify any outliers, further strengthening the usage of multiple metrics instead of one.
Before we get started, let’s define “holistic metric.” A holistic metric is one that is designed to be broad and encompass a large portion of the entirety of a quarterback’s play for a given time game, season or career. They typically take many factors into account and are intended by their creators to be wide-ranging.
For the purposes of this test, I have selected the following holistic quarterback metrics to utilize in a creation of a new holistic composite average:
Average net yards per attempt (ANY/A) (obtained from Pro Football Reference)
This metric is a mathematical formula with a small amount of human intervention (pass yards + 20*(pass TD) - 45*(interceptions thrown) - sack yards)/(passing attempts + sacks) and does exactly what its name suggests: it calculates the average pass yards that are gained by the team every time the QB drops back to pass. You’ll notice there is a weight added for touchdowns and interceptions (this is the human intervention in the metric) based on the research of Chase Stuart of Pro Football Reference. The strengths and weaknesses of the metric are as follows:
- Strength—minimal opportunity for human error due to only human involvement being weighting associated with injecting a yard calculation for touchdowns and a yard penalty for interceptions.
- Weakness—like a lot of metrics, this one doesn’t entirely isolate the play of the QB at a micro level. If the QB delivers a good pass that gets dropped, it’s still a negative in this metric. The larger the sample size, the more this part washes out, but game-by-game it still needs to be recognized. Also, this metric does not account for quarterback contributions in the run game.
Expected points added per play (EPA/play) (obtained from ESPN.com)
EPA/play is actually one of the easier metrics to explain. When a team has the ball on a specific yard line and a certain down and distance with a certain amount of clock remaining, there are thousands of examples of that exact same down/distance/time trifecta occurring. These thousands of examples, taken together, can give us an average of how many points it would be expected that the team would score on that drive. Once a play is completed, there will be a new down and distance and, as such, a new “expected points” total for that drive. The difference between the starting “expected points” number and the ending “expected points” number is “EPA.” Let’s discuss the strengths and weaknesses of the metric:
- Strength—does a better job of accounting for “success” of a play than yardage. Gaining three yards on 2nd-and-10 is markedly less successful of a play than gaining three yards on 3rd-and-2. In addition, this accounts for a quarterback who contributes running the ball because the metric does not know or care if the successful play was done through the air or on the ground. It also accommodates for some quarterbacks being on the field for more plays than another by utilizing the “per play” denominator.
- Weakness—like ANY/A above it, this is a results-based metric and does not have the context needed to know if the QB contributed positively to a play and was let down by a failure outside of this control. The “per play” denominator can hurt the metric if a QB is not present for enough snaps to provide a reasonable sample size.
Total QBR (obtained from ESPN.com)
ESPN’s proprietary quarterback metrics are best described as “EPA plus.” Take the basis established by your knowledge of EPA/play above, but attempts to mitigate the weaknesses that have started to pop up commonly in the metrics we’ve discussed thus far by assigning credit and blame across different positions groups for a given play based on factors such as how far the pass travels in the air, what percentage of yards were gained after the catch, and whether the quarterback was under pressure. As an example, if a quarterback throws a five-yard out route to a receiver who has two yards of separation from the nearest defender at the time of the catch, that play has happened many times before and QBR can expect a certain amount of yards after the catch from that receiver based on that prior data. If that receiver gets more than those expected yards, that benefit is not assigned to the quarterback and if that receiver gets less than those expected yards, that blame is not assigned to the quarterback. Like all metrics, Total QBR has strengths and weaknesses:
- Strength—it is the first metric thus far utilized that accounts for “garbage time stats,” a common counter to any raw statistical argument. Total QBR judges win probability at the time of the snap and if chances of victory at that time are incredibly slim, the value of the play as assigned to the quarterback is lessened. Total QBR accounts for quarterbacks rushing and a timely third-down conversion by a running quarterback is better reflected in Total QBR than in any metric discussed up to this point.
- Weakness—Total QBR is an efficiency statistic, not a total value statistic. If someone is REALLY efficient and drops back to pass 12 times with two designed runs, they may have been incredibly efficient, but they didn’t have a large impact on the game specifically because they didn’t impact a large portion of the plays in a significant way.
Passer Rating (obtained from Pro Football Reference)
Passer rating was adopted by the NFL in 1973 and is still used today to measure quarterbacks. Much like ANY/A discussed above, it is a weighted formula that uses raw statistics applied to a human-created construct to grade a player. In this case, the scale is from 0.0 to 158.3. The formula is as follows:
A = (comp/att - .3) x 5
B = (yards/att - 3) x.25
C = (TD/att) x 20
D = 2.375 - (INT/att x 25)
ATT = Number of passing attempts
COMP = Number of completions
YDS = Passing yards
TD = Touchdown passes
INT = Interceptions
If the result of any calculation is greater than 2.375, it is set to 2.375. If the result is a negative number, it is set to zero.
Then, the above calculations are used to complete the passer rating:
Passer rating = (A + B + C + D/6) X 100
Clear as can be, right? The important thing to take away is that the raw ratios used (in this case, completions per attempt, yards per attempt, touchdowns per attempt, and interceptions per attempt) are each applied a weight based on the perceived value of the creators back in the 1970s (Don Smith, Seymour Siwoff and Don Weiss headed the task force created by then-NFL commissioner Pete Rozelle to develop a better system for measuring quarterbacks).
- Strength—like ANY/A, the only human involvement in passer rating is the human decision on the weighting of individual raw ratios. In addition, the test of time matters. This metric has been utilized longer than any other holistic quarterback descriptor to measure the play of the signal callers in the NFL. If you utilize the all-time passer-rating leaders (Aaron Rodgers, Deshaun Watson, Russell Wilson, Drew Brees, Dak Prescott, Tony Romo, Tom Brady, Steve Young) it will demonstrate how passing has become more efficient since this metric was determined, but if the ranking of the passer rating year over year is utilized, there is a significant sample size of it being a flawed, but digestible way of measuring quarterback passing efficiency.
- Weakness—there is enough sample size to show that passer rating weights completion percentage higher than a lot of modern analysts would think is appropriate, inflating the positions of historically checkdown-heavy and risk-adverse quarterbacks like Derek Carr and Chad Pennington. The metric also fails to account for any quarterback running contribution, is not weighted for garbage time, and does not account for a great pass dropped by the receiver. It also does not recognize the quarterback’s role in tacking sacks.
Defense-adjusted value over average (DVOA) (obtained from Football Outsiders)
The least transparent metric on this list but, oddly enough, the easiest to explain—DVOA measures a team’s efficiency by comparing success on every single play to a league average based on situation and opponent. I would encourage you to read this article on the Football Outsiders website that explains DVOA in detail, but will attempt to summarize here. Imagine the offspring of a trio union of EPA/play, success rate, and ANY/A and you will start to understand where DVOA fits into the modicum of metrics. “Success” is determined by whether or not a play on first down gains 45 percent of the needed yards, if a play on second down gains 60 percent of the needed yards, and if a play on third or fourth down gains the entire amount of needed yards to convert to either a new set of downs or a touchdown. This is the first intervention of human decision-making based on a pre-determined definition of “success.” A successful play is assigned “success points” behind the scenes, with big plays assigned differing amounts of points (this is where the second part of human weighting comes in, much like the human weighting decisions with passer rating and ANY/A). Plays in the red zone also are applied additional weight and turnovers and penalties carry with them negative implications on the metric.
This above calculation creates a “success value” for each play, which is then compared to all other teams who ran a play in that same environment to determine whether the team being graded performed above or below expectations based on every other team. Then that success or failure relative to league average is weighted against the defense the team is facing and specifically, how other offenses have performed against THAT DEFENSE in THAT EXACT SITUATION. This is where the “defense-weighted” portion of the name comes in to play.
The final step in DVOA is to normalize the ratings so that “average” shows as zero. Positive play for the offense will show up above zero, and positive play for the defense will show up below zero.
- Strengths—although there is a great deal of human involvement in DVOA, it all occurs prior to the metric being run, much like ANY/A or passer rating. As such, DVOA is not biased towards or against certain players or teams, despite what any fans may tell you. It also attempts to incorporate an extremely large number of variables and is more ambitious in its scope than any other metric at this point on the list. In addition, it is the only one that accounts for the defense being played against, which adds a valuable check to our rapidly expanding composite. This metric has weights built in for playing from behind or with a lead in the fourth quarter and also games played indoors versus outdoors, further adding to the breadth of its value.
- Weakness—any metric that incorporates “success rate” carries with it the idea there are situations in a game where the definition of “success” does not revolve around only gaining the maximum number of yards, chief among them being plays that are designed to kill clock. Because the metric counts defensive pass interference as a pass play, a quarterback who makes an ill-advised throw into good coverage and is bailed out by overly-aggressive cornerback play or a flag-happy official doesn’t reflect on the quarterback as it should. DVOA attempts to rectify a lot of issues, but it is ultimately still a results-based metric and it carries with it weaknesses due to it.
Pro Football Focus Grade (PFF grade) (obtained from Pro Football Focus)
I have mentioned thus far here that a well-placed throw from a quarterback that is dropped by the receiver is not accounted for in any results-based metric.
It is accounted for in the PFF grade.
As mentioned in the opening, human context added to metrics adds the possibility of human error and human bias into the equation. If we use only the PFF grade in a vacuum to measure, we are obtaining the context that the other metrics lack and also bringing with it human bias.
PFF grading does not utilize any formula and is unlike any other metric on this list. It is a film grade based on a set of rules. It’s important to know what those rules are (see their explanation of their grading system on their website) but each player is graded on a scale from -2 to +2 on every play, with 0 being average. A grade of 60 is considered average at the end of the game and it’s the point from where all grading starts. Plays where the grader is very unclear of the intent of the player on that play is given a grade of 0 as to not influence the score positively or negatively. Season-level grades are not a composite of individual game grades as extra weight is applied for consistency.
- Strength—As mentioned above, great throws that are dropped are accounted for in this metric, along with the inverse. If a quarterback makes a horrendous throw that is dropped by a defender, every metric utilized prior to the PFF grade as part of this composite will view it only as an incompletion, even though the throw is just as bad as it was if the defender had better hands. If a receiver has a ball placed right in his hands on a drag route, that play will be graded more positively than if the ball is high and inside, requiring him to make an acrobatic catch. This is the main value of utilizing the PFF grade as part of this composite—it does something none of the other metrics do by introducing film into the equation and helping add context.
- Weakness—PFF grades are not entirely devoid of results-based influences, lest the paragraph above confuse you. The context and gravity of the play still influences the grade. In PFF’s grading scale, the example used for a -2 play (the lowest possible grade on a given play) is: “2009 NFC Championship Game, tie game, FG range, Favre throws across his body for an INT.” If this throw occurs in the third quarter when Favre’s team is driving and up by seven, this play may get a -1.5 instead of a -2 because the gravity of the moment is affecting the grade, even though the affect on the game in totality may be similar. In addition, the subjective grading aspect adds is a “what SHOULD the player have done” to all quarterback plays, which is inherently volatile. An example would be throws that are altered by the quarterback due to receiver leverage or coverage where the person grading the play may believe the ball is poorly placed when in reality, the ball was placed as the coaching instructed, but a miscommunication between the receiver and quarterback have it appear as an off-target pass.
Completion percentage over expectation (CPOE) (obtained from NFL NextGen Stats)
The discussion we had on EPA earlier in this article will help us understand CPOE here. When a quarterback throws a 15-yard dig route to a receiver with three yards of separation from the nearest defender and two yards of separation between the QB and the nearest pass rusher, that play has happened many times before in the NFL. We know whether each one of those plays ended in a completion or not and that mass quantity of data can help us determine an “expected completion percentage” based on that particular throw situation. (You can read more here at the following link.) Ten different in-play factors are utilized to determine this, and then the quarterback being measured can have his actual completion percentage cross-referenced against the completion percentage that would be expected based on the types of situations present at the time of his throws to generate CPOE.
- Strengths—CPOE accounts for depth of target, receiver separation, and pressure and the effects they have on a quarterback’s basic ability to complete passes. High or low straight completion percentage can often be heavily influenced by depth of target and offensive system, and using it outside context can be littered with lack of context. CPOE takes the main responsibility of a quarterback (complete passes) and considers the factors that impact his ability to do so when judging it.
- Weakness—Because CPOE is the least-holistic metric on this list, it does not account for a litany of items, such as quarterback running, sacks, bad passes that are caught, good passes that are dropped, and a lot of things that have been included in previous metrics.
As you can see, each of these metrics carry with them strengths and weaknesses, and no one metric can evaluate perfectly the play of a quarterback in any given game, season, or career. However, recognition of the strengths and weaknesses of the metrics used in this composite allow us to feel good that any flaws in one will not only be balanced by the use of an average, but also show up as outliers in the event that weakness is extreme enough to greatly influence the quarterback’s place relative to other metrics.
The final step necessary is to discard the raw metric assigned to each quarterback and utilize the RANKING of the metric instead, and then use an average of those ranks. This allows my final holistic composite average to weight the final metric relative to the peers of the QB. The final product for Josh Allen looks like this:
If we take Allen’s current ranks in each of these holistic metrics and average them, we get a holistic composite average of 7.14 (a reminder that lower is better). We have retained the individual ranks, so we can see that CPOE weights Allen the lowest of all the utilized metrics. We know from the discussion above that, as an example, CPOE does not account for quarterback rushing. Knowing that information, we now have context as to why the metrics have averaged out the way they have, and we know that the presence of EPA/play, DVOA, QBR, and the PFF grade (which do account for QB rushing) are helping to offset that flaw.
Let’s look at Houston Texans quarterback Deshaun Watson:
Deshaun Watson has an average of 9.14 so far this year. When we look at the metrics pulling the average down, we know that his QBR and EPA/play won’t be boosted by running as much as they will be for Allen, but his passer rating and ANY/A are higher, indicating more efficiency in results through the air. We also know that EPA/play is a metric heavily weighted toward team performance, but his PFF is markedly higher. This reinforces the narrative that Watson is a good quarterback who is currently being let down by his team. The fact that the shape of the radar graph for a quarterback with this phenomenon is a teardrop is 100 percent intentional on my part.
One more for good measure:
Aaron Rodgers is having an MVP-caliber season and this is what that looks like utilizing this holistic composite average. The only metric that isn’t top-two is CPOE, reinforcing the narrative of Aaron Rodgers as a player who is consistently making good throws but not improbable ones.
Let’s use the eye test to wrap this up.
If I told you Josh Allen was performing like a top-eight quarterback in the NFL this year, would you agree with me? If I said Aaron Rodgers was top three? Deshaun Watson top ten? It certainly tracks.
If we are willing to accept that every metric has flaws, take the time to know what those flaws are, and utilize them in a composite to help us better evaluate quarterbacks, we can get a better understanding for how a player at the most scrutinized position in sports is actually performing.
...and that’s the way the cookie crumbles. I’m Bruce Nolan with Buffalo Rumblings. You can find me on Twitter and Instagram @BruceExclusive and look for new episodes of “The Bruce Exclusive” every Thursday and Friday on The Buffalo Rumblings podcast network!