Chat with John Dewan re Defensive Stats-Long
markj111
Posts: 2,921 ✭✭✭
in Sports Talk
Chat with John Dewan
By John Dewan
Hello to my friends at Bill James Online. I wanted to share this conversation I had with readers over at the website called SonsOfSamHorn.net. Here are the questions and answers we posted in the “Red Sox Forum” over the last few weeks.
Thanks for doing this John.
Could you please explain the differences, and the similarities, between your system and UZR.
Thanks.
First, let me give you a little background. I developed Zone Ratings in the 1990’s during my days at STATS, Inc. The concept was based on the coding we were doing of the distance and location of batted balls. Each defensive position was assigned a zone where, based on the data, a majority of plays could be expected to be made. Zone Rating data was published annually in the Baseball Scoreboard beginning in 1990. The last edition of the Baseball Scoreboard was in 2001.
My last year at STATS was the year 2000 and in the Baseball Scoreboard 2000, I developed a new system. I called it Ultimate Zone Ratings, abbreviated UZR. This was essentially a forerunner of the system I developed in 2002 at Baseball Info Solutions which I named the Plus/Minus System. Ultimate Zone Ratings in the Scoreboard was also a system based on plusses and minuses. But with the new Plus/Minus System, I used more detailed data from Baseball Info Solutions and included adjustments in many areas that I hadn’t done in the first version of UZR.
Getting back to your question, what’s the difference between my system and UZR? While I don’t know for sure if the current version of UZR is an extension of my original UZR, or if it was independently developed, the bottom line is that they are based on the exact same concept. Both systems break the field into small areas and look at the probabilities of plays being made in those areas. The differences lie in the various adjustments that are made.
My research assistant, Ben Jedlovec, prepared the following:
Based on my understanding of both systems,
Similarities
Both use BIS Data. UZR started with STATS data, but the most commonly referenced version uses BIS data.
Both have the same idea- break down balls in play by type, location, velocity.
Both are measured on an above/below average scale.
Both have runs saved systems with components for GDP, OF Arms, Range.
We use similar run value multipliers at each position.
Both are available online (Fangraphs or Bill James Online).
Technical Differences
UZR uses multi-year samples, while Plus/Minus adjusts for year-to-year league changes. As teams are increasingly recognizing the importance of a strong defense, the league as a whole will be stronger defensively. It is important to handle this trend appropriately.
Plus/Minus uses smaller, more precise zones, or “buckets” of plays.
UZR has several minute adjustments, such as batter hand, pitcher hand, base/out state, and pitcher groundball/flyball tendencies. We remain focused on the value contributed to the team in the player’s specific context.
Park adjustments are handled differently- I believe UZR applies blanket adjustment across all buckets, while Plus/Minus has park factors in form of more precise buckets. A ball hit 395 feet to Vector 190 that stays in the park is only compared to all other balls hit 395 feet to Vector 190 that stay in the park. If it leaves the park, it neither helps nor hurts the fielder. Also, we added the “Manny Adjustment”, which removes fly balls hit unreachably high off a wall. We named the system after the Green Monster’s most notable victim, who went from being by far the worst left fielder in baseball before the adjustment to being only arguably the worst left fielder after the adjustment.
Plus/Minus accommodates plays where the first baseman holds the runner and middle infielders are covering second on hit-and-run plays. UZR adjusts for all base/out states.
The two systems apply the run values at different stages in the calculations. UZR applies runs right away, while we convert to Enhanced PM then apply the Run Factors.
Plus/Minus is a little more aggressive in awarding credit/penalty. An example: 100 balls in a ‘bucket’ (specified type, velocity, location), 30 fielded by the 2B, 20 by the 1B, 50 go through for singles. On a groundout to the second baseman, we give +50/(50+30) = 5/8 = +.625. UZR gives +50/100 = +.50. On a single through both fielders, Plus/Minus gives -30/80 = -.375 to the 2B, and -20/70 = -.29 to the 1B. UZR gives -30/100 = -.3 to the 2B, and -20/100 = -.2 to the 1B. You could make an argument for either method of accounting, but neither one is better than the other. The differences are the greatest at the middle infield positions, where overlap between fielders is the highest.
Fundamental Differences
Runs Saved includes Bunt Runs Saved for corner infielders, pitcher fielding (Plus/Minus and holding runners), and catcher fielding (handling the pitching staff and the running game).
Runs Saved measures the extra impact of HR Saving Catches. Runs Saved will add other Defensive Misplay/Good Fielding Play runs in the future.
There are some large similarities, but the bottom line is we’re not measuring exactly the same pieces of the puzzle, and we’re accounting for them differently.
John -
What do you think is the appropriate sample size in order for +/- to have utility? Do you think volatility in player's defensive numbers is more attributable to performance variation, or the metrics themselves still being fairly new ground?
Over time, we have all developed a feel for what baseball data means. For example, looking for a player with a long career I randomly picked Juan Pierre flipping through my Bill James Handbook. In 2004 he hit .326 for the Marlins. One year later with the same team, he hit exactly 50 points lower (.276). With the wisdom of hindsight, but even at the time, we know his real ability is somewhere in between.
So it is, for the most part, with our plus/minus numbers. But it can still vary from year to year and a player’s true ability generally lies between the fluctuations.
Another example: if a player has a plus/minus of +3 after five games, he has played well in those five games. It’s like going, say, 10-for-20 in those five games. There’s no question that he played well. But the sample size is small and, in that limited timeframe, provides only a minuscule amount of insight into the player’s true ability.
Like other numbers in baseball, a small sample size tells you what a player is doing, but the larger the sample size gets, the more you know about what he is really capable of doing.
When it comes down to it, I give our overall plus/minus numbers similar credibility as other baseball numbers, like batting average or on-base percentage. In my new book, The Fielding Bible—Volume II, we developed Runs Saved. I think of Runs Saved as the Runs Created of defense in that it encompasses a wide variety of methods. I give Runs Saved similar credibility to Runs Created.
What do you think is the next step in baseball evaluation?
On offense I believe we’re measuring 80-90 percent of the true ability of players. On defense, I believe we’re at about the 60 percent level. But we’re still at the tip of the iceberg in terms of precision and a ton more can be done, especially defensively. As new forms of data become available, we’ll be able to enhance our defensive systems. One example: BIS has now developed a batted ball timer, which we believe will greatly improve the accuracy of our system.
Thanks for taking the time to answer our questions, John, great work on The Fielding Bible II.
Would you please detail how the data used for your defensive evaluations is compiled. Also, what are the differences, if any, in the method used to compile defensive data between BIS and STATS, Inc.? Finally, what are your thoughts on the strengths and weaknesses of these methods of acquiring data, and what do you anticipate being done to improve them?
Let me refer you to www.fieldingbible.com to get a better overview of the Plus/Minus System and my other defensive systems. We added new techniques, including Defensive Runs Saved, when we published The Fielding Bible—Volume II this spring. Daily updates are being posted on Bill James Online (www.billjamesonline.com). It’s a subscription website, but it’s only $3 per month.
Having been involved with the set-up of both the STATS and BIS data tracking systems, I believe that both organizations do an excellent job overall. For defensive data, if STATS still does what they were doing back when I was there, they rely on a scorer in the pressbox making the location of batted balls onto a grid system that breaks the field into 26 vectors emanating from home plate on one axis and 10-feet increments on the other. BIS utilizes a video review by its Video Scouts to pinpoint the batted ball onto a replica of the field on their video screen. In theory BIS data allows for greater precision as each pixel on the computer screen can represent a location.
As technology moves forward, we will be able to get better and better precision in our data. As I mentioned above, I’m very excited about looking at the batted ball timer data we’ve been starting to collect.
Thanks for the chat. I have a few questions, some of which might seem silly:
1. At what point will some of the same techniques that evaluate defense (type, velocity, and direction of the hit) be used to adjust our understanding of player offense?
2. mgl has argued that at some level our understanding of defense is better than that of offense -- it's just that offense has neater bins to categorize event results. What would a neater set of defensive result categories look like? If you had the power to completely change our terminology of defense, what if anything would you change?
3. How much longer until little gps units are in the cleats of every fielder, or some equivalent, so that assessments of reaction and range can be made better? Will any team do this, and who will do it first?
4. How do teams these days think about defense? Do they all subscribe to services like yours, or have their own, bigger ones?
While offense is a different beast, and has a different set of variables to consider, I believe there is a lot that can be done with the type of data that we are collecting that we haven’t done yet. We’re using data for analyzing defense that we don’t use much to analyze offense. For example, Bill James did a study that showed there is some consistency among hitters regarding how they hit grounders and line drives, but that hitting flyballs is a significant factor that separates hitters from one another.
I don’t think we understand defense better than offense. We’ve worked hard at it, but we’re still getting there. As far as categories, I’d like to see people start to talk about defense in terms of defensive runs saved broken down into categories like OF Arm runs saved and double play runs saved while pushing errors and fielding percentage more into the background
Sportvision is teaming with Major League Baseball to attempt to just that. There was a recent article in the New York Times on this topic. Measuring range and reaction time would be great, but the ultimate goal is to combine them and measure a player’s skill at turning a ball in play into an out, which current systems are already getting at.
Each team handles defense in its own way. Many subscribe to our system and/or refer to The Fielding Bible frequently. I think teams have been catching on to the importance of defense over the past few seasons, and the media really started catching on this offseason. The lower-than-expected contracts signed by free agent defensive liabilities this past offseason (Burrell, Abreu, and Dunn, to name a few) indicate that the league as a whole has made an adjustment.
Thank you, Mr. Dewan, for the chat. Your publications have become a fixture of my pre-season purchases.....
Thanks to Fenway, Red Sox fans may be unusually interested in how the various fielding systems try to correct (or ignore) park effects on fielding. In light not just of the Green Monster, but Fenway's jutout behind 3b (which may reduce the run-scoring impact of some shots down the third-base line), its huge right field, its smaller foul territory, the Triangle, etc., how can fans best account for park effects on the currently available fielding ratings/metrics? Is there a reasonable way to normalize defensive metrics, or should we look just at "away" splits over a longer period of seasons (in order to avoid small sample size error)?
Many thanks.
Our system handles the major park effects well.
First off, a 360-foot fly ball to right in Fenway is only being compared to identical fly balls in stadiums where the 360-foot fly stays in the park. The fact that the same fly ball might be out in other parks doesn’t affect how much credit or penalty is assigned for the play. Because our zones are small (i.e. more precise), a park adjustment is already built into the system.
Secondly, after the first Fielding Bible we added the “Manny Adjustment” for balls hit off the wall, which eliminates balls hit too high to handle off a wall.
Lastly, foul fly balls don’t impact a player’s plus/minus number. As for the Fenway “jutout”, Mike Lowell may get a slight benefit when a ball goes for a single at Fenway that might otherwise go for a double, but the effect on the total is probably miniscule.
Defensive positioning prior to the ball being put into play seems to be one major factor that frustrates evaluation of defense through scouting as players with poor positioning can make routine plays look spectacular. Statistical analysis using putouts, assists, etc. compensate for this by counting the plays that a player makes whether they are routine or spectacular. From my understanding, UZR compensates for positioning by figuring the tendency of an average player to make a play in a particular zone.
How does +/- compensate for this problem and why does it compensate for this better than other defensive evaluations? Was this problem one that you particularly thought of while working on +/-?
Thanks alot for doing this chat.
That’s exactly what Plus/Minus does. Both Plus/Minus and UZR factor in defensive positioning and give credit for it. Both systems account for both components of good defense – having good range and positioning well. In this way, both systems are complete. What still can be done is to break down each fielder’s performance into separate components for range and positioning.
We explain our system at great length in The Fielding Bible and at www.fieldingbible.com. In 2008, a hard grounder to Vector 197 (slightly to a normal shortstop’s right) was converted to an out by the shortstop 86% of the time. The average shortstop will make the play most of the time, but not always. If the shortstop makes this particular play, we award him +0.14 (1 - .86) plays above average. If he fails to get the out, we penalize him -.86 plays. We do this for every play and every position and add them up to get a player’s plus/minus score, which we later convert to Runs Saved.
As mentioned in the first question, Plus/Minus uses more precise zones to determine the difficulty of a particular play, which in theory should give us more accurate results.
Here's a question from lurker RS DOrtiz. I'm paraphrasing a bit here.
I am interested in the weighting of throwing arms when one is considering defensive evaluations for outfielders, specifically in regards to Jacoby Ellsbury. Does his weak throwing arm set him apart that much negatively when considering votes for the Gold Glove? How are throwing values calculated? Also, how does Jacoby compare to someone like Ichiro considering arm and range? Observationally, it appears to me that Ellsbury makes more "spectacular" catches than Ichiro, and I'm wondering how well that relates to actual statistics and evaluations.
Thanks.
In The Fielding Bible—Volume II, we tackle the issue of combining a player’s throwing arm with his range in the outfield by converting the systems we use for throwing (OF Arms) and range (Plus/Minus) into Runs Saved. A baserunner kill at home plate is the most valuable defensive play. In terms of range, making a more difficult play will earn the fielder more Plus/Minus Runs than a routine play.
Ellsbury’s weaker arm has only cost the Red Sox about three runs in his career. He’s made up for it with six defensive runs saved with his range. Ichiro is in a completely different universe. Ichiro has saved 30 runs with his throwing arm since 2003 and 66 runs with his range. Ichiro has established himself as one of the best outfielders in baseball, while Ellsbury seems to be an all-around average centerfielder. In left and right, Ellsbury’s range would stand out, but in center he’s nothing special. In center field, that’s partially true for Ichiro as well; his range is average for a center fielder but his throwing arm more than makes up for it. Here’s the data for Ichiro and Ellsbury since 2003:
(through July 22)
LastName
Pos
Innings
Plus/Minus Runs Saved
OF Arms Runs Saved
HR Saving Catch Runs Saved
Total Runs Saved
Ellsbury
7
490.3
3
-1
0
2
Ellsbury
8
1398.0
-1
-2
0
-3
Suzuki
8
2279.0
-2
9
0
7
Ellsbury
9
287.0
4
0
2
6
Suzuki
9
6770.7
60
21
8
89
Hi John, thanks for taking the time to answer some questions.
What do you feel is the best way to evaluate the defensive impact of catchers? What is the value of controlling the running game against blocking balls, and do you think it is even possible to quantify something like how a catcher handles the pitching staff?
We explain our method of rating catchers’ ability to control the running game in The Fielding Bible—Volume II. We have also taken a first stab at measuring a catcher’s handling of the pitching staff. We use an example comparing Brandon Inge, Ivan Rodriguez, and Jose Molina to illustrate how we calculate Catcher Earned Runs Saved. I have to refer you to the essay in the book, because to explain the system I’d have to copy the whole essay here.
You mentioned that our Catcher Runs Saved system seems to rely heavily on CERA (Catcher ERA) but that’s a huge oversimplification. In the system, we use the earned runs in the CERA, but only as it relates to catchers catching the same pitchers. If Pudge has a 4.40 CERA with Joe Smith pitching and Molina has only caught him for one inning with a 9.00 CERA, there is almost no effect. There might be credit for Pudge for one earned run saved, but once we use our credibility factor the Adjusted Earned Runs Saved is 0.
Also, we use our Enhanced Fielding System of Defensive Misplays and Good Fielding Plays to evaluate catchers. Jason Kendall blocked more pitches than anyone, while Bengie Molina allowed the most balls to get by him. This is more of a scouting-based approach and contains very valuable information that is otherwise unrecorded.
Regarding Jason Varitek, the numbers suggest that he has been better than other catchers at handling the pitching staff in the last couple of years. Or, more specifically, with Varitek calling the game for his pitchers, it has resulted in an improved ERA for those pitchers compared to all other catchers who have caught those pitchers. We give him credit for nine Adjusted Earned Runs Saved – three in 2008 and six so far in 2009.
There is still a lot to do in this area, but we’ve gotten a good start on evaluating catchers.
Many thanks for your time, John.
Given the advances in Pitch/FX analysis, and the forthcoming usage of Hit/FX, how might these trajectory-based technologies be better used in fielding evaluations?
In theory, PITCHfx and HITfx data are very useful tools for data collection and analysis, but both have their limitations. To their credit, MLBAM and Sportvision have invested a lot of time and money in both projects, and we’re starting to see the benefits of this type of information.
Neither system in its present form adds much to our current fielding analysis. I expect that utilizing the new batted ball timer data collected by Baseball Info Solutions will be a huge advance not only for fielding analysis but for pitching and hitting evaluation as well.
Thank you for taking questions, John. Over the last couple of years, Dave Cameron has been using touting data like this (from Fangraphs):
CF Values 2009
Name
Batting
Fielding
Replacement
Positional
RAR
WAR
Dollars (Millions)
Matt Kemp
13.6
10.1
11.4
1.2
36.3
3.6
$16.00
Torii Hunter
20.6
-2.8
10.8
0.7
29.3
2.9
$13.20
Franklin Gutierrez
5.7
12.0
9.9
1.2
28.7
2.9
$13.00
Carlos Beltran
20
-3.8
9.3
0.6
26.1
2.6
$11.80
Mike Cameron
8.3
5.1
10.7
1.2
25.3
2.5
$11.40
Curtis Granderson
7.8
2.1
12.5
1.3
23.7
2.4
$10.70
Nyjer Morgan
-2.8
17.7
11.5
-2.7
23.7
2.4
$10.70
B.J. Upton
0.8
5.4
11.8
1.2
19.1
1.9
$8.60
Nate McLouth
10.5
-3.3
10.5
1.1
18.8
1.9
$8.00
Aaron Rowand
8.3
-1.1
10.6
1.2
19
1.9
$8.00
Adam Jones
11.4
-6.6
11.1
1.2
17.1
1.7
$7.70
Shane Victorino
10.2
-7.8
12.3
1.2
16
1.6
$7.20
Jacoby Ellsbury
5.3
-6.4
11.6
1.2
11.7
1.2
$5.30
Melky Cabrera
1.3
0.6
8.7
-1.6
9
0.9
$4.00
Grady Sizemore
1.7
-4.1
9.6
-0.5
6.6
0.7
$3.00
Dave reaches the conclusion that Franklin Gutierrez defense is so valuable, that his overall value exceeds nearly all major league CF (Beltran, Granderson, etc). He has used similar data to support his claim that Adrian Beltre has been well worth his contract with the Mariners.
Are defensive metrics so evolved and reliable, that we can credibly make such claims when the bulk of a player's value is wrapped up in defense. Take Nyjer Morgan as an extreme example. When calculated this way, his entire value is based on defense, and yet he rivals Beltran, Granderson, Upton, McLouth, and Rowand).
How do you respond to these conclusions? Shouldn't an adjustment be made for reliability of the data, where offensive metrics are weighted higher and defense indicators discounted, in order to render a credible judgement?
As mentioned earlier, we use caution in small samples of defensive data. However, in Gutierrez’s case, his defense value is far from a small-sample fluke. He led all right fielders in Plus/Minus Runs Saved in 2007 in only 579 innings, and he repeated the feat in 2008 in just over a half-season’s worth of innings. After last season, Fielding Bible Award voters were convinced of his ability and gave him the award over Nick Markakis, Ichiro Suzuki, and everyone else, despite the fact that Gutierrez has a sub-par arm in right field.
When new Seattle GM Jack Zduriencik brought in Gutierrez to play center and Endy Chavez to play left, we touted the Mariners as having the best defensive outfield in baseball. Sure enough, Seattle has the second best Defensive Runs Saved in their outfield with 29 through July 28, second only to Oakland’s 30.
Gutierrez has handled the transition to center field well. While his Plus/Minus numbers are down at the tougher position as you would expect, he still leads the league in Plus/Minus Runs Saved, and his arm is less of an issue. On top of his defensive prowess, Gutierrez is having his best season at the plate. In The Fielding Bible--Volume II, we combined offense, defense, baserunning, and positional value into Total Runs. If Gutierrez keeps playing like this in the second half, he could find himself among the top players in baseball on our 2009 Total Runs leaderboard.
Nyjer Morgan is a different story because the sample size is smaller. He’s played as an above average outfielder, but he’s only logged 1300 innings across three seasons counting all three outfield positions. He rates above average with a total of 21 runs saved on defense in his limited time, but we don’t consider him in Gutierrez’ class (yet).
A vast majority of the outs recorded by MLB defenses are what would be considered "routine plays," or plays that any player at that level would be expected to make. What percentage of all plays made by the defense would you consider to be "routine," and does that number vary from position to position?
I guess it depends on how you define “routine plays”. Let’s define a “Routine Play” as one that at least 50% of players at that position made successfully, and we’ll see what percentage of those plays were converted to outs. (Note: this does not include popups and liners for infielders, which are also included in Plus/Minus.)
Based on this definition, every defensive position shows at least 75% of the plays as routine. Third base and pitcher are right at 75% while center field and right field top the list at 90%.
These routine plays are handled successfully about 90% of the time. There is some variation by position but nothing really significant.
It seems that statistical analysis should be most useful in the middle of the distribution, where we can have real questions about someone's true value, rather than at the ends of the distribution, where a player's value seems relatively apparent. What I mean is, it doesn't seem like we really need advanced statistical analysis to tell us that A-Rod or Pujols are relatively good hitters, or that Alex Cora is relatively bad; regardless of what numbers you crunch, most statistical analyses will confirm this. But, if you were building a team, it might be nice to know which middle-of-the-road hitters are good hitters and which are bad hitters before you commit lots of money to them in the form of large contracts.
The point here is that one would think that most measurement systems should be relatively consistent on the tail ends of distributions. I would otherwise expect that your analysis and MGL's analysis could agree on, perhaps, the BEST defenses and the WORST defenses while having disagreements about the middle-of-the-road defenses. (Correct me if I'm significantly wrong at any point here.)
To that effect, I am reposting a thought I posed to MGL on The Book's Blog, and was wondering what your take on it was:
<QUOTE>
If you’ve got a Hardball Times Annual, open your 2008 edition. MGL has an article called “Signals and Noise”, in which he estimates, using his defensive system, the value of the Toronto Defense to be +12 runs. I admit that I didn’t pay any attention to this number the first time I read the article.
No problem, right, because who cares about Toronto’s defense?
Well, it turns out that at least one guy does: John Dewan, who has the very next article in this book. In this article, using his defensive system, he estimates the value of the Toronto defense to be +92 (best defense that year, iirc).
No, really. If you turn three pages, you get two different estimates that differ by eighty (something like 60ish runs?).
</QUOTE>
Isn't that an incredible amount of variance around what you consider to be the best team in the league? Is it a problem that defensive evaluations can't agree on the best and worst teams, let alone the middle-of-the-road teams?
I think you’re comparing apples to oranges. The number cited from THT 2008 (+92) is Enhanced Plus/Minus, essentially the number of bases (not runs) saved due to the seven primary fielders’ ranges. That number also doesn’t include any of the other components we included in developing Runs Saved during the 2008-2009 offseason.
Having said that, the number of Runs Saved by the defense for Toronto that we have now in Bill James Online (www.billjamesonline.com) is 70. Comparing that to the 12 runs shown in the “Signals and Noise” article, your point is well made. Why are the two systems that different in 2007?
As mentioned earlier, UZR does many things similar to Runs Saved. However, Runs Saved also measures pitcher fielding, the ability of pitchers and catchers to control the running game, a catcher’s ability to handle the pitching staff, corner infielders’ ability to handle bunts, and the extra impact of robbed home runs. (Just ask Mark Buehrle about robbed home runs.)
This season, both systems like the Pirates, Mariners, Rangers, and Tigers. Neither likes the Marlins, Red Sox, Royals, or Twins.
Runs Saved (from Bill James Online) and UZR both like the Rays’ defense, but UZR rates them several spots higher. The biggest reason is the Rays’ pitching staff, which has not fielded the position well at all this year (-8 Runs Saved). UZR does not account for pitcher defense, so comparing the two isn’t entirely fair.
The Toronto discrepancy is largely due to the accounting differences between the two systems. The difference is most conspicuous in the middle infield.
I have a question for Mr. Dewan. This year, the Red Sox defense has been pretty shaky at times, and sabermetric measures like Ultimate Zone Rating (UZR) rate them as one of the worst in baseball this year. How did this happen to this Boston squad, which was very good from 2007 and 2008, by all accounts? Are you seeing the same sort of decline in performance using your plus/minus method?
As mentioned in the previous question, the Red Sox rate as the worst defensive team in the majors based on the Team Runs Saved data available at Bill James Online. The biggest issue is the left side of the infield, which has been abysmal in 2009. Injuries have taken their toll on Mike Lowell, and you can see that the Red Sox have been 23 runs below average at third base this year. Just between Lowell and Julio Lugo, the Red Sox have lost about 30 runs (or roughly 3 wins) defensively. Fortunately for Boston, Jed Lowrie is back playing and should improve the defense.
Thanks for giving us some time, John.
What does data received from projects like Sportsvision (reported in the New York Times) portend for the future of defensive statistics? To what extent are you worried that increased granularity of data leads to greater reliance on noisey data rather then greater clarity?
Relatedly, in the non-baseball world, many (for example, Nassim Taleb) have written about our increased reliance on metrics to provide an air of objectivity and clarity to otherwise subjective and messy data. When people look at +/- (or UZR or other metrics) they may rely on a final output (e.g., Lowell is "-20" etc.) for more then the data can reasonably provide. Do you have any thoughts on how much room for the subjective and observational should be left in a world of statistics with increasing granularity, complexity and authoritative acceptance?
Thanks again.
Sportvision is doing some great work. The trick is turning the video tracking into useful analysis, particularly from a fielding standpoint. We could subdivide the field into a million different regions and included a thousand different variables, but the analysis would be crippled by trivially small sample sizes.
There will always be some things we won’t be able to quantify, on offense and on defense. At the PITCHfx Summit a few weeks back, there was some discussion of tracking the catcher’s mitt as a way to measure how effectively a pitcher is hitting his spots. We’re never going to be able to track everything. The best we can do is to understand what we can and can’t quantify and make mental adjustments accordingly. Another good example of this is our Defensive Misplays and Good Fielding Plays system. If an outfielder misplays a double into a triple or misses the cutoff man, it won’t show up in the boxscore or in a player’s Plus/Minus or UZR score. We do count these at BIS, and we’re working on adding them to our evaluations.
As I mentioned earlier, I think our defensive metrics are, at best, at the 60% mark in terms of credibility. Subjective evaluation accounts for the remaining 40% or more.
Related to Joshv's and browndog's questions above - according to THT's Mike Fast, the Trackman System will be unveiled at this year's All Star home run derby. Do you have any thoughts on this promising new system? How does it compare to Sportsvision, and what will it mean for the future of defensive assessments?
I don’t know that much about the Trackman system. The technology to track the ball in flight in real-time is exciting from a fan’s standpoint. It seemed there were still a few issues to work out based on what we saw in the Home Run Derby (the trail on the ball looked choppy at times, and the cameras occasionally lost track of the ball). Sportvision is using different technology (primarily cameras rather than Doppler radar) but going for the same general idea—tracking the ball’s flight digitally.
John, I see Fielding Bible numbers over at Bill James online. What's your connection to Bill?
I don’t know the man. Does he work with baseball information too? ;-)
Bill and I have been working together on many projects since the mid-1980’s. We worked together on Project Scoresheet, collecting detailed data and turning it into useful analysis. Bill was on investor in STATS, Inc from day one and one of the most influential creative forces at that company as we grew it from a bedroom office in my home. Most recently, Bill and I collaborate on the annual Bill James Handbooks, which rely on data from my company (Baseball Info Solutions) and is published by ACTA Sports. Bill also made huge contributions to both editions of The Fielding Bible.
Bill and I are partners on Bill James Online.
Over the last few years we've seen teams value cost controlled prospects more highly than ever before. Do you feel that their is a building momentum in the industry to invest in ball tracking technology in the minors to help teams to better evaluate these highly valuable assets?
The interest in minor league information has grown over the last 20 years as quickly as the overall interest in major league sabermetrics. It’s just at a lower level. It will lag behind but continue to grow.
Related to Puffy's question: In your opinion, do most MLB teams currently over-value, under-value, or properly value, the worth of defense, in writing player contracts? Who does it best/worst?
Defense is probably undervalued by some teams and properly valued by others. As a whole, the league probably still underrates good defenders, but the gap is quickly closing. Earlier, I mentioned this past offseason’s free agent market and the adjustments the league has made. The Mariners and Rangers both made moves to drastically shore up their defenses this offseason, and they rate as two of the best in the league this year. Teams near the bottom of the Runs Saved leaderboard are probably lagging behind the curve, but there are exceptions. Everyone knows the Red Sox are one of the more analytical teams, yet their team defense has been abysmal. But when you score runs like they do and have a strong pitching staff, you can live with sub-par defense.
Tied in to some other questions here. In talking to a couple of scouts whose teams do use saber stats to help evaluate players on offense, as well traditional scouting, I have been told that they do not use defensive metrics in evaluating mil or MLB players because of the subjectivity of the stats. Do you see this changing anytime soon?
The Plus/Minus and Runs Saved systems, while not 100% objective, are certainly not “subjective”. That being said, we are always looking for ways to improve our data and analysis. I’ve already mentioned the batted ball timer data BIS is collecting this season. This information will greatly improve our analysis of fielding (and pitching and hitting, for that matter).
Judging by the number of teams who subscribe to our defensive products, it’s clear that teams are very interested in our Plus/Minus, Runs Saved, and Enhanced Fielding systems. As mentioned a few questions back, this offseason’s free agent market is confirmation that teams are embracing the various methods available to them. I don’t think any team uses a purely statistical approach to evaluating defense, nor should they, but it’s clear that the numbers are carrying a lot more weight than before. This trend will continue as new data becomes available and our ability to analyze the data improves.
How about Mike Lowell? We can see for ourselves that he's not been the same in the field since his hip surgery after last season. What do your fielding metrics tell you about him, and his defensive performance? Is he positioning himself differently, now that his mobility is impaired? Is anything affected beyond his range?
We mentioned Lowell a few questions back. This is a case where your observations directly match what our numbers are saying. Lowell has fallen off a cliff defensively this year, costing the Red Sox around two full wins with his defense by itself. One of the things we track is how well players go to their left and their right, and how well they handle balls hit more directly to them. In Lowell’s case, he has dropped off in each one of these areas. He hasn’t handled bunts very well this year either, but it’s a super small sample size. Fortunately, he’s still been productive at the plate, though he’s not being as patient as he used to.
Trying to get in before the deadline, so two questions:
1) does STATS compute things like inter-rater reliability for all of their personnel? Are unreliable people discarded or weighted appropriately? I wonder about the effectiveness of using human judgments for many of these measures.
2) should defensive stats report both an average value and also a measure of variability? So, for example, players that have fewer (or greater) plays would have smaller or larger variability scores? The point is: could defensive evals do a better job communicating the amount of variability in each player (or team's) measurement, so that fans could get a good sense of whether or not a difference is really a significant difference?
I can only assume STATS is using the same process they used when I left. At BIS, we rigorously train our scorers and minimize any potential biases. During the season, we review each scorer’s performance and make corrections as necessary. At the end of each season, we do a review of many plays to ensure that our data is recorded as accurately as possible.
The key with any statistic or number is the context. You can do your best to inform your readers of the process and thought behind each evaluation, and that’s all you can do. We could attach a reliability score to every number we publish, but what about when people start misinterpreting the reliability indicator? Then we just have another statistic to explain. There’s a fine line between educating readers and being too technical and losing their attention altogether.
What kinds of factors skew statistical analyses of defense?
I'll give a concrete example: Jacoby Ellsbury's defense has fallen off a cliff this year, according to most advanced metrics. I find it hard to believe that he was an elite defender in 2008 and is now a poor one. Can you make an educated guess whether one year's rating is more likely to be an aberration than the other? If so, what factors would you look for as signs that a particular player's rating is an aberration (or, inversely, is especially likely to be accurate)?
The biggest thing that skews statistical analysis is sample size.
I mentioned Ellsbury earlier. With barely a full season of innings in the outfield, it’s still early to draw strong conclusions. Based on what we’ve seen so far, Ellsbury has a below average arm with above average range, balancing out to an average centerfielder defensively, maybe above average at the corner positions. In 2008, he was tremendous in having very few Defensive Misplays relative to the Good Plays that we scouted and counted.
What kind of year-to-year correlation do we currently see in the best defensive metrics? And do you think it's fair to assume that if/when we have a way to perfectly measure defensive performance, it will show a similar year-to-year correlation as the most stable hitting/pitching metrics?
On the team level, we’re seeing year-to-year correlations in Runs Saved in the .3 to .4 range. We’re not quite to the level of hitting/pitching metrics yet, but we’re getting closer. There is no perfect way to measure defense, and the same goes for offense and pitching. We will keep improving our methods, and eventually our understanding of defense will catch up to our understanding of hitting and pitching.
This is the last question -- thank you to everyone for your very detailed questions. I am very happy to see the level of sophistication and understanding that you all have. Enjoy the rest of the season!
By John Dewan
Hello to my friends at Bill James Online. I wanted to share this conversation I had with readers over at the website called SonsOfSamHorn.net. Here are the questions and answers we posted in the “Red Sox Forum” over the last few weeks.
Thanks for doing this John.
Could you please explain the differences, and the similarities, between your system and UZR.
Thanks.
First, let me give you a little background. I developed Zone Ratings in the 1990’s during my days at STATS, Inc. The concept was based on the coding we were doing of the distance and location of batted balls. Each defensive position was assigned a zone where, based on the data, a majority of plays could be expected to be made. Zone Rating data was published annually in the Baseball Scoreboard beginning in 1990. The last edition of the Baseball Scoreboard was in 2001.
My last year at STATS was the year 2000 and in the Baseball Scoreboard 2000, I developed a new system. I called it Ultimate Zone Ratings, abbreviated UZR. This was essentially a forerunner of the system I developed in 2002 at Baseball Info Solutions which I named the Plus/Minus System. Ultimate Zone Ratings in the Scoreboard was also a system based on plusses and minuses. But with the new Plus/Minus System, I used more detailed data from Baseball Info Solutions and included adjustments in many areas that I hadn’t done in the first version of UZR.
Getting back to your question, what’s the difference between my system and UZR? While I don’t know for sure if the current version of UZR is an extension of my original UZR, or if it was independently developed, the bottom line is that they are based on the exact same concept. Both systems break the field into small areas and look at the probabilities of plays being made in those areas. The differences lie in the various adjustments that are made.
My research assistant, Ben Jedlovec, prepared the following:
Based on my understanding of both systems,
Similarities
Both use BIS Data. UZR started with STATS data, but the most commonly referenced version uses BIS data.
Both have the same idea- break down balls in play by type, location, velocity.
Both are measured on an above/below average scale.
Both have runs saved systems with components for GDP, OF Arms, Range.
We use similar run value multipliers at each position.
Both are available online (Fangraphs or Bill James Online).
Technical Differences
UZR uses multi-year samples, while Plus/Minus adjusts for year-to-year league changes. As teams are increasingly recognizing the importance of a strong defense, the league as a whole will be stronger defensively. It is important to handle this trend appropriately.
Plus/Minus uses smaller, more precise zones, or “buckets” of plays.
UZR has several minute adjustments, such as batter hand, pitcher hand, base/out state, and pitcher groundball/flyball tendencies. We remain focused on the value contributed to the team in the player’s specific context.
Park adjustments are handled differently- I believe UZR applies blanket adjustment across all buckets, while Plus/Minus has park factors in form of more precise buckets. A ball hit 395 feet to Vector 190 that stays in the park is only compared to all other balls hit 395 feet to Vector 190 that stay in the park. If it leaves the park, it neither helps nor hurts the fielder. Also, we added the “Manny Adjustment”, which removes fly balls hit unreachably high off a wall. We named the system after the Green Monster’s most notable victim, who went from being by far the worst left fielder in baseball before the adjustment to being only arguably the worst left fielder after the adjustment.
Plus/Minus accommodates plays where the first baseman holds the runner and middle infielders are covering second on hit-and-run plays. UZR adjusts for all base/out states.
The two systems apply the run values at different stages in the calculations. UZR applies runs right away, while we convert to Enhanced PM then apply the Run Factors.
Plus/Minus is a little more aggressive in awarding credit/penalty. An example: 100 balls in a ‘bucket’ (specified type, velocity, location), 30 fielded by the 2B, 20 by the 1B, 50 go through for singles. On a groundout to the second baseman, we give +50/(50+30) = 5/8 = +.625. UZR gives +50/100 = +.50. On a single through both fielders, Plus/Minus gives -30/80 = -.375 to the 2B, and -20/70 = -.29 to the 1B. UZR gives -30/100 = -.3 to the 2B, and -20/100 = -.2 to the 1B. You could make an argument for either method of accounting, but neither one is better than the other. The differences are the greatest at the middle infield positions, where overlap between fielders is the highest.
Fundamental Differences
Runs Saved includes Bunt Runs Saved for corner infielders, pitcher fielding (Plus/Minus and holding runners), and catcher fielding (handling the pitching staff and the running game).
Runs Saved measures the extra impact of HR Saving Catches. Runs Saved will add other Defensive Misplay/Good Fielding Play runs in the future.
There are some large similarities, but the bottom line is we’re not measuring exactly the same pieces of the puzzle, and we’re accounting for them differently.
John -
What do you think is the appropriate sample size in order for +/- to have utility? Do you think volatility in player's defensive numbers is more attributable to performance variation, or the metrics themselves still being fairly new ground?
Over time, we have all developed a feel for what baseball data means. For example, looking for a player with a long career I randomly picked Juan Pierre flipping through my Bill James Handbook. In 2004 he hit .326 for the Marlins. One year later with the same team, he hit exactly 50 points lower (.276). With the wisdom of hindsight, but even at the time, we know his real ability is somewhere in between.
So it is, for the most part, with our plus/minus numbers. But it can still vary from year to year and a player’s true ability generally lies between the fluctuations.
Another example: if a player has a plus/minus of +3 after five games, he has played well in those five games. It’s like going, say, 10-for-20 in those five games. There’s no question that he played well. But the sample size is small and, in that limited timeframe, provides only a minuscule amount of insight into the player’s true ability.
Like other numbers in baseball, a small sample size tells you what a player is doing, but the larger the sample size gets, the more you know about what he is really capable of doing.
When it comes down to it, I give our overall plus/minus numbers similar credibility as other baseball numbers, like batting average or on-base percentage. In my new book, The Fielding Bible—Volume II, we developed Runs Saved. I think of Runs Saved as the Runs Created of defense in that it encompasses a wide variety of methods. I give Runs Saved similar credibility to Runs Created.
What do you think is the next step in baseball evaluation?
On offense I believe we’re measuring 80-90 percent of the true ability of players. On defense, I believe we’re at about the 60 percent level. But we’re still at the tip of the iceberg in terms of precision and a ton more can be done, especially defensively. As new forms of data become available, we’ll be able to enhance our defensive systems. One example: BIS has now developed a batted ball timer, which we believe will greatly improve the accuracy of our system.
Thanks for taking the time to answer our questions, John, great work on The Fielding Bible II.
Would you please detail how the data used for your defensive evaluations is compiled. Also, what are the differences, if any, in the method used to compile defensive data between BIS and STATS, Inc.? Finally, what are your thoughts on the strengths and weaknesses of these methods of acquiring data, and what do you anticipate being done to improve them?
Let me refer you to www.fieldingbible.com to get a better overview of the Plus/Minus System and my other defensive systems. We added new techniques, including Defensive Runs Saved, when we published The Fielding Bible—Volume II this spring. Daily updates are being posted on Bill James Online (www.billjamesonline.com). It’s a subscription website, but it’s only $3 per month.
Having been involved with the set-up of both the STATS and BIS data tracking systems, I believe that both organizations do an excellent job overall. For defensive data, if STATS still does what they were doing back when I was there, they rely on a scorer in the pressbox making the location of batted balls onto a grid system that breaks the field into 26 vectors emanating from home plate on one axis and 10-feet increments on the other. BIS utilizes a video review by its Video Scouts to pinpoint the batted ball onto a replica of the field on their video screen. In theory BIS data allows for greater precision as each pixel on the computer screen can represent a location.
As technology moves forward, we will be able to get better and better precision in our data. As I mentioned above, I’m very excited about looking at the batted ball timer data we’ve been starting to collect.
Thanks for the chat. I have a few questions, some of which might seem silly:
1. At what point will some of the same techniques that evaluate defense (type, velocity, and direction of the hit) be used to adjust our understanding of player offense?
2. mgl has argued that at some level our understanding of defense is better than that of offense -- it's just that offense has neater bins to categorize event results. What would a neater set of defensive result categories look like? If you had the power to completely change our terminology of defense, what if anything would you change?
3. How much longer until little gps units are in the cleats of every fielder, or some equivalent, so that assessments of reaction and range can be made better? Will any team do this, and who will do it first?
4. How do teams these days think about defense? Do they all subscribe to services like yours, or have their own, bigger ones?
While offense is a different beast, and has a different set of variables to consider, I believe there is a lot that can be done with the type of data that we are collecting that we haven’t done yet. We’re using data for analyzing defense that we don’t use much to analyze offense. For example, Bill James did a study that showed there is some consistency among hitters regarding how they hit grounders and line drives, but that hitting flyballs is a significant factor that separates hitters from one another.
I don’t think we understand defense better than offense. We’ve worked hard at it, but we’re still getting there. As far as categories, I’d like to see people start to talk about defense in terms of defensive runs saved broken down into categories like OF Arm runs saved and double play runs saved while pushing errors and fielding percentage more into the background
Sportvision is teaming with Major League Baseball to attempt to just that. There was a recent article in the New York Times on this topic. Measuring range and reaction time would be great, but the ultimate goal is to combine them and measure a player’s skill at turning a ball in play into an out, which current systems are already getting at.
Each team handles defense in its own way. Many subscribe to our system and/or refer to The Fielding Bible frequently. I think teams have been catching on to the importance of defense over the past few seasons, and the media really started catching on this offseason. The lower-than-expected contracts signed by free agent defensive liabilities this past offseason (Burrell, Abreu, and Dunn, to name a few) indicate that the league as a whole has made an adjustment.
Thank you, Mr. Dewan, for the chat. Your publications have become a fixture of my pre-season purchases.....
Thanks to Fenway, Red Sox fans may be unusually interested in how the various fielding systems try to correct (or ignore) park effects on fielding. In light not just of the Green Monster, but Fenway's jutout behind 3b (which may reduce the run-scoring impact of some shots down the third-base line), its huge right field, its smaller foul territory, the Triangle, etc., how can fans best account for park effects on the currently available fielding ratings/metrics? Is there a reasonable way to normalize defensive metrics, or should we look just at "away" splits over a longer period of seasons (in order to avoid small sample size error)?
Many thanks.
Our system handles the major park effects well.
First off, a 360-foot fly ball to right in Fenway is only being compared to identical fly balls in stadiums where the 360-foot fly stays in the park. The fact that the same fly ball might be out in other parks doesn’t affect how much credit or penalty is assigned for the play. Because our zones are small (i.e. more precise), a park adjustment is already built into the system.
Secondly, after the first Fielding Bible we added the “Manny Adjustment” for balls hit off the wall, which eliminates balls hit too high to handle off a wall.
Lastly, foul fly balls don’t impact a player’s plus/minus number. As for the Fenway “jutout”, Mike Lowell may get a slight benefit when a ball goes for a single at Fenway that might otherwise go for a double, but the effect on the total is probably miniscule.
Defensive positioning prior to the ball being put into play seems to be one major factor that frustrates evaluation of defense through scouting as players with poor positioning can make routine plays look spectacular. Statistical analysis using putouts, assists, etc. compensate for this by counting the plays that a player makes whether they are routine or spectacular. From my understanding, UZR compensates for positioning by figuring the tendency of an average player to make a play in a particular zone.
How does +/- compensate for this problem and why does it compensate for this better than other defensive evaluations? Was this problem one that you particularly thought of while working on +/-?
Thanks alot for doing this chat.
That’s exactly what Plus/Minus does. Both Plus/Minus and UZR factor in defensive positioning and give credit for it. Both systems account for both components of good defense – having good range and positioning well. In this way, both systems are complete. What still can be done is to break down each fielder’s performance into separate components for range and positioning.
We explain our system at great length in The Fielding Bible and at www.fieldingbible.com. In 2008, a hard grounder to Vector 197 (slightly to a normal shortstop’s right) was converted to an out by the shortstop 86% of the time. The average shortstop will make the play most of the time, but not always. If the shortstop makes this particular play, we award him +0.14 (1 - .86) plays above average. If he fails to get the out, we penalize him -.86 plays. We do this for every play and every position and add them up to get a player’s plus/minus score, which we later convert to Runs Saved.
As mentioned in the first question, Plus/Minus uses more precise zones to determine the difficulty of a particular play, which in theory should give us more accurate results.
Here's a question from lurker RS DOrtiz. I'm paraphrasing a bit here.
I am interested in the weighting of throwing arms when one is considering defensive evaluations for outfielders, specifically in regards to Jacoby Ellsbury. Does his weak throwing arm set him apart that much negatively when considering votes for the Gold Glove? How are throwing values calculated? Also, how does Jacoby compare to someone like Ichiro considering arm and range? Observationally, it appears to me that Ellsbury makes more "spectacular" catches than Ichiro, and I'm wondering how well that relates to actual statistics and evaluations.
Thanks.
In The Fielding Bible—Volume II, we tackle the issue of combining a player’s throwing arm with his range in the outfield by converting the systems we use for throwing (OF Arms) and range (Plus/Minus) into Runs Saved. A baserunner kill at home plate is the most valuable defensive play. In terms of range, making a more difficult play will earn the fielder more Plus/Minus Runs than a routine play.
Ellsbury’s weaker arm has only cost the Red Sox about three runs in his career. He’s made up for it with six defensive runs saved with his range. Ichiro is in a completely different universe. Ichiro has saved 30 runs with his throwing arm since 2003 and 66 runs with his range. Ichiro has established himself as one of the best outfielders in baseball, while Ellsbury seems to be an all-around average centerfielder. In left and right, Ellsbury’s range would stand out, but in center he’s nothing special. In center field, that’s partially true for Ichiro as well; his range is average for a center fielder but his throwing arm more than makes up for it. Here’s the data for Ichiro and Ellsbury since 2003:
(through July 22)
LastName
Pos
Innings
Plus/Minus Runs Saved
OF Arms Runs Saved
HR Saving Catch Runs Saved
Total Runs Saved
Ellsbury
7
490.3
3
-1
0
2
Ellsbury
8
1398.0
-1
-2
0
-3
Suzuki
8
2279.0
-2
9
0
7
Ellsbury
9
287.0
4
0
2
6
Suzuki
9
6770.7
60
21
8
89
Hi John, thanks for taking the time to answer some questions.
What do you feel is the best way to evaluate the defensive impact of catchers? What is the value of controlling the running game against blocking balls, and do you think it is even possible to quantify something like how a catcher handles the pitching staff?
We explain our method of rating catchers’ ability to control the running game in The Fielding Bible—Volume II. We have also taken a first stab at measuring a catcher’s handling of the pitching staff. We use an example comparing Brandon Inge, Ivan Rodriguez, and Jose Molina to illustrate how we calculate Catcher Earned Runs Saved. I have to refer you to the essay in the book, because to explain the system I’d have to copy the whole essay here.
You mentioned that our Catcher Runs Saved system seems to rely heavily on CERA (Catcher ERA) but that’s a huge oversimplification. In the system, we use the earned runs in the CERA, but only as it relates to catchers catching the same pitchers. If Pudge has a 4.40 CERA with Joe Smith pitching and Molina has only caught him for one inning with a 9.00 CERA, there is almost no effect. There might be credit for Pudge for one earned run saved, but once we use our credibility factor the Adjusted Earned Runs Saved is 0.
Also, we use our Enhanced Fielding System of Defensive Misplays and Good Fielding Plays to evaluate catchers. Jason Kendall blocked more pitches than anyone, while Bengie Molina allowed the most balls to get by him. This is more of a scouting-based approach and contains very valuable information that is otherwise unrecorded.
Regarding Jason Varitek, the numbers suggest that he has been better than other catchers at handling the pitching staff in the last couple of years. Or, more specifically, with Varitek calling the game for his pitchers, it has resulted in an improved ERA for those pitchers compared to all other catchers who have caught those pitchers. We give him credit for nine Adjusted Earned Runs Saved – three in 2008 and six so far in 2009.
There is still a lot to do in this area, but we’ve gotten a good start on evaluating catchers.
Many thanks for your time, John.
Given the advances in Pitch/FX analysis, and the forthcoming usage of Hit/FX, how might these trajectory-based technologies be better used in fielding evaluations?
In theory, PITCHfx and HITfx data are very useful tools for data collection and analysis, but both have their limitations. To their credit, MLBAM and Sportvision have invested a lot of time and money in both projects, and we’re starting to see the benefits of this type of information.
Neither system in its present form adds much to our current fielding analysis. I expect that utilizing the new batted ball timer data collected by Baseball Info Solutions will be a huge advance not only for fielding analysis but for pitching and hitting evaluation as well.
Thank you for taking questions, John. Over the last couple of years, Dave Cameron has been using touting data like this (from Fangraphs):
CF Values 2009
Name
Batting
Fielding
Replacement
Positional
RAR
WAR
Dollars (Millions)
Matt Kemp
13.6
10.1
11.4
1.2
36.3
3.6
$16.00
Torii Hunter
20.6
-2.8
10.8
0.7
29.3
2.9
$13.20
Franklin Gutierrez
5.7
12.0
9.9
1.2
28.7
2.9
$13.00
Carlos Beltran
20
-3.8
9.3
0.6
26.1
2.6
$11.80
Mike Cameron
8.3
5.1
10.7
1.2
25.3
2.5
$11.40
Curtis Granderson
7.8
2.1
12.5
1.3
23.7
2.4
$10.70
Nyjer Morgan
-2.8
17.7
11.5
-2.7
23.7
2.4
$10.70
B.J. Upton
0.8
5.4
11.8
1.2
19.1
1.9
$8.60
Nate McLouth
10.5
-3.3
10.5
1.1
18.8
1.9
$8.00
Aaron Rowand
8.3
-1.1
10.6
1.2
19
1.9
$8.00
Adam Jones
11.4
-6.6
11.1
1.2
17.1
1.7
$7.70
Shane Victorino
10.2
-7.8
12.3
1.2
16
1.6
$7.20
Jacoby Ellsbury
5.3
-6.4
11.6
1.2
11.7
1.2
$5.30
Melky Cabrera
1.3
0.6
8.7
-1.6
9
0.9
$4.00
Grady Sizemore
1.7
-4.1
9.6
-0.5
6.6
0.7
$3.00
Dave reaches the conclusion that Franklin Gutierrez defense is so valuable, that his overall value exceeds nearly all major league CF (Beltran, Granderson, etc). He has used similar data to support his claim that Adrian Beltre has been well worth his contract with the Mariners.
Are defensive metrics so evolved and reliable, that we can credibly make such claims when the bulk of a player's value is wrapped up in defense. Take Nyjer Morgan as an extreme example. When calculated this way, his entire value is based on defense, and yet he rivals Beltran, Granderson, Upton, McLouth, and Rowand).
How do you respond to these conclusions? Shouldn't an adjustment be made for reliability of the data, where offensive metrics are weighted higher and defense indicators discounted, in order to render a credible judgement?
As mentioned earlier, we use caution in small samples of defensive data. However, in Gutierrez’s case, his defense value is far from a small-sample fluke. He led all right fielders in Plus/Minus Runs Saved in 2007 in only 579 innings, and he repeated the feat in 2008 in just over a half-season’s worth of innings. After last season, Fielding Bible Award voters were convinced of his ability and gave him the award over Nick Markakis, Ichiro Suzuki, and everyone else, despite the fact that Gutierrez has a sub-par arm in right field.
When new Seattle GM Jack Zduriencik brought in Gutierrez to play center and Endy Chavez to play left, we touted the Mariners as having the best defensive outfield in baseball. Sure enough, Seattle has the second best Defensive Runs Saved in their outfield with 29 through July 28, second only to Oakland’s 30.
Gutierrez has handled the transition to center field well. While his Plus/Minus numbers are down at the tougher position as you would expect, he still leads the league in Plus/Minus Runs Saved, and his arm is less of an issue. On top of his defensive prowess, Gutierrez is having his best season at the plate. In The Fielding Bible--Volume II, we combined offense, defense, baserunning, and positional value into Total Runs. If Gutierrez keeps playing like this in the second half, he could find himself among the top players in baseball on our 2009 Total Runs leaderboard.
Nyjer Morgan is a different story because the sample size is smaller. He’s played as an above average outfielder, but he’s only logged 1300 innings across three seasons counting all three outfield positions. He rates above average with a total of 21 runs saved on defense in his limited time, but we don’t consider him in Gutierrez’ class (yet).
A vast majority of the outs recorded by MLB defenses are what would be considered "routine plays," or plays that any player at that level would be expected to make. What percentage of all plays made by the defense would you consider to be "routine," and does that number vary from position to position?
I guess it depends on how you define “routine plays”. Let’s define a “Routine Play” as one that at least 50% of players at that position made successfully, and we’ll see what percentage of those plays were converted to outs. (Note: this does not include popups and liners for infielders, which are also included in Plus/Minus.)
Based on this definition, every defensive position shows at least 75% of the plays as routine. Third base and pitcher are right at 75% while center field and right field top the list at 90%.
These routine plays are handled successfully about 90% of the time. There is some variation by position but nothing really significant.
It seems that statistical analysis should be most useful in the middle of the distribution, where we can have real questions about someone's true value, rather than at the ends of the distribution, where a player's value seems relatively apparent. What I mean is, it doesn't seem like we really need advanced statistical analysis to tell us that A-Rod or Pujols are relatively good hitters, or that Alex Cora is relatively bad; regardless of what numbers you crunch, most statistical analyses will confirm this. But, if you were building a team, it might be nice to know which middle-of-the-road hitters are good hitters and which are bad hitters before you commit lots of money to them in the form of large contracts.
The point here is that one would think that most measurement systems should be relatively consistent on the tail ends of distributions. I would otherwise expect that your analysis and MGL's analysis could agree on, perhaps, the BEST defenses and the WORST defenses while having disagreements about the middle-of-the-road defenses. (Correct me if I'm significantly wrong at any point here.)
To that effect, I am reposting a thought I posed to MGL on The Book's Blog, and was wondering what your take on it was:
<QUOTE>
If you’ve got a Hardball Times Annual, open your 2008 edition. MGL has an article called “Signals and Noise”, in which he estimates, using his defensive system, the value of the Toronto Defense to be +12 runs. I admit that I didn’t pay any attention to this number the first time I read the article.
No problem, right, because who cares about Toronto’s defense?
Well, it turns out that at least one guy does: John Dewan, who has the very next article in this book. In this article, using his defensive system, he estimates the value of the Toronto defense to be +92 (best defense that year, iirc).
No, really. If you turn three pages, you get two different estimates that differ by eighty (something like 60ish runs?).
</QUOTE>
Isn't that an incredible amount of variance around what you consider to be the best team in the league? Is it a problem that defensive evaluations can't agree on the best and worst teams, let alone the middle-of-the-road teams?
I think you’re comparing apples to oranges. The number cited from THT 2008 (+92) is Enhanced Plus/Minus, essentially the number of bases (not runs) saved due to the seven primary fielders’ ranges. That number also doesn’t include any of the other components we included in developing Runs Saved during the 2008-2009 offseason.
Having said that, the number of Runs Saved by the defense for Toronto that we have now in Bill James Online (www.billjamesonline.com) is 70. Comparing that to the 12 runs shown in the “Signals and Noise” article, your point is well made. Why are the two systems that different in 2007?
As mentioned earlier, UZR does many things similar to Runs Saved. However, Runs Saved also measures pitcher fielding, the ability of pitchers and catchers to control the running game, a catcher’s ability to handle the pitching staff, corner infielders’ ability to handle bunts, and the extra impact of robbed home runs. (Just ask Mark Buehrle about robbed home runs.)
This season, both systems like the Pirates, Mariners, Rangers, and Tigers. Neither likes the Marlins, Red Sox, Royals, or Twins.
Runs Saved (from Bill James Online) and UZR both like the Rays’ defense, but UZR rates them several spots higher. The biggest reason is the Rays’ pitching staff, which has not fielded the position well at all this year (-8 Runs Saved). UZR does not account for pitcher defense, so comparing the two isn’t entirely fair.
The Toronto discrepancy is largely due to the accounting differences between the two systems. The difference is most conspicuous in the middle infield.
I have a question for Mr. Dewan. This year, the Red Sox defense has been pretty shaky at times, and sabermetric measures like Ultimate Zone Rating (UZR) rate them as one of the worst in baseball this year. How did this happen to this Boston squad, which was very good from 2007 and 2008, by all accounts? Are you seeing the same sort of decline in performance using your plus/minus method?
As mentioned in the previous question, the Red Sox rate as the worst defensive team in the majors based on the Team Runs Saved data available at Bill James Online. The biggest issue is the left side of the infield, which has been abysmal in 2009. Injuries have taken their toll on Mike Lowell, and you can see that the Red Sox have been 23 runs below average at third base this year. Just between Lowell and Julio Lugo, the Red Sox have lost about 30 runs (or roughly 3 wins) defensively. Fortunately for Boston, Jed Lowrie is back playing and should improve the defense.
Thanks for giving us some time, John.
What does data received from projects like Sportsvision (reported in the New York Times) portend for the future of defensive statistics? To what extent are you worried that increased granularity of data leads to greater reliance on noisey data rather then greater clarity?
Relatedly, in the non-baseball world, many (for example, Nassim Taleb) have written about our increased reliance on metrics to provide an air of objectivity and clarity to otherwise subjective and messy data. When people look at +/- (or UZR or other metrics) they may rely on a final output (e.g., Lowell is "-20" etc.) for more then the data can reasonably provide. Do you have any thoughts on how much room for the subjective and observational should be left in a world of statistics with increasing granularity, complexity and authoritative acceptance?
Thanks again.
Sportvision is doing some great work. The trick is turning the video tracking into useful analysis, particularly from a fielding standpoint. We could subdivide the field into a million different regions and included a thousand different variables, but the analysis would be crippled by trivially small sample sizes.
There will always be some things we won’t be able to quantify, on offense and on defense. At the PITCHfx Summit a few weeks back, there was some discussion of tracking the catcher’s mitt as a way to measure how effectively a pitcher is hitting his spots. We’re never going to be able to track everything. The best we can do is to understand what we can and can’t quantify and make mental adjustments accordingly. Another good example of this is our Defensive Misplays and Good Fielding Plays system. If an outfielder misplays a double into a triple or misses the cutoff man, it won’t show up in the boxscore or in a player’s Plus/Minus or UZR score. We do count these at BIS, and we’re working on adding them to our evaluations.
As I mentioned earlier, I think our defensive metrics are, at best, at the 60% mark in terms of credibility. Subjective evaluation accounts for the remaining 40% or more.
Related to Joshv's and browndog's questions above - according to THT's Mike Fast, the Trackman System will be unveiled at this year's All Star home run derby. Do you have any thoughts on this promising new system? How does it compare to Sportsvision, and what will it mean for the future of defensive assessments?
I don’t know that much about the Trackman system. The technology to track the ball in flight in real-time is exciting from a fan’s standpoint. It seemed there were still a few issues to work out based on what we saw in the Home Run Derby (the trail on the ball looked choppy at times, and the cameras occasionally lost track of the ball). Sportvision is using different technology (primarily cameras rather than Doppler radar) but going for the same general idea—tracking the ball’s flight digitally.
John, I see Fielding Bible numbers over at Bill James online. What's your connection to Bill?
I don’t know the man. Does he work with baseball information too? ;-)
Bill and I have been working together on many projects since the mid-1980’s. We worked together on Project Scoresheet, collecting detailed data and turning it into useful analysis. Bill was on investor in STATS, Inc from day one and one of the most influential creative forces at that company as we grew it from a bedroom office in my home. Most recently, Bill and I collaborate on the annual Bill James Handbooks, which rely on data from my company (Baseball Info Solutions) and is published by ACTA Sports. Bill also made huge contributions to both editions of The Fielding Bible.
Bill and I are partners on Bill James Online.
Over the last few years we've seen teams value cost controlled prospects more highly than ever before. Do you feel that their is a building momentum in the industry to invest in ball tracking technology in the minors to help teams to better evaluate these highly valuable assets?
The interest in minor league information has grown over the last 20 years as quickly as the overall interest in major league sabermetrics. It’s just at a lower level. It will lag behind but continue to grow.
Related to Puffy's question: In your opinion, do most MLB teams currently over-value, under-value, or properly value, the worth of defense, in writing player contracts? Who does it best/worst?
Defense is probably undervalued by some teams and properly valued by others. As a whole, the league probably still underrates good defenders, but the gap is quickly closing. Earlier, I mentioned this past offseason’s free agent market and the adjustments the league has made. The Mariners and Rangers both made moves to drastically shore up their defenses this offseason, and they rate as two of the best in the league this year. Teams near the bottom of the Runs Saved leaderboard are probably lagging behind the curve, but there are exceptions. Everyone knows the Red Sox are one of the more analytical teams, yet their team defense has been abysmal. But when you score runs like they do and have a strong pitching staff, you can live with sub-par defense.
Tied in to some other questions here. In talking to a couple of scouts whose teams do use saber stats to help evaluate players on offense, as well traditional scouting, I have been told that they do not use defensive metrics in evaluating mil or MLB players because of the subjectivity of the stats. Do you see this changing anytime soon?
The Plus/Minus and Runs Saved systems, while not 100% objective, are certainly not “subjective”. That being said, we are always looking for ways to improve our data and analysis. I’ve already mentioned the batted ball timer data BIS is collecting this season. This information will greatly improve our analysis of fielding (and pitching and hitting, for that matter).
Judging by the number of teams who subscribe to our defensive products, it’s clear that teams are very interested in our Plus/Minus, Runs Saved, and Enhanced Fielding systems. As mentioned a few questions back, this offseason’s free agent market is confirmation that teams are embracing the various methods available to them. I don’t think any team uses a purely statistical approach to evaluating defense, nor should they, but it’s clear that the numbers are carrying a lot more weight than before. This trend will continue as new data becomes available and our ability to analyze the data improves.
How about Mike Lowell? We can see for ourselves that he's not been the same in the field since his hip surgery after last season. What do your fielding metrics tell you about him, and his defensive performance? Is he positioning himself differently, now that his mobility is impaired? Is anything affected beyond his range?
We mentioned Lowell a few questions back. This is a case where your observations directly match what our numbers are saying. Lowell has fallen off a cliff defensively this year, costing the Red Sox around two full wins with his defense by itself. One of the things we track is how well players go to their left and their right, and how well they handle balls hit more directly to them. In Lowell’s case, he has dropped off in each one of these areas. He hasn’t handled bunts very well this year either, but it’s a super small sample size. Fortunately, he’s still been productive at the plate, though he’s not being as patient as he used to.
Trying to get in before the deadline, so two questions:
1) does STATS compute things like inter-rater reliability for all of their personnel? Are unreliable people discarded or weighted appropriately? I wonder about the effectiveness of using human judgments for many of these measures.
2) should defensive stats report both an average value and also a measure of variability? So, for example, players that have fewer (or greater) plays would have smaller or larger variability scores? The point is: could defensive evals do a better job communicating the amount of variability in each player (or team's) measurement, so that fans could get a good sense of whether or not a difference is really a significant difference?
I can only assume STATS is using the same process they used when I left. At BIS, we rigorously train our scorers and minimize any potential biases. During the season, we review each scorer’s performance and make corrections as necessary. At the end of each season, we do a review of many plays to ensure that our data is recorded as accurately as possible.
The key with any statistic or number is the context. You can do your best to inform your readers of the process and thought behind each evaluation, and that’s all you can do. We could attach a reliability score to every number we publish, but what about when people start misinterpreting the reliability indicator? Then we just have another statistic to explain. There’s a fine line between educating readers and being too technical and losing their attention altogether.
What kinds of factors skew statistical analyses of defense?
I'll give a concrete example: Jacoby Ellsbury's defense has fallen off a cliff this year, according to most advanced metrics. I find it hard to believe that he was an elite defender in 2008 and is now a poor one. Can you make an educated guess whether one year's rating is more likely to be an aberration than the other? If so, what factors would you look for as signs that a particular player's rating is an aberration (or, inversely, is especially likely to be accurate)?
The biggest thing that skews statistical analysis is sample size.
I mentioned Ellsbury earlier. With barely a full season of innings in the outfield, it’s still early to draw strong conclusions. Based on what we’ve seen so far, Ellsbury has a below average arm with above average range, balancing out to an average centerfielder defensively, maybe above average at the corner positions. In 2008, he was tremendous in having very few Defensive Misplays relative to the Good Plays that we scouted and counted.
What kind of year-to-year correlation do we currently see in the best defensive metrics? And do you think it's fair to assume that if/when we have a way to perfectly measure defensive performance, it will show a similar year-to-year correlation as the most stable hitting/pitching metrics?
On the team level, we’re seeing year-to-year correlations in Runs Saved in the .3 to .4 range. We’re not quite to the level of hitting/pitching metrics yet, but we’re getting closer. There is no perfect way to measure defense, and the same goes for offense and pitching. We will keep improving our methods, and eventually our understanding of defense will catch up to our understanding of hitting and pitching.
This is the last question -- thank you to everyone for your very detailed questions. I am very happy to see the level of sophistication and understanding that you all have. Enjoy the rest of the season!
0