My writings about baseball, with a strong statistical & machine learning slant.

Monday, December 28, 2009

An obvious idea about IP distributions

Having finished "The Girl With the Dragon Tattoo," last week, and having no more pleasure reading left, I resorted to reading a short statistics text that I brought along for my trip. I studied various maths for many years, but my formal knowledge of statistics is limited to two semesters as an undergrad, as well as some physics labs, which I did not fully appreciate at the time. Nowadays, when working with baseball data (sometimes using Excel), I wish that I had kept my notes from experimental physics lab ten years ago!

In any case, early in the statistics book I stumbled on a simple comment in relation to non unimodal distributions: the author casually noted that most of these distributions are actually distributions of sums of multiple random variables. Well pitcher IP (innings pitched) is not really a random variable with a (skewed) normal distribution, but it can well be described as the sum of two other variables: starting innings pitched and relief innings pitched. Of course, the accumulation of a pitcher's starting and relief innings is not unrelated. However in today's game, these can well be treated as separate variables. Except for playoff situations, pitchers are always either starters or relievers (or not on the active roster) at a particular stretch of the season. The Yankees' Joba Chamberlain & Phil Hughes drama from this season would be a recent example of this property. So both IP_start and IP_relief may reasonably predicted based on a pitcher's time spent in each role, as well as his success, durability, manager's preference, etc. Unfortunately, as of now, there is no possibility to see how many games a pitcher was available as a starter or reliever throughout the season. But conceptually, this idea is possible it today's game.

Each of those distributions (IP_start and IP_relief) is itself bimodal, with peaks at 0 IP, and at around 200 IP for IP_start and around 65 IP for IP_relief. Still, these distributions are simpler than the trimodal distribution for total innings pitched.

Now that I think of it, I'm not sure why I didn't try to predict IP_start and IP_relief separately before. From the beginning, I thought about pitcher roles/usage in predicting IP, but I didn't think categorizing pitchers into starter & reliever would be useful, beyond what already evident from previous years' data. I still think that was the right idea, but predicting IP_start and IP_relief is another idea that makes more sense. I am not chopping the pitchers into starters and relievers, but rather I am computing two aspects of a pitcher's ability to contribute toward his own usage. Both Mariano Rivera and Josh Towers can be predicted to have few IP_start, but for different reasons. Similarly, we will expect both CC Sabathia and Kei Igawa to have few IP_relief, but again for different reasons.

I have some preliminary tests showing that even a linear combination of features can predict IP_start at a much better error rate (ie correlation between predictions and actual values on a test set) than I am currently doing in the "one size fits all" IP model. Predictions for IP_relief are tougher, and would be expected (especially since I am including pitchers with a very low IP in my model, where variance is very high). Still, predicting IP_start very accurately will be useful. Also, the features used for IP_start and IP_relief prediction are different, so the combined model (ie summing IP_start and IP_relief predictions) will lead to a straightforward non-linear model for total IP, which could be much more predictive.

I should have concrete results soon. Tomorrow, I am supposed to spend a day seeing the Great Ocean Road east of Melbourne. So maybe I'll have something to show the day after.

Sunday, December 27, 2009

Pitcher repertoire depth: a measure of "how many pitches" he throws.

In my K9 estimates, I use two statistics that I call "rep_depth" and "rep_offerings." Let me explain what this is all about.

Leading up the this year's NL Cy Young voting, I was reading some discussions about whether Tim Lincecum or Adam Wainwright has a deeper arsenal of pitches. The implied assumption seemed to be that throwing only two pitches regularly is a bad thing. Those taking part in the discussion threw around nuggets like "Lincecum throws 4 pitches at least 5% of the time" or other such ad hoc stats using arbitrary cutoffs. I thought I'd try to come up with something more systematic.

FWIW, Lincecum's and Wainwright's pitch frequency breakdowns can be seen on FanGraphs. It is clear to me that Wainwright has a deeper arsenal (although not by much). It's not clear how much this matters as far as effectiveness goes, but the point here is to apply numbers to a concept that people use, not so evaluate the utility of the concept itself.

Most pitchers throw anywhere from 1 to 6 pitches (with non-trivial frequency) according to FanGraphs pitch data. Someone who throws two pitches 50-50 is a two pitch pitcher. Easy enough. However, I think that most people would say that someone who throws two pitches at a 90-10 ratio is a one-pitch pitcher. Similarly for a 80-10-10 pitcher. However someone who throws three pitches in a 60-20-20 ratio is a three pitch pitcher. Also someone who throws two pitches at a 70-30 ration is a two pitch pitcher, but still has a less balanced repertoire than the hypothetical 50-50 pitcher.

This brings to mind something involving the harmonic mean. The harmonic mean of 50 and 50 is 50. The harmonic mean of 70 and 30 is 42. The harmonic mean of 90 and 10 is 18. Of the three means (arithmetic, geometric and harmonic), the harmonic mean is always the smallest, and thus seems like a good candidate for "rewarding" pitchers with a highly balanced repertoire.

Conceptually, if a pitcher throws one pitch much more often than his other offerings, the hitter need only to look for that one pitch. However as a pitcher spreads his offerings more widely, a hitter has to look for several pitches, even those only thrown 10-15% of the time. Or at least, that's the idea.

Considering n pitches, the harmonic mean of their frequencies is expressed as:



where p_m is the frequency of pitch number m, expressed as a probability (between 0 and 1). Therefore, the largest possible value for p_2 would be 0.5, the largest possible value for p_3 would be 0.33, and so forth.

In order to convert the harmonic mean back to a number between 1 and 6, we multiply the harmonic mean by n^2. The first time, we multiply by n in order to get a number that is at most 1.0. The second time, we multiply by n in order to get a number that is at most n. Thus:




That's it. Also I remember this best value of n and call it the "repertoire offerings," as the measure of how many pitches are in the pitcher's repertoire (ie how many pitches a hitter has to consider), expressed as a whole number.

For example, in 2009, Tim Lincecum ends up with a rep_depth of 2.34 off of 3 offerings. Adam Wainwright ends up with a rep_depth of 2.55 off of 3 offerings. Among well-known staring pitchers in 2009, Dan Haren had the highest rep_depth at 3.24 off of 4 offerings. As seen here, Haren threw four pitches at least 13% of the time each (fastball, cutter, curve and splitter). However that kind of depth is unusual. (Actually I just noticed that James Shields has an even higher rep_depth at 3.41. But you can argue that he is less famous than Dan Haren?)

I have included a full list of pitcher seasons (2002-2009) and repertoire depths here. The cutoff is for 60IP+, and a small amount of data from 2002 is missing, but otherwise this is a complete list.

Before I go on, I must mention that I made a small modification to the formula shown above. If we want to compute the rep_depth at 1 (ie just consider one pitch), it makes no sense to derive an answer other than 1.0. Also, if a pitcher throws his pitches at the rate of 45-45-5-5, he should end up with rep_depth of 1.0 at 1 offering, and with a rep_depth 2.0 at 2 offerings. Therefore, I normalize any set of pitches before computing the rep_depth. Thus, rep_depth is capped at 1.0 at 1 offering on the low side for everyone (for example, Mariano Rivera of the last few years).

Also, there were some nasty cases where a pitcher might end up with 1.002 rep_depth off 5 offerings. I think my formula is flawed in such cases, so I revert to 1.0 off of 1 offering, if my formula yields a rep_depth below 1.1. This is a hack, but it rarely come into play, and I think it makes sense.

It is not immediately clear whether having a high repertoire depth is always a good thing. There are some pretty good pitchers who have low repertoire depths, and some mediocre pitchers (Adam Eaton of 2008, anyone) who threq a lot of different pitches, and yet didn't do too well with any of them.

If we list only pitchers with 20+ VORP, there is no obvious pattern among the top performers, in regard to repertoire depth. Then again, the point here is to summarize pitch data, rather than to draw immediate conclusions about predictable performance.

The average rep_depth is right around 2.0, with average rep_offerings right around 3.0. So we can confidently say that your typical pitcher throws three pitches, but not with an uneven distribution. There is no inherent advantage to throwing more pitches, although there might be an advantage to throwing those same three pitches with a more even distribution. Power pitchers with great fastballs can often get away with throwing only two pitches (the other usually being a breaking pitch). Pitchers with lesser fastballs usually need a third offering, be that a cutter, splitter or changeup.

Going back to the K9 projections I wrote about previously, my formula punishes (in terms of an expected strikeout rate) high values for both rep_depth and for rep_offerings. However, the system awards points for throwing particular pitches a high percentage of the time, namely fastballs and breaking pitches for power pitches, and slow changeups for pitchers with slower fastballs. Makes sense to me. Universally, the system expect pitchers who throw lots of different pitches to have low strikeout rates, all other things being equal. This is a bit surprising, but not illogical. A pitcher with a great fastball (or cutter, in the case of Mariano Rivera) needs not throw much else. These kind of pitchers can be highly effective, and they record high strikeout rates. Pitchers with great primary weapons don't need more than one secondary offering. Although as the linked chart shows, many pitchers (especially older, more experienced pitchers) have had great seasons throwing a variety of pitches. So the tendency to have lower strikeout rates among high rep_depth pitchers might be a case of reverse causality. I'm not really sure.

Among pitcher seasons 2002-2009 (20IP cutoff), the rep_depth and rep_offerings can be bucketed as follows:

rep_depth:

3181 elements, 10 buckets --> 318 target average
bucket 1 [1.000000, 1.321785] for 318 elements (1.121944 average)
bucket 2 [1.323485, 1.542642] for 318 elements (1.449034 average)
bucket 3 [1.543158, 1.688595] for 318 elements (1.621933 average)
bucket 4 [1.688927, 1.799741] for 318 elements (1.745801 average)
bucket 5 [1.800337, 1.906526] for 318 elements (1.855347 average)
bucket 6 [1.906688, 2.001738] for 318 elements (1.956757 average)
bucket 7 [2.001881, 2.180508] for 318 elements (2.092596 average)
bucket 8 [2.180602, 2.370741] for 318 elements (2.274430 average)
bucket 9 [2.371444, 2.650319] for 318 elements (2.501306 average)
bucket 10 [2.651505, 4.090890] for 319 elements (2.952992 average)

Created 10 buckets
6226.893148 / 3181 = 1.957527

rep_offerings:

bucket 1 [1.000000, 2.000000] for 979 elements (1.848825 average)
bucket 2 [3.000000, 3.000000] for 1132 elements (3.000000 average)
bucket 3 [4.000000, 4.000000] for 888 elements (4.000000 average)
bucket 4 [5.000000, 6.000000] for 182 elements (5.065934 average)

Created 4 buckets
9680.000000 / 3181 = 3.043068

Now, at least, it is possible to estimate the depth of a pitcher's repertoire using two numbers, which can be simply computed.

Notes on Rick Porcello & Edwin Jackson (a month later)

A month ago, I wrote about Chien-Ming Wang, Rick Porcello, and Edwin Jackson. These are all pitchers that have under-performed my K9 projections in the recent past. Being less narcissistic, they under-performed K9 rates projected by a systematic approach to estimate K9 rates from pitch data and biographical information.

I pointed out that Porcello does not actually under perform his expected K9 by much, although his low projection itself is a concern.

However Wang and Edwin Jackson project to be at least league-average K9 pitchers, but have both consistently struck out many fewer batters than expected. From the table linked below, let's pull a list of starting pitchers (100IP+) who projected to have at least 6K/9, but significantly underperformed in reality (in order of under-performance):
  • Chien-Ming Wang 2006
  • Danny Graves 2003
  • Carlos Silva 2005
  • David Wells 2003 (actually not a bad season: 15-7 4.18 ERA for the Bombers, 4.6 WARP)
  • Fausto Carmona 2008
  • Jason Johnson 2005
  • Jorge Sosa 2005
  • Kyle Lohse 2005
  • Bronson Arroyo 2005 (last season in Boston, 1.9 WARP)
  • Edwin Jackson 2008
  • John Lieber 2004
  • Jorge Sosa 2003
  • Chien-Ming Wang 2005
  • Sidney Ponson 2004
  • Mike Hampton 2002 (his horrible last season for the Rockies)
  • Zack Duke 2007
  • Jeremy Guthrie 2009
  • Horacio Ramirez 2005
  • Hiroki Kudora 2008
Admittedly, it is still hard to separate out K9 under-performance from just plain low K9 totals. However, by setting the cutoff at 100IP and an expected 6K/9, we are at least skipping over many marginal starters, and the Kirk Reuters & Turk Wendells of the world.

Still, being listed here as one of the top unexpectedly low K9 seasons in the last 10 years is not really a positive projection for a a pitcher's future.

I think the Yankees are doing the right thing by not committing to Wang, but this comment might have been more meaningful if it had been made two years ago. At this point, he is "injury prone," rather than a surprisingly effective low-strikeout pitcher. The predictive power of pitcher injuries is another topic that I have been looking into. Maybe I'll have something on that front soon.

Edwin Jackson is another matter. He is just coming off a career season. After a disappointing development prior to 2009, he put a solid 3.9 WARP in 2009, despite some late-season decline. However, he still continues to strike out many fewer batters than one would expect given his stuff (ie pitches and bio information). In 2009, he posted 6.77 K/9, while my system would expect him to put up 8.49 K/9. That 1.72 K/9 under performance is still in the bottom 10% of pitcher seasons between 2002-2009.

The D-backs recently traded for Jackson, giving up Max Scherzer as part of the deal. Scherzer projected for 7.0K/9 in his first full major league season, but in fact struck out 9.2 per 9 innings.

I'm not saying that my work here proves anything, but I'm not sure why the Diamondbacks are giving up a young pitcher with one solid season (and a solid K/9 over achievement) for a young pitcher with one slightly better season, and several year's worth of consistently recording fewer strikeouts than expected. Hopefully, they are not expecting Edwin Jackson to start striking out a lot more hitters next year, since he has a multiyear history of under performing those expectations. On the other hand, there is a possibility that Scherzer can keep his strikeout totals above expectation (and well above the league average).

Of course strikeouts are not everything, but they are pretty important for predicting pitcher longevity and long-term success. Bill James wrote a long time ago about how almost no pitchers have long careers without early-career strikeout totals that are above league average. I have never seen anyone try to prove the contrary. I have not done any multi-year value studies of pitchers, so perhaps when I do, there will be something else to say.

In the mean time, I am close to publishing some results for my overall value and predictions. Once I have a system that I'm comfortable with, and one which does well in historic tests with PECOTA & CHONE, I will write about it. Also, I should probably publish 2010 projections at some point. If I wait until I make all the improvements that I can, I will never publish anything!

Updated data for K9 projections


I've been in Thailand & Australia these last couple weeks, so the update has taken longer than I expected.

I re-ran my scraping of FanGraphs, to get pitch data (average velocity & percentage thrown for every pitch) for all pitchers, active during 2002-2009 (before, I was missing many 2009 seasons, and was systematically missing some others). I also got updated biographical information (height & weight) that matches what can be found on MLB.com. So the data set is now complete (with a few very small exceptions).

Again, my idea was to see how well we can predict a pitcher's K9 (strikeouts per 9 innings) rates, given online what pitches he throws, and some biographical data like height, weight, age and handedness. I cheat a little by also using his IP in the model, although this is just to separate the starters from the relievers (we would expect starters and relievers, especially left handed relievers) to have different K9 rates with the same pitch arsenal.

In short, running with the updated data led to almost no changes in my ability to project K9 rates from this non-performance data. The correlation between projected and actual rates is still right at 0.6.

The graph is above. The listing of pitcher seasons (with 60IP+), fastball data, and actual & projected strikeout rates is available here.

The new formula still breaks the pitchers into 3 categories, by average fastball speed, and whether they are starters or relievers. For those who care, here is the new formula:

Overall:

FB_fg_vel <= 91.55 :
| IP Start <= 5.5 : LM1 (757)
| IP Start > 5.5 : LM2 (1353)
FB_fg_vel > 91.55 : LM3 (1091)

Where LM1, LM2 and LM3 are linear models that break down as follows:

LM num: 1
SO9 =
0.6879 * THROWS=L
- 0.0009 * HEIGHT_OVER_6ft
+ 0.0091 * WEIGHT
- 0.0293 * AGE
- 0.0001 * IP Start
- 0.0238 * FB_fg_per
+ 0.023 * SL_fg_per
- 0.0002 * CT_fg_per
+ 0.0215 * CB_fg_per
+ 0.0181 * CH_fg_per
+ 0.049 * SF_fg_per
+ 0.3019 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.1251 * CH_fg_vel
- 0.559 * rep_depth
- 0.2189 * rep_offerings
- 8.7262

LM num: 2
SO9 =
0.611 * THROWS=L
- 0.0006 * HEIGHT_OVER_6ft
+ 0.0001 * WEIGHT
+ 0.0002 * AGE
+ 0.0012 * IP Start
- 0.0488 * FB_fg_per
+ 0.0088 * SL_fg_per
- 0.0179 * CT_fg_per
+ 0.0159 * CB_fg_per
- 0.0189 * CH_fg_per
+ 0.0003 * SF_fg_per
+ 0.2786 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.0144 * CB_fg_vel
- 0.0282 * CH_fg_vel
+ 0.009 * SF_fg_vel
- 0.2939 * rep_depth
- 0.1174 * rep_offerings
- 12.5946

LM num: 3
SO9 =
1.3079 * THROWS=L
- 0.0976 * HEIGHT_OVER_6ft
+ 0 * WEIGHT
+ 0.0289 * AGE
- 0.0013 * IP Start
- 0.0773 * FB_fg_per
- 0.0314 * CT_fg_per
- 0.0255 * CH_fg_per
+ 0.0164 * SF_fg_per
+ 0.5636 * FB_fg_vel
- 0.0002 * SL_fg_vel
+ 0.0091 * CB_fg_vel
- 0.0004 * CH_fg_vel
- 0.3258 * rep_depth
- 0.2241 * rep_offerings
- 39.442

The difference between the new and old models is trivial. Overall, I think I get something like a 2% decrease in root mean squared error, which is not significant.

I'm just including this information for completeness.