My writings about baseball, with a strong statistical & machine learning slant.

Monday, December 28, 2009

An obvious idea about IP distributions

Having finished "The Girl With the Dragon Tattoo," last week, and having no more pleasure reading left, I resorted to reading a short statistics text that I brought along for my trip. I studied various maths for many years, but my formal knowledge of statistics is limited to two semesters as an undergrad, as well as some physics labs, which I did not fully appreciate at the time. Nowadays, when working with baseball data (sometimes using Excel), I wish that I had kept my notes from experimental physics lab ten years ago!

In any case, early in the statistics book I stumbled on a simple comment in relation to non unimodal distributions: the author casually noted that most of these distributions are actually distributions of sums of multiple random variables. Well pitcher IP (innings pitched) is not really a random variable with a (skewed) normal distribution, but it can well be described as the sum of two other variables: starting innings pitched and relief innings pitched. Of course, the accumulation of a pitcher's starting and relief innings is not unrelated. However in today's game, these can well be treated as separate variables. Except for playoff situations, pitchers are always either starters or relievers (or not on the active roster) at a particular stretch of the season. The Yankees' Joba Chamberlain & Phil Hughes drama from this season would be a recent example of this property. So both IP_start and IP_relief may reasonably predicted based on a pitcher's time spent in each role, as well as his success, durability, manager's preference, etc. Unfortunately, as of now, there is no possibility to see how many games a pitcher was available as a starter or reliever throughout the season. But conceptually, this idea is possible it today's game.

Each of those distributions (IP_start and IP_relief) is itself bimodal, with peaks at 0 IP, and at around 200 IP for IP_start and around 65 IP for IP_relief. Still, these distributions are simpler than the trimodal distribution for total innings pitched.

Now that I think of it, I'm not sure why I didn't try to predict IP_start and IP_relief separately before. From the beginning, I thought about pitcher roles/usage in predicting IP, but I didn't think categorizing pitchers into starter & reliever would be useful, beyond what already evident from previous years' data. I still think that was the right idea, but predicting IP_start and IP_relief is another idea that makes more sense. I am not chopping the pitchers into starters and relievers, but rather I am computing two aspects of a pitcher's ability to contribute toward his own usage. Both Mariano Rivera and Josh Towers can be predicted to have few IP_start, but for different reasons. Similarly, we will expect both CC Sabathia and Kei Igawa to have few IP_relief, but again for different reasons.

I have some preliminary tests showing that even a linear combination of features can predict IP_start at a much better error rate (ie correlation between predictions and actual values on a test set) than I am currently doing in the "one size fits all" IP model. Predictions for IP_relief are tougher, and would be expected (especially since I am including pitchers with a very low IP in my model, where variance is very high). Still, predicting IP_start very accurately will be useful. Also, the features used for IP_start and IP_relief prediction are different, so the combined model (ie summing IP_start and IP_relief predictions) will lead to a straightforward non-linear model for total IP, which could be much more predictive.

I should have concrete results soon. Tomorrow, I am supposed to spend a day seeing the Great Ocean Road east of Melbourne. So maybe I'll have something to show the day after.

Sunday, December 27, 2009

Pitcher repertoire depth: a measure of "how many pitches" he throws.

In my K9 estimates, I use two statistics that I call "rep_depth" and "rep_offerings." Let me explain what this is all about.

Leading up the this year's NL Cy Young voting, I was reading some discussions about whether Tim Lincecum or Adam Wainwright has a deeper arsenal of pitches. The implied assumption seemed to be that throwing only two pitches regularly is a bad thing. Those taking part in the discussion threw around nuggets like "Lincecum throws 4 pitches at least 5% of the time" or other such ad hoc stats using arbitrary cutoffs. I thought I'd try to come up with something more systematic.

FWIW, Lincecum's and Wainwright's pitch frequency breakdowns can be seen on FanGraphs. It is clear to me that Wainwright has a deeper arsenal (although not by much). It's not clear how much this matters as far as effectiveness goes, but the point here is to apply numbers to a concept that people use, not so evaluate the utility of the concept itself.

Most pitchers throw anywhere from 1 to 6 pitches (with non-trivial frequency) according to FanGraphs pitch data. Someone who throws two pitches 50-50 is a two pitch pitcher. Easy enough. However, I think that most people would say that someone who throws two pitches at a 90-10 ratio is a one-pitch pitcher. Similarly for a 80-10-10 pitcher. However someone who throws three pitches in a 60-20-20 ratio is a three pitch pitcher. Also someone who throws two pitches at a 70-30 ration is a two pitch pitcher, but still has a less balanced repertoire than the hypothetical 50-50 pitcher.

This brings to mind something involving the harmonic mean. The harmonic mean of 50 and 50 is 50. The harmonic mean of 70 and 30 is 42. The harmonic mean of 90 and 10 is 18. Of the three means (arithmetic, geometric and harmonic), the harmonic mean is always the smallest, and thus seems like a good candidate for "rewarding" pitchers with a highly balanced repertoire.

Conceptually, if a pitcher throws one pitch much more often than his other offerings, the hitter need only to look for that one pitch. However as a pitcher spreads his offerings more widely, a hitter has to look for several pitches, even those only thrown 10-15% of the time. Or at least, that's the idea.

Considering n pitches, the harmonic mean of their frequencies is expressed as:

where p_m is the frequency of pitch number m, expressed as a probability (between 0 and 1). Therefore, the largest possible value for p_2 would be 0.5, the largest possible value for p_3 would be 0.33, and so forth.

In order to convert the harmonic mean back to a number between 1 and 6, we multiply the harmonic mean by n^2. The first time, we multiply by n in order to get a number that is at most 1.0. The second time, we multiply by n in order to get a number that is at most n. Thus:

That's it. Also I remember this best value of n and call it the "repertoire offerings," as the measure of how many pitches are in the pitcher's repertoire (ie how many pitches a hitter has to consider), expressed as a whole number.

For example, in 2009, Tim Lincecum ends up with a rep_depth of 2.34 off of 3 offerings. Adam Wainwright ends up with a rep_depth of 2.55 off of 3 offerings. Among well-known staring pitchers in 2009, Dan Haren had the highest rep_depth at 3.24 off of 4 offerings. As seen here, Haren threw four pitches at least 13% of the time each (fastball, cutter, curve and splitter). However that kind of depth is unusual. (Actually I just noticed that James Shields has an even higher rep_depth at 3.41. But you can argue that he is less famous than Dan Haren?)

I have included a full list of pitcher seasons (2002-2009) and repertoire depths here. The cutoff is for 60IP+, and a small amount of data from 2002 is missing, but otherwise this is a complete list.

Before I go on, I must mention that I made a small modification to the formula shown above. If we want to compute the rep_depth at 1 (ie just consider one pitch), it makes no sense to derive an answer other than 1.0. Also, if a pitcher throws his pitches at the rate of 45-45-5-5, he should end up with rep_depth of 1.0 at 1 offering, and with a rep_depth 2.0 at 2 offerings. Therefore, I normalize any set of pitches before computing the rep_depth. Thus, rep_depth is capped at 1.0 at 1 offering on the low side for everyone (for example, Mariano Rivera of the last few years).

Also, there were some nasty cases where a pitcher might end up with 1.002 rep_depth off 5 offerings. I think my formula is flawed in such cases, so I revert to 1.0 off of 1 offering, if my formula yields a rep_depth below 1.1. This is a hack, but it rarely come into play, and I think it makes sense.

It is not immediately clear whether having a high repertoire depth is always a good thing. There are some pretty good pitchers who have low repertoire depths, and some mediocre pitchers (Adam Eaton of 2008, anyone) who threq a lot of different pitches, and yet didn't do too well with any of them.

If we list only pitchers with 20+ VORP, there is no obvious pattern among the top performers, in regard to repertoire depth. Then again, the point here is to summarize pitch data, rather than to draw immediate conclusions about predictable performance.

The average rep_depth is right around 2.0, with average rep_offerings right around 3.0. So we can confidently say that your typical pitcher throws three pitches, but not with an uneven distribution. There is no inherent advantage to throwing more pitches, although there might be an advantage to throwing those same three pitches with a more even distribution. Power pitchers with great fastballs can often get away with throwing only two pitches (the other usually being a breaking pitch). Pitchers with lesser fastballs usually need a third offering, be that a cutter, splitter or changeup.

Going back to the K9 projections I wrote about previously, my formula punishes (in terms of an expected strikeout rate) high values for both rep_depth and for rep_offerings. However, the system awards points for throwing particular pitches a high percentage of the time, namely fastballs and breaking pitches for power pitches, and slow changeups for pitchers with slower fastballs. Makes sense to me. Universally, the system expect pitchers who throw lots of different pitches to have low strikeout rates, all other things being equal. This is a bit surprising, but not illogical. A pitcher with a great fastball (or cutter, in the case of Mariano Rivera) needs not throw much else. These kind of pitchers can be highly effective, and they record high strikeout rates. Pitchers with great primary weapons don't need more than one secondary offering. Although as the linked chart shows, many pitchers (especially older, more experienced pitchers) have had great seasons throwing a variety of pitches. So the tendency to have lower strikeout rates among high rep_depth pitchers might be a case of reverse causality. I'm not really sure.

Among pitcher seasons 2002-2009 (20IP cutoff), the rep_depth and rep_offerings can be bucketed as follows:


3181 elements, 10 buckets --> 318 target average
bucket 1 [1.000000, 1.321785] for 318 elements (1.121944 average)
bucket 2 [1.323485, 1.542642] for 318 elements (1.449034 average)
bucket 3 [1.543158, 1.688595] for 318 elements (1.621933 average)
bucket 4 [1.688927, 1.799741] for 318 elements (1.745801 average)
bucket 5 [1.800337, 1.906526] for 318 elements (1.855347 average)
bucket 6 [1.906688, 2.001738] for 318 elements (1.956757 average)
bucket 7 [2.001881, 2.180508] for 318 elements (2.092596 average)
bucket 8 [2.180602, 2.370741] for 318 elements (2.274430 average)
bucket 9 [2.371444, 2.650319] for 318 elements (2.501306 average)
bucket 10 [2.651505, 4.090890] for 319 elements (2.952992 average)

Created 10 buckets
6226.893148 / 3181 = 1.957527


bucket 1 [1.000000, 2.000000] for 979 elements (1.848825 average)
bucket 2 [3.000000, 3.000000] for 1132 elements (3.000000 average)
bucket 3 [4.000000, 4.000000] for 888 elements (4.000000 average)
bucket 4 [5.000000, 6.000000] for 182 elements (5.065934 average)

Created 4 buckets
9680.000000 / 3181 = 3.043068

Now, at least, it is possible to estimate the depth of a pitcher's repertoire using two numbers, which can be simply computed.

Notes on Rick Porcello & Edwin Jackson (a month later)

A month ago, I wrote about Chien-Ming Wang, Rick Porcello, and Edwin Jackson. These are all pitchers that have under-performed my K9 projections in the recent past. Being less narcissistic, they under-performed K9 rates projected by a systematic approach to estimate K9 rates from pitch data and biographical information.

I pointed out that Porcello does not actually under perform his expected K9 by much, although his low projection itself is a concern.

However Wang and Edwin Jackson project to be at least league-average K9 pitchers, but have both consistently struck out many fewer batters than expected. From the table linked below, let's pull a list of starting pitchers (100IP+) who projected to have at least 6K/9, but significantly underperformed in reality (in order of under-performance):
  • Chien-Ming Wang 2006
  • Danny Graves 2003
  • Carlos Silva 2005
  • David Wells 2003 (actually not a bad season: 15-7 4.18 ERA for the Bombers, 4.6 WARP)
  • Fausto Carmona 2008
  • Jason Johnson 2005
  • Jorge Sosa 2005
  • Kyle Lohse 2005
  • Bronson Arroyo 2005 (last season in Boston, 1.9 WARP)
  • Edwin Jackson 2008
  • John Lieber 2004
  • Jorge Sosa 2003
  • Chien-Ming Wang 2005
  • Sidney Ponson 2004
  • Mike Hampton 2002 (his horrible last season for the Rockies)
  • Zack Duke 2007
  • Jeremy Guthrie 2009
  • Horacio Ramirez 2005
  • Hiroki Kudora 2008
Admittedly, it is still hard to separate out K9 under-performance from just plain low K9 totals. However, by setting the cutoff at 100IP and an expected 6K/9, we are at least skipping over many marginal starters, and the Kirk Reuters & Turk Wendells of the world.

Still, being listed here as one of the top unexpectedly low K9 seasons in the last 10 years is not really a positive projection for a a pitcher's future.

I think the Yankees are doing the right thing by not committing to Wang, but this comment might have been more meaningful if it had been made two years ago. At this point, he is "injury prone," rather than a surprisingly effective low-strikeout pitcher. The predictive power of pitcher injuries is another topic that I have been looking into. Maybe I'll have something on that front soon.

Edwin Jackson is another matter. He is just coming off a career season. After a disappointing development prior to 2009, he put a solid 3.9 WARP in 2009, despite some late-season decline. However, he still continues to strike out many fewer batters than one would expect given his stuff (ie pitches and bio information). In 2009, he posted 6.77 K/9, while my system would expect him to put up 8.49 K/9. That 1.72 K/9 under performance is still in the bottom 10% of pitcher seasons between 2002-2009.

The D-backs recently traded for Jackson, giving up Max Scherzer as part of the deal. Scherzer projected for 7.0K/9 in his first full major league season, but in fact struck out 9.2 per 9 innings.

I'm not saying that my work here proves anything, but I'm not sure why the Diamondbacks are giving up a young pitcher with one solid season (and a solid K/9 over achievement) for a young pitcher with one slightly better season, and several year's worth of consistently recording fewer strikeouts than expected. Hopefully, they are not expecting Edwin Jackson to start striking out a lot more hitters next year, since he has a multiyear history of under performing those expectations. On the other hand, there is a possibility that Scherzer can keep his strikeout totals above expectation (and well above the league average).

Of course strikeouts are not everything, but they are pretty important for predicting pitcher longevity and long-term success. Bill James wrote a long time ago about how almost no pitchers have long careers without early-career strikeout totals that are above league average. I have never seen anyone try to prove the contrary. I have not done any multi-year value studies of pitchers, so perhaps when I do, there will be something else to say.

In the mean time, I am close to publishing some results for my overall value and predictions. Once I have a system that I'm comfortable with, and one which does well in historic tests with PECOTA & CHONE, I will write about it. Also, I should probably publish 2010 projections at some point. If I wait until I make all the improvements that I can, I will never publish anything!

Updated data for K9 projections

I've been in Thailand & Australia these last couple weeks, so the update has taken longer than I expected.

I re-ran my scraping of FanGraphs, to get pitch data (average velocity & percentage thrown for every pitch) for all pitchers, active during 2002-2009 (before, I was missing many 2009 seasons, and was systematically missing some others). I also got updated biographical information (height & weight) that matches what can be found on So the data set is now complete (with a few very small exceptions).

Again, my idea was to see how well we can predict a pitcher's K9 (strikeouts per 9 innings) rates, given online what pitches he throws, and some biographical data like height, weight, age and handedness. I cheat a little by also using his IP in the model, although this is just to separate the starters from the relievers (we would expect starters and relievers, especially left handed relievers) to have different K9 rates with the same pitch arsenal.

In short, running with the updated data led to almost no changes in my ability to project K9 rates from this non-performance data. The correlation between projected and actual rates is still right at 0.6.

The graph is above. The listing of pitcher seasons (with 60IP+), fastball data, and actual & projected strikeout rates is available here.

The new formula still breaks the pitchers into 3 categories, by average fastball speed, and whether they are starters or relievers. For those who care, here is the new formula:


FB_fg_vel <= 91.55 :
| IP Start <= 5.5 : LM1 (757)
| IP Start > 5.5 : LM2 (1353)
FB_fg_vel > 91.55 : LM3 (1091)

Where LM1, LM2 and LM3 are linear models that break down as follows:

LM num: 1
SO9 =
0.6879 * THROWS=L
- 0.0009 * HEIGHT_OVER_6ft
+ 0.0091 * WEIGHT
- 0.0293 * AGE
- 0.0001 * IP Start
- 0.0238 * FB_fg_per
+ 0.023 * SL_fg_per
- 0.0002 * CT_fg_per
+ 0.0215 * CB_fg_per
+ 0.0181 * CH_fg_per
+ 0.049 * SF_fg_per
+ 0.3019 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.1251 * CH_fg_vel
- 0.559 * rep_depth
- 0.2189 * rep_offerings
- 8.7262

LM num: 2
SO9 =
0.611 * THROWS=L
- 0.0006 * HEIGHT_OVER_6ft
+ 0.0001 * WEIGHT
+ 0.0002 * AGE
+ 0.0012 * IP Start
- 0.0488 * FB_fg_per
+ 0.0088 * SL_fg_per
- 0.0179 * CT_fg_per
+ 0.0159 * CB_fg_per
- 0.0189 * CH_fg_per
+ 0.0003 * SF_fg_per
+ 0.2786 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.0144 * CB_fg_vel
- 0.0282 * CH_fg_vel
+ 0.009 * SF_fg_vel
- 0.2939 * rep_depth
- 0.1174 * rep_offerings
- 12.5946

LM num: 3
SO9 =
1.3079 * THROWS=L
- 0.0976 * HEIGHT_OVER_6ft
+ 0 * WEIGHT
+ 0.0289 * AGE
- 0.0013 * IP Start
- 0.0773 * FB_fg_per
- 0.0314 * CT_fg_per
- 0.0255 * CH_fg_per
+ 0.0164 * SF_fg_per
+ 0.5636 * FB_fg_vel
- 0.0002 * SL_fg_vel
+ 0.0091 * CB_fg_vel
- 0.0004 * CH_fg_vel
- 0.3258 * rep_depth
- 0.2241 * rep_offerings
- 39.442

The difference between the new and old models is trivial. Overall, I think I get something like a 2% decrease in root mean squared error, which is not significant.

I'm just including this information for completeness.

Saturday, November 28, 2009

Chien-Ming Wang, Edwin Jackson, Rick Porcello and projected K/9 rates

For all of the pitchers projected in the graph below (see previous post), I have summarized their data in a spreadsheet here. The data is sorted by "SO9_proj_error" which is just a measure of how much my projected K/9 value over (or under) estimated the actual strikeout rate.

The top rows of the chart show all the pitcher seasons for which my system had the largest under-estimates of strikeout rates. Two trends stand out:
  • Several pitchers appear on the top of this list multiple time (Eric Gagne, B.J. Ryan, Pedro Martinez, Uggie Urbina, Rich Harden, Mark Prior)
  • Most of these pitchers were dominant pitchers (at least for a few season).
  • Octavio Dotel is an interesting exception.
This seems to suggest that over-achieving one's projected K/9 rate is a repeatable skill, and a that this might be a valuable skill. It's not entirely clear how much this trend is separate from just the trend of high strikeout pitchers being very effective, however. Let's just leave it at that.

Now take a look at the bottom rows of the chart. Again, there is lot of repetition. We see the following starting pitchers several time at 2+ strikeout rate underachievement:
  • Kirk Rueter, Jorge Sosa, Carlos Silva, Aaron Cook, and Chien-Ming Wang
  • Edwin Jackson's K/9 underachievements have ranged between 0.85, 2.79 and 1.97 between 2007 and 2009.
Also, there are not many top-end pitchers that show up on the bottom of this list (the 2003 version of CC Sabathia doesn't count). Some good pitcher seasons, but none by pitchers that were dominant for a significant stretch of time. Again, it is difficult to completely separate this effect from just the negative effect of having low strikeout totals. However, some pitchers seem to possess the "skill" of consistently underachieving their projected K/9 totals. This is not one of the habits of highly successful pitchers.

To me, this suggests (though certainly doesn't prove) that Chien-Ming Wang and Aaron Cook should not expect to start striking out more batters, nor should they expect to become top-end starters. Carlos Silva, Kirk Rueter and Jorge Sosa should be good "comps" for these pitchers. Similarly, if Edwin Jackson is not able to get 7-8 K/9 next year, I would bet that he never will.

Projecting from their fastball speed and selection of pitches, Wang and Jackson should both be at least league-average K/9 pitchers (7 K/9). Cook projects a little less so at 6 K/9. However none of them are near their projected number, after several full seasons in the majors.

Now, let's consider Rick Porcello, the Tigers rookie who has been also come up in discussions about the future of low-strikeout pitchers. His K/9 average for 2009 was only 4.69, yet he is nowhere near the bottom of my chart. Based on his pitch data, he only projects to have averaged 5.54 K/9 in 2009. He underachieved his K/9 projection by 0.85. There are plenty of highly successful pitchers who have had similar figures (Mark Buerle, CC Sabathia, etc).

Why is Porcello's projected K/9 rate so low? He throws a league-average fastball (90.7 mph) and yet he throws it a 77.1% of the time. According to LM2 (the linear model for starting pitchers with league-average or below fastballs from my previous post), these are the dominant terms that contribute toward his low K/9 estimate. Let me reproduce the factors for LM2 below. I have skipped the terms that don't have a significant effect on the final value:

LM num: 2
SO9 =
- 0.0447 * FB_fg_per
+ 0.0083 * SL_fg_per
- 0.0159 * CT_fg_per
+ 0.0153 * CB_fg_per
- 0.0185 * CH_fg_per
+ 0.2423 * FB_fg_vel
- 0.2151 * rep_depth
- 0.1346 * rep_offerings

The "FB_fg_per" term corresponds to the percent of the time that Porcello throws fastballs. Multiplying 77.1 * -0.0447 = -3.45. So if Porcello decreases the number of fastballs that he throws to 55%, he will see this negative term decrease by 1.0. The model suggests that by throwing a third as many fastballs, Porcello can add about 1.0 K/9 to his projected total. The model favors curve balls and sliders for LM2 pitchers, so committing those "extra" 20% of pitches toward breaking balls can push Porcello's expected K/9 rate to the league average of 7.

Therefore, it seems plausible that Porcello can become a league-average strikeout pitcher by learning to throw more breaking balls. He can expect to achieve solid strikeout rates without developing a harder fastball.

I am not an expert on pitcher development, but I see no reason why this should not happen. Porcello is only 20 years old. He already throws sliders, change ups and curve balls over 5% of the time each. I don't know how scouts rate his secondary offerings, but I don't see why at least one of these can't develop into a good secondary pitch that Porcello can throw 25% of the time.

Porcello did not post a high strikeout rate in the minors, either, but was able to find success. Therefore, he already has the skills to pitch well without striking out many men. If he improves his strikeout rate, and there is a plausible reason that he can, Porcello has a chance to become a dominant pitcher.

Contrast his career path that of Chien-Ming Wang. He struck out lots of guys in the minors, and has the fastball to strike out guys in the majors. But for several years running now, he does not. That is unlikely to change.

Lastly, let's come back to Porcello's teammate, Edwin Jackson. Wang has the 93.9 mph fastball. However, he only struck out about 7.7 K/9 in the minors, and has had even lower totals in the majors (about 6.3 K/9). His strikeout totals are higher than Porcello now, but there is no reason to think that his will increase, after being lower than expected for several years. I'm not saying that this makes Porcello a more valuable pitcher than Edwin Jackson, but it is something for the Tigers to consider.

What 0.6 correlation looks like

Data points represent all* MLB pitcher seasons 2002-2009 with 60+IP.

* Actually I excluded a few cases as a result of pulling this data from the dataset of my projection system. So I'm missing all 2009 rookie seasons (except Rick Porcello, who I inserted manually). Also I'm missing 2002-2004 data for all pitchers who were not active during 2005-2009. I should have the full data at some point, but I don't think these omissions matter much, anyway.

Thursday, November 26, 2009

Predicting K/9 rates from pitch data

As some of you know, I've been working on a machine learning (ML) system for predicting pitcher IP and VORP (cumulative season value) for the last couple months. I'll be writing more about that later. In the process of building this system, I took a look at pitch data available on FanGraphs. I ended up pursuing some tangents that might be interesting on their own.

I came up with a idea to compute how many pitches a pitcher throws. Actually, I output a pair which I call "repertoire depth," and "repertoire offerings." I will write more about that later, also.

This all lead me to thinking: can we estimate a pitcher's K/9 (strikeouts per 9 innings) rate from just his pitch data? From FanGraphs, we can learn how fast his average fastball, slider, etc registers on the gun, and how often he throws this pitch.

Why would you want to estimate a pitcher's K/9 from these values? We can get a better estimate by looking at previous years' K/9 rates, although I'm not sure how well these correlate year to year. However, what if we are looking at a minor league player, or another prospect? Everyone strikes out more guys in the minors than they doin the majors. In any case, let's take a look at the model that I came up with.

As input, I used pitch velocities and percentages thrown from FanGraphs for all pitches that FanGraphs tracks. Also, I added the pitcher's age, handedness, height, and weight as features. The data is for 2002-2009, so I also allowed the year to be a feature (in case there is a significant year-to-year rise in K/9 rates beyond pitchers maximizing other factors). Lastly, I used IP_start (innings started) as a feature. My goal was to stay away from all performance-based stats, but as you will see, IP_Start is used only to separate starters & relievers.

Here is the model overview:

FB_fg_vel <= 91.15 :
| IP Start <= 4.15 : LM1 (755)
| IP Start > 4.15 : LM2 (1278)
FB_fg_vel > 91.15 : LM3 (1284)

To put this in English, the algorithm that I chose to use (Weka's M5P tree algorithm) breaks the data into three sections, each described by a linear model.
  • LM1 (755 members) represents all pitchers whose average fastball is slower than 91.15 mph, and who did not throw more than 4 innings as starters. In other words, relievers who don't throw faster than average.
  • LM2 (1278 members) represents all pitchers who also throw below average fastballs, but also had at least one start.
  • LM3 (1284 members) represents all pitchers whose average fastball rates at least 91.2 mph (starters and relievers alike).
Here are the models. I'm including all features for completeness, but I will show which features are really driving the model.

Key: FB = fastball, SL = slider, CT = cutter, CB = curve, CH = change, SF = splitter, _per = % thrown (on 0-100 scale), _vel = velocity in mph.

Also, rep_depth and rep_offerings are my measures of a depth of a pitcher's repertoire. I'll write about those in a subsequent post. Just think of them as values between 1 and 4, centered at 2, that estimate how many different pitch types a pitcher throws.

LM num: 1
SO9 =
0.6477 * THROWS=L
- 0.0002 * HEIGHT_OVER_6ft
+ 0.0063 * WEIGHT
- 0.0409 * AGE
- 0.0001 * IP Start
- 0.0013 * FB_fg_per
+ 0.0568 * SL_fg_per
+ 0.0278 * CT_fg_per
+ 0.0505 * CB_fg_per
+ 0.0642 * CH_fg_per
+ 0.0778 * SF_fg_per
+ 0.3189 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.1237 * CH_fg_vel
- 0.9693 * rep_depth
- 0.0054 * rep_offerings
- 12.0036

LM num: 2
SO9 =
0.5443 * THROWS=L
- 0.0002 * HEIGHT_OVER_6ft
+ 0.0001 * AGE
+ 0.0016 * IP Start
- 0.0447 * FB_fg_per
+ 0.0083 * SL_fg_per
- 0.0159 * CT_fg_per
+ 0.0153 * CB_fg_per
- 0.0185 * CH_fg_per
+ 0 * SF_fg_per
+ 0.2423 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.0097 * CB_fg_vel
- 0.0286 * CH_fg_vel
- 0.2151 * rep_depth
- 0.1346 * rep_offerings
- 9.4151

LM num: 3
SO9 =
1.2297 * THROWS=L
- 0.0839 * HEIGHT_OVER_6ft
+ 0.0294 * AGE
- 0 * IP Start
- 0.2967 * FB_fg_per
- 0.222 * SL_fg_per
- 0.2528 * CT_fg_per
- 0.2317 * CB_fg_per
- 0.249 * CH_fg_per
- 0.2105 * SF_fg_per
+ 0.5642 * FB_fg_vel
- 0.0001 * SL_fg_vel
+ 0.0098 * CB_fg_vel
- 0.0003 * CH_fg_vel
- 0.3879 * rep_depth
- 0.1807 * rep_offerings
- 17.5718

This may look like a lot of features, but really we can summarize what features are driving each of the three models pretty simply.

First of all, let's note which features are not significant factors for predicting K/9 rates. Height, weight, age, and IP_Start (apart from breaking down to the models) are not significant considerations in any of the three models. The year of the data does not even show up at all, so it is also not significant.

What features are significant across the board? Fastball velocity, handedness (bonus for lefties), and repertoire depth are the most significant for each of the models. Apart from that, the three models give differing values for throwing relatively many sliders, curves, cutters, splitters and change ups. The models seem to suggest that some pitch types are better for guys with slower fastballs, and that some pitch types are better for guys who throw smoke. Don't read too much into what values are positive and which are negative for the _per values, since the percentages for the six pitches under consideration should sum to 100 (if we ignore knuckleballs, pitchouts and unrecognized pitches). The relative weights for those _per values are what matter.

Before I lose everyone who has made it this far, let's think again to what all this means?

If you look at LM3 (pitchers who throw 91.2 mph+ on average fastballs), the only significant factors contributing toward an increased strikeout rate are increased velocity, and being left handed. Increasing how many different pitches, or favoring certain pitches, will not suggest a higher strikeout rate, according to this model. Does this suggest that hard throwers (especially lefties) should not worry about developing secondary offerings? What does this have to say about the future of hard-throwing, low-strikeout guys like Chien-Ming Wang? I don't know, but these are interesting questions.

I'll be following up soon by listing some examples of real and expected K/9 rates for select pitchers, derived from this model.

For those of you who were wondering how accurately my model predicts K/9 rates...

The correlation between my predicted value and the real K/9 rate is 0.6017. If I measure the correlation between real K/9, and the outputs from a model that trains with 10x cross validation (data is split into 10 sections and each is evaluated by a model trained on the other 9), the correlation goes down to 0.5755. So the model I showed above is over trained, but by very little, and generally the method explains about 60% of the variance between different pitchers' strikeout rates. It can not explain the other 40% using just pitch data and bio data.

Lastly, if you look at LM2 (guys who have average to below-average fastballs and started at least one game), you can see that secondary factors (beyond fastball speed and handedness) take more of a precedence. Throwing more curve balls, and throwing slower changeups have significant, if not dramatic, positive effects on strikeout rates. Also the benefits of being a lefty are less pronounced, as are the benefits of a faster fastball (relative to other below-average guys).

The later suggests to me that the benefits of being a lefty, as well as the benefits of throwing really hard, grow non-linearly with the speed of the average fastball. I would expect this system to under-estimate the strikeout rates of very hard throwing pitchers. Indeed for Brad Lidge's and Eric Gagne's 14 K/9 seasons, my system suggests that they should have had only about 11 K/9 each time.

The system fails miserably for knuckle ball pitchers. At one point, I trained a system that split the pitchers into 5 categories (rather than the 3 here), and that system gave the knuckle ball pitchers their own category, but my current system is simpler, and has a higher overall correlation to the K/9 data. Sorry Mr. Wakefield.

Does the world need another baseball blog?

What is not started today is never finished tomorrow. -Goethe

My freshman year roommate (who fancied himself a great writer) always started his papers with a quote. I always thought he was a pretentious jerk. Yet even he didn't write a blog about his baseball analysis.

I've hesitated for some time about writing up my work, and putting any of it on the web. But it seems like the best way to publish some of my findings, so here goes. Hopefully some of what I've come up with will be of use to others.