My writings about baseball, with a strong statistical & machine learning slant.

Thursday, November 26, 2009

Predicting K/9 rates from pitch data

As some of you know, I've been working on a machine learning (ML) system for predicting pitcher IP and VORP (cumulative season value) for the last couple months. I'll be writing more about that later. In the process of building this system, I took a look at pitch data available on FanGraphs. I ended up pursuing some tangents that might be interesting on their own.

I came up with a idea to compute how many pitches a pitcher throws. Actually, I output a pair which I call "repertoire depth," and "repertoire offerings." I will write more about that later, also.

This all lead me to thinking: can we estimate a pitcher's K/9 (strikeouts per 9 innings) rate from just his pitch data? From FanGraphs, we can learn how fast his average fastball, slider, etc registers on the gun, and how often he throws this pitch.

Why would you want to estimate a pitcher's K/9 from these values? We can get a better estimate by looking at previous years' K/9 rates, although I'm not sure how well these correlate year to year. However, what if we are looking at a minor league player, or another prospect? Everyone strikes out more guys in the minors than they doin the majors. In any case, let's take a look at the model that I came up with.

As input, I used pitch velocities and percentages thrown from FanGraphs for all pitches that FanGraphs tracks. Also, I added the pitcher's age, handedness, height, and weight as features. The data is for 2002-2009, so I also allowed the year to be a feature (in case there is a significant year-to-year rise in K/9 rates beyond pitchers maximizing other factors). Lastly, I used IP_start (innings started) as a feature. My goal was to stay away from all performance-based stats, but as you will see, IP_Start is used only to separate starters & relievers.

Here is the model overview:

FB_fg_vel <= 91.15 :
| IP Start <= 4.15 : LM1 (755)
| IP Start > 4.15 : LM2 (1278)
FB_fg_vel > 91.15 : LM3 (1284)

To put this in English, the algorithm that I chose to use (Weka's M5P tree algorithm) breaks the data into three sections, each described by a linear model.
  • LM1 (755 members) represents all pitchers whose average fastball is slower than 91.15 mph, and who did not throw more than 4 innings as starters. In other words, relievers who don't throw faster than average.
  • LM2 (1278 members) represents all pitchers who also throw below average fastballs, but also had at least one start.
  • LM3 (1284 members) represents all pitchers whose average fastball rates at least 91.2 mph (starters and relievers alike).
Here are the models. I'm including all features for completeness, but I will show which features are really driving the model.

Key: FB = fastball, SL = slider, CT = cutter, CB = curve, CH = change, SF = splitter, _per = % thrown (on 0-100 scale), _vel = velocity in mph.

Also, rep_depth and rep_offerings are my measures of a depth of a pitcher's repertoire. I'll write about those in a subsequent post. Just think of them as values between 1 and 4, centered at 2, that estimate how many different pitch types a pitcher throws.

LM num: 1
SO9 =
0.6477 * THROWS=L
- 0.0002 * HEIGHT_OVER_6ft
+ 0.0063 * WEIGHT
- 0.0409 * AGE
- 0.0001 * IP Start
- 0.0013 * FB_fg_per
+ 0.0568 * SL_fg_per
+ 0.0278 * CT_fg_per
+ 0.0505 * CB_fg_per
+ 0.0642 * CH_fg_per
+ 0.0778 * SF_fg_per
+ 0.3189 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.1237 * CH_fg_vel
- 0.9693 * rep_depth
- 0.0054 * rep_offerings
- 12.0036

LM num: 2
SO9 =
0.5443 * THROWS=L
- 0.0002 * HEIGHT_OVER_6ft
+ 0.0001 * AGE
+ 0.0016 * IP Start
- 0.0447 * FB_fg_per
+ 0.0083 * SL_fg_per
- 0.0159 * CT_fg_per
+ 0.0153 * CB_fg_per
- 0.0185 * CH_fg_per
+ 0 * SF_fg_per
+ 0.2423 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.0097 * CB_fg_vel
- 0.0286 * CH_fg_vel
- 0.2151 * rep_depth
- 0.1346 * rep_offerings
- 9.4151

LM num: 3
SO9 =
1.2297 * THROWS=L
- 0.0839 * HEIGHT_OVER_6ft
+ 0.0294 * AGE
- 0 * IP Start
- 0.2967 * FB_fg_per
- 0.222 * SL_fg_per
- 0.2528 * CT_fg_per
- 0.2317 * CB_fg_per
- 0.249 * CH_fg_per
- 0.2105 * SF_fg_per
+ 0.5642 * FB_fg_vel
- 0.0001 * SL_fg_vel
+ 0.0098 * CB_fg_vel
- 0.0003 * CH_fg_vel
- 0.3879 * rep_depth
- 0.1807 * rep_offerings
- 17.5718

This may look like a lot of features, but really we can summarize what features are driving each of the three models pretty simply.

First of all, let's note which features are not significant factors for predicting K/9 rates. Height, weight, age, and IP_Start (apart from breaking down to the models) are not significant considerations in any of the three models. The year of the data does not even show up at all, so it is also not significant.

What features are significant across the board? Fastball velocity, handedness (bonus for lefties), and repertoire depth are the most significant for each of the models. Apart from that, the three models give differing values for throwing relatively many sliders, curves, cutters, splitters and change ups. The models seem to suggest that some pitch types are better for guys with slower fastballs, and that some pitch types are better for guys who throw smoke. Don't read too much into what values are positive and which are negative for the _per values, since the percentages for the six pitches under consideration should sum to 100 (if we ignore knuckleballs, pitchouts and unrecognized pitches). The relative weights for those _per values are what matter.

Before I lose everyone who has made it this far, let's think again to what all this means?

If you look at LM3 (pitchers who throw 91.2 mph+ on average fastballs), the only significant factors contributing toward an increased strikeout rate are increased velocity, and being left handed. Increasing how many different pitches, or favoring certain pitches, will not suggest a higher strikeout rate, according to this model. Does this suggest that hard throwers (especially lefties) should not worry about developing secondary offerings? What does this have to say about the future of hard-throwing, low-strikeout guys like Chien-Ming Wang? I don't know, but these are interesting questions.

I'll be following up soon by listing some examples of real and expected K/9 rates for select pitchers, derived from this model.

For those of you who were wondering how accurately my model predicts K/9 rates...

The correlation between my predicted value and the real K/9 rate is 0.6017. If I measure the correlation between real K/9, and the outputs from a model that trains with 10x cross validation (data is split into 10 sections and each is evaluated by a model trained on the other 9), the correlation goes down to 0.5755. So the model I showed above is over trained, but by very little, and generally the method explains about 60% of the variance between different pitchers' strikeout rates. It can not explain the other 40% using just pitch data and bio data.

Lastly, if you look at LM2 (guys who have average to below-average fastballs and started at least one game), you can see that secondary factors (beyond fastball speed and handedness) take more of a precedence. Throwing more curve balls, and throwing slower changeups have significant, if not dramatic, positive effects on strikeout rates. Also the benefits of being a lefty are less pronounced, as are the benefits of a faster fastball (relative to other below-average guys).

The later suggests to me that the benefits of being a lefty, as well as the benefits of throwing really hard, grow non-linearly with the speed of the average fastball. I would expect this system to under-estimate the strikeout rates of very hard throwing pitchers. Indeed for Brad Lidge's and Eric Gagne's 14 K/9 seasons, my system suggests that they should have had only about 11 K/9 each time.

The system fails miserably for knuckle ball pitchers. At one point, I trained a system that split the pitchers into 5 categories (rather than the 3 here), and that system gave the knuckle ball pitchers their own category, but my current system is simpler, and has a higher overall correlation to the K/9 data. Sorry Mr. Wakefield.


  1. Very interesting. Did you think to consider horizontal and vertical break as determinants of k/9? If I recall correctly, fan graphs provides his annually since 2007.

    Interesting stuff though.

  2. Hi Roger,

    Thanks for your kind words. I did not consider the h and v break data, since I'm not sure what it means. My purpose here was to see how much physical characteristics (non-performance metrics) can explain performance. Do curveball pitchers get more strikeouts, or tend to be starters rather than in the bullpen. That sort of thing.

    I was surprised that the only factors that show up in the data were fastball speed, (left) handedness, and significant differences in strikeout rate between the leagues, even after other factors are considered.

    If the data had told me that lefty curveball pitchers get more strikeouts or tend to reside in the National League, that's something I could have understood. If it told me there was a correlation between h-break and walk rate, I'd have had no idea what that means. Perhaps others would have, though.