After much delay, I'm ready to show my best effort for predicting strikeout rate from numbers not related to a pitcher's performance.

Say you've got a pitcher. College guy, minor leaguer, major leaguer, etc. You would like to project what his future MLB strikeout rate should be (granted that he makes it that far). Can we predict this number from scouting the guy (i.e. taking a look at what pitches he already throws, and projecting how his repertoire may evolve)? Yes, we can. With around a **0.6 correlation** to actual performance. I showed my original model for this almost 3 months ago.

Since then, I have looked into a number of additional factors and adjustments. Also I can answer a few obvious questions raised by my original work. Most importantly, I am going to break down the model, rather than just drop a formula on my blog.

##

##
Starters & Relievers

I think it's important to know how well we can predict strikeout rate for starters, relievers, and for pitchers in general. So I will show three models.

I consider anyone a starter if he pitches 100+ innings. I consider a reliever anyone who throws less than 6 innings as a starting pitcher. This simple reliever classification captures 86% of the relief innings (with less than 1% of starter innings). The simple starter classification captures 82% of starter innings (with less than 10% of the reliever innings). Also, we avoid (almost all) pitchers who spend substantial time both in the rotation and in the bullpen.

##

##
M5 Rules in WEKA

I wish I could just show you a few multiple regression models side by side. However, the algorithm I use is slightly more complicated. Although only slightly.

I train my models with the "M5 Rules" algorithm in WEKA, the open source machine learning platform. This algorithm is basically a souped up multiple regression model. Given a bunch of features (30 or so in my case), the algorithm will build a linear model to predict a single value. With a few caveats:

- The algorithm can split the data set along a linear rule (example: FB velocity > 90 mph).
- The algorithm tries to keep the weights of features as low as possible. This also means that features that are not meaningful are eliminated entirely from the model.

If I give the algorithm 30 features, rather than getting back a single rule with 30 weights, I might get 2 or 3 linear rules, each with 5-10 weighted features.

In evaluating the accuracy of my model, I look at the correlation between the outputs of the model and the observed strikeout rates. However, the correlation values that I get are from 5x cross validation. This means that WEKA actually builds 5 models, each time leaving out 1/5 of the data to use for testing. The correlation figures are for the testing data of the 5 models. Therefore, I am never training and testing on the same data. However, when breaking down models, I always use a model trained on the entire data set.

##

##
The Three Models

As I explained above, each of my models (for starters, for relievers, and for all pitchers) will have 2-3 rules each. Thankfully, the models split along the same couple of features (FB velocity and starter/reliever classification), so I will be able to compare rules across models fairly easily. First, let me summarize the models.

model: | # of rules: | correlation with SO9: | average IP: | % lefties: |

overall | 3 | 0.5675 | 69.0 | 27.5 |

starters | 2 | 0.6205 | 171.6 | 27.0 |

relievers | 2 | 0.4456 | 35.9 | 28.1 |

You can see the models in a text document here. Also I have a grid of the features by model here. I have removed meaningless features in the spreadsheet. This should make it easier to see the significant features that are left over.

Let's quickly list the possible features:

- "bio data:" handedness, height, weight, age
- for each pitch [FB=fastball, SL=slider, CT=cutter, CB=curve, CH=change, SF=splitter, KN=knuckler]:

- % thrown
- velocity
- whether the pitch is part of
**offerings**
- whether the pitch is part of
**depth**

- repertoire depth and repertoire offerings for the pitcher (an explanation with examples is here)
- league (LG)

- IP Start

- only for the 'all pitchers' model

###
Using IP in Strikeout Prediction Models

The concept of using IP in a model to predict strikeout rates seems contrary to my aim of predicting strikeout rates without using performance-based information. It certainly is. Playing time is the simplest measure of performance. Better pitchers throw more innings.

And yet, adding IP to the feature set above does not help us much in predicting strikeout rates. The 'all pitchers' model (which uses IP Start) has **0.5675** correlation with observed strikeout rates. If we don't allow it to use IP Start, correlation drops to **0.5491**. The model stays just as predictive. However for correctness, I don't use any IP features in the starters and relievers models (beyond classification of pitchers into these models). I use IP for the 'all pitchers' model in order to have a rule separating starters and relievers. This makes later analysis a lot simpler.

The fact that IP is does not help us predict strikeout rates (if we already have pitch type data) is notable. This is not the case for the other pitcher rates that we might care about!

Let's restrict ourselves to starting pitchers. Here is what happens if we train models for various pitcher rates with and without IP as an input:

rate: | correlation with IP: | correlation without IP: | drop off: |

SO9 | 0.618 | 0.620 | -0.002 |

QERA | 0.589 | 0.474 | 0.115 |

SO/BB | 0.517 | 0.400 | 0.117 |

BBr | 0.413 | 0.307 | 0.106 |

LD% | 0.383 | 0.377 | 0.006 |

GB/FB | 0.361 | 0.280 | 0.061 |

GB% | 0.328 | 0.301 | 0.027 |

FB% | 0.313 | 0.287 | 0.026 |

HRr | 0.332 | 0.257 | 0.075 |

BABIP | 0.221 | 0.091 | 0.130 |

As you can see, having access to IP data does not matter for predicting starters' strikeout rates. Given the same pitch distribution, a 140 inning fourth starter will tend to have the same strikeout rate as a 220 inning first starter. However this is not true for many of the other rates.

Pitch distributions & bio data does explain a significant portion of many of these rates. However the rates that we might care most about (QERA, BBr, SO/BB, GB/FB, BABIP) are dependent, do a large degree, on IP. I don't know whether this means that I am missing some method of getting more out of the non-performance related data, or that good pitchers have low walk rates and keep their BABIP down in ways that can't be measured using aggregated pitch data.

Once I am able to look at individual, rather than aggregated, pitch data from Pitch F/X, perhaps I'll have a better idea.

###
Fastball Velocity

As I've written before, average fastball velocity is the most predictive single feature for strikeout rates. This is true for all groups of pitchers.

Fastball velocity does not have a linear relationship with strikeout rates. Rather, as you can see from the graph here, differences in velocity matter a lot more on the high end (93-96 mph) then they do on the low end (86-89 mph). Therefore all three models split the data along high end/low end fastball velocities, thus creating separate rules for "hard throwing" and for "soft throwing" pitchers. This piecewise linear approach is a much cleaner way of handling nonlinearity than trying to fit a polynomial to the relationship between fastball velocity and strikeout rates.

Here are the weights for fastball velocity in the SO9 models. Remember that SO9 is measured in strikeouts, while pitch velocities are measured in mph. So a weight of **+0.50** means that an increase in 1 mph on the fastball will result in a SO9 prediction that is 0.5 strikeouts higher.

model: | hard throwers: | soft throwers: |

starters | +0.519 | +0.316 |

relievers | +0.586 | +0.351 |

The harder that a pitcher throws, the more he has to gain (strikeout-wise) by throwing even harder. I have never faced a 90 mph fastball, so I have no idea why hitters find it so much harder to make contact with a 90 mph fastball than with a 92 mph fastball. But they do.

Explaining the non-linearity might be easier. Pitchers on the low end of the scale (with average fastballs in the mid 80's) only stick around in the majors by doing lots of other things well. Those that don't *somehow* manage reasonable strikeout rates do not keep their jobs. Pitchers on the high end of the scale (those who flash 98 mph fastballs and who sit above 94 mph) are so rare that hitters won't be used to facing that kind of heat. So strikeout rates for those pitchers should be exceptionally high.

Remember Joba Chamberlain in 2007? His average fastball speed out of the pen was 97 mph (!) according to FanGraphs. In 2008, his average fastball speed fell to 95 mph, and yet he still maintained a strikeout rate of 10.6 K/9. However in 2009, his average fastball rate dropped to 92.5 mph. Still well above average, but no longer elite. His strikeout rate fell to 7.6 K/9, which is also well above average, but no longer spectacular. Say what you want about Joba's other pitches. It's his declining fastball speed that drove down his strikeout rate.

###
Lefties and the National League

Low strikeout rate has got you down? Want to increase your strikeout rate by up to 2.0 K/9 in two easy steps?

First, learn how to throw left handed. Second, sign with the Nationals and make sure that you get into their starting rotation. Or join any other National League team, for that matter. An lefty starter in the National League with an above-average fastball will average almost 2.0 more K/9 than a right-handed American League starter with the same stuff, according to my model:

model: | hard throwers: | soft throwers: |

starters: THROWS=L | +1.296 | +0.355 |

relievers: THROWS=L | +0.939 | +0.650 |

starters: LG | -0.424 | -0.213 |

relievers: LG | -0 | -0.117 |

Since LG = [AL = +1; NL = -1], the -0.424 value means that switching from the AL to the NL will gain a hard throwing starter almost 0.8 k/9. Since the league change effect is almost entirely absent for relievers, I think that the increase in strikeout rate is due to NL starters facing the opposite pitcher. It seems that a starter in the NL can pad his strikeout rate simply by having a good enough fastball to make it hard for the opposing pitcher to make contact. This should be worth noting for AL teams that sign hard-throwing NL starters.

If you look at Randy Johnson, Kevin Brown and Curt Schilling, all of these guys had their SO9 rates dip by about 2.0, when they made late-career moves to the AL East from the NL West. Two years later, Randy Johnson went back to the NL West, and his strikeout rate jumped right back up, despite a small drop off in fastball velocity. The other two pitchers retired.

I don't know offhand what the difference is between strikeouts in the AL and in the NL. But I'm sure that the difference is not 4.0 strikeouts per game. Whatever the difference is, it looks like it's being disproportionately made up for by the hard-throwing starting pitchers in the NL. Soft tossers like Barry Zito should not expect their strikeout rates to rise with a move to the Senior Circuit. However, Zito's left-handedness serves him well in both leagues.

(I am not implying that the three star pitchers' drop in SO9 rates proves that when hard-throwing pitchers move the AL, their strikeout rates drop by 2.0 strikeouts per nine innings. My model suggests that the drop should be more like 0.8 strikeouts per nine innings. However even a drop like that would probably not be accounted for by simple translations of average strikeout rates between the NL and AL. That's all I was saying. This sounds like a worthwhile study. And a fairly simple one, also.)

###
Repertoire Depth

The models seem to suggest that having a deep repertoire is negatively correlated to strikeout rates. This is not entirely so.

Although all of the model weights associated with rep_offerings and rep_depth are negative or zero, these weights are assigned in a context where we can give positive weights to pitchers for individual pitches.

- FB velocity
- handedness
- league (AL vs NL)
- rep_depth and rep_offerings

Such models showed that (for both starters and relievers) **rep_depth** is *positively correlated* with strikeout rates, while **rep_offerings** is *negatively correlated* with strikeout rates. Also, the absolute value of the weight for **rep_depth** is higher in all cases.

In other words, having multiple "core" pitches is predictive of a high strikeout rate, but throwing lots of pitches, without throwing them very often, is not.

Furthermore, a model with the five features above performs almost as well as a model with the full set of features. I will explore this deeper in a future post.

As for the current models, which include features about individual pitches, why are the **rep_depth** and **rep_offerings** weights always negative? I can't say for sure. But we can't simply ignore the fact that many pitchers who have had very high strikeout rates (Gagne, Lidge, Papelbon, Clemens, Randy Johnson) were all one-pitch or two-pitch pitchers. If the model sees many examples of pitchers with really high strikeout rates that use just two good pitches, that fact will be reflected in the feature weights.

###
Other Pitch Stats

As I mentioned above, using the most important five stats can get us a strikeout rate model that is within a few percentage point of models trained with the full set of features. That is not to say the the other features are meaningless. The have some additional predictive value, but most of the information these features offer is redundant. However they are interesting for descriptive purposes.

By allowing the model to consider information about a pitcher's curveball and change up, we should get better estimates about how significant his fastball speed really is. However it is difficult to tell anything conclusively about these "lesser" pitches' predictive power.

The model seems to suggest that hard-throwing relievers should use the change up as a core pitch. If they are going to throw sliders, they better be hard sliders. That sounds reasonable. However I would not take these numbers too seriously. The relevant weight are here. If you see something interesting that I missed, please let me know!

###
Odds & Ends

This post has already run long, so I'll just mention a few more observations, without comment:

- If we build a model for
*left-handed* starters, we get a **0.75** correlation to the observed strikeout data.
- If we build a model for left-handed
*relievers*, the correlation is **0.45**.
- Giving the model features for "
**quality of opposition**" does not help predict strikeout rates. Although I'm not sure how sensitive these features are to a pitcher's individual opposition, rather than his team's opposition as a whole.
- The model suggests that
**height** doesn't matter, except that it hurts lefty relievers. Higher **body weight** is good for relievers, but bad for lefty starters. Higher **age** is bad for lefty starters, but good for lefty relievers. I doubt if any of these values are significant.