I trained my model in pieces. Most importantly, I trained separate models for IP_Start and IP_Relief. Here is a simple graph of what I did:
My basic inputs were:
- "IP stats" (such as IP, IP_Start, IP_Relief and also DNP (did not play), ROOKIE, etc) for the previous four years
- "Value stats" (such as wins, losses, holds, VORP, SNWP (support neutral winning percentage for starters), WXR (wins added in relief), etc
To get the final model, I first trained basic models for IP_total, IP_Start and IP_relief purely based on previous playing time. Then I trained the final IP_Relief model based on the basic projections and also the value stats. The final IP_Start model uses this same IP_Relief model, as well as value stats. This allows the final projection to be a simple sum of the IP_Relief and IP_Start models.
Each of the five trained models can been seen in a long text file here. Most of the features are self-explanatory, although a few are not. I'd be happy to explain what they mean, if anyone wants to know.
Note that "_2008" features simply mean features for the previous year. I named all the features as though I were training on 2009 pitcher seasons, although I actually use 2005-2009 pitcher seasons for training. Also "_average" features are averages for the previous three years not including the last one (ie "_2005", "_2006" and "_2007"). The "_ip_average" features (the few there are) are four-year averages, weighed by seasonal IP. So they are a sort of "how did he do when he actually played" features. They are very optimistic estimates, in other words. Most of the features used are "_2008" features, which shows just how much projections are dependent on very recent performance.
Again, I won't spend much time on these models. Each model reports a correlation with the data that it is trying to predict (based on 5x cross validation). The correlation is the "r" in "r^2" that you statistically-minded people know and love. My most basic IP_total projection system gets a 0.71 correlation with the data (using previous year's IP only, and not included here), while the final system claims a 0.76 correlation. That doesn't sound like much, although that's a 10% improvement in r^2.
More importantly, my final projections take account of a broader set of features than the more basic projections. Statistically speaking, there is not much gain from looking at value stats when projecting IP. We know that a pitcher who throws 200 IP and is very effective is more likely to throw 200 IP again than a guy with the same 200 IP who was at replacement level. However, the difference in ability is usually reflected in the previous usage anyway (replacement level pitchers don't often accumulate 200 IP), so the information gain is not large.
However I would rather use this information, even if the gain is small. I want my system to take account of the fact that Mariano Rivera provides lots of value, and thus project him differently than Brad Lidge, who had a similar number of innings pitched in relief. Even if performance and playing time is highly correlated, I would rather use more factors, as long as their relationship makes sense to me, and as long as I am not using too many.
People in AI often say that feature selection is an art, as well as science. If there is no information gain in a feature, my system won't use it. However, most of the relationships I use in my models are not statistically significant. Which features I keep and which ones I don't keep is a matter of discretion. Generally, I want to use features that make intuitive sense, and which comprise both "_2008" information, and information from previous years. I could train a model with almost the same accuracy using only "_2008" features, but it would consider Brandon Webb and Ben Sheets as old rookies going into 2010. We know that's not the case, so I'd rather have a model that as least tries to use all four years' worth of data.
As it turns out, older data is useful to predict starters' innings, but is almost useless for predicting future relief innings. Here is the final IP_Relief model:
0.4164 * L_2008
+ 0.3454 * BS_2008
+ 0.3444 * VORP_2008
+ 0.2825 * Hold_2008
- 4.9235 * SNVAR_2008
+ 1.9615 * WXR_2008
- 3.3587 * bin_change_LG_2008
+ 0.1924 * SNWP_R_2008
- 0.3574 * W_average
+ 0.4407 * L_average
+ 0.078 * VORP_average
- 0.9571 * SNVAR_average
+ 0.8097 * LG_average
+ 0.1252 * proj_IP_v26
- 0.1387 * proj_IP_Start_v26
+ 0.4968 * proj_IP_Relief_v26
You don't need to understand what all these features mean to see that next year's relief innings are heavily dependent on WHYDFML. Total value added (with value added for starting pitching taken out) is responsible for 1/2 of relief innings pitched prediction, while only 1/2 is dependent on the simple IP_Relief model that accounts for previous years' innings pitched. Therefore an effective reliever who pitched 40 innings in 2009 might be projected for 60 IP in 2010, while a reliever who throws 80 IP every year might be projected for less than 60 IP next year, if his last year wasn't so good. Having been an effective reliever in 2007 means nothing, if your 2008 sucked.
I'm not sure why SV (saves) did not make the cut for features used to predict IP_Relief, while Holds, BS (blown saves) and WXR (wins added in relief) proved to be useful indicators. Maybe this goes to show that saves statistics get abused too much nowadays. A setup man with twice the holds and double the WXR as another setup man is most certainly a more valuable setup man, but a pitcher with 60 saves is not inherently more valuable than a pitcher with 30 saves.
The IP_Start also has an interesting caveat:
0.8554 * W_2008
+ 0.877 * VORP_2008
- 16.6386 * SNVAR_2008
+ 2.8454 * SNWP_R_2008
- 0.606 * L_average
+ 0.5458 * VORP_average
- 2.2624 * SNVAR_average
+ 0.2685 * proj_IP_v26
+ 0.6921 * proj_IP_Start_v26
- 0.761 * proj_IP_Relief_v26_w_value
Here "SNWP_R" is a BP feature (Support Neutral Winning Percentage) that I turned into an "over replacement" stat. This is BP's stat for measuring starter quality on a 0.0-1.0 scale where about 0.44 is replacement level (and 0.6 is really good). I smoothed values for low-IP guys and count the points above 0.44 (multiplied by 100).
What's interesting is that SNVAR (wins added as a starter, also from BP) is negative. Positive values for VORP and negative ones for SNVAR mostly neutralize each other for starters, while SNWP_R is always positive. Therefore a starter who is really good in 160 innings will get the same credit as a starter who is equally good in 220 innings (same SNWP but higher SNVAR).
The basic proj_IP_Start_v26 (IP & rookie stats only) IP_Start model is essentially a regression of last year's IP_Start (with extra credit for previous years' IP), so it's interesting that the bonus for a good 160 IP pitcher is the same as that for a good 220 IP pitcher (although the later is hurt more by regression in the simple model). The model does not a have a feature to say "this guy has pitched 220 innings every year for five years, so I should give him credit for that skill," but still, it seems to suggest that such pitchers are more lucky than skillful. As Brendan Webb showed last year, being a workhorse for years does not preclude a regression to the mean. It's easy to say that a guy is special in retrospect, but projecting into the future, there is no reason to give any of today's pitchers a 50th precentile projection of 220 IP.
I'd be happy to discuss more details about these models if anyone is interested. Otherwise, I'll proceed to 2010 projections and thoughts for further improvements. Including a better understanding of how injuries affect IP prediction.
Also, I need to repeat these steps (and some other steps) for ERA prediction, and VORP prediction (although hopefully ERA and IP will interact well enough to produce VORP projections directly).