From the start of my work on training models to predict pitcher usage & performance, I was dogged by a simple problem:
Which pitcher seasons should I use as training data?
I'm not the only person facing this problem. Typically you would see statistical analysis (trying to establish the correlation between factors A, B and C) use an arbitrary cutoff for IP or another broad indicator of a complete-enough season. I ended up using a similar approach, and rationalized that I'm excluding a small number of pitcher seasons by using a cutoff like 80IP over the 3 previous seasons (later reduced to 50IP). However, this does not feel like a satisfactory solution.
Ideally, I would like to use all of the pitcher seasons as data. However no one (including me) really cares about a system that predicts the performance of fringe major leaguers very accurately. Everyone wants to know how well the starts will do, or at least the regulars with major league contracts. And yet to systematically ignore weak contributors seems suboptimal for many reasons. If you count pitcher seasons, something like 20% or so of all pitcher seasons are by pitchers who can not be referred to as major league regulars by any means. I'd left this issue on the back burner until it absolutely could not be ignored any longer.
In my recent look into predicting strikeout rate from pitch data, I had significantly better predictive results (and more reasonable looking models) by restricting training data only to pitchers with 20+IP and 40+IP. This is understandable, since the swings for strikeout rates (and also for breakdown by pitch type and other pitch data) would be very high for a guy only throwing the equivalent of a few complete games. So I ended up coming up with a rather arbitrary cutoff, and only training with pitcher seasons over something like 32IP. Ugly. I would have preferred to simply decay the pitcher seasons' weights in training, in proportion to how much the data is free of expected random fluctuation.
As it turns out, there is a statistical basis for this sort of weighing of training data. I was reading an old statistics text, and it mentions a similar problem in doing linear regression analysis. My system for predicting strikeout rate is not exactly a linear regression, but it's very similar. The most common method of minimizing root mean squared error in your best fit line assumes that variance of the dependent variable does not depend on the value of the independent variable(s). This assumption is not going to be true for just about any baseball study of seasonal totals.
If you know the relationship between the variance of the dependent variable and the value of the independent variable(s), this problem is easy to fix. You re-weigh the instances in inverse proportion to the expected variance of the data points. In other words, you give *more* weight to values that you know you are observing more accurately. The book gives an example, where the independent variable is the amount of a steroid given to a castrated rooster, and the dependent variable is some measure of growth in the bird. LOL!
Unfortunately, the baseball case is not so simple. For the Sammy Sosa roosters, there were several data points for each amount of steroid, given to different birds. In the case of ballplayers, we can't run a player season twice, and see how much variance there is in performance from the different trails.
Also, since I am essentially training a multiple regression model, there are many independent variables and it's not entirely clear which one is most important in estimating expected variance in future usage or performance. So weighing pitcher seasons based on the value of past performance may not be ideal, either.
The most logical way to estimate the variance in performance for a pitcher season is to relate the variance to the *actual* performance. If we think of a the actual performance as the result of a single trail, dependent on all of his known past performance & other information, then we can get a rough estimate of the expected variance of his ideally projected performance.
That sounds confusing, but basically we are asking: given that player X pitched 60 IP in 2009, what was the expected variance in IP for his going into 2009, everything else being equal. Similarly, we get estimates for pitcher Y who pitched 200 IP, and pitcher Z who pitched only 5 IP. Knowing nothing else about the pitchers, I'm sure that anybody would guess that pitcher Y (200 IP guy) would have the highest expected variance in IP, and that pitcher X (60 IP guy) would have the lowest of the three.
To estimate variance, I compute the error of the latest generation of my IP prediction model, for different classes of actual IP. Assuming my system does not have a strong bias against a class of pitchers, this should be at least a decent approximation of expected variance in IP. Below, I graph the results (as the midpoints of buckets (classes) for which I estimate variance using the method above).
For this quick look, I only used data for the 2009 season, and only for pitchers who met my previous cutoff criteria. So the lower end (near 0 IP) points are probably pretty dubious. Still, the trend makes sense. As one might expect, expected variance is high for guys who throw very few innings, or very many innings. The easiest IP to predict is for a reliever who gets 60-80 IP every year. Even there, we are looking at an annual standard deviation of 30 IP, but not nearly the 70 IP for a typical full time starter.
Going back to the statistical text, now that I have an estimate of variance of my observed cases, I should weigh them by the inverse of their variances, for use in training. So the low-variance 60-80 IP reliever should have a 5x training weight, in comparison the the 200 IP starter. Also, he should have a 2-3x training weight, in comparison to the fringe major leaguer, getting close to 0 IP.
To make these figures more credible, I'll need to include all pitcher seasons that I have data for, including those for pitchers having no prior major league experience, and even for those guys who missed an entire year, due to injury or ineffectiveness. I should have that soon. Unfortunately, updating my database to include more players is always a pain. Not because I don't have the data, but rather because matching data from several sources involves some manual work (like matching alternative spellings of names). Still, this should be ready soon.
Not sure this approach will buy me anything as far as an improved model. The idea of giving starting pitcher season less training weight seems unnatural. These are really the guys I care about predicting accurately, no? However the idea of having a logical system for handling fringe, low-IP guys is very appealing to me. As is the idea of forcing my system to predict the performance of steady (IP-wise, anyway) relief pitchers more accurately. Missing the mark by 30 IP on CC Sabathia is one thing, but not predicting Mariano Rivera to throw very close to 70 IP is not so defensible.
Unfortunately, I always have more ideas than time to try them properly. But having a system by which I can train using all pitcher seasons that I have access to is very appealing, so this is going straight to the top of my TODO heap...
My writings about baseball, with a strong statistical & machine learning slant.
Monday, January 11, 2010
I am in Tokyo, and haven't been thinking much about baseball lately. However I did try breaking down the prediction of IP into IP_Start & IP_Relief as I wrote about earlier. Well, it didn't help much.
I wasn't surprised that predicting IP wasn't magically going to become easier by separately predicting innings pitched in start and in relief. However I was surprised that these separate projections didn't do much for predicting VORP (overall value in runs saved) either.
However I thought I'd explain a little better exactly why I think this idea made sense (and still might work, maybe with more data).
OK, so here is the distribution of IP for 2005-2009 pitchers, who had at least 50IP experience in the three years prior. Don't ask about the cutoffs. I'm thinking about how to remove them, without making things worse. The overlayed red line is the best fit normal distribution (not that this data would have an expected normal distribution):
As you can see, the distribution has two peaks, one at 61IP and another at 181IP. Actually, this is an Excel graph, and Excel lines up the x-axis for bar graphs strangely. Think of all those labels as being attached to the left-side has on the x-axis. So the peaks are actually at 61IP-85IP and at 181IP-105IP. Both those numbers make sense. You would expect a lot of major league relievers to throw about 73IP, and a lot of starters to throw just under 200IP.
Hence the idea of breaking down IP by IP_Start & IP_Relief. Total innings pitched are the sum of of those two distributions. Maybe the separate distributions would be easier to predict (ie if they have simple shapes)?
However, here are the actual distributions:
Well these distributions are not really any easier to describe. Both the start and relief innings pitched distributions have at least two peaks. The relief inning distribution might even have a third peak around 50IP. Given that some pitchers have bullpen roles (ie LOOGY or similar) that don't call for more than 60IP per year, this might be a true third peak. Then again, since we are now dealing with less data, the shape of the distribution is even harder to trust.
Which brings me to answer the obvious question: since pitchers often have defined roles, why don't I train separate models for those roles? Unfortunately, I don't think this would be so simple. Even if the previous years' roles were easily known (they are not), pitcher roles change often for the non-stars. Also, training models on 2,000 pitcher seasons is hard enough, but training models on 100 pitcher seasons doesn't even make sense.
Without more training data, I just don't think it will be possible to build a better model for IP by breaking down the distribution into components. I could probably squeeze in a bit more training data. However older stats are less useful for predicting future pitcher seasons, and injury data is not as good for the older years (the number of DL listings has grown significantly in recent years).
Conceptually, I wonder if IP (or perhaps just IP_Relief) can be broken down into two or three components that can be measured or estimated? I'm thinking something like:
- Time (games) available
- Usage per game
This way, it would be possible to separate a pitcher who threw 40IP in relief, because he missed half the season to injuries, suspension, or minor league usage, and one who was available in the bullpen every day, but is used seldom by him manager when available. Also this approach might improve my predictions for rookies and other pitchers who have little major league experience. If the projections are too low (or too high), it will be easy to see why: availability estimates are too low (or too high).
In contrast to IP, VORP distribution is simpler to describe:
As the red curve shows, the distribution is not normal, but it has just one peak. This is not to say that predicting VORP is easier than IP. Rather, it is much harder! Predicting value with a very high correlation is probably impossible. There are just too many unknowable factors involved.
There is ultimately an absolute cap on how well an algorithm can predict future season IP and VORP. Unfortunately it is impossible to know what that cap might be. This is discouraging, since I could be, for all I know, striving to make improvements that won't really matter. Currently, I can predict VORP with a little more than a 0.5 correlation, compared with real values. I don't know how much of the other 0.5 correlation is completely random. Then again, I just can't believe that my crude methods are the best that can be done without cheating. Not yet.