From the start of my work on training models to predict pitcher usage & performance, I was dogged by a simple problem:
Which pitcher seasons should I use as training data?
I'm not the only person facing this problem. Typically you would see statistical analysis (trying to establish the correlation between factors A, B and C) use an arbitrary cutoff for IP or another broad indicator of a complete-enough season. I ended up using a similar approach, and rationalized that I'm excluding a small number of pitcher seasons by using a cutoff like 80IP over the 3 previous seasons (later reduced to 50IP). However, this does not feel like a satisfactory solution.
Ideally, I would like to use all of the pitcher seasons as data. However no one (including me) really cares about a system that predicts the performance of fringe major leaguers very accurately. Everyone wants to know how well the starts will do, or at least the regulars with major league contracts. And yet to systematically ignore weak contributors seems suboptimal for many reasons. If you count pitcher seasons, something like 20% or so of all pitcher seasons are by pitchers who can not be referred to as major league regulars by any means. I'd left this issue on the back burner until it absolutely could not be ignored any longer.
In my recent look into predicting strikeout rate from pitch data, I had significantly better predictive results (and more reasonable looking models) by restricting training data only to pitchers with 20+IP and 40+IP. This is understandable, since the swings for strikeout rates (and also for breakdown by pitch type and other pitch data) would be very high for a guy only throwing the equivalent of a few complete games. So I ended up coming up with a rather arbitrary cutoff, and only training with pitcher seasons over something like 32IP. Ugly. I would have preferred to simply decay the pitcher seasons' weights in training, in proportion to how much the data is free of expected random fluctuation.
As it turns out, there is a statistical basis for this sort of weighing of training data. I was reading an old statistics text, and it mentions a similar problem in doing linear regression analysis. My system for predicting strikeout rate is not exactly a linear regression, but it's very similar. The most common method of minimizing root mean squared error in your best fit line assumes that variance of the dependent variable does not depend on the value of the independent variable(s). This assumption is not going to be true for just about any baseball study of seasonal totals.
If you know the relationship between the variance of the dependent variable and the value of the independent variable(s), this problem is easy to fix. You re-weigh the instances in inverse proportion to the expected variance of the data points. In other words, you give *more* weight to values that you know you are observing more accurately. The book gives an example, where the independent variable is the amount of a steroid given to a castrated rooster, and the dependent variable is some measure of growth in the bird. LOL!
Unfortunately, the baseball case is not so simple. For the Sammy Sosa roosters, there were several data points for each amount of steroid, given to different birds. In the case of ballplayers, we can't run a player season twice, and see how much variance there is in performance from the different trails.
Also, since I am essentially training a multiple regression model, there are many independent variables and it's not entirely clear which one is most important in estimating expected variance in future usage or performance. So weighing pitcher seasons based on the value of past performance may not be ideal, either.
The most logical way to estimate the variance in performance for a pitcher season is to relate the variance to the *actual* performance. If we think of a the actual performance as the result of a single trail, dependent on all of his known past performance & other information, then we can get a rough estimate of the expected variance of his ideally projected performance.
That sounds confusing, but basically we are asking: given that player X pitched 60 IP in 2009, what was the expected variance in IP for his going into 2009, everything else being equal. Similarly, we get estimates for pitcher Y who pitched 200 IP, and pitcher Z who pitched only 5 IP. Knowing nothing else about the pitchers, I'm sure that anybody would guess that pitcher Y (200 IP guy) would have the highest expected variance in IP, and that pitcher X (60 IP guy) would have the lowest of the three.
To estimate variance, I compute the error of the latest generation of my IP prediction model, for different classes of actual IP. Assuming my system does not have a strong bias against a class of pitchers, this should be at least a decent approximation of expected variance in IP. Below, I graph the results (as the midpoints of buckets (classes) for which I estimate variance using the method above).
For this quick look, I only used data for the 2009 season, and only for pitchers who met my previous cutoff criteria. So the lower end (near 0 IP) points are probably pretty dubious. Still, the trend makes sense. As one might expect, expected variance is high for guys who throw very few innings, or very many innings. The easiest IP to predict is for a reliever who gets 60-80 IP every year. Even there, we are looking at an annual standard deviation of 30 IP, but not nearly the 70 IP for a typical full time starter.
Going back to the statistical text, now that I have an estimate of variance of my observed cases, I should weigh them by the inverse of their variances, for use in training. So the low-variance 60-80 IP reliever should have a 5x training weight, in comparison the the 200 IP starter. Also, he should have a 2-3x training weight, in comparison to the fringe major leaguer, getting close to 0 IP.
To make these figures more credible, I'll need to include all pitcher seasons that I have data for, including those for pitchers having no prior major league experience, and even for those guys who missed an entire year, due to injury or ineffectiveness. I should have that soon. Unfortunately, updating my database to include more players is always a pain. Not because I don't have the data, but rather because matching data from several sources involves some manual work (like matching alternative spellings of names). Still, this should be ready soon.
Not sure this approach will buy me anything as far as an improved model. The idea of giving starting pitcher season less training weight seems unnatural. These are really the guys I care about predicting accurately, no? However the idea of having a logical system for handling fringe, low-IP guys is very appealing to me. As is the idea of forcing my system to predict the performance of steady (IP-wise, anyway) relief pitchers more accurately. Missing the mark by 30 IP on CC Sabathia is one thing, but not predicting Mariano Rivera to throw very close to 70 IP is not so defensible.
Unfortunately, I always have more ideas than time to try them properly. But having a system by which I can train using all pitcher seasons that I have access to is very appealing, so this is going straight to the top of my TODO heap...