My writings about baseball, with a strong statistical & machine learning slant.

Monday, January 11, 2010

Breaking down IP: not so simple.

I am in Tokyo, and haven't been thinking much about baseball lately. However I did try breaking down the prediction of IP into IP_Start & IP_Relief as I wrote about earlier. Well, it didn't help much.

I wasn't surprised that predicting IP wasn't magically going to become easier by separately predicting innings pitched in start and in relief. However I was surprised that these separate projections didn't do much for predicting VORP (overall value in runs saved) either.

However I thought I'd explain a little better exactly why I think this idea made sense (and still might work, maybe with more data).

OK, so here is the distribution of IP for 2005-2009 pitchers, who had at least 50IP experience in the three years prior. Don't ask about the cutoffs. I'm thinking about how to remove them, without making things worse. The overlayed red line is the best fit normal distribution (not that this data would have an expected normal distribution):

As you can see, the distribution has two peaks, one at 61IP and another at 181IP. Actually, this is an Excel graph, and Excel lines up the x-axis for bar graphs strangely. Think of all those labels as being attached to the left-side has on the x-axis. So the peaks are actually at 61IP-85IP and at 181IP-105IP. Both those numbers make sense. You would expect a lot of major league relievers to throw about 73IP, and a lot of starters to throw just under 200IP.

Hence the idea of breaking down IP by IP_Start & IP_Relief. Total innings pitched are the sum of of those two distributions. Maybe the separate distributions would be easier to predict (ie if they have simple shapes)?

However, here are the actual distributions:

Well these distributions are not really any easier to describe. Both the start and relief innings pitched distributions have at least two peaks. The relief inning distribution might even have a third peak around 50IP. Given that some pitchers have bullpen roles (ie LOOGY or similar) that don't call for more than 60IP per year, this might be a true third peak. Then again, since we are now dealing with less data, the shape of the distribution is even harder to trust.

Which brings me to answer the obvious question: since pitchers often have defined roles, why don't I train separate models for those roles? Unfortunately, I don't think this would be so simple. Even if the previous years' roles were easily known (they are not), pitcher roles change often for the non-stars. Also, training models on 2,000 pitcher seasons is hard enough, but training models on 100 pitcher seasons doesn't even make sense.

Without more training data, I just don't think it will be possible to build a better model for IP by breaking down the distribution into components. I could probably squeeze in a bit more training data. However older stats are less useful for predicting future pitcher seasons, and injury data is not as good for the older years (the number of DL listings has grown significantly in recent years).

Conceptually, I wonder if IP (or perhaps just IP_Relief) can be broken down into two or three components that can be measured or estimated? I'm thinking something like:
  • Time (games) available
  • Usage per game
This way, it would be possible to separate a pitcher who threw 40IP in relief, because he missed half the season to injuries, suspension, or minor league usage, and one who was available in the bullpen every day, but is used seldom by him manager when available. Also this approach might improve my predictions for rookies and other pitchers who have little major league experience. If the projections are too low (or too high), it will be easy to see why: availability estimates are too low (or too high).

In contrast to IP, VORP distribution is simpler to describe:

As the red curve shows, the distribution is not normal, but it has just one peak. This is not to say that predicting VORP is easier than IP. Rather, it is much harder! Predicting value with a very high correlation is probably impossible. There are just too many unknowable factors involved.

There is ultimately an absolute cap on how well an algorithm can predict future season IP and VORP. Unfortunately it is impossible to know what that cap might be. This is discouraging, since I could be, for all I know, striving to make improvements that won't really matter. Currently, I can predict VORP with a little more than a 0.5 correlation, compared with real values. I don't know how much of the other 0.5 correlation is completely random. Then again, I just can't believe that my crude methods are the best that can be done without cheating. Not yet.

No comments:

Post a Comment