My writings about baseball, with a strong statistical & machine learning slant.

Tuesday, March 30, 2010

IP projection adjustments (with injury data)

After several earlier failed efforts (over the past six months), I finally managed to substantially improve a prediction system by using rich injury data. With data from http://www.baseballinjurytool.com/, I built an additional layer to my IP projection system (described in earlier posts), that has substantially improved results based on retrospective analysis.

To demonstrate what I'm doing, I'll show how the injuries data changes my system's projections for some 2009 pitchers' IP (based on 2004-2008 data). But first, let me quickly explain what I'm doing.

The data from Corey Dawkins' DL tool includes information about all major league players' in-season DL stints, as well as "Camp" (ie spring training) injuries, "Offseason" injuries, and also descripitive information about all surgeries. Unfortunately, the data does not include this same information for minor league players, or for players without teams (ie Ben Sheets in 2009). However the data is otherwise very thorough, and complete in the vast majority of cases.

In short, the data is very good. Having said that, getting predictive value from injury data is very difficult. A few months ago, I built models projecting a player's likely DL days during the next season, as well as his likelihood of suffering an elbow injury, a shoulder injury, a surgery, etc. These models gave me predictions with non-trivial correlation to reality, but they were not helpful for improving my predictive models for IP, VORP, or ERA. Therefore, I have re-defined my earlier goals for injury data.

My goal is simply to improve my existing IP projection system using features derived from the DL information that I have available. Intuitively, we know that some injury histories should make us want to downgrade a pitcher's playing time projection. But if my ML system can't find an IP prediction improvement in using that feature, then that feature does not make its way into the model. I will get back to this idea when I talk about my Randy Johnson projection for 2009.

My System

Looking at the injuries listed, I came up with ~60 features that I thought might be useful in a predictive model. Features might be something like "did pitcher X have an elbow injury" or "did pitcher X have Tommy John surgery" or "did pitcher X spend over 60 days on the DL." Features are generated per pitcher per season, and they range from the general to the very specific.

As with my previous IP prediction system, I build a model to predict a pitcher's stats using his last four years' data. So for 2009 predictions, I'm looking at 2005, 2006, 2007 and 2008 data, as well as some averages. To simplify injury-based IP projections, I only considered "2008" and "average" (ie 2005-2007 arithmetic average) features for injuries.

If I'm looking at "shoulder surgeries," I will only at whether the pitcher had a shoulder surgery in the year previous to April 1 2009, and also at how many years during 2005-2007 did he have at least one shoulder surgery. In other words, I am asking "did he just come off of shoulder surgery?" and "does he have a history of shoulder surgeries in the past?"

The devil's in the details, but I think this approach makes logical sense, and it makes for models that are easy to understand.

One interesting aspect of looking at an injury type in the short run and in the longer run is that having a history or a particular injury could raise a pitcher's IP projection. A good example is Tommy John surgery. If pitcher X has TJ surgery in 2007, he can expect to miss most of the 2008 season (which will usually be listed as "recovery from TJ surgery" in the 2008 DL listings). Such a history should make us want to increase his 2009 IP projections from what they otherwise would be. Maybe he didn't throw many innings in 2008, but he had a good excuse, and as long as he does not undergo more surgery in 2008, he should be good to go in 2009.

Asked to evaluate the TJ-related features, my system gives a value for "how will an average pitcher's IP be affected by these injury features?" This is limiting, since the projected IP gain or loss should change depending whether we are talking about AJ Burnett, Billy Wagner, or Eddie Guardano. But for now model, I ignore this issue. I am simply trying to find which features have power to predict IP (on top of a good system that doesn't use injury data).

I built separate mini-models for predicting IP changes due to:

  • elbow injuries
  • shoulder injuries
  • forearm & upper arm injuries
  • offseason injuries & surgeries
  • injuries sustained in camp (spring training)
  • other surgeries (not including features already covered above)
  • DL listings and # of days on DL
These models are interesting in their own rights. I will list the features that they use and explain them in a future post.

Now I have seven possible causes to increase or decrease a pitcher's IP projection. I sum the values to get a single "injury adjustment feature."

Now I can train a final model with:
  • my non-injury based IP projection
  • my "injury adjustment feature"
  • whether the pitcher still qualifies as a rookie
The later is necessary since I do not have injury data for pitchers while they were not on an MLB roster, nor do I use minor league stats in my projections. Therefore it is often useful to project rookies differently from veterans.

Using the three features above, I get the following simple model:

IF
proj_IP <= 81.746
THEN
IP_2009 = 
8.1498 * ROOKIE_2009 
+ 0.2684 * injury_sum 
+ 0.9539 * proj_IP
+ 5.989
ELSE
IP_2009 = 
0.783 *injury_sum 
+ 0.7377 * proj_IP 
+ 41.7999


This method is imperfect, but at least my system is able to recognize that injury-based IP adjustments need to be much large for starters than for relievers. Naturally, I don't allow IP projections below 0.0. Also, the model above only applies to pitchers with an injury history (ie at least one injury listing in the past four years). Pitchers with no injury history simply get the "proj_IP" value with no changes.

Ok, now that I've explained what I did, let me show the biggest hits and misses for 2009 IP predictions using injury data. The best and worst 20 changes can be seen here, and I will break down a few cases below.

I coulda told you that...

The injury-using system makes major gains by predicting much lower IP in 2009 for:

  • Jake Westbrook, Shaun Marcum, Dustin McGowan
  • Tim Hudson, Ben Sheets, Jeremy Bonderman, Jeff Francis
By the start of the 2009 season, all of these guys were coming off of major injuries and were DL'ed by their teams to open the season. My system was able to take that into account and dramatically slash their projected IP. Am I cheating? I'll discuss that in a second. Fact is, if you generate injury data saying that pitcher X is going to open the season on the DL with "recovery from Tommy John surgery," my system will take that into account. I'm not aware of other projection systems that take that into account automatically.

Making an expected recovery

The injury-based system beats the simple system by significantly boosting IP projections for:
  • Josh Johnson, Chad Gaudin, Jorge De La Rosa
  • Josh Beckett, Brandon Looper, Francisco Liriano
These guys had severe injuries in the past, but were relatively healthy heading into 2009. Therefore, the system substantially increased their IP projections. In these cases, the increased IP projections look good, but that doesn't always work out.

Bad interaction

I've tried to keep the system simple, to make it (relatively) easy to understand, and also to reduce over-fitting (more on that below). However this means the system will occasionally output predictions that are clearly flawed. Here are a couple of cases.

My simple system projected Randy Johnson for 128.5 IP in 2009. However, the injury-based model boosted him up to a projected 208.0 IP, one of the highest projections in baseball for 2009. Johnson has a history of back injuries, including back surgeries in 2006 and 2007. However, he did not have any surgeries in 2008. My system does not use specific features for "back surgery" or "herniated disc surgery," but it does use general features for "surgeries." Therefore, Johnson's back surgery gets lumped in with all other surgeries, including stitches, appendix removals and scoped knees. Tommy John surgery and shoulder labrum surgery get their own specific features, but back surgeries do not. And yet, we know that back injuries tend to be chronic. Indeed, Randy Johnson only threw 96.0 IP in 2009, and spent 71 days on the DL with back problems. Should I have trained a mini-model for back injuries? Maybe I should have.

In a totally different kind of failure, my system used injury data to increase Sidney Ponson's IP projection for 2009 from 84.8 IP to 128.1 IP. He only threw 58.7 IP. He also happened to spend 49 days on the DL, but I doubt that 128.1 IP is a reasonable projection for a healthy Sidney Ponson. At 32, Sir Sidney is a journeyman pitcher. His projections should remain low, not only because of injury risk, but also because there is a good chance that he won't stick on a major league roster. My simple model looks at a pitcher's effectiveness (VORP, SNWP, etc) in making IP predictions, but the injury-based adjustments are made without looking at value. Sometimes, this causes unrealistic projections.

Am I cheating?

With my methods, some over-fitting is unavoidable. There is not enough recent injury data to have large separate training and testing sets. All I can do is to keep over-fitting to a minimum. I create features that make logical sense and I avoid lumping unrelated categories together ("surgery" or "elbow surgery" or "Tommy John surgery," rather than "complicated surgery (including TJ, shoulder labrum and herniated disc in the lower back)." Also my methods exclude features from training that don't provide a significant amount of information gain. I'll write more about what actual features I include in a future post, but my entire injury-based adjustment only uses about 25 features total (out of a possible 120 or so candidates).

As I alluded to earlier, I use injury listings from spring training to predict that year's injury-based adjustments. I consider everything listed before April 1, 2009 as "2008 data." If a player is recovering from Tommy John surgery, the team will typically place him on the DL before the season starts. I use that information to predict that he will likely not pitch that year.

One can argue that my system will output inaccurate IP estimates before the last week of March, but that isn't quite so. The system will output good estimates in January, too, but they will change if more information is added. Also, if you are looking at the projection of a pitcher, and you know that he will be DL'ed to start the season, it's ok to just put that future expected DL listing into the input file. I think this is a better approach than manually editing projections for known major injury cases, or for outputting incorrect projections for pitcher that we know will likely miss the season.

Is it worth it?

My injury-based adjustments do not affect the 2/3 of pitcher who do not have an injury history (in the past four years) at the MLB level. However the remaining 1/3 of pitcher (including the vast majority of veterans) get significantly better IP projections if we take injury data into account.


IP_proj
IP_proj (with injuries)
Average error
33.4
30.1
RMS error
46.0
41.6
r^2 (with actual)
0.49
0.58
% better
50.1%
49.9%


Although only 50% of cases are improved, the average improvement is greater than the average mistake. I am improving r^2 by 10% for a system that already compares very favorably with PECOTA and CHONE for IP projection.

2010 projections

I will have new 2010 IP projections up soon. Since these take account of the 2010 DL listings that take place during March, I want to make sure I have the most up to date DL listings first. But I will have something here very soon, in any case.


No comments:

Post a Comment