My writings about baseball, with a strong statistical & machine learning slant.

Tuesday, March 30, 2010

Elbows, Shoulders & Surgeries (seven mini-models by injury type)

As I mentioned in my last post, I created seven mini-models to look at how different types of injuries affect IP predictions. I thought the individual mini-models were interesting, and also very readable. Let me show what I'm doing with an example.

(A full text of the seven mini-models is here.)

Elbow Problems

Suppose I want to see how elbow-related injuries should affect my IP projection. I train an M5Rules linear model in WEKA with my non-injury IP projection, and all elbow-related features. I won't go over the M5Rules model again. However it's important to note that it will not use all of the features that I provide. Instead, there is some feature selection that takes place, so features that do not provide information gain for the model are excluded. Here's what's left:

IP_2009 = 
-22.0378 * inj_tj_surgery_2008 
- 34.7761 * inj_tj_recovery_camp_2008 
+ 39.0062 * inj_tj_surgery_average 
+ 31.5039 * inj_elbow_surg_not_tj_average 
+ 50.288 * inj_elbow_strain_average 
+ 0.9455 * proj_IP_v27_w_value 
+ 6.794

The projection consists of:
  • a linear constant
  • 0.9x * simple IP model
  • weighted sum of several elbow-related features
Hopefully most of the feature names are self-explanatory, though I will explain the interesting ones. A feature that ends in "_2008" relates to the year previous to the one we want to predict (so it would actually be 2007 stats for 2008 predictions, etc). A feature that ends in "_average" is a linear average of the three years prior to "_2008." So "_2008" features represent the near past, whereas the "_average" features represent the older background for the player.

To compute the "elbow related factor" used as part of the IP predictions, I just use the weighted sums of the elbow features above, ignoring the constant term and the "0.9x * proj_IP" term.

Therefore, the pitcher will increase or decrease his IP projection due to past elbow injuries depending on the sign and magnitude of this term:

- 22.0378 * inj_tj_surgery_2008
- 34.7761 * inj_tj_recovery_camp_2008
+ 39.0062 * inj_tj_surgery_average 
+ 31.5039 * inj_elbow_surg_not_tj_average 
+ 50.288 * inj_elbow_strain_average

In words, the player gets negative IP for:
  • Tommy John surgery in the past year
  • "Tommy John recovery" DL listing in spring training of this year
However the player gets positive IP for:
  • Tommy John surgery in the past (but not last year)
  • non-Tommy John elbow surgery in the past
  • elbow strain DL stints in the past
To any fan who knows a little bit about pitchers' recovery from elbow surgery, none of these features should be surprising. So at least the elbow model passes the sniff test.

To those interested in the magnitude of the features weights, I should point out that all yearly feature weights are binary {0,1}, thus making all "_average" values {0, 1/3, 2/3, 1}. Since the incidence of a particular injury is rare, "_average" values can be thought of as binary values, but at 1/3 of the weight.

Creaking Shoulders

The features for shoulder injuries are a little different from the elbow features:

- 43.2967 * inj_labrum_surgery_2008 
- 8.357 * inj_shoulder_tendonitis_2008 
- 9.3244 * inj_shoulder_inflam_2008 
- 19.828 * inj_shoulder_strain_average 
- 25.3431 * inj_shoulder_inflam_average 
+ 12.2862 * inj_shoulder_average 
- 54.3987 * inj_labrum_surgery_recovery_average 

Histories of shoulder strains, shoulder inflammations and especially of shoulder labrum surgeries are predictors of downward IP projections well into the future. Whereas the model projects a strong rebound for Tommy John survivors, a history of shoulder trouble is not a good sign for a pitcher holding up long-term, even if he stays off the DL for a while. Once a pitcher has major shoulder problems, they tend to bother him for the rest of his career.

Stiff Forearms

I trained a separate model for arm injuries not involving the elbow or the shoulder:

- 16.8042 * inj_forearm_2008 
+ 25.4747 * inj_forearm_average 
+ 27.0505 * inj_upper_arm_average 

Although I remember reading somewhere that forearm soreness is often a precursor to elbow problems, this injury seems to follow a simple pattern: recent injury is bad, but a history with no re-injury means projections can be raised.

General Surgery

Here is a look at surgeries, but from a more general sense:

+30.7839 * inj_recovery_2008 

- 14.106 * inj_shoulder_surgery_2008 
- 72.0142 * inj_surgery_camp_2008 
- 49.5069 * inj_recovery_camp_2008 

If the player was DL'ed last year with "recovery from surgery," then he should expect to play more in the upcoming year. Also it's not surprising that having surgery in spring training is not a precursor of playing a lot. Nor is going on the DL in spring training as "recovery from surgery" a good sign, either.

Somewhat surprisingly, recent shoulder surgery garners only 1/3 of the penalty for "shoulder labrum surgery". I guess there are simpler shoulder procedures that pitchers undergo, from which the recovery time is much faster than for SLAP, or any of the other labrum surgeries. This is not to say that all non-labrum shoulder surgeries are no big deal, but there is a huge difference between Mariano Rivera's 2008 surgery to remove "calcification" in the shoulder, and Brandon Webb's surgery last year to repair a "fraying labrum."

Having said that, neither me nor WEKA knows much about the anatomy of shoulder injuries. I don't really understand where the rotator cuff ends and where the labrum begins. I know what a labrum is and I've known several friends to tear their labrums playing football & rugby, but I don't really know how those gets repaired for pitchers. If you are reading this and know more about labrums than I do, please let me know!

Missed Seasons

I'm not sure why I combined "offseason surgery," "missed season due to injury [over 149 days on the DL]," and "season ending DL [30+ days through October]" together, but here it is:

- 16.5312 * inj_season_ending_2008 
- 18.7897 * inj_offseason_2008 
+ 31.5928 * inj_offseason_average 
+ 18.9852 * inj_missed_season_average 

Ending a season on the DL or having offseason surgery are not good signs, but these are not as strongly negative as one might have expected. Past offseason surgery corresponds to gains in IP (I won't speculate why). Spending entire seasons on the DL in the past is positive for IP projection for obvious reasons.

Spring Cheating

As I wrote in my last post, my system takes account of DL transactions from spring training. This is the time of year when teams often tip their hand on a pitcher's health. Getting placed on the DL to start the season is never a good thing:

+ 8.2829 * inj_camp_2008 
- 48.6768 * inj_dl_camp_2008 
+ 21.7274 * inj_camp_average 
+ 42.948 * inj_surgery_camp_average 
+ 54.6509 * inj_recovery_camp_average

According to the model, showing up on the injury report in camp, but not on the DL, is not a bad thing. However opening the season on the DL is very bad for that year's playing time. Being at least a full season removed from surgery and/or subsequent recovery (but not currently back on the DL) is good.

Days and Weeks

Finally, let's look at the features derived simply from the occurrence of DL stints (and days on the DL), irrespective of the injury description or timing:

- 7.5115 * inj_15DL_2008 
+ 0.3146 * inj_days_DL_average 
- 33.9261 * inj_anyDL_average 
+ 24.378 * inj_something_average 
- 23.5081 * inj_long_DL_average

Being on the DL last year is not great, but not a large downward factor, either. The "_average" features are confusing, but they mainly cancel each other out.

Here "long_DL" means any DL stint of over 60 days. Most surprisingly, there is no negative feature for "long_DL" in the past year. The seven mini-models are trained independently, so this model does not take account of the downward adjustments that we make for specific major injuries, surgeries, recoveries, etc.

If we know that a player missed lots of time in the past year due to injury, but we don't know what his injury was, whether he missed the whole season, or whether he had surgery, then we know very little about how to adjust his future IP projection.

I started out trying to predict future performance by looking at past DL time, but now I see why this would never work well. To predict anything meaningful from injuries, the type of injury is important, and the timing is important. Compared to those factors, time missed is not very important.

I also tried to train mini-models on several other injury types, but didn't have useful results. Most injuries are too rare to be selected by features selection algorithms as being sufficiently meaningful (back surgery, for example, among pitchers, is not very common). Other injuries are somewhat common, but don't seem to have much predictive value (hamstring strains, oblique strains, and the flu).

Hopefully you'll agree that my mini-model are readable, interesting, and pass the sniff test. I'll be happy to hear suggestions for further improvements. I'm glad this thing worked, but I'm sure it's possible to build a better model!

No comments:

Post a Comment