My writings about baseball, with a strong statistical & machine learning slant.

Wednesday, March 31, 2010

Final(!) 2010 IP projections (with injury data)

Unhappy with my incomplete updates for guys starting the year on DL, I got a much more thorough list. From Baseball Prospectus, I got a list of (probable) opening day rosters with DL. I checked all injuries on & put in approximate surgery dates where appropriate:

New projections are down by 1.1 IP per pitcher, which is good! Here is the complete list, sorted by projected 2010 IP. These should be my final projections. Mostly improved, with obvious DL guys updated, but no major pitchers affected. The exception is Cliff Lee, who's down to a 150 IP projection, since he will likely start the season on the DL. The number sounds low, but starting the year on the DL is never a good thing. Here is the full list:

Hoping to have ERA/FIP/VORP projections here soon. Maybe not of the same detail as my IP stuff, but at least something decent before the season starts...

Can't wait for opening day!

Tuesday, March 30, 2010

2010 IP projections (with injury data)

I'm not very patient, so instead of making sure that I had this year's spring training DL listings up to date, I updated information for some prominent cases (ie Joe Nathan), and made sure I covered all the guys listed in these FanGraphs articles. Thank god that fantasy baseball fans care so about DL possibilities for lot and lots of pitchers :-)

My updated 2010 IP projections are here, sorted by +- from my previous projections when I take injury histories into account. I call this adjustment "inj_gain" in the spreadsheet.

As the name indicates, more pitchers gain IP than lose IP through these adjustments. In fact, the average pitcher has his IP rise from 59.8 innings to 65.1 innings. This is not actually gonna happen, but is rather a reflection of the unusually high number of players coming off of injuries in 2009. The average pitcher who was on a major league roster in 2009 spent 21.9 days on the DL. The 2005-2009 average was just over 15 days, so the trend of more pitcher injuries is going strong. Since I trained my projections on 2005-2009 data, it's not surprising that 2010 projected gains are too high. Also I'm missing some guys who will open the season on the DL, as I mentioned already. I probably got all the big names right, but pitchers who are not on the fantasy radar might be missing camp-related DL deductions.

Looking at the top of my list, here are the pitchers who stand to gain the most from my injury-based adjustments:

2009 IP
2010 proj
w/ injuries
DL days 2009
Carl Pavano
Tim Hudson
Chris Carpenter
Francisco Liriano
John Lackey
Brandon McCarthy

The projected returns of Carl Pavano, Tim Hudson and Francisco Liriano are all hard to argue with. However the adjustments for Chris Carpenter and John Lackey seem a bit excessive. Increasing their IP projections makes sense, but their non-injury projection were already pretty high, and thus did not need to go up by 60 IP. Then again, my system does not have many other examples of pitcher who missed a month of the season, and still threw many innings and were every effective. As I wrote before, my system does not take account of value (ERA, VORP, etc) when computing the injury adjustments, so it has no way of knowing that these guys are already projected to be pretty good in 2010. My hunch is that those two should project around 200 IP each.

I hadn't heard of Brandon McCarthy until I ran these numbers, but there is a nice article about him on FanGraphs. He's a young starting pitcher with a history of injuries and of giving up home runs, but is now finally healthy (and still pretty young). He's not listed on the Rangers' official depth chart, so he might start the season in AAA, but there is a good chance that he's end up in the Rangers' rotation sometime this season. Given the context, his new 143.1 IP projection is too high, but it would have been a reasonable projection if the Rangers had a hole in their starting staff.

Most of the large decreases in IP projection are for the pitcher mentioned in those FanGraphs articles above. Most of the pitchers mentioned will start their seasons on the DL, which leads to large reductions in projected IP. Unfortunately, Joe Nathan's projection goes down only to 33.7 IP, rather than to the 0 IP that will actually happen. On the other hand, the 0 IP projection for Chien-Ming Wang, the 28.3 IP projection for Brandon Webb, and the 7.9 IP projection for Brad Lidge are all too low, even though the algorithm has every reason to be bearish on these pitchers.

His injury history takes Roy Halladay down to 203.2 IP, a full 20 IP below John Lackey's new projection. That's a bit of over-compensation, but the algorithm is right to put those two in the same ballpark in expected IP. Other than that, there are no top-line pitchers taking large projected IP losses, except those who experienced recent injuries.

As you can see, the injury-based adjustments are rather crude and they often lead to estimates that over-shoot or under-shoot the mark. However I think that most of the estimates shoot in the right direction, and sometimes they are able to fix obvious problems with projections that don't use injury data. 

Elbows, Shoulders & Surgeries (seven mini-models by injury type)

As I mentioned in my last post, I created seven mini-models to look at how different types of injuries affect IP predictions. I thought the individual mini-models were interesting, and also very readable. Let me show what I'm doing with an example.

(A full text of the seven mini-models is here.)

Elbow Problems

Suppose I want to see how elbow-related injuries should affect my IP projection. I train an M5Rules linear model in WEKA with my non-injury IP projection, and all elbow-related features. I won't go over the M5Rules model again. However it's important to note that it will not use all of the features that I provide. Instead, there is some feature selection that takes place, so features that do not provide information gain for the model are excluded. Here's what's left:

IP_2009 = 
-22.0378 * inj_tj_surgery_2008 
- 34.7761 * inj_tj_recovery_camp_2008 
+ 39.0062 * inj_tj_surgery_average 
+ 31.5039 * inj_elbow_surg_not_tj_average 
+ 50.288 * inj_elbow_strain_average 
+ 0.9455 * proj_IP_v27_w_value 
+ 6.794

The projection consists of:
  • a linear constant
  • 0.9x * simple IP model
  • weighted sum of several elbow-related features
Hopefully most of the feature names are self-explanatory, though I will explain the interesting ones. A feature that ends in "_2008" relates to the year previous to the one we want to predict (so it would actually be 2007 stats for 2008 predictions, etc). A feature that ends in "_average" is a linear average of the three years prior to "_2008." So "_2008" features represent the near past, whereas the "_average" features represent the older background for the player.

To compute the "elbow related factor" used as part of the IP predictions, I just use the weighted sums of the elbow features above, ignoring the constant term and the "0.9x * proj_IP" term.

Therefore, the pitcher will increase or decrease his IP projection due to past elbow injuries depending on the sign and magnitude of this term:

- 22.0378 * inj_tj_surgery_2008
- 34.7761 * inj_tj_recovery_camp_2008
+ 39.0062 * inj_tj_surgery_average 
+ 31.5039 * inj_elbow_surg_not_tj_average 
+ 50.288 * inj_elbow_strain_average

In words, the player gets negative IP for:
  • Tommy John surgery in the past year
  • "Tommy John recovery" DL listing in spring training of this year
However the player gets positive IP for:
  • Tommy John surgery in the past (but not last year)
  • non-Tommy John elbow surgery in the past
  • elbow strain DL stints in the past
To any fan who knows a little bit about pitchers' recovery from elbow surgery, none of these features should be surprising. So at least the elbow model passes the sniff test.

To those interested in the magnitude of the features weights, I should point out that all yearly feature weights are binary {0,1}, thus making all "_average" values {0, 1/3, 2/3, 1}. Since the incidence of a particular injury is rare, "_average" values can be thought of as binary values, but at 1/3 of the weight.

Creaking Shoulders

The features for shoulder injuries are a little different from the elbow features:

- 43.2967 * inj_labrum_surgery_2008 
- 8.357 * inj_shoulder_tendonitis_2008 
- 9.3244 * inj_shoulder_inflam_2008 
- 19.828 * inj_shoulder_strain_average 
- 25.3431 * inj_shoulder_inflam_average 
+ 12.2862 * inj_shoulder_average 
- 54.3987 * inj_labrum_surgery_recovery_average 

Histories of shoulder strains, shoulder inflammations and especially of shoulder labrum surgeries are predictors of downward IP projections well into the future. Whereas the model projects a strong rebound for Tommy John survivors, a history of shoulder trouble is not a good sign for a pitcher holding up long-term, even if he stays off the DL for a while. Once a pitcher has major shoulder problems, they tend to bother him for the rest of his career.

Stiff Forearms

I trained a separate model for arm injuries not involving the elbow or the shoulder:

- 16.8042 * inj_forearm_2008 
+ 25.4747 * inj_forearm_average 
+ 27.0505 * inj_upper_arm_average 

Although I remember reading somewhere that forearm soreness is often a precursor to elbow problems, this injury seems to follow a simple pattern: recent injury is bad, but a history with no re-injury means projections can be raised.

General Surgery

Here is a look at surgeries, but from a more general sense:

+30.7839 * inj_recovery_2008 

- 14.106 * inj_shoulder_surgery_2008 
- 72.0142 * inj_surgery_camp_2008 
- 49.5069 * inj_recovery_camp_2008 

If the player was DL'ed last year with "recovery from surgery," then he should expect to play more in the upcoming year. Also it's not surprising that having surgery in spring training is not a precursor of playing a lot. Nor is going on the DL in spring training as "recovery from surgery" a good sign, either.

Somewhat surprisingly, recent shoulder surgery garners only 1/3 of the penalty for "shoulder labrum surgery". I guess there are simpler shoulder procedures that pitchers undergo, from which the recovery time is much faster than for SLAP, or any of the other labrum surgeries. This is not to say that all non-labrum shoulder surgeries are no big deal, but there is a huge difference between Mariano Rivera's 2008 surgery to remove "calcification" in the shoulder, and Brandon Webb's surgery last year to repair a "fraying labrum."

Having said that, neither me nor WEKA knows much about the anatomy of shoulder injuries. I don't really understand where the rotator cuff ends and where the labrum begins. I know what a labrum is and I've known several friends to tear their labrums playing football & rugby, but I don't really know how those gets repaired for pitchers. If you are reading this and know more about labrums than I do, please let me know!

Missed Seasons

I'm not sure why I combined "offseason surgery," "missed season due to injury [over 149 days on the DL]," and "season ending DL [30+ days through October]" together, but here it is:

- 16.5312 * inj_season_ending_2008 
- 18.7897 * inj_offseason_2008 
+ 31.5928 * inj_offseason_average 
+ 18.9852 * inj_missed_season_average 

Ending a season on the DL or having offseason surgery are not good signs, but these are not as strongly negative as one might have expected. Past offseason surgery corresponds to gains in IP (I won't speculate why). Spending entire seasons on the DL in the past is positive for IP projection for obvious reasons.

Spring Cheating

As I wrote in my last post, my system takes account of DL transactions from spring training. This is the time of year when teams often tip their hand on a pitcher's health. Getting placed on the DL to start the season is never a good thing:

+ 8.2829 * inj_camp_2008 
- 48.6768 * inj_dl_camp_2008 
+ 21.7274 * inj_camp_average 
+ 42.948 * inj_surgery_camp_average 
+ 54.6509 * inj_recovery_camp_average

According to the model, showing up on the injury report in camp, but not on the DL, is not a bad thing. However opening the season on the DL is very bad for that year's playing time. Being at least a full season removed from surgery and/or subsequent recovery (but not currently back on the DL) is good.

Days and Weeks

Finally, let's look at the features derived simply from the occurrence of DL stints (and days on the DL), irrespective of the injury description or timing:

- 7.5115 * inj_15DL_2008 
+ 0.3146 * inj_days_DL_average 
- 33.9261 * inj_anyDL_average 
+ 24.378 * inj_something_average 
- 23.5081 * inj_long_DL_average

Being on the DL last year is not great, but not a large downward factor, either. The "_average" features are confusing, but they mainly cancel each other out.

Here "long_DL" means any DL stint of over 60 days. Most surprisingly, there is no negative feature for "long_DL" in the past year. The seven mini-models are trained independently, so this model does not take account of the downward adjustments that we make for specific major injuries, surgeries, recoveries, etc.

If we know that a player missed lots of time in the past year due to injury, but we don't know what his injury was, whether he missed the whole season, or whether he had surgery, then we know very little about how to adjust his future IP projection.

I started out trying to predict future performance by looking at past DL time, but now I see why this would never work well. To predict anything meaningful from injuries, the type of injury is important, and the timing is important. Compared to those factors, time missed is not very important.

I also tried to train mini-models on several other injury types, but didn't have useful results. Most injuries are too rare to be selected by features selection algorithms as being sufficiently meaningful (back surgery, for example, among pitchers, is not very common). Other injuries are somewhat common, but don't seem to have much predictive value (hamstring strains, oblique strains, and the flu).

Hopefully you'll agree that my mini-model are readable, interesting, and pass the sniff test. I'll be happy to hear suggestions for further improvements. I'm glad this thing worked, but I'm sure it's possible to build a better model!

IP projection adjustments (with injury data)

After several earlier failed efforts (over the past six months), I finally managed to substantially improve a prediction system by using rich injury data. With data from, I built an additional layer to my IP projection system (described in earlier posts), that has substantially improved results based on retrospective analysis.

To demonstrate what I'm doing, I'll show how the injuries data changes my system's projections for some 2009 pitchers' IP (based on 2004-2008 data). But first, let me quickly explain what I'm doing.

The data from Corey Dawkins' DL tool includes information about all major league players' in-season DL stints, as well as "Camp" (ie spring training) injuries, "Offseason" injuries, and also descripitive information about all surgeries. Unfortunately, the data does not include this same information for minor league players, or for players without teams (ie Ben Sheets in 2009). However the data is otherwise very thorough, and complete in the vast majority of cases.

In short, the data is very good. Having said that, getting predictive value from injury data is very difficult. A few months ago, I built models projecting a player's likely DL days during the next season, as well as his likelihood of suffering an elbow injury, a shoulder injury, a surgery, etc. These models gave me predictions with non-trivial correlation to reality, but they were not helpful for improving my predictive models for IP, VORP, or ERA. Therefore, I have re-defined my earlier goals for injury data.

My goal is simply to improve my existing IP projection system using features derived from the DL information that I have available. Intuitively, we know that some injury histories should make us want to downgrade a pitcher's playing time projection. But if my ML system can't find an IP prediction improvement in using that feature, then that feature does not make its way into the model. I will get back to this idea when I talk about my Randy Johnson projection for 2009.

My System

Looking at the injuries listed, I came up with ~60 features that I thought might be useful in a predictive model. Features might be something like "did pitcher X have an elbow injury" or "did pitcher X have Tommy John surgery" or "did pitcher X spend over 60 days on the DL." Features are generated per pitcher per season, and they range from the general to the very specific.

As with my previous IP prediction system, I build a model to predict a pitcher's stats using his last four years' data. So for 2009 predictions, I'm looking at 2005, 2006, 2007 and 2008 data, as well as some averages. To simplify injury-based IP projections, I only considered "2008" and "average" (ie 2005-2007 arithmetic average) features for injuries.

If I'm looking at "shoulder surgeries," I will only at whether the pitcher had a shoulder surgery in the year previous to April 1 2009, and also at how many years during 2005-2007 did he have at least one shoulder surgery. In other words, I am asking "did he just come off of shoulder surgery?" and "does he have a history of shoulder surgeries in the past?"

The devil's in the details, but I think this approach makes logical sense, and it makes for models that are easy to understand.

One interesting aspect of looking at an injury type in the short run and in the longer run is that having a history or a particular injury could raise a pitcher's IP projection. A good example is Tommy John surgery. If pitcher X has TJ surgery in 2007, he can expect to miss most of the 2008 season (which will usually be listed as "recovery from TJ surgery" in the 2008 DL listings). Such a history should make us want to increase his 2009 IP projections from what they otherwise would be. Maybe he didn't throw many innings in 2008, but he had a good excuse, and as long as he does not undergo more surgery in 2008, he should be good to go in 2009.

Asked to evaluate the TJ-related features, my system gives a value for "how will an average pitcher's IP be affected by these injury features?" This is limiting, since the projected IP gain or loss should change depending whether we are talking about AJ Burnett, Billy Wagner, or Eddie Guardano. But for now model, I ignore this issue. I am simply trying to find which features have power to predict IP (on top of a good system that doesn't use injury data).

I built separate mini-models for predicting IP changes due to:

  • elbow injuries
  • shoulder injuries
  • forearm & upper arm injuries
  • offseason injuries & surgeries
  • injuries sustained in camp (spring training)
  • other surgeries (not including features already covered above)
  • DL listings and # of days on DL
These models are interesting in their own rights. I will list the features that they use and explain them in a future post.

Now I have seven possible causes to increase or decrease a pitcher's IP projection. I sum the values to get a single "injury adjustment feature."

Now I can train a final model with:
  • my non-injury based IP projection
  • my "injury adjustment feature"
  • whether the pitcher still qualifies as a rookie
The later is necessary since I do not have injury data for pitchers while they were not on an MLB roster, nor do I use minor league stats in my projections. Therefore it is often useful to project rookies differently from veterans.

Using the three features above, I get the following simple model:

proj_IP <= 81.746
IP_2009 = 
8.1498 * ROOKIE_2009 
+ 0.2684 * injury_sum 
+ 0.9539 * proj_IP
+ 5.989
IP_2009 = 
0.783 *injury_sum 
+ 0.7377 * proj_IP 
+ 41.7999

This method is imperfect, but at least my system is able to recognize that injury-based IP adjustments need to be much large for starters than for relievers. Naturally, I don't allow IP projections below 0.0. Also, the model above only applies to pitchers with an injury history (ie at least one injury listing in the past four years). Pitchers with no injury history simply get the "proj_IP" value with no changes.

Ok, now that I've explained what I did, let me show the biggest hits and misses for 2009 IP predictions using injury data. The best and worst 20 changes can be seen here, and I will break down a few cases below.

I coulda told you that...

The injury-using system makes major gains by predicting much lower IP in 2009 for:

  • Jake Westbrook, Shaun Marcum, Dustin McGowan
  • Tim Hudson, Ben Sheets, Jeremy Bonderman, Jeff Francis
By the start of the 2009 season, all of these guys were coming off of major injuries and were DL'ed by their teams to open the season. My system was able to take that into account and dramatically slash their projected IP. Am I cheating? I'll discuss that in a second. Fact is, if you generate injury data saying that pitcher X is going to open the season on the DL with "recovery from Tommy John surgery," my system will take that into account. I'm not aware of other projection systems that take that into account automatically.

Making an expected recovery

The injury-based system beats the simple system by significantly boosting IP projections for:
  • Josh Johnson, Chad Gaudin, Jorge De La Rosa
  • Josh Beckett, Brandon Looper, Francisco Liriano
These guys had severe injuries in the past, but were relatively healthy heading into 2009. Therefore, the system substantially increased their IP projections. In these cases, the increased IP projections look good, but that doesn't always work out.

Bad interaction

I've tried to keep the system simple, to make it (relatively) easy to understand, and also to reduce over-fitting (more on that below). However this means the system will occasionally output predictions that are clearly flawed. Here are a couple of cases.

My simple system projected Randy Johnson for 128.5 IP in 2009. However, the injury-based model boosted him up to a projected 208.0 IP, one of the highest projections in baseball for 2009. Johnson has a history of back injuries, including back surgeries in 2006 and 2007. However, he did not have any surgeries in 2008. My system does not use specific features for "back surgery" or "herniated disc surgery," but it does use general features for "surgeries." Therefore, Johnson's back surgery gets lumped in with all other surgeries, including stitches, appendix removals and scoped knees. Tommy John surgery and shoulder labrum surgery get their own specific features, but back surgeries do not. And yet, we know that back injuries tend to be chronic. Indeed, Randy Johnson only threw 96.0 IP in 2009, and spent 71 days on the DL with back problems. Should I have trained a mini-model for back injuries? Maybe I should have.

In a totally different kind of failure, my system used injury data to increase Sidney Ponson's IP projection for 2009 from 84.8 IP to 128.1 IP. He only threw 58.7 IP. He also happened to spend 49 days on the DL, but I doubt that 128.1 IP is a reasonable projection for a healthy Sidney Ponson. At 32, Sir Sidney is a journeyman pitcher. His projections should remain low, not only because of injury risk, but also because there is a good chance that he won't stick on a major league roster. My simple model looks at a pitcher's effectiveness (VORP, SNWP, etc) in making IP predictions, but the injury-based adjustments are made without looking at value. Sometimes, this causes unrealistic projections.

Am I cheating?

With my methods, some over-fitting is unavoidable. There is not enough recent injury data to have large separate training and testing sets. All I can do is to keep over-fitting to a minimum. I create features that make logical sense and I avoid lumping unrelated categories together ("surgery" or "elbow surgery" or "Tommy John surgery," rather than "complicated surgery (including TJ, shoulder labrum and herniated disc in the lower back)." Also my methods exclude features from training that don't provide a significant amount of information gain. I'll write more about what actual features I include in a future post, but my entire injury-based adjustment only uses about 25 features total (out of a possible 120 or so candidates).

As I alluded to earlier, I use injury listings from spring training to predict that year's injury-based adjustments. I consider everything listed before April 1, 2009 as "2008 data." If a player is recovering from Tommy John surgery, the team will typically place him on the DL before the season starts. I use that information to predict that he will likely not pitch that year.

One can argue that my system will output inaccurate IP estimates before the last week of March, but that isn't quite so. The system will output good estimates in January, too, but they will change if more information is added. Also, if you are looking at the projection of a pitcher, and you know that he will be DL'ed to start the season, it's ok to just put that future expected DL listing into the input file. I think this is a better approach than manually editing projections for known major injury cases, or for outputting incorrect projections for pitcher that we know will likely miss the season.

Is it worth it?

My injury-based adjustments do not affect the 2/3 of pitcher who do not have an injury history (in the past four years) at the MLB level. However the remaining 1/3 of pitcher (including the vast majority of veterans) get significantly better IP projections if we take injury data into account.

IP_proj (with injuries)
Average error
RMS error
r^2 (with actual)
% better

Although only 50% of cases are improved, the average improvement is greater than the average mistake. I am improving r^2 by 10% for a system that already compares very favorably with PECOTA and CHONE for IP projection.

2010 projections

I will have new 2010 IP projections up soon. Since these take account of the 2010 DL listings that take place during March, I want to make sure I have the most up to date DL listings first. But I will have something here very soon, in any case.

Wednesday, March 24, 2010

Rays' pitchers' IP projections

After breaking down the Yankees' pitchers' IP projections, I thought I'd look at another team, one the I do not know as well as I know the Bombers.

I do know that the Rays' staff is a young staff, with the Rays having a supposed glut of quality starting pitchers. After the 2008 season, they traded away Edwin Jackson, and last year they traded Scott Kazmir in mid-season. Both moves were made in part to make room for promising starters that they wanted to promote from the minors.

Here is a list of my projections, along with PECOTA and CHONE, for the Rays' top twelve pitchers, according to the official depth chart on the Rays website. Unlike the Yankees, the Rays project to have six guys primarily as starters, and six guys primarily as relievers:

  • Starters: James Shields, Matt Garza, Jeff Neimann, David Price, Wade Davis and Andy Sonnanstine
  • Relievers: JP Howell, Rafael Soriano, Lance Cormier, Dan Wheeler, Grant Balfour and Randy Choate
Also unlike the Yankees, the Rays don't have four experienced starting pitchers. James Shields has four years of full-time starter experience, and gets projected at 190+ IP by all three projection systems. Matt Garza has two full seasons of MLB starter experience, both good but not amazing. He projects at 160 IP by my system, and somewhat higher by PECOTA and CHONE. The other four Rays' starters don't have much of an MLB track record, or at least not a record of consistent MLB success.

Jeff Neimann and David Price both have one full season of MLB starter experience, with decent but not amazing results. Andy Sonnanstine has 2.5 years of starter experience, but none of it was very good. Wade Davis was solid in a 36 inning debut last year.

I have all of these four pitchers projected to throw between 81.0 (Sonnanstine) and 142.3 (Neimann) innings this year. PECOTA has them all between 160 and 180 innings each. CHONE has them all between 155 and 171 innings (except for Price at 123 innings).

In this case, I must say that my estimates look low, when take together. I project the Rays' starting six to throw only 800 innings in 2010. A team typically has about 1000 innings thrown by starters (the Devil rays had 974 IP by starters in 2009). So my projections are about 200 IP short for starting innings. This suggests that the Rays are likely to use another starter (not in the 12 listed here) during the 2010 season, or multiple such starters. While that may happen, it's likely that the pitchers listed above will indeed pitch more than the 800 innings that I project for them.

Conversely, PECOTA projects the top six Rays' starters for 1082 IP, while CHONE projects them for 985 IP. The PECOTA figure is clearly high, while the CHONE figure looks about right. Then again, there is probably a chance that at least some starts will end up going to a Ray outside of the top six starters listed above, so the CHONE projection might be a little high, as well.

I think it's interesting to note that my system projects Wade Davis at a higher IP than Andy Sonnanstine. The later is more experienced, and threw a lot more innings last year than Davis. However, Sonnanstine's performance has not been great in 2009, nor has it been great in years prior. Davis was effective in 36 innings last year as a starter. My system projects him to throw 99.9 innings in 2010, which happens to be the highest projected increase from 2009 to 2010 among the MLB pitchers that I project.

Although my system does not use minor league stats, major league depth charts, scout projections or any other information besides MLB stats, it projects Davis as the #5 starter, and Sonnanstine as the #6. I thought that was interesting.

In summary, I think my system gives reasonable projections for the Rays' pitchers' IP, even if I would like to nudge them a bit higher collectively. CHONE also has reasonable projections (on average), although they assume that the Rays will not use any starters outside the six that they have on the MLB roster. PECOTA, as with the Yankees, significantly over-estimates the pitchers' playing time.

I'm curious to see how the season will unfold, and which of the set of predictions will come closest to what actually happens with the Rays' staff this year. The Rays have a young staff that many people think should be better in 2010 than it was in 2009. If so, my projections will probably end up being too low. The Rays are an exciting team, and I'd like to see them do well, so let's hope my models are wrong on this one.

Yankees' 25-man pitchers (everyone loves projections)

When I worked at Big Software Company X, I always told new engineers to include specific analysis with any data that they sent for others to look at. So here, I am following my own advice. I can't publish PECOTA and CHONE projections with mine side-by-side for all pitchers, since that gives away those guys' hard work for free, but I think it's OK to list a few projections, for the sake of comparison. Better yet, let's look at a complete team.

Here are the IP projections from my system, along with PECOTA and CHONE, for the New York Yankees' likely opening day 12-man pitching roster. For those that don't like spreadsheets, here are the pitchers I think will be on that 25-man roster:

  • Starters: CC Sabathia, AJ Burrnett, Javier Vazquez, and Andy Pettitte
  • Starters/Relievers: Joba Chamberlain, Chad Gaudin, Phil Hughes and Alfredo Aceves
  • Relievers: Mariano Rivera, Chan Ho Park, Damaso Marte and David Robertson
You could argue for Sergio Mitre, Mark Malencon and others, but this is probably what the Yankees will start with. FWIW, here is their official depth chart.

None of the projected top 12 pitcher are rookies, so PECOTA and CHONE are not subject to the wildly optimistic rookie projection issue in this case. You can read more about rookie projections in my previous post. Even so, the projections from PECOTA are much to high on average (less so for CHONE).

In 2009, the 12 pitchers listed threw a total of 1532.2 innings. The Yankees' season innings is roughly 1450. Therefore, it is unlikely that the pitchers above will throw more than 1300 or so innings in 2010 (remember, at least some innings will be thrown by rookies and major league veterans currently in the minors). Indeed, my projection system gives the 12 guys 1236.8 innings this year. That may be low or that may be high, but it's a reasonable guess. However, PECOTA projects these same pitchers at 1563.6 total innings. That's 100 IP more than the Yankees will have in 2010. CHONE projects them at 1388 IP, which is still high, but is much less so than PECOTA, and can be mitigated by the fact that Javier Vazquez and Chan Ho Park got their innings for other teams last year. None of the projection systems take account of team innings balance.

Then again, as I wrote previously, my system does not require much of this team-based balancing, since it gives much more realistic projections for rookie and low-end pitchers (at least on average). With reasonable roster construction, total team projections should come out reasonably without adjustment. My system will over-estimate a teams' pitchers' innings if that team signs seven top-line starting pitchers, but teams never do that. My system could, however, massively under-estimate innings pitched for teams that have very few pitchers with major league success & experience. However, one could just proportionately increase that teams' rookies' innings pitched for a more reasonable adjustment. In other words, we would estimate rookies' IP not from their own minor league stats, but from a teams' need to fill innings and roster spots.

In the case of PECOTA (and CHONE, to a lesser degree), they massively over-estimate a teams' established pitchers' IP. This calls for additional (and possibly skewed) adjustments in order to get numbers that add up. Even though the individual projections look reasonable, the team totals are not reasonable. To get totals that sum to 1300 IP, we'd need to reduce PECOTA's projections by 10-15%. It's not clear whether all, or just the low-end pitchers, need to have their projections lowered.

Projecting IP matters for estimating just about any other pitching stat (for fantasy or otherwise). If you want to project strikeouts, walks, wins, saves or IP-weighted ERA (or WHIP), it is important to know what is the likely playing time that a pitcher will be able to handle. My projection system does not give a high and low end projection (ie 75th and 25th percentile), but it does give single projections that add up on a league-average basis.

Updated 2010 IP projections

I removed the known retired pitchers from training (all 15 of them), and so I have projections from my new model for 2010 IP. Pitchers are sorted by last year's IP.

I am using the same features and same process in the training of the models. Actually, I also looked at "games" and "games started" this time around, but that was a waste of time. No new information there. The correlations with actual IP stay the same, and even average IP rises by only a tiny amount.

What does happen, though, is that veteran starting pitchers are not regressed as heavily due to age. Also long-term value (ie value produced in the past four years) gets more weight in the model than in the previous one.

In a separate list, I have this model's projections, listed next to those of the last model. High-end older starting pitchers seem to gain IP, which I think makes sense.

Roy Halladay, Tim Lincecum, Bronson Arroyo, Javier Vazquez and Zach Duke all are projected at +10-12 IP more than in the previous model, probably because the new model gives them more credit for VORP and IP before 2009. Of those guys with high IP totals in 2009, only Josh Johnson, Jon Lester and Scott Baker lost more than 7 IP from the previous projection. Again, this is probably because those pitchers did not have high VORP or IP before 2008-2009.

Overall, the projected IP for the high-IP guys from 2009 is up (+2.5 IP for pitchers with 200+ IP in 2009). I think this is an improvement, even if my system is now lowballing some very good young pitchers.

Future Ideas

This should probably be it for my IP projections from season stats. However I am skipping two important aspects of a player's playing time projection: consistency and momentum. Since regression is more likely than improvement for most pitchers, my model (as well as most other predictive models) tend to project a pitcher at his recent level of performance, regressed downward to replacement level. However if the player had shown great consistency in the past few years, then perhaps we should regress him by a smaller factor. Alternatively, if the player has made a large improvement from past performance, perhaps he is on an upward trajectory that is not yet complete. His additional upside should perhaps earn him some credit then, also.

As is, I am treating each player's future IP (and also his value, in other models) as a state that can be predicted from different sets of averages (seasonal or multi-season). I am not looking at his performance as a time series, but perhaps I should do so, as well. This sounds a bit like stock projection (or rather it can use the same methods). I've been told that stock analysis have methods to project a stock's established level, and also to estimate the probability that the stock is establishing a new level (improvement or decline), rather than fluctuating at the current level.

However I'm not sure whether this approach can work for projecting pitchers, so I will just file it away for now. Besides, the yearly stats time series is way to course a scale for this kind of analysis. I should at least split the pitcher's performance into quarters before I start mapping his performance time series. Definitely a project for after the season starts, so don't expect to see anything here about that here in the near future.

What you should expect, however, is to see something for tweaking the IP projections using injury data. My previous attempts at using rich injury information did not work. However I now have a better set of injury data, including better pre-season and post-season injury coverage. At the very least, I hope to be able to distinguish between players who miss a season due to injury, compared to those that miss the season because they couldn't crack a major league roster. It's hard to imagine that information not impacting future projections.

Also, my basic value projections (VORP, ERA, etc) are overdue...