My writings about baseball, with a strong statistical & machine learning slant.

Wednesday, March 17, 2010

2010 IP projections


Based on the simple ML system that I wrote about in my previous post, here are my 2010 innings pitched projections. I offer predictions for IP (broken down by starter innings and reliever innings) for all pitchers active in 2009, as well as for those who missed 2009, but had over 120 IP in 2008 (ie Ben Sheets, Mike Mussina, etc). The pitchers are ordered by 2009 IP.

I also offer my predictions from an earlier system that does not account for pitchers’ value stats (wins, VORP, saves, etc). I include the difference between value-drive and non value-driven projections in the last row. These differences range between +30 IP to -30 IP.  The IP_Start model rewards both 2009 value and previous value, while the IP_Relief model is more heavily weighted toward 2009 value. Roy Halladay and Johan Santana gain the most among high-impact starters, while Mariano Rivera and Joe Nathan gain most among high-impact relievers. More on that later.

It would be in poor form for me to include PECOTA or CHONE predictions next to mine, so let me share a few averages with you instead. In this context, rookies are pitchers who had at least 1/3 IP in 2009, but less than 50 IP in their careers. Projections for rookies with no MLB experience are not included. My system does not use minor league stats or scouting reports, so all such rookies will project to about 20-30 IP, depending on age.


Actual  2009
My system (2010)
PECOTA
(2010)
CHONE
(2010)
Mean IP
65.1 IP
58.2 IP
 99.9 IP
91.2 IP
Mean IP
non-rookies
78.2 IP
66.4 IP
103.6 IP
95.1 IP
Mean IP
rookies
14.8 IP
26.4 IP
86.0 IP
76.4 IP

My system expects an average pitcher to regress by 7 IP from 2009. Since total innings must add up, the roughly 500 * 7 = 3,500 missing IP will be made up for by (yet unknown) pitchers without any major league experience. That’s about 100 IP per team.

In my system, veteran pitchers will regress by about 12 IP from 2009 on average (including pitchers who retire; more on that later). Rookies who’ve had a cup of coffee in the majors  will expect to pitch almost 12 IP more in 2010 than in 2009. All of this is pretty consistent with the averages that I outlined in the previous article.

However CHONE and PECOTA predictions are not consistent with these averages. In either system, the average pitcher will increase his innings pitched from 2009 by over 30 IP.  The average rookie will increase his innings pitched by over 60 IP! The average major league rookie is projected to throw more innings than Mariano Rivera.

I don’t think that any of this is realistic. CHONE and PECOTA allocate over 100% of 2010 innings to veterans, then allocate a further 20% of 2010 innings to rookies, and none of this accounts for pitchers making their major league debuts this year (while my system leaves 100 IP per team for those pitchers).

Some may argue that PECOTA and CHONE make “if he makes the majors” projections. A lot of these innings will end up getting pitched in the minors, and major league innings can be adjusted on a team-by-team basis. I think BP has a manual process where pitchers are selected based on likely playing time, and PECOTA IP projections are adjusted accordingly.

However I’m not sure that such a process is necessary, nor do I think it’s optimal for projecting team pitching totals. As I showed in my last posts, there are pitchers every year that come out of nowhere to pitch significant innings. Also the top prospects, as a group, collectively pitch fewer innings than CHONE or PECOTA would lead you to believe. Lastly, there was a nice article on BP recently (by Tommy Bennett, behind the pay wall) showing that a team’s 5th and 6th starters pitch comparatively similar innings on most teams. Therefore picking which players will be in the rotation, or who will be on the 25-man roster, is both futile and also counterproductive for projecting individual IP.

It would be better to compute an independent set of IP projections that respect the recent averages, and then (possibly) make some small team-based adjustments by hand or automatically. It’s more realistic to say that the Yankees will use 20 pitchers in 2010 whose collective IP is slightly less than the Yankees’ overall 2010 IP, rather than listing their top 10 or 13 pitchers, whose collective IP adds up to or exceeds the Yankees’ overall IP.  (FWIW, the Yankees used 24 pitchers in 2009.) Now using projected ERA or another value rate stat and a realistic IP projections, it should be possible to approximate the Yankees’ overall projected pitching value in 2010 in a robust manner.

Hopefully I’ll have time to complete all (most?) of those steps before the season gets under way.

Although my system is much better than CHONE or PECOTA at generating IP estimates for rookies, and somewhat better at projecting veterans, it is still far from ideal. Most notably, I think my (non)handling of retired pitchers hurts my system’s ability to project veteran starters’ IP rationally.

I think it’s important to train a system with examples where veterans don’t come back strong after a bad or injury-plagued year, and also to include examples of sudden falls to 0 IP after decent seasons. Both these things do happen to pitchers and the model need to take that into account. However, I think my system is overdoing the downward regression bit for starting pitchers. The only pitcher who threw 130 IP in 2009 who is expected to increase his total in 2010 is Johan Santana (who is expected to move from 166.7 IP to 174.0 IP).  A typical starter who threw 180 IP in 2009 with solid results (Roy Oswalt or Joe Saunders) is expected to throw 30 IP less in 2010. This sounds a bit harsh to me.

The cases of voluntary retirement for guys throwing over 100 IP in the previous season are few, but they might be screwing the results for all starting pitchers in my model. Greg Maddux and Mike Mussina were projected for 140 IP and 180 IP respectively in 2009 by my system. Both retired and threw 0 IP. In a model designed to minimized root mean square error, a few such examples can make a significant impact. Both Mussina and Maddux retired well before the 2009 regular season. My model should have been privy to that information, and ascribed their severe downturn in IP to retirement, and not so much to other factors.

Ben Sheets and Jeff Francis were also projected to throw a bunch of innings in 2009, but missed all of last year due to injury. Their cases should absolutely be in included in the model. However before I can build a decent injury-based model for IP, I need to account properly for known preseason retirement. Now where can I get that damn retirement data…

No comments:

Post a Comment