My writings about baseball, with a strong statistical & machine learning slant.

Sunday, December 27, 2009

Updated data for K9 projections


I've been in Thailand & Australia these last couple weeks, so the update has taken longer than I expected.

I re-ran my scraping of FanGraphs, to get pitch data (average velocity & percentage thrown for every pitch) for all pitchers, active during 2002-2009 (before, I was missing many 2009 seasons, and was systematically missing some others). I also got updated biographical information (height & weight) that matches what can be found on MLB.com. So the data set is now complete (with a few very small exceptions).

Again, my idea was to see how well we can predict a pitcher's K9 (strikeouts per 9 innings) rates, given online what pitches he throws, and some biographical data like height, weight, age and handedness. I cheat a little by also using his IP in the model, although this is just to separate the starters from the relievers (we would expect starters and relievers, especially left handed relievers) to have different K9 rates with the same pitch arsenal.

In short, running with the updated data led to almost no changes in my ability to project K9 rates from this non-performance data. The correlation between projected and actual rates is still right at 0.6.

The graph is above. The listing of pitcher seasons (with 60IP+), fastball data, and actual & projected strikeout rates is available here.

The new formula still breaks the pitchers into 3 categories, by average fastball speed, and whether they are starters or relievers. For those who care, here is the new formula:

Overall:

FB_fg_vel <= 91.55 :
| IP Start <= 5.5 : LM1 (757)
| IP Start > 5.5 : LM2 (1353)
FB_fg_vel > 91.55 : LM3 (1091)

Where LM1, LM2 and LM3 are linear models that break down as follows:

LM num: 1
SO9 =
0.6879 * THROWS=L
- 0.0009 * HEIGHT_OVER_6ft
+ 0.0091 * WEIGHT
- 0.0293 * AGE
- 0.0001 * IP Start
- 0.0238 * FB_fg_per
+ 0.023 * SL_fg_per
- 0.0002 * CT_fg_per
+ 0.0215 * CB_fg_per
+ 0.0181 * CH_fg_per
+ 0.049 * SF_fg_per
+ 0.3019 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.1251 * CH_fg_vel
- 0.559 * rep_depth
- 0.2189 * rep_offerings
- 8.7262

LM num: 2
SO9 =
0.611 * THROWS=L
- 0.0006 * HEIGHT_OVER_6ft
+ 0.0001 * WEIGHT
+ 0.0002 * AGE
+ 0.0012 * IP Start
- 0.0488 * FB_fg_per
+ 0.0088 * SL_fg_per
- 0.0179 * CT_fg_per
+ 0.0159 * CB_fg_per
- 0.0189 * CH_fg_per
+ 0.0003 * SF_fg_per
+ 0.2786 * FB_fg_vel
- 0.0001 * SL_fg_vel
- 0.0144 * CB_fg_vel
- 0.0282 * CH_fg_vel
+ 0.009 * SF_fg_vel
- 0.2939 * rep_depth
- 0.1174 * rep_offerings
- 12.5946

LM num: 3
SO9 =
1.3079 * THROWS=L
- 0.0976 * HEIGHT_OVER_6ft
+ 0 * WEIGHT
+ 0.0289 * AGE
- 0.0013 * IP Start
- 0.0773 * FB_fg_per
- 0.0314 * CT_fg_per
- 0.0255 * CH_fg_per
+ 0.0164 * SF_fg_per
+ 0.5636 * FB_fg_vel
- 0.0002 * SL_fg_vel
+ 0.0091 * CB_fg_vel
- 0.0004 * CH_fg_vel
- 0.3258 * rep_depth
- 0.2241 * rep_offerings
- 39.442

The difference between the new and old models is trivial. Overall, I think I get something like a 2% decrease in root mean squared error, which is not significant.

I'm just including this information for completeness.

No comments:

Post a Comment