My writings about baseball, with a strong statistical & machine learning slant.

Wednesday, April 7, 2010

Predicting ERA (the trials & tribulations thereof)

Unlike my IP predictions, what I did for ERA/FIP was very simple. In fact, my 2009 analysis showed that my ERA predictions (retroactively, and in sample) were no better than CHONE's ERA predictions (made before the 2009 season, so obviously out of sample). My 2009 retroactive IP predictions were also in-sample, but they were much better than those of PECOTA or CHONE, and my methods were clear and logical (even if the actual formula was somewhat complicated).


However, I think even my simple ERA projection system worth explaining, if only to organize my thoughts in regard to making future improvements.


Building an ERA prediction model involves two major choices:
  • How do we select and weigh instances (ie pitcher seasons)?
  • Do we want to predict ERA, RA, FIP, or something else?
It isn't obvious at first why these choices might be important. Indeed, one could build a decent system (Tom Tango's MARCEL) by completely ignoring them. MARCEL predicts that future ERA should be a linear combination of previous years' ERAs. Then one might make adjustments by considering a pitcher's defense, park factor, age, and the natural regression toward the mean. However, if one is to build a regression model (which is what I'm doing here, mostly), then one must consider how to properly weigh his instances, and whether we should be predicting ERA directly, or whether we should predict something else that translates into ERA easily. Let me explain why.

Weighing instances by sample size

In 2009, Brandon Webb posted an ERA of 13.50. He wasn't terribly unlucky with defense or balls in play, since his FIP was 10.85. Ok, he was unlucky with home runs, so his xFIP was 6.05. How significant is all this? Were all the predictive systems wrong in predicting Webb for an ERA below 4.0? Of course not! Brandon Webb pitched only 4.0 IP in 2009.

When we evaluate (or train) predictive models, there are two ways of handling cases like Webb's. We can either throw them out using an IP cutoff. Or we can reduce the significance of outliers. Or we can reduce the significance based on a non-constant instance weighting model (ie we don't treat errors for all pitchers on the same scale). Ok that's three methods, but the last two are versions of the same idea.

Most analysis comparing model or training models just throws out the low-IP cases. However, where do we draw the cutoff line? Perhaps we don't care about predicting ERA for guys who only throw 40+ IP, since the variances for those guys is so high anyway. But what makes a 41 IP guys significant, but a 39 IP guy insignificant? I like using all of the data!

In training predictive models, as in training in the gym, specificity of training applies. One gets better at rock climbing (or running, or pitching) by climbing rocks (or running, or throwing a baseball). Similarly, if we want to predict ERA for guys who throw less than 40 IP in a season, we must include 40 IP cases in our ERA training. 

I discussed all of this at length when I wrote about training strikeout rate (SO9) models. Here is a graph of my basic SO9 prediction model's error, bucketed by IP:

In the graph above, I refer to SO9 error as "VAR" and "STDEV" since modeling error by IP allows us to estimate SO9 variance and standard deviation by IP. Assuming that my model is not biased in predicting SO9 for pitchers with different IP, the difference in SO9 error must reflect the variance in observed SO9.

If we weight training instances by 1/(expected VAR), then it is possible to build a model that trains on all of the data, but assigns less meaning to prediction errors for cases that have a known high variance in the independent variable. Since we can set the baseline arbitrarily, and only relative differences in weights are meaningful, I set the baseline to 100 IP --> 1.0 training weight. Here are some training weights for the strikeout model, based on 2009 pitcher seasons:

Pitcher Season
IP
Instance weight
Zack Greinke_09
229.3
1.72
Mariano Rivera_09
66.3
0.82
Matt Capps_09
54.3
0.73
Brandon Webb_09
4.0
0.15
Ivan Bezdomny_09
0.0
0.0

By using a system like this one, it's possible to include Brandon Webb's results in my training set, without giving them unfair weighting (and thus regressing too far toward the mean for guys with more meaningful sample sizes).

As my SO9 article showed, this method (compared to simple IP cutoffs with equal weights assigned to all instances) had the same error on higher-IP pitchers, but had much lower error for low-IP guys. As the specificity of training principle would indicate, you can't predict instances well if you don't train on similar instances.

Since FIP suggests that ERA is linear in respect to SO9 rates, I simply used the SO9 formula to weight ERA training instances. This is lazy. I should compute a new formula for estimating ERA variance from IP, but I doubt the difference would make a noticeable difference. As long as the ERA variance follows a similar graph to SO9 variance, with a similar exponent, the difference would be negligible. 

Should we predict ERA directly?

Whether we weight pitcher seasons as described above, or exclude all low-IP pitcher seasons, it's very hard to predict ERA much better than assigning a league average to all pitchers. A simple model predicting ERA from previous years' ERA, FIP and QERA gave me no better than a 0.22 correlation (r) with the actual data. In part, this is due to park and league factors. While the vast majority of pitchers should be predicted to an ERA of between 4.0 and 5.0, league and park factors can affect ERA by +-0.3 runs or more. The average run context for the White Sox is about 0.7 R/G higher than for the Giants. About 93% of that difference is attributed to ERA. Run factor differences and the variance due to defense and BABIP luck dwarf the skill differences for most pitchers.

Of course, we could even out some of those differences by removing the BABIP luck, the defense, and the park factors from a pitcher's performance. Or we can predict FIP, which does most of those things already. Within the DIPS (Defense Independent Pitching Statistics) family of stats, FIP may not be the most accurate ERA approximate, but it's a simple stats, and more complex stats are not provably more effective predictors of park-neutral ERA. The FIP formula predicts ERA simply from strikeout rate, walk rate and HR rate:

FIP = (HR*13 + BB*3 - K*2)/IP + 3.2

While FIP is not quite park or league neutral, it is less biased by park factors than ERA. Also since FIP explicitly ignores the sequencing and interaction between home runs, walks and strikeouts, it is possible to adjust those rates independently for park and league context. Walk rates do not change much my league or park, but strikeout rates change significantly by league, and HR rates are different for each major league park. Best of all, one can predict FIP in a straightforward manner (from previous FIP and QERA) at a 3.2 correlation (r) with the real FIP.

Here is another way of thinking about this:

I build an model to predict ERA that takes previous difference in past ERA/FIP between pitchers into account. It's trying to do the right thing, but variance due to park factors and BABIP luck is so high that the system freaks out and sets everyone to around a 4.6 ERA.

Alternatively, I build a model to predict FIP. The differences between pitchers are roughly the same as before. But by eliminating BABIP luck and by reducing the variance to to park factors, the model is more confident in assigning FIP predictions further away from the mean.

I chose the blue pill, and I think it should be pretty clear why.

Conclusions & Ongoing Work

Comparatively, a non-regression system like MARCEL takes past performance as an indicator of future performance, without giving consideration to deviation from the mean. It is more likely to predict exceptional cases well (since it's not afraid to stray from the baseline prediction), but it will more often be fooled by unsustainable outlier performance.

Not surprisingly, my models is still heavily regressed toward the mean, even when I predict FIP and translate that into team-specific ERA. Very few starting pitchers get assigned an ERA prediction of under 4.0 FIP. Tim Lincecum is by far the lowest FIP prediction for 2010 (among starters) with a projected for 3.29 FIP. Even with a favorable league and park, I project his ERA for 3.04. The second-best starter projection for 2010 is Zack Greinke with 3.47 FIP. His adjusted ERA projection comes out to 3.53. That means that Greinke is equally likely to post a 4.03 ERA and a 3.03 ERA.

I'm ok with these projections, since even great pitchers often revert toward the mean, but have less room for further improvement. However, my system projects FIP of below 5.0 for some very marginal pitchers, and no 2010 pitcher is projected for an FIP of above 5.79. Natural selector (ie the MLB front offices) tend to make sure that high-FIP pitchers don't stay around for long in the majors, but my system is still making a mistake by regressing everyone aggressively toward the mean, simply for being on a major league roster. Distribution of talent is not symmetrical in MLB, so regression toward the mean should not be either. I'm not sure how to properly account for this.

I am making major improvements to my FIP/ERA projection formula, so I will not discuss the formula I currently use in much more detail. Right now, my adjustments for park and league are not very good, and I'm not adjusting the HR% part of FIP by HR% for different parks. Worst of all, I am not adjusting previous years' FIP feature by sample size. So if a pitcher threw had an FIP of 2.48 in only 31.0 IP in 2009, I treat that the same for future predict as a 2.48 FIP in 200+ IP. As a result, Neftali Feliz is my second-best FIP projection for 2010, behind only Jonathan Broxton. Actually 31.0 IP isn't bad for a 1-year projection and my system is used to handling those sample sizes. Unfortunately, that's also his 3-year total, any my system assigns more weight to 3-year averages than to last year's figures. I can only imagine the ridiculous projection it would have given to Joba Chamberlain going into 2008!

I will fix the most pressing issues, and soon I'll have a very good projection for FIP/ERA. Stay tuned.

No comments:

Post a Comment