My writings about baseball, with a strong statistical & machine learning slant.

Friday, April 23, 2010

Pitcher injury effects on projected value (Part I: Overview)


I often read about pitchers having high "injury risk," but what exactly does that mean? When projecting pitcher performance (for fantasy or otherwise), should I be concerned that a pitcher will miss time with injury, or that he will be ineffective when he pitches? I will try answer this question.

Of course, injuries are different, and even the same injury may have different effects on different pitchers. Just bear with me on this one. If modeling the effects of injuries on pitcher performance was easy, it would be less interesting.

Measuring Pitcher Value

If a pitcher's value is defined as runs saved, then his value can can be expressed as:
Runs saved  = Innings pitched (IP) * Runs saved per inning (ERA - replacement ERA)
When we look at past performance, we may want to consider the pitcher's defense, park factor and run context. However if we are projecting future value, these issues can easily be ignored. I am projecting ERA before park and defense adjustments. Other factors effecting ERA are so small compared to variance in projections that these factors can easily be ignored.

A commonly used statistic to measure pitcher runs saved is VORP. Therefore in this study, I project VORP from my IP and ERA projections. A lot of very smart people (such as Tom Tango) have argued very convincingly that VORP sets the replacement ERA level way too low. They are right, but for this study, setting the replacement level for ERA is not very important, as long as it is consistent. FWIW, VORP sets the replacement level ERA to around 5.4 (although it does so via RA). If we set the replacement ERA level to something like 4.9 for relievers and 5.3 for starters (where I think it belongs), there would not be much of a difference to this study.

In any case, I am trying to see how injury data can help me predict VORP, both by changing IP projections, and by changing ERA projections.

Data Sources & Results

I am using rich injury data from Corey Dawkins's injury tool. Using this data, I generate features like "did he have surgery in year X," "did he hit the DL in camp," and "how many days did he spend on the DL."

I have written about using this data to improve IP projection. Executive summary: there are several (7) categories of injuries that have large predictive effects on IP. 

More recently, I looked at how injury data can improve my FIP projections. Some features were useful, but not nearly as many features as for IP projections, nor by as much. I will write more about the details of this work later. I project ERA by translating my FIP projections, so I will write about the two interchangeably. 

Now that I have IP and ERA projections, both with and without injury features, I can compute four different versions of the VORP (runs saved) projection, and see which one is best at predicting actual VORP. Since pitchers without an injury history in Corey's database will not have their IP or ERA projection affected at all, I only include the pitchers affected by injury history. (These are all pitcher seasons 2005-2009 for which the pitcher either pitched or pitched the previous season & didn't retire. I'm trying to avoid selection bias by not ignoring projections for 0 IP seasons, or for little-used pitchers.)

Correlation to (real) VORP
STDEV from baseline projection
Basic (no injuries)
0.587
0.0
ERA with injuries
0.599
1.2 VORP
IP with injuries
0.619
2.8 VORP
ERA & IP with injuries
0.624
3.2 VORP


Using injury data improves my ability to project both IP and ERA. However, IP changes are both more useful in predicting VORP, and they are also larger in the average effect on VORP.

None of this is surprising. If we know that a pitcher had Tommy John surgery last year, we should expect his IP to drop (usually to 0 IP) the next season. Also, we can project his IP to recover the season afterward to a higher than he would be otherwise projected. However, how should we expect his ERA to change when he comes back? It's hard to say.

Why injuries don't affect ERA/FIP projections much

Even for injuries that do not typically lead to a missed season, the effects on ERA are harder to predict than the effects on IP. Whereas there were 35 individual injury features that affect the IP projections, only 5 injury features had any effect on FIP projection (that my model was able to pick up). Here are those features:
-0.4223 * inj_elbow_surg_not_tj_2008
+ 0.0671 * inj_anyDL_2008
+ 0.1826 * inj_dl_camp_2008
+ 0.1641 * inj_anyDL_average
+ 0.2736 * inj_surgery_ip_average
Without completely explaining my notation, FIP projections increase if a pitcher was DL'ed last year, DL'ed in camp before the current season, or DL'ed in the previous three years. Recent non-Tommy John elbow surgery lowers the FIP projection, although any recent surgery increases the FIP projection.

The features are pretty non-specific, and the changes are not large (very few pitchers have their FIP projections affected by more than +- 0.3 FIP). Mostly, the injury features allow my model to reward pitchers who did not hit the DL or undergo surgeries in recent years. Contrast this to the extensive injury-based adjustments that I found for my IP model.

Selection bias?

I was surprised not to find stronger effects on FIP from injury features. Maybe there is a selection bias in the way I look for effects on FIP? After all, if a pitcher can't pitch well, maybe he will throw fewer innings and thus not be included in the data set?

For my FIP training, I include all pitcher seasons. However, I weigh instances by the inverse of expected FIP variance, estimated from real IP. I wrote about this in my previous post. The lower the pitcher's actual IP, the less weight I put on projecting his FIP correctly. However the effect for IP > 30 is small, and no pitcher seasons are completely excluded.

Still, to check for bias, I trained an FIP model on pitchers with 40+ IP, giving all of them equal training weight. Then I looked at the effects of injuries on this model. Nothing much changed.

Another possible manifestation of the selection bias might be that pitchers who are below-average to begin with can't handle an injury-based performance drop, and so pitchers who over-perform after injuries are over-represented? To test this possibility, I excluded all pitchers with a projected FIP > 4.2 (without considering injuries). Now I trained an injury model. Again, no noticeable changes.

Conclusions? Effects on Fantasy?

I am not going to claim that injury history can not help predict FIP/ERA. However, while it was fairly easy to find many injury features that have a significantly positive effect on predicting IP, that was not true for predicting FIP/ERA. Moreover, the features that do effect FIP/ERA projection tend to be fairly general, while some very specific features affect IP projection. I'd love to see someone else research this issue and come up with better results, but this is what I got.

If my observations prove to be true, what does this mean for projecting "injury risk" for fantasy pitchers?

Beyond the "he had Tommy John surgery, don't draft him," there are many cases where a pitcher has significant risk of missing time due to injury. The clearest example of this is for major shoulder injuries. Pitchers with shoulder-related DL stints are always at risk for another stint on the DL, whether or not they had surgery on the shoulder, and whether or not they were injured recently.

However, there is no noticeable effect on FIP/ERA from past shoulder injuries when projecting future performance based on past results. Therefore, if he plays, there is no statistical reason to expect a dip or to expect a rise in ERA based on a history of shoulder injuries. If you want an high injury-risk sleeper for your fantasy team, take a pitcher with a dodgy injury history, but strong recent performance when healthy.

However if a pitcher's recent performance wasn't good, don't think it will improve after surgery. Surgery is meant to get a player back on the field. Those effects are consistent enough to measure. However surgery does not typically improve pitcher performance. I am trying to find examples where is does, but I have yet to find any.

I will follow up with some specific examples, and with updated projections for 2010. I know the season has already started, but I think it's still early enough to make some of those projections interesting. 

Sunday, April 11, 2010

How to weight FIP/ERA instances by IP

I like training on all of the available data, rather than only on data from "necessarily high IP sample size". However if I am trying to establish a relationship (between fastball speed and strikeout rate,  between FIP and ERA, or between past and future ERA, for example), I can not simply use samples based on 5 IP and 200 IP in the same way. When fitting a function to my data, I will invariably generate larger errors for those data points based on smaller sample sizes. Thankfully, I have a 60's era statistics book to help me out. According to M.G. Bulmer in Principles of Statisics:
If the form of the relationship between the variance of Y [dependent variable] and x [independent variable] is known, for example if the variance is known to be proportional to x, more efficient estimators can be obtained by weighting the observations with weights inversely proportional to their variances. In general, however, small departures from normality or homoscedasticity will have little effect on inference about the regression line and may be ignored.
In other words, if we can approximate the variance of our dependent variable by some function, then we can weight all data points in reverse proportion to that variance. But if the function does not suggest large differences in variance, we need not bother with the weighting.

Therefore, while many baseball studies are based on data points pruned by IP cutoffs, my studies use all available data, but with points weighted by the inverse of those points' variance (variance in whatever I am predicting), estimated based on the IP for the data point. This necessarily means that I have to compute a variance function every time I want to predict something new. Strikeout rates, ERA, FIP, etc all have different variance based on IP of the sample size. For every new dependent variable, I have to compute observed variance data points, plot them on a graph, and find a fit for this curve. However, this is no worse than picking (often arbitrarily, or worse, deliberately) an IP cutoff for one's baseball studies.

If you wanted to do a study predicting ERA or FIP from something (your independent variables are unimportant), you can take a shortcut by using my ERA and FIP variance estimates, based on single-season IP.

To approximate variance at each plotted data point (ie group of 320 pitcher seasons), I computed the RMS (root mean squared) error between real ERA (or FIP) and my basic projection system for ERA (or FIP). I could just use RMS deviation from the mean of each sample, but that would use incorrect baselines for the individual pitchers. My projection system is simple a statement of previous ERA/FIP, regressed toward the league average. Therefore, deviation from my simple estimate is a better measure of variance than is deviation from the mean of each sample.

Here are the data points for the observed ERA and FIP variance, along with my fits for that variance. The ERA and FIP variance is plotted on a logarithmic scale.



I fit both ERA and FIP variance to functions of the following type: (Also I minimize relative (rather than absolute) error at the data points, so as not to over-fit my function to the very low IP points.)

RA_VAR = A * (IP)^(B) + C

Here, C-term represents the ERA/FIP variance that one might expect from a very high IP sample, while the first term represents the variance differences between different IP sample sizes.

The variance functions for ERA/FIP are:

FIP_VAR = 296 * IP^(-1.63) + 0.32
ERA_VAR = 994 * (IP)^(-1.66) + 0.58

Since both functions have the same B term, it looks like variance for FIP and ERA converges at the same rate. Variance for ERA is always higher, which should surprise no one, but it's interesting that FIP does not converge faster than ERA, since it eliminates luck based on BABIP and other defense-related factors.

In any case, since the variance estimates converge at similar rates, M.G. Bulmer's book tells us that they can be used interchangeably without any effect on inference.

Therefore, if you are doing a study that looks at the effects of *something* on pitcher season ERA or FIP, you can use the FIP_VAR formula above to properly weight your data points, in order to compensate for variance caused by differences in IP sample size.

To get a feel for this function, here is a chart of suggested weights for various IP samples:

IP
1.0 / FIP_VAR
5
0.05
10
0.14
20
0.39
40
0.96
80
1.81
160
2.53
240
2.79

Saturday, April 10, 2010

FIP & ERA baselines from projected IP (an alternative take on replacement level)

In my last article, I complained that my FIP/ERA projection system tends to regress all pitchers to the same baseline (around 4.6 FIP/ERA). This is an appropriate MLB average, but fringe pitchers (especially starters) should probably be regressed to a much lower baseline. So here, I show what such a baseline might look like. Incidentally, this can also be used to compute the "replacement level" FIP/ERA for starters and relievers.

The question I set out to answer was: given an IP projection (split by starter IP and reliever IP), what FIP/ERA should I expect from a pitcher? If I map actual IP to ERA, then I get a very nice graph with the properties that one would expect. But this graph is biased by the survivor effect. Better pitchers throw more innings, even if they start out with lower expectations.

Instead, what if we graph expected IP to actual FIP/ERA? Now we can answer a questions like: "what FIP/ERA should a team expect from a fringe starter (projected for 30.0 IP as a starter, or about five starts)?" My IP projection system is trained on all apitcher seasons from 2005-2009, including low-IP, high-IP and 0 IP seasons, so it projects realistic IP for all pitchers, not just the good ones. Also it gives separate estimates for starter IP and reliever IP.

Using actual performance for all pitcher seasons, I separated the pitchers into two groups:
  1. IP >= 1 and (starter IP) >= 40% * (total IP)
  2. IP >= 1 and (starter IP) < 40% * (total IP)
This is my categorization into "mostly starters" and "mostly relievers." The cutoff might seem arbitrary, but it separates the starter and relievers quite well. I could have left out a batch of pitchers around 50%, but I don't like excluding examples from my training sets, and there are not many such pitchers in any case.

Now I rank "mostly starters" by projected starter IP, and I rank "most relievers" by projected reliever IP. Within each group of 320+ pitcher seasons, I find the median FIP and raw ERA. Thus I create the series that are mapped below:



If some of that is confusing, let me explain that again with an example. Take the "mostly starter" IP series. The highest IP datapoint occurs at IP = 165.4. That is the median projected IP_Start for the top 320 pitchers seasons, ranked by projected IP_Start, provided that those pitchers threw at least 1 IP, and that 40% of their IP came as starters. These pitcher seasons include:
  • Johan Santana (2009), projected to throw 214 IP (threw 166.2 IP)
  • Brandon Webb (2009), projected to throw 206 IP (althrough he only threw 4.0 IP)
  • Does not include Ben Sheets (2009), since he threw 0 IP.
I hope that makes things more clear.

Within a group of 320+ pitcher seasons (I use larger samples at the lower IP data points), I computed the median FIP and ERA, regardless of the IP for each instance. So in the example above, Johan Santana's ERA based on 166.2 IP in 2009 would be used on the same scale as Brandon Webb's ERA based on 4.0 IP in 2009. I purposely don't weight the instances by IP, since that would introduce survivor bias. Without biasing myself toward how many innings the pitchers actually ended up throwing, I want to know: given a projection of "X IP_Start" and "Y IP_Relief", what is a baseline for that pitcher's ERA and FIP.

Replacement Level

Incidentally, my graph also suggests possible replacement levels for starters and relievers. If we view replacement level as the level of performance that can be easily acquired from the waiver wire or from the minor leagues, then the low-end FIP/ERA projections from the graph should offer some guidance.

For relief pitching, the median FIP for low-end projections is around 4.5 (ERA 4.6-4.7). For starting pitching, the median FIP on the low-end is around 4.8 FIP, but the median ERA is around 5.3.

The low-end starter group might look like an outlier, but the median FIP/ERA are based on 400 pitcher seasons with the lowest IP_Start projections, but for those who actually pitched mostly as starters. This group had an average actual IP of 58.9 (52.0 IP as starters). The median actual IP was 42.7 (34.5 as starters). Therefore the group is a good representative of pitchers who one would not have expected to start many innings, but were pressed into starter roles and typically started multiple games. I believe they represent a good estimate of the kind of production a team might get from a spot starter pulled from the bullpen, or from a starter pulled up from AAA.

Going forward, I will use assume the following FIP and ERA (league-neural and park-neutral) replacement levels to fill a team's "missing innings" in projecting overall team ERA and overall pitcher VORP:

FIP
raw ERA
Starter
4.9
5.3
Reliever
4.5
4.6

This is not the only way to estimate replacement level for pitchers, but these are the values most consistent with my individual projections. If one were to use a different system to project IP, then one would get different results. However I don't know of another system that accurately projects IP_Start and IP_Relief for low-end pitchers. Compared to my system, PECOTA and CHONE massively over-estimate the IP for low-end pitchers, especially rookies.

FIP vs ERA disparity

Since FIP is meant to predict ERA (after removing the differences due to defense and BABIP luck), it may seem strange that replacement starter ERA is 0.4 runs higher than replacement starter FIP. However students of DIPS will know that FIP tends to under-estimate ERA for bad pitchers, and over-estimate ERA for good pitchers.

My graph seems to suggest that FIP trails ERA nicely in the range (4.1, 4.7), but the relationship starts to break down beyond that range. This is (in part) because FIP assumes that:
  • pitcher skills are limited to strikeout rate, walk rate and home run rate
  • these skills are linearly related to ERA
Both of these relationships break down on the high end and the low end of pitcher performance. Elite pitchers tend to have lower BABIPs than do average pitchers (although luck and defense constitute most of the BABIP difference for individual cases). Also elite pitchers tend to be better than average at secondary skills like holding runners, situational pitching, and fielding their position. Conversely, low-end pitchers are worse than average at all of these skills. Also, since outs have a non-linear relationship with runs (the more outs a pitcher produces, the less valuable each extra out is), pitchers who get very few easy outs (strikeouts, popups or soft ground balls) tend to have an even higher ERAs than can be linearly approximated from the factors of FIP. Think of Adam Eaton of 2007-2009. His FIP and xFIP were bad, but his ERA was consistently even worse.

Effects on Team Pitching Projections

Armed with new replacement levels for starters and relievers, I should have better team pitching projections soon. Since there is a large separation between replacement level for starter ERA and reliever ERA, teams will suffer disproportionately depending on whether their "missing innings" (ie those innings not filled by IP projections for pitchers on their opening day roster) will need to be starter of reliever innings. The Nationals, with holes in their rotation, will have to fill those missing innings at a higher ERA than the Royals, who have a set rotation, but will need to fill some of their bullpen at replacement level.

Teams will get no credit for relievers projected to post an ERA above 4.9, but will get credit for any starters with projected ERA below 5.3 (before league and park adjustments). This will make my projections much more accurate, even if they are now being made a little too late to count as pre-season predictions.

Davis, Buehrle, Feliz and Mariano Rivera

Also the baselines help me to resolve a couple of specific problems I noticed for individual pitchers. I projected Wade Davis at a lower FIP and ERA than Mark Buehrle. Davis pitched well in 36 IP as a rookie in 2009, and his ERA, FIP and xFIP were all better than Buehrle. However there is no way that one should project him to be better than Mark Buehrle in 2010. The baseline FIP/ERA for starters by projected IP allow me to fix this problem. In the new FIP and ERA projections, I am regressing pitchers to their individual baselines, rather than to the MLB baseline of 4.6. This will help Mark Buehrle.

IP Start (projected)
IP Relief (projected)
FIP baseline
ERA baseline
Wade Davis
90.5
9.4
4.69
4.78
Mark Buehrle
178.0
0.9
4.16
4.16

Similarly, I projected Neftali Feliz to post a lower FIP & ERA than Mariano Rivera in 2010. This is even more unreasonable, and new baselines should fix this:

IP Start (projected)
IP Relief (projected)
FIP baseline
ERA baseline
Neftali Feliz
11.2
28.7
4.57
4.73
Mariano Rivera
8.9
55.5
4.13
3.93

Once I iron out a few more kinks, I should have new FIP, ERA and VORP projections for both individuals and teams. I have not yet done much with park adjustments, other than to adjust the individual and "missing innings" ERA projections to the team's park factor from 2009. It would be nice to consider a pitcher's park factor in terms of specific effects on HR rate, but everything that I've read on this issue seems to suggest that park HR factors vary too much year to year to be of much use. With so many teams having changed stadiums in the past few years (or having changed major characteristics of the field, wind patterns or the ball itself), long-term park factors do not seem very useful for predicting future park factors. I'd rather use a cruder park factor that is more current.

Wednesday, April 7, 2010

Predicting ERA (the trials & tribulations thereof)

Unlike my IP predictions, what I did for ERA/FIP was very simple. In fact, my 2009 analysis showed that my ERA predictions (retroactively, and in sample) were no better than CHONE's ERA predictions (made before the 2009 season, so obviously out of sample). My 2009 retroactive IP predictions were also in-sample, but they were much better than those of PECOTA or CHONE, and my methods were clear and logical (even if the actual formula was somewhat complicated).


However, I think even my simple ERA projection system worth explaining, if only to organize my thoughts in regard to making future improvements.


Building an ERA prediction model involves two major choices:
  • How do we select and weigh instances (ie pitcher seasons)?
  • Do we want to predict ERA, RA, FIP, or something else?
It isn't obvious at first why these choices might be important. Indeed, one could build a decent system (Tom Tango's MARCEL) by completely ignoring them. MARCEL predicts that future ERA should be a linear combination of previous years' ERAs. Then one might make adjustments by considering a pitcher's defense, park factor, age, and the natural regression toward the mean. However, if one is to build a regression model (which is what I'm doing here, mostly), then one must consider how to properly weigh his instances, and whether we should be predicting ERA directly, or whether we should predict something else that translates into ERA easily. Let me explain why.

Weighing instances by sample size

In 2009, Brandon Webb posted an ERA of 13.50. He wasn't terribly unlucky with defense or balls in play, since his FIP was 10.85. Ok, he was unlucky with home runs, so his xFIP was 6.05. How significant is all this? Were all the predictive systems wrong in predicting Webb for an ERA below 4.0? Of course not! Brandon Webb pitched only 4.0 IP in 2009.

When we evaluate (or train) predictive models, there are two ways of handling cases like Webb's. We can either throw them out using an IP cutoff. Or we can reduce the significance of outliers. Or we can reduce the significance based on a non-constant instance weighting model (ie we don't treat errors for all pitchers on the same scale). Ok that's three methods, but the last two are versions of the same idea.

Most analysis comparing model or training models just throws out the low-IP cases. However, where do we draw the cutoff line? Perhaps we don't care about predicting ERA for guys who only throw 40+ IP, since the variances for those guys is so high anyway. But what makes a 41 IP guys significant, but a 39 IP guy insignificant? I like using all of the data!

In training predictive models, as in training in the gym, specificity of training applies. One gets better at rock climbing (or running, or pitching) by climbing rocks (or running, or throwing a baseball). Similarly, if we want to predict ERA for guys who throw less than 40 IP in a season, we must include 40 IP cases in our ERA training. 

I discussed all of this at length when I wrote about training strikeout rate (SO9) models. Here is a graph of my basic SO9 prediction model's error, bucketed by IP:

In the graph above, I refer to SO9 error as "VAR" and "STDEV" since modeling error by IP allows us to estimate SO9 variance and standard deviation by IP. Assuming that my model is not biased in predicting SO9 for pitchers with different IP, the difference in SO9 error must reflect the variance in observed SO9.

If we weight training instances by 1/(expected VAR), then it is possible to build a model that trains on all of the data, but assigns less meaning to prediction errors for cases that have a known high variance in the independent variable. Since we can set the baseline arbitrarily, and only relative differences in weights are meaningful, I set the baseline to 100 IP --> 1.0 training weight. Here are some training weights for the strikeout model, based on 2009 pitcher seasons:

Pitcher Season
IP
Instance weight
Zack Greinke_09
229.3
1.72
Mariano Rivera_09
66.3
0.82
Matt Capps_09
54.3
0.73
Brandon Webb_09
4.0
0.15
Ivan Bezdomny_09
0.0
0.0

By using a system like this one, it's possible to include Brandon Webb's results in my training set, without giving them unfair weighting (and thus regressing too far toward the mean for guys with more meaningful sample sizes).

As my SO9 article showed, this method (compared to simple IP cutoffs with equal weights assigned to all instances) had the same error on higher-IP pitchers, but had much lower error for low-IP guys. As the specificity of training principle would indicate, you can't predict instances well if you don't train on similar instances.

Since FIP suggests that ERA is linear in respect to SO9 rates, I simply used the SO9 formula to weight ERA training instances. This is lazy. I should compute a new formula for estimating ERA variance from IP, but I doubt the difference would make a noticeable difference. As long as the ERA variance follows a similar graph to SO9 variance, with a similar exponent, the difference would be negligible. 

Should we predict ERA directly?

Whether we weight pitcher seasons as described above, or exclude all low-IP pitcher seasons, it's very hard to predict ERA much better than assigning a league average to all pitchers. A simple model predicting ERA from previous years' ERA, FIP and QERA gave me no better than a 0.22 correlation (r) with the actual data. In part, this is due to park and league factors. While the vast majority of pitchers should be predicted to an ERA of between 4.0 and 5.0, league and park factors can affect ERA by +-0.3 runs or more. The average run context for the White Sox is about 0.7 R/G higher than for the Giants. About 93% of that difference is attributed to ERA. Run factor differences and the variance due to defense and BABIP luck dwarf the skill differences for most pitchers.

Of course, we could even out some of those differences by removing the BABIP luck, the defense, and the park factors from a pitcher's performance. Or we can predict FIP, which does most of those things already. Within the DIPS (Defense Independent Pitching Statistics) family of stats, FIP may not be the most accurate ERA approximate, but it's a simple stats, and more complex stats are not provably more effective predictors of park-neutral ERA. The FIP formula predicts ERA simply from strikeout rate, walk rate and HR rate:

FIP = (HR*13 + BB*3 - K*2)/IP + 3.2

While FIP is not quite park or league neutral, it is less biased by park factors than ERA. Also since FIP explicitly ignores the sequencing and interaction between home runs, walks and strikeouts, it is possible to adjust those rates independently for park and league context. Walk rates do not change much my league or park, but strikeout rates change significantly by league, and HR rates are different for each major league park. Best of all, one can predict FIP in a straightforward manner (from previous FIP and QERA) at a 3.2 correlation (r) with the real FIP.

Here is another way of thinking about this:

I build an model to predict ERA that takes previous difference in past ERA/FIP between pitchers into account. It's trying to do the right thing, but variance due to park factors and BABIP luck is so high that the system freaks out and sets everyone to around a 4.6 ERA.

Alternatively, I build a model to predict FIP. The differences between pitchers are roughly the same as before. But by eliminating BABIP luck and by reducing the variance to to park factors, the model is more confident in assigning FIP predictions further away from the mean.

I chose the blue pill, and I think it should be pretty clear why.

Conclusions & Ongoing Work

Comparatively, a non-regression system like MARCEL takes past performance as an indicator of future performance, without giving consideration to deviation from the mean. It is more likely to predict exceptional cases well (since it's not afraid to stray from the baseline prediction), but it will more often be fooled by unsustainable outlier performance.

Not surprisingly, my models is still heavily regressed toward the mean, even when I predict FIP and translate that into team-specific ERA. Very few starting pitchers get assigned an ERA prediction of under 4.0 FIP. Tim Lincecum is by far the lowest FIP prediction for 2010 (among starters) with a projected for 3.29 FIP. Even with a favorable league and park, I project his ERA for 3.04. The second-best starter projection for 2010 is Zack Greinke with 3.47 FIP. His adjusted ERA projection comes out to 3.53. That means that Greinke is equally likely to post a 4.03 ERA and a 3.03 ERA.

I'm ok with these projections, since even great pitchers often revert toward the mean, but have less room for further improvement. However, my system projects FIP of below 5.0 for some very marginal pitchers, and no 2010 pitcher is projected for an FIP of above 5.79. Natural selector (ie the MLB front offices) tend to make sure that high-FIP pitchers don't stay around for long in the majors, but my system is still making a mistake by regressing everyone aggressively toward the mean, simply for being on a major league roster. Distribution of talent is not symmetrical in MLB, so regression toward the mean should not be either. I'm not sure how to properly account for this.

I am making major improvements to my FIP/ERA projection formula, so I will not discuss the formula I currently use in much more detail. Right now, my adjustments for park and league are not very good, and I'm not adjusting the HR% part of FIP by HR% for different parks. Worst of all, I am not adjusting previous years' FIP feature by sample size. So if a pitcher threw had an FIP of 2.48 in only 31.0 IP in 2009, I treat that the same for future predict as a 2.48 FIP in 200+ IP. As a result, Neftali Feliz is my second-best FIP projection for 2010, behind only Jonathan Broxton. Actually 31.0 IP isn't bad for a 1-year projection and my system is used to handling those sample sizes. Unfortunately, that's also his 3-year total, any my system assigns more weight to 3-year averages than to last year's figures. I can only imagine the ridiculous projection it would have given to Joba Chamberlain going into 2008!

I will fix the most pressing issues, and soon I'll have a very good projection for FIP/ERA. Stay tuned.