My writings about baseball, with a strong statistical & machine learning slant.

Sunday, April 11, 2010

How to weight FIP/ERA instances by IP

I like training on all of the available data, rather than only on data from "necessarily high IP sample size". However if I am trying to establish a relationship (between fastball speed and strikeout rate,  between FIP and ERA, or between past and future ERA, for example), I can not simply use samples based on 5 IP and 200 IP in the same way. When fitting a function to my data, I will invariably generate larger errors for those data points based on smaller sample sizes. Thankfully, I have a 60's era statistics book to help me out. According to M.G. Bulmer in Principles of Statisics:
If the form of the relationship between the variance of Y [dependent variable] and x [independent variable] is known, for example if the variance is known to be proportional to x, more efficient estimators can be obtained by weighting the observations with weights inversely proportional to their variances. In general, however, small departures from normality or homoscedasticity will have little effect on inference about the regression line and may be ignored.
In other words, if we can approximate the variance of our dependent variable by some function, then we can weight all data points in reverse proportion to that variance. But if the function does not suggest large differences in variance, we need not bother with the weighting.

Therefore, while many baseball studies are based on data points pruned by IP cutoffs, my studies use all available data, but with points weighted by the inverse of those points' variance (variance in whatever I am predicting), estimated based on the IP for the data point. This necessarily means that I have to compute a variance function every time I want to predict something new. Strikeout rates, ERA, FIP, etc all have different variance based on IP of the sample size. For every new dependent variable, I have to compute observed variance data points, plot them on a graph, and find a fit for this curve. However, this is no worse than picking (often arbitrarily, or worse, deliberately) an IP cutoff for one's baseball studies.

If you wanted to do a study predicting ERA or FIP from something (your independent variables are unimportant), you can take a shortcut by using my ERA and FIP variance estimates, based on single-season IP.

To approximate variance at each plotted data point (ie group of 320 pitcher seasons), I computed the RMS (root mean squared) error between real ERA (or FIP) and my basic projection system for ERA (or FIP). I could just use RMS deviation from the mean of each sample, but that would use incorrect baselines for the individual pitchers. My projection system is simple a statement of previous ERA/FIP, regressed toward the league average. Therefore, deviation from my simple estimate is a better measure of variance than is deviation from the mean of each sample.

Here are the data points for the observed ERA and FIP variance, along with my fits for that variance. The ERA and FIP variance is plotted on a logarithmic scale.

I fit both ERA and FIP variance to functions of the following type: (Also I minimize relative (rather than absolute) error at the data points, so as not to over-fit my function to the very low IP points.)

RA_VAR = A * (IP)^(B) + C

Here, C-term represents the ERA/FIP variance that one might expect from a very high IP sample, while the first term represents the variance differences between different IP sample sizes.

The variance functions for ERA/FIP are:

FIP_VAR = 296 * IP^(-1.63) + 0.32
ERA_VAR = 994 * (IP)^(-1.66) + 0.58

Since both functions have the same B term, it looks like variance for FIP and ERA converges at the same rate. Variance for ERA is always higher, which should surprise no one, but it's interesting that FIP does not converge faster than ERA, since it eliminates luck based on BABIP and other defense-related factors.

In any case, since the variance estimates converge at similar rates, M.G. Bulmer's book tells us that they can be used interchangeably without any effect on inference.

Therefore, if you are doing a study that looks at the effects of *something* on pitcher season ERA or FIP, you can use the FIP_VAR formula above to properly weight your data points, in order to compensate for variance caused by differences in IP sample size.

To get a feel for this function, here is a chart of suggested weights for various IP samples:

1.0 / FIP_VAR

No comments:

Post a Comment