If the form of the relationship between the variance of Y [dependent variable] and x [independent variable] is known, for example if the variance is known to be proportional to x, more efficient estimators can be obtained by weighting the observations with weights inversely proportional to their variances. In general, however, small departures from normality or homoscedasticity will have little effect on inference about the regression line and may be ignored.
In other words, if we can approximate the variance of our dependent variable by some function, then we can weight all data points in reverse proportion to that variance. But if the function does not suggest large differences in variance, we need not bother with the weighting.
Therefore, while many baseball studies are based on data points pruned by IP cutoffs, my studies use all available data, but with points weighted by the inverse of those points' variance (variance in whatever I am predicting), estimated based on the IP for the data point. This necessarily means that I have to compute a variance function every time I want to predict something new. Strikeout rates, ERA, FIP, etc all have different variance based on IP of the sample size. For every new dependent variable, I have to compute observed variance data points, plot them on a graph, and find a fit for this curve. However, this is no worse than picking (often arbitrarily, or worse, deliberately) an IP cutoff for one's baseball studies.
If you wanted to do a study predicting ERA or FIP from something (your independent variables are unimportant), you can take a shortcut by using my ERA and FIP variance estimates, based on singleseason IP.
To approximate variance at each plotted data point (ie group of 320 pitcher seasons), I computed the RMS (root mean squared) error between real ERA (or FIP) and my basic projection system for ERA (or FIP). I could just use RMS deviation from the mean of each sample, but that would use incorrect baselines for the individual pitchers. My projection system is simple a statement of previous ERA/FIP, regressed toward the league average. Therefore, deviation from my simple estimate is a better measure of variance than is deviation from the mean of each sample.
Here are the data points for the observed ERA and FIP variance, along with my fits for that variance. The ERA and FIP variance is plotted on a logarithmic scale.
I fit both ERA and FIP variance to functions of the following type: (Also I minimize relative (rather than absolute) error at the data points, so as not to overfit my function to the very low IP points.)
RA_VAR = A * (IP)^(B) + C
Here, Cterm represents the ERA/FIP variance that one might expect from a very high IP sample, while the first term represents the variance differences between different IP sample sizes.
The variance functions for ERA/FIP are:
FIP_VAR = 296 * IP^(1.63) + 0.32
ERA_VAR = 994 * (IP)^(1.66) + 0.58
Since both functions have the same B term, it looks like variance for FIP and ERA converges at the same rate. Variance for ERA is always higher, which should surprise no one, but it's interesting that FIP does not converge faster than ERA, since it eliminates luck based on BABIP and other defenserelated factors.
In any case, since the variance estimates converge at similar rates, M.G. Bulmer's book tells us that they can be used interchangeably without any effect on inference.
Therefore, if you are doing a study that looks at the effects of *something* on pitcher season ERA or FIP, you can use the FIP_VAR formula above to properly weight your data points, in order to compensate for variance caused by differences in IP sample size.
To get a feel for this function, here is a chart of suggested weights for various IP samples:
IP

1.0 / FIP_VAR

5

0.05

10

0.14

20

0.39

40

0.96

80

1.81

160

2.53

240

2.79

No comments:
Post a Comment