My writings about baseball, with a strong statistical & machine learning slant.
Monday, February 1, 2010
Handling small sample sizes (K9 variance by IP)
First of all, thanks to Dave Allen for pointing out that I should be trying to predict SOr (strikeout rate as a percentage of AB) rather than SO9 (strikeout rate per 9 innings). However in practice, there is no difference. I can't give you linear a formula to translate SOr to SO9 offhand, but I can say that my machine learning system predicts them with the same accuracy (within 1% which is not remotely significant), using the same features, weighed in equal proportions. That said, Dave is right and I should be using SOr.
For the mean time, I will stick with SO9, knowing that I can translate to SOr as needed. I understand strikeout rates per 9 innings much better than I understand strikeout rates per batter. I imagine other people do as well. Seven K's per 9 innings is average. Anything around nine per nine innings is very good. Anything above that exceptional. In any case, thanks Dave. You are 100% right.
In my last post, I wrote about an idea I had for handling small sample sizes. Or rather, I wrote about training models intelligently with lots of data of unequal significance. I now have a much better example of this idea at work.
I have a data set of all pitcher seasons from 2002-2009, along with pitch data from FanGraphs, and the pitchers' observed strikeout rates. Also I have innings pitched (IP) for these pitcher seasons.
I would like to build an (optimal) model to predict expected strikeout rate (SO9) from the FanGraphs pitch data. However my data set includes data for guys who pitched 200 IP, and for those who guys who pitched 20 IP. Intuitively, these data points are not equally significant. However the 20 IP data points are not insignificant either! This is important since almost 20% of the pitcher season from the past decade are for guys throwing less than 14 IP in a season. I would expect strikeout rates to be very noisy at such low usage figures. I suspect that most analysts just throw data like this out. However, where do i draw the line? 20 IP? 40 IP? Maybe at a cutoff where the model looks most predictive? But that isn't very good science.
It would be nice to scale each data point by an estimte of how confident we are in the accuracy of that observation. In other words, if we can get a relative measure of variance (in our dependent variable) based on the input data (independent variables), we can scale by the inverse of that variance in training. I have a more detailed explanation of this idea in my previous post.
As before, I isolate the independent variable as IP. My dependent variable is observed strikeout rate (SO9). I estimate the variance in SO9 by looking at the error of a model I built to predict SO9 rate where all pitcher seasons are trained with equal weight. The logic here may seem circular, but there is nothing inherently illogical about it. If I simply looked at variance in observed SO9 within bands of pitcher seasons (by IP), I would be way over-estimating variance in the high IP cases. CC Sabathia, Randy Johnson, Jon Lester and Kirk Reuter will have large difference in observed SO9 rate, even over 200IP. Comparing observed rates to reasonable, unbiased estimates is a better way of trying to understand which part of the difference can be attributed to small sample sizes, and which part of the difference is due to other factors, like talent, pitching style, etc.
OK, so on to the graph! My data points are buckets of pitcher seasons, grouped by IP, and represented by the average IP of the bucket. There are 5,000 pitcher seasons in my sample. Each data point represents a bucket of 1,000 pitcher seasons, except for the outermost points, which represent only 500 pitcher seasons. How's that for sample size!
As you can see, the curve is modelled quite well by a f(x) = ax^(-b) function, where x = IP and f(x) estimates the variance in SO9 rate.
The next step is to re-train my model, having weighed the data points by the inverse proportion of this estimated variance. This should result in a higher correlation between real and predicted SO9 rate, using cross fold validation or any other method where I leave out part of the data for testing.
Also, I should expect to perform better on weighting-neutral tests. For example, I should more often be closer to the observed strikeout rate than my current model, which treats all the pitcher season the same.
Phrased differently, I am now training a model that places more significance on nailing down strikeout rates that I think that I should be able to predict more accurately. So being off by 2 K/9 on Roy Halladay will be less acceptable than being off by 2 K/9 on Jonathan Albaladejo. I think that makes a lot of sense!
Hopefully this approach will work. We all need better tools for tackling sampling size issues. Regression to the mean can screw up good analysis, if not properly accounted for.