Having finished "The Girl With the Dragon Tattoo," last week, and having no more pleasure reading left, I resorted to reading a short statistics text that I brought along for my trip. I studied various maths for many years, but my formal knowledge of statistics is limited to two semesters as an undergrad, as well as some physics labs, which I did not fully appreciate at the time. Nowadays, when working with baseball data (sometimes using Excel), I wish that I had kept my notes from experimental physics lab ten years ago!
In any case, early in the statistics book I stumbled on a simple comment in relation to non unimodal distributions: the author casually noted that most of these distributions are actually distributions of sums of multiple random variables. Well pitcher IP (innings pitched) is not really a random variable with a (skewed) normal distribution, but it can well be described as the sum of two other variables: starting innings pitched and relief innings pitched. Of course, the accumulation of a pitcher's starting and relief innings is not unrelated. However in today's game, these can well be treated as separate variables. Except for playoff situations, pitchers are always either starters or relievers (or not on the active roster) at a particular stretch of the season. The Yankees' Joba Chamberlain & Phil Hughes drama from this season would be a recent example of this property. So both IP_start and IP_relief may reasonably predicted based on a pitcher's time spent in each role, as well as his success, durability, manager's preference, etc. Unfortunately, as of now, there is no possibility to see how many games a pitcher was available as a starter or reliever throughout the season. But conceptually, this idea is possible it today's game.
Each of those distributions (IP_start and IP_relief) is itself bimodal, with peaks at 0 IP, and at around 200 IP for IP_start and around 65 IP for IP_relief. Still, these distributions are simpler than the trimodal distribution for total innings pitched.
Now that I think of it, I'm not sure why I didn't try to predict IP_start and IP_relief separately before. From the beginning, I thought about pitcher roles/usage in predicting IP, but I didn't think categorizing pitchers into starter & reliever would be useful, beyond what already evident from previous years' data. I still think that was the right idea, but predicting IP_start and IP_relief is another idea that makes more sense. I am not chopping the pitchers into starters and relievers, but rather I am computing two aspects of a pitcher's ability to contribute toward his own usage. Both Mariano Rivera and Josh Towers can be predicted to have few IP_start, but for different reasons. Similarly, we will expect both CC Sabathia and Kei Igawa to have few IP_relief, but again for different reasons.
I have some preliminary tests showing that even a linear combination of features can predict IP_start at a much better error rate (ie correlation between predictions and actual values on a test set) than I am currently doing in the "one size fits all" IP model. Predictions for IP_relief are tougher, and would be expected (especially since I am including pitchers with a very low IP in my model, where variance is very high). Still, predicting IP_start very accurately will be useful. Also, the features used for IP_start and IP_relief prediction are different, so the combined model (ie summing IP_start and IP_relief predictions) will lead to a straightforward non-linear model for total IP, which could be much more predictive.
I should have concrete results soon. Tomorrow, I am supposed to spend a day seeing the Great Ocean Road east of Melbourne. So maybe I'll have something to show the day after.