My writings about baseball, with a strong statistical & machine learning slant.

Wednesday, February 10, 2010

Why strikeout rates from pitch data are interesting? (Part 1)

Before I go into depth about the results my findings, I think that this is a question worth answering. Why am I trying to predict strikeout rates from pitch data an biographical information?

As Dave Allen pointed out to me (thanks Dave!), one can predict strikeout rates very well with swinging strike rates. So if the point is to find a way to predict strikeout rates independent of direct measures of performance (ie wins, ERA, VORP, etc), then my gig is up. However, this isn't the point.

The article from Jeff Sullivan (linked above) shows that there is a linear relationship between strikeout rate and swinging strike rate. His R^2 = 0.71, therefore implying that the correlation coefficient between these factors is 0.84. That is much better than anything I can get from pitch data. Which I don't think is surprising. I compare these values in more detail below.

Also it's notable that Jeff uses a 100 IP cutoff for his pitcher seasons. Therefore, he completely ignores relievers. I do not ignore any pitcher seasons, as I wrote about in several previous posts. However I include my results, training only on 100+ IP pitcher seasons, as a point of comparison.

predicting:correlation:model type:features:
SO%0.84linearswinging strike rate
SO90.59non-linearpitch type, bio, league
SO90.48linearpitch type, bio, league
SO90.37non-linearwithout fastball velocity
SO90.52non-linearfastball velocity only

As I wrote before, SO% and SO9 are extremely correlated (you can estimate one from the other with a linear transformation with very high accuracy). My resulting models for SO% and SO9 are almost the same (after a linear transformation), with the same correlations. I will switch to SO% soon. I promise!

A few things are clear from the table above:
  • Swinging strike rate is a much better predictor of SO% than anything available from pitch data (how hard does a pitcher throw, which pitches, and how often), bio data (height, weight, age, handedness) and the league in which he pitches.
  • I gain significant predictive power from using a non-linear model. If you want to see why, take a look at the graph of fastball velocity vs strikeout rate from my previous post.
  • Most of my predictive power comes from the non-linear use of the fastball velocity.
Although moving from 0.52 correlation to 0.59 correlation is significant, I must acknowledge that my model is not much more than a way to predict strikeout rates from fastball speeds, adjusted for handedness, league, age, and a few other factors. Then again, I need to re-iterate that my full model takes into account all pitcher seasons with less than 100 IP, while these models are restricted only to full time starting pitchers.

In case you are wondering what it means for my 'other features' to predict strikeout rate at a 0.37 correlation, here is an amusing parallel. If I want to predict the year in which a certain pitcher season occurred, using the same data, I get a similar correlation:

predicting:correlation:model type:features:
YEAR0.32non-linearpitch type, bio, league

My data consists of pitcher seasons 2002-2009. There is equal distribution of pitcher seasons among these years. The year is treated as an number, so the model gets credit for getting close to the right year, etc.

So one can predict the year that a certain pitcher season took place using a few trends in pitch data and biographical composition of the pitchers. Again, note that all pitcher seasons are for 100+ IP. The most significant features (roughly in order of importance):
  • Cutter %: Pitchers throw more cutters than ever before. However, this may also reflect a bias in the classification of pitches as cutters over time by BIS. This only affects pitchers who throw cutters, whoever, who are still in the minority.
  • Fastball velocity: Average fastball velocity among starters has steadily increased for the last decade.
  • Repertoire depth: This is a statistic that I invented to measure how many pitches a pitcher has in his arsenal. Over the last decade, starters have developed more balanced repertoires. In other words, they throw their second & third pitches at a percentage more in balance with the percentage of fastballs thrown.
  • THROWS = L: There are more lefty starters now than there used to be.
  • not SO9: I allowed the model to use SO9 as a feature, but it did not find it useful to predict the year, at least not after the above factors are considered.
All of these factors add up to a 0.32 correlation with the year of the pitcher season. I thought that was an interesting list, so decided to share.

For those of you who care about details, here is how the model above (predicting pitcher year) looks like:

Rule: 1
IF
CT_fg_per > 0.05
CT_fg_per <= 9.15
THEN

YEAR_BIT =
0.0082 * THROWS=L
+ 0.0816 * HEIGHT_OVER_6ft
- 0.0004 * WEIGHT
+ 0.0359 * AGE
- 0.1365 * LG
- 0.0011 * FB_fg_per
- 0.0017 * SL_fg_per
+ 0.0657 * CT_fg_per
- 0.017 * CB_fg_per
- 0.019 * CH_fg_per
- 0.0028 * SF_fg_per
+ 0.074 * FB_fg_vel
- 0.0307 * SL_fg_vel
+ 0.0135 * CT_fg_vel
- 0.0024 * CB_fg_vel
+ 0.0012 * CH_fg_vel
+ 0.3186 * rep_depth
- 0.0047 * rep_offerings
+ 1999.0137 [314 instances]

Rule: 2

YEAR_BIT =
0.3793 * THROWS=L
+ 0.0581 * HEIGHT_OVER_6ft
- 0.0401 * FB_fg_per
- 0.0415 * SL_fg_per
+ 0.0217 * CT_fg_per
- 0.0919 * CB_fg_per
- 0.0638 * CH_fg_per
- 0.1114 * SF_fg_per
+ 0.2177 * FB_fg_vel
- 0.0356 * CB_fg_vel
+ 0.034 * CH_fg_vel
+ 1.5547 * rep_depth
- 0.2658 * rep_offerings
+ 1987.7999 [816 instances]

No comments:

Post a Comment