My writings about baseball, with a strong statistical & machine learning slant.

Friday, February 26, 2010

Why the Reds paid $30 million for Aroldis Chapman

This is not a post about money. I'm just going to demonstrate why a major league ready LHP with a huge fastball is automatically a valuable pitcher, regardless of his other qualities (except maybe health).

According to what I read, Chapman's fastball sits in the mid to low 90's and touches 98 mph. He keeps his velocity deep into games, so he projects as a starter. Let's say that he is able to to work his way up to sit at 95 mph, with below-average control for a starting pitcher.

According to my league-based models from the last post, Chapman's strikeout rate projects at 10.3 K/9 in the NL, and at 9.2 K/9 in the AL. This is baseline, so he might do better, he might do worse, but this is a 50th percentile estimate of sorts. Now let's give him poor walk rate for a starter. Let's say he is A.J. Burnett. So we project Chapman at 4.22 BB/9, second-worst for 200 IP starters in 2009. While we're at it, let's give him Burnett's home run rate from last year as well (1.09 HR/9). Now we can approximate Chapman's ERA using FIP, a simple fielding-independent ERA predictor:

(HR*13 + (BB + HBP - IBB)*3 - K*2) / IP + (annual league constant)

For the league constant, let's use Bronson Arroyo's 2009 constant of +3.21. Putting it all together, we get:

(1.09*13 + (4.22)*3 - 10.3*2) / 9 + (3.21) = 3.90 FIP

Therefore, a Chapman who sticks in the starting rotation, puts up a 50th-percentile strikeout rate, and has A.J. Burnett's control, is a very valuable pitcher. According to Dave Cameron's calculations, replacement level for NL starters was 5.37 FIP in 2008. Therefore, the projected Chapman, over 200 IP, saves (5.37 - 3.90) * 200/9 = 32.7 runs, as compared to a replacement-level starter (i.e. Micah Owings).

If we use the general rule that 10 runs = 1 win, then Chapman is a +3.3 win pitcher, even with below-average control and assuming his 50th-percentile strikeout projection. That's a lot of projection, but if the Reds think that Chapman is healthy and almost ready for the majors, then their signing makes a lot of sense.

Thursday, February 25, 2010

League differences in strikeout rates (a more rigorous treatment)

In my last post, I implored Brian Cashman to stop signing high strikeout pitchers from the NL. Then I showed a couple of graphs to argue my point, with a big "small sample size" disclaimer. Now, I'm going to be a bit more rigorous.

Let me review the problem.
  1. NL starters strike out more batters than AL starters, but not by much
  2. NL starters who lead the league in strikeouts have higher totals than AL leaders, by a larger margin
  3. NL starters who move to the AL tend to have large drops in their strikeout rates
There is no paradox here, but the statements don't seem to mesh very well. Only the first of these facts is reflected in league-based adjustments to ERA approximations like FIP, xFIP and QERA. So by any of these measures, high strikeout starters in the NL are move valuable than high strikeout starters in the AL.

Fastballs and Strikeout Rates

When I wrote about predicting strikeout rates from pitch data, I showed that, all other pitch data being equal, switching leagues (AL to NL) is worth between +0.4 K/9 and +0.8 K/9, depending on a starting pitcher's fastball velocity. Hard-throwing starters benefit from a move to the NL more than soft tossers benefit.
For all starting pitcher seasons (100+ IP), let's map average fastball velocity to strikeout rates. Since the relationship is not linear, I fit both leagues' data to quadratic functions: 

As you can see, the projected NL strikeout rate is always higher than the projected AL strikeout rate for each pitcher, with the differences ranging from +0.4 K/9 to +0.8 K/9 or so.

The differences in distributions for strikeout rates and fastball velocity can be summarized as such:

25th percentile:
75th percentile:
SO9 NL: 
FB vel NL: 
FB vel AL: 

At this point, we could quit with the following explanation:
  • Strikeouts are harder to get in the NL:

    • therefore NL starters have higher strikeout rates
    • and therefore AL starters throw a little bit harder, on average
However, the simple analysis ignores another crucial factor in strikeout rates: being left-handed. I had a friend ask me: how could lefties have higher strikeout rates than righties? Don't most batter bat right-handed, and thus prefer to see lefties? I can't answer that question directly, but there is no question that lefty starters strike out more batters than righties, given the same fastball velocity. This is why about 30% of all starting pitchers are left-handed, even though the population is only about 10% left handed.
To show how huge this advantage is, I trained separate piecewise-linear models for strikeout rates in the AL and NL, based only on FB_vel and handedness. You can see the full text representation of the models here. As in the more detailed models that I trained earlier, there are 2-3 linear rules, with the rules split by fastball velocity. The NL model has two rules, but the AL model has a third rule for pitchers with average fastballs under 88.5 mph. Let's ignore that low-end rule, and compare the other rules in a chart. 

FB range:
FB_vel feature:
base case (90.0 mph rightie):
NL rule 1: 
[91.1, +∞)
NL rule 2:
(-∞, 91.0]
AL rule 1: 
[90.6, +∞)
AL rule 2: 
[88.5, 90.5]

As you can see, being a hard-throwing lefty increases strikeout rates in either league. But it's a little more valuable in the NL.

Now that we have predictions for strikeout rates from FB speed and handedness, we can do something more interesting: we can compare strikeout rates between leagues, and to guess how a pitcher might have performed in the the opposite league.

Comparing strikeout rates between leagues

My method is simple: take a starter's actual SO9, and subtract the projected SO9 for his appropriate league, using the above model. Now we have his over or under achievement for his strikeout rate. Call that his "skill factor," beyond just throwing hard or being a lefty. To project how he might have fared in the other league, we add that over (or under) achievement to his league-based projection for the other league.

For example, Randy Johnson struck out 10.62 K/9 in 2004 for the Diamondbacks. That translates to 9.43 K/9 in the AL. When he moved to the Yankees in 2005, Johnson posted 8.42 K/9. His 2205 translation for the NL: 9.62 K/9. So his K/9 dropped off year to year, but due as much to league change as to declining performance. His "skill factor" dropped from +0.85 to +0.29 from 2004 to 2005 and his fastball velocity dropped by 1.0 mph, but the league change had a major effect on his strikeout rate.

I have a spreadsheet of all starters here, along with strikeout rate projections for both leagues. Yes, it's a huge spreadsheet with lots of columns. Sorry. I'm going to mention a few trends that I found interesting. (I excluded Tim Wakefield & Steve Sparks. My projections really don't make sense for knuckleballers.)

I ranked the pitcher seasons by projected (or actual) strikeout rates by AL standards. So all AL starters are ranked by actual strikeout rates, while NL starters are ranked by their "skill factor" adjusted to the the AL.

Top billings go to Erik Bedard (vintage 2007) and Pedro Martinez (vintage 2002). Both struck out almost 11 per nine innings in the AL. Unfortunately, the data goes back only to 2002, so I'm missing Randy Johnson's best seasons, as well as Pedro's ridiculous totals from 1999 and 2000. If we had included Pedro's 1999 season, he would have won the AL-rules contest by almost two full strikeouts per game. He was that good.

Also you might notice that the top strikeout pitchers tend to be disproportionally left-handed. About half of the highest-strikeout guys are lefties, although only 30% of all starters are lefties. This is not just an NL phenomenon. There is no clear preference for NL pitchers among the highest strikeout rates in MLB, even if we project those NL pitchers to AL baselines.

As you might also notice, the AL-based strikeout rates are much lower for the top guys than are the NL-based strikeout rates for those same guys. As you can see from the "NL - AL proj" column, I have the baseline projections (based on fastball speed and handedness) range from -0.46 K/9 to +1.22 K/9 . To see the trends for these values, let's run a 25-point average, by ordering the pitcher seasons according to AL-based strikeout rates:

There is a lot of variance, but top-strikeout guys are projected to have a larger advantage in the NL  than the average starting pitcher.

Even if you do not believe my models for projecting strikeout rates from FB_velocity and handedness, it's pretty clear that high-strikeout pitchers tend to be more dominant in the NL than a simple league-average based adjustment would predict. The average starter strikes out about +0.6 K/9 in the NL, compared to the AL. However, a high-strikeout starter strikes out almost +1.0 K/9 in the NL (or certainly +0.8 K/9). Since the "average strikeout pitcher" includes high-strikeout pitchers, the difference for mid-level strikeout pitcher is even more apparent.

It's also interesting, though perhaps less significant, that the league-based strikeout differentials increase a little bit for the really low-end strikeout pitchers. As I mentioned in the previous article, some low-strikeout pitchers benefit from a move the the NL. Typically what happens is that a hard-throwing starter in the AL will under-perform his strikeout project projections, and then he might do a lot better with a move to the NL. That sounds like good news for Chien-Ming Wang, and also for Carlos Silva. Although I'm not sure if Silva throws hard enough any more to be effective in any league.


If you believe my methods and are not overly worried about training models on somewhat small samples (more discussion of sample sizes later), then you can use the models directly to predict the kind of adjustments that need to be made in order to compare AL & NL strikeout rates on the same scale. Alternatively, you can look at specific examples of pitcher seasons, and see what kind of adjustment had to be applied in those cases. Here are a few typical cases:

  • Hard-throwing lefty (Bedard '07, Santana '04, etc): +1.2 K/9
  • Hard-throwing righty (Schilling '03, Burnett '07, etc): +0.8 K/9
  • League-average lefty (Cliff Lee '04, Pettitte '06, etc): +0.5 K/9
  • League-average righty (Pedro '02, Mussina '03, etc): +0.4 K/9
If you do not buy the whole "lefties have disproportionately more success in the NL" theory, then you can simply use the RHP figures above, or refer to the trailing averages graph above.

If anything, I think I've showed that high-strikeout pitcher do tend to do much better in the NL than simple average-based projections might suggest. Therefore, those pitchers will have their FIP, xFIP, and QERA (and probably SIERA as well) suffer by moving to the American League (while league-average strikeout pitchers will not). As a result, their VORP, WAR, or any other measure of value, will also drop with a move to the AL.

Does this mean that we should change FIP? I don't think so. The point of FIP is to measure the "run-saving value" of each pitcher. Top strikeout pitchers really do provide more value to NL clubs by striking out more hitters. We should not penalize them because they would not have done as well in the AL. However, projection systems and GMs need to be aware of the fact that NL-based high-strikeout pitchers are not as valuable in the AL, and thus not overpay for production that they will not be getting.  

Sample sizes

Although the samples I am using here are not as small as those for the original article, I'm still basing this analysis on only 1,000 pitcher seasons, of which only 300 are for lefties. Worse yet, I am training two models (AL and NL), for a total of 5 rules (each with 3 variables). However, this is all the data that I've got. Even if I had reliable fastball data going back 20 years, I'd be looking at stats from a different game.

Rather than argue whether or not all of this is statistically significant, let's perform an experiment. I've repeatedly mentioned Randy Johnson. Then I projected his stats, based on adjustments that, among other seasons, included Randy Johnson seasons. Doesn't sound kosher. Let's remove Randy Johnson from the sample, and train a new NL model:

FB range:
FB_vel feature:
base case (90.0 mph righty):
NL rule 1: 
[91.1, +∞)
NL rule 2:
(-∞, 91.0]
NL rule 1 (no RJ): 
[91.1, +∞)
NL rule 2 (no RJ): 
(-∞, 91.0]

That's right, I just referred to the Big Unit at RJ.

As you can see, removing Johnson's stats is not insignificant for the NL strikeout model. Not surprisingly, the new model (for hard-throwing NL starters) gives less weight to fastball speed, and also less weight to being left-handed. If we use this new model, hard throwing lefties only get a +1.0 K/9 bonus for the NL, while hard-throwing righties stay at +0.8 K/9.

OK, so the model is sensitive. Then again, we just selectively removed the most dominant lefty NL pitcher from our data set! I'd rather have the complete data, and to accept that all of my conclusions come with error bars.

Tuesday, February 23, 2010

Injuries and projecting performance.

With the release of Josh Hermsmeyer's injury database, lots of people are commenting on how predicting injuries (and also predicting injury-related performance changes) might be the new breakthrough in sabermetrics. Perhaps so. I've had access to descriptive DL data for the 2000's for some times now, so I've given it a shot. Let me tell you: predicting injuries is really hard; using injuries for performance prediction is even harder.

I haven't touched the injury data while I was busy with pitch data, but here is where I left off...

I used descriptive injury data to come up with features like "did the [pitcher] have a surgery?" "how many days did he spend on the DL?" and "did he have elbow-related injuries." I spent a week coming up with what I thought might be interesting features for pitcher injuries. Then I ran "feature selection" in WEKA with that data, to see which of my injury features were most useful for predicting innings pitched, VORP, and the incidence of future injuries. I still have those notes somewhere. A full notebook of which injury feature notes for what helps to predict future DL time, future elbow problems, future shoulder problems, and future surgeries. Good times.

Of course, the correlations between my predictions (for DL days, probability of elbow-related DL stints, etc) and reality wasn't very high. We all know that injuries are hard to predict, and that there is much variance involved. Still, my predictions weren't bad. The average pitcher projected for 15 (!) DL days, with some pitchers for considerably more time than that. All the extreme cases made some sense. However the DL projections were not of much use for predicting VORP, or even IP. Most strangely, the features for "projected_DL_X" often had a positive value in projecting IP.

How can a high injury risk player project to have more value than another player with the same stats and less injury risk?

The key is same stats. Suppose you have a pitcher with seasonal WAR (total value, in wins) of 5.0, 5.0, and 2.0. What is his established level of performance? It matters whether the 2.0 drop off was injury related. If so, we might say that he's a 5.0 WAR player, but perhaps project him a bit lower because of injury risk. If he wasn't injured, we might say he is now a 2.0 WAR player.

It's like applying to Yale Medical School with a 3.4 GPA. You'd better have some 4.0 GPA semesters, and an explanation for the 3.0 GPA semesters. If you got a 3.4 GPA every semester, that's your established level of performance.

Of course, determining to which extent performance change is due to injuries is very difficult. Especially for a computer.

A projection system, in either case, will look at the 5.0, 5.0 and 2.0 seasons and probably give a guess between 2.0 and 5.0 for next season. Now you tell the computer that the player has above-average injury risk. Well, that means that his upside (an injury-free season) is also higher than expected, so the computer might upgrade his projection. Then again, if we know that the player will most likely miss half of the next season, then the computer should lower his projection instead. It's not as simple as downgrading the projected value of a player, if he has high injury risk.

As you can see, the projection systems, built to function without injury information, will not necessarily benefit from injury projections. A lot of the information about a player's injury risk is already embedded in the previous years' stats.

Modeling injury risk separately from usage and value projection is useful in itself. But if you see a pitcher projected at X, but with high injury risk, do not necessarily think this means that X needs to be revised downward. The only thing that you can be sure of is that variance in performance increases with injury risk. But you don't need a computer model to tell you that.

I'll be getting back to pitcher injuries soon. Good luck to everyone else looking at these fascinating problems!

Friday, February 19, 2010

Dear Mr. Cashman, no more NL starters please!

Before the 2005 season, the Yankees acquired Randy Johnson in a trade with the Diamondbacks. Although Johnson was already 41 a the time, he had just come off of a ridiculous six year run with the Snakes. He'd collected four Cy Young awards, and he also finished second once. A big part of his success was due to his nearly 12 strikeouts per 9 innings average (SO9 or K/9) over those six seasons. Johnson had defied regression to the mean. He had posted at least 10.0 K/9 in each of the past 14 seasons. Knowing all this, the Yankees sent Javier Vazquez, a left-handed pitching prospect, Dioner Navarro, and stacks of cash to the Snakes for the best left-handed starting pitcher of his era. (All stats and facts can be found on FanGraphs and Baseball Reference)

In 2005, Johnson wasn't nearly as dominant. He finished 17-8 in 225 2/3 innings with a 3.79 ERA. He was the Yankees' best starting pitcher, but it neither he, nor the Yankees fans, considered his season a success.

At the time, I thought that criticism of Johnson was a bit overblown. A commentator (I forget which one) pointed out that Johnson's numbers (17-8, 225 2/3 IP, 3.79 ERA) were not much worse than those of Bartolo Colon (21-8, 222 2/3 IP, 3.48 ERA), who won the AL Cy Young that year. I figured that if Randy Johnson had been one of the top two or three starting pitchers in the AL that year, then how can we consider his season a failure? Was his rise in ERA bad luck, a response to tougher opposition, or did he finally decline after years of proving statisticians wrong?

Advanced pitching metrics show that Randy Johnson did indeed regress quite a bit in 2005. His FIP (a simple fielding independent ERA approximation) jumped to 3.78 (from an FIP of 2.30 in 2004), after being consistently under 3.00 with the Diamondbacks. A big part of this increase was his loss of 2.20 K/9 from his 2004 season with the Snakes. FIP is computed as follows:

(HR*13 + (BB + HBP - IBB)*3 - K*2) / IP + (annual league constant)

If we ignore the change in league constant, then it's clear that Johnson's loss of 2.20 K/9 resulted in a rise of 0.49 FIP. In other words, FIP predicts that Johnson's ERA would rise by 0.49 from the result of his strikeout rate falling from 10.62 to 8.42 from 2004 to 2005. However FIP does not adjust for park factors, and also takes home run rates at face value. A more advanced version of FIP is xFIP, which takes better account of park factors and the luck involved in HR rates. From 2004 to 2005, Johnson's xFIP rose from 2.60 to 3.42.

Therefore, if we make allowances for the differences in moving from Arizona in the NL to New York in the AL, and we make allowances for Johnson's good luck with home runs in 2004 (and bad luck with home runs in 2005), he still regressed quite a bit. A large part of Johnson's decline can be attributed to his sudden loss of 2.20 K/9. Should this loss of strikeout have been expected by the Yankees front office?

In my grand opus on strikeout rates, I mentioned that it seems like lots of high-strikeout pitcher who move from the NL to the AL have lose about 2.0 K/9 in their first year after the move. Let's take a look at all starting pitchers who've recently switched leagues.

For this mini-study, I define starting pitchers as those pitchers who threw 100+ innings as starters in consecutive seasons. The data is from 2003-2009, so the samples are fairly small (and some samples could be significantly biased by transactional trends for certain teams, ie the Yankees). Therefore, I will not claim that this study proves anything. Nor do I suggest that you should take my numbers at face value.

Here are the before & after strikeout rates for starters who switched leagues, along with trend lines:

Wouldn't you know it? Starters with 10+ K/9 in the NL tend to lose a little over 2.0 K/9 in moving to the AL. Maybe the Yankees (or at least their fans) should have looked at this graph in 2006 before they gave Randy such a hard time.

Also, we've got us two parallel lines. So maybe strikeout changes in switching leagues are symmetrical? Let's look into this a little further.

First, let's note that for any "old SO9" rate, a starter moving to the NL will tend to have 1.5 more K/9 than a starter moving to the AL. This would suggest that there is some loss in moving to the AL and some gain in moving to the NL, with the total adding up to about 1.5 K/9. This is pretty consistent with my findings in regard to league adjustments for predicting strikeout rates from pitch data.

Second, we can also see that the slope of both trend lines is about 2/3, so pitchers lose 1/3 of their strikeouts over 6.0 K/9 (which happens to be roughly the average strikeout rate for starters in both leagues) when they switch leagues, in addition to league change adjustments. Does this have to do with league change also, or would the pitchers have undergone a 1/3 reduction in their marginal strikeout rates if they didn't switch leagues?

We can not know what kind of season Randy Johnson (or Curt Schilling, or Kevin Brown) would have had if they never moved from the NL West to the AL East. Instead, we'll look at two other groups of pitchers. One group switched teams, but stayed in the same league. The other group stayed with the same team. However, we only count pitchers staying with the same team if their age for the second season was 30+. Teams rarely trade (and never release) young pitchers who are full time starters for them, and young pitchers are not eligible for free agency. Therefore we need to compare pitchers switching teams to guys who are a bit older than average. I picked 30 arbitrarily, in order to avoid multiple-endpoints issues for such small sample sizes.

Here are all six classifications of starters below, with trend lines. Again the samples are small (some categories contain as few as 50 pitchers), so we should not place great emphasis on small changes in the slope of the trend lines.

Again, the slopes do differ a little, but we have a fairly clean stacking of the six categories of starters inside the 4.0 K/9 to 10.0 K/9 range where the vast majority of starters make their living.

Consider the several hypothetical fates of an NL starter:
  • If he stays with the same team (blue trend line), then he will experience a strikeout rate decline of 0.0 to just over 0.5 K/9, depending on how high his previous K/9 rate was.
  • If he moves teams within the NL (thin teal trend line), then he will have the exact same pattern of decline.
  • If he moves to the AL (purple tend line), then he will lose between 0.0 K/9 and 1.5 K/9 within the range of reasonable strikeout totals. An average starter (6.0 K/9) will lose just over 0.5 K/9, but a high strikeout starter may lose 1.5 K/9 due to the league change, in addition to the expected 0.5 K/9 natural decline.
Now consider a hypothetical AL starter:
  • If he stays with the same team (red trend line), then he will have a similar decline to the NL starter, except that his decline will be a tiny bit bigger on the high-strikeout end.
  • If he switches teams within the AL (orange trend line), then he will experience further decline of up to 0.5 K/9 on the high-strikeout end.
  • If he switches to the NL (green trend line), then he will gain between 1.0 and 1.5 K/9, relative to having switch teams within the AL.
The most notable points are that:
  • All starters lose strikeouts when they move to the AL.
  • All starters gain strikeouts when they move to the NL.
  • High-strikeout pitchers are particularly susceptible to the drops in strikeout rates when they move to the AL.
  • Low-strikeout pitchers have the most to gain by moving to the NL.
Looking at the graphs above, one might be tempted to assume that starters have much higher strikeout rates in the NL, on average, than they do in the AL. However, this is not the case. Consider average strikeout rates for the six categories above (I use the "new SO9" numbers in all cases):

NL (mean):AL (mean):NL (median):AL (median):
same team:6.045.925.645.56
new team:5.825.645.365.35
switch to league: 6.615.796.295.66

How can this be? If average starters in the NL do not strike out more batters than average AL starters, then how come there is such a huge change in strikeout rates when starters switch leagues?

Does this imply that the AL has more talented pitchers? Maybe, but not necessarily.

Low-strikeout AL pitchers benefit significantly from a league change. Therefore they have a strong incentive to change leagues. Low strikeout NL pitchers lose a further 0.5 K/9 with a move to the AL, so they have little incentive to change league. This would suggest that low-strikeout pitchers (at least by standards of AL ability) will be concentrated in the NL.

Now consider high-strikeout pitchers. High-strikeout pitchers who move to the AL have large drops in strikeout rates (1.5 K/9 on the high end). This is a significant disincentive for them to make the move. However high-strikeout AL pitchers have only a 0.5 or so gain in strikeout rates when moving to the NL. So they have a small incentive to move to the NL, and a large incentive to stay in the NL. This would suggest that high-strikeout pitchers will also be concentrated in the NL!

The math for average and slightly above-average strikeout pitcher is a bit more symmetrical. If a pitcher has 7.0 K/9, he will, on average, gain or lose 0.75 K/9 by switching leagues. His numbers will look better in the NL, but a well-tuned stat that adjusts for average league differences will suggest that his performance has the same value in either league.

Therefore, we should think that the AL will have more slightly above average starters, while the NL will have more high-strikeout starters, but it will also have more low-strikeout starters. That would be the most logical equilibrium.

So far, we have not mentioned why pitchers should have different strikeout rates in the two leagues. Let's assume that the reason is a combination of rule differences, and of lineup difference that result from an adjustment to those rule differences. When the teams enter inter-league play, they will have to play some games by the other league's rules, and against lineups built for those same rules.

If the strikeout rate differences are due to different rules (and lineups designed to adjust for those rules), then NL starters should have a comparative disadvantage relative to their AL brethren. The high-strikeout NL starters will suddenly become much lower-strikeout pitchers. The low-strikeout NL starters will suffer a smaller loss in strikeout rate. However, even if the back of the rotation performs relatively better, that can not compensate for the top NL starters losing 1.0-2.0 K/9 overnight. On the other hand, AL teams, which should have relatively more slightly above average strikeout starters (and fewer high-strikeout starters) will have a more balanced effect on their strikeout rates when facing NL competition. They will all strike out 0.5-1.0 more batters per 9 innings. The run environment will change proportionately, so the AL guys will not become more valuable. However, AL teams will not be subject to the kind of dramatic loss in high-end value that NL teams' high-strikeout starters experience in AL ballparks.

I might be going too far with an argument that is hinged on a fairly small amount of recent data. However, I think it's an argument worth considering. The following facts are hard to dispute:
  • High-strikeout NL starters experience large drop offs in strikeout rate upon moving to the AL.
  • In recent years, the NL has had many more high-strikeout starting pitchers than the AL, despite the fact that there is little difference in average strikeout rate between the leagues.
  • The AL has whipped the NL in inter-league play (in some years by huge margins) ever since this experiment began.
In his book, Whitey Herzog suggested that NL teams had a unfair advantage against AL teams in inter-league play, because their pitchers should be much better hitters. In today's run environment, that may not matter much any more. However in today's high-strikeout environment, I suggest that high-strikeout starting pitchers from the NL are disproportionately hurt by inter-league play, thus giving the AL a significant advantage.

So if you are reading this, Mr. Cashman, please stop signing high-strikeout NL starting pitchers. Instead, keep concentrating on offense, defense, and the bullpen. Keep trying to acquire or develop starters like Andy Pettitte and David Wells. Guys with above average strikeout rates, who keep their walk rates down. I hope that Javier Vaquez bucks the trend and holds on to most of his 9.77 K/9 from last season. However, I would not hold my breath.

I should also reconsider some of my thoughts on this off-season's big transactions. A few months ago, I wrote about the Edwin Jackson trade. I said that Jackson had consistently under-achieved his potential SO9 rate (based on fastball speed and other pitch factors), and that pitchers like him tend to continue under-achieving. I said that despite one good season and a world of talent, Jackson was unlikely to ever achieve high strikeout rates with Arizona. I might have to change my mind about this now. Jackson struck out 6.77 per nine innings last season. His totals should improve in Arizona to about 7.8 K/9. If that happens, Jackson will be a valuable starter, whatever his other shortcomings may be.

In order to get Jackson, the Snakes sent Max Scherzer to the Tigers. He recorded 9.2 K/9 in his first full season as a starter. According to my graph, he should drop down to about 7.5 K/9 next year with the Tigers. This suggests that last season's strikeout rates for the two pitchers are a not as different as they first seemed to me. Even so, Scherzer was clearly the better pitcher last year. Both Jackson and Scherzer are young, and if the Diamondbacks think that Jackson has a higher upside, then they are fully justified in making the trade. In any case, Scherzer is likely to regress (at least in nominal stats) with the Tigers, so the Snakes will look like they sold high on him.

Wednesday, February 17, 2010

Firemen are soooo predictable! (addendum to 'Fastballs for Dinner')

Silly me. I was comparing starters to all pitchers. We all know that there are three kinds of pitchers: starters, relievers and Joba Chamberlain.

I isolated the starters with a simple 100 IP cutoff. I can similarly isolate the relievers by requiring no more than 6 starting innings pitched. This captures 86% of the relief innings, with less than 1% contamination by starter innings.

Graphs for relievers are below:

There is not a whole lot of difference between the known relievers and the "all pitchers" figures. However I can't help note just how similar all lefty relievers' repertoires really are. If this is evolution, I hope evolution doesn't happen to the general pitcher population.

Then again, the data I'm using does not take account of two-seam fastballs or of sinking fastballs. But still, all of the crafty lefties I can think of are starting pitchers. As I mentioned in an earlier post, there has been a steady increase of lefty starters in the past decade. So maybe clubs are doing a good job of getting their best lefties into the rotation. The lefties who throw smoke but can't learn a pitch other than a slider end up in the bullpen.

Tuesday, February 16, 2010

How do you know if he'll be unhittable?

After much delay, I'm ready to show my best effort for predicting strikeout rate from numbers not related to a pitcher's performance.

Say you've got a pitcher. College guy, minor leaguer, major leaguer, etc. You would like to project what his future MLB strikeout rate should be (granted that he makes it that far). Can we predict this number from scouting the guy (i.e. taking a look at what pitches he already throws, and projecting how his repertoire may evolve)? Yes, we can. With around a 0.6 correlation to actual performance. I showed my original model for this almost 3 months ago.

Since then, I have looked into a number of additional factors and adjustments. Also I can answer a few obvious questions raised by my original work. Most importantly, I am going to break down the model, rather than just drop a formula on my blog.

Starters & Relievers

I think it's important to know how well we can predict strikeout rate for starters, relievers, and for pitchers in general. So I will show three models.

I consider anyone a starter if he pitches 100+ innings. I consider a reliever anyone who throws less than 6 innings as a starting pitcher. This simple reliever classification captures 86% of the relief innings (with less than 1% of starter innings). The simple starter classification captures 82% of starter innings (with less than 10% of the reliever innings). Also, we avoid (almost all) pitchers who spend substantial time both in the rotation and in the bullpen.

M5 Rules in WEKA

I wish I could just show you a few multiple regression models side by side. However, the algorithm I use is slightly more complicated. Although only slightly.

I train my models with the "M5 Rules" algorithm in WEKA, the open source machine learning platform. This algorithm is basically a souped up multiple regression model. Given a bunch of features (30 or so in my case), the algorithm will build a linear model to predict a single value. With a few caveats:
  1. The algorithm can split the data set along a linear rule (example: FB velocity > 90 mph).
  2. The algorithm tries to keep the weights of features as low as possible. This also means that features that are not meaningful are eliminated entirely from the model.
If I give the algorithm 30 features, rather than getting back a single rule with 30 weights, I might get 2 or 3 linear rules, each with 5-10 weighted features.

In evaluating the accuracy of my model, I look at the correlation between the outputs of the model and the observed strikeout rates. However, the correlation values that I get are from 5x cross validation. This means that WEKA actually builds 5 models, each time leaving out 1/5 of the data to use for testing. The correlation figures are for the testing data of the 5 models. Therefore, I am never training and testing on the same data. However, when breaking down models, I always use a model trained on the entire data set.

The Three Models

As I explained above, each of my models (for starters, for relievers, and for all pitchers) will have 2-3 rules each. Thankfully, the models split along the same couple of features (FB velocity and starter/reliever classification), so I will be able to compare rules across models fairly easily. First, let me summarize the models.

model:# of rules:correlation with SO9:average IP:% lefties:
relievers 20.445635.928.1

You can see the models in a text document here. Also I have a grid of the features by model here. I have removed meaningless features in the spreadsheet. This should make it easier to see the significant features that are left over.

Let's quickly list the possible features:
  1. "bio data:" handedness, height, weight, age
  2. for each pitch [FB=fastball, SL=slider, CT=cutter, CB=curve, CH=change, SF=splitter, KN=knuckler]:
    • % thrown
    • velocity
    • whether the pitch is part of offerings
    • whether the pitch is part of depth
  3. repertoire depth and repertoire offerings for the pitcher (an explanation with examples is here)
  4. league (LG)
    • AL = +1
    • NL = -1
  5. IP Start

    • only for the 'all pitchers' model

Using IP in Strikeout Prediction Models

The concept of using IP in a model to predict strikeout rates seems contrary to my aim of predicting strikeout rates without using performance-based information. It certainly is. Playing time is the simplest measure of performance. Better pitchers throw more innings.

And yet, adding IP to the feature set above does not help us much in predicting strikeout rates. The 'all pitchers' model (which uses IP Start) has 0.5675 correlation with observed strikeout rates. If we don't allow it to use IP Start, correlation drops to 0.5491. The model stays just as predictive. However for correctness, I don't use any IP features in the starters and relievers models (beyond classification of pitchers into these models). I use IP for the 'all pitchers' model in order to have a rule separating starters and relievers. This makes later analysis a lot simpler.

The fact that IP is does not help us predict strikeout rates (if we already have pitch type data) is notable. This is not the case for the other pitcher rates that we might care about!

Let's restrict ourselves to starting pitchers. Here is what happens if we train models for various pitcher rates with and without IP as an input:

rate:correlation with IP:correlation without IP:drop off:
SO/BB 0.5170.4000.117

As you can see, having access to IP data does not matter for predicting starters' strikeout rates. Given the same pitch distribution, a 140 inning fourth starter will tend to have the same strikeout rate as a 220 inning first starter. However this is not true for many of the other rates.

Pitch distributions & bio data does explain a significant portion of many of these rates. However the rates that we might care most about (QERA, BBr, SO/BB, GB/FB, BABIP) are dependent, do a large degree, on IP. I don't know whether this means that I am missing some method of getting more out of the non-performance related data, or that good pitchers have low walk rates and keep their BABIP down in ways that can't be measured using aggregated pitch data.

Once I am able to look at individual, rather than aggregated, pitch data from Pitch F/X, perhaps I'll have a better idea.

Fastball Velocity

As I've written before, average fastball velocity is the most predictive single feature for strikeout rates. This is true for all groups of pitchers.

Fastball velocity does not have a linear relationship with strikeout rates. Rather, as you can see from the graph here, differences in velocity matter a lot more on the high end (93-96 mph) then they do on the low end (86-89 mph). Therefore all three models split the data along high end/low end fastball velocities, thus creating separate rules for "hard throwing" and for "soft throwing" pitchers. This piecewise linear approach is a much cleaner way of handling nonlinearity than trying to fit a polynomial to the relationship between fastball velocity and strikeout rates.

Here are the weights for fastball velocity in the SO9 models. Remember that SO9 is measured in strikeouts, while pitch velocities are measured in mph. So a weight of +0.50 means that an increase in 1 mph on the fastball will result in a SO9 prediction that is 0.5 strikeouts higher.

model:hard throwers:soft throwers:

The harder that a pitcher throws, the more he has to gain (strikeout-wise) by throwing even harder. I have never faced a 90 mph fastball, so I have no idea why hitters find it so much harder to make contact with a 90 mph fastball than with a 92 mph fastball. But they do.

Explaining the non-linearity might be easier. Pitchers on the low end of the scale (with average fastballs in the mid 80's) only stick around in the majors by doing lots of other things well. Those that don't somehow manage reasonable strikeout rates do not keep their jobs. Pitchers on the high end of the scale (those who flash 98 mph fastballs and who sit above 94 mph) are so rare that hitters won't be used to facing that kind of heat. So strikeout rates for those pitchers should be exceptionally high.

Remember Joba Chamberlain in 2007? His average fastball speed out of the pen was 97 mph (!) according to FanGraphs. In 2008, his average fastball speed fell to 95 mph, and yet he still maintained a strikeout rate of 10.6 K/9. However in 2009, his average fastball rate dropped to 92.5 mph. Still well above average, but no longer elite. His strikeout rate fell to 7.6 K/9, which is also well above average, but no longer spectacular. Say what you want about Joba's other pitches. It's his declining fastball speed that drove down his strikeout rate.

Lefties and the National League

Low strikeout rate has got you down? Want to increase your strikeout rate by up to 2.0 K/9 in two easy steps?

First, learn how to throw left handed. Second, sign with the Nationals and make sure that you get into their starting rotation. Or join any other National League team, for that matter. An lefty starter in the National League with an above-average fastball will average almost 2.0 more K/9 than a right-handed American League starter with the same stuff, according to my model:

model:hard throwers:soft throwers:
starters: THROWS=L+1.296+0.355
relievers: THROWS=L+0.939+0.650
starters: LG-0.424-0.213
relievers: LG-0-0.117

Since LG = [AL = +1; NL = -1], the -0.424 value means that switching from the AL to the NL will gain a hard throwing starter almost 0.8 k/9. Since the league change effect is almost entirely absent for relievers, I think that the increase in strikeout rate is due to NL starters facing the opposite pitcher. It seems that a starter in the NL can pad his strikeout rate simply by having a good enough fastball to make it hard for the opposing pitcher to make contact. This should be worth noting for AL teams that sign hard-throwing NL starters.

If you look at Randy Johnson, Kevin Brown and Curt Schilling, all of these guys had their SO9 rates dip by about 2.0, when they made late-career moves to the AL East from the NL West. Two years later, Randy Johnson went back to the NL West, and his strikeout rate jumped right back up, despite a small drop off in fastball velocity. The other two pitchers retired.

I don't know offhand what the difference is between strikeouts in the AL and in the NL. But I'm sure that the difference is not 4.0 strikeouts per game. Whatever the difference is, it looks like it's being disproportionately made up for by the hard-throwing starting pitchers in the NL. Soft tossers like Barry Zito should not expect their strikeout rates to rise with a move to the Senior Circuit. However, Zito's left-handedness serves him well in both leagues.

(I am not implying that the three star pitchers' drop in SO9 rates proves that when hard-throwing pitchers move the AL, their strikeout rates drop by 2.0 strikeouts per nine innings. My model suggests that the drop should be more like 0.8 strikeouts per nine innings. However even a drop like that would probably not be accounted for by simple translations of average strikeout rates between the NL and AL. That's all I was saying. This sounds like a worthwhile study. And a fairly simple one, also.)

Repertoire Depth

The models seem to suggest that having a deep repertoire is negatively correlated to strikeout rates. This is not entirely so.

Although all of the model weights associated with rep_offerings and rep_depth are negative or zero, these weights are assigned in a context where we can give positive weights to pitchers for individual pitches.

I also trained models using only the most important features (here is the model for starters):
  • FB velocity
  • handedness
  • league (AL vs NL)
  • rep_depth and rep_offerings
Such models showed that (for both starters and relievers) rep_depth is positively correlated with strikeout rates, while rep_offerings is negatively correlated with strikeout rates. Also, the absolute value of the weight for rep_depth is higher in all cases.

In other words, having multiple "core" pitches is predictive of a high strikeout rate, but throwing lots of pitches, without throwing them very often, is not.

Furthermore, a model with the five features above performs almost as well as a model with the full set of features. I will explore this deeper in a future post.

As for the current models, which include features about individual pitches, why are the rep_depth and rep_offerings weights always negative? I can't say for sure. But we can't simply ignore the fact that many pitchers who have had very high strikeout rates (Gagne, Lidge, Papelbon, Clemens, Randy Johnson) were all one-pitch or two-pitch pitchers. If the model sees many examples of pitchers with really high strikeout rates that use just two good pitches, that fact will be reflected in the feature weights.

Other Pitch Stats

As I mentioned above, using the most important five stats can get us a strikeout rate model that is within a few percentage point of models trained with the full set of features. That is not to say the the other features are meaningless. The have some additional predictive value, but most of the information these features offer is redundant. However they are interesting for descriptive purposes.

By allowing the model to consider information about a pitcher's curveball and change up, we should get better estimates about how significant his fastball speed really is. However it is difficult to tell anything conclusively about these "lesser" pitches' predictive power.

The model seems to suggest that hard-throwing relievers should use the change up as a core pitch. If they are going to throw sliders, they better be hard sliders. That sounds reasonable. However I would not take these numbers too seriously. The relevant weight are here. If you see something interesting that I missed, please let me know!

Odds & Ends

This post has already run long, so I'll just mention a few more observations, without comment:
  1. If we build a model for left-handed starters, we get a 0.75 correlation to the observed strikeout data.
  2. If we build a model for left-handed relievers, the correlation is 0.45.
  3. Giving the model features for "quality of opposition" does not help predict strikeout rates. Although I'm not sure how sensitive these features are to a pitcher's individual opposition, rather than his team's opposition as a whole.
  4. The model suggests that height doesn't matter, except that it hurts lefty relievers. Higher body weight is good for relievers, but bad for lefty starters. Higher age is bad for lefty starters, but good for lefty relievers. I doubt if any of these values are significant.