I decided to play around with pitching statistics during Pedro Martinez’s tenure with the Boston Red Sox. I went into this with no specific hypothesis, just an interest in analyzing the statistics for one of my favorite athletes (and human beings) and seeing if I could learn something new about his career with the Sox.
I imported individual game log data from Baseball-Reference (http://www.baseball-reference.com/players/m/martipe02.shtml) into GGobi and RStudio. My first impression was that this data was exhaustive. As a sports fan and a nerd in general, I am well aware of the extent that statistics have penetrated player analysis in the past decade, but seeing the full breakdown in a spreadsheet was striking. There are 48 different variables recorded, from the obvious (Wins, ERA, Strikeouts) to the mundane (flyballs, ground balls, inning exited). At first I tried looking at everything to see if I could observe a trend, but ended up keeping it simple and sticking with the basic counting statistics that everyone understands.
I began with RStudio, plotting the number of Earned Runs that Pedro gave up versus the number of pitches he threw in his previous start, as well as the number of strikeouts versus the number of days rest he had had since his last start. I was interested to see if there was any correlation, as he had plagued with durability concerns for most of his career. Since most of the data points were bunched up in a small area, the data was a little difficult to read, as small differences with these stats make a huge difference.
I then imported the data into GGobi, as it’s a little bit easier to compare multiple sets of data at the same time. I started by graphing out some of his stats (decision earned, earned runs, strikeouts, walks and innings pitched) on a game by game basis, but nothing other than his consistent excellence stood out. I then created a scatterplot, comparing earned runs, strikeouts, walks and innings pitched to number of pitches thrown in previous start, to further observe if durability concerns were real or not. There didn’t seem to be much of a correlation either way, as the density was pretty consistent for all the statistics.
Finally, I compared the number of strikeouts he threw to the decision (win, loss or no decision) that he earned. Again, difficult to take anything away from this scatter point in terms of identifying a trend, as the limited number of options for results make it hard to tell how frequently each data point happened. However, it was interesting to note that looking at this particular scatter point evoked strong memories. The outliers at the top of the chart were games that I could immediately identify and instantly reminded me of the feelings of excitement and frustration I had watching them. I realized that I was approaching this data set all wrong. Instead of trying to identify performance trends, I should have included some of my own personal data and tried to create a personal narrative, exploring the relationship between his performance and what was happening in my life.