Predicting Injuries in MLB Pitchers

I’ve made it midway via bootcamp and finished my third and favourite project to this point! The last few weeks we’ve been learning about SQL databases, classification fashions equivalent to Logistic Regression and Support Vector Machines, and visualization instruments comparable to Tableau, Bokeh, and 라이브스코어 Flask. I put these new expertise to make use of over the past 2 weeks in my project to classify injured pitchers. This submit will define my process and analysis for this project. All of my code and project presentation slides can be discovered on my Github and my Flask app for this project may be discovered at mlb.kari.codes.

Challenge:

For this project, my challenge was to predict MLB pitcher injuries using binary classification. To do this, I gathered data from several sites together with Baseball-Reference.com and MLB.com for pitching stats by season, Spotrac.com for Disabled Checklist data per season, and Kaggle for 2015–2018 pitch-by-pitch data. My objective was to use aggregated data from previous seasons, to predict if a pitcher can be injured in the following season. The requirements for this project have been to store our data in a PostgreSQL database, to utilize classification models, and to visualise our data in a Flask app or create graphs in Tableau, Bokeh, or Plotly.

Data Exploration:

I gathered knowledge from the 2013–2018 seasons for over 1500 Main League Baseball pitchers. To get a really feel for my information, I began by looking at features that had been most intuitively predictive of injury and compared them in subsets of injured and wholesome pitchers as follows:

I first checked out age, and while the mean age in each injured and healthy players was round 27, the information was skewed a bit otherwise in each groups. The most typical age in injured gamers was 29, while healthy gamers had a much decrease mode at 25. Similarly, common pitching velocity in injured gamers was higher than in wholesome players, as expected. The subsequent characteristic I considered was Tommy John surgery. This is a quite common surgical procedure in pitchers the place a ligament in the arm gets torn and is changed with a wholesome tendon extracted from the arm or leg. I used to be assuming that pitchers with previous surgeries were more prone to get injured again and the data confirmed this idea. A significant 30% of injured pitchers had a previous Tommy John surgical procedure while wholesome pitchers had been at about 17%.

I then checked out common win-loss document within the two groups, which surprisingly was the function with the highest correlation to injury in my dataset. The subset of injured pitchers have been successful an average of forty three% of games compared to 36% for healthy players. It is sensible that pitchers with more wins will get more taking part in time, which can lead to more accidents, as shown in the higher average innings pitched per game in injured players.

The feature I was most occupied with exploring for this project was a pitcher’s repertoire and if sure pitches are more predictive of injury. Looking at function correlations, I found that Sinker and Cutter pitches had the highest positive correlation to injury. I decided to discover these pitches more in depth and seemed on the percentage of combined Sinker and Cutter pitches thrown by particular person pitchers every year. I observed a pattern of injuries occurring in years the place the sinker/cutter pitch percentages were at their highest. Beneath is a sample plot of 4 leading MLB pitchers with recent injuries. The red points on the plots signify years in which the gamers were injured. You can see that they often correspond with years in which the sinker/cutter percentages were at a peak for every of the pitchers.