Playing Numbers Sports Analytics Content for those Interested in Playing the Numbers Tue, 28 Jul 2020 17:31:18 +0000 en-US hourly 1 Playing Numbers 32 32 Classifying MLB Hit Outcomes Tue, 28 Jul 2020 17:25:18 +0000 In 2015, MLB introduced Statcast to all 30 stadiums. This system monitors player and ball movement and has provided a wealth of new information, in the process [...]

The post Classifying MLB Hit Outcomes appeared first on Playing Numbers.

In 2015, MLB introduced Statcast to all 30 stadiums. This system monitors player and ball movement and has provided a wealth of new information, in the process introducing many new terms to broadcasting parlance. Two specific terms, exit velocity and launch angle, have been used quite frequently since, with good reason – they’re very evocative of the action happening on the field.

Mike Trout Hitting Metrics
Statcast parameters used in a broadcast

The exit velocity is the speed of the ball off of the bat, the launch angle is the vertical angle off the bat (high values are popups, near-zero values are horizontal, negative values are into the ground). When these started becoming more popular, I found myself thinking quite often, “how do I know if this is good or not?” With exit velocity, it’s fairly easy to conceptualize, but less transparent for launch angle. This led me to try plotting these two variables using hit outcome as a figure of merit. The shown chart uses data from the 2018 season.

Hit outcomes by Launch Angle and Launch Speed
Hit outcomes for various Launch Angles and Exit Velocities

This plot held some interesting trends:

  • A “singles band” stretching from roughly 45 degrees at 65 mph to -10 degrees beyond 100 mph. The former represents a bloop single, the latter is a grounder that shoots past infielders, and this band as a whole encapsulates everything in between.
  • A considerable amount of stochastic singles at low launch speed. These can correspond to things like bunts against the shifts, infield hits, etc.
  • A pocket for doubles with hard-hit balls (over ~85 mph), generally hit slightly above the horizontal, making it to the deep parts of the outfield.
  • A well-defined home run pocket for hard-hit between 12-50 degrees.

A couple years after making this plot, I was thinking about where I could employ clustering models, and it jumped back to mind. Over the last few months, I’ve worked extensively on modeling this data.

Model Selection

While I approached this problem with clustering in mind, it’s always good to investigate other possible models. For this, I considered k-Nearest Neighbors (kNN) clustering, a Support Vector Classifier (SVC), and a Gradient Boosted Decision Tree (gBDT).

The previous plot looks at launch angle and exit velocity, but this neglects the third spatial dimension. In order to do this, I also included spray angle into the model, which captures horizontal location on the field, then I trained and evaluated each model to see which is most accurate:

Predicted hit outcomes for the various models evaluated with spray angle
Predictions and accuracy for various considered model types

Out of those evaluated, it turns out that tree-based methods perform the best out-of-box (77% accurate), so I chose that as the model-of-choice.

A more extensive discussion of my model selection process can be found here.

Extending this model

In developing this model, I wanted to approach it as I would if I were a team: use parameters that are known in advance, so I could use this model for player evaluation. The two most relevant parameters I looked at were player speed and differences in parks.

Player speeds were scraped from Baseball Savant. Since I’m evaluating on 2018 data, 2017 sprint speeds were assumed to be known and used in the model. For 2018 rookies, the mean sprint speed was imputed. By itself, adding sprint speed only marginally increased the accuracy of this model, but did provide some improvement for accurately discerning extra-base hits, where speed can be very important.

To account for differences in parks, Fangraphs maintains park factor values to parameterize which outcomes are more likely in various parks. At the most granular level, they’re split by park type and handedness. When adding to the model, I consider the handedness of the player I’m evaluating, and add all of the associated values. I again use 2017 park factor values, assuming they were known. These considerably helped the model, particularly discriminating home runs and doubles, improving the accuracy of both.

Last, inspired by a post about  fly ball carry on Alan Nathan’s blog, I replaced the absolute spray angle with an adjusted spray angle. Adjusted spray angle just flips the sign (from positive to negative, or vice-versa) for left-handed batters. By doing so, the information shifts from a sheer horizontal coordinate to a push vs pull metric. Including this in the model rather than true spray angle helps it quite a bit, especially with quantifying home runs.

Histograms for absolute and adjusted spray angles
Distribution of spray angle and adjusted spray angle for 2018 data

Additionally, I did a hyperparameter sweep of the model, adjusting parameters of the model itself. I found the optimum depth of the decision trees (think of a flowchart, splitting data more and more as it gets deeper), and learning rate (how much it adjusts to focus on previously misclassified data), which helped the model with singles and outs. Ultimately a depth of 5 and a learning rate of 0.3 were used.

Hyperparameter optimization for max depth (left) and learning rate (right)
Accuracy for models with various hyperparameters

A full look at adding these parameters to the model and the hyperparameter optimization can be found in this blog post. The final model arrived at an accuracy of 79.14%.

Applying the Model

A common stat to evaluate offensive production is Weighted On Base Average (wOBA), which tries to encapsulate the idea that different offensive actions are worth different values, and weights them proportionally to the observed “run value” of that action. The weights are derived from data, using a method known as linear weights.

The problem with wOBA is that it is calculated based on outcomes, but there’s a level of uncertainty in those outcomes – variance due to things like weather or defense. As a result, wOBA provides a good description of things that have happened, but not underlying skill. By focusing on wOBA, we do what Annie Duke calls “resulting” in her book Thinking in Bets – fixating solely on the outcome, neglecting the quality of inputs.

This is the perfect opportunity to utilize the model – the output assigns a probability for each hit type for balls hit in play. When describing accuracy in previous sections, the most probable outcome was compared to the true outcome, but it gives us the likelihood for all outcomes. For example, it might say a line drive has a 30% chance of an out, 40% chance of a single, 20% chance of a double, 10% chance at a triple, and isn’t hit hard enough for a HR, so 0% chance.

We can put these likelihoods into the wOBA calculation to get a value based on the probability the model assigns possible outcomes, rather than only the result. To do so, the counts of each hit type in the wOBA calculation are replaced by the sum of probabilities for the respective hit type.

If you’re into baseball stats, you might be thinking this sounds very familiar, that’s because this is similar to what xwOBA is. Expected Weighted On-base Average (xwOBA), does this calculation but has a different approach. For their model, line drives and fly balls are modeled with a k-Nearest Neighbors model using only exit velocity and launch angle, while soft hits are modeled with Generalized Additive Models (GAM) that also use sprint speed. To highlight some differences:

  • Their model at no point uses spray angle, however my work showed spray angle helps accuracy significantly, and adjusted spray angle even more so.
  • My model uses sprint speed everywhere. Speed is more important for infield events, so directly encoding like MLB’s model is almost certainly more helpful in those events, but I show it also helps with the accuracy of doubles, which almost entirely fall into the domain where MLB’s approach would not include speed.

In order to put my model-based wOBA on the same scale as true wOBA and xwOBA, I scaled the mean and standard deviation of the distribution to that of true wOBA:

Distribution of true wOBA, the wOBA using this model, and xwOBA

Now that they are comparable, I looked at how players are affected most, now looking at 2019 qualified hitters. Looking at model-based wOBA vs true wOBA:

Model-based wOBA calculation vs true wOBA

A full interactive version of this plot is available in this post. The grey line shown represents where wOBA predicted by the model matches true wOBA (y=x). Players above that line have a better model-based wOBA, players under that line have a better true wOBA.

For those players above the line, the model believes that they have been unlucky in their outcomes, and have disproportionally worse true outcomes with respect to wOBA, based on the hit kinematics, player speed, and park factors. The 5 players with the biggest difference between model-based and true wOBA are shown in green. Left to right they are: Mallex Smith, Rougned Odor, Willy Adames, Dansby Swanson, and Dexter Fowler. The high point on the far right above the line is Mike Trout – even with a true wOBA of 0.436, the model believes he was unlucky!

By contrast, the players the model think have gotten the most lucky in their outcomes are shown in red. Left to right: Yuli Gurriel, Rafael Devers, Jeff McNeil, Ketel Marte, and Anthony Rendon.


In Scott Page’s The Model Thinker, he outlines 7 different possible uses for models under the acronym REDCAPE: Reason, Explain, Design, Communicate, Act, Predict, Explore. This being a machine learning model, they’re notoriously good at prediction at the cost of interpretability. Through these predictions, it can suggest acting. While the opacity of machine learning hurts interpretability, the development of has served to help explain as well.


This model can be used to predict hit outcomes, using just hit kinematics, player speed, and factors of the park you expect to be playing in. Because of this, it can be used for prediction in several regimes. However, it’s important to not make out-of-sample predictions, using the model somewhere it isn’t trained to. This model was trained on MLB data, so the domain of applicability is only for MLB-like scenarios. Due to differences in pitching skill, it won’t translate one-to-one to minor league (MILB) data.

Where there is a clear value to prediction from this model would be situations where MLB starters practice against batters. In that case, it doesn’t matter the level of the batters, they could be MLB, MILB, college, whatever, so long as the pitches they’re seeing are MLB-like. If there’s no defense on the field, it might not be clear how these practice hits would translate to real-world scenarios, but this model allows you to predict how the practice would translate to a real game. The scenario I can immediately think of is spring training, where minor leaguers are playing against MLB caliber pitchers – this model would give you a more clear insight about what to expect from the minor leaguers in MLB game situations.


Elaborating from the previous prediction section, being able to understand hit outcomes from players who aren’t necessarily playing in MLB games is a powerful tool, it lets you better understand how to evaluate your players and motivate actions such as when to promote a player.

Further, shown in the model-based wOBA application above, if wOBA is being used as a metric to evaluate player value, this can provide a more granular version of that, one that removes “resulting” from the equation and filters in some alternative outcome possibilities for the same hits. This would be very useful when doing things like evaluating trades, if you know a player has been particularly unlucky recently and that the model predicts a higher wOBA than the true value, it’s possible you might be able to get a cheaper trade.


Through the way in developing this model, there have been several interesting insights, I’ll wrap up with a few:

  • Triples are tough to predict. Like, really tough. – This is an obvious statement in passing, but it wasn’t until fighting with this model to get any reasonable triple accuracy that I saw just how dire the situation was. They require the perfect coalescing of correct stadium, quick batter, slow fielder, and a solid hit, which make them incredibly unpredictable.
  • Hit kinematics get you most of the way there in predicting the outcome of a hit – Beyond exit velocity, launch angle, and spray angle, further variables provided some improvements in accuracy, but nothing near the gain achieved by the initial three. This is important when considering things like the value of sprint speed – from an offensive perspective, it’s far secondary to a quality hit.
  • Focusing on true outcomes loses understanding of underlying skill by neglecting the quality of inputs – In a high statistics regime, such as looking at all hits, there’s going to be many cases where less likely outcomes ended up being the true result; even a 5% probable event sounds low, but still has a 1/20 opportunity of happening, which would occur a couple of times every game. Looking at possible outcomes rather than realized outcomes provides a better understanding of underlying talent.

I also encountered some model building insights, that aren’t necessarily insights into the game itself, but still useful to keep in mind, especially for those building models:

  • Smart features are just as useful as smart models – Taking a step back and using informative features is a great way to make sure your model is doing the best it can – for example, switching to from absolute to adjusted spray angle provided a good improvement in accuracy.
  • Simple questions can lead to useful projects – I made the launch angle vs launch speed plot quite some time ago, just because I was curious about how to interpret those parameters. That plot just sat in my head until I was thinking about projects I could use clustering on, which inspired this project.
  • Sometimes your gut model isn’t the right one – I approached this problem thinking it’d be a neat way to employ k-Nearest Neighbors clustering. However, one of the first things I discovered is that tree-based methods do better than k-NN – keeping an open mind to alternative models is good, test as many as you can.

Hopefully you enjoyed this deep dive into model building and application. For a deeper look as well as links to code, be sure to check out my corresponding blog posts:

The post Classifying MLB Hit Outcomes appeared first on Playing Numbers.

]]> 0
Data Science in Sports (Talk at Northwestern University) Wed, 22 Apr 2020 13:27:34 +0000 This past weekend, I was honored to speak to almost 100 Kellogg MBA students about my work in sports analytics.

The post Data Science in Sports (Talk at Northwestern University) appeared first on Playing Numbers.


The post Data Science in Sports (Talk at Northwestern University) appeared first on Playing Numbers.

]]> 1
Jimmy Graham: A risk worth taking for the Chicago Bears? Sun, 12 Apr 2020 16:53:32 +0000 Bears fans got a lesson in regression to the mean last season. It may have been wishful thinking, but Bears fans were convinced that Matt [...]

The post Jimmy Graham: A risk worth taking for the Chicago Bears? appeared first on Playing Numbers.

Bears fans got a lesson in regression to the mean last season. It may have been wishful thinking, but Bears fans were convinced that Matt Nagy’s team would continue on their upward trajectory. Barring any maladies at the kicker position, there was even talk of a shot at the Super Bowl. It didn’t turn out that way as we know. Last off-season, there was one glaring problem; this off-season, there are many more. The Bears front office have gone about fixing them and they have made a number of free agency signings. Win-now mode has been activated.

So, with the Bears front office learning from their mistake of focusing too much on one position and Bears fans being wary to predict anything into the future, we head into an uncertain offseason.

Mitch Trubisky continues to sway opinion. More people are leaning towards skepticism as his sample size of games gets larger. The Bears decision-makers might also be leaning that way, and have added one weapon that might be able to help him out when he drops back to throw. Jimmy Graham signed with the Bears on a two-year, $16 million deal, including $9 million guaranteed recently. The move has split opinion.

What do we know about Jimmy Graham?

Jimmy Graham is 33 and is coming off a forgettable time in Green Bay.His production has been declining for some time. That’s understandable as a player approaching his mid-30s. What we don’t know is how much that is down to his fit in Green Bay and Seattle. Ryan Pace was part of the front office who helped the Saints draft him. That’s where Graham had his most productive years.

Graham says he wants to get back to himself in Chicago, which suggests all was not entirely well in Green Bay. He made two Pro Bowls in as many years with Seattle so the recent samples are inconclusive.

Jimmy Graham’s Yards via Pro-Football-Reference.

The Bears have committed to Mitch Trubisky. They say he is ‘their guy’, but actions speak louder than words. The move to bring in Nick Foles suggests there is uncertainty and it will work one of two ways for the current starter: (1) he will respond to the competition or (2) Foles will win the job. It’s a win-win for the Bears unless both of them are terrible, which is an something not outside the realm of possibility. Regardless of who is under center for the Bears next season, help is needed at tight end.

The Bears had five tight ends who played snaps last season and they combined for less than 500 yards on 44 catches. Trey Burton might remain the number one guy but his inability to remains healthy is an issue.

The year before Burton signed with the Bears was an outlier when it came to his health and availability. Jimmy Graham, for all the questions about his decline, stays on the field and has done even if his authority once he is there is in question. Graham offers nothing as a blocker but he is never injured and that counts for something.

Jimmy Graham’s Air Yards and Yards After Catch

Graham’s production has waned. One thing that remains constant is his ability to be threatening. This is a look at his air yards and yards after catch from He is approaching elite territory with those figures.

The Bears are going about their business in an orderly and calm fashion. They are acting with purpose and a plan, or at least that is how it appears. The signing of Jimmy Graham at that price is better than faffing about in free agency later. They have filled a need at TE early instead of targeting unrealistic players before settling for a late-round, developmental draft pick.

Jimmy Graham is not what he once was when he scorched defenders in New Orleans, but he is a safe pair of hands and doesn’t tend to get injured. The Bears will bounce back next season to some degree after a forgettable 2019 and want to win now. Graham has one, maybe two years left in him, and he is thinking the same. This could be the perfect marriage.

The problem is that Graham was given a deal that brings questions marks with it. The Bears have committed more cap room to tight ends than another team, and it’s not even that close.

Maybe that’s the way forward for the Bears in 2020. Mitch Trubisky did not have enough help at the position in 2019 and he struggled massively. Matt Nagy wants and needs him to get better at reading coverage, going through his progression and finding open receivers. Maybe having more mid-level threats to check down to will help that.

They have their running back situation locked down, their receiving corps looks to be in good shape and they have a competition at quarterback. Whoever wins the job will be left with no excuses that the offence wasn’t invested in. Whether it’s Trubisky or Foles who is starting under centre next season, the Bears have surrounded them with enough talent to make something happen. The defence should remain stout but the question is whether the passing offence can keep up.

The post Jimmy Graham: A risk worth taking for the Chicago Bears? appeared first on Playing Numbers.

]]> 0
Using ML to Understand Real Madrid’s Poor Last decade in La Liga Sun, 12 Apr 2020 16:26:04 +0000 Using K-Means Clustering to analyze the types of teams Real Madrid and Barcelona drop points against in La Liga. This is a shorter version of [...]

The post Using ML to Understand Real Madrid’s Poor Last decade in La Liga appeared first on Playing Numbers.

Using K-Means Clustering to analyze the types of teams Real Madrid and Barcelona drop points against in La Liga.

This is a shorter version of a longer paper, a link to the full paper is here. A link to the code and the datasets used is here. To see the final results of this paper, skip to the Conclusion.


Real Madrid’s success in the Champions League during the last decade is in stark contrast to their poor performance in La Liga. By using K-Means Clustering to categorize the teams the Real Madrid and their arch-rivals Barcelona have lost and drawn to in La Liga, we can get a clearer idea of where Madrid is making mistakes in this competition.

History of Real Madrid

Throughout the late 1990s and the 21st century, Real Madrid has been known as the club in world football to spend an exorbitant amount on the players they desire the most, specifically attackers. Galaticos, as they became known globally, would often go on to be the offensive leaders of the team and the person that Madrid’s attack would flow through. While there have certainly been legendary Galaticos in the past, perhaps none have been more impactful to the club in modern history than Cristiano Ronaldo. Signed in the summer of 2009 as part of the answer to arch-rival Barcelona’s dominance of Spain and Europe (They were the first team to win six trophies in one year), Ronaldo would stay at Madrid for almost a decade. The defining moments of the Ronaldo Era at Madrid would almost all come in the Champions League, as Madrid went to the semifinals of the competition 8 years in a row and was able to win the entire thing 4 times in the span of 5 years; to say that this is anything other than legendary is an understatement.

Despite Madrid’s success in the Champions League, their struggles in domestic competition cannot be overlooked. In Ronaldo’s time at Madrid (nine seasons), Real Madrid was only able to win La Liga two times. In the same time period, Barcelona won the league title 6 times and city rivals Atletico de Madrid won the title once. While the Ronaldo Era won’t be looked back on as unsuccessful, the question remains: How can a team that was so dominant in one competition be so lackluster in another?

I attempt to answer this question by using Machine Learning. I’ve created a dataset for Madrid and Barcelona that lists every single time they’ve lost or drawn against an opponent in La Liga from the 2009-10 all the way until the 2017-18 season, which covers every season that Ronaldo was at Madrid. On each dataset, I’m going to use a Machine Learning technique called K-Means Clustering (K-Means or KMC for short) to find the average teams that Madrid and Barcelona are losing points to and see if there are any clear differences. K-Means Clustering will be done in a couple of different scenarios to get a better understanding of where exactly Madrid is struggling and Barceona has been succeeding (For example, dividing the dataset up by manager and doing KMC for each manager)


For each of the 2 teams, the dataset is made up of the following features:

  • Time of the season, which is represented by the matchday feature.
  • The amount of rest that each team has had since their last game, represented by the days_since_last_game features.
  • Whether the game was home or away, represented by the home_0_away_1 column.
  • The strength of the opposition in the form of where they placed in the league that year (final_league_position), as well as the form of the opposition going into the match (elo_opp). While most sites list form as an important factor in determining which team is going to win, there hasn’t been much research into quantitatively determining the form of a team outside of looking at the last 5 or so results. The website is the only source I was able to find that does this. It allows users to look at a team’s form (called elo, which is a number of points your team has as a result of their results across decades), which gives us the information we need about the form of our opposition.
  • The form of Madrid and/or Barca, depending on which team we’re looking at (elo_madrid or elo_barca).
  • The difference in elo between the two teams (diff_elo).
  • And finally the betting odds of a Madrid/Barca win (odds_of_madrid_win or odds_of_barca_win).
  • The amount of points earned by the team in the game – this will be either a 0 for a loss (no points collected) or 1 for a draw (1 point collected).

Data collected after the 2014/15 season contains an extra feature called xg_diff, which calculates the difference in xG by simply doing (the xG of Madrid/Barca in the specific match)  minus (the xG of their opposition).

This dataset also only considers games where the league title hasn’t been won yet. The reasoning behind this is that it’s hard to gauge the team’s motivation since there’s nothing left to play for; Madrid and Barca tend to let their starters rest and be ready for The Champions League and Copa Del Rey when they can’t win the league, so this opens up the door for lineups with bench players, new formations, and a general lack of motivation to win.

K-Means Clustering

K-Means Clustering (KMC) is an unsupervised Machine Learning method that computes the average at multiple locations in numerical data.The k in KMC is the number of averages, or clusters, that are being computed; k is a number that’s greater than 1 but less than the number of rows in your dataset. The basic steps of KMC are as follows:

  1. Pick k random points in your data to start the algorithm (For us, pick k rows of data). These points will serve as our initial clusters C1, C2, C3, …., CK.
  2. For each point Di in the data D, calculate the euclidean distance between d and each cluster C. The C with the lowest euclidean distance is the cluster that d gets grouped into. For two dimensional data for example, we would be doing the calculation , where (x1,y1) are the coordinates of one cluster and (x2,y2) are the coordinates of Di.
  3. After each Di has been grouped to a cluster, calculate the average of each cluster. If a cluster C1 has 2 points of two dimensional data grouped to it, for example, then the coordinates for the new C1 would be ((x1 + x2)/2, (y1 + y2)/2)
  4. Repeat steps 2 and 3 until the Di in each C remains the same.

The benefits of using KMC as opposed to a normal average calculation is clear. For example, pretend that we’re using KMC on the matchday feature in our data for one season, and for this season Madrid lost/drew against teams on matchdays 1,2,3,4,35,36,37 and 38. If we were to calculate the average normally we’d get 19.5, which would lead us to believe that this season Madrid lost a majority of its games near the halfway point, but we can obviously see is wrong. Using KMC with k = 2 would give us a much better representation of the data and tell us that on average, Madrid lost to teams in the beginning of the season (1 Cluster would contain the points 1,2,3,4) and the end of the season (The second cluster would contain the points 35,36,37,38).

For more information on KMC, take a look at this video from StatQuest with Josh Starmer.

One of the most important parts of KMC is deciding on how big or small k should be. KMC is clearly a very powerful algorithm, but when used with the wrong value for k then it becomes difficult to extract meaningful information from your data.

Fortunately, there are many methods that have been developed to help determine the best k. For this analysis I’ll be using Silhouette Analysis, which involves measuring 1) the distance between each point in a given cluster (you want this value to be small since it measures how similar the data points in this cluster are to one another) and 2) the distance between each cluster (you want this value to be large since it measures how dissimilar this cluster is the others).

More information on how these two values are calculated can be found here and I adapted the code from this tutorial to find the optimal k for my data.

The plan for the data is as follows:

  1. Input each data file that we’re examining, and run the silhouette analysis code on it for    k = 2-10 to find the ideal k to use in KMC
  2. Run KMC to obtain the center of each cluster, and analyze these centers to learn more about what types of teams Madrid and Barcelona are losing to on average.

This is the data we’re looking at for each team:

  1. Losses and Draws from the 2009/10 season to the 2013/14 season (non-xG data)
  2. Losses and Draws from the 2014/15 season to the 2017/18 season (so that we can look at xG data as well)
  3. Pt1 and 2 data combined (the xg_diff column from 2) will be removed)
  4. Data organized by manager. For this category, the manager needs to have at least 2 seasons of coaching either Real or Barca in this 9 season time period in order to make sure there’s enough data points to get meaningful information. For Real such coaches are Mourinho, Ancelotti and Zidane; for Barca it’s Guardiola and Enrique
  5. Title Winning Seasons for each team
  6. Title Losing Seasons for each team

To see my full breakdown of each category of data, check out the link here.


Real Madrid’s results in the Champions League in the Ronaldo Era certainly won’t be looked back on as a failure, but some questions will be asked about their sub-par league performance. Throughout the different managers that have tried to change Madrid’s luck in the league, this analysis shows us that Madrid typically fell into the same patterns over and over again which cost them the league title on more than one occasion. The biggest patterns included 1) dropping points to their direct competitors both home and away, 2) being unable to pull out results in away games against mid table teams and lower, 3) creating enough offensive opportunities for themselves but not capitalizing on them (evidenced by their xg_diff being 1 in the first cluster of the 2014/15-2017/18 dataset), and 4) not being able to start their league campaigns with good form. Solving any one of these problems will be beneficial for Madrid in the long run, as the smallest or margins tend to make a difference in the title race. Being able to secure even 3 or 4 more points a season can quite literally be the difference between 1st or 2nd, so in the future Real will be doing themselves a favor to fix one of these 4 areas.

In addition to the data that I already collected, it would definitely have been beneficial to have access to more reliable data about injuries and the form of certain players, as this could’ve given us more information about why and how Madrid was losing certain games. I treated a majority of this analysis with the assumption that Madrid was playing a full-strength lineup, when in reality this wasn’t always the case; knowing which games were played at full strength and which weren’t would’ve definitely helped this analysis. In addition, having more information about how these games unfolded would’ve helped. Having more information about the lineup strength of the other team, how many passes they attempted in each third of the field, how much possession they had, how many key passes they made, etc. would give us a better idea of the  different styles of teams that have taken points from Madrid the most. Future work would definitely add this data, but only for the most recent seasons as this type of data is fairly new and wasn’t available even 10 years ago.

The post Using ML to Understand Real Madrid’s Poor Last decade in La Liga appeared first on Playing Numbers.

]]> 1
Using NCAA Stats to Predict NBA Draft Order Wed, 04 Mar 2020 15:08:38 +0000 Intro & Lit Review Predicting the NBA draft is always difficult. Should you draft a player on college statistics, NCAA tournament performance, combine results, potential, [...]

The post Using NCAA Stats to Predict NBA Draft Order appeared first on Playing Numbers.

Intro & Lit Review

Predicting the NBA draft is always difficult. Should you draft a player on college statistics, NCAA tournament performance, combine results, potential, or a combination of all of these? The goal of this project is to come up with a model to help management of an NBA team determine how they should draft NBA players. This problem about who to draft has always been difficult to judge. Some teams tend to draft more on college performance and some teams tend to draft more on potential. Another factor is also level of competition on the college level. Some players in smaller colleges may have better stats just because they play against competition that is not as good. Our dataset came from the official website of the NCAA( -The Official Site of the NCAA, 2018). We looked at years 2011-2014 as our training set and 2015 as our test set. If we can create a successful model this will help management know when they should draft players. The purpose of this study is to see if we can use college stats of past draft picks to help our front office pick players for the upcoming draft.


We ran two different models. The first model would ignore all seasons for players who were not drafted in that particular year. For example, we would ignore the first three season for a college senior and focus solely on his stats for his final year of college, which was when he was drafted into the NBA. The assumption of this model that we studied is that only the final season of a player’s college career would have any bearing on how that player was drafted. This model is referred to as “model A.” The second model would combine all the stats for each player’s college career and use his per-game statistics to see if a model based on his full body of work in college would be more representative of how that player would be drafted. We refer to this as “model B.”

We decided to make a tiered structure of NCAA basketball programs meant to provide more draft stock to players at powerhouse schools such as Kentucky and Duke. Our data examined player stats from 2011-2015, so we conducted a study using the website of all AP Top 25 polls over that time frame. In considering what tier to place schools in, the number of appearances in the top 25, average rank, highest rank, and number of appearances in the top 10 were used to identify the most successful basketball programs of that time. The study resulted in 33 schools being identified and divided into four separate tiers. Any school not specifically identified in these tiers were placed into a fifth tier.

School Tiers

A series of box plots were created to determine if the data would need to be further parsed into subsets. We started by examining box plots of statistics by position for each of the two models we were building (samples below)

It comes as no surprise that numerous statistics were very different based on the position played. For example, centers would have more blocks and rebounds than guards while guards would have more assists and three-pointers than centers. The expected separations lend to the concept that certain variables should be accounted for more or less heavily based on the position played. For this reason, the predictive models would be built for specific positions. The original data identified only the three general positions of Guards, Forwards, and Centers. The next step in exploring the data was to look at correlations of each statistic to draft position. In case the correlations were not strong, we also looked at correlations to being a lottery pick (selected in the top 14), for which we had to create a new variable, and also the round of the draft taken.

We built separate correlation matrices for Guards, Forwards, and Centers for each of the two models. Using common basketball knowledge, there was an expectation that certain statistics would be more strongly correlated to draft position for certain positions. What we found in the matrices, however, did not support that idea. For this reason, we further split the player positions into the five specific positions, breaking guards into point guards and shooting guards and also separating small forwards from power forwards. The assumption used to do this was that point guards would be identified as the shorter half of the guards while the taller half would be deemed as shooting guards. The same logic was used to determine small and power forwards. The correlation matrices were then created for each of the five positions. These are represented by the below figure.

We were then ready to start the variable selection process for modeling. For each statistic, we created histograms to examine the distributions and found many to have skewness. For each variable we created a set of log-transformed statistics and square root-transformed statistics. Our variable selection process consisted of finding the most normal distribution of a statistic between the statistic itself and its two transformed versions. Once the variables were determined based on correlations and histograms, we built linear regression models with each variable as a predictor of draft position.

VIFs were calculated to examine the presence of multicollinearity. It was determined that a VIF less than three was appropriate to assume no multicollinearity existed. RMSE for each model was also calculated. The other criterion used to examine the linear regression model was to make sure that each variable in the model was statistically significant with a p-value less than 0.05. Often times, the variables in these initial models had numerous p-values over 0.05, subjecting them to removal from the models. We ran a stepwise automated selection process to let mathematics decide for us on which variables to keep and discard. This process involves tuning the regression models to keep a combination of variables with the most statistical significance and the lowest AIC for the model. If it threw out any variables that we felt that absolutely must be included in the model, we build a second regression model with the variables from the stepwise model and added back in the specific variables to be kept. RMSE was calculated for every model and models with much higher RMSEs were discarded. Upon completion of this process for each position, we cross-validated the models against average models to make sure our RMSEs were better than the average model’s RMSEs. Finally, we were ready to apply the model to the test set and compare the predictions to the actual results.


The first model based on the player’s final year of college came out with good results for the training years. All of the models based off of the five positions produced a much lower RMSE than the null model. However, when we ran these models on the test set, the results were not as good.

Let’s look at the top two and bottom two people that were expected to be drafted according to model A and see when they were actually drafted. Chris McCollough was predicted to be the highest person drafted and he was drafted 29th overall. The second highest projected pick was Jordan Mickey and he was not drafted until the second round. Delon Wright was projected to be last and he was drafted 20th. Branden Dawson did produce a more accurate model as he was supposed to be drafted near the end and was drafted 56th. The second model which was based on the career averages also produced good results in the training set, but the results in the test set were not as good. Cameron Payne had the best projected result and was drafted 14th, so that was an okay result. Dakari Johnson was the second best and he was not drafted until 48th. The two lowest projected picks were fairly accurate as Branden Dawson and Sir’Dominic Pointer were both picked late at 56th and 53rd respectively. In model B, 41 percent of the predictions were within 6 picks of the actual results, while only 26 percent met this mark for model A. However, 48 percent of the predictions in both models were within 10 picks of the actual draft results. In comparing the two models, there were no particular positions, schools, or draft classes that were predicted better from one model to the next, which would eliminate the potential too use a set of hybrid models.


In the end, both models produced good results on the training years while the test year predictions were hit or miss. Part of the reason for this is because a lot of prospects are drafted more on potential than how productive they are in college. Some freshmen have mediocre numbers in college, but get drafted high because of how much undeveloped talent NBA teams believe they have. For this reason, we saw the models give a much more favor to class/age. Freshmen were predicted to be drafted the highest and each subsequent year of classmen were predicted to be drafted lower. Another reason for the mediocre test results could be explained because they were tested over just one draft year. It could be possible that if we had more draft years to test on, the results could have been different, for better or for worse. Other factors that could be considered in impacting draft order for NCAA players are NBA combine results and performance in the NCAA tournament.

The post Using NCAA Stats to Predict NBA Draft Order appeared first on Playing Numbers.

]]> 4
Beyond The Arch: A Closer Look at Post-Up Bigs Mon, 02 Mar 2020 22:53:21 +0000 Beyond The Arch is a series of articles where I use K Means Clustering to better understand how players are used on offense in the [...]

The post Beyond The Arch: A Closer Look at Post-Up Bigs appeared first on Playing Numbers.

Beyond The Arch is a series of articles where I use K Means Clustering to better understand how players are used on offense in the modern NBA. With six new offensive archetypes we explore many questions about how modern day NBA offenses operate. You can find the very first article with a in-depth explanation of the model here.

In my previous article I took a deeper look at Spot-Up Bigs which is the modern take on a big man in the NBA. In this article I am going to turn back the clock and take a closer look at some of the more vintage big men in the NBA. It is a misconception that post-up bigs cannot succeed in today’s game. There were seven Post-Up Bigs with a weighted Plus-Minus Rating (wPMR) of 90% or better in 2018. This is only two less than Balanced Playmakers and dwarfs Spot-Up Bigs who come in at just one such player. The issue is if you are not one the best it gets ugly quick. Lets take a look at the archetype as a whole and what makes some of the more effective players successful.

Post-Up Bigs by the Numbers

Average play type distribution by archetype

Not surprisingly Post-Up Bigs Post-Up on average 27% of the time, more than all other archetypes combined. Most of the rest of their offense comes from PnR as a roll man, Cuts which for bigs is usually out of the dunker spot, and Put Backs which are all typical “big man” things. It is also worth noting that Post-Up Bigs Iso 6% of the time which is tied for second most and the highest non-playmaker mark. This will make sense as we look at the types of players in this archetype and how skilled they are compared to secondary offensive pieces.

Post-Up Bigs on the Decline

Year over year trend for number of post-up bigs.

In the chart above we can see first hand the direction the NBA is going. There were only 30 Post-Up Bigs in 2018 down from 44 in 2015. As the NBA goes smaller there is very little room for these types of players. If you are not one of the elite options out of the post there is a good chance you will be replaced in the rotation by a more versatile wing who can space the floor and accentuate the strengths of the offensive Playmaker.

Post-Up Bigs Player Profiles

Now that we understand the DNA of a Post-Up Big, we can take a closer look at three different players who fall into this group. As you will see, even within the archetypes players can differentiate themselves by how they choose to use their offensive possessions.

Nikola Jokic’s possessions from the 2018–2019 NBA season

Nikola Jokic may not only be one of the best offensive big men in the league today but maybe ever. His passing is matched by very few players in the game regardless of size or position. At 28% of possessions Post-Ups account for the majority of his offense and he is very effective from there at 1.03 points-per-possession (PPP). What makes him unique is his ability to be used as a spot-up or off-screen threat in small servings as well. I would not be surprised to see him used more off-screens as his 1.14 PPP puts him in the 89th percentile. Just the idea of a fellow big man having to chase him around the court off-screens makes me tired. Possibly hinting at his lack of raw athleticism Jokic ranks in only the 39th percentile as a roll man despite 16% of his possessions being used this way. These numbers possibly underestimate his impact in this part of the game since they do not factor in his ability to find open shooters out of the PnR.

Anthony Davis’ possessions from the 2018–2019 NBA season

Some people may be surprised by AD not being a “playmaker” but post-ups accounted for more of his offensive possessions than any other play type. His 0.97 PPP, which puts him in the 62nd percentile, is strong but not a spectacular number which would lead you to believe he is one of the five best players in the league. To paint the full picture, on offense at least, we would need to dig a bit further. Davis has six play types that used up at least 11% of his possessions and all but one come with efficiency in the 59th percentile or better. The lone play type that comes in below that mark was PnR Roll Man (39%) which may say more about the ball-handlers getting him the ball than his own skill set. To have a player with the physical tools of AD, the Defensive Player of the Year talent, and the wide range of ways to attack offensively you can start to see why Davis is an unquestioned superstar in the league.

Blake Griffin’s possessions from the 2018–2019 NBA season

Blake Griffin spent 28% of his possessions on post-ups albeit at league average efficiency. What is unique about Griffin is he spends the rest of his possessions doing non-big things. At .99 PPP Blake is one of the most efficient PnR Ball Handlers which is his second most frequently used play type at 20%. He also isolates on 14% of his possessions with efficiency in the 73rd percentile further showing his ability as a playmaker. If he was not well-rounded enough his other play type above 10% is Spot-Ups where he clocks in with 84th percentile efficiency. Blake Griffin is truly one of the most versatile big men we have in the league and you can’t help but wonder if injuries kept him from truly reaching his ceiling.

As you can see Post-Up Bigs are not dead in this league but in order to be successful you have to have other wrinkles in your game. Whether you are an elite passer and shooter like Jokic, two-way superstar like AD, or a jack-of-all trades like Blake, turning your back to the basket and powering through the opposing defense is not enough in the modern NBA. In my final introductory article of the Beyond The Arch series I will look at the Rim Runner Archetype to better understand these players.

The post Beyond The Arch: A Closer Look at Post-Up Bigs appeared first on Playing Numbers.

]]> 1
Did Cheating Really Help the Astros Win? Mon, 24 Feb 2020 21:11:22 +0000 At this point, no one is denying that the Astros cheating in the 2017 season. I wanted to find out how much it actually helped them win.

The post Did Cheating Really Help the Astros Win? appeared first on Playing Numbers.

If you aren’t familiar, the Houston Astros cheating scandal is lighting the baseball world aflame. They have been accused of stealing pitch calls and relaying them via trash can bang during the 2017 season. This is the same season that they won the world series. More recently, they have been accused of taping buzzers to their chests to relay the same information. At this point, no one is really denying that the cheating happened. With this in mind, I wanted to try to evaluate how much this cheating contributed to them winning during 2017.

Code if you want to replicate this analysis:

If you would prefer to watch a video on this:

The Data

Recently, I came across On this website, Tony Adams painstakingly watched every home game from the 2017 season and tracked the number of trash can bangs that he heard. This is a great data-set, and I wanted to use it for an analysis.

For each home game, Tony tracks the number of bangs and the score. I also appended the hits data and the by inning data to make this analysis more robust. I wrote a simple scraper to get the game box scores from baseball reference.

For this analysis, I was not aware that the bangs by at bat were available, so I used the aggregates by game. I will be doing a part 2 of this analysis after I analyze the by player / by inning data.

Correlation analysis

First, I wanted to do a high level analysis to see if there was a relationship between bangs and runs or hits. For both of these variables, the correlation was extremely low (~.14). This was not exactly a promising start to the research.

As you can see in the scatterplot, there is virtually no relationship between the number of bangs and the number of runs.

Linear Regression Analysis

I still wanted to see if bangs was a significant predictor of runs even though there was a negligible correlation. A linear regression is the most practical tool answer this question.

Not surprisingly, our regression results mirrored the correlation analysis. Bangs were not a significant predictor of runs, and bangs explained less than 3% of the variance in runs (R-Squared = .022).

Linear Regression Results

Logistic Regression Analysis

In theory, it is possible that sign stealing could help a team win without contributing directly to hits or runs. I ran a logistic regression to test the relationship between bangs and wins.

Again, bangs were not a significant predictor of wins.

Logistic Regression Results

A New Hypothesis

I was stumped. This had to go deeper than what my preliminary models were telling me. I decided to look into the number of bangs in wins and in losses. As it turns out, the Astros banged on average 22.2 times in losses and 16.8 times in wins.

This lead me to a new hypothesis: Maybe the Astros primarily resorted to cheating when they were behind.

Testing the “cheating when behind” theory

I tested this by looking at how many bangs there were when the Astros were behind early. The graph below shows a huge spike in the number of bangs when the Astros are losing in the early innings.

We see a large spike when the Astros are losing early

The Astros still banged on cans in games where they were winning, but this number is greatly reduced from their losing games.

Next, I looked at how the Astros performed when they were coming from behind. If they were cheating in these circumstances, we would expect that they would out perform the average team. Sure enough, the Astros win percentage when coming from behind was outrageously high.

Astros Win % when losing

It could be that they were just a generationally good team. I wanted to test how well they performed against other aggregates to determine if this was the case.

I looked at how well the team performed when they were ahead as well. You see the exact opposite trend in this case.

Astros won less games than expected when ahead

You can see that the Astros won far less games than would be expected when they were leading throughout the game.

I find this to be quite a large anomaly. We see that they outperform the average team when coming from behind and banging more, but under perform the average team when ahead and banging less.

Final Thoughts

On the surface level, it looks like cheating didn’t greatly impact the team’s ability to win. However, we can clearly see very strange trends when peeling back the layers. This article is the tip of the iceberg when it comes to analyzing the cheating scandal. I would hope that this perspective brought an additional layer to your understanding of the mechanisms at play.

In part 2 of this analysis, I will go through the by inning data provided at I hope to be able to quantify how much each bang contributed to hits, getting on base, and runs.

The post Did Cheating Really Help the Astros Win? appeared first on Playing Numbers.

]]> 1
Beyond The Arch: A Closer Look at Spot-Up Bigs Fri, 21 Feb 2020 14:20:18 +0000 Beyond The Arch is a series of articles where I use K Means Clustering to better understand how players are used on offense in the [...]

The post Beyond The Arch: A Closer Look at Spot-Up Bigs appeared first on Playing Numbers.

Beyond The Arch is a series of articles where I use K Means Clustering to better understand how players are used on offense in the modern NBA. With six new offensive archetypes we explore many questions about how modern day NBA offenses operate. You can find the very first article with a in-depth explanation of the model here.

The modern NBA is quickly going away from the traditional post bound big man. In order to survive in today’s world an old school big must adapt. One way to do that is to expand their game out past the three point line. In this article I will take a closer look at the Spot-Up Big archetype. Interestingly enough in my last article I analyzed Brook Lopez’s transition from being a Post-Up Big in Brooklyn and LA to being an effective Spot-Up Big the with Milwaukee Bucks. You can find that article here on

Spot-Up Bigs by the Numbers

Average play type distribution by archetype

Spot-Up Bigs are the second least common archetype you will find over the last four years behind only Post Up Bigs. On average these players use 31% of their possessions on spot-ups. While a healthy 17% of their possessions are used as a pick-and-roll (PnR) big man, I would imagine many of these possessions are more of the “pick and pop” variety compared to their rim running and post up counterparts. Intuitively they clean up teammates misses at the basket via the put pack less than other big men due to how many possessions on offense they spend on the perimeter.

Slow and Steady Growth

Year over year trend for number of spot up bigs.

After a drastic drop from 2015 to 2016, spot-up bigs have been growing slowly over the last three years. As the league moves more and more towards playmakers and versatile wings the amount of minutes available to bigs has dwindled. To compound the issue, there are very few bigs available in the NBA talent pool that can shoot with efficiency and hold up in the post defensively. While these types of players are rare this actually makes them quite valuable to the teams lucky enough to have them.

Spot Up Bigs Player Profiles

Now that we understand the DNA of a Spot-Up Big, we can take a closer look at three different players who fall into this group. As you will see, even within the archetypes players can differentiate themselves by how they choose to use their offensive possessions.

John Collins’ possessions from the 2018–2019 NBA season

Collins was extremely effective for the Hawks last season coming in with a weighted Plus-Minus Rating (wPMR) of 93%, tops for Spot-Up Bigs. Collins scored 1.25 points per PnR Roll Man possession which was in the 80th percentile. Pair that with league average spot-up shooting and Atlanta had an offensive piece they could use in a variety of ways. Collins had six different play types come in at more than 100 possessions, all with good efficiency, which means he should be able to contribute within the confines of several archetypes once he fully develops.

Kevin Love’s possessions from the 2018–2019 NBA season

Kevin Love is one of the first names I think about when I hear Spot-Up Big. He was a major part of the Cavs title team largely due to his ability to knock down the outside shot. As you can see above even in the post-Lebron era he was lethal as a spot up threat. At 1.14 points per possession he ranked in the 84th percentile. Despite a strong wPMR of 83%, it would appear Love could improve his offensive impact by lowering his post-up frequency where he only scored with 28th percentile efficiency but spent over 20% of his possessions on such plays. Transition and off-screen shooting are the only other play types that make up over 10% of his possessions meaning the Cavs sparingly used him like a traditional big in the pick and roll or in the dunker spot which more often than not show up as a “Cut” possession. Kevin Love’s best basketball is most certainly behind him but he will forever be a poster boy for this archetype.

Pascal Siakam’s possessions from the 2018–2019 NBA season

Siakam graded out as a Spot-Up Big last year but his numbers hinted at an offensive ceiling that was much higher. Siakam’s most used play types were transition and spot-ups where he ranked in the 80th and 86th percentile respectively in points-per-possession (PPP). Many spot-up players are effective in transition as well due to their ability to trail the break and find an open three while the defense is scrambling. It is Pascal’s next two play types that hinted at his ability to admirably fill in for the departed Kawhi this season. Siakam was very effective at 1.08 PPP in post-ups and 0.97 PPP in isolation. He also was elite as a PnR ball handler although on a much smaller sample. This leads me to believe if we were to run this analysis for the 2019 season we would see Siakam as a playmaker for the new look Raptors. Pascal is a great example of how the most elite players can shape their game to whatever their team needs to win. It also goes to show that it is possible for spot-up specialists to evolve into true playmakers with enough skill and hard work.

Spot-Up Bigs are such a fascinating group to analyze given the current state of the NBA and how rare it is to find a big man who can stretch the floor and do enough other stuff well to provide positive value. The league is always evolving and sometimes it takes time for players to catch on and adjust their skill development focus at a young age. It will be interesting to see if in future generations we will see more big men working on their outside game inspired by guys like Kevin Love and Dirk. In my next article I will take a closer look at the true old school giants of the NBA, the Post-Up Big.

The post Beyond The Arch: A Closer Look at Spot-Up Bigs appeared first on Playing Numbers.

]]> 0
A Firsthand Account Interviewing at MLB Front Offices Thu, 20 Feb 2020 22:21:18 +0000 Over the last few months, I’ve applied for several jobs in MLB front offices. I wanted to recount my experience so others could see what [...]

The post A Firsthand Account Interviewing at MLB Front Offices appeared first on Playing Numbers.

Over the last few months, I’ve applied for several jobs in MLB front offices. I wanted to recount my experience so others could see what this process looks like: the work required in the application process, the application timeline, people you interact with, and so on. For visitors, I’m a Ph.D. student working in particle physics writing my dissertation and am looking for the next step afterward.

Most of my research is data analysis on large datasets, and I do baseball analysis in my spare time, often with similar methodology to my research, so I figured applying for these jobs would be a logical progression of my career, and one where I have experience. I’m purposefully omitting specifics on questions I was asked in questionnaires and names as well, however, I hope this account is interesting nonetheless.

Tampa Bay Rays

Baseball R&D Analyst

Admittedly, I applied here way before I should have started applying for roles, but I saw this listing go up last summer and was interested. I submitted an application through a site called TeamWork on June 23, and the next day I received notice that I passed through to the second round of their process. This was a timed questionnaire, with no hard cap on the time to take, but recommended to be 90 minutes or so. It took me around 2 hours, personally. It was about 4 pages of SAT-style questions, primarily based on pattern recognition. Finally, there was a page of 5 longer-form questions. 4 of which were stats-based, one with a closed-form solution and 3 I solved through simulating in python. The last was physics-based, a straightforward projectile motion problem.

I passed this round, and continued to the third round of their process, in which I was given a week-long data project, involving making a projection system based on inputs from 2 systems measuring the same parameters. This also incorporated “messy” data and how one would approach this. It was suggested to spend about 4 hours on this, I ended up spending close to three times that. On returning the project I also was asked when I expected to graduate as well (though listed on my resume, may have been overlooked). After this step, they elected to continue with other candidates.

St. Louis Cardinals

Senior Data Scientist

I applied for the Senior Data Science role with the St. Louis Cardinals at the end of September. I sent in my resume and received a response about a week and a half later, continuing to the next round. They asked for responses to 5 questions, limiting each to 300 words or less. These questions were a mixture of assessing my data analysis and modeling background and my baseball knowledge. I was given a week to work on them and submitted them 4 days later, October 14.

On October 24th I was notified I made the next round of the process, a phone interview. This was a 1-hour call on November 1, with their Director of Analytics, Senior Director of Baseball Development, and their Project Director of Baseball Development. Much of this call I ended up talking about physics, the experiment I work on, the data challenges I’ve faced and so on. The next day I got a call from the person conducting the hiring process, who asked me to come in for an in-person interview, which I scheduled for the following week, November 8th.

I drove to St. Louis the night before and stayed with my parents who live in the area. The next morning my interview started around 10. The interview started by touching base with the person conducting the hiring process, the Baseball Development Project Director. It was a half-hour, followed by an hour and a half block with the Baseball Development department. This included the same people I spoke with on the phone interview. They put me in front of a whiteboard and talked with me about some baseball topics, asking me to draw how I might expect certain variables to look, or how I might model a distribution. There was also a few lines of whiteboard coding, explaining how I would make a plot in my language of choice.

After this, I met with some Assistant GMs, one primarily involved with international operations, and one who serves as a director of scouting. The latter gave me quite a hard time about wanting to pivot my career out of physics, and was surprised to find that many of my family and colleagues supported my interest in applying to baseball jobs. Afterward, I met with a pair of player development managers, and then a pair of baseball analysts. Both of these meetings were pretty calm and just more informal “get to know you” type interviews. Last, the day ended with another chat with their Director of Baseball Analytics. Interestingly, he also came from a physics background so this was a nice chat, to sort of understand his transition as well. Unfortunately, this was the week of the GM meetings, so both the President of Baseball Ops and GM were out.

On leaving, I was told I should hear back in a few days. I was called about a week and a half later, on November 18th with news that they went with another candidate.

Cleveland Indians

Data Scientist

I sent in an application for a Data Scientist position with the Cleveland Indians on October 31st, and I didn’t hear anything back for quite some time… sort of. Several weeks later I was browsing my spam folder and noticed that they sent a questionnaire, and it had a week deadline that had already passed. Fortunately, I wasn’t the only person this happened to – I received a follow up on November 17th that mentioned this was a widespread issue, and I could continue if I was interested. I completed it that day. 3 questions, focused on projects I’m interested in and experience I’ve gained working on past projects.

On the 22nd I received an email asking to move forward in the process with a phone interview, which I scheduled for later that day. The person I spoke to on this call had a cursory knowledge of the experiment I work on and we talked about data challenges I’ve worked on and models I’ve built, as well as some talk about my background and my interest in baseball. It ran a half-hour. Following this, on December 3rd I was asked for a second phone interview, which we scheduled for the following day. This was with 3 people in their baseball R&D team, I don’t remember all the details of what we spoke on for this one, but it wasn’t all that dissimilar from the previous phone interviews, mostly getting to know my background and experience.

On December 9th, they followed up by a second, longer questionnaire. This questionnaire had 9 questions split into 3 categories: Baseball Valuation, Math and Research, and Analytical Questions. In each category, I was told to select 2 to answer, for a total of 6 responses, and each should be “no more than a page.” I ended up scraping data and putting together code for 2 of these problems, both in the Baseball Valuation section. One of the Math and Research questions involved me reading an article and commenting on it. The remainder were more conceptual answers. Admittedly, I gave many of my ideas that I would have preferred to keep to myself for personal publication, but with this being the end of the hiring cycle for baseball, I figured there wasn’t much to lose on my end by doing so. I spent quite some time on finalizing these responses and submitted a 6-page response on December 12.

On December 15th I spoke with the person coordinating the job search, informing me they wanted to bring me in for an interview. Being so close to the holiday season, they wanted to bring me in that week, and elected to do a round-trip all on December 18th. That day, I woke up at 3:30 am, was on a flight out of Midway by 6:00 am, and arrived at Progressive Field around 9 am.

I was met by the person who coordinated the job search for 30 minutes, outlining the plans for the day and what to expect. Then I met with a couple of assistant GMs for a half-hour, and afterward was brought into a conference room to meet with the R&D group. There were 4 people in the room, one of which was another Assistant GM, and about halfway through another R&D member video-conferenced in. We discussed some ideas I spoke about on my questionnaire and worked through challenges and thoughts about how you might build the models. Through this, they asked quite a few high-level stats and data analysis questions. We went to lunch after this (I had a very tasty chicken and brie sandwich), and then continued time with the R&D department after lunch.

By this time, I was quite exhausted after having been up since 3:30 and having a full stomach, and if I had to isolate my weakest portion of the interview, this would be it. On the way back from lunch, I was discussing with one of the R&D members recent published work on uncertainty for neural networks, and when we got back he decided to ask more questions on uncertainty measurements, and I unfortunately spaced on bootstrapping methods when asked about data-driven uncertainty quantification. Once he mentioned this, I was back on track and ended on a decent note, but this was certainly where I looked the weakest.

The rest of the afternoon, I had three meetings, each about 45 minutes, with various members of their front office, including the Assistant Director of Baseball Operations, VP of Player Acquisitions, Assistant Director of International Scouting, Baseball Operations Assistant, Director of Learning and Development, and Assistant Director of Player Development. These were more traditional-style interviews, asking about my experience working with others, conflict resolution, how I can take my work and explain it to non-analytical people, and so on. All of these went well, and the people were all fantastic and fun to talk to. The day ended with another 30 minutes with the R&D team, asking some quick questions on basic linear models, and then sending me off. I was on a flight by 6 pm, landed by 6:10 (time zones are fun), and completely exhausted in bed by 9 pm.

At the end of the interview, the person coordinating it mentioned that it would be some time before I heard back, due to the holiday season, followed by him being on vacation. I sent a follow-up email the next day thanking for the experience and he mentioned that I should hear back mid-January. After a month, I sent an email on January 15th, just to touch base, and was told they would have an answer by the end of the following week. I received a call on January 23rd from the person coordinating the search and an Assistant GM informing me that they went with another candidate. As the cycle was over and I’ll be applying for more mainstream data science and academia roles, I asked if they had any feedback to better my interviewing in the future, and they said that they really enjoyed my interview and talking to me, didn’t have anything that stood out that I needed to improve on, but that the candidate they selected had more of a classical statistics background.


Unfortunately, I don’t have a job lined up just yet. I’ve got a few other irons in the fire – one in the sports sector, a few within academia, and a few other areas of data science. Neither of these interviews provided very substantive feedback in terms of what I could do better moving forward, so I’ve tried to do some reflecting on it. I took a look at the candidate the Cardinals ended up hiring, who has a decade of experience in industry doing modeling, so probably a portion is experience. My background being in physics, and our analyses are somewhat opaque, it might be hard to see how my experience overlaps into the baseball domain clearly. While I have some work on the community research portion of FanGraphs, I could do better in uploading some of my public models as I work on them, so my experience translates more transparently. Beyond that, it’s probably a good sign that I made it at least to an in-person interview with the teams I applied for, so at some rate, it’s a matter of time before something works out.

Otherwise, I wanted to close with some other broader takeaways:

  • These application processes absolutely monopolize time. All the questionnaires, phone calls, time interviewing, time traveling, and so on really add up. I will admit that I am likely more sensitive to this at this point since I’m simultaneously writing my dissertation and trying to finish my physics analysis as well.
  • MLB R&D departments are smaller than you might expect, at least in my sample size of 2. Both of the teams I interviewed with, it was just a handful of people. These people are very bright, along the way I met another physics Ph.D. and a statistics Ph.D., but from listening to a game broadcast, you get the impression these are huge teams of people, which is not the truth.
  • Almost everybody I encountered through these experiences were fantastic, a lot of fun to talk to, and passionate about the game. For both interviews, it was an incredible amount of fun to take a day away from work and get to talk to so many people who think about the game and to talk about baseball at a deep level.

This article originally ran on my site and got me connected to Playing Numbers. Check out the site for more of my content, and follow me on Twitter @TylerJBurch.

The post A Firsthand Account Interviewing at MLB Front Offices appeared first on Playing Numbers.

]]> 0
Can they Keep Up? A Look into the Pace Statistic in the NBA Fri, 07 Feb 2020 15:39:54 +0000 Analytics is taking over the game of basketball in a beautifully informative way. Every metric sheds new light on a teams performance and provides a [...]

The post Can they Keep Up? A Look into the Pace Statistic in the NBA appeared first on Playing Numbers.

Analytics is taking over the game of basketball in a beautifully informative way. Every metric sheds new light on a teams performance and provides a different way to rank and compare them. This new data can be used to discover valuable insights that can help front offices game-plan or help a sports bettor gain an edge. One measurement I’ve found useful for judging a team’s performance is pace. According to the pace factor estimates the number of possessions a team has in 48 minutes. I was intrigued by the recent rise in pace throughout the league and wanted to see what underlying factors lead to this change. I came across “Why NBA Game Pace is Historically High” by Kelly Scaletta, which looks into the reasons pace has risen across the league in recent years. It made me wonder how pace of play changed based on a team’s opponent; do faster teams force their opponents to keep up with them or do slower teams force their opponents to play a similar style?

The code for the scraper used to collect data and the notebook used for analysis can be found here.

Before I could begin to answer this question I needed to gather the data first. I built a web scraper to collect the season average for each team as well as the pace for each game they played. The data was gathered from for the 2018–2019 season. Using the mean and the standard deviation for average pace I divided the teams into four groups, which is illustrated in the graph below.

The team with the highest pace in the 2018–2019 season is the Atlanta Hawks with 103.9 possessions/game and the Cleveland Cavaliers and Memphis Grizzlies tie with the lowest of 96.6 possessions/game. It’s interesting to note the drop off between faster and fast teams compared to the more gradual transition between slow and slower teams. This leads me to think that faster teams will have the strongest impact across all their opponents, forcing their opponents to play fast in order to keep the game close.

In order to determine the affect a team’s opponent had on their own pace I created two metrics I call Game Pace Difference (GPD) and Team Pace Difference (TPD) for each game played in the regular season. These metrics are calculated as follows: 

Formulas for Game Pace Difference(GPD) and Team Pace Difference(TPD)

GPD can be used to determine if a team played above or below their season average and TPD can be used to determine if they played an opponent that was faster or slower than themselves. Using these metrics together, games can be identified based on average pace and game pace and the relationship between the two groups can be quantified. By plotting these metrics against each other, games can be categorized and the affect of an opponent’s pace can be visualized to gain a better understanding.

The matrix on the left explains the relationship between GPD and TPD and where a point should lie based on the outcome of the game. A positive trend in the data would represent that the opponent’s pace has no affect on the team they are playing, as it shows that a team can still exceed their season average, whether faster or slower, against an opponent on the opposite side of the pace spectrum. A negative trend in the data represents an opponent’s pace will affect the team they are playing, if the opponent has a faster pace, the team they are playing will play at a faster tempo, and vice versa.

A downward trend can be seen in the scatter plot and the line of best fit further illustrates this. GPD and TPD have a correlation score of -0.34 which associates the relationship with a slightly negative trend. Following the logic from the matrix on the left this supports the fact that a team’s opponent will affect their pace during the game. Instead of an opponent having to keep up with a team or being forced to play their style, team’s are more likely to adapt to their opponent’s playing style when going up against them. I was curious to see how this related to the groups of teams I created and aimed to answer if faster teams had a stronger impact on opponent’s pace than slower teams.

The table below shows the average Game Pace Difference for each group against each group.

The averages are consistent with the findings from the scatter plot. Faster opponents increase the pace of the team they are playing and Slower teams decrease the pace. Slow and Slower teams have a larger difference in game pace than Fast and Faster teams, showing that slower teams have a stronger impact. I found this interesting because I originally thought Faster teams would have a stronger impact due to the large gap in pace between Faster teams and the rest of the league. However, this makes sense because of the natural flow of the game; a slower team is more likely to hold on to the ball and maintain a full possession versus a faster team that is more likely to shoot quicker, leading to a change in possession. In addition to decreasing the opportunities to change possession by playing a slowed down and methodical offense, the time of possession for the faster team also decreases. Therefore the faster team is prohibited from taking more shots on offense and possibly turning the ball over. I also found it interesting that Faster teams played slightly slower than average against Fast teams. It shows just how much faster they played than other teams, all other groups caused them to play below their season average as faster teams tried to match the pace of their opponent. 

There are many reasons as to why a team’s pace varies from game to game and I believe this analysis is a great starting place to further examine how team’s not only change their tempo, but other aspects of their playing style as well. I always figured that a team would plan to completely overpower their opponent in every aspect of the game in their efforts to win but, the fact that a team’s pace is affected by their opponent’s implies that teams prepare to handle their opponent’s strength, instead of trying to exploit every weakness. By following along with this idea and utilizing this method of analysis, new paths to victory can be found that will change the way teams plan to conquer their opponents.

The post Can they Keep Up? A Look into the Pace Statistic in the NBA appeared first on Playing Numbers.

]]> 4