Finding Player Comps with Mahalnobis Distance
19 Jun 2021
§ Motivation
Baseball Reference has a feature on player pages called "Similarity Scores" that quantifies how similar a player's season or career is to another. For example, Mike Trout's career to date is most similar to that of Hack Wilson. Dan Szymborski's ZiPS projections uses Mahalanobis distances to find player comps and to predict future performance..
§ The Dataset
To find player comps, I will be using Mahalanobis distances like ZiPS does. The data set I will be using is the Baseball Databank. The data includes every player season from 1847 to 2020. For the vector space model, I will be using every basic stat as a dimension in the vector space. For batters, the dimensions are: games played, at bats, runs, hits, doubles, triples, home runs, RBI, SB, CS, strikeouts, IBB, sac hits, sac flies, and GIDP.
# These are the basic stats we are keeping track of.
cols = ['G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'SH', 'SF', 'GIDP']
The Batting.csv
data set only includes these basic stats and preferably it would have more advanced metrics like OPS+ or wOBA that measures hits more indirectly. This way, we can build a model that takes into account the player's value rather than just taking counting stats at face value. As all the data is available to me, I should be able to calculate these statistics on my own, but that's out of the scope for this exercise.
For this experiment, I only used post-integration statistics, i.e. post-1947; this is what I consider "modern baseball", and I want to find comparisions for "modern players". This is because how people played and the rules of pre-integration baseball were drastically different, not to mention the many capable and talented players that were barred from playing the game.
# Read all data
orig_data = pd.read_csv("baseballdatabank-master/core/Batting.csv")
# Keep only pre-integration stats
batting_data = orig_data[orig_data.yearID >= 1947]
# Copy over the statistically relevant columns
data = batting_data[cols].copy()
§ Mahalanobis Distance
Mahalanobis Distance
The Mahalanobis distance is given by
where:
- and are two vectors with the same distribution (i.e. the player statlines)
- is the inverse of the covariance matrix.
The Mahalanobis distance is a generalisation of the number of standard deviations from the mean into higher dimensions. In terms of player comparisons, we can think of each statistic as a different spatial dimension and all the dimensions compose a vector space model. Then, a player's performance can be generalised into a vector within this vector space, which can then be compared using different distance measures.
In the case of baseball statistics, some statistics correlate with one another such as OBP and OPS. Then the Mahalanobis distance is a measure that we would want to use, as it takes into account the correlations between data unlike the Euclidean distance.
§ Implementation
Using numpy
, the covariance matrix is easy to find. However, because some stats like IBB or CS were not kept until later in the game's history, some players may have an empty value; these empty values become NaN
. Luckily, numpy.ma
allows us to mask these invalid values.
First, numpy.ma.masked_invalid
masks the NaN
values, and numpy.ma.cov
allows us to compute the covariance matrix with masked values. Using the numpy.linalg
library, we can also compute the inverse of this matrix.
# Find covariant matrix of the data, and its inverse
cov = np.ma.cov(np.ma.masked_invalid(data), rowvar=False)
invcov = np.linalg.inv(cov)
Then to compute the Mahalanobis distance, we simply find a player's season by their player ID and the year and compare it to all other player seasons. For example Shohei Ohtani has a player ID of ohtansh01
, and we can pick his rookie 2018 season. Then, using the distance()
function we can find the most similar season to Ohtani's.
To select a player season, we can restrict the pandas
dataframe based on predicates:
orig_data[(orig_data.playerID == pid) & (orig_data.yearID == yr)][cols]
The above finds a player season where both the player and year match, then makes a vector from the statistically relevant columns defined earlier.
Then, we iterate through all the player seasons, and compute the Mahalanobis distance. There is a useful function called numpy.einsum
that allows us to use Einstein notation to define the matrix multiplication required to compute the distance.
dist = np.sqrt(np.einsum('nj,jk,nk->n', delta, invcov, delta))[0]
Using the above elements, I created a distance()
function, computing the Mahalanobis distance from a given player to every single player season post-war. In the end, it will print out the most similar season. Using Ohtani's rookie season, we call distance('ohtansh01', 2018)
and it yields Willson Contreras' 2019 with a distance of 3.095
.
Again, I only looked at batting data, one obvious improvement would be adding another dimension for age. This will allow me to do something like to Baseball Reference's similarity score and find most similar by age. Also, using age makes it possible to come up with past players with similar performances and use them as a model for how a current player might age, à la ZiPS.
The full implementation of the distance()
function can be found at this gist.