Motivation§

Baseball Reference has a feature on player pages called “Similarity Scores” that quantifies how similar a player’s season or career is to another. For example, Mike Trout’s career to date is most similar to that of Hack Wilson. Dan Szymborski’s ZiPS projections uses Mahalanobis distances to find player comps and to predict future performance..


The Dataset§

To find player comps, I will be using Mahalanobis distances like ZiPS does. The data set I will be using is the Baseball Databank. The data includes every player season from 1847 to 2020. For the vector space model, I will be using every basic stat as a dimension in the vector space. For batters, the dimensions are: games played, at bats, runs, hits, doubles, triples, home runs, RBI, SB, CS, strikeouts, IBB, sac hits, sac flies, and GIDP.

# These are the basic stats we are keeping track of.

cols = ['G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'SH', 'SF', 'GIDP']

The Batting.csv data set only includes these basic stats and preferably it would have more advanced metrics like OPS+ or wOBA that measures hits more indirectly. This way, we can build a model that takes into account the player’s value rather than just taking counting stats at face value. As all the data is available to me, I should be able to calculate these statistics on my own, but that’s out of the scope for this exercise.

I decided to discard any statistics from pre-integration (pre-1947), this is because the game’s rules were drastically different in its early days (1850-1890s), and the dead-ball era (1900-1920) will affect the covariance matrix. Wide spread integration did not occur until later, but 1947 is the first year that Robinson played.

# Read all data
orig_data = pd.read_csv("baseballdatabank-master/core/Batting.csv")

# Keep only pre-integration stats
batting_data = orig_data[orig_data.yearID >= 1947]

# Copy over the statistically relevant columns
data = batting_data[cols].copy()

Mahalanobis Distance§

Mahalanobis Distance

The Mahalanobis distance $d(\vec{x}, \vec{y})$ is given by

$$ \sqrt{\left(\vec{x} - \vec{y}\right)^T\pmb{S}^{-1}\left(\vec{x} - \vec{y}\right)}, $$

where:

  • $\vec{x}$ and $\vec{y}$ are two vectors with the same distribution (i.e. the player statlines)
  • $\pmb{S}^{-1}$ is the inverse of the covariance matrix.

The Mahalanobis distance is a generalisation of the number of standard deviations from the mean into higher dimensions. In terms of player comparisons, we can think of each statistic as a different spatial dimension and all the dimensions compose a vector space model. Then, a player’s performance can be generalised into a vector within this vector space, which can then be compared using different distance measures.

In the case of baseball statistics, some statistics correlate with one another such as OBP and OPS. Then the Mahalanobis distance is a measure that we would want to use, as it takes into account the correlations between data unlike the Euclidean distance.


Implementation§

Using numpy, the covariance matrix is easy to find. However, because some stats like IBB or CS were not kept until later in the game’s history, some players may have an empty value; these empty values become NaN. Luckily, numpy.ma allows us to mask these invalid values.

First, numpy.ma.masked_invalid masks the NaN values, and numpy.ma.cov allows us to compute the covariance matrix with masked values. Using the numpy.linalg library, we can also compute the inverse of this matrix.

# Find covariant matrix of the data, and its inverse
cov = np.ma.cov(np.ma.masked_invalid(data), rowvar=False)
invcov = np.linalg.inv(cov)

Then to compute the Mahalanobis distance, we simply find a player’s season by their player ID and the year and compare it to all other player seasons. For example Shohei Ohtani has a player ID of ohtansh01, and we can pick his rookie 2018 season. Then, using the distance() function we can find the most similar season to Ohtani’s.

To select a player season, we can restrict the pandas dataframe based on predicates:

orig_data[(orig_data.playerID == pid) & (orig_data.yearID == yr)][cols]

The above finds a player season where both the player and year match, then makes a vector from the statistically relevant columns defined earlier.

Then, we iterate through all the player seasons, and compute the Mahalanobis distance. There is a useful function called numpy.einsum that allows us to use Einstein notation to define the matrix multiplication required to compute the distance.

dist = np.sqrt(np.einsum('nj,jk,nk->n', delta, invcov, delta))[0]

Using the above elements, I created a distance() function, computing the Mahalanobis distance from a given player to every single player season post-war. In the end, it will print out the most similar season. Using Ohtani’s rookie season, we call distance('ohtansh01', 2018) and it yields Willson Contreras’ 2019 with a distance of 3.095.

Again, I only looked at batting data, one obvious improvement would be adding another dimension for age. This will allow me to do something like to Baseball Reference’s similarity score and find most similar by age. Also, using age makes it possible to come up with past players with similar performances and use them as a model for how a current player might age, à la ZiPS.

The full implementation of the distance() function can be found at this gist.