Came across a blog earlier this week written by David Robinson on applying a Bayesian method to baseball batting averages to get an expected batting average. Robinson's inquisition was sparked by knowing that a batter who goes 300/1000 is a better hitter than a batter who goes 4/10, despite the .300 average versus the .400 average. The question remains, how much better is he?
My immediate thought was, I could apply this to hockey (and get some R training out of it, to boot). So, I did. This post will be less about the methodology and more about the findings, as there is no chance I can explain what is going on here better than Robinson has on his own blog (seriously, click the link above).
All stats provided by War-On-Ice.com. They encompass 2005-2006 through 2014-2015 for forwards who recorded 50 or more shots on goal in a season, 5v5)
Step 1: See if we have a data set that can be analyzed in this method:
On a count basis, we have a nearly normally distributed model (with more outliers to the right of the mean than to the left, not so surprisingly).
When converting to density, we get a bit of a spike on the left side, but I'm not too worried about that with our raw data showing a near normal distribution. The conversion to density is so we can find our alpha0 and beta0 for our calculation of beta distribution.
Beta distribution, from Robinson:
"In short, the beta distribution can be understood as representing a probability distribution of probabilities- that is, it represents all the possible values of a probability when we don’t know what that probability is."
In theory, it is a continuous probability so we can calculate expected values over different sample sizes. If we had a player with 10 shots on goal, and a player with 300 shots on goal, the Beta Distribution would allow us to effectively compare both players expected outcomes; with the hopes that Player A's expected shooting percentage would change a lot, while Player B's expected shooting percentage wouldn't move all that much.
Using this data set, the following parameters were calculated and used (R-function to calculated Beta Parameters was found here):
alpha0 = 5.897168
beta0 = 56.40995
Our beta is so much greater than our alpha parameter due to the longer right tail of the distribution.
With these parameters, we could effectively expect a player who has taken 0 shots on goal to have a shooting percentage of 9.47%.
After computing the expected shooting percentage of each player, it was time to compare it to each player's actual recorded shooting percentage:
X-axis = recorded shooting percentage
Y-axis = expected shooting percentage
The variables had a very strong relationship with an adjusted r-squared of .9818.
Here are the top ten movers, both positive change and negative, along with displaying how the players who shoot a ton don't move all that much.
I decided to branch off and go a little further than Robinson for the rest of the blog, and I applied these expected shooting percentages to calculate an expected goals total for each player. Below is the players expected goals using their expected shooting percentage in comparison to their actual goal total.
Another very strong relationship, an adjusted r-squared of .9548.
Finally, I checked to see if these new expected goal totals were a better predictor for next season's goals than a player's raw goals were.
Yikes. Adjusted r-squared of .172.
How does this compare when looking at actual previous year's goals to actual next season's goals?
Adjusted r-squared of .1707. Negligibly better.
The quest to find a way to predict next season's goals continues.