Applying Sabermetrics to Hockey

As we dig deeper and deeper into hockey analytics, the wise move would be to continue looking at baseball Sabermetrics for inspiration. Now, obviously baseball and hockey are two completely different sports, especially when we try to measure them, One thing remains inherently equal though. Runs lead to wins in baseball. Goals lead to wins in hockey. It's the deeper digging that will separate the sports further (what leads to runs or preventing runs versus what leads to goals or preventing goals), but the root remains the same. You want to maximize runs for. You want to maximize goals for.

At the basis of this core in baseball, sabermatricians have come up with ways to calculate expected winning percentage based off of run totals. These analyses have been proven to correlate VERY highly to actual winning percentage. Which led to my curiosity: Can we substitute runs for goals scored in these evaluation tools, and see the same correlations?

Bill James:

Bill James' expected winning percentage formula is based off the pythagorean theorem, and is widely recognized as one of the most accurate winning percentage calculators in baseball. At first, the formula was (R^2) / (R^2+RA^2).

When applying it to hockey, we will replace runs and runs allowed with goals for and goals allowed. Using 5v5 data for all teams from 05-06 (end of the first lockout, of course) through 14-15. Winning percentage will remain at it's core for hockey, where a win is a win and an OTL is a loss (wins/82 - amended to 48 for the lockout season). To try and allot for some of the OT noise, we will also run expected W% to point percentage (points/164 - amended to 96 for the lockout season). 

Bill James ^2 to W%:

While we see a nearly perfect positive correlation (.813), our R^2 comes in at just 0.6607, as some of our data points do not cluster as nearly to our line of best fit as we'd like.

The Bill James calculation to Points% fits better than to raw W% with an R^2 of 0.7163, and a correlation of .846.

Over time, analysts found that instead of using a perfect square in Bill James' formula, that using an exponent of 1.83 was more accurate. There is no need to chart these out, as the R^2, correlations, and charts would look exactly the same. The difference here would be in the raw data. 

On average, using the perfect square in the Bill James formula, the expected winning percentage over estimates the actual winning percentage on an average of 0.07%. The 1.83 exponent formula comes in a little better, over estimating actual winning percentage by an average of 0.06%.

Over time, other sabermatricians jumped into the fold and began to develop expected winning percentage formulas. The next one we will try to apply to hockey is Bill Kross's 'conditional' winning percentage.

With Kross, each team's equation was conditional on whether or not they outscored the opposition. If you outscored your opponent, the equation is G/(GA*2). If your opponents outscored you, the equation is 1-G/(2*GA).

Kross to W%

On the surface, our correlation appears weaker compared to Bill James, and it is, coming in at .808. R^2 also happens to be a bit weaker with the Kross method, measuring just 0.6529. Negligibly weaker, but weaker.

Running Kross to P%, we again see a minor increase, as we did with the James method, with an R^2 of 0.7084 and a correlation of 0.842. Still weaker, though negligibly, than James. 

While most advanced metric analysis takes place at the 5v5 level, we may have to open this analysis up to all situations. Here is a chart of R^2 and correlations between the James model and the Kross model when taking into account all team situations:

Interestingly enough, it seems that working with these numbers in only 5v5 situations is suited more strongly to their strength. Reasoning behind that could be that a team that dominates goal differential at 5v5 is more likely to be a winning team than a team that relies more heavily on special teams play. Another small example of why analysis is mostly done at 5v5 situations. 

Using goals and goals against, can we do better than Bill James and Bill Kross, for hockey, by running a linear regression? 

Going back to the 5v5 data only, using goals for and goals against as our independent variables, and W% as our dependent variable, we get the equation:

eW% = 0.5156 + (GF*0.0029) + (GA*-0.0030)

Doing the same for Points%, our equation becomes:

eP% = 0.5758 + (GF*0.0028) + (GA*-0.0030)

How do these quick regressions stack up to James and Cook?

eW% to W%

R^2 comes in just under both James and Kross at 0.6474. Since we are working with the data off of a linear regression, our 'average miss' will come back to 0.

eP% to P%

Again, our regression comes in just below both James and Kross with an R^2 of 0.7036. 

When measuring baseball, the James and Kross methods both come in substantially strong when measuring their expected winning percentage calculators to actual winning percentage. James' formula coming in at a solid R^2 of 0.90177, and Kross at 0.8762.

We will need to keep digging on the hockey front to find an estimator of that strength, and we'll need to look into more information than just goals scored.