9 Offense-Defense Rating

library("tidyverse"); theme_set(theme_bw())

Warning: package 'purrr' was built under R version 4.4.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.1          ✔ tibble    3.2.1     
✔ lubridate 1.9.3          ✔ tidyr     1.3.1     
✔ purrr     1.0.4          
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("DT")

A simple example of offense-defense data is this fictitious collection of 4 teams that have played 5 games.

d <- tribble(
  ~home, ~away, ~home_score, ~away_score, 
      1,     2,          21,           7,
      2,     3,          13,          14,
      4,     1,           3,          28,
      3,     4,          31,           0,
      1,     3,          42,          10
)

Previously, we calculated the margin of victory (home score minus away score) and constructed a model for the margin of victory. In those models, we estimated a team strength and used the difference in team strengths to calculate the expected margin of victory and game outcome probabilities.

In this chapter, we will build models for the both the home score and the away score. Using the scores allows us to estimate offense and defense team strengths.

9.1 Model

A model for the number of points scored by the home team is

\[S^H_g = \eta + \theta_{H[g]} - \delta_{A[g]} + \epsilon^H_g\]

where, for game $g$,

$S^H_g$ is the score for the home team,
$H[g]$ is the ID for the home team,
$A[g]$ is the ID for the away team,
$\epsilon^H_g$ is a random error.

This error arises because if two teams play each other repeatedly we would expect the home score would be random around the expected home score.

The parameters are

$\eta$ is the offensive home advantage,
$\theta_t$ is the offensive strength of team $t$, and
$\delta_t$ is the defense strength of team $t$.

The expected home score when team $H$ is at home playing against team $A$ is

\[E\left[S^H\right] = \eta + \theta_{H} - \delta_{A}.\] We can build a similar model for the away team score. A model for the number of points scored by the away team is

\[S^A_g = \theta_{A[g]} - \delta_{H[g]} + \epsilon^A_g\]

where, for game $g$,

$S^A_g$ is the score for the away team,
$\epsilon^A_g$ is a random error.

This error arises because if two teams play each other repeatedly we would expect the away score would be random around the expected away score.

The expected away score when team $H$ is at home playing against team $A$ is

\[E\left[S^A\right] = \theta_{A} - \delta_{H}.\]

We can utilize these models to calculate the expected margin of victory $M$ for home team $H$ playing away team $A$: \[\begin{array}{rl} E[M] &= E[S^H - S^A] \\ &= E[S^H] - E[S^A] \\ &= \eta + \theta_H - \delta_A - (\theta_A - \delta_H) \\ &= \eta + (\theta_H + \delta_H) - (\theta_A + \delta_A) \end{array} \] where

$\eta$ is the home advantage,
$\theta_H - \delta_H$ is the strength of the home team, and
$\theta_A - \delta_A$ is the strength of the away team.

9.1.1 Identifiability

In this model, when two teams play each other, the score is expected to be the home advantage $\eta$ plus the difference in strengths between the two teams $\theta - \delta$. Thus, only the difference between the offense and defense rating is identifiable. and we can arbitrarily add a constant to all of the $\theta$ and $\delta$ values.

9.1.2 Regression

If we assume $\epsilon_g^H, \epsilon_g^A \stackrel{ind}{\sim} N(0,\sigma^2)$, then this is a linear regression model.

Since we have both the offensive parameters ($\theta$s) and defensive parameters ($\delta$s), we will need to construct a home matrix $X^H$ and an away matrix $X^A$.

For $X^H$ we have \[X^H_{g,t} = \left\{ \begin{array}{rl} 1 & \mbox{if team $t$ is the {\bf home} team in game $g$} \\ 0 & \mbox{otherwise} \end{array} \right.\]

For $X^A$ we have \[X^A_{g,t} = \left\{ \begin{array}{rl} 1 & \mbox{if team $t$ is the {\bf away} team in game $g$} \\ 0 & \mbox{otherwise} \end{array} \right.\]

Compared to the model matrices we created earlier, we have no negative values (-1) in these matrices.

construct_matrix <- function(d, Col = "home") {
  n_games <- nrow(d)
  n_teams <- length(unique(unlist(d[, Col])))
  
  m <- matrix(0, 
              nrow = n_games, 
              ncol = n_teams)
  
  for (g in 1:n_games) {
    m[g, as.numeric(d[g, Col])] <-  1
  }
  
  return(m)
}

X_h <- construct_matrix(d, "home")
X_a <- construct_matrix(d, "away")

Check to make sure these matrices are appropriate

# n_games x n_teams
dim(X_h)

[1] 5 4

dim(X_a)

[1] 5 4

# all entries are 0 or 1
all(unique(X_h) %in% 0:1)

[1] TRUE

all(unique(X_a) %in% 0:1)

[1] TRUE

all(rowSums(X_h) == 1)

[1] TRUE

all(rowSums(X_a) == 1)

[1] TRUE

To see how this is a linear regression model, we can construct a regression model using matrix algebra \[ Y = \left[ \begin{array}{c} S^H_1 \\ \vdots \\ S^H_G \\ S^A_1 \\ \vdots \\ S^A_G \end{array} \right] = \left[ \begin{array}{cc} 1 & X_h & -X_a \\ 0 & X_a & -X_h \end{array} \right] \left[ \begin{array}{c} \eta \\ \theta_1 \\ \vdots \\ \theta_T \\ \delta_1 \\ \vdots \\ \delta_T \end{array} \right] + \left[ \begin{array}{c} \epsilon_1^H \\ \vdots \\ \epsilon_G^H \\ \epsilon_1^A \\ \vdots \\ \epsilon_G^A \end{array} \right] = X\beta + \epsilon \] where

$G$ is the number of games,
$T$ is the number of teams, and
$1$ ($0$) is shorthand for a $G \times 1$ vector of all 1s (0s), and
$\epsilon$ is a $G\times 2$ vector with $\epsilon \sim N(0,\sigma^2 \mathrm{I})$.

Thus, we can estimate the parameters in this model using linear regression.

# Construct score vector
Y <- c(d$home_score, d$away_score)

# Construct (complete) model matrix
X <- rbind(
  cbind(1, X_h, -X_a),
  cbind(0, X_a, -X_h)
)

# Fit model
m <- lm(Y ~ 0 + X) # 0 removes intercept (so does -1) 
summary(m)


Call:
lm(formula = Y ~ 0 + X)

Residuals:
     1      2      3      4      5      6      7      8      9     10 
-3.833 -4.083  1.194  1.444  5.278  4.083  3.833 -1.444 -1.194 -5.278 

Coefficients: (1 not defined because of singularities)
   Estimate Std. Error t value Pr(>|t|)  
X1   7.3889     5.2804   1.399   0.2967  
X2  29.4444     7.3500   4.006   0.0570 .
X3   9.8056    10.0536   0.975   0.4323  
X4  22.1667     6.8595   3.232   0.0839 .
X5   1.3056    10.0536   0.130   0.9086  
X6   6.8889     8.6565   0.796   0.5096  
X7  12.0000     7.9206   1.515   0.2690  
X8   0.1111     8.6565   0.013   0.9909  
X9       NA         NA      NA       NA  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.921 on 2 degrees of freedom
Multiple R-squared:  0.9719,    Adjusted R-squared:  0.8597 
F-statistic: 8.662 on 8 and 2 DF,  p-value: 0.1076

home_advantage <- coef(m)[1]

teams <- data.frame(
  team = rep(1:4, times = 2),
  type = rep(c("offense", "defense"), each = 4),
  rating = coef(m)[-1]
) |>
  mutate(
    rating = ifelse(is.na(rating), 0, rating),
    rating = rating - mean(rating)
  ) |>
  pivot_wider(
    names_from = type,
    values_from = rating
  ) |>
  mutate(
    strength = offense + defense
  )

teams |> 
  datatable(filter = "top", rownames = FALSE) |>
  formatRound(columns = c('offense', 'defense', 'strength'), 
              digits = 2)

9.1.3 Prediction

Let $S^H$ ($S^A$) be the score for the home (away) team in an upcoming game. The regression model assumes \[ S^H \sim N(\eta + \theta_H - \delta_A, \sigma^2) \quad \mbox{and (independently)} \quad S^A \sim N(\theta_A - \delta_H, \sigma^2). \] Thus, the margin of victory $M$ is \[ M = S^H - S^A \sim N(E[M], 2\sigma^2) \] where \[E[M] = \eta + (\theta_H + \delta_H) - (\theta_A + \delta_A).\]

To find the probability the home team wins we use \[P(M > 0) = P\left( T_v > \frac{-E[M]}{\sigma\sqrt{2}} \right) = P\left( T_v < \frac{ E[M]}{\sigma\sqrt{2}} \right) \] where $T_v$ is a T-distribution with $v$ is the degrees of freedom. The degrees of freedom can be calculated using $v = 2(G-T)$.

If $G >> T$ (which is certainly not true in our simple example), then $T_v \stackrel{d}{\approx} Z$ (where $\stackrel{d}{\approx}$ means approximately equal in distribution). Thus, we can calculate the probability using \[ P(M > 0) \approx P\left( Z < \frac{ E[M]}{\sigma\sqrt{2}} \right). \]

The standard deviation here is

(sd <- summary(m)$sigma)

[1] 7.920613

Suppose Team 1 plays against Team 2 at Team 1’s home.

# Margin of Victory: Team 1 (Home) - Team 2 (Away)
expected_margin <- as.numeric(coef(m)[1] + 
                                teams[1, "strength"] - 
                                teams[2, "strength"]) 

# Probability of Victory: Team 1 (Home) - Team 2 (Away)
pt(expected_margin / (sd * sqrt(2)), df = summary(m)$df[2])

[1] 0.9052297

9.2 Examples

9.2.1 2023 Intraconference Big12 Baseball

# Big 12 2023 Intraconference games
# I'm not sure why the file uses 2024
baseball <- read.csv("../../data/college_baseball_2024.csv") |>
  
  # Create home/away teams and scores
  # for neutral sites, home/away will be arbitrary
  mutate(
    Home_Field = gsub("@", "", Home_Field),
    home_is_Team_1 =                        # this is used repeatedly below
      Home_Field == "neutral" | 
        Home_Field == Team_1,  
    
    # Set the teams
    home = ifelse(home_is_Team_1, Team_1, Team_2),
    away = ifelse(home_is_Team_1, Team_2, Team_1),
    
    # Set the scores
    home_score = ifelse(home_is_Team_1, Score_1, Score_2),
    away_score = ifelse(home_is_Team_1, Score_2, Score_1),
    
    Date = as.Date(Dates, format = "%m/%d/%Y")
  ) |>
  filter(Home_Field != "neutral") |>
  select(Date, Home_Field, home, away, home_score, away_score)

9.2.1.1 Exploratory

baseball  |>
  datatable(filter = "top", rownames = FALSE)

Summary statistics

summary(baseball)

      Date             Home_Field            home               away          
 Min.   :2023-03-17   Length:109         Length:109         Length:109        
 1st Qu.:2023-04-01   Class :character   Class :character   Class :character  
 Median :2023-04-16   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2023-04-17                                                           
 3rd Qu.:2023-05-01                                                           
 Max.   :2023-05-20                                                           
   home_score       away_score    
 Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.000   1st Qu.: 3.000  
 Median : 7.000   Median : 6.000  
 Mean   : 7.394   Mean   : 5.963  
 3rd Qu.:10.000   3rd Qu.: 8.000  
 Max.   :20.000   Max.   :21.000

9.2.1.2 Modeling

# Construct factors for teams
teams <- data.frame(
  names = c(baseball$home, baseball$away) |>
    unique() |>
    sort() |>
    factor()
)

baseball <- baseball |>
  mutate(
    home = factor(home, levels = teams$names),
    away = factor(away, levels = teams$names)
  )

X_h <- construct_matrix(baseball, "home")
X_a <- construct_matrix(baseball, "away")

Fit the model

Y <- c(baseball$home_score, 
       baseball$away_score)

X <- rbind(
  cbind(1, X_h, -X_a),
  cbind(0, X_a, -X_h)
)
m <- lm(Y ~ 0 + X) # 0 removes intercept (so does -1) 
summary(m)


Call:
lm(formula = Y ~ 0 + X)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9329 -2.9293 -0.3731  2.2201 13.9119 

Coefficients: (1 not defined because of singularities)
    Estimate Std. Error t value Pr(>|t|)    
X1    1.3951     0.5557   2.510 0.012857 *  
X2    3.3262     1.1832   2.811 0.005429 ** 
X3    4.2627     1.1832   3.603 0.000398 ***
X4    3.9188     1.1832   3.312 0.001100 ** 
X5    3.3672     1.1727   2.871 0.004529 ** 
X6    6.7411     1.1772   5.726 3.72e-08 ***
X7    4.1092     1.1832   3.473 0.000631 ***
X8    5.6595     1.1832   4.783 3.35e-06 ***
X9    5.3685     1.1832   4.537 9.83e-06 ***
X10   5.2415     1.2562   4.172 4.49e-05 ***
X11  -3.0106     1.1933  -2.523 0.012421 *  
X12  -3.1693     1.1933  -2.656 0.008549 ** 
X13  -1.4180     1.1933  -1.188 0.236145    
X14  -1.4096     1.1814  -1.193 0.234227    
X15  -0.8453     1.1814  -0.715 0.475168    
X16   0.0582     1.1933   0.049 0.961150    
X17  -0.6773     1.1933  -0.568 0.570997    
X18  -1.3492     1.1933  -1.131 0.259574    
X19       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.101 on 200 degrees of freedom
Multiple R-squared:  0.7552,    Adjusted R-squared:  0.7332 
F-statistic: 34.29 on 18 and 200 DF,  p-value: < 2.2e-16

9.2.1.3 Parameter estimates

(home_advantage <- coef(m)[1])

      X1 
1.395062

(sd             <- summary(m)$sigma)

[1] 4.101447

constant <- mean(baseball$away_score)

ratings <- data.frame(
  team = rep(teams$names, times = 2),
  type = rep(c("offense", "defense"), each = nrow(teams)),
  rating = coef(m)[-1]
) |>
  mutate(
    rating = ifelse(is.na(rating), 0, rating),
    rating = rating - mean(rating) + constant
  ) |>
  pivot_wider(
    names_from = type,
    values_from = rating
  ) |>
  mutate(
    strength = offense + defense
  ) |>
  arrange(desc(strength)) |>
  mutate(
    team = factor(team, levels = team)
  )

ratings |> 
  datatable(filter = "top", rownames = FALSE) |>
  formatRound(columns = c('offense', 'defense', 'strength'),
              digits = 2)

We can compare these ratings to the Big 12 standings from 2023.

Let’s plot these ratings

p_ratings <- ratings |>
  mutate(
    offense = offense   + constant,
    defense = defense   + constant,
    strength = offense + defense
  ) |>
  pivot_longer(
    cols = offense:strength,
    names_to = "type",
    values_to = "rating"
  ) |>
  mutate(
    team = factor(team, levels = ratings$team[order(ratings$strength)]),
    type = factor(type, levels = c("offense","defense","strength"))
  )

ggplot(p_ratings,
       aes(
         x = rating,
         y = team
       )) +
  geom_bar(stat="identity") + 
  facet_wrap(~type, nrow = 1, scales = "free_x")

ggplot(ratings,
       aes(
         ymin = defense,
         ymax = defense + offense,
         x = team
       )) +
  geom_linerange() +
  coord_flip()

9.2.1.4 Predictions

Let’s suppose we are interested in calculating the probability in a home-and-home series between Oklahoma and Oklahoma State.

ratings |> filter(team %in% c('Oklahoma','Oklahoma State'))

# A tibble: 2 × 4
  team           offense defense strength
  <fct>            <dbl>   <dbl>    <dbl>
1 Oklahoma State   11.0     3.44     14.5
2 Oklahoma          7.65    2.88     10.5

# @ Oklahoma
(expected_home <- home_advantage + 
    ratings$offense[ratings$team == "Oklahoma"] - 
    ratings$defense[ratings$team == "Oklahoma State"])

      X1 
5.607516

(expected_away <- 
    ratings$offense[ratings$team == "Oklahoma State"] - 
    ratings$defense[ratings$team == "Oklahoma"])

[1] 8.150726

(expected_margin <- expected_home - expected_away)

      X1 
-2.54321

(probability_Oklahoma_win_home <- 
    pt(expected_margin / (sd * sqrt(2)), 
       df = summary(m)$df[2]))

      X1 
0.330763

# @ Oklahoma State
(expected_home <- home_advantage + 
    ratings$offense[ratings$team == "Oklahoma State"] - 
    ratings$defense[ratings$team == "Oklahoma"])

      X1 
9.545788

(expected_away <- 
    ratings$offense[ratings$team == "Oklahoma"] - 
    ratings$defense[ratings$team == "Oklahoma State"])

[1] 4.212454

(expected_margin <- expected_home - expected_away)

      X1 
5.333333

(probability_Oklahoma_win_away <- 
    1 - pt(expected_margin / (sd * sqrt(2)), # note the 1 -
           df = summary(m)$df[2]))

       X1 
0.1794737

# Probability Oklahoma wins both (assuming independence)
probability_Oklahoma_win_home * probability_Oklahoma_win_away

        X1 
0.05936325

9.3 Rating Systems

9.3.1 Kenpom.com

Ken Pomeroy provides the types of ratings at kenpom.com for D-I college basketball:

ORtg: offense rating
DRtg: defense rating
NetRtg: strength rating

Note that NetRtg = ORtg - DRtg.

Rather than estimating the home court advantage, these ratings assume a 3.75 points per game home court advantage.