Warning: package 'purrr' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4.9000 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggResidpanel")library("emmeans")
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
library("lme4")
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
library("DT")options(width =120)
12.1 Models
Let \(Y_{a,e}\) be the rank for athlete \(a\) in event \(e\) with 1 indicating the fastest time and higher numbers indicating slower times. To build a model for the ranks, we will introduce a latent variable \(\zeta_{a,e}\) with the order statistics \(\zeta_{a,e,(1)}, \ldots, \zeta_{a,e,(N_e)}\) where \(N_e\) are the number of athletes in event \(e\). The order statistics are such that \[\zeta_{a,e,(1)} < \zeta_{a,e,(2)} < \cdots < \zeta_{a,e,(N_e)}.\] We then define the athletes rank in terms of these order statistics to be \[Y_{a,e} = \arg_r \{r: \zeta_{a,e} = \zeta_{a,e,(r)}\}.\] The latent variable \(\zeta_{a,e}\) is related to the time in the event in the sense that if we had the time in the event, we would just use it rather than using these \(\zeta_{a,e}s\).
Now, we need to build a model for the \(\zeta_{a,e}\) for each athlete. We assume \[\zeta_{a,e} \stackrel{ind}{\sim} N(\theta_a, 1).\] So the athlete performance at a given event is a draw from a normal distribution whose mean is the ability of the athlete. The variance is assumed to be \(1\) for identifiability reasons.
Similar to our models of player/team strengths, the \(\theta_a\) are unidentifiable because we can add a constant to all the \(\theta_a\) without affect the distribution of the ranks.
12.1.1 Bayesian approach
In contrast, to the approach we have used previously, we will introduce a Bayesian approach that provides weak identifiability. We introduce a prior for the \(\theta\)s, namely \[\theta_a \stackrel{ind}{\sim} N(0, 1).\] An appealing aspect of this prior is that the \(\theta_a\) will now be interpretable as how many standard deviations away from average (0) is an athlete.
12.1.2 Gibbs sampling
Many Bayesian techniques estimate parameters using a technique called Gibbs sampling. To perform Gibbs sampling, we alternate draws from the following full conditional distributions:
\(p(\theta|\zeta, y)\)
\(p(\zeta|\theta, y)\)
where
\(\theta = (\theta_1,\ldots,\theta_A)\) where \(A\) is the number of athletes,
Performing iterations of these draws will converge to a collection of draws from the joint posterior distribution\[p(\theta,\zeta|y).\] We can take the \(\theta= (\theta_1,\ldots,\theta_A)\) draws as our belief about that players ability.
Now, we need to determine what the two full conditional distributions are. Define \(\zeta_a = \{\zeta_{a,e}: \zeta_{a,e} \mbox{ exists}\}\) and \(n_a = |\zeta_a|\), which is the number of races athlete \(a\) is in. For the full conditional for \(\theta\), we have \[p(\theta|\zeta,y) = \prod_{a=1}^A p(\theta_a|\zeta_a) = \prod_{a=1}^A N\left(\theta_a; \bar{\zeta}_a, \frac{1}{1+n_a}\right)\] where \(\bar{\zeta}_a\) is the average of the \(\zeta\) values for athlete \(a\) across all the races they competed in.
Thus, the \(\theta_a\) are all independent draws from a normal distribution where the mean is the mean of the \(\zeta\) values for that athlete and the variance is the inverse of the number of races plus 1 for that athlete. Note that, conditional on the \(\zeta\)s, the \(\theta\)s are independent of \(y\).
Define \(A_r\) be the set of athletes in race \(r\). For the full conditional for \(\zeta\), we have \[p(\zeta_{\cdot,e}|\theta,y) = \left[\prod_{a \in A_r} N(\zeta_{a,e}; \theta_a, 1) \right] \mathrm{I}\left(y_{a,e,(1)} < \cdots < y_{a,e,(N_r)}\right).\] Thus, for each event, we draw the \(\zeta\)s from independent normals with a mean according to that athlete, but with the requirement that the \(\zeta\) maintain the order observed in the race.
12.1.3 Computing
#' Construct Gibbs sampler#' #' @param y is the observed R x A matrix of ranks#' @param n_iter integer, number of iterations to run#' #' @return A x (n_iter + 1) matrix of theta (player ability) drawsrun_gibbs <-function(y, n_iter =10, init =NULL) { R <-nrow(y) A <-ncol(y)# Storage theta <-matrix(NA, nrow = n_iter +1, ncol = A)colnames(theta) <-colnames(y)# Initial drawif (is.null(init)) { init =sort(rnorm(A)) # if athletes are in order, this helps } theta[1, ] <- init# Precalculate quantities sd_A <-sqrt(1/ (1+colSums(!is.na(y))))# Iterationsfor (i in1:n_iter) { zeta <-sample_zeta(theta[i, ], y) theta[i+1,] <-sample_theta(zeta, sd_A = sd_A) }return(theta)}#' Check to make sure the zeta are in the correct order according to the ranks#' in y#' #' @param zeta numeric vector of player abilities#' @param y_r integer vector of ranks#' #' @return logical indicating if zeta is in the correct ordercheck_zeta <-function(zeta, y_r) {all(rank(zeta[!is.na(y_r)]) == y_r[!is.na(y_r)])}# Sample latent performances for each athlete in each race#' #' @param theta numeric vector of athlete abilities#' @param y integer matrix with athlete ranks in each event, #' NAs indicate the athelete did not compete#' #' @return numeric R x A matrix of player performances in each event#' sample_zeta <-function(theta, y) {# Create zeta matrix zeta <- y*0 R <-nrow(zeta) A <-ncol(zeta)for (r in1:R) { max_rank <-max(y[r,], na.rm =TRUE)for (a in1:A) {if (!is.na(y[r,a])) { rnk <- y[r,a] zeta_below <-ifelse(y[r, a] ==1, -Inf, zeta[r, which(y[r,] == (rnk -1) )]) zeta_above <-ifelse(y[r, a] == max_rank, Inf, zeta[r, which(y[r,] == (rnk +1) )])# Sample zeta[r,a] ~ N(theta[a], 1) in (zeta_below, zeta_above) zeta[r, a] <-qnorm(runif(1, min =pnorm(zeta_below, mean = theta[a]), max =pnorm(zeta_above, mean = theta[a])), mean = theta[a]) } } }return(zeta)}#' Sample player abilities based on latent player performances#' #' @param zeta numeric R x A matrix of athlete performances, NAs indicate athlete did not perform#' @param sd_A numeric vector of pre-calculated standard deviations#' #' @return numeric vector of athlete abilities#' sample_theta <-function(zeta, sd_A) {return(rnorm(ncol(zeta), mean =colMeans(zeta, na.rm =TRUE), sd = sd_A))}
12.2 Examples
12.2.1 Iowa State University Women’s 2024 Cross-Country
Rows: 1330 Columns: 12
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): NAME, YEAR, TEAM, t_sec, race
dbl (5): PL, t_min, SCORE, minutes, distance
time (2): Avg. Mile, TIME
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(cc, aes(race, NAME, fill = rank)) +geom_tile()
cc |>group_by(NAME) |>summarize(n =n()) |>ggplot(aes(x = n)) +geom_histogram(binwidth =1) +labs(title ="Iowa State University Women's 2024 Cross-Country",x ="Number of Races" )