Chapter 8 Normal Hierarchical Models with No Predictors
We follow the material in Chapter 16 pretty closely. Let’s consider the spotify
data set in the bayesrules
package.
<- bayesrules::spotify
spotify spotify
## # A tibble: 350 × 23
## track_id title artist popularity album_id album_name album_release_d…
## <chr> <chr> <fct> <dbl> <chr> <chr> <chr>
## 1 7sYAS4C… On &… Alok 79 2a9AGiv… On & On 2019-11-22
## 2 1YSwQvw… All … Alok 56 2FdyKqN… All The L… 2019-07-19
## 3 39cmB3Z… Hear… Alok 75 6fpsA5a… Hear Me N… 2016
## 4 2Dnb6yP… The … Alok 65 0hH9tb1… The Wall 2019-06-28
## 5 1a5Yu5L… Hear… Alok 52 6apmeAA… Hear Me N… 2016-01-01
## 6 74RrMiu… Toda… Alok 45 3GQZqxm… Toda La N… 2018-05-25
## 7 7sYAS4C… On &… Alok 79 2a9AGiv… On & On 2019-11-22
## 8 0Qp1IH1… The … Alok 61 4w68G9o… The Wall 2019-06-28
## 9 36C0Gw2… Tell… Alok 61 1R2C7EG… Tell Me W… 2019-08-30
## 10 1UTXQrm… Meta… Alok 61 2ds371e… Metaphor 2019-02-01
## # … with 340 more rows, and 16 more variables: playlist_name <chr>,
## # playlist_id <chr>, genre <chr>, subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>,
## # speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>
summary(spotify)
## track_id title artist
## Length:350 Length:350 Frank Ocean : 40
## Class :character Class :character Camila Cabello: 38
## Mode :character Mode :character Kendrick Lamar: 31
## Beyoncé : 25
## Alok : 19
## TV Noise : 14
## (Other) :183
## popularity album_id album_name
## Min. : 7.00 Length:350 Length:350
## 1st Qu.:45.25 Class :character Class :character
## Median :62.50 Mode :character Mode :character
## Mean :58.40
## 3rd Qu.:73.00
## Max. :94.00
##
## album_release_date playlist_name playlist_id
## Length:350 Length:350 Length:350
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## genre subgenre danceability energy
## Length:350 Length:350 Min. :24.00 Min. : 9.24
## Class :character Class :character 1st Qu.:56.40 1st Qu.:50.30
## Mode :character Mode :character Median :68.20 Median :64.60
## Mean :66.04 Mean :63.53
## 3rd Qu.:75.90 3rd Qu.:77.60
## Max. :95.10 Max. :98.20
##
## key loudness mode speechiness
## Min. : 0.000 Min. :-20.700 Min. :0.0000 Min. : 2.54
## 1st Qu.: 1.000 1st Qu.: -8.492 1st Qu.:0.0000 1st Qu.: 4.31
## Median : 6.000 Median : -6.388 Median :1.0000 Median : 7.38
## Mean : 5.286 Mean : -6.987 Mean :0.5543 Mean :11.26
## 3rd Qu.: 9.000 3rd Qu.: -4.960 3rd Qu.:1.0000 3rd Qu.:12.68
## Max. :11.000 Max. : -0.247 Max. :1.0000 Max. :48.50
##
## acousticness instrumentalness liveness valence
## Min. : 0.00178 Min. : 0.00000 Min. : 2.640 Min. : 4.98
## 1st Qu.: 1.25750 1st Qu.: 0.00000 1st Qu.: 9.262 1st Qu.:32.02
## Median : 7.11500 Median : 0.00356 Median :12.900 Median :46.30
## Mean :20.60655 Mean : 7.27060 Mean :18.849 Mean :48.20
## 3rd Qu.:30.95000 3rd Qu.: 0.28200 3rd Qu.:25.475 3rd Qu.:65.20
## Max. :98.30000 Max. :94.10000 Max. :88.900 Max. :95.70
##
## tempo duration_ms
## Min. : 64.92 Min. : 66786
## 1st Qu.:102.00 1st Qu.:179594
## Median :122.96 Median :210292
## Mean :121.47 Mean :215335
## 3rd Qu.:135.18 3rd Qu.:239600
## Max. :207.97 Max. :425123
##
This is a collection of information on 350 songs which are on Spotify. In this section, we are going to see how to model the popularity of the songs with no predictors. However, we are going to consider the effect of artist. Let’s start by re-ordering the factor artist
to be in terms of decreasing mean popularity.
<- spotify %>%
spotify select(artist, title, popularity) %>%
mutate(artist = forcats::fct_reorder(.f = artist, -popularity, .fun = mean))
The sample of spotify songs consists of 44 artists; let’s do a barplot of the number of songs for the artists.
%>%
spotify count(artist) %>%
ggplot(aes(x = n)) +
geom_bar()
%>%
spotify group_by(artist) %>%
summarize(
mu = mean(popularity),
n = n()
)
## # A tibble: 44 × 3
## artist mu n
## <fct> <dbl> <int>
## 1 Camilo 81 9
## 2 Lil Skies 79.3 3
## 3 Camila Cabello 77.1 38
## 4 J. Cole 76.7 12
## 5 Sean Kingston 73 2
## 6 Frank Ocean 69.8 40
## 7 Beyoncé 69.7 25
## 8 Sufjan Stevens 65.7 3
## 9 León Larregui 65.3 3
## 10 Alok 64.8 19
## # … with 34 more rows
Each artist has at least 2 songs, 3 is the most common number of songs, and there are as many as 40 songs for an artist. We have three choices for how to model this, as was described in the previous chapter.
- Complete pooling. We could ignore all of the artist information and put all of the popularities in one bin.
- No pooling. We could separately analyze each artist, with only the common standard deviation of popularity for each artist.
- Partial pooling We could model the data via a hierarchical model. We assume there is a parent distribution of possible means of popularity that artists could have, and that the artists we have are sampled from that parent distribution. We assume that the distribution of popularities of songs for a given artist depends only on the mean of the artist and a standard deviation which is common to all of the artists.
As you can imagine, the first two options will turn out not to be great choices.
8.1 Complete pooled model
We start by setting some notation. Let \(n_j\) be the number of songs that the \(j\)th artist has in the list. Let \(Y_{ij}\) denote the popularity of the \(i\)th song for the \(j\)th artist. So, the list of all samples would look something like
\[ \left((Y_{11}, Y_{21}, \ldots, Y_{n_1 1}), (Y_{12}, \ldots, Y_{n_2 2}) \right) \]
We could look at a density plot of the pooled data, which would give us one model of the outcome.
ggplot(spotify, aes(x = popularity)) +
geom_density()
Formally, we could create the following Bayesian model, where the second and third lines are the priors which are obtained via prior_summary
in the R chunk below.
- \(Y|_{\mu, \sigma} \sim N(\mu, \sigma)\)
- \(\mu \sim N(50, \mathrm{sd} = 52)\)
- \(\sigma ~ \mathrm{Exp}(0.048)\).
library(rstanarm)
<- stan_glm(popularity ~ 1,
pooled_model data = spotify,
prior_intercept = normal(50, 2.5, autoscale = T),
prior_aux = exponential(1, autoscale = T),
refresh = FALSE
)prior_summary(pooled_model)
## Priors for model 'pooled_model'
## ------
## Intercept (after predictors centered)
## Specified prior:
## ~ normal(location = 50, scale = 2.5)
## Adjusted prior:
## ~ normal(location = 50, scale = 52)
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 0.048)
## ------
## See help('prior_summary.stanreg') for more details
posterior_vs_prior(pooled_model, group_by_parameter = T, facet_args = list(scales = "free_y"))
::tidy(pooled_model,
broom.mixedconf.int = T,
effects = c("fixed", "aux")
)
## # A tibble: 3 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 58.4 1.14 56.6 60.2
## 2 sigma 20.6 0.800 19.4 22.0
## 3 mean_PPD 58.4 1.58 55.8 61.0
Based on this model, the popularity of a song is normally distributed with mean about 58.4 and standard deviation about 20.7. Note that this would be true whichever artist we have a song from. This may have been reasonable if we didn’t know anything more about the songs, but is unreasonable given that we know the artist and much more information about the songs.
<- spotify %>%
artist_means group_by(artist) %>%
summarize(
popularity = mean(popularity),
n = n()
)
<- posterior_predict(pooled_model, newdata = artist_means)
predictions_complete
::ppc_intervals(artist_means$popularity, yrep = predictions_complete) +
bayesplotscale_x_continuous(
labels = artist_means$artist,
breaks = 1:nrow(artist_means)
+
) coord_flip()
8.2 No pooled model
In this model, we create a separate mean for each artist in our sample. Notationally, we let \(\mu_j\) be the mean popularity of artist \(j\). Here is a plot that sort of illustrates it, but we are not assuming in the below plot that we are estimating the density via a normal distribution with different means and common standard deviation.
ggplot(spotify, aes(x = popularity, color = artist)) +
geom_density() +
::gghighlight(artist == "Lil Skies" | artist == "Frank Ocean" | artist == "Mia X") gghighlight
Our model is
- \(Y|_{\mu_j, \sigma} \sim N(\mu_j, \sigma)\)
- \(\mu_j\) is the mean song popularity for artist \(j\)
- \(\sigma\) is the standard deviation in popularity from song to song, which does not depend on the artist.
A plot of densities which would follow our model would look more like this, if we assume that the standard deviation is 10 (which we would have to estimate, or better yet get draws from).
<- lapply(1:44, function(i) {
funcs geom_function(fun = function(x) dnorm(x, mean = artist_means$popularity[i], 10))
})
ggplot(artist_means, aes(x = popularity)) +
+
funcs scale_x_continuous(limits = c(0, 100))
Looks kind of redonkulonk, doesn’t it? That can’t be right. Probably not even useful. But, let’s see how we would have done it anyway.
<- stan_glm(popularity ~ artist - 1,
no_pooled_model data = spotify,
prior = normal(50, 2.5, autoscale = T),
prior_aux = exponential(1, autoscale = T), iter = 10000,
refresh = FALSE
)
prior_summary(no_pooled_model)
## Priors for model 'no_pooled_model'
## ------
##
## Coefficients
## Specified prior:
## ~ normal(location = [50,50,50,...], scale = [2.5,2.5,2.5,...])
## Adjusted prior:
## ~ normal(location = [50,50,50,...], scale = [325.55,558.97,165.63,...])
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 0.048)
## ------
## See help('prior_summary.stanreg') for more details
posterior_vs_prior(no_pooled_model)
Our model with prior specification is
- \(Y|_{\mu_j, \sigma} \sim N(\mu_j, \sigma)\)
- \(\mu_j\) is the mean song popularity for artist \(j\)
- \(\sigma\) is the standard deviation in popularity from song to song, which does not depend on the artist.
- \(\mu_j \sim N(50, s_j)\) we allow a different standard deviation for each prior, but have it fixed at 50. We can see the first few standard deviations for the prior are 325.55 (for Camilo), 558.97 (for Lil Skies), etc.
- \(\sigma \sim \mathrm{Exp}(.048)\)
We use the no intercept model (the artist - 1
) for ease of interpretation. Each of the means listed when we use tidy
is just the mean of the artist, for example.
::tidy(no_pooled_model, effects = c("fixed", "aux")) broom.mixed
## # A tibble: 46 × 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 artistCamilo 81.0 4.66
## 2 artistLil Skies 79.4 8.29
## 3 artistCamila Cabello 77.1 2.36
## 4 artistJ. Cole 76.7 3.99
## 5 artistSean Kingston 72.9 9.81
## 6 artistFrank Ocean 69.8 2.21
## 7 artistBeyoncé 69.7 2.82
## 8 artistSufjan Stevens 65.6 8.17
## 9 artistLeón Larregui 65.3 8.13
## 10 artistAlok 64.9 3.18
## # … with 36 more rows
We visualize the estimations from the model as follows (it would also be informative to plot all of the popularities instead of the mean popularities):
<- posterior_predict(
predictions_no
no_pooled_model,newdata = artist_means
)
::ppc_intervals(artist_means$popularity,
bayesplotyrep = predictions_no,
prob_outer = 0.80
+
) ::scale_x_continuous(
ggplot2labels = artist_means$artist,
breaks = 1:nrow(artist_means)
+
) coord_flip()
What do the intervals represent in the above plot? Remember, the outer band is the 80 percent posterior predictive interval for songs written by the artist in question. So, for Camilo, we would get:
::spread_draws(no_pooled_model, `[\\(a-z].*`, regex = T) %>%
tidybayes::clean_names() %>%
janitorselect(artist_camilo, sigma) %>%
mutate(val = rnorm(n(), artist_camilo, sigma)) %>%
summarize(HDInterval::hdi(val, credMass = 0.8))
## # A tibble: 2 × 1
## `HDInterval::hdi(val, credMass = 0.8)`
## <dbl>
## 1 61.4
## 2 99.1
8.3 Partially pooled (hierarchical) model
Recall that \(Y_{ij}\) is the popularity of the \(i\)th song written by artist \(j\). Suppose we observe data consisting of \(n_j\) values of popularity for artist \(j = 1, \ldots, J\). We imagine that the mean popularity of songs for artist \(j\) comes from a normal distribution with mean \(\mu\) and standard deviation \(\sigma_\mu\). We model the popularity of songs on the artist level as normal with mean \(\mu_j\) and common standard deviation \(\sigma_y\). As usual, we would want to find the probability of the parameters given the data \(y_{ij}\), but we can’t do that directly, so we apply Bayes’ rule:
\[ f(\mu, \sigma_\mu, \mu_j, \sigma_y| y_{ij}) \propto f(y_{ij}|\mu, \sigma_\mu, \mu_j, \sigma_y) f(\mu, \sigma_\mu, \mu_j, \sigma_y) \\ = \Pi \mathrm{dnorm}(y_{ij}, \mu_j, \sigma_y) f(\mu_j, \sigma_y|\mu\sigma_\mu) f(\mu, \sigma_\mu)\\ = \Pi \mathrm{dnorm}(y_{ij}, \mu_j, \sigma_y) \Pi \mathrm{dnorm}(\mu_j, \mu, \sigma_\mu) f_1(\mu) f_2(\sigma_\mu) f_3(\sigma_y) \]
where \(f_1\), \(f_2\) and \(f_3\) represent the prior distributions of \(\mu\), \(\sigma_\mu\) and \(\sigma_y\). Note that, as before, we are assuming that once we know \(\mu_j\) and \(\sigma_y\), the distribution of the data does not depend on \(\mu\) and \(\sigma_\mu\).
Let’s see how it looks written a different way.
- \(Y_{ij} \sim N(\mu_j, \sigma_y)\)
- \(\mu_j \sim N(\mu, sigma)\)
- \(\mu\) has a normal prior.
- \(\sigma_\mu\) and \(\sigma_y\) have exponential priors.
We can interpret the parameters of the model as follows:
- \(\mu_j\) is the mean popularity of artist \(j\)
- \(\sigma_y\) is the within artist variability of popularity, common among all artists.
- \(\mu\) is the mean song popularity across all artists; that is, “the mean rating for the most average artist,” as described in the book.
- \(\sigma_\mu\) is the between-group variability. This describes the variation of \(\mu_j\) from artist to artist.
8.3.1 Equivalent Model
We could also have modeled the artist means in terms of their deviations from the overall mean \(\mu\). Let \(b_j\) be the difference between the artist mean \(\mu_j\) and the overall mean \(\mu\), so that \(\mu + b_j = \mu_j\). We would then have the following model:
- \(Y_{ij} \sim N(\mu + b_j, \sigma_y)\)
- \(b_j \sim N(0, \sigma_\mu)\)
- \(\mu\) has a normal prior.
- \(\sigma_\mu\) and \(\sigma_y\) have exponential priors.
8.3.2 Within vs between group variability
The variance in \(Y\), the popularity of songs, breaks down into two components. The within group component, which we denoted as \(\sigma_y^2\) above, and the between groups component, which we denoted as \(\sigma_\mu^2\) above. It is not obviously true, but it is the case that \(\mathrm{Var}(Y) = \sigma_y^2 + \sigma_\mu^2\). That is, the total variance of popularity is the sum of the within group variance and the between group variance.
We can also consider the percentage of the total variance explained by within group variance: \(\frac{\sigma_y^2}{\sigma_y^2 + \sigma_\mu^2}\), as well as the percentage explained by the between group variance. These percentages are numbers between 0 and 1, and we will consider them further in the sequel.
8.4 Posterior Analysis
We start by seeing how to simulate draws from the posterior distribution of the parameters.
<- stan_glmer(popularity ~ (1 | artist),
spotify_hier data = spotify,
family = gaussian(),
prior_intercept = normal(50, 2.5, autoscale = T),
prior_aux = exponential(1, autoscale = T)
)
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000123 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.23 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.731146 seconds (Warm-up)
## Chain 1: 0.604363 seconds (Sampling)
## Chain 1: 1.33551 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 2.5e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.728449 seconds (Warm-up)
## Chain 2: 0.414264 seconds (Sampling)
## Chain 2: 1.14271 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 2.1e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.729909 seconds (Warm-up)
## Chain 3: 0.579332 seconds (Sampling)
## Chain 3: 1.30924 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 2.2e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.661534 seconds (Warm-up)
## Chain 4: 0.528148 seconds (Sampling)
## Chain 4: 1.18968 seconds (Total)
## Chain 4:
We can see that we set the prior for \(\mu\) to be normal with mean 50, and the prior for \(\sigma_\mu\) to be exponential. We are using the second version of the model, so we don’t need to specify the means of the priors for the \(\mu_j\); the means of the priors of the \(b_j\) will just be 0. We did not set the prior for the within-group variation, choosing to let rstanarm
do that automatically for us.
prior_summary(spotify_hier)
## Priors for model 'spotify_hier'
## ------
## Intercept (after predictors centered)
## Specified prior:
## ~ normal(location = 50, scale = 2.5)
## Adjusted prior:
## ~ normal(location = 50, scale = 52)
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 0.048)
##
## Covariance
## ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
## ------
## See help('prior_summary.stanreg') for more details
And this is why; the specification for the within-group variation is a bit strange, and will be discussed later when it makes more sense. Currently, we just assume that this is a reasonable weakly informative prior.
Let’s check some convergence diagnostics for one of the \(b_j\) and
library(bayesplot)
::get_variables(spotify_hier) tidybayes
## [1] "(Intercept)"
## [2] "b[(Intercept) artist:Camilo]"
## [3] "b[(Intercept) artist:Lil_Skies]"
## [4] "b[(Intercept) artist:Camila_Cabello]"
## [5] "b[(Intercept) artist:J._Cole]"
## [6] "b[(Intercept) artist:Sean_Kingston]"
## [7] "b[(Intercept) artist:Frank_Ocean]"
## [8] "b[(Intercept) artist:Beyoncé]"
## [9] "b[(Intercept) artist:Sufjan_Stevens]"
## [10] "b[(Intercept) artist:León_Larregui]"
## [11] "b[(Intercept) artist:Alok]"
## [12] "b[(Intercept) artist:Vampire_Weekend]"
## [13] "b[(Intercept) artist:The_Blaze]"
## [14] "b[(Intercept) artist:Missy_Elliott]"
## [15] "b[(Intercept) artist:Au/Ra]"
## [16] "b[(Intercept) artist:Hinder]"
## [17] "b[(Intercept) artist:X_Ambassadors]"
## [18] "b[(Intercept) artist:Kendrick_Lamar]"
## [19] "b[(Intercept) artist:Mike_WiLL_Made-It]"
## [20] "b[(Intercept) artist:Placebo]"
## [21] "b[(Intercept) artist:The_xx]"
## [22] "b[(Intercept) artist:The_Wrecks]"
## [23] "b[(Intercept) artist:Michael_Kiwanuka]"
## [24] "b[(Intercept) artist:NODE]"
## [25] "b[(Intercept) artist:David_Lee_Roth]"
## [26] "b[(Intercept) artist:Zeds_Dead]"
## [27] "b[(Intercept) artist:C-Kan]"
## [28] "b[(Intercept) artist:Black_Stone_Cherry]"
## [29] "b[(Intercept) artist:House_Of_Pain]"
## [30] "b[(Intercept) artist:Elisa]"
## [31] "b[(Intercept) artist:Jazzinuf]"
## [32] "b[(Intercept) artist:Atlas_Genius]"
## [33] "b[(Intercept) artist:MTK]"
## [34] "b[(Intercept) artist:BUNT.]"
## [35] "b[(Intercept) artist:Kid_Frost]"
## [36] "b[(Intercept) artist:Tamar_Braxton]"
## [37] "b[(Intercept) artist:TV_Noise]"
## [38] "b[(Intercept) artist:Jean_Juan]"
## [39] "b[(Intercept) artist:DA_Image]"
## [40] "b[(Intercept) artist:Freestyle]"
## [41] "b[(Intercept) artist:Röyksopp]"
## [42] "b[(Intercept) artist:Honeywagon]"
## [43] "b[(Intercept) artist:Soul&Roll]"
## [44] "b[(Intercept) artist:Chris_Goldarg]"
## [45] "b[(Intercept) artist:Mia_X]"
## [46] "sigma"
## [47] "Sigma[artist:(Intercept),(Intercept)]"
## [48] "accept_stat__"
## [49] "stepsize__"
## [50] "treedepth__"
## [51] "n_leapfrog__"
## [52] "divergent__"
## [53] "energy__"
mcmc_trace(spotify_hier, pars = c(
"b[(Intercept) artist:Alok]",
"(Intercept)",
"sigma",
"Sigma[artist:(Intercept),(Intercept)]"
))
mcmc_acf(spotify_hier, pars = c(
"b[(Intercept) artist:Alok]",
"(Intercept)",
"sigma",
"Sigma[artist:(Intercept),(Intercept)]"
))
mcmc_dens_overlay(spotify_hier, pars = c(
"b[(Intercept) artist:Alok]",
"(Intercept)",
"sigma",
"Sigma[artist:(Intercept),(Intercept)]"
))
neff_ratio(spotify_hier, pars = c(
"b[(Intercept) artist:Alok]",
"(Intercept)",
"sigma",
"Sigma[artist:(Intercept),(Intercept)]"
))
## (Intercept)
## 0.18850
## b[(Intercept) artist:Alok]
## 0.34075
## sigma
## 1.01350
## Sigma[artist:(Intercept),(Intercept)]
## 0.21350
rhat(spotify_hier, pars = c(
"b[(Intercept) artist:Alok]",
"(Intercept)",
"sigma",
"Sigma[artist:(Intercept),(Intercept)]"
))
## (Intercept)
## 1.0018017
## b[(Intercept) artist:Alok]
## 1.0015923
## sigma
## 0.9995779
## Sigma[artist:(Intercept),(Intercept)]
## 1.0028074
Let’s check the posterior predictive density versus the observed density of the popularitoes. It’s not terrible; could be better, but it definitely follows more or less the right shape.
pp_check(spotify_hier)
Now, let’s return to the parameters that we saw from get_Variables
. That was quite the list!!! Let’s go through them to understand what they are refering to.
- \(\mu\) =
(Intercept)
- standard deviation within group = \(\sigma_y\) =
sigma
- variance between groups = \(\sigma_\mu^2\) =
Sigma[artist:(Intercept),(Intercept)]
We can get the estimates, standard errors and confindence intervals for these using tidy
:
::tidy(spotify_hier,
broom.mixedeffects = c("fixed", "ran_par"),
conf.int = TRUE,
conf.level = .8
)
## # A tibble: 3 × 6
## term estimate std.error conf.low conf.high group
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 52.4 2.51 49.3 55.5 <NA>
## 2 sd_(Intercept).artist 15.2 NA NA NA artist
## 3 sd_Observation.Residual 14.0 NA NA NA Residual
We see that our estimate for the mean of artist means is 52.5, and we think it is somewhere between 49.4 and 55.8. Note that this is similar to what t.test
gives when we find the 80 percent confidence interval for the mean of the artist means. It is not similar to what t.test
gives when we supply the popularity of all songs.
%>%
spotify group_by(artist) %>%
summarize(mu = mean(popularity)) %>%
with(t.test(mu, conf.level = .8))
##
## One Sample t-test
##
## data: mu
## t = 21.764, df = 43, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 80 percent confidence interval:
## 49.00667 55.24105
## sample estimates:
## mean of x
## 52.12386
t.test(spotify$popularity, conf.level = .8) # this isn't it
##
## One Sample t-test
##
## data: spotify$popularity
## t = 52.935, df = 349, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 80 percent confidence interval:
## 56.98624 59.81948
## sample estimates:
## mean of x
## 58.40286
Finally, looking at the standard deviation estimates, we see that \(\sigma_y \approx 14\), which means we expect the popularity of songs within the same artist to vary by about plus or minus 28. The standard deviation between groups is a little bit higher, 15.1. This means that \(\frac{15.1^2}{15.1^2 + 14^2} = 54\%\) of the variance in popularity is explained by the between groups variance, while 46 percent of the variance in popularity is explained by the within-artist variance.
8.4.1 Two examples of posterior prediction
Find an 80 percent posterior prediction interval for a song by Beyonce.
The uncertainty in the popularity of a new song by Beyonce comes from two sources. First, the uncertainty in our estimates for \(\mu_j\) and \(\sigma_y\), and second, from the range of popularities in the songs themselves.
Here’s how we can do it.
library(tidybayes)
spread_draws(spotify_hier, `[\\(a-zA-Z].*`, regex = T) %>%
::clean_names() %>%
janitorselect(matches("Sigm|sig|^int|beyonc")) %>%
mutate(pop = rnorm(n(), intercept + b_intercept_artist_beyonce, sigma)) %>%
summarize(HDInterval::hdi(pop, credMass = 0.8))
## # A tibble: 2 × 1
## `HDInterval::hdi(pop, credMass = 0.8)`
## <dbl>
## 1 51.9
## 2 88.8
Find an 80 percent posterior prediction interval for a song by Early B.
This time, there is uncertainty from three sources:
- The uncertainty in our estimates of the parameters.
- The variation of popularity within an artist.
- The variation of popularity of songs between artists.
spread_draws(spotify_hier, `[\\(a-zA-Z].*`, regex = T) %>%
::clean_names() %>%
janitorselect(matches("Sigm|sig|^int")) %>%
mutate(sigma_artist = sqrt(sigma_artist_intercept_intercept)) %>%
mutate(
artist_mean = rnorm(n(), intercept, sigma_artist),
pop = rnorm(n(), artist_mean, sigma)
%>%
) summarize(HDInterval::hdi(pop, credMass = 0.8))
## # A tibble: 2 × 1
## `HDInterval::hdi(pop, credMass = 0.8)`
## <dbl>
## 1 24.4
## 2 77.7
8.5 Shrinkage
Shrinkage describes the phenomenon in a hierarchical model, the parameter estimates tend to be closer to each other than they would be in a non-pooled model. To see that, let’s compare the distribution of artist means in the hierarchical model to those from the non-pooled model we did earlier.
get_variables(no_pooled_model)
## [1] "artistCamilo" "artistLil Skies"
## [3] "artistCamila Cabello" "artistJ. Cole"
## [5] "artistSean Kingston" "artistFrank Ocean"
## [7] "artistBeyoncé" "artistSufjan Stevens"
## [9] "artistLeón Larregui" "artistAlok"
## [11] "artistVampire Weekend" "artistThe Blaze"
## [13] "artistMissy Elliott" "artistAu/Ra"
## [15] "artistHinder" "artistX Ambassadors"
## [17] "artistKendrick Lamar" "artistMike WiLL Made-It"
## [19] "artistPlacebo" "artistThe xx"
## [21] "artistThe Wrecks" "artistMichael Kiwanuka"
## [23] "artistNODE" "artistDavid Lee Roth"
## [25] "artistZeds Dead" "artistC-Kan"
## [27] "artistBlack Stone Cherry" "artistHouse Of Pain"
## [29] "artistElisa" "artistJazzinuf"
## [31] "artistAtlas Genius" "artistMTK"
## [33] "artistBUNT." "artistKid Frost"
## [35] "artistTamar Braxton" "artistTV Noise"
## [37] "artistJean Juan" "artistDA Image"
## [39] "artistFreestyle" "artistRöyksopp"
## [41] "artistHoneywagon" "artistSoul&Roll"
## [43] "artistChris Goldarg" "artistMia X"
## [45] "sigma" "accept_stat__"
## [47] "stepsize__" "treedepth__"
## [49] "n_leapfrog__" "divergent__"
## [51] "energy__"
spread_draws(no_pooled_model, `^artist.*`, regex = T) %>%
summarize(across(matches("^artist"), mean)) %>%
t() %>%
sd()
## [1] 15.88599
get_variables(spotify_hier)
## [1] "(Intercept)"
## [2] "b[(Intercept) artist:Camilo]"
## [3] "b[(Intercept) artist:Lil_Skies]"
## [4] "b[(Intercept) artist:Camila_Cabello]"
## [5] "b[(Intercept) artist:J._Cole]"
## [6] "b[(Intercept) artist:Sean_Kingston]"
## [7] "b[(Intercept) artist:Frank_Ocean]"
## [8] "b[(Intercept) artist:Beyoncé]"
## [9] "b[(Intercept) artist:Sufjan_Stevens]"
## [10] "b[(Intercept) artist:León_Larregui]"
## [11] "b[(Intercept) artist:Alok]"
## [12] "b[(Intercept) artist:Vampire_Weekend]"
## [13] "b[(Intercept) artist:The_Blaze]"
## [14] "b[(Intercept) artist:Missy_Elliott]"
## [15] "b[(Intercept) artist:Au/Ra]"
## [16] "b[(Intercept) artist:Hinder]"
## [17] "b[(Intercept) artist:X_Ambassadors]"
## [18] "b[(Intercept) artist:Kendrick_Lamar]"
## [19] "b[(Intercept) artist:Mike_WiLL_Made-It]"
## [20] "b[(Intercept) artist:Placebo]"
## [21] "b[(Intercept) artist:The_xx]"
## [22] "b[(Intercept) artist:The_Wrecks]"
## [23] "b[(Intercept) artist:Michael_Kiwanuka]"
## [24] "b[(Intercept) artist:NODE]"
## [25] "b[(Intercept) artist:David_Lee_Roth]"
## [26] "b[(Intercept) artist:Zeds_Dead]"
## [27] "b[(Intercept) artist:C-Kan]"
## [28] "b[(Intercept) artist:Black_Stone_Cherry]"
## [29] "b[(Intercept) artist:House_Of_Pain]"
## [30] "b[(Intercept) artist:Elisa]"
## [31] "b[(Intercept) artist:Jazzinuf]"
## [32] "b[(Intercept) artist:Atlas_Genius]"
## [33] "b[(Intercept) artist:MTK]"
## [34] "b[(Intercept) artist:BUNT.]"
## [35] "b[(Intercept) artist:Kid_Frost]"
## [36] "b[(Intercept) artist:Tamar_Braxton]"
## [37] "b[(Intercept) artist:TV_Noise]"
## [38] "b[(Intercept) artist:Jean_Juan]"
## [39] "b[(Intercept) artist:DA_Image]"
## [40] "b[(Intercept) artist:Freestyle]"
## [41] "b[(Intercept) artist:Röyksopp]"
## [42] "b[(Intercept) artist:Honeywagon]"
## [43] "b[(Intercept) artist:Soul&Roll]"
## [44] "b[(Intercept) artist:Chris_Goldarg]"
## [45] "b[(Intercept) artist:Mia_X]"
## [46] "sigma"
## [47] "Sigma[artist:(Intercept),(Intercept)]"
## [48] "accept_stat__"
## [49] "stepsize__"
## [50] "treedepth__"
## [51] "n_leapfrog__"
## [52] "divergent__"
## [53] "energy__"
spread_draws(spotify_hier, `^b.*|^\\(.*`, regex = T) %>%
::clean_names() %>%
janitorsummarize(across(matches("^b"), function(x) mean(x + intercept))) %>%
t() %>%
sd()
## [1] 13.62124
We see that the estimated means for the hierachical model have lower standard deviation than the means for the no pooled model. This is because the artist means have been pulled toward the overall mean of artists. Here is another plot we can use to look at it, from the book. We see that the posterior predictive intervals are pulled closer to the group means, so that the artists with low means are pulled up, while those with high means are pulled down.
<- posterior_predict(spotify_hier,
predictions_hierarchical newdata = artist_means
)
ppc_intervals(artist_means$popularity, yrep = predictions_hierarchical) +
scale_x_continuous(
labels = artist_means$artist,
breaks = 1:nrow(artist_means)
+
) coord_flip()