7 Individual Differences in Cognitive Strategies: The Hierarchical Architecture

📍 Where we are in the Bayesian modeling workflow: Chs. 1–5 fitted models to a single agent and validated the inference engine. Now we scale up: real experiments involve many participants who differ from each other. This chapter builds hierarchical (multilevel) models that estimate both individual parameters and the population distribution they are drawn from — and validates the whole pipeline with the same six-phase battery from Ch. 6. The memory model fitted here uses the reversal learning dataset from Ch. 6, which was the first dataset that cleanly separated bias from beta. The fitted model and its LOO score are saved for Ch. 8.

7.1 Introduction

Our exploration of decision-making models has so far focused on single agents or averaged behavior. However, individuals differ systematically in how they approach tasks and process information. Some people may be more risk-averse, have better memory, learn faster, or employ entirely different strategies than others. Ignoring these individual differences can lead to misleading conclusions about cognitive processes. From a statistical perspective, if we assume a single set of parameters generates all data, our model is mathematically misspecified, which leads to pathological posterior geometries and misleading inferential results.

This chapter introduces hierarchical modeling (also called multilevel or mixed effects modeling) as a powerful framework for capturing these individual differences while still identifying population-level patterns.

7.1.1 The Epistemic Challenge: Pooling Information

Traditional statistical approaches to handling data from multiple individuals force a false dichotomy, imposing rigid, unjustified structural assumptions on the data:

Complete Pooling (Underfitting): Treats all participants as identical: it estimates a single set of parameters for the entire group, completely ignoring potentially meaningful individual variation. This assumes the population variance (the average error in predicting an individual from the population) is strictly zero ($\sigma \to 0$).
- The Pathology: Systematically fails to capture cognitive heterogeneity, describing an “average” participant that does not necessarily exist.
No Pooling (Overfitting): Analyzes each participant in complete isolation, estimating independent parameters ($\theta_j$) for each individual. No pooling corresponds to fitting $J$ separate models to $J$ individual participants. This assumes the population variance is infinite ($\sigma \to \infty$).
- The Pathology: Severely inefficient and vulnerable to noise. For individuals with sparse or noisy data, the posterior estimate will be excessively wide or completely dominated by the random fluctuations of their specific likelihood.

Why does this dichotomy matter in practice? With 120 trials per agent, complete pooling produces one parameter estimate that represents nobody in particular. No pooling produces 50 estimates, many of which are dominated by noise for agents with only 20–30 trials. Partial pooling is not a philosophical compromise — it produces estimates that are demonstrably closer to the truth for low-data individuals, as the shrinkage plot at the end of this chapter will show directly.

7.1.2 The Solution: Partial Pooling (Multilevel or Hierarchical Modeling)

Hierarchical modeling offers a principled, data-driven middle ground. Instead of forcing $\sigma$ to be $0$ or $\infty$, we treat the population variance as a free parameter estimated from the data. The model is structured hierarchically:

Individual Level: Each individual agent $j$ has a specific parameter vector (e.g., their intrinsic bias $\theta_j$ or memory decay $\beta_j$).
Population Level: These individual parameters are explicitly modeled as draws from an overarching continuous population distribution (e.g., a Gaussian characterized by a population mean $\mu$ and standard deviation $\sigma$).
The Bayesian Compromise: The estimation of an individual’s parameter ($\theta_j$) is a precision-weighted combination of their own empirical data (the likelihood) and the population distribution (the hierarchical prior).
Shrinkage: Estimates for individuals with sparse or noisy data are “shrunk” toward the population mean $\mu$. This is a mathematical mechanism that regularizes inference, preventing overfitting and yielding more robust out-of-sample estimates.

Historical Context: Shrinkage, Partial Pooling, and the Recovery of Individual Differences

The hierarchical model is now a standard tool, but the intellectual path to it was surprisingly contentious. Accepting partial pooling required overturning the intuition — deeply held by statisticians trained in classical estimation theory — that combining information from different units is always a form of cheating.

7.1.2.1 Mixed Models in the Fields

The mathematical foundations were laid not in cognitive science but in animal breeding. Henderson (1950) needed to estimate the genetic merit of individual bulls from data on their offspring, spread across farms that differed in management and feeding. The complete-pooling solution (assume all farms are identical) was agronomically wrong; the no-pooling solution (analyse each farm in isolation) wasted the information that herds from the same breed share a common distribution of genetic merit. Henderson’s Best Linear Unbiased Prediction (BLUP) formalised the middle path: individual estimates are shrunk toward the population mean by an amount that depends on how much data each individual contributes. The same equations appear, under different names, as the core of the hierarchical models in this chapter.

7.1.2.2 Stein’s Paradox and the Optimality of Shrinkage

The theoretical justification arrived from a different direction, and it was deeply counterintuitive. James and Stein (1961) proved that the ordinary (no-pooling) estimator of three or more means is inadmissible under squared-error loss: there always exists a shrinkage estimator — one that pulls all estimates toward a common value — with uniformly lower expected loss. This is Stein’s paradox. It implies that, regardless of whether the true means are related to each other, you are statistically better off treating them as exchangeable draws from a common distribution than estimating each independently. The result was controversial because it seemed to say that knowing the speed of light should improve your estimate of the boiling point of water — which is absurd — but the paradox dissolves once you recognise that the admissibility comparison is over the joint vector of estimates, not individual components.

7.1.2.3 Bayesian Formalisation

Lindley and Smith (1972) showed that the James-Stein estimator is a Bayesian posterior mean under the assumption that the true parameters are drawn from a common normal distribution — exactly the hierarchical prior structure used in this chapter. This connected the frequentist inadmissibility result to a principled Bayesian generative story: shrinkage is not a trick but the natural consequence of treating individuals as exchangeable members of a population. Gelman and Rubin (1992) and Gelman et al. (1995) developed the computational infrastructure (via MCMC) that made this formalism applicable to realistic psychological datasets.

7.1.2.4 Into Cognitive Science

The adoption of hierarchical models in cognitive science was slow. For much of the 1990s and early 2000s, individual-differences research either fitted separate models to each participant and then performed a second-stage group analysis — the “two-stage” procedure that discards uncertainty — or fitted a single population model that ignored individual variation entirely. Rouder et al. (2005) and Rouder and Lu (2008) demonstrated that hierarchical models substantially improve parameter recovery in reaction-time and signal-detection contexts: individual estimates derived from partial pooling are closer to the true generating values than either the complete-pooling or no-pooling alternatives, especially for participants who contribute few trials. The shrinkage plot at the end of this chapter makes this advantage visible directly: you can watch the model borrow strength from the population to regularise noisy individuals.

7.1.2.5 Non-Centred Parameterisation and Neal’s Funnel

The one purely computational contribution to this story came from Neal (2003) and was adopted as standard practice by the Stan community via Betancourt and Girolami (2013). When individual parameters are parameterised as $\theta_j = \mu + \sigma \cdot z_j$ with $z_j \sim \mathcal{N}(0,1)$ — the non-centred form — the geometry of the posterior becomes approximately Gaussian and Hamiltonian Monte Carlo can traverse it efficiently. In the centred form $\theta_j \sim \mathcal{N}(\mu, \sigma^2)$, the posterior develops a funnel geometry near $\sigma \approx 0$ that causes HMC to diverge. The non-centred parameterisation is the single most important practical technique in hierarchical Bayesian computation, and the reason the models in this chapter spend time on it.

7.2 Learning Objectives

After completing this chapter, you will be able to:

Explain Partial Pooling: Understand how multilevel modeling balances individual and group-level information through shrinkage.
Distinguish Pooling Approaches: Articulate the differences and trade-offs between complete pooling ($\sigma=0$), no pooling ($\sigma=\infty$), and partial pooling.
Implement Hierarchical Models: Write Stan code for hierarchical models with varying intercepts and/or slopes.
Reparameterize a model: Understand why and how to reparameterize a model and in particular how to use non-centered parameterization to improve sampling efficiency in hierarchical models.
Model Parameter Correlations: Implement and interpret models that estimate correlations between individual-level parameters (via Cholesky-factorized correlation matrices $\mathbf{L}_\Omega$).
Perform Comprehensive Model Checks: Apply a rigorous Bayesian workflow to hierarchical models (Predictive Checks, Prior-Posterior Update Checks, Sensitivity Analysis, Parameter Recovery, Simulation-Based Calibration Checks, etc).
Apply to Cognitive Questions: Use multilevel modeling to investigate individual differences in cognitive strategies (e.g., bias, memory, WSLS).

This chapter has two parts. The first applies the full Bayesian workflow to the Hierarchical Biased Agent — one parameter per person, one population distribution — and uses it to expose and fix the Centered Parameterization’s pathological geometry (Neal’s Funnel). The second applies the same workflow to the Hierarchical Memory Agent, which has two correlated parameters per person, requires a bivariate population distribution, and uses the reversal learning dataset from Ch. 6 that we validated there. The LOO score from the fitted memory model is saved for Ch. 8’s model comparison.

7.2.1 Biased Agent Model (Multilevel)

In this architecture, each agent $j \in \{1, \dots, J\}$ possesses an unobserved, individual-level bias parameter $\theta_j$. This individual parameter is generated by a latent population distribution, parameterized by a global mean $\mu_\theta$ and global standard deviation $\sigma_\theta$. Finally, the empirical choices $h_{ij}$ for trial $i$ are generated conditionally on the individual parameter $\theta_j$.

Mathematically, this translates to the following generative structure on the unconstrained (logit) space:\[h_{ij} \sim \text{Bernoulli}(\text{logit}^{-1}(\theta_{\text{logit}, j}))\]\[\theta_{\text{logit}, j} \sim \text{Normal}(\mu_{\theta}, \sigma_{\theta})\]\[\mu_{\theta} \sim \text{Normal}(0, 1.5)\]\[\sigma_{\theta} \sim \text{Exponential}(1)\]

By mapping this DAG, we have strictly defined our generative assumptions. We are now ready to translate this architecture into Stan code and subject it to Prior Predictive Checks.

Standard Plate Notation (Left) and Kruschke-Style Generative Notation (Right) for the Hierarchical Biased Agent.

Computational Note: The Geometry of Hierarchical Models The mathematical definition provided above—specifically $\theta_{\text{logit}, j} \sim \text{Normal}(\mu_{\theta}, \sigma_{\theta})$—is known as the Centered Parameterization. It is the most direct conceptual translation of our theory. However, Hamiltonian Monte Carlo (NUTS) often struggles with this exact form. As $\sigma_\theta$ approaches zero, the target geometry constrains into a region of extreme curvature known as “Neal’s Funnel.” In the next phase, we will purposefully implement this naive form to diagnose its pathological failures before introducing the coordinate-transform solution (Non-Centered Parameterization).

7.3 Phase 1: Generative Plausibility of the Hierarchical Biased Agent

Before we write the inference algorithm in Stan, we must construct the forward generative model in R. This code must strictly mirror the mathematical dependencies formalized in our DAG.

We model the parameters on the logit (log-odds) scale. As discussed in previous chapters, probabilities are bounded between $0$ and $1$. If we attempt to place a hierarchical Normal distribution directly on the probability space, the probability mass will inevitably spill over the bounds, creating mathematically impossible values (e.g., a probability of $1.2$). The logit transform $\text{logit}(p) = \log(\frac{p}{1-p})$ maps the bounded $[0, 1]$ interval to the unbounded real line $(-\infty, \infty)$. The HMC sampler traverses this unbounded space smoothly, and we use the inverse-logit function (plogis() in R, inv_logit() in Stan) to map the inferences back to probabilities at the likelihood level.

Let us build a unified generative function. In accordance with the rigorous Bayesian workflow, this function is designed to either:

Draw hyperparameters directly from our stated priors (for Prior Predictive Checks).
Accept fixed “ground-truth” hyperparameters (for Parameter Recovery in Phase 2).

Code

#' Simulate Hierarchical Biased Agents
#'
#' @param n_agents Integer, number of agents (J).
#' @param n_trials_min Integer, minimum number of trials per agent.
#' @param n_trials_max Integer, maximum number of trials per agent.
#' @param mu_theta Numeric, population mean on the logit scale. If NULL, draws from prior.
#' @param sigma_theta Numeric, population SD on the logit scale. If NULL, draws from prior.
#' @param seed Integer, for exact computational reproducibility.
#'
#' @return A tidy tibble containing both latent parameters and observed choices.
simulate_hierarchical_biased <- function(n_agents = 100, 
                                         n_trials_min = 20, 
                                         n_trials_max = 200, 
                                         mu_theta = NULL, 
                                         sigma_theta = NULL,
                                         seed = NULL) {
  
  if (!is.null(seed)) set.seed(seed)
  
  # 1. Sample Population Hyperparameters (if not provided)
  if (is.null(mu_theta)) mu_theta <- rnorm(1, mean = 0, sd = 1.5)
  if (is.null(sigma_theta)) sigma_theta <- rexp(1, rate = 1)
  
  # 2. Sample Individual Latent Parameters and Trial Counts
  # We purposefully vary N_j to demonstrate shrinkage later.
  agent_params <- tibble(
    agent_id = 1:n_agents,
    n_trials = sample(n_trials_min:n_trials_max, n_agents, replace = TRUE),
    theta_logit = rnorm(n_agents, mean = mu_theta, sd = sigma_theta),
    theta_prob = plogis(theta_logit) 
  )
  
  # 3. Generate Observed Data (Likelihood)
  # h_ij ~ Bernoulli(inv_logit(theta_logit_j))
  sim_data <- agent_params |> 
    mutate(
      trial_data = map2(theta_prob, n_trials, function(prob, n) {
        tibble(
          trial = 1:n,
          choice = rbinom(n, size = 1, prob = prob)
        )
      })
    ) |> 
    unnest(trial_data)
  
  attr(sim_data, "hyperparameters") <- list(
    mu_theta = mu_theta,
    sigma_theta = sigma_theta
  )
  
  return(sim_data)
}

7.3.1 Executing the Prior Predictive Check: The Logit-Normal Artifact

When placing priors on an unbounded scale (logit) that will be transformed to a bounded scale (probability), our intuition often fails. A seemingly “uninformative” prior like $\mu \sim \text{Normal}(0, 10)$ on the logit scale actually forces almost all probability mass to exactly $0$ or $1$ on the outcome scale, implying that all agents are entirely deterministic.

Let us generate 12 datasets drawn entirely from our chosen priors ($\mu_\theta \sim N(0, 1.5), \sigma_\theta \sim \text{Exp}(1)$) to verify they generate scientifically plausible, heterogeneous populations.

Code

# Generate 12 distinct "universes" from the priors
prior_sims <- map_dfr(1:12, function(sim_id) {
  # We use a fixed seed per universe for stable workflow mapping
  sim_data <- simulate_hierarchical_biased(
    n_agents = 100, 
    n_trials_min = 1, n_trials_max = 1, # Only need 1 trial to extract parameters
    seed = sim_id * 10 
  )
  
  sim_data |> 
    distinct(agent_id, theta_prob) |> 
    mutate(simulation_id = paste("Prior Universe", sim_id))
})

# Visualize the generative plausibility
ggplot(prior_sims, aes(x = theta_prob)) +
  geom_density(fill = "steelblue", alpha = 0.5, color = NA) +
  facet_wrap(~simulation_id, ncol = 4) +
  scale_x_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) +
  labs(
    title = "Prior Predictive Check: Implied Population Distributions",
    subtitle = "Healthy = populations spread across (0,1), not piled at 0 or 1. Each panel is one prior universe.",
    x = "Agent Bias (Probability Scale)",
    y = "Density"
  )

7.4 Phase 2: Data Preparation and the Centered Stan Model

Before we unleash the Hamiltonian Monte Carlo (HMC) engine, we must format our simulated data structure to match the static typing of the Stan compiler. Because our agents have varying numbers of trials ($N_j$), we cannot pass a simple $J \times N$ matrix. Instead, we use a “long format” array indexing strategy, where a specific vector maps every trial $n \in \{1, \dots, N_{\text{total}}\}$ to its generating agent $j$.

Code

#' Prepare Data for Stan
#'
#' Maps a tidy tibble into the strictly typed list required by CmdStanR.
#' 
#' @param sim_data Tidy tibble generated by simulate_hierarchical_biased
#' @return A named list formatted for CmdStanR
prep_stan_data <- function(sim_data) {
  list(
    N_total                  = nrow(sim_data),
    N_subjects               = length(unique(sim_data$agent_id)),
    subj_id                  = sim_data$agent_id,
    y                        = sim_data$choice,
    prior_mu_theta_mu        = 0,
    prior_mu_theta_sigma     = 1.5,
    prior_sigma_theta_lambda = 1,
    run_diagnostics          = 1
  )
}

# Execute the preparation on a specifically generated universe for recovery
sim_data_centered <- simulate_hierarchical_biased(
  n_agents = 50, 
  n_trials_min = 30,  # Intentional sparsity to force shrinkage
  n_trials_max = 120,
  mu_theta = 1.38,# slightly below .8 in prob
  sigma_theta = 0.8,  
  seed = 1981
)
stan_data_centered <- prep_stan_data(sim_data_centered)

7.4.1 Stan Model: Centered Parameterization (CP)

We will now implement the Centered Parameterization (CP) of the hierarchical biased agent. In CP, we explicitly model the individual latent parameters ($\theta_{\text{logit}, j}$) - determining their probability of choosing “right” versus “left” - as being drawn directly from a population distribution with mean thetaM and standard deviation thetaSD.

Mathematically, this translates to the following hierarchical structure, which strictly mirrors our Data Generating Process (DGP) from Phase 1:

Population Level: \[\mu_\theta \sim \text{Normal}(0, 1.5)\] \[\sigma_\theta \sim \text{Exponential}(1)\]
Individual Level: \[\theta_{\text{logit}, j} \sim \text{Normal}(\mu_\theta, \sigma_\theta)\]
Data Level (Likelihood): \[h_{ij} \sim \text{Bernoulli}(\text{logit}^{-1}(\theta_{\text{logit}, j}))\]

This architecture mathematically instantiates partial pooling: the estimation of an individual’s bias balances their specific choice history ($h_{ij}$) with the population’s overarching distributional geometry ($\mu_\theta, \sigma_\theta$).

The Anatomy of a Potential Pathology: Look closely at the individual-level formulation: $\theta_{\text{logit}, j} \sim \text{Normal}(\mu_\theta, \sigma_\theta)$. The permissible geometric space for $\theta_{\text{logit}, j}$ is dynamically dictated by $\sigma_\theta$. If the data suggests the population is highly homogenous, the sampler will push $\sigma_\theta \to 0$. As it does, the parameter space for $\theta_{\text{logit}, j}$ is violently crushed toward $\mu_\theta$, creating a funnel of infinite curvature. We are fitting this model precisely to watch the NUTS sampler fail in this region.

Let us write the Stan code.

Code

stan_biased_ml_cp_code <- "
// Multilevel Biased Agent Model: Centered Parameterization (CP)
data {
  int<lower=1> N_total;                 // Total number of observations
  int<lower=1> N_subjects;              // Total number of agents
  array[N_total] int<lower=1, upper=N_subjects> subj_id; // Agent ID index
  array[N_total] int<lower=0, upper=1> y;     // Observed choices
  real prior_mu_theta_mu;
  real<lower=0> prior_mu_theta_sigma;
  real<lower=0> prior_sigma_theta_lambda;
  int<lower=0, upper=1> run_diagnostics;
}

parameters {
  // Population-level hyperparameters
  real mu_theta;                        // Population mean bias (logit scale)
  real<lower=0> sigma_theta;            // Population SD of bias (logit scale)

  // Individual-level parameters (CENTERED)
  vector[N_subjects] theta_logit;       // Agent-specific biases 
}

model {
  // 1. Population-level hyperpriors
  target += normal_lpdf(mu_theta | prior_mu_theta_mu, prior_mu_theta_sigma);
  target += exponential_lpdf(sigma_theta | prior_sigma_theta_lambda);

  // 2. Hierarchical prior (CENTERED)
  target += normal_lpdf(theta_logit | mu_theta, sigma_theta);

  // 3. Likelihood
  target += bernoulli_logit_lpmf(y | theta_logit[subj_id]); 
}

generated quantities {
  // Transform hyperparameters back to the probability (scale)
  real<lower=0, upper=1> mu_theta_prob = inv_logit(mu_theta);
  vector<lower=0, upper=1>[N_subjects] theta_prob = inv_logit(theta_logit);

  // Prior samples — cheap scalar draws; declared at top level so they appear in draws output
  real mu_theta_prior    = normal_rng(prior_mu_theta_mu, prior_mu_theta_sigma);
  real sigma_theta_prior = exponential_rng(prior_sigma_theta_lambda);

  // Initialize arrays for conditional diagnostics
  vector[N_total] log_lik;
  array[N_total] int y_post_rep;
  array[N_total] int y_prior_rep;

  // Accumulator for joint prior log-density
  real lprior;
  lprior = normal_lpdf(mu_theta | prior_mu_theta_mu, prior_mu_theta_sigma) +
           exponential_lpdf(sigma_theta | prior_sigma_theta_lambda) +
           normal_lpdf(theta_logit | mu_theta, sigma_theta);

  if (run_diagnostics) {
    for (i in 1:N_total) {
      log_lik[i] = bernoulli_logit_lpmf(y[i] | theta_logit[subj_id[i]]);
      y_post_rep[i] = bernoulli_logit_rng(theta_logit[subj_id[i]]);
      real theta_logit_prior_i = normal_rng(mu_theta_prior, sigma_theta_prior);
      y_prior_rep[i] = bernoulli_logit_rng(theta_logit_prior_i);
    }
  }
}
"

# Save
stan_file_biased_ml_cp <- here::here("stan", "ch07_multilevel_biased_cp.stan")
writeLines(stan_biased_ml_cp_code, stan_file_biased_ml_cp)

# Compile the CP model utilizing the CmdStan C++ backend
mod_biased_ml_cp <- cmdstan_model(stan_file_biased_ml_cp)

7.4.2 Fitting the model

This not really necessary yet, but a rigorous workflow demands an explicit initialization strategy, so we have all the pieces in place for more complex models. We will initialize the NUTS algorithm by drawing starting values directly from our generative priors. This ensures the chains begin in scientifically plausible regions of the parameter space, rather than Stan’s arbitrary default $U[-2, 2]$ on the unconstrained space.

Code

fit_cp_path <- here::here("simmodels", "ch07_fit_cp.rds")

init_cp_list <- lapply(1:4, function(i) {
  list(
    mu_theta    = rnorm(1, 0, 1.5),
    sigma_theta = rexp(1, 1),
    theta_logit = rnorm(stan_data_centered$N_subjects, 0, 0.5)
  )
})

if (regenerate_simulations || !file.exists(fit_cp_path)) {
  
  fit_cp <- mod_biased_ml_cp$sample(
    data            = stan_data_centered,
    seed            = 123,
    chains          = 4,
    parallel_chains = 4,
    iter_warmup     = 1000,
    iter_sampling   = 1000,
    init            = init_cp_list, 
    adapt_delta     = 0.85
  )
  
  fit_cp$save_object(fit_cp_path)
  
} else {
  fit_cp <- readRDS(fit_cp_path)
}

7.4.3 Assessing the fitting process

Even if the Stan compiler finishes without throwing explicit warning messages, our workflow demands that we systematically verify the geometry of the posterior and the efficiency of the sampler. Hamiltonian Monte Carlo (HMC) is a physical simulation—it explores the posterior by simulating a particle gliding over a frictionless manifold dictated by the joint log-probability density. We must prove the numerical integration of this physics engine remained stable.

Let us systematically unpack the computational diagnostics of our Centered Parameterization fit.

The Diagnostic Summary: Integration Stability First, we ask cmdstanr for a high-level summary of the NUTS sampler’s performance.

Code

#fit_cp$cmdstan_diagnose()
fit_cp$diagnostic_summary()

$num_divergent
[1] 0 0 0 0

$num_max_treedepth
[1] 0 0 0 0

$ebfmi
[1] 0.9499418 1.0854016 0.9723112 0.9770850

We are looking for strict zeros in num_divergent. A divergent transition occurs when the numerical integration process (the leapfrog steps) fails to accurately map the curvature of the posterior manifold, causing the simulated trajectory to fly off the energy level. Even one divergence compromises the mathematical guarantee of the posterior.

7.4.3.1 Chain Mixing and Stationarity ($\hat{R}$)

Did our independent Markov chains converge to the same stationary distribution? We visualize this using trace plots and quantify it using $\hat{R}$ (R-hat).

Code

# Extract the posterior array
draws_cp <- fit_cp$draws()

# Visual inspection of chain mixing
mcmc_trace(draws_cp, pars = c("mu_theta", "sigma_theta"), np = nuts_params(fit_cp)) +
  labs(
    title = "Trace Plots: Hyperparameters (CP Model)",
    subtitle = "Healthy = chains overlap and fluctuate around a stable band. Divergent red triangles mark sampler failures."
  )

Code

# Quantitative assessment of convergence and efficiency
cp_summary <- draws_cp |> 
  subset_draws(variable = c("mu_theta", "sigma_theta", "theta_logit")) |> 
  summarise_draws(default_convergence_measures())

# We require R-hat < 1.01 for all parameters
cp_summary |> 
  filter(variable %in% c("mu_theta", "sigma_theta", "theta_logit[1]", "theta_logit[2]")) |>
  knitr::kable(digits = 3)

variable	rhat	ess_bulk	ess_tail
mu_theta	1.000	4407.162	2910.797
sigma_theta	1.000	3472.594	3048.045
theta_logit[1]	1.001	4648.021	2779.811
theta_logit[2]	1.001	4760.858	3107.729

The potential scale reduction factor ($\hat{R}$) compares the variance within a single chain to the variance between all chains. If the chains are exploring the same typical set, this ratio converges to 1. Our strict threshold is $\hat{R} < 1.01$.

7.4.3.2 Effective Sample Size (ESS)

Markov chains are inherently autocorrelated. Stan might draw 4000 samples, but we need to know the Effective Sample Size—the equivalent number of completely independent draws.In your summary table, look at ess_bulk (efficiency of the posterior mean estimate) and ess_tail (efficiency of the posterior interval estimates). We strictly require ESS > 400 per parameter. If the geometry is difficult, ESS plummets, indicating the sampler is taking highly correlated, inefficient steps.

7.4.3.3 Energy and E-BFMI

Finally, we evaluate the Energy-Bayesian Fraction of Missing Information (E-BFMI). HMC relies on injecting kinetic energy (momentum) into the system to explore the tails of the target distribution.

Code

# Assess the momentum resampling efficiency
mcmc_nuts_energy(nuts_params(fit_cp)) +
  labs(
    title = "Energy Transitions",
    subtitle = "Marginal and transition energy distributions must align."
  )

If the model geometry is well-behaved, the energy transition distribution (the energy proposed by the momentum resampling) will neatly match the marginal energy distribution (the actual energy of the posterior). An E-BFMI below 0.2 indicates the sampler is taking tiny, localized steps and failing to explore the global geometry.

7.4.4 Prior-Posterior Update Checks: Quantifying Learning

Bayesian inference is the mathematical process of allocating credibility across parameter values. The posterior is simply the prior updated by the likelihood. If the posterior is identical to the prior, the data contained no information. If the posterior is violently pushed against the boundaries of the prior, our priors were likely misspecified.

Let us visualize how much the data narrowed our uncertainty about the population parameters. Because we explicitly generated prior draws in our Stan generated quantities block, this comparison is trivial.

Code

# Extract prior and posterior draws for the population mean and variance
draws_df <- as_draws_df(draws_cp)

# Reshape for ggplot
pp_update_data <- draws_df |> 
  dplyr::select(
    mu_posterior = mu_theta, mu_prior = mu_theta_prior,
    sigma_posterior = sigma_theta, sigma_prior = sigma_theta_prior
  ) |> 
  pivot_longer(everything(), names_to = c("parameter", "distribution"), names_sep = "_")

# Extract the true generative hyperparameters we saved as attributes
true_mu <- attr(sim_data_centered, "hyperparameters")$mu_theta
true_sigma <- attr(sim_data_centered, "hyperparameters")$sigma_theta

# Create a reference dataframe for the ground truth matching the facet names
true_pop_df <- tibble(
  parameter = c("mu", "sigma"),
  true_value = c(true_mu, true_sigma)
)

# Visualize the update
ggplot(pp_update_data, aes(x = value)) +
  geom_histogram(aes(fill = distribution), alpha = 0.6, color = NA) +
  # Add the true values as dashed lines
  geom_vline(data = true_pop_df, aes(xintercept = true_value, linetype = "True Value"), 
             color = "darkred", linewidth = 1) +
  facet_wrap(
    ~parameter, 
    scales = "free",
    # Clean up the facet labels for publication quality
    labeller = as_labeller(c("mu" = "Population Mean (\u03BC)", "sigma" = "Population SD (\u03C3)"))
  ) +
  scale_fill_manual(
    name = "Distribution",
    values = c("prior" = "gray70", "posterior" = "steelblue")
  ) +
  scale_linetype_manual(
    name = "Ground Truth", 
    values = c("True Value" = "dashed")
  ) +
  labs(
    title = "Prior-Posterior Update Checks: Population Hyperparameters",
    subtitle = "Healthy = posterior (blue) narrower than prior (grey) and centred on true value (dashed red).",
    x = "Parameter Value (Logit Scale)",
    y = "Density")

A healthy update shows a posterior that is significantly narrower than the prior (indicating learning) and comfortably contained within the prior’s mass (indicating the prior did not inappropriately censor the likelihood).

7.4.5 Posterior Parameter Correlations

Parameters in hierarchical models are structurally correlated. By plotting the joint posterior geometry of multiple parameters, we can detect pathological collinearity or explicitly visualize the boundary of possible Neal’s Funnel.

We will use a pairs plot to examine the joint distribution of the population mean ($\mu_\theta$), population standard deviation ($\sigma_\theta$), and a single agent’s latent bias ($\theta_{\text{logit}, 1}$).

Code

# Visualize joint posterior geometries and identify structural correlations
mcmc_pairs(
  draws_cp, 
  pars = c("mu_theta", "sigma_theta", "theta_logit[1]"),
  np = nuts_params(fit_cp), # Injects divergence data (red crosses) if any exist
  diag_fun = "dens",
  off_diag_fun = "hex"
)

7.4.6 Posterior Predictive Checks (PPC): Generative Adequacy

Can our fitted model actually generate the reality we observed? If we simulate data from the posterior distribution, does it look like the empirical data? If not, the model is structurally misspecified, regardless of how well the chains mixed.

We will compare the observed choice rate for each agent against the distribution of choice rates predicted by the posterior.

Code

# Safely extract vectors
y_obs <- as.numeric(stan_data_centered$y)
agent_idx <- as.numeric(stan_data_centered$subj_id)

# Use cmdstanr's native extraction to guarantee a strict S x N matrix
y_rep <- fit_cp$draws("y_post_rep", format = "matrix")

# Select a random subset of agents to avoid overcrowded, illegible plots
n_sample <- 12
sampled_agents <- sample(unique(agent_idx), n_sample)

# Create a logical mask mapping which trials belong to the sampled agents
keep_mask <- agent_idx %in% sampled_agents

# Subset the data exactly
y_obs_sub <- y_obs[keep_mask]
agent_idx_sub <- agent_idx[keep_mask]
# For y_rep, we subset the columns (trials)
y_rep_sub <- y_rep[ , keep_mask] 

# Define a custom statistic to explicitly state our cognitive metric 
# (and simultaneously silence the bayesplot warning)
prop_right <- function(x) mean(x)

# Execute the grouped Posterior Predictive Check
ppc_stat_grouped(
  y = y_obs_sub, 
  yrep = y_rep_sub, 
  group = agent_idx_sub, 
  stat = "prop_right"
) +
  labs(
    title = "Posterior Predictive Check: Agent-Level Choice Rates",
    subtitle = paste0("Healthy = dark line (observed) inside histogram (predicted). Showing ", n_sample, " agents."),
    x = "Proportion of 'Right' Choices"
  )

7.4.7 Calibration and Recovery

We have proven the model fits the data and generates plausible predictions. Now, we must prove that the parameters we extract are actually the true parameters that generated the data. Because this is a hierarchical model, we must assess recovery at both the population and individual levels.

7.4.7.1 Individual-Level Recovery

Having validated the top of the hierarchy (in the prior-posterior update check), we move to the individual agents. We plot the posterior estimates against the ground truth. We expect the estimates to align with the diagonal identity line, but crucially, we expect agents with fewer trials to deviate from the raw truth, demonstrating shrinkage toward the estimated population mean.

Code

# 1. Extract the full summary (which defaults to including median, q5, and q95)
recovery_summary <- draws_cp |> 
  posterior::subset_draws(variable = "theta_logit") |> 
  posterior::summarise_draws() |> 
  mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])")))

# 2. Extract the true generative parameters directly from the simulated dataset
true_agent_params <- sim_data_centered |> 
  distinct(agent_id, n_trials, theta_logit, theta_prob)

# 3. Join the posterior estimates with the ground truth
recovery_df <- true_agent_params |> 
  left_join(recovery_summary, by = "agent_id")

# 4. Plot Recovery with Shrinkage Dynamics
ggplot(recovery_df, aes(x = theta_logit, y = median)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
  geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.5, color = "steelblue") +
  geom_point(aes(size = n_trials), color = "midnightblue", alpha = 0.8) +
  labs(
    title = "Parameter Recovery: Hierarchical Biased Agent",
    subtitle = "Healthy = points near diagonal (y = x). Small dots shrink more — that is partial pooling working correctly.",
    x = "True Parameter Value (Logit Scale)",
    y = "Posterior Median Estimate (Logit Scale)",
    size = "Number of Trials (N_j)"
  )

7.4.8 Prior sensitivity

Code

# 1. Calculate the sensitivity diagnostics
# We target the hyperparameters, as they govern the hierarchical shrinkage
sens_results <- powerscale_sensitivity(
  fit_cp, # or fit_ncp, provided lprior is defined
  variable = c("mu_theta", "sigma_theta")
)

cat("=== PRIOR SENSITIVITY DIAGNOSTICS ===\n")

=== PRIOR SENSITIVITY DIAGNOSTICS ===

Code

print(sens_results)

Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data

    variable prior likelihood                                diagnosis
    mu_theta 0.165      0.050 potential strong prior / weak likelihood
 sigma_theta 0.354      0.091            potential prior-data conflict

Code

ps_seq <- priorsense::powerscale_sequence(
  fit_cp, 
  variable = c("mu_theta", "sigma_theta")
)

# Plot the density shifts
priorsense::powerscale_plot_dens(ps_seq) +
  theme_classic() +
  labs(
    title = "Prior Sensitivity: Density Perturbation",
    subtitle = "Does scaling the prior/likelihood drastically alter our geometric beliefs?"
  )

Code

# Plot the quantitative estimate shifts
priorsense::powerscale_plot_quantities(ps_seq) +
  theme_classic() +
  labs(
    title = "Prior Sensitivity: Parameter Estimates",
    subtitle = "Horizontal lines indicate the baseline posterior bounds. Deviations indicate sensitivity."
  )

The sensitivity analysis reveals that our population variance ($\sigma_\theta$) exhibits a prior-data conflict. This is a feature, not a bug, of partial pooling: our Exponential prior applies regularizing pressure toward zero variance (complete pooling), while the empirical data resists this pull to account for individual differences.

7.4.9 Simulation-Based Calibration Checks (SBC)

Parameter recovery on a single dataset is anecdotal. A single success might just mean we got lucky with a benign region of the parameter space. To mathematically guarantee that our NUTS sampler is unbiased and our posterior uncertainties are perfectly calibrated across the entire prior space, we must perform Simulation-Based Calibration Checks (SBC). The logic of SBC is structurally elegant: if we generate a ground-truth parameter $\theta_{\text{true}}$ from the prior, generate data from the data model conditional on $\theta_{\text{true}}$, and compute the posterior, then the posterior must not systematically over- or under-estimate $\theta_{\text{true}}$. Under a well-calibrated sampler, the rank of the true parameter value within the sorted posterior draws will be uniformly distributed:\[r \sim \text{Uniform}(0, S)\]where $S$ is the number of posterior samples. If the model is overconfident (the posterior is too narrow and misses the truth), the rank histogram will be U-shaped. If it is underconfident (the posterior is too wide), it will form an inverted U-shape. Asymmetries indicate directional bias.

7.4.10 Setting up the SBC Pipeline

The SBC package requires a strict data structure. It expects a generator function that returns a list of two components:

variables: A named list of the true parameter values.
generated: The formatted data list required by our Stan model.

Let us build a strict wrapper around our simulation and data preparation functions to satisfy this requirement.

Code

# Define the strict SBC generator wrapper
generate_sbc_dataset <- function(n_agents = 50, n_trials_min = 20, n_trials_max = 120) {
  
  # 1. Simulate the universe (draws hyperparams and agent params from priors)
  sim_data <- simulate_hierarchical_biased(
    n_agents = n_agents, 
    n_trials_min = n_trials_min, 
    n_trials_max = n_trials_max
  )
  
  # 2. Extract the true generative hyperparameters
  mu_true <- attr(sim_data, "hyperparameters")$mu_theta
  sigma_true <- attr(sim_data, "hyperparameters")$sigma_theta
  
  # 3. Extract the true individual latent parameters (must preserve agent order)
  theta_true <- sim_data |> 
    distinct(agent_id, theta_logit) |> 
    arrange(agent_id) |> 
    pull(theta_logit)
  
  # 4. Prepare the strictly typed list for CmdStanR
  stan_data <- prep_stan_data(sim_data)
  
  # 5. Return the exact structure required by the SBC package
  list(
    variables = list(
      mu_theta = mu_true,
      sigma_theta = sigma_true,
      theta_logit = theta_true
    ),
    generated = stan_data
  )
}

A production-grade SBC requires $N_{\text{sim}} \ge 1000$ to smooth the rank histograms. For computational feasibility in this chapter, we will run 200 simulations. Because we established our future parallel processing plan in the setup chunk, the SBC package will automatically distribute these models across your CPU cores.

Code

sbc_cp_path <- here::here("simmodels", "ch07_sbc_cp.rds")

generator_cp <- SBC_generator_function(
  generate_sbc_dataset,
  n_agents = 50, n_trials_min = 30, n_trials_max = 120
)

backend_cp <- SBC_backend_cmdstan_sample(
  mod_biased_ml_cp,
  iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0
)

if (regenerate_simulations || !file.exists(sbc_cp_path)) {
  future::plan(future::multisession, workers = 8)
  sbc_datasets_cp <- generate_datasets(generator_cp, 200)
  sbc_results_cp  <- compute_SBC(datasets = sbc_datasets_cp, backend = backend_cp,
  keep_fits = FALSE)
  saveRDS(
    list(datasets = sbc_datasets_cp, results = sbc_results_cp),
    sbc_cp_path
  )
  
} else {
  sbc_obj        <- readRDS(sbc_cp_path)
  sbc_datasets_cp <- sbc_obj$datasets
  sbc_results_cp  <- sbc_obj$results
}

# Visualize
plot_rank_hist(sbc_results_cp, variables = c("mu_theta", "sigma_theta")) +
  labs(
    title    = "SBC Rank Histograms: Centered Parameterization",
    subtitle = "Healthy = bars flat within grey band. U-shape = overconfident. Here, U-shapes reveal the funnel pathology."
  )

Code

plot_ecdf_diff(sbc_results_cp, variables = c("mu_theta", "sigma_theta")) +
  labs(
    title    = "SBC ECDF Difference: Centered Parameterization",
    subtitle = "Healthy = blue line stays near zero. Excursions here correlate with the divergence failures shown below."
  )

Code

# 1. Extract the lightweight stats dataframe from the SBC object
sbc_stats_df <- sbc_results_cp$stats

# 2. Filter for the population-level parameters you care about
# (e.g., mu_alpha, mu_beta, rho, or even sigma[1])
recovery_df <- sbc_stats_df |>
  filter(variable %in% c("mu_theta"))

# 3. Create the Parameter Recovery Identity Plot
p_sbc_recovery <- ggplot(recovery_df, aes(x = simulated_value, y = median)) +
  # The Identity Line (Perfect Recovery)
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", linewidth = 1) +
  
  # The Posterior Estimates with 90% Credible Intervals
  geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.3, color = "steelblue") +
  geom_point(color = "midnightblue", alpha = 0.7, size = 2) +
  
  # Facet by parameter
  facet_wrap(~variable, scales = "free") +
  
  labs(
    title = "Parameter Recovery (Extracted Directly from SBC)",
    subtitle = "Points represent the posterior median. Bars are 90% Credible Intervals.",
    x = "True Generative Value (Drawn from Prior)",
    y = "Posterior Median Estimate"
  ) +
  theme_classic() +
  theme(strip.background = element_rect(fill = "gray95", color = NA))

print(p_sbc_recovery)

Code

# 4. Quantify the Recovery (Correlation and RMSE)
recovery_metrics <- recovery_df |>
  group_by(variable) |>
  summarize(
    Pearson_r = cor(simulated_value, median),
    Spearman_rho = cor(simulated_value, median, method = "spearman"),
    RMSE = sqrt(mean((simulated_value - median)^2)),
    .groups = "drop"
  )

cat("=== PARAMETER RECOVERY QUANTIFICATION ===\n")

=== PARAMETER RECOVERY QUANTIFICATION ===

Code

print(knitr::kable(recovery_metrics, digits = 3))



|variable | Pearson_r| Spearman_rho|  RMSE|
|:--------|---------:|------------:|-----:|
|mu_theta |      0.99|        0.989| 0.212|

Let’s explore the warnings we got.

Code

# Extract diagnostics now so they are available for the targeted refit below
diags_cp <- sbc_results_cp$backend_diagnostics
cp_fail_rate <- mean(diags_cp$n_divergent > 0) * 100

# Divergence count distribution across all 200 universes
diags_cp |> 
  ggplot(aes(x = n_divergent)) +
  geom_histogram(binwidth = 1, fill = "darkred", alpha = 0.8) +
  labs(
    title = "CP Divergence Counts Across 200 SBC Universes",
    subtitle = sprintf("%.0f%% of universes produced at least one divergence. 
Even a single divergence invalidates the geometric guarantee of NUTS.", cp_fail_rate),
    x = "Number of Divergent Transitions",
    y = "Number of SBC Universes"
  )

Code

# Locate the divergence-prone region: do they concentrate at small sigma_theta?
divergence_profile <- posterior::as_draws_df(sbc_datasets_cp$variables) |> 
  as_tibble() |> 
  mutate(sim_id = row_number()) |> 
  left_join(
    diags_cp |> mutate(sim_id = row_number()) |> dplyr::select(sim_id, n_divergent),
    by = "sim_id"
  ) |> 
  mutate(had_divergence = n_divergent > 0)

ggplot(divergence_profile, aes(x = sigma_theta, y = n_divergent, color = had_divergence)) +
  geom_point(alpha = 0.5, size = 1.5) +
  scale_color_manual(
    values = c("FALSE" = "steelblue", "TRUE" = "darkred"),
    labels = c("FALSE" = "Clean", "TRUE" = "Divergent"),
    name = NULL
  ) +
  labs(
    title = "CP Divergences Concentrate at Small True sigma_theta",
    subtitle = "The funnel geometry only manifests when population variance is small.",
    x = expression("True " * sigma[theta]),
    y = "Number of Divergent Transitions"
  )

Code

ggplot(divergence_profile, aes(x = mu_theta, y = sigma_theta, color = had_divergence)) +
  geom_point(alpha = 0.7, size = 2) +
  scale_color_manual(
    name = "Sampler Status",
    values = c("FALSE" = "steelblue", "TRUE" = "darkred"),
    labels = c("Healthy", "Diverged")
  ) +
  labs(
    title = "The Anatomy of Failure: Where does CP break?",
    x = "True Population Mean (mu_theta)",
    y = "True Population Variance (sigma_theta)"
  ) +
  theme_classic()

Code

hard_universe_path <- here::here("simmodels", "ch07_fit_cp_hard.rds")

hard_universe_idx <- divergence_profile |> 
  filter(had_divergence) |> 
  slice_min(sigma_theta, n = 1) |> 
  pull(sim_id)

hard_stan_data   <- sbc_datasets_cp$generated[[hard_universe_idx]]
# Ensure compatibility with updated Stan model if cache is stale
if (!"prior_mu_theta_mu" %in% names(hard_stan_data)) {
  hard_stan_data$prior_mu_theta_mu        <- 0
  hard_stan_data$prior_mu_theta_sigma     <- 1.5
  hard_stan_data$prior_sigma_theta_lambda <- 1
  hard_stan_data$run_diagnostics          <- 1
}
hard_true_sigma  <- divergence_profile$sigma_theta[hard_universe_idx]

cat(sprintf(
  "Selected universe %d: true sigma_theta = %.3f\n",
  hard_universe_idx, hard_true_sigma
))

Selected universe 198: true sigma_theta = 0.016

Code

if (regenerate_simulations || !file.exists(hard_universe_path)) {
  fit_cp_hard <- mod_biased_ml_cp$sample(
    data            = hard_stan_data,
    seed            = 999,
    chains          = 4,
    parallel_chains = 4,
    iter_warmup     = 1000,
    iter_sampling   = 1000,
    init            = init_cp_list,
    adapt_delta     = 0.85, 
    refresh         = 0
  )
  fit_cp_hard$save_object(hard_universe_path)
} else {
  fit_cp_hard <- readRDS(hard_universe_path)
}

# 1. Generate the pairs plot 
p_funnel <- mcmc_pairs(
  fit_cp_hard$draws(),
  pars     = c("mu_theta", "sigma_theta", "theta_logit[1]"),
  np       = nuts_params(fit_cp_hard),
  np_style = pairs_style_np(div_color = "darkred", div_alpha = 0.8, div_size = 2),
  diag_fun     = "dens",
  off_diag_fun = "hex"
)

wrap_elements(p_funnel) + 
  plot_annotation(
    title = sprintf("Neal's Funnel: CP Geometry at True sigma_theta = %.3f", hard_true_sigma),
    subtitle = "Red points are divergent transitions crashing at the funnel neck."
  )

Here we see a funnel. As sigma goes to zero, theta_logit[1] can only have one value: the value of the population (mu_theta). The curvature becomes infinitely steep, and the NUTS sampler fails to explore this region, resulting in divergent transitions (red points). This is the geometric pathology of the Centered Parameterization.

7.4.11 Non-Centered Parameterization (NCP) as the Solution

Why does a coordinate transform fix a geometry problem? Neal’s Funnel is not a flaw in the model — it is a flaw in the space the sampler must navigate. In the CP, individual parameters $\theta_j$ live in a space whose width is controlled by $\sigma_\theta$: as $\sigma_\theta \to 0$ the space collapses. In the NCP, we sample standardized offsets $z_j \sim N(0,1)$ whose geometry is always the same — the funnel disappears because the sampler no longer navigates the narrowing space directly. The parameters $\theta_j$ are reconstructed after sampling, not during it.

We have reached a critical juncture in our Bayesian workflow. We showed computationally (via SBC and diagnostic pairs plots) that the Centered Parameterization (CP) creates a pathological geometry—Neal’s Funnel—that makes it difficult for the NUTS sampler to efficiently traverse the posterior space when data is sparse or population variance is small. When the model is misbehaving, sometimes we need to reconsider our theory and/or our data. Yet, in this case, we can do something to the math underlying the model to topologically transform the posterior space the model is sampling: a coordinate transformation known as the Non-Centered Parameterization (NCP).

7.4.11.1 The Geometry of the Transformation

In the Centered Parameterization, the hierarchical dependency is explicit and direct:\[\theta_{\text{logit}, j} \sim \text{Normal}(\mu_\theta, \sigma_\theta)\] This forces the sampler to navigate a space where the permissible bounds of $\theta$ are dynamically restricted by $\sigma$, creating the funnel. In the Non-Centered Parameterization, we decouple this dependency, in a way that is not dissimilar from rescaling your predictors - a common procedure you probably learned in your statistical course to help model convergence and interpretation.

We introduce an auxiliary, standardized latent variable $z_j$ for each agent. We force the sampler to explore this $z$-space, which is simply a perfectly behaved Normal distribution:\[z_j \sim \text{Normal}(0, 1)\] Then, we deterministically shift and scale $z_j$ to reconstruct our target cognitive parameter:\[\theta_{\text{logit}, j} = \mu_\theta + z_j \times \sigma_\theta\]

Mathematically, this is identical to the CP ($z \sim N(0,1) \implies \mu + z\sigma \sim N(\mu, \sigma)$). However, the computational geometry is entirely transformed and when the sampler traverses it, the problems it meets are likely different.

7.4.12 Stan Model: Multilevel Biased Agent (NCP)

Here is the NCP version of the model. Note the changes in the parameters, transformed parameters, and model blocks compared to the CP version.

Code

stan_biased_ml_ncp_code <- "
// Multilevel Biased Agent Model (Non-Centered Parameterization)
// Estimates individual biases drawn from a population distribution.

data {
  int<lower=1> N_total;                 // Total number of observations
  int<lower=1> N_subjects;              // Number of agents
  array[N_total] int<lower=1, upper=N_subjects> subj_id; // Agent ID
  array[N_total] int<lower=0, upper=1> y;     // Observed choices
  real prior_mu_theta_mu;
  real<lower=0> prior_mu_theta_sigma;
  real<lower=0> prior_sigma_theta_lambda;
  int<lower=0, upper=1> run_diagnostics;
}

parameters {
  // Population-level parameters
  real mu_theta;                     // Population mean bias (logit scale)
  real<lower=0> sigma_theta;         // Population SD of bias (logit scale)

  // Individual-level parameters (standardized deviations, z-scores of the agents)
  vector[N_subjects] z_theta;        // Non-centered individual effects
}

transformed parameters {
  // Deterministic reconstruction of the cognitive parameters
  vector[N_subjects] theta_logit = mu_theta + z_theta * sigma_theta;
}

model {
  // Priors for population-level parameters
  target += normal_lpdf(mu_theta | prior_mu_theta_mu, prior_mu_theta_sigma);
  target += exponential_lpdf(sigma_theta | prior_sigma_theta_lambda);
  
  /// Individual level prior (NON-CENTERED)
  target += std_normal_lpdf(z_theta);

  // Likelihood
  target += bernoulli_logit_lpmf(y | theta_logit[subj_id]);
}

generated quantities {
  // Transform hyperparameters back to the probability (scale)
  real<lower=0, upper=1> mu_theta_prob = inv_logit(mu_theta);
  vector<lower=0, upper=1>[N_subjects] theta_prob = inv_logit(theta_logit);
  
  // Initialize containers for trial-level metrics
  vector[N_total] log_lik;
  array[N_total] int y_post_rep;
  array[N_total] int y_prior_rep;

  // --- Prior Log-Density  ---
  real lprior;
  lprior = normal_lpdf(mu_theta | prior_mu_theta_mu, prior_mu_theta_sigma) + 
           exponential_lpdf(sigma_theta | prior_sigma_theta_lambda) + 
           std_normal_lpdf(z_theta);

  // --- Prior Predictive Checks: Generative Baseline ---
  real mu_theta_prior = normal_rng(prior_mu_theta_mu, prior_mu_theta_sigma);
  real sigma_theta_prior = exponential_rng(prior_sigma_theta_lambda);

  if (run_diagnostics) {
    // 2. Draw individual agent parameters from the non-centered prior
    vector[N_subjects] theta_logit_prior;
    for (j in 1:N_subjects) {
      real z_theta_prior = normal_rng(0, 1); 
      theta_logit_prior[j] = mu_theta_prior + z_theta_prior * sigma_theta_prior;
    }

    // --- Trial-Level Computations ---
    for (i in 1:N_total) {
      log_lik[i] = bernoulli_logit_lpmf(y[i] | theta_logit[subj_id[i]]);
      y_post_rep[i] = bernoulli_logit_rng(theta_logit[subj_id[i]]);
      y_prior_rep[i] = bernoulli_logit_rng(theta_logit_prior[subj_id[i]]);
    }
  }
}"

# Write Stan code to file
stan_file_ncp <- here::here("stan", "ch07_multilevel_biased_ncp.stan")
writeLines(stan_biased_ml_ncp_code, stan_file_ncp)

mod_biased_ml_ncp <- cmdstan_model(stan_file_ncp)

fit_ncp_path <- here::here("simmodels", "ch07_fit_ncp.rds")

# Extract N_subjects safely
  N_subj_val <- stan_data_centered$N_subjects
  
  # The dumb list (4 chains)
  init_ncp_list <- list(
    list(mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), z_theta = rnorm(N_subj_val, 0, 1)),
    list(mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), z_theta = rnorm(N_subj_val, 0, 1)),
    list(mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), z_theta = rnorm(N_subj_val, 0, 1)),
    list(mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), z_theta = rnorm(N_subj_val, 0, 1))
  )

  if (regenerate_simulations || !file.exists(fit_ncp_path)) {
    fit_ncp <- mod_biased_ml_ncp$sample(
      data            = stan_data_centered,
      seed            = 123,
      chains          = 4,
      parallel_chains = 4,
      iter_warmup     = 1000,
      iter_sampling   = 1000,
      init            = init_ncp_list,
      adapt_delta     = 0.85,
      refresh         = 0
    )
    fit_ncp$save_object(fit_ncp_path)

} else {
  fit_ncp <- readRDS(fit_ncp_path)
}

Why we skip the full diagnostic battery here: The NCP model is not our production model for this dataset — it is a pedagogical fix. We introduced the CP model specifically to watch it fail via SBC, and the NCP to demonstrate the geometric solution. Running the full six-phase battery (prior predictive, posterior predictive, prior-posterior update, sensitivity, recovery, SBC) on both parameterizations would be instructive but redundant for this chapter’s purpose. When you apply the NCP to your own data, run all six phases. Here we focus on the two diagnostics that directly reveal the CP’s failure and the NCP’s fix: sampling efficiency and the SBC rank histograms.

Stan Efficiency Rule 3 — Subject Indexing: subj_id vs. Contiguous Blocks

The models in this chapter index subject membership via an integer array: array[N_total] int subj_id. On each trial row, subj_id[i] says which subject that row belongs to. This scatter-index pattern is simple and works perfectly for models where trial order does not matter — like the biased-agent model here, where the likelihood is bernoulli_logit_lpmf(y | theta[subj_id]).

However, from Chapter 9 onward, all models are path-dependent: the predicted probability on trial $t$ is a function of everything that happened on trials $1, \ldots, t-1$ within the same subject. For those models, we need to process each subject’s trials in strict sequence, which means we must know exactly where each subject’s block of rows starts and ends in the data array.

The contiguous-block pattern achieves this:

# Sort data by subject, then by trial within subject
data_sorted <- data |> arrange(subject_id, trial)
bounds <- data_sorted |>
  mutate(row = row_number()) |>
  group_by(subject_id) |>
  summarise(subj_start = min(row), subj_end = max(row))

data {
  array[N_subjects] int subj_start;  // first row index for each subject
  array[N_subjects] int subj_end;    // last row index for each subject
}
model {
  for (j in 1:N_subjects) {
    for (i in subj_start[j]:subj_end[j]) {
      // process trials in order — history is correct
    }
  }
}

This layout also enables the reduce_sum parallelism introduced in Chapter 12: because each subject’s trials form a self-contained contiguous block, independent subject loops can be dispatched to separate CPU threads without any shared mutable state.

We keep subj_id here for simplicity. You will see subj_start/subj_end throughout Chapters 9–11.

Code

# Timing, ESS, and Pathfinder warm-start comparison for CP vs NCP

# ── 1. Wall-clock and ESS ─────────────────────────────────────────────────────
time_cp  <- fit_cp$time()$total
time_ncp <- fit_ncp$time()$total

ess_cp  <- summarise_draws(fit_cp$draws(c("mu_theta", "sigma_theta")),  "ess_bulk") |>
  rename(ess_cp  = ess_bulk)
ess_ncp <- summarise_draws(fit_ncp$draws(c("mu_theta", "sigma_theta")), "ess_bulk") |>
  rename(ess_ncp = ess_bulk)

efficiency_comparison <- ess_cp |>
  left_join(ess_ncp, by = "variable") |>
  mutate(
    ess_per_sec_cp  = ess_cp  / time_cp,
    ess_per_sec_ncp = ess_ncp / time_ncp
  )

cat("=== COMPUTATIONAL SPEED (WALL-CLOCK) ===\n")

=== COMPUTATIONAL SPEED (WALL-CLOCK) ===

Code

cat("CP  Total Time:", round(time_cp,  2), "seconds\n")

CP  Total Time: 4.22 seconds

Code

cat("NCP Total Time:", round(time_ncp, 2), "seconds\n")

NCP Total Time: 6.23 seconds

Code

cat("Speedup Factor:", round(time_cp / time_ncp, 2), "x\n\n")

Speedup Factor: 0.68 x

Code

cat("=== SAMPLING EFFICIENCY (ESS / SECOND) ===\n")

=== SAMPLING EFFICIENCY (ESS / SECOND) ===

Code

print(knitr::kable(efficiency_comparison, digits = 1))



|variable    | ess_cp| ess_ncp| ess_per_sec_cp| ess_per_sec_ncp|
|:-----------|------:|-------:|--------------:|---------------:|
|mu_theta    | 4407.2|   603.3|         1044.8|            96.8|
|sigma_theta | 3472.6|   600.7|          823.2|            96.4|

Code

# ── 2. Pathfinder warm-start timing for NCP ───────────────────────────────────
fit_ncp_pf_path <- here::here("simmodels", "ch07_fit_ncp_pathfinder_init.rds")

if (regenerate_simulations || !file.exists(fit_ncp_pf_path)) {
  pf_ncp <- tryCatch(
    mod_biased_ml_ncp$pathfinder(data = stan_data_centered, num_paths = 4,
                                  seed = 42, refresh = 0),
    error = function(e) NULL
  )
  init_for_ncp <- if (!is.null(pf_ncp)) pf_ncp else init_ncp_list

  t_pf_start <- proc.time()["elapsed"]
  fit_ncp_pf <- mod_biased_ml_ncp$sample(
    data            = stan_data_centered,
    seed            = 124,
    chains          = 4,
    parallel_chains = 4,
    iter_warmup     = 500,        # shorter warmup thanks to Pathfinder init
    iter_sampling   = 1000,
    init            = init_for_ncp,
    adapt_delta     = 0.85,
    refresh         = 0
  )
  time_pf_total <- (proc.time()["elapsed"] - t_pf_start) +
                   if (!is.null(pf_ncp)) pf_ncp$time()$total else 0
  fit_ncp_pf$save_object(fit_ncp_pf_path)
} else {
  fit_ncp_pf <- readRDS(fit_ncp_pf_path)
  time_pf_total <- fit_ncp_pf$time()$total   # approximate (Pathfinder time not stored)
}

timing_summary <- tibble::tribble(
  ~Method,                       ~`Time (s)`,  ~`Warmup iter`,
  "CP (standard)",                round(time_cp, 1),    1000L,
  "NCP (standard init)",          round(time_ncp, 1),   1000L,
  "NCP (Pathfinder warm-start)",  round(time_pf_total, 1), 500L
)
knitr::kable(timing_summary, caption = "Wall-clock time: CP vs NCP, with and without Pathfinder warm-start")

Wall-clock time: CP vs NCP, with and without Pathfinder warm-start
Method	Time (s)	Warmup iter
CP (standard)	4.2	1000
NCP (standard init)	6.2	1000
NCP (Pathfinder warm-start)	5.0	500

Here we see that the NCP model is sometimes slightly faster, sometimes slightly slower than CP in raw wall-clock time, but always less sampling-efficient (lower ESS/second). This is the trade-off of NCP: it eliminates the pathological geometry that causes SBC failures in CP, at the cost of slightly less efficient sampling in data-rich regimes. The Pathfinder warm-start partially recovers that efficiency by cutting warmup iterations.

Yet, NCP might still end up exploring the space better. Let’s see how it fares on the SBC assessment that was giving issues to CP.

We will generate 200 random universes from the priors, fit the NCP model to each, and plot the rank histograms.

Code

sbc_ncp_path <- here::here("simmodels", "ch07_sbc_ncp.rds")

backend_ncp <- SBC_backend_cmdstan_sample(
  mod_biased_ml_ncp,
  iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0
)

if (regenerate_simulations || !file.exists(sbc_ncp_path)) {
  future::plan(future::multisession, workers = 4)
  
  # Add keep_fits = FALSE right here
  sbc_results_ncp <- compute_SBC(
    datasets = sbc_datasets_cp, 
    backend = backend_ncp, 
    keep_fits = FALSE
  )
  
  saveRDS(sbc_results_ncp, sbc_ncp_path)
} else {
  sbc_results_ncp <- readRDS(sbc_ncp_path)
}

plot_rank_hist(sbc_results_ncp, variables = c("mu_theta", "sigma_theta")) +
  labs(title = "SBC Rank Histograms: Non-Centered Parameterization")

Code

sbc_stats_df <- sbc_results_cp$stats

# 2. Filter for the population-level parameters you care about
# (e.g., mu_alpha, mu_beta, rho, or even sigma[1])
recovery_df <- sbc_stats_df |>
  filter(variable %in% c("mu_theta", "sigma_theta"))

# 3. Create the Parameter Recovery Identity Plot
p_sbc_recovery <- ggplot(recovery_df, aes(x = simulated_value, y = median)) +
  # The Identity Line (Perfect Recovery)
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", linewidth = 1) +
  
  # The Posterior Estimates with 90% Credible Intervals
  geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.3, color = "steelblue") +
  geom_point(color = "midnightblue", alpha = 0.7, size = 2) +
  
  # Facet by parameter
  facet_wrap(~variable, scales = "free") +
  
  labs(
    title = "Parameter Recovery (Extracted Directly from SBC)",
    subtitle = "Points represent the posterior median. Bars are 90% Credible Intervals.",
    x = "True Generative Value (Drawn from Prior)",
    y = "Posterior Median Estimate"
  ) +
  theme_classic() +
  theme(strip.background = element_rect(fill = "gray95", color = NA))

print(p_sbc_recovery)

Code

# 4. Quantify the Recovery (Correlation and RMSE)
recovery_metrics <- recovery_df |>
  group_by(variable) |>
  summarize(
    Pearson_r = cor(simulated_value, median),
    Spearman_rho = cor(simulated_value, median, method = "spearman"),
    RMSE = sqrt(mean((simulated_value - median)^2)),
    .groups = "drop"
  )

cat("=== PARAMETER RECOVERY QUANTIFICATION ===\n")

=== PARAMETER RECOVERY QUANTIFICATION ===

Code

print(knitr::kable(recovery_metrics, digits = 3))



|variable    | Pearson_r| Spearman_rho|  RMSE|
|:-----------|---------:|------------:|-----:|
|mu_theta    |     0.990|        0.989| 0.212|
|sigma_theta |     0.986|        0.982| 0.191|

Code

# Extract diagnostics now so they are available for the targeted refit below
diags_ncp <- sbc_results_ncp$backend_diagnostics
ncp_fail_rate <- mean(diags_ncp$n_divergent > 0) * 100

# Divergence count distribution across all 200 universes
diags_ncp |> 
  ggplot(aes(x = n_divergent)) +
  geom_histogram(binwidth = 1, fill = "darkred", alpha = 0.8) +
  labs(
    title = "NCP Divergence Counts Across 200 SBC Universes",
    subtitle = sprintf("%.0f%% of universes produced at least one divergence. 
Even a single divergence invalidates the geometric guarantee of NUTS.", cp_fail_rate),
    x = "Number of Divergent Transitions",
    y = "Number of SBC Universes"
  )

Code

# Locate the divergence-prone region: do they concentrate at small sigma_theta?
divergence_profile <- posterior::as_draws_df(sbc_datasets_cp$variables) |> 
  as_tibble() |> 
  mutate(sim_id = row_number()) |> 
  left_join(
    diags_ncp |> mutate(sim_id = row_number()) |> dplyr::select(sim_id, n_divergent),
    by = "sim_id"
  ) |> 
  mutate(had_divergence = n_divergent > 0)

ggplot(divergence_profile, aes(x = sigma_theta, y = n_divergent, color = had_divergence)) +
  geom_point(alpha = 0.5, size = 1.5) +
  scale_color_manual(
    values = c("FALSE" = "steelblue", "TRUE" = "darkred"),
    labels = c("FALSE" = "Clean", "TRUE" = "Divergent"),
    name = NULL
  ) +
  labs(
    title = "NCP Divergences Concentrate at Small True sigma_theta",
    subtitle = "The funnel geometry only manifests when population variance is small.",
    x = expression("True " * sigma[theta]),
    y = "Number of Divergent Transitions"
  )

Code

ggplot(divergence_profile, aes(x = mu_theta, y = sigma_theta, color = had_divergence)) +
  geom_point(alpha = 0.7, size = 2) +
  scale_color_manual(
    name = "Sampler Status",
    values = c("FALSE" = "steelblue", "TRUE" = "darkred"),
    labels = c("Healthy", "Diverged")
  ) +
  labs(
    title = "The Anatomy of Failure: Where does NCP break?",
    x = "True Population Mean (mu_theta)",
    y = "True Population Variance (sigma_theta)"
  ) +
  theme_classic()

7.4.12.1 Posterior Simulation-Based Calibration Checks: Probing Local Geometry

Prior SBC tested our sampler across the entire prior space — it asked whether NUTS is unbiased when the true parameters could be anything the prior assigns non-negligible mass to. This is a global guarantee, and it is necessary. But it is not sufficient. A sampler can be globally unbiased yet struggle precisely in the regions that matter most: the posterior neighborhoods that the actual observed data force us into. When the likelihood is highly informative — as it typically is once we have a few hundred trials per agent — the posterior concentrates sharply in a small corner of the prior space. The sampler must now take large numbers of tiny, precise steps in a geometrically restricted region. Prior SBC never directly stresses this regime, because most of its simulated universes place the true parameter somewhere across the full prior, and only a fraction happen to fall in the data-rich posterior neighborhood.

Posterior SBC (PSBC) is a complementary local diagnostic that directly targets this regime. The logic inverts the generative direction. Instead of drawing true parameters from the prior, we draw them from the fitted posterior — a distribution that has already been conditioned on our actual data and therefore concentrates in precisely the region the sampler will need to navigate in practice. Concretely: we sample a single draw $\tilde{\theta}^{(s)}$ from the posterior ($\theta \mid y_{\text{obs}}$), treat it as a ground truth, generate a synthetic dataset $\tilde{y}^{(s)} \sim p(y \mid \tilde{\theta}^{(s)})$ from the data model, fit the model to $\tilde{y}^{(s)}$, and record where $\tilde{\theta}^{(s)}$ falls in the resulting posterior. If the sampler is locally well-calibrated, these ranks are again uniform — but now over the concentrated geometry of the posterior, not the diffuse geometry of the prior. A U-shaped rank histogram here means the sampler is overconfident specifically when the data are informative, which is exactly the failure mode we need to detect. Implementing this requires us to write a factory function: a function that closes over a fitted model object and a data structure, and returns a new generator function that the SBC package can call repeatedly to produce fresh $({\tilde{\theta}}, \tilde{y})$ pairs. If this pattern is unfamiliar, read it as a function that builds generators rather than being one — make_psbc_generator(fit_cp, stan_data_centered) does not itself generate a dataset; it returns a self-contained callable that will do so every time the SBC machinery invokes it.

Code

# 1. Define a robust factory function for the PSBC generator
make_psbc_generator <- function(fit, original_data) {
  # Extract the full posterior array as a flat data frame
  draws_df <- posterior::as_draws_df(fit$draws())
  
  # Pre-compute the exact column names to guarantee flawless extraction
  theta_cols <- paste0("theta_logit[", 1:original_data$N_subjects, "]")
  y_cols <- paste0("y_post_rep[", 1:original_data$N_total, "]")
  
  # The generator function required by the SBC package
  function() {
    # Step A: Randomly sample one specific "state of the world" from the posterior
    idx <- sample(seq_len(nrow(draws_df)), 1)
    
    # Step B: Package the true parameters for this specific simulation
    variables <- list(
      mu_theta = as.numeric(draws_df$mu_theta[idx]),
      sigma_theta = as.numeric(draws_df$sigma_theta[idx]),
      theta_logit = as.numeric(draws_df[idx, theta_cols])
    )
    
    # Step C: Package the dataset generated by this specific state of the world
    generated <- list(
      N_total                  = original_data$N_total,
      N_subjects               = original_data$N_subjects,
      subj_id                  = original_data$subj_id,
      y                        = as.numeric(draws_df[idx, y_cols]),
      prior_mu_theta_mu        = original_data$prior_mu_theta_mu,
      prior_mu_theta_sigma     = original_data$prior_mu_theta_sigma,
      prior_sigma_theta_lambda = original_data$prior_sigma_theta_lambda,
      run_diagnostics          = original_data$run_diagnostics
    )
    
    list(variables = variables, generated = generated)
  }
}

psbc_path <- here::here("simmodels", "ch07_psbc.rds")

if (regenerate_simulations || !file.exists(psbc_path)) {
  
  gen_psbc_cp  <- SBC_generator_function(make_psbc_generator(fit_cp,  stan_data_centered))
  gen_psbc_ncp <- SBC_generator_function(make_psbc_generator(fit_ncp, stan_data_centered))
  
  set.seed(8675309)
  psbc_data_cp  <- generate_datasets(gen_psbc_cp,  100)
  psbc_data_ncp <- generate_datasets(gen_psbc_ncp, 100)
  
  backend_cp_psbc  <- SBC_backend_cmdstan_sample(mod_biased_ml_cp,  iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0)
  backend_ncp_psbc <- SBC_backend_cmdstan_sample(mod_biased_ml_ncp, iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0)
  future::plan(future::multisession, workers = 4)
  psbc_res_cp  <- compute_SBC(psbc_data_cp,  backend_cp_psbc)
  psbc_res_ncp <- compute_SBC(psbc_data_ncp, backend_ncp_psbc)
  
  saveRDS(
    list(
      res_cp  = psbc_res_cp,
      res_ncp = psbc_res_ncp
    ),
    psbc_path
  )
  
} else {
  psbc_obj    <- readRDS(psbc_path)
  psbc_res_cp  <- psbc_obj$res_cp
  psbc_res_ncp <- psbc_obj$res_ncp
}

plot_rank_hist(psbc_res_cp,  variables = c("mu_theta", "sigma_theta")) +
  labs(title = "PSBC Rank Histograms: Centered Parameterization")

Code

plot_rank_hist(psbc_res_ncp, variables = c("mu_theta", "sigma_theta")) +
  labs(title = "PSBC Rank Histograms: Non-Centered Parameterization")

Code

# 1. Extract the lightweight stats dataframe from the SBC object
sbc_stats_df <- psbc_res_ncp$stats

# 2. Filter for the population-level parameters you care about
# (e.g., mu_alpha, mu_beta, rho, or even sigma[1])
recovery_df <- sbc_stats_df |>
  filter(variable %in% c("mu_theta", "sigma_theta"))

# 3. Create the Parameter Recovery Identity Plot
p_sbc_recovery <- ggplot(recovery_df, aes(x = simulated_value, y = median)) +
  # The Identity Line (Perfect Recovery)
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", linewidth = 1) +
  
  # The Posterior Estimates with 90% Credible Intervals
  geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.3, color = "steelblue") +
  geom_point(color = "midnightblue", alpha = 0.7, size = 2) +
  
  # Facet by parameter
  facet_wrap(~variable, scales = "free") +
  
  labs(
    title = "Parameter Recovery (Extracted Directly from SBC)",
    subtitle = "Points represent the posterior median. Bars are 90% Credible Intervals.",
    x = "True Generative Value (Drawn from Prior)",
    y = "Posterior Median Estimate"
  ) +
  theme_classic() +
  theme(strip.background = element_rect(fill = "gray95", color = NA))

print(p_sbc_recovery)

Code

# 4. Quantify the Recovery (Correlation and RMSE)
recovery_metrics <- recovery_df |>
  group_by(variable) |>
  summarize(
    Pearson_r = cor(simulated_value, median),
    RMSE = sqrt(mean((simulated_value - median)^2)),
    .groups = "drop"
  )

cat("=== PARAMETER RECOVERY QUANTIFICATION ===\n")

=== PARAMETER RECOVERY QUANTIFICATION ===

Code

print(knitr::kable(recovery_metrics, digits = 3))



|variable    | Pearson_r|  RMSE|
|:-----------|---------:|-----:|
|mu_theta    |     0.257| 0.142|
|sigma_theta |     0.358| 0.107|

Code

diags_psbc_cp  <- psbc_res_cp$backend_diagnostics
diags_psbc_ncp <- psbc_res_ncp$backend_diagnostics

psbc_cp_fail_rate  <- mean(diags_psbc_cp$n_divergent  > 0) * 100
psbc_ncp_fail_rate <- mean(diags_psbc_ncp$n_divergent > 0) * 100

cat("=== POSTERIOR SBC (LOCAL) DIVERGENCE RATES ===\n")

=== POSTERIOR SBC (LOCAL) DIVERGENCE RATES ===

Code

cat("CP  Failure Rate:", psbc_cp_fail_rate,  "%\n")

CP  Failure Rate: 0 %

Code

cat("NCP Failure Rate:", psbc_ncp_fail_rate, "%\n")

NCP Failure Rate: 0 %

7.4.13 Comparing CP and NCP: Global Fragility vs. Local Efficiency

The CP did not seem to perform too bad, but it broke down when doing SBC. We therefore reparameterized the model by decoupling the sampling of individual and population level parameters (from CP to NCP). This fundamentally altered how the Hamiltonian Monte Carlo (NUTS) algorithm explores the posterior manifold.

We saw that NCP took slightly longer to run and was less efficient in sampling (< ESS). We then ran extensive SBC both sampling the simulations from the priors (global calibration) and sampling the simulations from the fitted posteriors (local calibration).

By comparing global calibration (Prior SBC) against local calibration (Posterior SBC), we have mathematically mapped the boundary conditions of these parameterizations.

Integration Stability and Global Fragility (Divergences): In CP, sparse data forces the sampler into the infinite curvature of Neal’s Funnel. As proven by our Prior SBC failure rates, CP is globally fragile; it regularly diverged across the broad generative prior space whenever the true population variance ($\sigma_\theta$) was small. NCP, conversely, decouples the parameters. The sampler explores a standard, isotropic Gaussian space ($z \sim \text{Normal}(0, 1)$) where the curvature is constant, completely immunizing the model against Neal’s Funnel and dropping global divergences to zero.
Sampling Efficiency and Local Vulnerability (ESS and Speed): Because the NCP geometry is smooth and well-behaved in data-sparse regimes, the Hamiltonian integrator can take massive, efficient step sizes, causing the Effective Sample Size (ESS) to skyrocket. However, as our Posterior SBC proved, NCP is locally vulnerable to the Data Funnel (or “Reverse Funnel”). In the specific neighborhood of our simulated data, the likelihood was highly informative and tightly constrained the posterior. CP navigated this sharp likelihood peak effortlessly. NCP, however, was forced to blindly explore a broad standard normal prior, only to find that the data violently rejected almost all of it, causing its ESS to plummet.
Stationarity ($\hat{R}$): While both models can achieve $\hat{R} < 1.01$ under ideal local conditions, NCP achieves it reliably across the entire prior space, ensuring all Markov chains mix seamlessly regardless of the underlying data-generating universe.
The Geometric Trade-off: When CP Wins Given the global safety of NCP, it is tempting to conclude that we should always use the Non-Centered Parameterization. This is mathematically incorrect. The choice between CP and NCP is a strict geometric trade-off governed by the density of your data.

** The Data-Poor Regime (Use NCP): When trial counts ($N_j$) are low or population variance ($\sigma_\theta$) is small, the prior dominates the likelihood. This creates Neal’s Funnel in the CP space. NCP thrives here because it decouples the parameters, allowing the sampler to easily and safely explore the prior geometry.

** The Data-Rich Regime (Use CP): Consider a scenario where every agent completes $10,000$ trials. The likelihood becomes massively informative, crushing the uncertainty around $\theta_{\text{logit}, j}$ into a tiny, tight peak.If we use CP, the sampler effortlessly navigates this tight, data-driven likelihood space. If we use NCP, the sampler attempts to explore the broad $z \sim \text{Normal}(0,1)$ prior, only to find that almost the entire space is immediately rejected by the highly restrictive likelihood. This makes NCP highly inefficient, resulting in massive autocorrelation, low ESS, and high computational time.

In other words, there is no universal panacea in cognitive modeling. We do not engage in blind “tinkering” to get our models to fit. Instead, we adhere to the Iterative Bayesian Workflow: we fit the model, deploy a strict diagnostic battery to identify the exact geometric failure (prior-dominated funnels vs. data-dominated funnels), and apply the mathematically appropriate parameterization.

Because typical cognitive tasks (like our Matching Pennies game) involve relatively sparse data ($N_j \approx 100$) and substantial population-level shrinkage, we will utilize the Non-Centered Parameterization as our default architecture as we proceed to construct the more complex Memory and WSLS agents.

7.5 The Epistemic Showdown: Visualizing Shrinkage

We have theorized about the dangers of ignoring population structure (Complete Pooling) and the dangers of ignoring noise in small samples (No Pooling). In a rigorous Bayesian workflow, we do not rely on intuition; we compute the alternatives and visualize the geometric differences.We will quickly define and fit the two extreme models using our strictly typed stan_data_centered. The complete and no-pooling models are included here solely as visual contrasts — they serve as the “wrong answer” endpoints that make the partial pooling solution legible. A full diagnostic battery on these two models would be redundant: complete pooling is trivially identifiable (one parameter, well-conditioned), and no-pooling is structurally identical to J independent Ch. 6 models we already validated. The NCP model, which actually matters, passed its full battery above.

7.5.1 The Complete Pooling Model (Underfitting)

This model assumes all agents are identical clones. The population variance is forced to exactly $0$. We estimate a single, universal bias parameter.

Code

stan_full_pooling_code <- "
data {
  int<lower=1> N_total;
  array[N_total] int<lower=0, upper=1> y;
}
parameters {
  real theta_logit; // A single parameter for the entire universe
}
model {
  target += normal_lpdf(theta_logit | 0, 1.5);
  target += bernoulli_logit_lpmf(y | theta_logit);
}
"
stan_file_full <- here::here("stan", "ch07_biased_full_pooling.stan")
writeLines(stan_full_pooling_code, stan_file_full)
mod_full <- cmdstan_model(stan_file_full)

fit_full_path <- here::here("simmodels", "ch07_fit_full_pool.rds")

if (regenerate_simulations || !file.exists(fit_full_path)) {
  fit_full <- mod_full$sample(
    data            = list(N_total = stan_data_centered$N_total, y = stan_data_centered$y),
    seed            = 123, chains = 4, parallel_chains = 4, refresh = 0
  )
  fit_full$save_object(fit_full_path)
} else {
  fit_full <- readRDS(fit_full_path)
}

full_pool_est <- posterior::summarise_draws(fit_full$draws("theta_logit"), "median")$median

7.5.2 The No Pooling Model (Overfitting)

This model assumes every agent is a completely isolated island. The population variance is implicitly assumed to be $\infty$. We estimate an independent parameter for each agent, equipped only with a static, weakly informative prior.

Code

stan_no_pooling_code <- "
data {
  int<lower=1> N_total;
  int<lower=1> N_subjects;
  array[N_total] int<lower=1, upper=N_subjects> subj_id;
  array[N_total] int<lower=0, upper=1> y;
}
parameters {
  vector[N_subjects] theta_logit; // N_subjects completely independent parameters
}
model {
  // Static prior: No hierarchical learning
  target += normal_lpdf(theta_logit | 0, 1.5);
  target += bernoulli_logit_lpmf(y | theta_logit[subj_id]);
}
"
stan_file_no <- here::here("stan", "ch07_biased_no_pooling.stan")
writeLines(stan_no_pooling_code, stan_file_no)
mod_no <- cmdstan_model(stan_file_no)

# Fit the No Pooling model
fit_no_path <- here::here("simmodels", "ch07_fit_no_pool.rds")

if (regenerate_simulations || !file.exists(fit_no_path)) {
  fit_no <- mod_no$sample(
    data            = stan_data_centered,
    seed            = 123, chains = 4, parallel_chains = 4, refresh = 0
  )
  fit_no$save_object(fit_no_path)
} else {
  fit_no <- readRDS(fit_no_path)
}

no_pool_summary <- posterior::summarise_draws(fit_no$draws("theta_logit"), "median") |> 
  mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])"))) |> 
  rename(theta_no_pool = median)

7.5.3 The Geometry of Shrinkage

We now have three distinct sets of inferences: * No Pooling: The raw, isolated estimates (highly vulnerable to the noise of low $N_j$). * Complete Pooling: The rigid, universal average (blind to genuine individual differences). * Partial Pooling: Our hierarchical NCP model estimates (the precision-weighted Bayesian compromise).

To truly evaluate the epistemological value of these models, we must plot them not just against each other, but against the True Generative Parameters (the hidden reality we simulated in R). We will also extract the naive arithmetic mean of the No Pooling estimates to contrast it with the precision-weighted Partial Pooling mean.

Code

# 1. Extract the Partial Pooling Population and Individual Estimates (from the NCP model)
mu_partial_pool_est <- posterior::summarise_draws(fit_ncp$draws("mu_theta"), "median")$median

partial_pool_summary <- posterior::summarise_draws(fit_ncp$draws("theta_logit"), "median") |> 
  mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])"))) |> 
  rename(theta_partial_pool = median)

# 2. Calculate the No Pooling Mean (simple arithmetic average of the isolated estimates)
no_pool_mean_est <- mean(no_pool_summary$theta_no_pool)

# 3. Calculate the True Generative Mean (The actual center of our simulated universe)
true_mean_est <- mean(true_agent_params$theta_logit)

# 4. Combine into a single dataframe, including the TRUE generative parameters
comparison_df <- no_pool_summary |> 
  dplyr::select(agent_id, theta_no_pool) |> 
  left_join(
    partial_pool_summary |> dplyr::select(agent_id, theta_partial_pool), 
    by = "agent_id"
  ) |> 
  # Bring in the ground truth
  left_join(
    true_agent_params |> dplyr::select(agent_id, n_trials, theta_true = theta_logit),
    by = "agent_id"
  ) |> 
  # Sort agents by their No Pooling estimate for a cleaner visual curve
  arrange(theta_no_pool) |> 
  mutate(rank = row_number())

# The Ultimate Epistemic Plot
ggplot(comparison_df) +
  # --- POPULATION AVERAGES (Mapped to linetype for a clean legend) ---
  geom_hline(aes(yintercept = no_pool_mean_est, linetype = "No Pooling Mean"), 
             color = "darkorange", linewidth = 0.8) +
  geom_hline(aes(yintercept = full_pool_est, linetype = "Complete Pooling Mean"), 
             color = "darkred", linewidth = 0.8) +
  geom_hline(aes(yintercept = mu_partial_pool_est, linetype = "Partial Pooling Mean (\U03BC)"), 
             color = "steelblue", linewidth = 1.2) +
  geom_hline(aes(yintercept = true_mean_est, linetype = "True Generative Mean"), 
             color = "black", linewidth = 0.8) +
  
  # --- INDIVIDUAL ESTIMATES ---
  # Shrinkage Vectors (from No Pool to Partial Pool)
  geom_segment(aes(x = rank, xend = rank, 
                   y = theta_no_pool, yend = theta_partial_pool), 
               color = "gray65", arrow = arrow(length = unit(0.1, "cm"))) +
  
  # True Generative Parameters (Black Crosses)
  geom_point(aes(x = rank, y = theta_true, color = "True Value"), 
             shape = 4, size = 2, stroke = 1) +
             
  # No Pooling Estimates (Open Circles)
  geom_point(aes(x = rank, y = theta_no_pool, color = "No Pooling", size = n_trials), 
             shape = 1, stroke = 1.2) +
             
  # Partial Pooling Estimates (Solid Points)
  geom_point(aes(x = rank, y = theta_partial_pool, color = "Partial Pooling", size = n_trials), 
             alpha = 0.8) +
  
  # --- SCALES & THEMES ---
  scale_linetype_manual(
    name = "Population Averages",
    values = c(
      "True Generative Mean" = "solid",
      "Partial Pooling Mean (\U03BC)" = "dotted",
      "Complete Pooling Mean" = "dashed",
      "No Pooling Mean" = "dotdash"
    )
  ) +
  scale_color_manual(
    name = "Individual Estimates",
    values = c(
      "True Value" = "black",
      "No Pooling" = "darkorange", 
      "Partial Pooling" = "steelblue"
    )
  ) +
  labs(
    title = "Bayesian Shrinkage: Partial Pooling vs. the Extremes",
    subtitle = "Arrows show shrinkage from No Pooling → Partial Pooling. Longest arrows = fewest trials. Partial pooling lands closest to truth (black solid line).",
    x = "Agents (Sorted by No-Pooling Estimate)",
    y = "Estimated Bias Parameter (Logit Scale)",
    size = "Trial Count (N_j)"
  ) +
  theme(
    axis.text.x = element_blank(), 
    axis.ticks.x = element_blank(),
    legend.position = "right"
  )

Study this plot carefully. It contains the entire epistemological justification for hierarchical modeling, visible across two dimensions: the horizontal population averages and the individual agent estimates.

7.5.4 The Horizontal Lines: The Quartet of Population Means

True Generative Mean (Solid Black): The actual center of our simulated universe. This is the structural reality we are trying to recover.
No Pooling Mean (Dot-dash Orange): The unweighted arithmetic average of the isolated estimates. Because it treats all agents equally, it is highly susceptible to being dragged away from the truth by extreme outliers with very little data.
Complete Pooling Mean (Dashed Red): The unweighted average of all trials. Because it ignores agent boundaries, it allows participants who completed massive numbers of trials to disproportionately pull the center of gravity toward themselves.
Partial Pooling Mean ($\mu_\theta$, Dotted Blue): The precision-weighted compromise. By mathematically down-weighting the noisy, low-$N_j$ agents, it reliably lands closest to the true generative mean.

7.5.5 The Points and Arrows: Information-Weighted Regularization

The Overfitting Extremes: Look at the tails of the graph. The orange open circles (No Pooling) are extreme. If we trusted No Pooling, we would conclude these agents have massive, almost deterministic biases.
The Power of the Prior: The solid blue dots (Partial Pooling) are systematically pulled inward. The hierarchical model “knows” that extremely high or low values are mathematically unlikely given the behavior of the rest of the cohort.
The Calibration of Skepticism: Look at the size of the dots, representing the trial count ($N_j$). The longest arrows (most aggressive shrinkage) almost exclusively belong to the smallest dots. When an agent has sparse data, the posterior relies heavily on the population prior. The largest dots barely move; when an agent has massive amounts of data, the likelihood overwhelms the prior, allowing them to remain safely extreme.

7.6 The Memory Agent: Two Correlated Individual Differences

7.6.1 Introduction

The Biased Agent established the full Bayesian workflow for a single individual-level parameter: simulate, fit, diagnose, recover, calibrate. It also taught us, the hard way, that the Centered Parameterization fails when population variance is small, and that the Non-Centered Parameterization resolves this by decoupling individual and population parameters geometrically. We proceed to the Memory Agent with both of those lessons in hand.

The Memory Agent is structurally richer. Recall from Chapter 3 its decision rule. On each trial $i$, agent $j$ maintains a running average of the opponent’s past choices $m_{ij}$ and uses it as a predictive signal:

\[h_{ij} \sim \text{Bernoulli}(\text{logit}^{-1}(\alpha_j + \beta_j \cdot \text{logit}(m_{ij})))\]

\[m_{ij} = \frac{1}{i-1}\sum_{k=1}^{i-1} o_{kj}\]

There are now two individual-level parameters per agent: $\alpha_j$, the baseline directional bias, and $\beta_j$, the sensitivity to opponent history. They need not vary independently across the population. An agent who has a strong directional tendency might also track the opponent more aggressively — or might not. Fitting two separate univariate hierarchical models forecloses this question before the data can answer it. We model the parameters jointly.

Experimental design: the reversal paradigm. In Ch. 6 we discovered that fitting the memory model to a constant-bias opponent produces a strong positive correlation between bias and beta in the posterior — both parameters explain the same rightward tendency and cannot be told apart. The reversal learning design (opponent plays 80% right for trials 1–120, then 20% right for trials 121–240) breaks this confound: in the second phase, a pure-bias agent continues right while a memory agent flips left, providing the discriminating contrast the two-parameter model needs. We therefore use the reversal schedule saved in Ch. 6 as the shared opponent sequence for all J agents here.

7.6.1.1 The Bivariate Population Distribution

\[\begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} \sim \mathcal{N}_2\!\left(\boldsymbol{\mu},\, \boldsymbol{\Sigma}\right), \quad \boldsymbol{\Sigma} = \operatorname{diag}(\boldsymbol{\sigma})\, \Omega\, \operatorname{diag}(\boldsymbol{\sigma}), \quad \Omega = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\]

This decomposition separates how much individuals differ on each cognitive dimension ($\sigma_\alpha$, $\sigma_\beta$) from how those dimensions covary ($\rho$). The scalar $\rho \in (-1, 1)$ is the key new inferential target.

7.6.1.2 The LKJ Prior

We need a prior on the correlation matrix $\Omega$. Stan provides the LKJ distribution (Lewandowski, Kurowicka, and Joe, 2009), with a single shape parameter $\eta > 0$:

\[\Omega \sim \text{LKJ}(\eta) \quad \propto \quad (\det \Omega)^{\eta - 1}\]

When $\eta = 1$ the distribution is uniform over valid correlation matrices. When $\eta > 1$ mass concentrates toward the identity, encoding the prior belief that parameters tend to vary independently unless the data argue otherwise. For a $2 \times 2$ matrix, the marginal on $\rho$ reduces to a rescaled Beta: $\rho \sim 2 \cdot \text{Beta}(\eta, \eta) - 1$. With our default $\eta = 2$, the prior is unimodal at zero and weakly regularizes against extreme correlations.

Stan samples the lower-triangular Cholesky factor $\mathbf{L}_\Omega$ (declared as cholesky_factor_corr[2]) under the lkj_corr_cholesky(eta) prior, recovering $\Omega = \mathbf{L}_\Omega \mathbf{L}_\Omega^\top$ in generated quantities. This parameterization is unconstrained on its natural manifold.

7.6.1.3 Non-Centered Parameterization

The scalar NCP from the biased agent ($\theta_j = \mu + z_j \cdot \sigma$) generalizes directly. We introduce a $2 \times J$ matrix of independent standard normals:

\[z_{kj} \sim \mathcal{N}(0, 1) \text{ i.i.d.}, \qquad \begin{pmatrix}\alpha_j \\ \beta_j\end{pmatrix} = \begin{pmatrix}\mu_\alpha \\ \mu_\beta\end{pmatrix} + \operatorname{diag}(\boldsymbol{\sigma})\,\mathbf{L}_\Omega\,\mathbf{z}_j\]

The sampler traverses the flat, isotropic space of $\mathbf{Z}$, whose geometry is entirely independent of $\boldsymbol{\sigma}$ and $\Omega$. Neal’s Funnel is structurally impossible.

Standard Plate Notation (left) and Kruschke-Style Notation (right) for the Hierarchical Memory Agent. The correlation node Omega and the bivariate MVN distinguish this model from the biased agent.

7.6.2 Phase 1: Generative Plausibility of the Memory Agent

Before writing any Stan code, we construct the forward generative model in R. This code must strictly mirror the mathematical hierarchy above.

One design decision deserves explicit justification. The memory signal $m_{ij}$ is a deterministic function of the opponent’s past choices. We could compute it inside Stan’s transformed data block, but this executes a nested loop on every leapfrog step of every chain. Instead, we pre-compute the vector opp_rate_prev in R once and pass it as data. This reduces the memory calculation to a single vectorized pass before fitting begins, at zero loss of model fidelity.

The simulation function must also mirror the NCP Stan model exactly. We draw $\mathbf{z}_j \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_2)$ and apply the Cholesky transform, exactly as the Stan model will. For a $2 \times 2$ correlation matrix with off-diagonal $\rho$, the lower Cholesky factor is:

\[\mathbf{L}_\Omega = \begin{pmatrix} 1 & 0 \\ \rho & \sqrt{1-\rho^2} \end{pmatrix}\]

We sample $\rho$ from the LKJ(2) marginal: $\rho \sim 2 \cdot \text{Beta}(2, 2) - 1$.

In accordance with the rigorous Bayesian workflow, the simulation function is designed to either:

Draw all hyperparameters directly from the stated priors (for Prior Predictive Checks).
Accept fixed ground-truth hyperparameters (for Parameter Recovery).

Code

#' Simulate Hierarchical Memory Agents
#'
#' @param n_agents Integer, number of agents (J).
#' @param n_trials_min Integer, minimum trials per agent.
#' @param n_trials_max Integer, maximum trials per agent.
#' @param mu_alpha Numeric, population mean baseline bias (logit scale). NULL draws from prior.
#' @param mu_beta Numeric, population mean memory sensitivity. NULL draws from prior.
#' @param sigma_alpha Numeric, population SD of alpha. NULL draws from prior.
#' @param sigma_beta Numeric, population SD of beta. NULL draws from prior.
#' @param rho Numeric, population correlation alpha-beta. NULL draws from LKJ(2).
#' @param opponent_choices Integer vector of length >= max(n_trials_max).
#'   If provided, every agent plays a consecutive prefix of this fixed
#'   sequence (realistic: all participants face the same opponent).
#'   If NULL, generates independent Bernoulli draws using opponent_rate.
#' @param opponent_rate Numeric in (0,1). Used only when opponent_choices
#'   is NULL. Default 0.7.
#' @param seed Integer, for exact computational reproducibility.
#'
#' @return A tidy tibble containing both latent parameters and observed choices.
simulate_hierarchical_memory <- function(
  n_agents      = 50,
  n_trials_min  = 20,
  n_trials_max  = 120,
  mu_alpha      = NULL,
  mu_beta       = NULL,
  sigma_alpha   = NULL,
  sigma_beta    = NULL,
  rho           = NULL,
  opponent_choices = NULL,
  opponent_rate    = 0.7,
  seed          = NULL
) {
  if (!is.null(seed)) set.seed(seed)

  # 1. Sample hyperparameters from priors if not provided
  if (is.null(mu_alpha))    mu_alpha    <- rnorm(1, 0, 1)
  if (is.null(mu_beta))     mu_beta     <- rnorm(1, 0, 1)
  if (is.null(sigma_alpha)) sigma_alpha <- rexp(1, 1)
  if (is.null(sigma_beta))  sigma_beta  <- rexp(1, 1)
  # LKJ(2) marginal on rho for 2x2 = 2*Beta(2,2) - 1
  if (is.null(rho))         rho         <- 2 * rbeta(1, 2, 2) - 1

  # 2. Construct the lower-triangular Cholesky factor L_Omega
  L_Omega <- matrix(c(1, rho, 0, sqrt(1 - rho^2)), nrow = 2, ncol = 2)

  # 3. Sample individual parameters via the NCP transform (mirrors Stan exactly)
  # Z is 2 x J; Scale = diag(sigma) %*% L_Omega
  Z            <- matrix(rnorm(2 * n_agents), nrow = 2)
  Scale        <- diag(c(sigma_alpha, sigma_beta)) %*% L_Omega
  # indiv_params is J x 2: column 1 = alpha_j, column 2 = beta_j
  indiv_params <- t(c(mu_alpha, mu_beta) + Scale %*% Z)

  # 4. Simulate trial-level data
  n_trials_vec <- sample(n_trials_min:n_trials_max, n_agents, replace = TRUE)

  sim_data <- map_dfr(seq_len(n_agents), function(j) {
    n_t     <- n_trials_vec[j]
    alpha_j <- indiv_params[j, 1]
    beta_j  <- indiv_params[j, 2]

    if (!is.null(opponent_choices)) {
      # Shared reversal schedule: agent plays the first n_t steps of the
      # fixed sequence. Agents who leave early see only the first phase.
      stopifnot(length(opponent_choices) >= n_t)
      opp_choices <- opponent_choices[seq_len(n_t)]
    } else {
      opp_choices <- rbinom(n_t, 1, opponent_rate)
    }

    # Pre-compute running opponent rate BEFORE each trial (strict t-1 history).
    # Trial 1 has no history: initialize to 0.5 (uninformative).
    # Clip to [0.01, 0.99] so logit(opp_rate_prev) is always finite.
    opp_rate_prev    <- numeric(n_t)
    opp_rate_prev[1] <- 0.5
    if (n_t > 1) {
      raw_rates         <- cumsum(opp_choices[-n_t]) / seq_len(n_t - 1)
      opp_rate_prev[-1] <- pmax(0.01, pmin(0.99, raw_rates))
    }

    logit_p       <- alpha_j + beta_j * qlogis(opp_rate_prev)
    agent_choices <- rbinom(n_t, 1, plogis(logit_p))

    tibble(
      agent_id      = j,
      trial         = seq_len(n_t),
      n_trials      = n_t,
      choice        = agent_choices,
      opp_rate_prev = opp_rate_prev,
      true_alpha    = alpha_j,
      true_beta     = beta_j
    )
  })

  attr(sim_data, "hyperparameters") <- list(
    mu_alpha    = mu_alpha,    mu_beta     = mu_beta,
    sigma_alpha = sigma_alpha, sigma_beta  = sigma_beta,
    rho         = rho
  )

  sim_data
}

7.6.2.1 Executing the Prior Predictive Check: The Joint Population Cloud

For a bivariate model, the most revealing prior predictive display is a scatter of individual $(\alpha_j, \beta_j)$ pairs across multiple prior universes. This simultaneously reveals how dispersed individual differences are ($\sigma_\alpha$, $\sigma_\beta$), how correlated the two cognitive dimensions tend to be ($\rho$), and whether any prior universe generates implausible populations (e.g., $|\beta_j| > 8$, meaning the memory signal completely overrides choice).

Code

# Generate 12 prior universes
prior_sims_mem <- map_dfr(1:12, function(sim_id) {
  sim <- simulate_hierarchical_memory(
    n_agents = 80, n_trials_min = 1, n_trials_max = 1,
    seed = sim_id * 13
  )
  hp <- attr(sim, "hyperparameters")
  sim |>
    distinct(agent_id, true_alpha, true_beta) |>
    mutate(simulation_id = paste("Prior Universe", sim_id))
})

ggplot(prior_sims_mem, aes(x = true_alpha, y = true_beta)) +
  geom_point(alpha = 0.4, size = 0.9, color = "steelblue") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray60", linewidth = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray60", linewidth = 0.4) +
  facet_wrap(~simulation_id, ncol = 4) +
  labs(
    title    = "Prior Predictive Checks: Implied Joint Population Distributions",
    subtitle = "Populations generated by mu ~ N(0,1), sigma ~ Exp(1), Omega ~ LKJ(2)",
    x        = expression("Baseline bias  " * alpha[j] * "  (logit scale)"),
    y        = expression("Memory sensitivity  " * beta[j])
  )

A healthy prior generates a range of cloud shapes — some circular (low $|\rho|$), some elongated and tilted (high $|\rho|$), some compact (low $\sigma$), some diffuse (high $\sigma$). No prior universe should collapse all agents to a point or place all agents at implausibly extreme values.

7.6.3 Phase 2: Data Preparation and the NCP Stan Model

Having established from the biased agent workflow that NCP is our default architecture for cognitive tasks with sparse per-agent data, we implement it directly. We skip the Centered Parameterization entirely.

Code

#' Prepare Memory Agent Data for Stan
#'
#' @param sim_data Tibble from simulate_hierarchical_memory().
#' @return A named list formatted for CmdStanR.
prep_stan_data_memory <- function(sim_data) {
  list(
    N_total                  = nrow(sim_data),
    N_subjects               = length(unique(sim_data$agent_id)),
    subj_id                  = sim_data$agent_id,
    y                        = sim_data$choice,
    opp_rate_prev            = sim_data$opp_rate_prev,
    prior_mu_alpha_mu        = 0,
    prior_mu_alpha_sigma     = 1.0,
    prior_mu_beta_mu         = 0,
    prior_mu_beta_sigma      = 1.0,
    prior_sigma_lambda       = 1.0,
    prior_Omega_shape        = 2.0,
    run_diagnostics          = 1
  )
}


# --- Load the reversal learning schedule from Ch. 6 ---
# This 240-trial sequence (80% right → reversal → 20% right) was validated
# in Ch. 6 to cleanly identify bias and beta. All agents play the same
# opponent schedule, with variable trial counts (80–240 trials).
reversal_obj      <- readRDS(here("simdata", "ch06_reversal_data.rds"))
opponent_reversal <- reversal_obj$opponent_choices  # 240-element integer vector

cat("Loaded reversal schedule from Ch. 6:",
    length(opponent_reversal), "trials |",
    sum(opponent_reversal[1:120]), "right in phase 1 |",
    sum(opponent_reversal[121:240]), "right in phase 2\n")

Loaded reversal schedule from Ch. 6: 240 trials | 93 right in phase 1 | 23 right in phase 2

Code

# --- Generate hierarchical memory data using the reversal schedule ---
sim_data_memory <- simulate_hierarchical_memory(
  n_agents         = 50,
  n_trials_min     = 80,    # Minimum: agents see at least the first phase
  n_trials_max     = 240,   # Maximum: agents complete both phases
  mu_alpha         =  0.0,  # No systematic directional baseline
  mu_beta          =  1.5,  # Moderate positive memory tracking
  sigma_alpha      =  0.3,
  sigma_beta       =  0.5,
  rho              =  0.3,  # Mild positive correlation
  opponent_choices = opponent_reversal,  # <-- reversal schedule from Ch. 6
  seed             = 1981
)

stan_data_memory <- prep_stan_data_memory(sim_data_memory)

cat(sprintf(
  "Hierarchical memory data: %d agents, %d total trials, %d–%d trials per agent\n",
  stan_data_memory$N_subjects,
  stan_data_memory$N_total,
  min(sim_data_memory$n_trials),
  max(sim_data_memory$n_trials)
))

Hierarchical memory data: 50 agents, 8058 total trials, 84–237 trials per agent

7.6.3.1 Stan Model: Hierarchical Memory Agent (NCP)

The only new structural element relative to the biased agent NCP is the bivariate extension. Instead of a scalar $z_j$, we have a $2 \times J$ matrix $\mathbf{Z}$. Instead of scaling by a single $\sigma$, we apply diag_pre_multiply(sigma, L_Omega) — which computes $\operatorname{diag}(\boldsymbol{\sigma})\mathbf{L}_\Omega$ and produces the full Cholesky scale matrix encoding both marginal SDs and the correlation structure.

Code

stan_memory_ncp_code <- "
// Hierarchical Memory Agent Model (Non-Centered Parameterization)
// alpha_j (baseline bias) and beta_j (memory sensitivity) are jointly
// drawn from a bivariate Normal. NCP decouples sampling geometry from
// the population parameters, preventing Neal's Funnel.

data {
  int<lower=1> N_total;                    // Total observations
  int<lower=1> N_subjects;                 // Number of agents
  array[N_total] int<lower=1, upper=N_subjects>  subj_id;   // Agent index
  array[N_total] int<lower=0, upper=1>  y;       // Observed choices
  vector<lower=0.01, upper=0.99>[N_total] opp_rate_prev;
  
  real prior_mu_alpha_mu;
  real<lower=0> prior_mu_alpha_sigma;
  real prior_mu_beta_mu;
  real<lower=0> prior_mu_beta_sigma;
  real<lower=0> prior_sigma_lambda;
  real<lower=0> prior_Omega_shape;
  int<lower=0, upper=1> run_diagnostics;
}

parameters {
  // Population-level parameters
  real mu_alpha;              // Population mean baseline bias (logit scale)
  real mu_beta;               // Population mean memory sensitivity
  vector<lower=0>[2] sigma;   // Marginal SDs

  // Cholesky factor of the correlation matrix
  cholesky_factor_corr[2] L_Omega;

  // Individual-level parameters (NCP)
  matrix[2, N_subjects] z;
}

transformed parameters {
  matrix[N_subjects, 2] indiv_params;
  {
    matrix[2, N_subjects] deviations = diag_pre_multiply(sigma, L_Omega) * z;
    for (j in 1:N_subjects) {
      indiv_params[j, 1] = mu_alpha + deviations[1, j];
      indiv_params[j, 2] = mu_beta  + deviations[2, j];
    }
  }
}

model {
  // Hyperpriors
  target += normal_lpdf(mu_alpha | prior_mu_alpha_mu, prior_mu_alpha_sigma);
  target += normal_lpdf(mu_beta  | prior_mu_beta_mu,  prior_mu_beta_sigma);
  target += exponential_lpdf(sigma | prior_sigma_lambda);
  target += lkj_corr_cholesky_lpdf(L_Omega | prior_Omega_shape);

  // Individual-level prior (NCP)
  target += std_normal_lpdf(to_vector(z));

  // Likelihood
  vector[N_total] logit_p;
  for (i in 1:N_total) {
    logit_p[i] = indiv_params[subj_id[i], 1]
               + indiv_params[subj_id[i], 2] * logit(opp_rate_prev[i]);
  }
  target += bernoulli_logit_lpmf(y | logit_p);
}

generated quantities {
  matrix[2, 2] Omega = multiply_lower_tri_self_transpose(L_Omega);
  real rho = Omega[1, 2];

  vector[N_subjects] alpha    = indiv_params[, 1];
  vector[N_subjects] beta_mem = indiv_params[, 2];

  // Prior log-density
  real lprior = normal_lpdf(mu_alpha | prior_mu_alpha_mu, prior_mu_alpha_sigma)
              + normal_lpdf(mu_beta  | prior_mu_beta_mu,  prior_mu_beta_sigma)
              + exponential_lpdf(sigma | prior_sigma_lambda)
              + lkj_corr_cholesky_lpdf(L_Omega | prior_Omega_shape)
              + std_normal_lpdf(to_vector(z));

  // Predictive checks and pointwise log-likelihood
  vector[N_total] log_lik;
  array[N_total] int y_post_rep;
  array[N_total] int y_prior_rep;

  // Prior draws for prior-posterior update plots
  real mu_alpha_prior    = normal_rng(prior_mu_alpha_mu, prior_mu_alpha_sigma);
  real mu_beta_prior     = normal_rng(prior_mu_beta_mu, prior_mu_beta_sigma);
  real sigma_alpha_prior = exponential_rng(prior_sigma_lambda);
  real sigma_beta_prior  = exponential_rng(prior_sigma_lambda);
  // LKJ(2) marginal on rho for 2x2: 2*Beta(2,2) - 1
  real rho_prior         = 2.0 * beta_rng(2.0, 2.0) - 1.0;

  if (run_diagnostics) {
    matrix[2, 2] L_Omega_prior = lkj_corr_cholesky_rng(2, prior_Omega_shape);
    vector[2] sigma_prior = [sigma_alpha_prior, sigma_beta_prior]';
    matrix[2, N_subjects] z_prior;
    for (j in 1:N_subjects) { 
      z_prior[1,j] = normal_rng(0,1); 
      z_prior[2,j] = normal_rng(0,1); 
    }
    matrix[N_subjects, 2] indiv_prior;
    {
      matrix[2, N_subjects] dev_prior = diag_pre_multiply(sigma_prior, L_Omega_prior) * z_prior;
      for (j in 1:N_subjects) {
        indiv_prior[j, 1] = mu_alpha_prior + dev_prior[1, j];
        indiv_prior[j, 2] = mu_beta_prior  + dev_prior[2, j];
      }
    }

    for (i in 1:N_total) {
      int j = subj_id[i];
      real lp_post  = indiv_params[j,1] + indiv_params[j,2] * logit(opp_rate_prev[i]);
      real lp_prior = indiv_prior[j,1]  + indiv_prior[j,2]  * logit(opp_rate_prev[i]);

      log_lik[i]     = bernoulli_logit_lpmf(y[i] | lp_post);
      y_post_rep[i]  = bernoulli_logit_rng(lp_post);
      y_prior_rep[i] = bernoulli_logit_rng(lp_prior);
    }
  }
}
"

stan_file_memory_ncp <- here::here("stan", "ch07_multilevel_memory_ncp.stan")
writeLines(stan_memory_ncp_code, stan_file_memory_ncp)
mod_memory_ncp <- cmdstan_model(stan_file_memory_ncp)

Two Stan details are worth noting. First, the local block { matrix[2,J] deviations = ...; } in transformed parameters confines the intermediate matrix to local scope, preventing it from appearing in the parameter output. Second, lkj_corr_cholesky_rng(2, 2.0) in generated quantities draws a full Cholesky factor from the LKJ(2) prior. This is the correct way to generate prior predictive data — unlike the Beta marginal trick, it generalizes to the $3 \times 3$ case we will need for the WSLS agent.

Stan Efficiency — Cholesky Factorization for Correlation Matrices

Whenever a model includes a correlation matrix $\Omega$ over individual-level parameters, use the Cholesky-factorized parameterization rather than sampling $\Omega$ directly.

Parameterization	Type	Why avoid
`corr_matrix[K] Omega` + `lkj_corr_lpdf`	Direct	Requires $O(K^3)$ matrix square root and positive-definiteness checks at every leapfrog step
`cholesky_factor_corr[K] L_Omega` + `lkj_corr_cholesky_lpdf`	Cholesky	$L_\Omega$ is already lower-triangular; lies directly on its natural constrained manifold; $O(K^2)$ operations

// Avoid
parameters { corr_matrix[2] Omega; }
model     { Omega ~ lkj_corr(2.0); }

// Prefer
parameters { cholesky_factor_corr[2] L_Omega; }
model      { L_Omega ~ lkj_corr_cholesky(2.0); }
transformed parameters {
  // Apply the Cholesky scale to NCP offsets in O(K^2) — no matrix inversion
  matrix[2, J] deviations = diag_pre_multiply(sigma, L_Omega) * z;
}
generated quantities {
  // Recover the full correlation matrix once per saved draw (not every leapfrog step)
  matrix[2, 2] Omega = multiply_lower_tri_self_transpose(L_Omega);
  real rho = Omega[1, 2];
}

Key point: the matrix multiply diag_pre_multiply(sigma, L_Omega) * z gives the individual-level deviations directly — no matrix square-root needed at run time. The correlation matrix $\Omega$ is only reconstructed in generated quantities, where it runs once per saved draw instead of once per leapfrog step.

For $K = 2$ parameters the saving is modest; for $K = 5$ or more (e.g., five per-subject RL parameters in a full model) it becomes substantial.

7.6.3.2 Fitting the model

Code

fit_memory_path <- here::here("simmodels", "ch07_fit_memory_ncp.rds")

# Initialize at prior means.
N_subj_val_mem <- stan_data_memory$N_subjects

init_memory_list <- lapply(1:4, function(i) {
  list(
    mu_alpha = rnorm(1, 0, 0.3), 
    mu_beta = rnorm(1, 0, 0.3), 
    sigma = abs(rnorm(2, 0, 0.3)) + 0.1, 
    L_Omega = matrix(c(1, 0, 0, 1), nrow=2), 
    z = matrix(0, nrow=2, ncol=N_subj_val_mem)
  )
})

if (regenerate_simulations || !file.exists(fit_memory_path)) {
  fit_memory <- mod_memory_ncp$sample(
    data            = stan_data_memory,
    seed            = 123,
    chains          = 4,
    parallel_chains = 4,
    iter_warmup     = 1000,
    iter_sampling   = 1000,
    init            = init_memory_list, 
    adapt_delta     = 0.90,   
    max_treedepth   = 12,
    refresh         = 0
  )
  fit_memory$save_object(fit_memory_path)
} else {
  fit_memory <- readRDS(fit_memory_path)
}

We raise adapt_delta to 0.90. The bivariate NCP is well-behaved globally, but the correlation parameter $\rho$ introduces a curved direction that can require longer leapfrog trajectories. The higher target acceptance rate forces the dual-averaging algorithm to find a smaller step size, reducing integration error along this direction. If max_treedepth warnings appear (distinct from divergences), raise max_treedepth further — do not raise adapt_delta in response to those.

7.6.3.3 Assessing the fitting process

Even if the Stan compiler finishes without explicit warnings, our workflow demands that we systematically verify the geometry of the posterior and the efficiency of the sampler.

Code

#fit_memory$cmdstan_diagnose()
fit_memory$diagnostic_summary()

$num_divergent
[1] 0 0 0 0

$num_max_treedepth
[1] 0 0 0 0

$ebfmi
[1] 0.7753663 0.6990873 0.7302008 0.7121430

We are looking for strict zeros in num_divergent. With the NCP, this should hold. Any divergences here are a signal that adapt_delta needs further adjustment, the priors are misspecified, or there is a structural model issue.

7.6.3.4 Chain Mixing and Stationarity ($\hat{R}$)

Code

draws_memory <- fit_memory$draws()

mcmc_trace(
  draws_memory,
  pars = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho"),
  np   = nuts_params(fit_memory)
) +
  labs(
    title    = "Trace Plots: Population Parameters (Memory Agent NCP)",
    subtitle = "All five population-level parameters, including rho. Stationary chains expected throughout."
  )

Code

draws_memory |>
  subset_draws(variable = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho",
                            "alpha[1]", "alpha[2]", "beta_mem[1]", "beta_mem[2]")) |>
  summarise_draws(default_convergence_measures()) |>
  knitr::kable(digits = 3)

variable	rhat	ess_bulk	ess_tail
mu_alpha	1.003	4536.130	3219.616
mu_beta	1.000	5530.353	3365.868
sigma[1]	1.001	3094.326	2956.688
sigma[2]	1.000	1998.533	2940.560
rho	1.000	1646.126	2270.844
alpha[1]	1.002	5728.964	3161.287
alpha[2]	1.002	6427.232	3574.001
beta_mem[1]	1.001	5303.794	2973.740
beta_mem[2]	1.000	5467.761	3269.354

The potential scale reduction factor ($\hat{R}$) must be below 1.01 for all parameters. The correlation parameter rho is informed only by covariation across agents, not within any individual agent’s trial sequence, so its ESS will be lower than the marginal parameters. ESS > 400 remains our threshold. An ESS below 200 for rho warrants investigation of the correlation manifold geometry.

7.6.3.4.1 Effective Sample Size (ESS)

Monitor ess_tail for sigma[1] and sigma[2]. Population SDs can develop mild funnel-like behavior even under NCP when the likelihood is highly informative about individual parameters, because the $z_j$ draws must simultaneously explain many agents. If ess_tail is low for the sigmas but ess_bulk is acceptable, raising adapt_delta further or running more warmup iterations is the correct response.

7.6.3.4.2 Energy (E-BFMI)

Code

mcmc_nuts_energy(nuts_params(fit_memory)) +
  labs(
    title    = "Energy Transitions: Memory Agent NCP",
    subtitle = "Marginal and transition energy distributions must align. E-BFMI > 0.2 required."
  )

7.6.3.5 Posterior Parameter Correlations

Code

mcmc_pairs(
  draws_memory,
  pars         = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho"),
  np           = nuts_params(fit_memory),
  diag_fun     = "dens",
  off_diag_fun = "hex"
)

The critical panels are $(\sigma_\alpha, \mu_\alpha)$ and $(\sigma_\beta, \mu_\beta)$. Under the CP biased agent, these showed triangular wedges — the Neal’s Funnel signature. Under NCP, they should be roughly elliptical. Any triangular shape here would be unexpected and would require investigation.

7.6.4 Prior-Posterior Update Checks: Quantifying Learning

Code

draws_df_memory <- as_draws_df(draws_memory)

# Extract true generative hyperparameters saved as attributes
true_hp <- attr(sim_data_memory, "hyperparameters")

true_pop_mem <- tibble(
  parameter  = c("mu_alpha", "mu_beta", "sigma_alpha", "sigma_beta", "rho"),
  true_value = c(true_hp$mu_alpha, true_hp$mu_beta,
                 true_hp$sigma_alpha, true_hp$sigma_beta, true_hp$rho)
)

# Reshape: split on the underscore immediately before "posterior" or "prior"
pp_update_mem <- draws_df_memory |>
  dplyr::select(
    mu_alpha_posterior  = mu_alpha,        mu_alpha_prior,
    mu_beta_posterior   = mu_beta,         mu_beta_prior,
    sigma_alpha_posterior = `sigma[1]`,    sigma_alpha_prior,
    sigma_beta_posterior  = `sigma[2]`,    sigma_beta_prior,
    rho_posterior       = rho,             rho_prior
  ) |>
  pivot_longer(
    everything(),
    names_to  = c("parameter", "distribution"),
    names_sep = "_(?=posterior|prior)"
  )

ggplot(pp_update_mem, aes(x = value)) +
  geom_histogram(aes(fill = distribution), alpha = 0.6, color = NA, bins = 60) +
  geom_vline(
    data = true_pop_mem,
    aes(xintercept = true_value, linetype = "True Value"),
    color = "darkred", linewidth = 1
  ) +
  facet_wrap(~parameter, scales = "free") +
  scale_fill_manual(
    name   = "Distribution",
    values = c("prior" = "gray70", "posterior" = "steelblue")
  ) +
  scale_linetype_manual(
    name   = "Ground Truth",
    values = c("True Value" = "dashed")
  ) +
  labs(
    title = "Prior-Posterior Update Checks: Memory Agent",
    x     = "Parameter Value",
    y     = "Count"
  )

A healthy update shows a posterior significantly narrower than the prior and comfortably contained within the prior’s mass. The rho panel carries the central new scientific information. If the posterior on $\rho$ is essentially indistinguishable from the LKJ(2) prior, the data do not contain sufficient information to identify the population-level correlation — this can happen with small $J$ or short trial sequences. In that case, report the parameter as unidentified, not the prior as the conclusion. A clearly narrowed posterior constitutes genuine evidence about the cognitive architecture of the population.

7.6.5 Posterior Predictive Checks (PPC): Generative Adequacy

Code

y_obs_mem     <- as.numeric(stan_data_memory$y)
agent_idx_mem <- as.numeric(stan_data_memory$subj_id)
y_rep_mem     <- fit_memory$draws("y_post_rep", format = "matrix")

n_sample_mem       <- 12
sampled_agents_mem <- sample(unique(agent_idx_mem), n_sample_mem)
keep_mask_mem      <- agent_idx_mem %in% sampled_agents_mem

prop_right <- function(x) mean(x)

ppc_stat_grouped(
  y     = y_obs_mem[keep_mask_mem],
  yrep  = y_rep_mem[, keep_mask_mem],
  group = agent_idx_mem[keep_mask_mem],
  stat  = "prop_right"
) +
  labs(
    title    = "Posterior Predictive Checks: Agent-Level Choice Rates (Subset)",
    subtitle = paste("Showing", n_sample_mem, "randomly sampled agents. Dark line: Observed. Histogram: Predicted."),
    x        = "Proportion of 'Right' Choices"
  )

7.6.5.1 Calibration and Recovery

We have proven the model fits the data and generates plausible predictions. Now we must prove the parameters we extract are actually the true parameters that generated the data. For a bivariate model, recovery must be assessed at both levels: all five population parameters, and the individual $(\alpha_j, \beta_j)$ pairs jointly.

7.6.5.1.1 Individual-Level Recovery

Code

recovery_alpha <- draws_memory |>
  subset_draws(variable = "alpha") |>
  summarise_draws() |>
  mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])")))

recovery_beta <- draws_memory |>
  subset_draws(variable = "beta_mem") |>
  summarise_draws() |>
  mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])")))

true_agent_params_mem <- sim_data_memory |>
  distinct(agent_id, n_trials, true_alpha, true_beta)

recovery_mem_df <- true_agent_params_mem |>
  left_join(recovery_alpha |> dplyr::select(agent_id, alpha_est = median, alpha_q5 = q5, alpha_q95 = q95), by = "agent_id") |>
  left_join(recovery_beta  |> dplyr::select(agent_id, beta_est  = median, beta_q5  = q5, beta_q95  = q95),  by = "agent_id")

p_rec_alpha <- ggplot(recovery_mem_df, aes(x = true_alpha, y = alpha_est)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
  geom_errorbar(aes(ymin = alpha_q5, ymax = alpha_q95), width = 0, alpha = 0.5, color = "steelblue") +
  geom_point(aes(size = n_trials), color = "midnightblue", alpha = 0.8) +
  labs(
    title = expression("Recovery: Baseline Bias " * alpha[j]),
    x = "True alpha (logit scale)", y = "Posterior Median", size = "N trials"
  )

p_rec_beta <- ggplot(recovery_mem_df, aes(x = true_beta, y = beta_est)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
  geom_errorbar(aes(ymin = beta_q5, ymax = beta_q95), width = 0, alpha = 0.5, color = "steelblue") +
  geom_point(aes(size = n_trials), color = "midnightblue", alpha = 0.8) +
  labs(
    title = expression("Recovery: Memory Sensitivity " * beta[j]),
    x = "True beta", y = "Posterior Median", size = "N trials"
  )

p_rec_alpha + p_rec_beta + plot_layout(guides = "collect")

Recovery for $\beta_j$ is typically noisier than for $\alpha_j$ at any given trial count. The memory signal $\text{logit}(m_{ij})$ starts at zero (the $\text{logit}(0.5) = 0$ initialization) and becomes informative only as the cumulative opponent history diverges from 0.5. Agents with few trials show stronger shrinkage toward $\mu_\beta = 1.5$ — not toward zero, but toward the estimated population mean, which is the appropriate regularization given what the rest of the cohort tells us about the population.

7.6.5.2 Prior sensitivity

Code

sens_results_mem <- powerscale_sensitivity(
  fit_memory,
  variable = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho")
)
cat("=== PRIOR SENSITIVITY DIAGNOSTICS: MEMORY AGENT ===\n")

=== PRIOR SENSITIVITY DIAGNOSTICS: MEMORY AGENT ===

Code

print(sens_results_mem)

Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data

 variable prior likelihood                     diagnosis
 mu_alpha 0.033      0.075                             -
  mu_beta 0.058      0.076 potential prior-data conflict
 sigma[1] 0.361      0.208 potential prior-data conflict
 sigma[2] 0.517      0.124 potential prior-data conflict
      rho 0.034      0.197                             -

Code

ps_seq_mem <- powerscale_sequence(
  fit_memory,
  variable = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho")
)

powerscale_plot_dens(ps_seq_mem) +
  labs(
    title    = "Prior Sensitivity: Density Perturbation (Memory Agent)",
    subtitle = "Does scaling the prior or likelihood drastically alter our beliefs about the population?"
  )

Code

powerscale_plot_quantities(ps_seq_mem) +
  labs(
    title    = "Prior Sensitivity: Parameter Estimates (Memory Agent)",
    subtitle = "Horizontal lines: baseline posterior bounds. Deviations indicate sensitivity."
  )

Pay particular attention to the rho panel. Because the correlation is identified only from across-agent covariation, it has fewer effective observations than the marginal population parameters. If rho shows strong sensitivity to prior scaling, the data contain limited information about the correlation structure and the LKJ prior is doing meaningful regularization.

7.6.5.3 Simulation-Based Calibration Checks (SBC)

Parameter recovery on a single dataset is anecdotal. We must prove that the sampler is globally unbiased across the entire prior space.

Code

sbc_memory_path <- here::here("simmodels", "ch07_sbc_memory.rds")

generate_sbc_dataset_memory <- function(n_agents = 50, n_trials_min = 80, n_trials_max = 240) {
  # Use the same reversal schedule as the main fit.
  # opponent_reversal is available in the chapter environment from Edit 6.16.
  sim  <- simulate_hierarchical_memory(
    n_agents         = n_agents,
    n_trials_min     = n_trials_min,
    n_trials_max     = n_trials_max,
    opponent_choices = opponent_reversal
  )
  hp   <- attr(sim, "hyperparameters")

  list(
    variables = list(
      mu_alpha   = hp$mu_alpha,
      mu_beta    = hp$mu_beta,
      `sigma[1]` = hp$sigma_alpha,
      `sigma[2]` = hp$sigma_beta,
      rho        = hp$rho,
      alpha      = sim |> distinct(agent_id, true_alpha) |> arrange(agent_id) |> pull(true_alpha),
      beta_mem   = sim |> distinct(agent_id, true_beta)  |> arrange(agent_id) |> pull(true_beta)
    ),
    generated = prep_stan_data_memory(sim)
  )
}

generator_memory <- SBC_generator_function(
  generate_sbc_dataset_memory,
  n_agents = 50, n_trials_min = 30, n_trials_max = 120
)

backend_memory <- SBC_backend_cmdstan_sample(
  mod_memory_ncp,
  iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0
)

if (regenerate_simulations || !file.exists(sbc_memory_path)) {
  sbc_datasets_memory <- generate_datasets(generator_memory, 100)
  plan(sequential)
  #future::plan(future::multisession, workers = 4)
  sbc_results_memory  <- compute_SBC(datasets = sbc_datasets_memory, backend = backend_memory,
    keep_fits = FALSE)
  saveRDS(
    list(datasets = sbc_datasets_memory, results = sbc_results_memory),
    sbc_memory_path
  )
} else {
  sbc_mem_obj         <- readRDS(sbc_memory_path)
  sbc_datasets_memory <- sbc_mem_obj$datasets
  sbc_results_memory  <- sbc_mem_obj$results
}

plot_rank_hist(sbc_results_memory, variables = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho")) +
  labs(title = "SBC Rank Histograms: Memory Agent NCP")

Code

plot_ecdf_diff(sbc_results_memory, variables = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho")) +
  labs(
    title    = "SBC ECDF Difference: Memory Agent NCP",
    subtitle = "Deviation from the diagonal indicates miscalibration."
  )

Code

# 1. Extract the lightweight stats dataframe from the SBC object
sbc_stats_df <- sbc_results_memory$stats

# 2. Filter for the population-level parameters you care about
# (e.g., mu_alpha, mu_beta, rho, or even sigma[1])
recovery_df <- sbc_stats_df |>
  filter(variable %in% c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho"))

# 3. Create the Parameter Recovery Identity Plot
p_sbc_recovery <- ggplot(recovery_df, aes(x = simulated_value, y = median)) +
  # The Identity Line (Perfect Recovery)
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", linewidth = 1) +
  
  # The Posterior Estimates with 90% Credible Intervals
  geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.3, color = "steelblue") +
  geom_point(color = "midnightblue", alpha = 0.7, size = 2) +
  
  # Facet by parameter
  facet_wrap(~variable, scales = "free") +
  
  labs(
    title = "Parameter Recovery (Extracted Directly from SBC)",
    subtitle = "Points represent the posterior median. Bars are 90% Credible Intervals.",
    x = "True Generative Value (Drawn from Prior)",
    y = "Posterior Median Estimate"
  ) +
  theme_classic() +
  theme(strip.background = element_rect(fill = "gray95", color = NA))

print(p_sbc_recovery)

Code

# 4. Quantify the Recovery (Correlation and RMSE)
recovery_metrics <- recovery_df |>
  group_by(variable) |>
  summarize(
    Pearson_r = cor(simulated_value, median),
    RMSE = sqrt(mean((simulated_value - median)^2)),
    .groups = "drop"
  )

cat("=== PARAMETER RECOVERY QUANTIFICATION ===\n")

=== PARAMETER RECOVERY QUANTIFICATION ===

Code

print(knitr::kable(recovery_metrics, digits = 3))



|variable | Pearson_r|  RMSE|
|:--------|---------:|-----:|
|mu_alpha |     0.988| 0.162|
|mu_beta  |     0.977| 0.220|
|rho      |     0.673| 0.334|
|sigma[1] |     0.988| 0.147|
|sigma[2] |     0.969| 0.237|

Code

diags_memory  <- sbc_results_memory$backend_diagnostics
mem_fail_rate <- mean(diags_memory$n_divergent > 0) * 100

diags_memory |>
  ggplot(aes(x = n_divergent)) +
  geom_histogram(binwidth = 1, fill = "darkred", alpha = 0.8) +
  labs(
    title    = "Memory NCP Divergence Counts Across 200 SBC Universes",
    subtitle = sprintf("%.0f%% of universes produced at least one divergence.", mem_fail_rate),
    x        = "Number of Divergent Transitions",
    y        = "Number of SBC Universes"
  )

Code

# Does any failure concentrate at small sigma values?
posterior::as_draws_df(sbc_datasets_memory$variables) |>
  as_tibble() |>
  mutate(sim_id = row_number()) |>
  left_join(
    diags_memory |> mutate(sim_id = row_number()) |> dplyr::select(sim_id, n_divergent),
    by = "sim_id"
  ) |>
  mutate(had_divergence = n_divergent > 0) |>
  ggplot(aes(x = `sigma[1]`, y = `sigma[2]`, color = had_divergence)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(
    values = c("FALSE" = "steelblue", "TRUE" = "darkred"),
    labels = c("FALSE" = "Clean", "TRUE" = "Divergent"),
    name   = NULL
  ) +
  labs(
    title    = "Memory NCP: Where Do Divergences Occur?",
    subtitle = "Any divergences should not concentrate at small sigma values (that would indicate residual funnel geometry).",
    x        = expression(sigma[alpha]), y = expression(sigma[beta])
  )

The rho rank histogram will typically have wider confidence bands than the marginal mean and SD histograms at $N = 200$. This is expected: $\rho$ is identified only from covariation across $J = 50$ agents, so each SBC universe contributes one effective draw of $\rho$ to the calibration check. Mild irregularity within the confidence envelope is statistical noise. Systematic U-shape or strong asymmetry would indicate a genuine calibration failure.

7.6.5.4 Cognitive Interpretation of $\hat{\rho}$

The posterior on $\rho$ is the inferential result that distinguishes this model from two independent univariate hierarchical models.

$\rho \approx 0$: Baseline tendency and memory-based opponent tracking are orthogonal individual differences. Studying one tells us nothing about the other.
$\rho > 0$: Agents with a stronger directional baseline also tend to weight memory more heavily — consistent with a general exploitation tendency, where both parameters index willingness to depart from 50/50 in response to a signal.
$\rho < 0$: Agents with a strong directional baseline tend to suppress memory use — a compensatory account in which prior commitment crowds out sensitivity to the opponent’s history.

None of these interpretations follow from the model alone. The model provides a calibrated posterior over $\rho$. Cognitive theory provides the framework for deciding what that value means.

7.6.5.5 LOO for Model Comparison

We retain the LOO-CV score here for the model comparison section at the end of the chapter.

Code

loo_memory <- loo(fit_memory$draws("log_lik", format = "matrix"))
print(loo_memory)


Computed from 4000 by 8058 log-likelihood matrix.

         Estimate   SE
elpd_loo  -4233.0 44.3
p_loo        55.3  1.3
looic      8465.9 88.6
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume independent draws (r_eff=1).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

Code

# Save everything Ch. 8 needs for model comparison.
# Ch. 8 will load this object to avoid refitting.
ch07_exports_file <- here("simmodels", "ch07_memory_exports_for_ch7.rds")

saveRDS(
  list(
    fit_memory       = fit_memory,        # Fitted CmdStanR object
    loo_memory       = loo_memory,        # Pre-computed LOO score
    stan_data_memory = stan_data_memory,  # Stan data list (reversal design)
    sim_data_memory  = sim_data_memory,   # Full trial-level tibble
    reversal_schedule = opponent_reversal # 240-trial opponent sequence
  ),
  ch07_exports_file
)

cat("Ch. 7 exports saved to:", ch07_exports_file, "\n")

Ch. 7 exports saved to: /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/simmodels/ch07_memory_exports_for_ch7.rds

Code

cat("Load in Ch. 8 with: readRDS(here('simmodels', 'ch07_memory_exports_for_ch7.rds'))\n")

Load in Ch. 8 with: readRDS(here('simmodels', 'ch07_memory_exports_for_ch7.rds'))

7.7 Conclusion: From One Agent to a Population

This chapter extended the Bayesian modeling workflow to multiple participants. The core ideas are:

Partial pooling is not a compromise — it is the statistically correct way to combine information across individuals with unequal data, and it provably outperforms both extremes (as the shrinkage plot showed).
Centered vs. non-centered parameterization is a geometric choice, not a philosophical one. CP fails when population variance is small (Neal’s Funnel); NCP fixes the global geometry but can be locally inefficient when data are very informative. SBC revealed which regime we are in.
The reversal learning design is essential for the memory model. A constant-bias opponent makes bias and beta statistically inseparable; the reversal schedule provides the contrast that identifies them.

Coming up in Chapter 8: We have now validated two models — the hierarchical biased agent and the hierarchical memory agent — and computed LOO scores for both. Chapter 8 introduces model comparison: given that both models have passed validation, which one better predicts held-out data? The answer will use the LOO scores and the loo_compare() function from the loo package.

Betancourt, Michael, and Mark Girolami. 2013. “Hamiltonian Monte Carlo for Hierarchical Models.” Current Trends in Bayesian Methodology with Applications 79: 30.

Gelman, Andrew, John B Carlin, Hal S Stern, and Donald B Rubin. 1995. “Bayesian Data Analysis.”

Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Henderson, Charles R. 1950. “Estimation of Genetic Parameters.” Annals of Mathematical Statistics 21: 309–10.

James, William, and Charles Stein. 1961. “Estimation with Quadratic Loss” 1: 361–79.

Lindley, Dennis V, and Adrian FM Smith. 1972. “Bayes Estimates for the Linear Model.” Journal of the Royal Statistical Society: Series B (Methodological) 34 (1): 1–18.

Neal, Radford M. 2003. “Slice Sampling.” The Annals of Statistics 31 (3): 705–67.

Rouder, Jeffrey N, and Jun Lu. 2008. “Bayesian Hierarchical Models for Perceptual Processes.” Psychonomic Bulletin & Review 15 (6): 1153–67.

Rouder, Jeffrey N, Jun Lu, Paul Speckman, Dongchu Sun, and Yi Jiang. 2005. “An Introduction to Bayesian Hierarchical Models with an Application to the SET Generalization Task.” Psychonomic Bulletin & Review 12 (4): 573–604.

# Individual Differences in Cognitive Strategies: The Hierarchical Architecture {#sec-multilevel-models} ```{r ch07_setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5, fig.align = 'center', out.width = "80%", dpi = 300 ) regenerate_simulations <- FALSE # Construct the required directory infrastructure if (!dir.exists("stan")) dir.create("stan") if (!dir.exists("simdata")) dir.create("simdata") if (!dir.exists("simmodels")) dir.create("simmodels") if (!dir.exists("figures")) dir.create("figures") # Load the strict computational stack using pacman pacman::p_load( tidyverse, # Core suite for data manipulation and visualization here, # Robust path routing relative to project root posterior, # Analytical tools for posterior array manipulations cmdstanr, # Direct R interface to the CmdStan C++ compiler tidybayes, # Tidy extraction of draws and predictive distributions patchwork, # Composition of diagnostic ggplot matrices bayesplot, # Specialized geometric diagnostic plots (traces, pairs) furrr, # Parallel mapping for Recovery and SBC arrays future, # Asynchronous backend for parallelization loo, # PSIS-LOO out-of-sample predictive cross-validation priorsense, # Analytical prior sensitivity via power-scaling DiagrammeR, # Generative structural visualizations (DAGs) SBC, # Simulation-Based Calibration Checks marginaleffects # Conditional predictive interpretations on the outcome scale ) # Enforce a clean, data-to-ink maximized visual theme theme_set(theme_classic()) # Establish the parallel processing topology # We reserve 2 cores to prevent the operating system from hanging during intensive MCMC/SBC #cores_to_use <- max(1, parallel::detectCores() - 2) #plan(multisession, workers = cores_to_use) plan(sequential) # to ensure proper knitting. ``` > **📍 Where we are in the Bayesian modeling workflow:** > Chs. 1–5 fitted models to a single agent and validated the inference > engine. Now we scale up: real experiments involve many participants who > differ from each other. This chapter builds hierarchical (multilevel) > models that estimate both individual parameters *and* the population > distribution they are drawn from — and validates the whole pipeline > with the same six-phase battery from Ch. 6. > The memory model fitted here uses the **reversal learning dataset from > Ch. 6**, which was the first dataset that cleanly separated `bias` from > `beta`. The fitted model and its LOO score are saved for Ch. 8. ## Introduction Our exploration of decision-making models has so far focused on single agents or averaged behavior. However, individuals differ systematically in how they approach tasks and process information. Some people may be more risk-averse, have better memory, learn faster, or employ entirely different strategies than others. Ignoring these individual differences can lead to misleading conclusions about cognitive processes. From a statistical perspective, if we assume a single set of parameters generates all data, our model is mathematically misspecified, which leads to pathological posterior geometries and misleading inferential results. This chapter introduces hierarchical modeling (also called multilevel or mixed effects modeling) as a powerful framework for capturing these individual differences while still identifying population-level patterns. ### The Epistemic Challenge: Pooling Information Traditional statistical approaches to handling data from multiple individuals force a false dichotomy, imposing rigid, unjustified structural assumptions on the data: 1. **Complete Pooling (Underfitting)**: Treats all participants as identical: it estimates a single set of parameters for the entire group, completely ignoring potentially meaningful individual variation. This assumes the population variance (the average error in predicting an individual from the population) is strictly zero ($\sigma \to 0$). * *The Pathology*: Systematically fails to capture cognitive heterogeneity, describing an "average" participant that does not necessarily exist. 2. **No Pooling (Overfitting)**: Analyzes each participant in complete isolation, estimating independent parameters ($\theta_j$) for each individual. No pooling corresponds to fitting $J$ separate models to $J$ individual participants. This assumes the population variance is infinite ($\sigma \to \infty$). * *The Pathology*: Severely inefficient and vulnerable to noise. For individuals with sparse or noisy data, the posterior estimate will be excessively wide or completely dominated by the random fluctuations of their specific likelihood. > *Why does this dichotomy matter in practice?* With 120 trials per agent, > complete pooling produces one parameter estimate that represents nobody > in particular. No pooling produces 50 estimates, many of which are > dominated by noise for agents with only 20–30 trials. Partial pooling > is not a philosophical compromise — it produces estimates that are > demonstrably *closer to the truth* for low-data individuals, as the > shrinkage plot at the end of this chapter will show directly. ### The Solution: Partial Pooling (Multilevel or Hierarchical Modeling) Hierarchical modeling offers a principled, data-driven middle ground. Instead of forcing $\sigma$ to be $0$ or $\infty$, we treat the population variance as a free parameter estimated from the data. The model is structured hierarchically: * **Individual Level**: Each individual agent $j$ has a specific parameter vector (e.g., their intrinsic bias $\theta_j$ or memory decay $\beta_j$). * **Population Level**: These individual parameters are explicitly modeled as draws from an overarching continuous population distribution (e.g., a Gaussian characterized by a population mean $\mu$ and standard deviation $\sigma$). * **The Bayesian Compromise**: The estimation of an individual's parameter ($\theta_j$) is a precision-weighted combination of their own empirical data (the likelihood) and the population distribution (the hierarchical prior). * **Shrinkage**: Estimates for individuals with sparse or noisy data are "shrunk" toward the population mean $\mu$. This is a mathematical mechanism that regularizes inference, preventing overfitting and yielding more robust out-of-sample estimates. ::: {.callout-note} ### Historical Context: Shrinkage, Partial Pooling, and the Recovery of Individual Differences The hierarchical model is now a standard tool, but the intellectual path to it was surprisingly contentious. Accepting partial pooling required overturning the intuition — deeply held by statisticians trained in classical estimation theory — that combining information from different units is always a form of cheating. #### Mixed Models in the Fields The mathematical foundations were laid not in cognitive science but in animal breeding. @henderson1950estimation needed to estimate the genetic merit of individual bulls from data on their offspring, spread across farms that differed in management and feeding. The complete-pooling solution (assume all farms are identical) was agronomically wrong; the no-pooling solution (analyse each farm in isolation) wasted the information that herds from the same breed share a common distribution of genetic merit. Henderson's **Best Linear Unbiased Prediction** (BLUP) formalised the middle path: individual estimates are shrunk toward the population mean by an amount that depends on how much data each individual contributes. The same equations appear, under different names, as the core of the hierarchical models in this chapter. #### Stein's Paradox and the Optimality of Shrinkage The theoretical justification arrived from a different direction, and it was deeply counterintuitive. @james1961estimation proved that the ordinary (no-pooling) estimator of three or more means is *inadmissible* under squared-error loss: there always exists a shrinkage estimator — one that pulls all estimates toward a common value — with uniformly lower expected loss. This is Stein's paradox. It implies that, regardless of whether the true means are related to each other, you are statistically better off treating them as exchangeable draws from a common distribution than estimating each independently. The result was controversial because it seemed to say that knowing the speed of light should improve your estimate of the boiling point of water — which is absurd — but the paradox dissolves once you recognise that the admissibility comparison is over the *joint* vector of estimates, not individual components. #### Bayesian Formalisation @lindley1972bayes showed that the James-Stein estimator is a Bayesian posterior mean under the assumption that the true parameters are drawn from a common normal distribution — exactly the hierarchical prior structure used in this chapter. This connected the frequentist inadmissibility result to a principled Bayesian generative story: shrinkage is not a trick but the natural consequence of treating individuals as exchangeable members of a population. @gelman1992inference and @gelman1995bayesian developed the computational infrastructure (via MCMC) that made this formalism applicable to realistic psychological datasets. #### Into Cognitive Science The adoption of hierarchical models in cognitive science was slow. For much of the 1990s and early 2000s, individual-differences research either fitted separate models to each participant and then performed a second-stage group analysis — the "two-stage" procedure that discards uncertainty — or fitted a single population model that ignored individual variation entirely. @rouder2005introduction and @rouder2008bayesian demonstrated that hierarchical models substantially improve parameter recovery in reaction-time and signal-detection contexts: individual estimates derived from partial pooling are closer to the true generating values than either the complete-pooling or no-pooling alternatives, especially for participants who contribute few trials. The shrinkage plot at the end of this chapter makes this advantage visible directly: you can watch the model borrow strength from the population to regularise noisy individuals. #### Non-Centred Parameterisation and Neal's Funnel The one purely computational contribution to this story came from @neal2003slice and was adopted as standard practice by the Stan community via @betancourt2013hamiltonian. When individual parameters are parameterised as $\theta_j = \mu + \sigma \cdot z_j$ with $z_j \sim \mathcal{N}(0,1)$ — the non-centred form — the geometry of the posterior becomes approximately Gaussian and Hamiltonian Monte Carlo can traverse it efficiently. In the centred form $\theta_j \sim \mathcal{N}(\mu, \sigma^2)$, the posterior develops a funnel geometry near $\sigma \approx 0$ that causes HMC to diverge. The non-centred parameterisation is the single most important practical technique in hierarchical Bayesian computation, and the reason the models in this chapter spend time on it. ::: ## Learning Objectives After completing this chapter, you will be able to: * **Explain Partial Pooling**: Understand how multilevel modeling balances individual and group-level information through shrinkage. * **Distinguish Pooling Approaches**: Articulate the differences and trade-offs between complete pooling ($\sigma=0$), no pooling ($\sigma=\infty$), and partial pooling. * **Implement Hierarchical Models**: Write Stan code for hierarchical models with varying intercepts and/or slopes. * **Reparameterize a model**: Understand why and how to reparameterize a model and in particular how to use non-centered parameterization to improve sampling efficiency in hierarchical models. * **Model Parameter Correlations**: Implement and interpret models that estimate correlations between individual-level parameters (via Cholesky-factorized correlation matrices $\mathbf{L}_\Omega$). * **Perform Comprehensive Model Checks**: Apply a rigorous Bayesian workflow to hierarchical models (Predictive Checks, Prior-Posterior Update Checks, Sensitivity Analysis, Parameter Recovery, Simulation-Based Calibration Checks, etc). * **Apply to Cognitive Questions**: Use multilevel modeling to investigate individual differences in cognitive strategies (e.g., bias, memory, WSLS). This chapter has two parts. The first applies the full Bayesian workflow to the **Hierarchical Biased Agent** — one parameter per person, one population distribution — and uses it to expose and fix the Centered Parameterization's pathological geometry (Neal's Funnel). The second applies the same workflow to the **Hierarchical Memory Agent**, which has two correlated parameters per person, requires a bivariate population distribution, and uses the reversal learning dataset from Ch. 6 that we validated there. The LOO score from the fitted memory model is saved for Ch. 8's model comparison. ### Biased Agent Model (Multilevel) In this architecture, each agent $j \in \{1, \dots, J\}$ possesses an unobserved, individual-level bias parameter $\theta_j$. This individual parameter is generated by a latent population distribution, parameterized by a global mean $\mu_\theta$ and global standard deviation $\sigma_\theta$. Finally, the empirical choices $h_{ij}$ for trial $i$ are generated conditionally on the individual parameter $\theta_j$. Mathematically, this translates to the following generative structure on the unconstrained (logit) space:$$h_{ij} \sim \text{Bernoulli}(\text{logit}^{-1}(\theta_{\text{logit}, j}))$$$$\theta_{\text{logit}, j} \sim \text{Normal}(\mu_{\theta}, \sigma_{\theta})$$$$\mu_{\theta} \sim \text{Normal}(0, 1.5)$$$$\sigma_{\theta} \sim \text{Exponential}(1)$$ By mapping this DAG, we have strictly defined our generative assumptions. We are now ready to translate this architecture into Stan code and subject it to Prior Predictive Checks. ```{tikz hierarchical_biased_dags, echo=FALSE, fig.cap="Standard Plate Notation (Left) and Kruschke-Style Generative Notation (Right) for the Hierarchical Biased Agent.", fig.ext='png', fig.width=12, cache=TRUE} \usetikzlibrary{positioning, shapes.geometric, arrows.meta, fit, backgrounds, calc} \tikzset{ stochastic/.style={->, >={Latex[length=3mm]}, thick}, deterministic/.style={->, >={Latex[length=3mm]}, thick, dashed, draw=gray!70}, plate/.style={draw=black, thick, rounded corners, inner sep=10pt}, continuous/.style={circle, draw=black, thick, minimum size=1.2cm, inner sep=0pt}, discrete/.style={rectangle, draw=black, thick, minimum size=1.2cm}, observed/.style={fill=gray!30}, latent/.style={fill=white}, pics/norm_dist/.style={code={ \draw[thin, gray] (-1.2,0) -- (1.2,0); \draw[thick, blue] plot[domain=-1.2:1.2, samples=50] (\x, {0.8*exp(-3*\x*\x)}); \node[font=\footnotesize, below] at (0,0) {Normal}; }}, pics/exp_dist/.style={code={ \draw[thin, gray] (-0.2,0) -- (1.5,0); \draw[thick, blue] plot[domain=0:1.4, samples=50] (\x, {0.8*exp(-2.5*\x)}); \node[font=\footnotesize, below] at (0.6,0) {Exponential}; }}, pics/lkj_dist/.style={code={ \draw[thin, gray] (-1.2,0) -- (1.2,0); \draw[thick, blue] plot[domain=-0.95:0.95, samples=50] (\x, {0.8*(1-\x*\x)}); \node[font=\footnotesize, below] at (0,0) {LKJ}; }}, pics/mvn_dist/.style={code={ \draw[thin, gray, ->] (-0.8,0) -- (0.8,0); \draw[thin, gray, ->] (0,-0.8) -- (0,0.8); \draw[thick, blue, rotate=30] (0,0) ellipse (0.6 and 0.22); \draw[thick, blue, rotate=30] (0,0) ellipse (0.3 and 0.11); \node[font=\footnotesize, below right] at (0.1,-0.15) {MVN}; }}, pics/bern_dist/.style={code={ \draw[thin, gray] (-1,0) -- (1,0); \draw[blue, line width=4pt] (-0.4, 0) -- (-0.4, 0.3); \draw[blue, line width=4pt] (0.4, 0) -- (0.4, 0.8); \node[font=\footnotesize, below] at (0,0) {Bernoulli}; }} } \begin{tikzpicture} %% LEFT: STANDARD PLATE NOTATION \node[font=\large\bfseries] at (0, 2.2) {Standard Plate Notation}; \node[continuous, latent] (mu_a) at (-3, 0) {$\mu_\alpha$}; \node[continuous, latent] (sig_a) at (-1.5, 0) {$\sigma_\alpha$}; \node[continuous, latent, fill=gray!15] (omega) at (0, 0) {$\Omega$}; \node[continuous, latent] (sig_b) at (1.5, 0) {$\sigma_\beta$}; \node[continuous, latent] (mu_b) at (3, 0) {$\mu_\beta$}; \node[continuous, latent] (alpha) at (-1.5, -3.5) {$\alpha_j$}; \node[continuous, latent] (beta) at ( 1.5, -3.5) {$\beta_j$}; \node[discrete, observed] (m) at (-1.5, -7) {$m_{ij}$}; \node[discrete, observed] (h) at ( 1.5, -7) {$h_{ij}$}; \node[plate, fit=(m)(h), label={[anchor=south east, font=\small\bfseries]below right:$i=1\dots N$}] (pN) {}; \node[plate, inner sep=15pt, fit=(alpha)(beta)(pN), label={[anchor=south east, font=\small\bfseries]below right:$j=1\dots J$}] {}; \draw[stochastic] (mu_a) -- (alpha); \draw[stochastic] (sig_a) -- (alpha); \draw[stochastic] (omega) -- (alpha); \draw[stochastic] (omega) -- (beta); \draw[stochastic] (sig_b) -- (beta); \draw[stochastic] (mu_b) -- (beta); \draw[stochastic] (alpha) -- (h); \draw[stochastic] (beta) -- (h); \draw[deterministic] (m) -- (h); %% RIGHT: KRUSCHKE-STYLE NOTATION \node[font=\large\bfseries] at (11, 2.2) {Kruschke-Style Notation}; \coordinate (k_mu_a) at (7, 0); \draw (k_mu_a) pic {norm_dist}; \node[above=0.7cm of k_mu_a, font=\bfseries] {$\mu_\alpha$}; \coordinate (k_sig_a) at (9, 0); \draw (k_sig_a) pic {exp_dist}; \node[above=0.7cm of k_sig_a, font=\bfseries] {$\sigma_\alpha$}; \coordinate (k_omega) at (11, 0); \draw (k_omega) pic {lkj_dist}; \node[above=0.7cm of k_omega, font=\bfseries] {$\Omega$}; \coordinate (k_sig_b) at (13, 0); \draw (k_sig_b) pic {exp_dist}; \node[above=0.7cm of k_sig_b, font=\bfseries] {$\sigma_\beta$}; \coordinate (k_mu_b) at (15, 0); \draw (k_mu_b) pic {norm_dist}; \node[above=0.7cm of k_mu_b, font=\bfseries] {$\mu_\beta$}; \coordinate (k_mvn) at (11, -3.5); \draw (k_mvn) pic {mvn_dist}; \node[above=0.9cm of k_mvn, font=\bfseries, align=center] {Agent Params \\ $(\alpha_j,\,\beta_j)$}; \node[discrete, observed, minimum size=0.9cm] (k_m) at (9, -7) {$m_{ij}$}; \node[above=0.5cm of k_m, font=\bfseries] {Memory}; \coordinate (k_h) at (13, -7); \draw (k_h) pic {bern_dist}; \node[above=0.9cm of k_h, font=\bfseries] {Choice ($h_{ij}$)}; \draw[plate, dashed] (7.5, -5.2) rectangle (14.5, -2.2) node[below left, font=\small\bfseries] {$j=1\dots J$}; \draw[plate, dotted] (7.5, -8.2) rectangle (14.5, -5.8) node[below left, font=\small\bfseries] {$i=1\dots N$}; \draw[stochastic] (k_mu_a) -- (k_mvn); \draw[stochastic] (k_sig_a) -- (k_mvn); \draw[stochastic] (k_omega) -- (k_mvn) node[midway, right, font=\large] {$\sim$}; \draw[stochastic] (k_sig_b) -- (k_mvn); \draw[stochastic] (k_mu_b) -- (k_mvn); \draw[stochastic] (k_mvn) -- (k_h) node[midway, right, font=\large] {$\sim$}; \draw[deterministic] (k_m) -- (12.5, -7); \end{tikzpicture} ``` > **Computational Note: The Geometry of Hierarchical Models** > The mathematical definition provided above—specifically $\theta_{\text{logit}, j} \sim \text{Normal}(\mu_{\theta}, \sigma_{\theta})$—is known as the **Centered Parameterization**. It is the most direct conceptual translation of our theory. However, Hamiltonian Monte Carlo (NUTS) often struggles with this exact form. As $\sigma_\theta$ approaches zero, the target geometry constrains into a region of extreme curvature known as "Neal's Funnel." In the next phase, we will purposefully implement this naive form to diagnose its pathological failures before introducing the coordinate-transform solution (Non-Centered Parameterization). ## Phase 1: Generative Plausibility of the Hierarchical Biased Agent Before we write the inference algorithm in Stan, we must construct the forward generative model in R. This code must strictly mirror the mathematical dependencies formalized in our DAG. We model the parameters on the logit (log-odds) scale. As discussed in previous chapters, probabilities are bounded between $0$ and $1$. If we attempt to place a hierarchical Normal distribution directly on the probability space, the probability mass will inevitably spill over the bounds, creating mathematically impossible values (e.g., a probability of $1.2$). The logit transform $\text{logit}(p) = \log(\frac{p}{1-p})$ maps the bounded $[0, 1]$ interval to the unbounded real line $(-\infty, \infty)$. The HMC sampler traverses this unbounded space smoothly, and we use the inverse-logit function (plogis() in R, inv_logit() in Stan) to map the inferences back to probabilities at the likelihood level. Let us build a unified generative function. In accordance with the rigorous Bayesian workflow, this function is designed to either: * Draw hyperparameters directly from our stated priors (for Prior Predictive Checks). * Accept fixed "ground-truth" hyperparameters (for Parameter Recovery in Phase 2). ```{r ch07_simulate_hierarchical_biased} #' Simulate Hierarchical Biased Agents #' #' @param n_agents Integer, number of agents (J). #' @param n_trials_min Integer, minimum number of trials per agent. #' @param n_trials_max Integer, maximum number of trials per agent. #' @param mu_theta Numeric, population mean on the logit scale. If NULL, draws from prior. #' @param sigma_theta Numeric, population SD on the logit scale. If NULL, draws from prior. #' @param seed Integer, for exact computational reproducibility. #' #' @return A tidy tibble containing both latent parameters and observed choices. simulate_hierarchical_biased <- function(n_agents = 100, n_trials_min = 20, n_trials_max = 200, mu_theta = NULL, sigma_theta = NULL, seed = NULL) { if (!is.null(seed)) set.seed(seed) # 1. Sample Population Hyperparameters (if not provided) if (is.null(mu_theta)) mu_theta <- rnorm(1, mean = 0, sd = 1.5) if (is.null(sigma_theta)) sigma_theta <- rexp(1, rate = 1) # 2. Sample Individual Latent Parameters and Trial Counts # We purposefully vary N_j to demonstrate shrinkage later. agent_params <- tibble( agent_id = 1:n_agents, n_trials = sample(n_trials_min:n_trials_max, n_agents, replace = TRUE), theta_logit = rnorm(n_agents, mean = mu_theta, sd = sigma_theta), theta_prob = plogis(theta_logit) ) # 3. Generate Observed Data (Likelihood) # h_ij ~ Bernoulli(inv_logit(theta_logit_j)) sim_data <- agent_params |> mutate( trial_data = map2(theta_prob, n_trials, function(prob, n) { tibble( trial = 1:n, choice = rbinom(n, size = 1, prob = prob) ) }) ) |> unnest(trial_data) attr(sim_data, "hyperparameters") <- list( mu_theta = mu_theta, sigma_theta = sigma_theta ) return(sim_data) } ``` ### Executing the Prior Predictive Check: The Logit-Normal Artifact When placing priors on an unbounded scale (logit) that will be transformed to a bounded scale (probability), our intuition often fails. A seemingly "uninformative" prior like $\mu \sim \text{Normal}(0, 10)$ on the logit scale actually forces almost all probability mass to exactly $0$ or $1$ on the outcome scale, implying that all agents are entirely deterministic. Let us generate 12 datasets drawn entirely from our chosen priors ($\mu_\theta \sim N(0, 1.5), \sigma_\theta \sim \text{Exp}(1)$) to verify they generate scientifically plausible, heterogeneous populations. ```{r ch07_prior_predictive_check_viz} # Generate 12 distinct "universes" from the priors prior_sims <- map_dfr(1:12, function(sim_id) { # We use a fixed seed per universe for stable workflow mapping sim_data <- simulate_hierarchical_biased( n_agents = 100, n_trials_min = 1, n_trials_max = 1, # Only need 1 trial to extract parameters seed = sim_id * 10 ) sim_data |> distinct(agent_id, theta_prob) |> mutate(simulation_id = paste("Prior Universe", sim_id)) }) # Visualize the generative plausibility ggplot(prior_sims, aes(x = theta_prob)) + geom_density(fill = "steelblue", alpha = 0.5, color = NA) + facet_wrap(~simulation_id, ncol = 4) + scale_x_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) + labs( title = "Prior Predictive Check: Implied Population Distributions", subtitle = "Healthy = populations spread across (0,1), not piled at 0 or 1. Each panel is one prior universe.", x = "Agent Bias (Probability Scale)", y = "Density" ) ``` ## Phase 2: Data Preparation and the Centered Stan Model Before we unleash the Hamiltonian Monte Carlo (HMC) engine, we must format our simulated data structure to match the static typing of the Stan compiler. Because our agents have varying numbers of trials ($N_j$), we cannot pass a simple $J \times N$ matrix. Instead, we use a "long format" array indexing strategy, where a specific vector maps every trial $n \in \{1, \dots, N_{\text{total}}\}$ to its generating agent $j$. ```{r ch07_prep_stan_data} #' Prepare Data for Stan #' #' Maps a tidy tibble into the strictly typed list required by CmdStanR. #' #' @param sim_data Tidy tibble generated by simulate_hierarchical_biased #' @return A named list formatted for CmdStanR prep_stan_data <- function(sim_data) { list( N_total = nrow(sim_data), N_subjects = length(unique(sim_data$agent_id)), subj_id = sim_data$agent_id, y = sim_data$choice, prior_mu_theta_mu = 0, prior_mu_theta_sigma = 1.5, prior_sigma_theta_lambda = 1, run_diagnostics = 1 ) } # Execute the preparation on a specifically generated universe for recovery sim_data_centered <- simulate_hierarchical_biased( n_agents = 50, n_trials_min = 30, # Intentional sparsity to force shrinkage n_trials_max = 120, mu_theta = 1.38,# slightly below .8 in prob sigma_theta = 0.8, seed = 1981 ) stan_data_centered <- prep_stan_data(sim_data_centered) ``` ### Stan Model: Centered Parameterization (CP) We will now implement the Centered Parameterization (CP) of the hierarchical biased agent. In CP, we explicitly model the individual latent parameters ($\theta_{\text{logit}, j}$) - determining their probability of choosing "right" versus "left" - as being drawn directly from a population distribution with mean `thetaM` and standard deviation `thetaSD`. Mathematically, this translates to the following hierarchical structure, which strictly mirrors our Data Generating Process (DGP) from Phase 1: * **Population Level**: $$\mu_\theta \sim \text{Normal}(0, 1.5)$$ $$\sigma_\theta \sim \text{Exponential}(1)$$ * **Individual Level**: $$\theta_{\text{logit}, j} \sim \text{Normal}(\mu_\theta, \sigma_\theta)$$ * **Data Level (Likelihood)**: $$h_{ij} \sim \text{Bernoulli}(\text{logit}^{-1}(\theta_{\text{logit}, j}))$$ This architecture mathematically instantiates partial pooling: the estimation of an individual's bias balances their specific choice history ($h_{ij}$) with the population's overarching distributional geometry ($\mu_\theta, \sigma_\theta$). > The Anatomy of a Potential Pathology: > Look closely at the individual-level formulation: $\theta_{\text{logit}, j} \sim \text{Normal}(\mu_\theta, \sigma_\theta)$. The permissible geometric space for $\theta_{\text{logit}, j}$ is dynamically dictated by $\sigma_\theta$. If the data suggests the population is highly homogenous, the sampler will push $\sigma_\theta \to 0$. As it does, the parameter space for $\theta_{\text{logit}, j}$ is violently crushed toward $\mu_\theta$, creating a funnel of infinite curvature. We are fitting this model precisely to watch the NUTS sampler fail in this region. Let us write the Stan code. ```{r ch07_stan_model_biased_cp} stan_biased_ml_cp_code <- " // Multilevel Biased Agent Model: Centered Parameterization (CP) data { int<lower=1> N_total; // Total number of observations int<lower=1> N_subjects; // Total number of agents array[N_total] int<lower=1, upper=N_subjects> subj_id; // Agent ID index array[N_total] int<lower=0, upper=1> y; // Observed choices real prior_mu_theta_mu; real<lower=0> prior_mu_theta_sigma; real<lower=0> prior_sigma_theta_lambda; int<lower=0, upper=1> run_diagnostics; } parameters { // Population-level hyperparameters real mu_theta; // Population mean bias (logit scale) real<lower=0> sigma_theta; // Population SD of bias (logit scale) // Individual-level parameters (CENTERED) vector[N_subjects] theta_logit; // Agent-specific biases } model { // 1. Population-level hyperpriors target += normal_lpdf(mu_theta | prior_mu_theta_mu, prior_mu_theta_sigma); target += exponential_lpdf(sigma_theta | prior_sigma_theta_lambda); // 2. Hierarchical prior (CENTERED) target += normal_lpdf(theta_logit | mu_theta, sigma_theta); // 3. Likelihood target += bernoulli_logit_lpmf(y | theta_logit[subj_id]); } generated quantities { // Transform hyperparameters back to the probability (scale) real<lower=0, upper=1> mu_theta_prob = inv_logit(mu_theta); vector<lower=0, upper=1>[N_subjects] theta_prob = inv_logit(theta_logit); // Prior samples — cheap scalar draws; declared at top level so they appear in draws output real mu_theta_prior = normal_rng(prior_mu_theta_mu, prior_mu_theta_sigma); real sigma_theta_prior = exponential_rng(prior_sigma_theta_lambda); // Initialize arrays for conditional diagnostics vector[N_total] log_lik; array[N_total] int y_post_rep; array[N_total] int y_prior_rep; // Accumulator for joint prior log-density real lprior; lprior = normal_lpdf(mu_theta | prior_mu_theta_mu, prior_mu_theta_sigma) + exponential_lpdf(sigma_theta | prior_sigma_theta_lambda) + normal_lpdf(theta_logit | mu_theta, sigma_theta); if (run_diagnostics) { for (i in 1:N_total) { log_lik[i] = bernoulli_logit_lpmf(y[i] | theta_logit[subj_id[i]]); y_post_rep[i] = bernoulli_logit_rng(theta_logit[subj_id[i]]); real theta_logit_prior_i = normal_rng(mu_theta_prior, sigma_theta_prior); y_prior_rep[i] = bernoulli_logit_rng(theta_logit_prior_i); } } } " # Save stan_file_biased_ml_cp <- here::here("stan", "ch07_multilevel_biased_cp.stan") writeLines(stan_biased_ml_cp_code, stan_file_biased_ml_cp) # Compile the CP model utilizing the CmdStan C++ backend mod_biased_ml_cp <- cmdstan_model(stan_file_biased_ml_cp) ``` ### Fitting the model This not really necessary yet, but a rigorous workflow demands an explicit initialization strategy, so we have all the pieces in place for more complex models. We will initialize the NUTS algorithm by drawing starting values directly from our generative priors. This ensures the chains begin in scientifically plausible regions of the parameter space, rather than Stan's arbitrary default $U[-2, 2]$ on the unconstrained space. ```{r ch07_Fit_CP_Model} fit_cp_path <- here::here("simmodels", "ch07_fit_cp.rds") init_cp_list <- lapply(1:4, function(i) { list( mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), theta_logit = rnorm(stan_data_centered$N_subjects, 0, 0.5) ) }) if (regenerate_simulations || !file.exists(fit_cp_path)) { fit_cp <- mod_biased_ml_cp$sample( data = stan_data_centered, seed = 123, chains = 4, parallel_chains = 4, iter_warmup = 1000, iter_sampling = 1000, init = init_cp_list, adapt_delta = 0.85 ) fit_cp$save_object(fit_cp_path) } else { fit_cp <- readRDS(fit_cp_path) } ``` ### Assessing the fitting process Even if the Stan compiler finishes without throwing explicit warning messages, our workflow demands that we systematically verify the geometry of the posterior and the efficiency of the sampler. Hamiltonian Monte Carlo (HMC) is a physical simulation—it explores the posterior by simulating a particle gliding over a frictionless manifold dictated by the joint log-probability density. We must prove the numerical integration of this physics engine remained stable. Let us systematically unpack the computational diagnostics of our Centered Parameterization fit. 1. The Diagnostic Summary: Integration Stability First, we ask cmdstanr for a high-level summary of the NUTS sampler's performance. ```{r ch07_cp_diagnostic_summary} #fit_cp$cmdstan_diagnose() fit_cp$diagnostic_summary() ``` We are looking for strict zeros in num_divergent. A divergent transition occurs when the numerical integration process (the leapfrog steps) fails to accurately map the curvature of the posterior manifold, causing the simulated trajectory to fly off the energy level. Even one divergence compromises the mathematical guarantee of the posterior. #### Chain Mixing and Stationarity ($\hat{R}$) Did our independent Markov chains converge to the same stationary distribution? We visualize this using trace plots and quantify it using $\hat{R}$ (R-hat). ```{r ch07_cp_diagnostics_trace} # Extract the posterior array draws_cp <- fit_cp$draws() # Visual inspection of chain mixing mcmc_trace(draws_cp, pars = c("mu_theta", "sigma_theta"), np = nuts_params(fit_cp)) + labs( title = "Trace Plots: Hyperparameters (CP Model)", subtitle = "Healthy = chains overlap and fluctuate around a stable band. Divergent red triangles mark sampler failures." ) # Quantitative assessment of convergence and efficiency cp_summary <- draws_cp |> subset_draws(variable = c("mu_theta", "sigma_theta", "theta_logit")) |> summarise_draws(default_convergence_measures()) # We require R-hat < 1.01 for all parameters cp_summary |> filter(variable %in% c("mu_theta", "sigma_theta", "theta_logit[1]", "theta_logit[2]")) |> knitr::kable(digits = 3) ``` The potential scale reduction factor ($\hat{R}$) compares the variance within a single chain to the variance between all chains. If the chains are exploring the same typical set, this ratio converges to 1. Our strict threshold is $\hat{R} < 1.01$. #### Effective Sample Size (ESS) Markov chains are inherently autocorrelated. Stan might draw 4000 samples, but we need to know the Effective Sample Size—the equivalent number of completely independent draws.In your summary table, look at ess_bulk (efficiency of the posterior mean estimate) and ess_tail (efficiency of the posterior interval estimates). We strictly require ESS > 400 per parameter. If the geometry is difficult, ESS plummets, indicating the sampler is taking highly correlated, inefficient steps. #### Energy and E-BFMI Finally, we evaluate the Energy-Bayesian Fraction of Missing Information (E-BFMI). HMC relies on injecting kinetic energy (momentum) into the system to explore the tails of the target distribution. ```{r ch07_cp_diagnostics_energy} # Assess the momentum resampling efficiency mcmc_nuts_energy(nuts_params(fit_cp)) + labs( title = "Energy Transitions", subtitle = "Marginal and transition energy distributions must align." ) ``` If the model geometry is well-behaved, the energy transition distribution (the energy proposed by the momentum resampling) will neatly match the marginal energy distribution (the actual energy of the posterior). An E-BFMI below 0.2 indicates the sampler is taking tiny, localized steps and failing to explore the global geometry. ### Prior-Posterior Update Checks: Quantifying Learning Bayesian inference is the mathematical process of allocating credibility across parameter values. The posterior is simply the prior updated by the likelihood. If the posterior is identical to the prior, the data contained no information. If the posterior is violently pushed against the boundaries of the prior, our priors were likely misspecified. Let us visualize how much the data narrowed our uncertainty about the population parameters. Because we explicitly generated prior draws in our Stan generated quantities block, this comparison is trivial. ```{r ch07_cp_pp_update_viz} # Extract prior and posterior draws for the population mean and variance draws_df <- as_draws_df(draws_cp) # Reshape for ggplot pp_update_data <- draws_df |> dplyr::select( mu_posterior = mu_theta, mu_prior = mu_theta_prior, sigma_posterior = sigma_theta, sigma_prior = sigma_theta_prior ) |> pivot_longer(everything(), names_to = c("parameter", "distribution"), names_sep = "_") # Extract the true generative hyperparameters we saved as attributes true_mu <- attr(sim_data_centered, "hyperparameters")$mu_theta true_sigma <- attr(sim_data_centered, "hyperparameters")$sigma_theta # Create a reference dataframe for the ground truth matching the facet names true_pop_df <- tibble( parameter = c("mu", "sigma"), true_value = c(true_mu, true_sigma) ) # Visualize the update ggplot(pp_update_data, aes(x = value)) + geom_histogram(aes(fill = distribution), alpha = 0.6, color = NA) + # Add the true values as dashed lines geom_vline(data = true_pop_df, aes(xintercept = true_value, linetype = "True Value"), color = "darkred", linewidth = 1) + facet_wrap( ~parameter, scales = "free", # Clean up the facet labels for publication quality labeller = as_labeller(c("mu" = "Population Mean (\u03BC)", "sigma" = "Population SD (\u03C3)")) ) + scale_fill_manual( name = "Distribution", values = c("prior" = "gray70", "posterior" = "steelblue") ) + scale_linetype_manual( name = "Ground Truth", values = c("True Value" = "dashed") ) + labs( title = "Prior-Posterior Update Checks: Population Hyperparameters", subtitle = "Healthy = posterior (blue) narrower than prior (grey) and centred on true value (dashed red).", x = "Parameter Value (Logit Scale)", y = "Density") ``` A healthy update shows a posterior that is significantly narrower than the prior (indicating learning) and comfortably contained within the prior's mass (indicating the prior did not inappropriately censor the likelihood). ### Posterior Parameter Correlations Parameters in hierarchical models are structurally correlated. By plotting the joint posterior geometry of multiple parameters, we can detect pathological collinearity or explicitly visualize the boundary of possible Neal's Funnel. We will use a pairs plot to examine the joint distribution of the population mean ($\mu_\theta$), population standard deviation ($\sigma_\theta$), and a single agent's latent bias ($\theta_{\text{logit}, 1}$). ```{r ch07_cp_pairs_plot} # Visualize joint posterior geometries and identify structural correlations mcmc_pairs( draws_cp, pars = c("mu_theta", "sigma_theta", "theta_logit[1]"), np = nuts_params(fit_cp), # Injects divergence data (red crosses) if any exist diag_fun = "dens", off_diag_fun = "hex" ) ``` ### Posterior Predictive Checks (PPC): Generative Adequacy Can our fitted model actually generate the reality we observed? If we simulate data from the posterior distribution, does it look like the empirical data? If not, the model is structurally misspecified, regardless of how well the chains mixed. We will compare the observed choice rate for each agent against the distribution of choice rates predicted by the posterior. ```{r ch07_cp_ppc_viz} # Safely extract vectors y_obs <- as.numeric(stan_data_centered$y) agent_idx <- as.numeric(stan_data_centered$subj_id) # Use cmdstanr's native extraction to guarantee a strict S x N matrix y_rep <- fit_cp$draws("y_post_rep", format = "matrix") # Select a random subset of agents to avoid overcrowded, illegible plots n_sample <- 12 sampled_agents <- sample(unique(agent_idx), n_sample) # Create a logical mask mapping which trials belong to the sampled agents keep_mask <- agent_idx %in% sampled_agents # Subset the data exactly y_obs_sub <- y_obs[keep_mask] agent_idx_sub <- agent_idx[keep_mask] # For y_rep, we subset the columns (trials) y_rep_sub <- y_rep[ , keep_mask] # Define a custom statistic to explicitly state our cognitive metric # (and simultaneously silence the bayesplot warning) prop_right <- function(x) mean(x) # Execute the grouped Posterior Predictive Check ppc_stat_grouped( y = y_obs_sub, yrep = y_rep_sub, group = agent_idx_sub, stat = "prop_right" ) + labs( title = "Posterior Predictive Check: Agent-Level Choice Rates", subtitle = paste0("Healthy = dark line (observed) inside histogram (predicted). Showing ", n_sample, " agents."), x = "Proportion of 'Right' Choices" ) ``` ### Calibration and Recovery We have proven the model fits the data and generates plausible predictions. Now, we must prove that the parameters we extract are actually the true parameters that generated the data. Because this is a hierarchical model, we must assess recovery at both the population and individual levels. #### Individual-Level Recovery Having validated the top of the hierarchy (in the prior-posterior update check), we move to the individual agents. We plot the posterior estimates against the ground truth. We expect the estimates to align with the diagonal identity line, but crucially, we expect agents with fewer trials to deviate from the raw truth, demonstrating shrinkage toward the estimated population mean. ```{r ch07_cp_recovery_shrinkage} # 1. Extract the full summary (which defaults to including median, q5, and q95) recovery_summary <- draws_cp |> posterior::subset_draws(variable = "theta_logit") |> posterior::summarise_draws() |> mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])"))) # 2. Extract the true generative parameters directly from the simulated dataset true_agent_params <- sim_data_centered |> distinct(agent_id, n_trials, theta_logit, theta_prob) # 3. Join the posterior estimates with the ground truth recovery_df <- true_agent_params |> left_join(recovery_summary, by = "agent_id") # 4. Plot Recovery with Shrinkage Dynamics ggplot(recovery_df, aes(x = theta_logit, y = median)) + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") + geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.5, color = "steelblue") + geom_point(aes(size = n_trials), color = "midnightblue", alpha = 0.8) + labs( title = "Parameter Recovery: Hierarchical Biased Agent", subtitle = "Healthy = points near diagonal (y = x). Small dots shrink more — that is partial pooling working correctly.", x = "True Parameter Value (Logit Scale)", y = "Posterior Median Estimate (Logit Scale)", size = "Number of Trials (N_j)" ) ``` ### Prior sensitivity ```{r ch07_cp_prior_sensitivity} # 1. Calculate the sensitivity diagnostics # We target the hyperparameters, as they govern the hierarchical shrinkage sens_results <- powerscale_sensitivity( fit_cp, # or fit_ncp, provided lprior is defined variable = c("mu_theta", "sigma_theta") ) cat("=== PRIOR SENSITIVITY DIAGNOSTICS ===\n") print(sens_results) ps_seq <- priorsense::powerscale_sequence( fit_cp, variable = c("mu_theta", "sigma_theta") ) # Plot the density shifts priorsense::powerscale_plot_dens(ps_seq) + theme_classic() + labs( title = "Prior Sensitivity: Density Perturbation", subtitle = "Does scaling the prior/likelihood drastically alter our geometric beliefs?" ) # Plot the quantitative estimate shifts priorsense::powerscale_plot_quantities(ps_seq) + theme_classic() + labs( title = "Prior Sensitivity: Parameter Estimates", subtitle = "Horizontal lines indicate the baseline posterior bounds. Deviations indicate sensitivity." ) ``` The sensitivity analysis reveals that our population variance ($\sigma_\theta$) exhibits a prior-data conflict. This is a feature, not a bug, of partial pooling: our Exponential prior applies regularizing pressure toward zero variance (complete pooling), while the empirical data resists this pull to account for individual differences. ### Simulation-Based Calibration Checks (SBC) Parameter recovery on a single dataset is anecdotal. A single success might just mean we got lucky with a benign region of the parameter space. To mathematically guarantee that our NUTS sampler is unbiased and our posterior uncertainties are perfectly calibrated across the entire prior space, we must perform Simulation-Based Calibration Checks (SBC). The logic of SBC is structurally elegant: if we generate a ground-truth parameter $\theta_{\text{true}}$ from the prior, generate data from the data model conditional on $\theta_{\text{true}}$, and compute the posterior, then the posterior must not systematically over- or under-estimate $\theta_{\text{true}}$. Under a well-calibrated sampler, the rank of the true parameter value within the sorted posterior draws will be uniformly distributed:$$r \sim \text{Uniform}(0, S)$$where $S$ is the number of posterior samples. If the model is overconfident (the posterior is too narrow and misses the truth), the rank histogram will be U-shaped. If it is underconfident (the posterior is too wide), it will form an inverted U-shape. Asymmetries indicate directional bias. ### Setting up the SBC Pipeline The SBC package requires a strict data structure. It expects a generator function that returns a list of two components: * variables: A named list of the true parameter values. * generated: The formatted data list required by our Stan model. Let us build a strict wrapper around our simulation and data preparation functions to satisfy this requirement. ```{r ch07_sbc_generator} # Define the strict SBC generator wrapper generate_sbc_dataset <- function(n_agents = 50, n_trials_min = 20, n_trials_max = 120) { # 1. Simulate the universe (draws hyperparams and agent params from priors) sim_data <- simulate_hierarchical_biased( n_agents = n_agents, n_trials_min = n_trials_min, n_trials_max = n_trials_max ) # 2. Extract the true generative hyperparameters mu_true <- attr(sim_data, "hyperparameters")$mu_theta sigma_true <- attr(sim_data, "hyperparameters")$sigma_theta # 3. Extract the true individual latent parameters (must preserve agent order) theta_true <- sim_data |> distinct(agent_id, theta_logit) |> arrange(agent_id) |> pull(theta_logit) # 4. Prepare the strictly typed list for CmdStanR stan_data <- prep_stan_data(sim_data) # 5. Return the exact structure required by the SBC package list( variables = list( mu_theta = mu_true, sigma_theta = sigma_true, theta_logit = theta_true ), generated = stan_data ) } ``` A production-grade SBC requires $N_{\text{sim}} \ge 1000$ to smooth the rank histograms. For computational feasibility in this chapter, we will run 200 simulations. Because we established our future parallel processing plan in the setup chunk, the SBC package will automatically distribute these models across your CPU cores. ```{r ch07_sbc_cp_execution} sbc_cp_path <- here::here("simmodels", "ch07_sbc_cp.rds") generator_cp <- SBC_generator_function( generate_sbc_dataset, n_agents = 50, n_trials_min = 30, n_trials_max = 120 ) backend_cp <- SBC_backend_cmdstan_sample( mod_biased_ml_cp, iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0 ) if (regenerate_simulations || !file.exists(sbc_cp_path)) { future::plan(future::multisession, workers = 8) sbc_datasets_cp <- generate_datasets(generator_cp, 200) sbc_results_cp <- compute_SBC(datasets = sbc_datasets_cp, backend = backend_cp, keep_fits = FALSE) saveRDS( list(datasets = sbc_datasets_cp, results = sbc_results_cp), sbc_cp_path ) } else { sbc_obj <- readRDS(sbc_cp_path) sbc_datasets_cp <- sbc_obj$datasets sbc_results_cp <- sbc_obj$results } # Visualize plot_rank_hist(sbc_results_cp, variables = c("mu_theta", "sigma_theta")) + labs( title = "SBC Rank Histograms: Centered Parameterization", subtitle = "Healthy = bars flat within grey band. U-shape = overconfident. Here, U-shapes reveal the funnel pathology." ) plot_ecdf_diff(sbc_results_cp, variables = c("mu_theta", "sigma_theta")) + labs( title = "SBC ECDF Difference: Centered Parameterization", subtitle = "Healthy = blue line stays near zero. Excursions here correlate with the divergence failures shown below." ) # 1. Extract the lightweight stats dataframe from the SBC object sbc_stats_df <- sbc_results_cp$stats # 2. Filter for the population-level parameters you care about # (e.g., mu_alpha, mu_beta, rho, or even sigma[1]) recovery_df <- sbc_stats_df |> filter(variable %in% c("mu_theta")) # 3. Create the Parameter Recovery Identity Plot p_sbc_recovery <- ggplot(recovery_df, aes(x = simulated_value, y = median)) + # The Identity Line (Perfect Recovery) geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", linewidth = 1) + # The Posterior Estimates with 90% Credible Intervals geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.3, color = "steelblue") + geom_point(color = "midnightblue", alpha = 0.7, size = 2) + # Facet by parameter facet_wrap(~variable, scales = "free") + labs( title = "Parameter Recovery (Extracted Directly from SBC)", subtitle = "Points represent the posterior median. Bars are 90% Credible Intervals.", x = "True Generative Value (Drawn from Prior)", y = "Posterior Median Estimate" ) + theme_classic() + theme(strip.background = element_rect(fill = "gray95", color = NA)) print(p_sbc_recovery) # 4. Quantify the Recovery (Correlation and RMSE) recovery_metrics <- recovery_df |> group_by(variable) |> summarize( Pearson_r = cor(simulated_value, median), Spearman_rho = cor(simulated_value, median, method = "spearman"), RMSE = sqrt(mean((simulated_value - median)^2)), .groups = "drop" ) cat("=== PARAMETER RECOVERY QUANTIFICATION ===\n") print(knitr::kable(recovery_metrics, digits = 3)) ``` Let's explore the warnings we got. ```{r ch07_sbc_cp_divergence_viz} # Extract diagnostics now so they are available for the targeted refit below diags_cp <- sbc_results_cp$backend_diagnostics cp_fail_rate <- mean(diags_cp$n_divergent > 0) * 100 # Divergence count distribution across all 200 universes diags_cp |> ggplot(aes(x = n_divergent)) + geom_histogram(binwidth = 1, fill = "darkred", alpha = 0.8) + labs( title = "CP Divergence Counts Across 200 SBC Universes", subtitle = sprintf("%.0f%% of universes produced at least one divergence. Even a single divergence invalidates the geometric guarantee of NUTS.", cp_fail_rate), x = "Number of Divergent Transitions", y = "Number of SBC Universes" ) # Locate the divergence-prone region: do they concentrate at small sigma_theta? divergence_profile <- posterior::as_draws_df(sbc_datasets_cp$variables) |> as_tibble() |> mutate(sim_id = row_number()) |> left_join( diags_cp |> mutate(sim_id = row_number()) |> dplyr::select(sim_id, n_divergent), by = "sim_id" ) |> mutate(had_divergence = n_divergent > 0) ggplot(divergence_profile, aes(x = sigma_theta, y = n_divergent, color = had_divergence)) + geom_point(alpha = 0.5, size = 1.5) + scale_color_manual( values = c("FALSE" = "steelblue", "TRUE" = "darkred"), labels = c("FALSE" = "Clean", "TRUE" = "Divergent"), name = NULL ) + labs( title = "CP Divergences Concentrate at Small True sigma_theta", subtitle = "The funnel geometry only manifests when population variance is small.", x = expression("True " * sigma[theta]), y = "Number of Divergent Transitions" ) ggplot(divergence_profile, aes(x = mu_theta, y = sigma_theta, color = had_divergence)) + geom_point(alpha = 0.7, size = 2) + scale_color_manual( name = "Sampler Status", values = c("FALSE" = "steelblue", "TRUE" = "darkred"), labels = c("Healthy", "Diverged") ) + labs( title = "The Anatomy of Failure: Where does CP break?", x = "True Population Mean (mu_theta)", y = "True Population Variance (sigma_theta)" ) + theme_classic() ``` ```{r ch07_cp_hard_universe_fit} hard_universe_path <- here::here("simmodels", "ch07_fit_cp_hard.rds") hard_universe_idx <- divergence_profile |> filter(had_divergence) |> slice_min(sigma_theta, n = 1) |> pull(sim_id) hard_stan_data <- sbc_datasets_cp$generated[[hard_universe_idx]] # Ensure compatibility with updated Stan model if cache is stale if (!"prior_mu_theta_mu" %in% names(hard_stan_data)) { hard_stan_data$prior_mu_theta_mu <- 0 hard_stan_data$prior_mu_theta_sigma <- 1.5 hard_stan_data$prior_sigma_theta_lambda <- 1 hard_stan_data$run_diagnostics <- 1 } hard_true_sigma <- divergence_profile$sigma_theta[hard_universe_idx] cat(sprintf( "Selected universe %d: true sigma_theta = %.3f\n", hard_universe_idx, hard_true_sigma )) if (regenerate_simulations || !file.exists(hard_universe_path)) { fit_cp_hard <- mod_biased_ml_cp$sample( data = hard_stan_data, seed = 999, chains = 4, parallel_chains = 4, iter_warmup = 1000, iter_sampling = 1000, init = init_cp_list, adapt_delta = 0.85, refresh = 0 ) fit_cp_hard$save_object(hard_universe_path) } else { fit_cp_hard <- readRDS(hard_universe_path) } # 1. Generate the pairs plot p_funnel <- mcmc_pairs( fit_cp_hard$draws(), pars = c("mu_theta", "sigma_theta", "theta_logit[1]"), np = nuts_params(fit_cp_hard), np_style = pairs_style_np(div_color = "darkred", div_alpha = 0.8, div_size = 2), diag_fun = "dens", off_diag_fun = "hex" ) wrap_elements(p_funnel) + plot_annotation( title = sprintf("Neal's Funnel: CP Geometry at True sigma_theta = %.3f", hard_true_sigma), subtitle = "Red points are divergent transitions crashing at the funnel neck." ) ``` Here we see a funnel. As sigma goes to zero, theta_logit[1] can only have one value: the value of the population (mu_theta). The curvature becomes infinitely steep, and the NUTS sampler fails to explore this region, resulting in divergent transitions (red points). This is the geometric pathology of the Centered Parameterization. ### Non-Centered Parameterization (NCP) as the Solution > *Why does a coordinate transform fix a geometry problem?* Neal's Funnel > is not a flaw in the model — it is a flaw in the *space the sampler > must navigate*. In the CP, individual parameters $\theta_j$ live in a > space whose width is controlled by $\sigma_\theta$: as $\sigma_\theta > \to 0$ the space collapses. In the NCP, we sample standardized offsets > $z_j \sim N(0,1)$ whose geometry is always the same — the funnel > disappears because the sampler no longer navigates the narrowing space > directly. The parameters $\theta_j$ are reconstructed *after* sampling, > not during it. We have reached a critical juncture in our Bayesian workflow. We showed computationally (via SBC and diagnostic pairs plots) that the Centered Parameterization (CP) creates a pathological geometry—Neal’s Funnel—that makes it difficult for the NUTS sampler to efficiently traverse the posterior space when data is sparse or population variance is small. When the model is misbehaving, sometimes we need to reconsider our theory and/or our data. Yet, in this case, we can do something to the math underlying the model to topologically transform the posterior space the model is sampling: a coordinate transformation known as the Non-Centered Parameterization (NCP). #### The Geometry of the Transformation In the Centered Parameterization, the hierarchical dependency is explicit and direct:$$\theta_{\text{logit}, j} \sim \text{Normal}(\mu_\theta, \sigma_\theta)$$ This forces the sampler to navigate a space where the permissible bounds of $\theta$ are dynamically restricted by $\sigma$, creating the funnel. In the Non-Centered Parameterization, we decouple this dependency, in a way that is not dissimilar from rescaling your predictors - a common procedure you probably learned in your statistical course to help model convergence and interpretation. We introduce an auxiliary, standardized latent variable $z_j$ for each agent. We force the sampler to explore this $z$-space, which is simply a perfectly behaved Normal distribution:$$z_j \sim \text{Normal}(0, 1)$$ Then, we deterministically shift and scale $z_j$ to reconstruct our target cognitive parameter:$$\theta_{\text{logit}, j} = \mu_\theta + z_j \times \sigma_\theta$$ Mathematically, this is identical to the CP ($z \sim N(0,1) \implies \mu + z\sigma \sim N(\mu, \sigma)$). However, the computational geometry is entirely transformed and when the sampler traverses it, the problems it meets are likely different. ### Stan Model: Multilevel Biased Agent (NCP) Here is the NCP version of the model. Note the changes in the parameters, transformed parameters, and model blocks compared to the CP version. ```{r ch07_stan_model_biased_ncp} stan_biased_ml_ncp_code <- " // Multilevel Biased Agent Model (Non-Centered Parameterization) // Estimates individual biases drawn from a population distribution. data { int<lower=1> N_total; // Total number of observations int<lower=1> N_subjects; // Number of agents array[N_total] int<lower=1, upper=N_subjects> subj_id; // Agent ID array[N_total] int<lower=0, upper=1> y; // Observed choices real prior_mu_theta_mu; real<lower=0> prior_mu_theta_sigma; real<lower=0> prior_sigma_theta_lambda; int<lower=0, upper=1> run_diagnostics; } parameters { // Population-level parameters real mu_theta; // Population mean bias (logit scale) real<lower=0> sigma_theta; // Population SD of bias (logit scale) // Individual-level parameters (standardized deviations, z-scores of the agents) vector[N_subjects] z_theta; // Non-centered individual effects } transformed parameters { // Deterministic reconstruction of the cognitive parameters vector[N_subjects] theta_logit = mu_theta + z_theta * sigma_theta; } model { // Priors for population-level parameters target += normal_lpdf(mu_theta | prior_mu_theta_mu, prior_mu_theta_sigma); target += exponential_lpdf(sigma_theta | prior_sigma_theta_lambda); /// Individual level prior (NON-CENTERED) target += std_normal_lpdf(z_theta); // Likelihood target += bernoulli_logit_lpmf(y | theta_logit[subj_id]); } generated quantities { // Transform hyperparameters back to the probability (scale) real<lower=0, upper=1> mu_theta_prob = inv_logit(mu_theta); vector<lower=0, upper=1>[N_subjects] theta_prob = inv_logit(theta_logit); // Initialize containers for trial-level metrics vector[N_total] log_lik; array[N_total] int y_post_rep; array[N_total] int y_prior_rep; // --- Prior Log-Density --- real lprior; lprior = normal_lpdf(mu_theta | prior_mu_theta_mu, prior_mu_theta_sigma) + exponential_lpdf(sigma_theta | prior_sigma_theta_lambda) + std_normal_lpdf(z_theta); // --- Prior Predictive Checks: Generative Baseline --- real mu_theta_prior = normal_rng(prior_mu_theta_mu, prior_mu_theta_sigma); real sigma_theta_prior = exponential_rng(prior_sigma_theta_lambda); if (run_diagnostics) { // 2. Draw individual agent parameters from the non-centered prior vector[N_subjects] theta_logit_prior; for (j in 1:N_subjects) { real z_theta_prior = normal_rng(0, 1); theta_logit_prior[j] = mu_theta_prior + z_theta_prior * sigma_theta_prior; } // --- Trial-Level Computations --- for (i in 1:N_total) { log_lik[i] = bernoulli_logit_lpmf(y[i] | theta_logit[subj_id[i]]); y_post_rep[i] = bernoulli_logit_rng(theta_logit[subj_id[i]]); y_prior_rep[i] = bernoulli_logit_rng(theta_logit_prior[subj_id[i]]); } } }" # Write Stan code to file stan_file_ncp <- here::here("stan", "ch07_multilevel_biased_ncp.stan") writeLines(stan_biased_ml_ncp_code, stan_file_ncp) mod_biased_ml_ncp <- cmdstan_model(stan_file_ncp) fit_ncp_path <- here::here("simmodels", "ch07_fit_ncp.rds") # Extract N_subjects safely N_subj_val <- stan_data_centered$N_subjects # The dumb list (4 chains) init_ncp_list <- list( list(mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), z_theta = rnorm(N_subj_val, 0, 1)), list(mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), z_theta = rnorm(N_subj_val, 0, 1)), list(mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), z_theta = rnorm(N_subj_val, 0, 1)), list(mu_theta = rnorm(1, 0, 1.5), sigma_theta = rexp(1, 1), z_theta = rnorm(N_subj_val, 0, 1)) ) if (regenerate_simulations || !file.exists(fit_ncp_path)) { fit_ncp <- mod_biased_ml_ncp$sample( data = stan_data_centered, seed = 123, chains = 4, parallel_chains = 4, iter_warmup = 1000, iter_sampling = 1000, init = init_ncp_list, adapt_delta = 0.85, refresh = 0 ) fit_ncp$save_object(fit_ncp_path) } else { fit_ncp <- readRDS(fit_ncp_path) } ``` > **Why we skip the full diagnostic battery here:** The NCP model is not > our production model for this dataset — it is a *pedagogical fix*. We > introduced the CP model specifically to watch it fail via SBC, and the > NCP to demonstrate the geometric solution. Running the full six-phase > battery (prior predictive, posterior predictive, prior-posterior update, > sensitivity, recovery, SBC) on both parameterizations would be > instructive but redundant for this chapter's purpose. When you apply > the NCP to your *own* data, run all six phases. Here we focus on the > two diagnostics that directly reveal the CP's failure and the NCP's fix: > sampling efficiency and the SBC rank histograms. ::: {.callout-tip title="Stan Efficiency Rule 3 — Subject Indexing: `subj_id` vs. Contiguous Blocks"} The models in this chapter index subject membership via an integer array: `array[N_total] int subj_id`. On each trial row, `subj_id[i]` says which subject that row belongs to. This **scatter-index** pattern is simple and works perfectly for models where trial order does not matter — like the biased-agent model here, where the likelihood is `bernoulli_logit_lpmf(y | theta[subj_id])`. However, from Chapter 9 onward, all models are **path-dependent**: the predicted probability on trial $t$ is a function of everything that happened on trials $1, \ldots, t-1$ within the same subject. For those models, we need to process each subject's trials in strict sequence, which means we must know exactly where each subject's block of rows starts and ends in the data array. The **contiguous-block** pattern achieves this: ```r # Sort data by subject, then by trial within subject data_sorted <- data |> arrange(subject_id, trial) bounds <- data_sorted |> mutate(row = row_number()) |> group_by(subject_id) |> summarise(subj_start = min(row), subj_end = max(row)) ``` ```stan data { array[N_subjects] int subj_start; // first row index for each subject array[N_subjects] int subj_end; // last row index for each subject } model { for (j in 1:N_subjects) { for (i in subj_start[j]:subj_end[j]) { // process trials in order — history is correct } } } ``` This layout also enables the **`reduce_sum` parallelism** introduced in Chapter 12: because each subject's trials form a self-contained contiguous block, independent subject loops can be dispatched to separate CPU threads without any shared mutable state. We keep `subj_id` here for simplicity. You will see `subj_start/subj_end` throughout Chapters 9–11. ::: ```{r ch07_Time_NCP_Model} # Timing, ESS, and Pathfinder warm-start comparison for CP vs NCP # ── 1. Wall-clock and ESS ───────────────────────────────────────────────────── time_cp <- fit_cp$time()$total time_ncp <- fit_ncp$time()$total ess_cp <- summarise_draws(fit_cp$draws(c("mu_theta", "sigma_theta")), "ess_bulk") |> rename(ess_cp = ess_bulk) ess_ncp <- summarise_draws(fit_ncp$draws(c("mu_theta", "sigma_theta")), "ess_bulk") |> rename(ess_ncp = ess_bulk) efficiency_comparison <- ess_cp |> left_join(ess_ncp, by = "variable") |> mutate( ess_per_sec_cp = ess_cp / time_cp, ess_per_sec_ncp = ess_ncp / time_ncp ) cat("=== COMPUTATIONAL SPEED (WALL-CLOCK) ===\n") cat("CP Total Time:", round(time_cp, 2), "seconds\n") cat("NCP Total Time:", round(time_ncp, 2), "seconds\n") cat("Speedup Factor:", round(time_cp / time_ncp, 2), "x\n\n") cat("=== SAMPLING EFFICIENCY (ESS / SECOND) ===\n") print(knitr::kable(efficiency_comparison, digits = 1)) # ── 2. Pathfinder warm-start timing for NCP ─────────────────────────────────── fit_ncp_pf_path <- here::here("simmodels", "ch07_fit_ncp_pathfinder_init.rds") if (regenerate_simulations || !file.exists(fit_ncp_pf_path)) { pf_ncp <- tryCatch( mod_biased_ml_ncp$pathfinder(data = stan_data_centered, num_paths = 4, seed = 42, refresh = 0), error = function(e) NULL ) init_for_ncp <- if (!is.null(pf_ncp)) pf_ncp else init_ncp_list t_pf_start <- proc.time()["elapsed"] fit_ncp_pf <- mod_biased_ml_ncp$sample( data = stan_data_centered, seed = 124, chains = 4, parallel_chains = 4, iter_warmup = 500, # shorter warmup thanks to Pathfinder init iter_sampling = 1000, init = init_for_ncp, adapt_delta = 0.85, refresh = 0 ) time_pf_total <- (proc.time()["elapsed"] - t_pf_start) + if (!is.null(pf_ncp)) pf_ncp$time()$total else 0 fit_ncp_pf$save_object(fit_ncp_pf_path) } else { fit_ncp_pf <- readRDS(fit_ncp_pf_path) time_pf_total <- fit_ncp_pf$time()$total # approximate (Pathfinder time not stored) } timing_summary <- tibble::tribble( ~Method, ~`Time (s)`, ~`Warmup iter`, "CP (standard)", round(time_cp, 1), 1000L, "NCP (standard init)", round(time_ncp, 1), 1000L, "NCP (Pathfinder warm-start)", round(time_pf_total, 1), 500L ) knitr::kable(timing_summary, caption = "Wall-clock time: CP vs NCP, with and without Pathfinder warm-start") ``` Here we see that the NCP model is sometimes slightly faster, sometimes slightly slower than CP in raw wall-clock time, but always less *sampling-efficient* (lower ESS/second). This is the trade-off of NCP: it eliminates the pathological geometry that causes SBC failures in CP, at the cost of slightly less efficient sampling in data-rich regimes. The Pathfinder warm-start partially recovers that efficiency by cutting warmup iterations. Yet, NCP might still end up exploring the space better. Let's see how it fares on the SBC assessment that was giving issues to CP. We will generate 200 random universes from the priors, fit the NCP model to each, and plot the rank histograms. ```{r ch07_sbc_ncp_execution} sbc_ncp_path <- here::here("simmodels", "ch07_sbc_ncp.rds") backend_ncp <- SBC_backend_cmdstan_sample( mod_biased_ml_ncp, iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0 ) if (regenerate_simulations || !file.exists(sbc_ncp_path)) { future::plan(future::multisession, workers = 4) # Add keep_fits = FALSE right here sbc_results_ncp <- compute_SBC( datasets = sbc_datasets_cp, backend = backend_ncp, keep_fits = FALSE ) saveRDS(sbc_results_ncp, sbc_ncp_path) } else { sbc_results_ncp <- readRDS(sbc_ncp_path) } plot_rank_hist(sbc_results_ncp, variables = c("mu_theta", "sigma_theta")) + labs(title = "SBC Rank Histograms: Non-Centered Parameterization") sbc_stats_df <- sbc_results_cp$stats # 2. Filter for the population-level parameters you care about # (e.g., mu_alpha, mu_beta, rho, or even sigma[1]) recovery_df <- sbc_stats_df |> filter(variable %in% c("mu_theta", "sigma_theta")) # 3. Create the Parameter Recovery Identity Plot p_sbc_recovery <- ggplot(recovery_df, aes(x = simulated_value, y = median)) + # The Identity Line (Perfect Recovery) geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", linewidth = 1) + # The Posterior Estimates with 90% Credible Intervals geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.3, color = "steelblue") + geom_point(color = "midnightblue", alpha = 0.7, size = 2) + # Facet by parameter facet_wrap(~variable, scales = "free") + labs( title = "Parameter Recovery (Extracted Directly from SBC)", subtitle = "Points represent the posterior median. Bars are 90% Credible Intervals.", x = "True Generative Value (Drawn from Prior)", y = "Posterior Median Estimate" ) + theme_classic() + theme(strip.background = element_rect(fill = "gray95", color = NA)) print(p_sbc_recovery) # 4. Quantify the Recovery (Correlation and RMSE) recovery_metrics <- recovery_df |> group_by(variable) |> summarize( Pearson_r = cor(simulated_value, median), Spearman_rho = cor(simulated_value, median, method = "spearman"), RMSE = sqrt(mean((simulated_value - median)^2)), .groups = "drop" ) cat("=== PARAMETER RECOVERY QUANTIFICATION ===\n") print(knitr::kable(recovery_metrics, digits = 3)) # Extract diagnostics now so they are available for the targeted refit below diags_ncp <- sbc_results_ncp$backend_diagnostics ncp_fail_rate <- mean(diags_ncp$n_divergent > 0) * 100 # Divergence count distribution across all 200 universes diags_ncp |> ggplot(aes(x = n_divergent)) + geom_histogram(binwidth = 1, fill = "darkred", alpha = 0.8) + labs( title = "NCP Divergence Counts Across 200 SBC Universes", subtitle = sprintf("%.0f%% of universes produced at least one divergence. Even a single divergence invalidates the geometric guarantee of NUTS.", cp_fail_rate), x = "Number of Divergent Transitions", y = "Number of SBC Universes" ) # Locate the divergence-prone region: do they concentrate at small sigma_theta? divergence_profile <- posterior::as_draws_df(sbc_datasets_cp$variables) |> as_tibble() |> mutate(sim_id = row_number()) |> left_join( diags_ncp |> mutate(sim_id = row_number()) |> dplyr::select(sim_id, n_divergent), by = "sim_id" ) |> mutate(had_divergence = n_divergent > 0) ggplot(divergence_profile, aes(x = sigma_theta, y = n_divergent, color = had_divergence)) + geom_point(alpha = 0.5, size = 1.5) + scale_color_manual( values = c("FALSE" = "steelblue", "TRUE" = "darkred"), labels = c("FALSE" = "Clean", "TRUE" = "Divergent"), name = NULL ) + labs( title = "NCP Divergences Concentrate at Small True sigma_theta", subtitle = "The funnel geometry only manifests when population variance is small.", x = expression("True " * sigma[theta]), y = "Number of Divergent Transitions" ) ggplot(divergence_profile, aes(x = mu_theta, y = sigma_theta, color = had_divergence)) + geom_point(alpha = 0.7, size = 2) + scale_color_manual( name = "Sampler Status", values = c("FALSE" = "steelblue", "TRUE" = "darkred"), labels = c("Healthy", "Diverged") ) + labs( title = "The Anatomy of Failure: Where does NCP break?", x = "True Population Mean (mu_theta)", y = "True Population Variance (sigma_theta)" ) + theme_classic() ``` #### Posterior Simulation-Based Calibration Checks: Probing Local Geometry Prior SBC tested our sampler across the entire prior space — it asked whether NUTS is unbiased when the true parameters could be anything the prior assigns non-negligible mass to. This is a global guarantee, and it is necessary. But it is not sufficient. A sampler can be globally unbiased yet struggle precisely in the regions that matter most: the posterior neighborhoods that the actual observed data force us into. When the likelihood is highly informative — as it typically is once we have a few hundred trials per agent — the posterior concentrates sharply in a small corner of the prior space. The sampler must now take large numbers of tiny, precise steps in a geometrically restricted region. Prior SBC never directly stresses this regime, because most of its simulated universes place the true parameter somewhere across the full prior, and only a fraction happen to fall in the data-rich posterior neighborhood. Posterior SBC (PSBC) is a complementary local diagnostic that directly targets this regime. The logic inverts the generative direction. Instead of drawing true parameters from the prior, we draw them from the *fitted posterior* — a distribution that has already been conditioned on our actual data and therefore concentrates in precisely the region the sampler will need to navigate in practice. Concretely: we sample a single draw $\tilde{\theta}^{(s)}$ from the posterior ($\theta \mid y_{\text{obs}}$), treat it as a ground truth, generate a synthetic dataset $\tilde{y}^{(s)} \sim p(y \mid \tilde{\theta}^{(s)})$ from the data model, fit the model to $\tilde{y}^{(s)}$, and record where $\tilde{\theta}^{(s)}$ falls in the resulting posterior. If the sampler is locally well-calibrated, these ranks are again uniform — but now over the concentrated geometry of the posterior, not the diffuse geometry of the prior. A U-shaped rank histogram here means the sampler is overconfident *specifically when the data are informative*, which is exactly the failure mode we need to detect. Implementing this requires us to write a *factory function*: a function that closes over a fitted model object and a data structure, and returns a new generator function that the `SBC` package can call repeatedly to produce fresh $({\tilde{\theta}}, \tilde{y})$ pairs. If this pattern is unfamiliar, read it as a function that *builds* generators rather than *being* one — `make_psbc_generator(fit_cp, stan_data_centered)` does not itself generate a dataset; it returns a self-contained callable that will do so every time the SBC machinery invokes it. ```{r ch07_psbc_execution} # 1. Define a robust factory function for the PSBC generator make_psbc_generator <- function(fit, original_data) { # Extract the full posterior array as a flat data frame draws_df <- posterior::as_draws_df(fit$draws()) # Pre-compute the exact column names to guarantee flawless extraction theta_cols <- paste0("theta_logit[", 1:original_data$N_subjects, "]") y_cols <- paste0("y_post_rep[", 1:original_data$N_total, "]") # The generator function required by the SBC package function() { # Step A: Randomly sample one specific "state of the world" from the posterior idx <- sample(seq_len(nrow(draws_df)), 1) # Step B: Package the true parameters for this specific simulation variables <- list( mu_theta = as.numeric(draws_df$mu_theta[idx]), sigma_theta = as.numeric(draws_df$sigma_theta[idx]), theta_logit = as.numeric(draws_df[idx, theta_cols]) ) # Step C: Package the dataset generated by this specific state of the world generated <- list( N_total = original_data$N_total, N_subjects = original_data$N_subjects, subj_id = original_data$subj_id, y = as.numeric(draws_df[idx, y_cols]), prior_mu_theta_mu = original_data$prior_mu_theta_mu, prior_mu_theta_sigma = original_data$prior_mu_theta_sigma, prior_sigma_theta_lambda = original_data$prior_sigma_theta_lambda, run_diagnostics = original_data$run_diagnostics ) list(variables = variables, generated = generated) } } psbc_path <- here::here("simmodels", "ch07_psbc.rds") if (regenerate_simulations || !file.exists(psbc_path)) { gen_psbc_cp <- SBC_generator_function(make_psbc_generator(fit_cp, stan_data_centered)) gen_psbc_ncp <- SBC_generator_function(make_psbc_generator(fit_ncp, stan_data_centered)) set.seed(8675309) psbc_data_cp <- generate_datasets(gen_psbc_cp, 100) psbc_data_ncp <- generate_datasets(gen_psbc_ncp, 100) backend_cp_psbc <- SBC_backend_cmdstan_sample(mod_biased_ml_cp, iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0) backend_ncp_psbc <- SBC_backend_cmdstan_sample(mod_biased_ml_ncp, iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0) future::plan(future::multisession, workers = 4) psbc_res_cp <- compute_SBC(psbc_data_cp, backend_cp_psbc) psbc_res_ncp <- compute_SBC(psbc_data_ncp, backend_ncp_psbc) saveRDS( list( res_cp = psbc_res_cp, res_ncp = psbc_res_ncp ), psbc_path ) } else { psbc_obj <- readRDS(psbc_path) psbc_res_cp <- psbc_obj$res_cp psbc_res_ncp <- psbc_obj$res_ncp } plot_rank_hist(psbc_res_cp, variables = c("mu_theta", "sigma_theta")) + labs(title = "PSBC Rank Histograms: Centered Parameterization") plot_rank_hist(psbc_res_ncp, variables = c("mu_theta", "sigma_theta")) + labs(title = "PSBC Rank Histograms: Non-Centered Parameterization") # 1. Extract the lightweight stats dataframe from the SBC object sbc_stats_df <- psbc_res_ncp$stats # 2. Filter for the population-level parameters you care about # (e.g., mu_alpha, mu_beta, rho, or even sigma[1]) recovery_df <- sbc_stats_df |> filter(variable %in% c("mu_theta", "sigma_theta")) # 3. Create the Parameter Recovery Identity Plot p_sbc_recovery <- ggplot(recovery_df, aes(x = simulated_value, y = median)) + # The Identity Line (Perfect Recovery) geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", linewidth = 1) + # The Posterior Estimates with 90% Credible Intervals geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.3, color = "steelblue") + geom_point(color = "midnightblue", alpha = 0.7, size = 2) + # Facet by parameter facet_wrap(~variable, scales = "free") + labs( title = "Parameter Recovery (Extracted Directly from SBC)", subtitle = "Points represent the posterior median. Bars are 90% Credible Intervals.", x = "True Generative Value (Drawn from Prior)", y = "Posterior Median Estimate" ) + theme_classic() + theme(strip.background = element_rect(fill = "gray95", color = NA)) print(p_sbc_recovery) # 4. Quantify the Recovery (Correlation and RMSE) recovery_metrics <- recovery_df |> group_by(variable) |> summarize( Pearson_r = cor(simulated_value, median), RMSE = sqrt(mean((simulated_value - median)^2)), .groups = "drop" ) cat("=== PARAMETER RECOVERY QUANTIFICATION ===\n") print(knitr::kable(recovery_metrics, digits = 3)) diags_psbc_cp <- psbc_res_cp$backend_diagnostics diags_psbc_ncp <- psbc_res_ncp$backend_diagnostics psbc_cp_fail_rate <- mean(diags_psbc_cp$n_divergent > 0) * 100 psbc_ncp_fail_rate <- mean(diags_psbc_ncp$n_divergent > 0) * 100 cat("=== POSTERIOR SBC (LOCAL) DIVERGENCE RATES ===\n") cat("CP Failure Rate:", psbc_cp_fail_rate, "%\n") cat("NCP Failure Rate:", psbc_ncp_fail_rate, "%\n") ``` ### Comparing CP and NCP: Global Fragility vs. Local Efficiency The CP did not seem to perform too bad, but it broke down when doing SBC. We therefore reparameterized the model by decoupling the sampling of individual and population level parameters (from CP to NCP). This fundamentally altered how the Hamiltonian Monte Carlo (NUTS) algorithm explores the posterior manifold. We saw that NCP took slightly longer to run and was less efficient in sampling (< ESS). We then ran extensive SBC both sampling the simulations from the priors (global calibration) and sampling the simulations from the fitted posteriors (local calibration). By comparing global calibration (Prior SBC) against local calibration (Posterior SBC), we have mathematically mapped the boundary conditions of these parameterizations. * Integration Stability and Global Fragility (Divergences): In CP, sparse data forces the sampler into the infinite curvature of Neal's Funnel. As proven by our Prior SBC failure rates, CP is globally fragile; it regularly diverged across the broad generative prior space whenever the true population variance ($\sigma_\theta$) was small. NCP, conversely, decouples the parameters. The sampler explores a standard, isotropic Gaussian space ($z \sim \text{Normal}(0, 1)$) where the curvature is constant, completely immunizing the model against Neal's Funnel and dropping global divergences to zero. * Sampling Efficiency and Local Vulnerability (ESS and Speed): Because the NCP geometry is smooth and well-behaved in data-sparse regimes, the Hamiltonian integrator can take massive, efficient step sizes, causing the Effective Sample Size (ESS) to skyrocket. However, as our Posterior SBC proved, NCP is locally vulnerable to the Data Funnel (or "Reverse Funnel"). In the specific neighborhood of our simulated data, the likelihood was highly informative and tightly constrained the posterior. CP navigated this sharp likelihood peak effortlessly. NCP, however, was forced to blindly explore a broad standard normal prior, only to find that the data violently rejected almost all of it, causing its ESS to plummet. * Stationarity ($\hat{R}$): While both models can achieve $\hat{R} < 1.01$ under ideal local conditions, NCP achieves it reliably across the entire prior space, ensuring all Markov chains mix seamlessly regardless of the underlying data-generating universe. * The Geometric Trade-off: When CP Wins Given the global safety of NCP, it is tempting to conclude that we should always use the Non-Centered Parameterization. This is mathematically incorrect. The choice between CP and NCP is a strict geometric trade-off governed by the density of your data. ** The Data-Poor Regime (Use NCP): When trial counts ($N_j$) are low or population variance ($\sigma_\theta$) is small, the prior dominates the likelihood. This creates Neal's Funnel in the CP space. NCP thrives here because it decouples the parameters, allowing the sampler to easily and safely explore the prior geometry. ** The Data-Rich Regime (Use CP): Consider a scenario where every agent completes $10,000$ trials. The likelihood becomes massively informative, crushing the uncertainty around $\theta_{\text{logit}, j}$ into a tiny, tight peak.If we use CP, the sampler effortlessly navigates this tight, data-driven likelihood space. If we use NCP, the sampler attempts to explore the broad $z \sim \text{Normal}(0,1)$ prior, only to find that almost the entire space is immediately rejected by the highly restrictive likelihood. This makes NCP highly inefficient, resulting in massive autocorrelation, low ESS, and high computational time. In other words, there is no universal panacea in cognitive modeling. We do not engage in blind "tinkering" to get our models to fit. Instead, we adhere to the Iterative Bayesian Workflow: we fit the model, deploy a strict diagnostic battery to identify the exact geometric failure (prior-dominated funnels vs. data-dominated funnels), and apply the mathematically appropriate parameterization. Because typical cognitive tasks (like our Matching Pennies game) involve relatively sparse data ($N_j \approx 100$) and substantial population-level shrinkage, we will utilize the Non-Centered Parameterization as our default architecture as we proceed to construct the more complex Memory and WSLS agents. ## The Epistemic Showdown: Visualizing Shrinkage We have theorized about the dangers of ignoring population structure (Complete Pooling) and the dangers of ignoring noise in small samples (No Pooling). In a rigorous Bayesian workflow, we do not rely on intuition; we compute the alternatives and visualize the geometric differences.We will quickly define and fit the two extreme models using our strictly typed stan_data_centered. The complete and no-pooling models are included here solely as visual contrasts — they serve as the "wrong answer" endpoints that make the partial pooling solution legible. A full diagnostic battery on these two models would be redundant: complete pooling is trivially identifiable (one parameter, well-conditioned), and no-pooling is structurally identical to J independent Ch. 6 models we already validated. The NCP model, which actually matters, passed its full battery above. ### The Complete Pooling Model (Underfitting) This model assumes all agents are identical clones. The population variance is forced to exactly $0$. We estimate a single, universal bias parameter. ```{r ch07_fit_complete_pooling} stan_full_pooling_code <- " data { int<lower=1> N_total; array[N_total] int<lower=0, upper=1> y; } parameters { real theta_logit; // A single parameter for the entire universe } model { target += normal_lpdf(theta_logit | 0, 1.5); target += bernoulli_logit_lpmf(y | theta_logit); } " stan_file_full <- here::here("stan", "ch07_biased_full_pooling.stan") writeLines(stan_full_pooling_code, stan_file_full) mod_full <- cmdstan_model(stan_file_full) fit_full_path <- here::here("simmodels", "ch07_fit_full_pool.rds") if (regenerate_simulations || !file.exists(fit_full_path)) { fit_full <- mod_full$sample( data = list(N_total = stan_data_centered$N_total, y = stan_data_centered$y), seed = 123, chains = 4, parallel_chains = 4, refresh = 0 ) fit_full$save_object(fit_full_path) } else { fit_full <- readRDS(fit_full_path) } full_pool_est <- posterior::summarise_draws(fit_full$draws("theta_logit"), "median")$median ``` ### The No Pooling Model (Overfitting) This model assumes every agent is a completely isolated island. The population variance is implicitly assumed to be $\infty$. We estimate an independent parameter for each agent, equipped only with a static, weakly informative prior. ```{r ch07_fit_no_pooling} stan_no_pooling_code <- " data { int<lower=1> N_total; int<lower=1> N_subjects; array[N_total] int<lower=1, upper=N_subjects> subj_id; array[N_total] int<lower=0, upper=1> y; } parameters { vector[N_subjects] theta_logit; // N_subjects completely independent parameters } model { // Static prior: No hierarchical learning target += normal_lpdf(theta_logit | 0, 1.5); target += bernoulli_logit_lpmf(y | theta_logit[subj_id]); } " stan_file_no <- here::here("stan", "ch07_biased_no_pooling.stan") writeLines(stan_no_pooling_code, stan_file_no) mod_no <- cmdstan_model(stan_file_no) # Fit the No Pooling model fit_no_path <- here::here("simmodels", "ch07_fit_no_pool.rds") if (regenerate_simulations || !file.exists(fit_no_path)) { fit_no <- mod_no$sample( data = stan_data_centered, seed = 123, chains = 4, parallel_chains = 4, refresh = 0 ) fit_no$save_object(fit_no_path) } else { fit_no <- readRDS(fit_no_path) } no_pool_summary <- posterior::summarise_draws(fit_no$draws("theta_logit"), "median") |> mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])"))) |> rename(theta_no_pool = median) ``` ### The Geometry of Shrinkage We now have three distinct sets of inferences: * No Pooling: The raw, isolated estimates (highly vulnerable to the noise of low $N_j$). * Complete Pooling: The rigid, universal average (blind to genuine individual differences). * Partial Pooling: Our hierarchical NCP model estimates (the precision-weighted Bayesian compromise). To truly evaluate the epistemological value of these models, we must plot them not just against each other, but against the True Generative Parameters (the hidden reality we simulated in R). We will also extract the naive arithmetic mean of the No Pooling estimates to contrast it with the precision-weighted Partial Pooling mean. ```{r ch07_shrinkage_viz} # 1. Extract the Partial Pooling Population and Individual Estimates (from the NCP model) mu_partial_pool_est <- posterior::summarise_draws(fit_ncp$draws("mu_theta"), "median")$median partial_pool_summary <- posterior::summarise_draws(fit_ncp$draws("theta_logit"), "median") |> mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])"))) |> rename(theta_partial_pool = median) # 2. Calculate the No Pooling Mean (simple arithmetic average of the isolated estimates) no_pool_mean_est <- mean(no_pool_summary$theta_no_pool) # 3. Calculate the True Generative Mean (The actual center of our simulated universe) true_mean_est <- mean(true_agent_params$theta_logit) # 4. Combine into a single dataframe, including the TRUE generative parameters comparison_df <- no_pool_summary |> dplyr::select(agent_id, theta_no_pool) |> left_join( partial_pool_summary |> dplyr::select(agent_id, theta_partial_pool), by = "agent_id" ) |> # Bring in the ground truth left_join( true_agent_params |> dplyr::select(agent_id, n_trials, theta_true = theta_logit), by = "agent_id" ) |> # Sort agents by their No Pooling estimate for a cleaner visual curve arrange(theta_no_pool) |> mutate(rank = row_number()) # The Ultimate Epistemic Plot ggplot(comparison_df) + # --- POPULATION AVERAGES (Mapped to linetype for a clean legend) --- geom_hline(aes(yintercept = no_pool_mean_est, linetype = "No Pooling Mean"), color = "darkorange", linewidth = 0.8) + geom_hline(aes(yintercept = full_pool_est, linetype = "Complete Pooling Mean"), color = "darkred", linewidth = 0.8) + geom_hline(aes(yintercept = mu_partial_pool_est, linetype = "Partial Pooling Mean (\U03BC)"), color = "steelblue", linewidth = 1.2) + geom_hline(aes(yintercept = true_mean_est, linetype = "True Generative Mean"), color = "black", linewidth = 0.8) + # --- INDIVIDUAL ESTIMATES --- # Shrinkage Vectors (from No Pool to Partial Pool) geom_segment(aes(x = rank, xend = rank, y = theta_no_pool, yend = theta_partial_pool), color = "gray65", arrow = arrow(length = unit(0.1, "cm"))) + # True Generative Parameters (Black Crosses) geom_point(aes(x = rank, y = theta_true, color = "True Value"), shape = 4, size = 2, stroke = 1) + # No Pooling Estimates (Open Circles) geom_point(aes(x = rank, y = theta_no_pool, color = "No Pooling", size = n_trials), shape = 1, stroke = 1.2) + # Partial Pooling Estimates (Solid Points) geom_point(aes(x = rank, y = theta_partial_pool, color = "Partial Pooling", size = n_trials), alpha = 0.8) + # --- SCALES & THEMES --- scale_linetype_manual( name = "Population Averages", values = c( "True Generative Mean" = "solid", "Partial Pooling Mean (\U03BC)" = "dotted", "Complete Pooling Mean" = "dashed", "No Pooling Mean" = "dotdash" ) ) + scale_color_manual( name = "Individual Estimates", values = c( "True Value" = "black", "No Pooling" = "darkorange", "Partial Pooling" = "steelblue" ) ) + labs( title = "Bayesian Shrinkage: Partial Pooling vs. the Extremes", subtitle = "Arrows show shrinkage from No Pooling → Partial Pooling. Longest arrows = fewest trials. Partial pooling lands closest to truth (black solid line).", x = "Agents (Sorted by No-Pooling Estimate)", y = "Estimated Bias Parameter (Logit Scale)", size = "Trial Count (N_j)" ) + theme( axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "right" ) ``` Study this plot carefully. It contains the entire epistemological justification for hierarchical modeling, visible across two dimensions: the horizontal population averages and the individual agent estimates. ### The Horizontal Lines: The Quartet of Population Means * True Generative Mean (Solid Black): The actual center of our simulated universe. This is the structural reality we are trying to recover. * No Pooling Mean (Dot-dash Orange): The unweighted arithmetic average of the isolated estimates. Because it treats all agents equally, it is highly susceptible to being dragged away from the truth by extreme outliers with very little data. * Complete Pooling Mean (Dashed Red): The unweighted average of all trials. Because it ignores agent boundaries, it allows participants who completed massive numbers of trials to disproportionately pull the center of gravity toward themselves. * Partial Pooling Mean ($\mu_\theta$, Dotted Blue): The precision-weighted compromise. By mathematically down-weighting the noisy, low-$N_j$ agents, it reliably lands closest to the true generative mean. ### The Points and Arrows: Information-Weighted Regularization * The Overfitting Extremes: Look at the tails of the graph. The orange open circles (No Pooling) are extreme. If we trusted No Pooling, we would conclude these agents have massive, almost deterministic biases. * The Power of the Prior: The solid blue dots (Partial Pooling) are systematically pulled inward. The hierarchical model "knows" that extremely high or low values are mathematically unlikely given the behavior of the rest of the cohort. * The Calibration of Skepticism: Look at the size of the dots, representing the trial count ($N_j$). The longest arrows (most aggressive shrinkage) almost exclusively belong to the smallest dots. When an agent has sparse data, the posterior relies heavily on the population prior. The largest dots barely move; when an agent has massive amounts of data, the likelihood overwhelms the prior, allowing them to remain safely extreme. ## The Memory Agent: Two Correlated Individual Differences ### Introduction The Biased Agent established the full Bayesian workflow for a single individual-level parameter: simulate, fit, diagnose, recover, calibrate. It also taught us, the hard way, that the Centered Parameterization fails when population variance is small, and that the Non-Centered Parameterization resolves this by decoupling individual and population parameters geometrically. We proceed to the Memory Agent with both of those lessons in hand. The Memory Agent is structurally richer. Recall from Chapter 3 its decision rule. On each trial $i$, agent $j$ maintains a running average of the opponent's past choices $m_{ij}$ and uses it as a predictive signal: $$h_{ij} \sim \text{Bernoulli}(\text{logit}^{-1}(\alpha_j + \beta_j \cdot \text{logit}(m_{ij})))$$ $$m_{ij} = \frac{1}{i-1}\sum_{k=1}^{i-1} o_{kj}$$ There are now two individual-level parameters per agent: $\alpha_j$, the baseline directional bias, and $\beta_j$, the sensitivity to opponent history. They need not vary independently across the population. An agent who has a strong directional tendency might also track the opponent more aggressively — or might not. Fitting two separate univariate hierarchical models forecloses this question before the data can answer it. We model the parameters jointly. **Experimental design: the reversal paradigm.** In Ch. 6 we discovered that fitting the memory model to a *constant-bias* opponent produces a strong positive correlation between `bias` and `beta` in the posterior — both parameters explain the same rightward tendency and cannot be told apart. The reversal learning design (opponent plays 80% right for trials 1–120, then 20% right for trials 121–240) breaks this confound: in the second phase, a pure-bias agent continues right while a memory agent flips left, providing the discriminating contrast the two-parameter model needs. We therefore use the reversal schedule saved in Ch. 6 as the shared opponent sequence for all J agents here. #### The Bivariate Population Distribution $$\begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} \sim \mathcal{N}_2\!\left(\boldsymbol{\mu},\, \boldsymbol{\Sigma}\right), \quad \boldsymbol{\Sigma} = \operatorname{diag}(\boldsymbol{\sigma})\, \Omega\, \operatorname{diag}(\boldsymbol{\sigma}), \quad \Omega = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}$$ This decomposition separates how much individuals differ on each cognitive dimension ($\sigma_\alpha$, $\sigma_\beta$) from how those dimensions covary ($\rho$). The scalar $\rho \in (-1, 1)$ is the key new inferential target. #### The LKJ Prior We need a prior on the correlation matrix $\Omega$. Stan provides the LKJ distribution (Lewandowski, Kurowicka, and Joe, 2009), with a single shape parameter $\eta > 0$: $$\Omega \sim \text{LKJ}(\eta) \quad \propto \quad (\det \Omega)^{\eta - 1}$$ When $\eta = 1$ the distribution is uniform over valid correlation matrices. When $\eta > 1$ mass concentrates toward the identity, encoding the prior belief that parameters tend to vary independently unless the data argue otherwise. For a $2 \times 2$ matrix, the marginal on $\rho$ reduces to a rescaled Beta: $\rho \sim 2 \cdot \text{Beta}(\eta, \eta) - 1$. With our default $\eta = 2$, the prior is unimodal at zero and weakly regularizes against extreme correlations. Stan samples the lower-triangular **Cholesky factor** $\mathbf{L}_\Omega$ (declared as `cholesky_factor_corr[2]`) under the `lkj_corr_cholesky(eta)` prior, recovering $\Omega = \mathbf{L}_\Omega \mathbf{L}_\Omega^\top$ in `generated quantities`. This parameterization is unconstrained on its natural manifold. #### Non-Centered Parameterization The scalar NCP from the biased agent ($\theta_j = \mu + z_j \cdot \sigma$) generalizes directly. We introduce a $2 \times J$ matrix of independent standard normals: $$z_{kj} \sim \mathcal{N}(0, 1) \text{ i.i.d.}, \qquad \begin{pmatrix}\alpha_j \\ \beta_j\end{pmatrix} = \begin{pmatrix}\mu_\alpha \\ \mu_\beta\end{pmatrix} + \operatorname{diag}(\boldsymbol{\sigma})\,\mathbf{L}_\Omega\,\mathbf{z}_j$$ The sampler traverses the flat, isotropic space of $\mathbf{Z}$, whose geometry is entirely independent of $\boldsymbol{\sigma}$ and $\Omega$. Neal's Funnel is structurally impossible. ```{tikz hierarchical_memory_dag, echo=FALSE, fig.cap="Standard Plate Notation (left) and Kruschke-Style Notation (right) for the Hierarchical Memory Agent. The correlation node Omega and the bivariate MVN distinguish this model from the biased agent.", fig.ext='png', fig.width=12, cache=TRUE} \usetikzlibrary{positioning, shapes.geometric, arrows.meta, fit, backgrounds, calc} \begin{tikzpicture}[ stochastic/.style={->, >={Latex[length=3mm]}, thick}, deterministic/.style={->, >={Latex[length=3mm]}, thick, dashed, draw=gray!70}, plate/.style={draw=black, thick, rounded corners, inner sep=10pt}, continuous/.style={circle, draw=black, thick, minimum size=1.2cm, inner sep=0pt}, discrete/.style={rectangle, draw=black, thick, minimum size=1.2cm}, observed/.style={fill=gray!30}, latent/.style={fill=white}, pics/norm_dist/.style={code={ \draw[thin, gray] (-1.2,0) -- (1.2,0); \draw[thick, blue] plot[domain=-1.2:1.2, samples=50] (\x, {0.8*exp(-3*\x*\x)}); \node[font=\footnotesize, below] at (0,0) {Normal}; }}, pics/exp_dist/.style={code={ \draw[thin, gray] (-0.2,0) -- (1.5,0); \draw[thick, blue] plot[domain=0:1.4, samples=50] (\x, {0.8*exp(-2.5*\x)}); \node[font=\footnotesize, below] at (0.6,0) {Exponential}; }}, pics/lkj_dist/.style={code={ \draw[thin, gray] (-1.2,0) -- (1.2,0); \draw[thick, blue] plot[domain=-0.95:0.95, samples=50] (\x, {0.8*(1-\x*\x)}); \node[font=\footnotesize, below] at (0,0) {LKJ}; }}, pics/mvn_dist/.style={code={ \draw[thin, gray, ->] (-0.8,0) -- (0.8,0); \draw[thin, gray, ->] (0,-0.8) -- (0,0.8); \draw[thick, blue, rotate=30] (0,0) ellipse (0.6 and 0.22); \draw[thick, blue, rotate=30] (0,0) ellipse (0.3 and 0.11); \node[font=\footnotesize, below right] at (0.1,-0.15) {MVN}; }}, pics/bern_dist/.style={ code={ \draw[thin, gray] (-1,0) -- (1,0); % Replaced ycomb plot with explicit thick lines for compiler stability \draw[blue, line width=4pt] (-0.4, 0) -- (-0.4, 0.3); \draw[blue, line width=4pt] (0.4, 0) -- (0.4, 0.8); \node[font=\footnotesize, below] at (0,0) {Bernoulli}; } } ] %% LEFT: STANDARD PLATE NOTATION \node[font=\large\bfseries] at (0, 2.2) {Standard Plate Notation}; \node[continuous, latent] (mu_a) at (-3, 0) {$\mu_\alpha$}; \node[continuous, latent] (sig_a) at (-1.5, 0) {$\sigma_\alpha$}; \node[continuous, latent, fill=gray!15] (omega) at (0, 0) {$\Omega$}; \node[continuous, latent] (sig_b) at (1.5, 0) {$\sigma_\beta$}; \node[continuous, latent] (mu_b) at (3, 0) {$\mu_\beta$}; \node[continuous, latent] (alpha) at (-1.5, -3.5) {$\alpha_j$}; \node[continuous, latent] (beta) at ( 1.5, -3.5) {$\beta_j$}; \node[discrete, observed] (m) at (-1.5, -7) {$m_{ij}$}; \node[discrete, observed] (h) at ( 1.5, -7) {$h_{ij}$}; \node[plate, fit=(m)(h), label={[anchor=south east, font=\small\bfseries]below right:$i=1\dots N$}] (pN) {}; \node[plate, inner sep=15pt, fit=(alpha)(beta)(pN), label={[anchor=south east, font=\small\bfseries]below right:$j=1\dots J$}] {}; \draw[stochastic] (mu_a) -- (alpha); \draw[stochastic] (sig_a) -- (alpha); \draw[stochastic] (omega) -- (alpha); \draw[stochastic] (omega) -- (beta); \draw[stochastic] (sig_b) -- (beta); \draw[stochastic] (mu_b) -- (beta); \draw[stochastic] (alpha) -- (h); \draw[stochastic] (beta) -- (h); \draw[deterministic] (m) -- (h); %% RIGHT: KRUSCHKE-STYLE NOTATION \node[font=\large\bfseries] at (11, 2.2) {Kruschke-Style Notation}; \coordinate (k_mu_a) at (7, 0); \draw (k_mu_a) pic {norm_dist}; \node[above=0.7cm of k_mu_a, font=\bfseries] {$\mu_\alpha$}; \coordinate (k_sig_a) at (9, 0); \draw (k_sig_a) pic {exp_dist}; \node[above=0.7cm of k_sig_a, font=\bfseries] {$\sigma_\alpha$}; \coordinate (k_omega) at (11, 0); \draw (k_omega) pic {lkj_dist}; \node[above=0.7cm of k_omega, font=\bfseries] {$\Omega$}; \coordinate (k_sig_b) at (13, 0); \draw (k_sig_b) pic {exp_dist}; \node[above=0.7cm of k_sig_b, font=\bfseries] {$\sigma_\beta$}; \coordinate (k_mu_b) at (15, 0); \draw (k_mu_b) pic {norm_dist}; \node[above=0.7cm of k_mu_b, font=\bfseries] {$\mu_\beta$}; \coordinate (k_mvn) at (11, -3.5); \draw (k_mvn) pic {mvn_dist}; \node[above=0.9cm of k_mvn, font=\bfseries, align=center] {Agent Params \\ $(\alpha_j,\,\beta_j)$}; \node[discrete, observed, minimum size=0.9cm] (k_m) at (9, -7) {$m_{ij}$}; \node[above=0.5cm of k_m, font=\bfseries] {Memory}; \coordinate (k_h) at (13, -7); \draw (k_h) pic {bern_dist}; \node[above=0.9cm of k_h, font=\bfseries] {Choice ($h_{ij}$)}; \draw[plate, dashed] (7.5, -5.2) rectangle (14.5, -2.2) node[below left, font=\small\bfseries] {$j=1\dots J$}; \draw[plate, dotted] (7.5, -8.2) rectangle (14.5, -5.8) node[below left, font=\small\bfseries] {$i=1\dots N$}; \draw[stochastic] (k_mu_a) -- (k_mvn); \draw[stochastic] (k_sig_a) -- (k_mvn); \draw[stochastic] (k_omega) -- (k_mvn) node[midway, right, font=\large] {$\sim$}; \draw[stochastic] (k_sig_b) -- (k_mvn); \draw[stochastic] (k_mu_b) -- (k_mvn); \draw[stochastic] (k_mvn) -- (k_h) node[midway, right, font=\large] {$\sim$}; \draw[deterministic] (k_m) -- (12.5, -7); \end{tikzpicture} ``` ### Phase 1: Generative Plausibility of the Memory Agent Before writing any Stan code, we construct the forward generative model in R. This code must strictly mirror the mathematical hierarchy above. One design decision deserves explicit justification. The memory signal $m_{ij}$ is a deterministic function of the opponent's past choices. We could compute it inside Stan's `transformed data` block, but this executes a nested loop on every leapfrog step of every chain. Instead, we pre-compute the vector `opp_rate_prev` in R once and pass it as data. This reduces the memory calculation to a single vectorized pass before fitting begins, at zero loss of model fidelity. The simulation function must also mirror the NCP Stan model exactly. We draw $\mathbf{z}_j \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_2)$ and apply the Cholesky transform, exactly as the Stan model will. For a $2 \times 2$ correlation matrix with off-diagonal $\rho$, the lower Cholesky factor is: $$\mathbf{L}_\Omega = \begin{pmatrix} 1 & 0 \\ \rho & \sqrt{1-\rho^2} \end{pmatrix}$$ We sample $\rho$ from the LKJ(2) marginal: $\rho \sim 2 \cdot \text{Beta}(2, 2) - 1$. In accordance with the rigorous Bayesian workflow, the simulation function is designed to either: * Draw all hyperparameters directly from the stated priors (for Prior Predictive Checks). * Accept fixed ground-truth hyperparameters (for Parameter Recovery). ```{r ch07_simulate_hierarchical_memory} #' Simulate Hierarchical Memory Agents #' #' @param n_agents Integer, number of agents (J). #' @param n_trials_min Integer, minimum trials per agent. #' @param n_trials_max Integer, maximum trials per agent. #' @param mu_alpha Numeric, population mean baseline bias (logit scale). NULL draws from prior. #' @param mu_beta Numeric, population mean memory sensitivity. NULL draws from prior. #' @param sigma_alpha Numeric, population SD of alpha. NULL draws from prior. #' @param sigma_beta Numeric, population SD of beta. NULL draws from prior. #' @param rho Numeric, population correlation alpha-beta. NULL draws from LKJ(2). #' @param opponent_choices Integer vector of length >= max(n_trials_max). #' If provided, every agent plays a consecutive prefix of this fixed #' sequence (realistic: all participants face the same opponent). #' If NULL, generates independent Bernoulli draws using opponent_rate. #' @param opponent_rate Numeric in (0,1). Used only when opponent_choices #' is NULL. Default 0.7. #' @param seed Integer, for exact computational reproducibility. #' #' @return A tidy tibble containing both latent parameters and observed choices. simulate_hierarchical_memory <- function( n_agents = 50, n_trials_min = 20, n_trials_max = 120, mu_alpha = NULL, mu_beta = NULL, sigma_alpha = NULL, sigma_beta = NULL, rho = NULL, opponent_choices = NULL, opponent_rate = 0.7, seed = NULL ) { if (!is.null(seed)) set.seed(seed) # 1. Sample hyperparameters from priors if not provided if (is.null(mu_alpha)) mu_alpha <- rnorm(1, 0, 1) if (is.null(mu_beta)) mu_beta <- rnorm(1, 0, 1) if (is.null(sigma_alpha)) sigma_alpha <- rexp(1, 1) if (is.null(sigma_beta)) sigma_beta <- rexp(1, 1) # LKJ(2) marginal on rho for 2x2 = 2*Beta(2,2) - 1 if (is.null(rho)) rho <- 2 * rbeta(1, 2, 2) - 1 # 2. Construct the lower-triangular Cholesky factor L_Omega L_Omega <- matrix(c(1, rho, 0, sqrt(1 - rho^2)), nrow = 2, ncol = 2) # 3. Sample individual parameters via the NCP transform (mirrors Stan exactly) # Z is 2 x J; Scale = diag(sigma) %*% L_Omega Z <- matrix(rnorm(2 * n_agents), nrow = 2) Scale <- diag(c(sigma_alpha, sigma_beta)) %*% L_Omega # indiv_params is J x 2: column 1 = alpha_j, column 2 = beta_j indiv_params <- t(c(mu_alpha, mu_beta) + Scale %*% Z) # 4. Simulate trial-level data n_trials_vec <- sample(n_trials_min:n_trials_max, n_agents, replace = TRUE) sim_data <- map_dfr(seq_len(n_agents), function(j) { n_t <- n_trials_vec[j] alpha_j <- indiv_params[j, 1] beta_j <- indiv_params[j, 2] if (!is.null(opponent_choices)) { # Shared reversal schedule: agent plays the first n_t steps of the # fixed sequence. Agents who leave early see only the first phase. stopifnot(length(opponent_choices) >= n_t) opp_choices <- opponent_choices[seq_len(n_t)] } else { opp_choices <- rbinom(n_t, 1, opponent_rate) } # Pre-compute running opponent rate BEFORE each trial (strict t-1 history). # Trial 1 has no history: initialize to 0.5 (uninformative). # Clip to [0.01, 0.99] so logit(opp_rate_prev) is always finite. opp_rate_prev <- numeric(n_t) opp_rate_prev[1] <- 0.5 if (n_t > 1) { raw_rates <- cumsum(opp_choices[-n_t]) / seq_len(n_t - 1) opp_rate_prev[-1] <- pmax(0.01, pmin(0.99, raw_rates)) } logit_p <- alpha_j + beta_j * qlogis(opp_rate_prev) agent_choices <- rbinom(n_t, 1, plogis(logit_p)) tibble( agent_id = j, trial = seq_len(n_t), n_trials = n_t, choice = agent_choices, opp_rate_prev = opp_rate_prev, true_alpha = alpha_j, true_beta = beta_j ) }) attr(sim_data, "hyperparameters") <- list( mu_alpha = mu_alpha, mu_beta = mu_beta, sigma_alpha = sigma_alpha, sigma_beta = sigma_beta, rho = rho ) sim_data } ``` #### Executing the Prior Predictive Check: The Joint Population Cloud For a bivariate model, the most revealing prior predictive display is a scatter of individual $(\alpha_j, \beta_j)$ pairs across multiple prior universes. This simultaneously reveals how dispersed individual differences are ($\sigma_\alpha$, $\sigma_\beta$), how correlated the two cognitive dimensions tend to be ($\rho$), and whether any prior universe generates implausible populations (e.g., $|\beta_j| > 8$, meaning the memory signal completely overrides choice). ```{r ch07_memory_prior_predictive_scatter} # Generate 12 prior universes prior_sims_mem <- map_dfr(1:12, function(sim_id) { sim <- simulate_hierarchical_memory( n_agents = 80, n_trials_min = 1, n_trials_max = 1, seed = sim_id * 13 ) hp <- attr(sim, "hyperparameters") sim |> distinct(agent_id, true_alpha, true_beta) |> mutate(simulation_id = paste("Prior Universe", sim_id)) }) ggplot(prior_sims_mem, aes(x = true_alpha, y = true_beta)) + geom_point(alpha = 0.4, size = 0.9, color = "steelblue") + geom_vline(xintercept = 0, linetype = "dashed", color = "gray60", linewidth = 0.4) + geom_hline(yintercept = 0, linetype = "dashed", color = "gray60", linewidth = 0.4) + facet_wrap(~simulation_id, ncol = 4) + labs( title = "Prior Predictive Checks: Implied Joint Population Distributions", subtitle = "Populations generated by mu ~ N(0,1), sigma ~ Exp(1), Omega ~ LKJ(2)", x = expression("Baseline bias " * alpha[j] * " (logit scale)"), y = expression("Memory sensitivity " * beta[j]) ) ``` A healthy prior generates a range of cloud shapes — some circular (low $|\rho|$), some elongated and tilted (high $|\rho|$), some compact (low $\sigma$), some diffuse (high $\sigma$). No prior universe should collapse all agents to a point or place all agents at implausibly extreme values. ### Phase 2: Data Preparation and the NCP Stan Model Having established from the biased agent workflow that NCP is our default architecture for cognitive tasks with sparse per-agent data, we implement it directly. We skip the Centered Parameterization entirely. ```{r ch07_stan_memory_data_prep} #' Prepare Memory Agent Data for Stan #' #' @param sim_data Tibble from simulate_hierarchical_memory(). #' @return A named list formatted for CmdStanR. prep_stan_data_memory <- function(sim_data) { list( N_total = nrow(sim_data), N_subjects = length(unique(sim_data$agent_id)), subj_id = sim_data$agent_id, y = sim_data$choice, opp_rate_prev = sim_data$opp_rate_prev, prior_mu_alpha_mu = 0, prior_mu_alpha_sigma = 1.0, prior_mu_beta_mu = 0, prior_mu_beta_sigma = 1.0, prior_sigma_lambda = 1.0, prior_Omega_shape = 2.0, run_diagnostics = 1 ) } # --- Load the reversal learning schedule from Ch. 6 --- # This 240-trial sequence (80% right → reversal → 20% right) was validated # in Ch. 6 to cleanly identify bias and beta. All agents play the same # opponent schedule, with variable trial counts (80–240 trials). reversal_obj <- readRDS(here("simdata", "ch06_reversal_data.rds")) opponent_reversal <- reversal_obj$opponent_choices # 240-element integer vector cat("Loaded reversal schedule from Ch. 6:", length(opponent_reversal), "trials |", sum(opponent_reversal[1:120]), "right in phase 1 |", sum(opponent_reversal[121:240]), "right in phase 2\n") # --- Generate hierarchical memory data using the reversal schedule --- sim_data_memory <- simulate_hierarchical_memory( n_agents = 50, n_trials_min = 80, # Minimum: agents see at least the first phase n_trials_max = 240, # Maximum: agents complete both phases mu_alpha = 0.0, # No systematic directional baseline mu_beta = 1.5, # Moderate positive memory tracking sigma_alpha = 0.3, sigma_beta = 0.5, rho = 0.3, # Mild positive correlation opponent_choices = opponent_reversal, # <-- reversal schedule from Ch. 6 seed = 1981 ) stan_data_memory <- prep_stan_data_memory(sim_data_memory) cat(sprintf( "Hierarchical memory data: %d agents, %d total trials, %d–%d trials per agent\n", stan_data_memory$N_subjects, stan_data_memory$N_total, min(sim_data_memory$n_trials), max(sim_data_memory$n_trials) )) ``` #### Stan Model: Hierarchical Memory Agent (NCP) The only new structural element relative to the biased agent NCP is the bivariate extension. Instead of a scalar $z_j$, we have a $2 \times J$ matrix $\mathbf{Z}$. Instead of scaling by a single $\sigma$, we apply `diag_pre_multiply(sigma, L_Omega)` — which computes $\operatorname{diag}(\boldsymbol{\sigma})\mathbf{L}_\Omega$ and produces the full Cholesky scale matrix encoding both marginal SDs and the correlation structure. ```{r ch07_stan_memory_ncp_code} stan_memory_ncp_code <- " // Hierarchical Memory Agent Model (Non-Centered Parameterization) // alpha_j (baseline bias) and beta_j (memory sensitivity) are jointly // drawn from a bivariate Normal. NCP decouples sampling geometry from // the population parameters, preventing Neal's Funnel. data { int<lower=1> N_total; // Total observations int<lower=1> N_subjects; // Number of agents array[N_total] int<lower=1, upper=N_subjects> subj_id; // Agent index array[N_total] int<lower=0, upper=1> y; // Observed choices vector<lower=0.01, upper=0.99>[N_total] opp_rate_prev; real prior_mu_alpha_mu; real<lower=0> prior_mu_alpha_sigma; real prior_mu_beta_mu; real<lower=0> prior_mu_beta_sigma; real<lower=0> prior_sigma_lambda; real<lower=0> prior_Omega_shape; int<lower=0, upper=1> run_diagnostics; } parameters { // Population-level parameters real mu_alpha; // Population mean baseline bias (logit scale) real mu_beta; // Population mean memory sensitivity vector<lower=0>[2] sigma; // Marginal SDs // Cholesky factor of the correlation matrix cholesky_factor_corr[2] L_Omega; // Individual-level parameters (NCP) matrix[2, N_subjects] z; } transformed parameters { matrix[N_subjects, 2] indiv_params; { matrix[2, N_subjects] deviations = diag_pre_multiply(sigma, L_Omega) * z; for (j in 1:N_subjects) { indiv_params[j, 1] = mu_alpha + deviations[1, j]; indiv_params[j, 2] = mu_beta + deviations[2, j]; } } } model { // Hyperpriors target += normal_lpdf(mu_alpha | prior_mu_alpha_mu, prior_mu_alpha_sigma); target += normal_lpdf(mu_beta | prior_mu_beta_mu, prior_mu_beta_sigma); target += exponential_lpdf(sigma | prior_sigma_lambda); target += lkj_corr_cholesky_lpdf(L_Omega | prior_Omega_shape); // Individual-level prior (NCP) target += std_normal_lpdf(to_vector(z)); // Likelihood vector[N_total] logit_p; for (i in 1:N_total) { logit_p[i] = indiv_params[subj_id[i], 1] + indiv_params[subj_id[i], 2] * logit(opp_rate_prev[i]); } target += bernoulli_logit_lpmf(y | logit_p); } generated quantities { matrix[2, 2] Omega = multiply_lower_tri_self_transpose(L_Omega); real rho = Omega[1, 2]; vector[N_subjects] alpha = indiv_params[, 1]; vector[N_subjects] beta_mem = indiv_params[, 2]; // Prior log-density real lprior = normal_lpdf(mu_alpha | prior_mu_alpha_mu, prior_mu_alpha_sigma) + normal_lpdf(mu_beta | prior_mu_beta_mu, prior_mu_beta_sigma) + exponential_lpdf(sigma | prior_sigma_lambda) + lkj_corr_cholesky_lpdf(L_Omega | prior_Omega_shape) + std_normal_lpdf(to_vector(z)); // Predictive checks and pointwise log-likelihood vector[N_total] log_lik; array[N_total] int y_post_rep; array[N_total] int y_prior_rep; // Prior draws for prior-posterior update plots real mu_alpha_prior = normal_rng(prior_mu_alpha_mu, prior_mu_alpha_sigma); real mu_beta_prior = normal_rng(prior_mu_beta_mu, prior_mu_beta_sigma); real sigma_alpha_prior = exponential_rng(prior_sigma_lambda); real sigma_beta_prior = exponential_rng(prior_sigma_lambda); // LKJ(2) marginal on rho for 2x2: 2*Beta(2,2) - 1 real rho_prior = 2.0 * beta_rng(2.0, 2.0) - 1.0; if (run_diagnostics) { matrix[2, 2] L_Omega_prior = lkj_corr_cholesky_rng(2, prior_Omega_shape); vector[2] sigma_prior = [sigma_alpha_prior, sigma_beta_prior]'; matrix[2, N_subjects] z_prior; for (j in 1:N_subjects) { z_prior[1,j] = normal_rng(0,1); z_prior[2,j] = normal_rng(0,1); } matrix[N_subjects, 2] indiv_prior; { matrix[2, N_subjects] dev_prior = diag_pre_multiply(sigma_prior, L_Omega_prior) * z_prior; for (j in 1:N_subjects) { indiv_prior[j, 1] = mu_alpha_prior + dev_prior[1, j]; indiv_prior[j, 2] = mu_beta_prior + dev_prior[2, j]; } } for (i in 1:N_total) { int j = subj_id[i]; real lp_post = indiv_params[j,1] + indiv_params[j,2] * logit(opp_rate_prev[i]); real lp_prior = indiv_prior[j,1] + indiv_prior[j,2] * logit(opp_rate_prev[i]); log_lik[i] = bernoulli_logit_lpmf(y[i] | lp_post); y_post_rep[i] = bernoulli_logit_rng(lp_post); y_prior_rep[i] = bernoulli_logit_rng(lp_prior); } } } " stan_file_memory_ncp <- here::here("stan", "ch07_multilevel_memory_ncp.stan") writeLines(stan_memory_ncp_code, stan_file_memory_ncp) mod_memory_ncp <- cmdstan_model(stan_file_memory_ncp) ``` Two Stan details are worth noting. First, the local block `{ matrix[2,J] deviations = ...; }` in `transformed parameters` confines the intermediate matrix to local scope, preventing it from appearing in the parameter output. Second, `lkj_corr_cholesky_rng(2, 2.0)` in `generated quantities` draws a full Cholesky factor from the LKJ(2) prior. This is the correct way to generate prior predictive data — unlike the Beta marginal trick, it generalizes to the $3 \times 3$ case we will need for the WSLS agent. ::: {.callout-tip title="Stan Efficiency — Cholesky Factorization for Correlation Matrices"} Whenever a model includes a **correlation matrix** $\Omega$ over individual-level parameters, use the Cholesky-factorized parameterization rather than sampling $\Omega$ directly. | Parameterization | Type | Why avoid | |---|---|---| | `corr_matrix[K] Omega` + `lkj_corr_lpdf` | Direct | Requires $O(K^3)$ matrix square root and positive-definiteness checks at every leapfrog step | | `cholesky_factor_corr[K] L_Omega` + `lkj_corr_cholesky_lpdf` | Cholesky | $L_\Omega$ is already lower-triangular; lies directly on its natural constrained manifold; $O(K^2)$ operations | ```stan // Avoid parameters { corr_matrix[2] Omega; } model { Omega ~ lkj_corr(2.0); } // Prefer parameters { cholesky_factor_corr[2] L_Omega; } model { L_Omega ~ lkj_corr_cholesky(2.0); } transformed parameters { // Apply the Cholesky scale to NCP offsets in O(K^2) — no matrix inversion matrix[2, J] deviations = diag_pre_multiply(sigma, L_Omega) * z; } generated quantities { // Recover the full correlation matrix once per saved draw (not every leapfrog step) matrix[2, 2] Omega = multiply_lower_tri_self_transpose(L_Omega); real rho = Omega[1, 2]; } ``` **Key point:** the matrix multiply `diag_pre_multiply(sigma, L_Omega) * z` gives the individual-level deviations directly — no matrix square-root needed at run time. The correlation matrix $\Omega$ is only reconstructed in `generated quantities`, where it runs once per saved draw instead of once per leapfrog step. For $K = 2$ parameters the saving is modest; for $K = 5$ or more (e.g., five per-subject RL parameters in a full model) it becomes substantial. ::: #### Fitting the model ```{r ch07_Fit_Memory_NCP} fit_memory_path <- here::here("simmodels", "ch07_fit_memory_ncp.rds") # Initialize at prior means. N_subj_val_mem <- stan_data_memory$N_subjects init_memory_list <- lapply(1:4, function(i) { list( mu_alpha = rnorm(1, 0, 0.3), mu_beta = rnorm(1, 0, 0.3), sigma = abs(rnorm(2, 0, 0.3)) + 0.1, L_Omega = matrix(c(1, 0, 0, 1), nrow=2), z = matrix(0, nrow=2, ncol=N_subj_val_mem) ) }) if (regenerate_simulations || !file.exists(fit_memory_path)) { fit_memory <- mod_memory_ncp$sample( data = stan_data_memory, seed = 123, chains = 4, parallel_chains = 4, iter_warmup = 1000, iter_sampling = 1000, init = init_memory_list, adapt_delta = 0.90, max_treedepth = 12, refresh = 0 ) fit_memory$save_object(fit_memory_path) } else { fit_memory <- readRDS(fit_memory_path) } ``` We raise `adapt_delta` to 0.90. The bivariate NCP is well-behaved globally, but the correlation parameter $\rho$ introduces a curved direction that can require longer leapfrog trajectories. The higher target acceptance rate forces the dual-averaging algorithm to find a smaller step size, reducing integration error along this direction. If `max_treedepth` warnings appear (distinct from divergences), raise `max_treedepth` further — do not raise `adapt_delta` in response to those. #### Assessing the fitting process Even if the Stan compiler finishes without explicit warnings, our workflow demands that we systematically verify the geometry of the posterior and the efficiency of the sampler. ```{r ch07_memory_diagnostics} #fit_memory$cmdstan_diagnose() fit_memory$diagnostic_summary() ``` We are looking for strict zeros in `num_divergent`. With the NCP, this should hold. Any divergences here are a signal that `adapt_delta` needs further adjustment, the priors are misspecified, or there is a structural model issue. #### Chain Mixing and Stationarity ($\hat{R}$) ```{r ch07_memory_chain_mixing} draws_memory <- fit_memory$draws() mcmc_trace( draws_memory, pars = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho"), np = nuts_params(fit_memory) ) + labs( title = "Trace Plots: Population Parameters (Memory Agent NCP)", subtitle = "All five population-level parameters, including rho. Stationary chains expected throughout." ) draws_memory |> subset_draws(variable = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho", "alpha[1]", "alpha[2]", "beta_mem[1]", "beta_mem[2]")) |> summarise_draws(default_convergence_measures()) |> knitr::kable(digits = 3) ``` The potential scale reduction factor ($\hat{R}$) must be below 1.01 for all parameters. The correlation parameter `rho` is informed only by covariation *across* agents, not within any individual agent's trial sequence, so its ESS will be lower than the marginal parameters. ESS > 400 remains our threshold. An ESS below 200 for `rho` warrants investigation of the correlation manifold geometry. ##### Effective Sample Size (ESS) Monitor `ess_tail` for `sigma[1]` and `sigma[2]`. Population SDs can develop mild funnel-like behavior even under NCP when the likelihood is highly informative about individual parameters, because the $z_j$ draws must simultaneously explain many agents. If `ess_tail` is low for the sigmas but `ess_bulk` is acceptable, raising `adapt_delta` further or running more warmup iterations is the correct response. ##### Energy (E-BFMI) ```{r ch07_memory_energy} mcmc_nuts_energy(nuts_params(fit_memory)) + labs( title = "Energy Transitions: Memory Agent NCP", subtitle = "Marginal and transition energy distributions must align. E-BFMI > 0.2 required." ) ``` #### Posterior Parameter Correlations ```{r ch07_memory_posterior_correlations} mcmc_pairs( draws_memory, pars = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho"), np = nuts_params(fit_memory), diag_fun = "dens", off_diag_fun = "hex" ) ``` The critical panels are $(\sigma_\alpha, \mu_\alpha)$ and $(\sigma_\beta, \mu_\beta)$. Under the CP biased agent, these showed triangular wedges — the Neal's Funnel signature. Under NCP, they should be roughly elliptical. Any triangular shape here would be unexpected and would require investigation. ### Prior-Posterior Update Checks: Quantifying Learning ```{r ch07_memory_pp_update} draws_df_memory <- as_draws_df(draws_memory) # Extract true generative hyperparameters saved as attributes true_hp <- attr(sim_data_memory, "hyperparameters") true_pop_mem <- tibble( parameter = c("mu_alpha", "mu_beta", "sigma_alpha", "sigma_beta", "rho"), true_value = c(true_hp$mu_alpha, true_hp$mu_beta, true_hp$sigma_alpha, true_hp$sigma_beta, true_hp$rho) ) # Reshape: split on the underscore immediately before "posterior" or "prior" pp_update_mem <- draws_df_memory |> dplyr::select( mu_alpha_posterior = mu_alpha, mu_alpha_prior, mu_beta_posterior = mu_beta, mu_beta_prior, sigma_alpha_posterior = `sigma[1]`, sigma_alpha_prior, sigma_beta_posterior = `sigma[2]`, sigma_beta_prior, rho_posterior = rho, rho_prior ) |> pivot_longer( everything(), names_to = c("parameter", "distribution"), names_sep = "_(?=posterior|prior)" ) ggplot(pp_update_mem, aes(x = value)) + geom_histogram(aes(fill = distribution), alpha = 0.6, color = NA, bins = 60) + geom_vline( data = true_pop_mem, aes(xintercept = true_value, linetype = "True Value"), color = "darkred", linewidth = 1 ) + facet_wrap(~parameter, scales = "free") + scale_fill_manual( name = "Distribution", values = c("prior" = "gray70", "posterior" = "steelblue") ) + scale_linetype_manual( name = "Ground Truth", values = c("True Value" = "dashed") ) + labs( title = "Prior-Posterior Update Checks: Memory Agent", x = "Parameter Value", y = "Count" ) ``` A healthy update shows a posterior significantly narrower than the prior and comfortably contained within the prior's mass. The `rho` panel carries the central new scientific information. If the posterior on $\rho$ is essentially indistinguishable from the LKJ(2) prior, the data do not contain sufficient information to identify the population-level correlation — this can happen with small $J$ or short trial sequences. In that case, report the parameter as unidentified, not the prior as the conclusion. A clearly narrowed posterior constitutes genuine evidence about the cognitive architecture of the population. ### Posterior Predictive Checks (PPC): Generative Adequacy ```{r ch07_memory_ppc} y_obs_mem <- as.numeric(stan_data_memory$y) agent_idx_mem <- as.numeric(stan_data_memory$subj_id) y_rep_mem <- fit_memory$draws("y_post_rep", format = "matrix") n_sample_mem <- 12 sampled_agents_mem <- sample(unique(agent_idx_mem), n_sample_mem) keep_mask_mem <- agent_idx_mem %in% sampled_agents_mem prop_right <- function(x) mean(x) ppc_stat_grouped( y = y_obs_mem[keep_mask_mem], yrep = y_rep_mem[, keep_mask_mem], group = agent_idx_mem[keep_mask_mem], stat = "prop_right" ) + labs( title = "Posterior Predictive Checks: Agent-Level Choice Rates (Subset)", subtitle = paste("Showing", n_sample_mem, "randomly sampled agents. Dark line: Observed. Histogram: Predicted."), x = "Proportion of 'Right' Choices" ) ``` #### Calibration and Recovery We have proven the model fits the data and generates plausible predictions. Now we must prove the parameters we extract are actually the true parameters that generated the data. For a bivariate model, recovery must be assessed at both levels: all five population parameters, and the individual $(\alpha_j, \beta_j)$ pairs jointly. ##### Individual-Level Recovery ```{r ch07_memory_recovery} recovery_alpha <- draws_memory |> subset_draws(variable = "alpha") |> summarise_draws() |> mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])"))) recovery_beta <- draws_memory |> subset_draws(variable = "beta_mem") |> summarise_draws() |> mutate(agent_id = as.integer(str_extract(variable, "(?<=\\[)\\d+(?=\\])"))) true_agent_params_mem <- sim_data_memory |> distinct(agent_id, n_trials, true_alpha, true_beta) recovery_mem_df <- true_agent_params_mem |> left_join(recovery_alpha |> dplyr::select(agent_id, alpha_est = median, alpha_q5 = q5, alpha_q95 = q95), by = "agent_id") |> left_join(recovery_beta |> dplyr::select(agent_id, beta_est = median, beta_q5 = q5, beta_q95 = q95), by = "agent_id") p_rec_alpha <- ggplot(recovery_mem_df, aes(x = true_alpha, y = alpha_est)) + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") + geom_errorbar(aes(ymin = alpha_q5, ymax = alpha_q95), width = 0, alpha = 0.5, color = "steelblue") + geom_point(aes(size = n_trials), color = "midnightblue", alpha = 0.8) + labs( title = expression("Recovery: Baseline Bias " * alpha[j]), x = "True alpha (logit scale)", y = "Posterior Median", size = "N trials" ) p_rec_beta <- ggplot(recovery_mem_df, aes(x = true_beta, y = beta_est)) + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") + geom_errorbar(aes(ymin = beta_q5, ymax = beta_q95), width = 0, alpha = 0.5, color = "steelblue") + geom_point(aes(size = n_trials), color = "midnightblue", alpha = 0.8) + labs( title = expression("Recovery: Memory Sensitivity " * beta[j]), x = "True beta", y = "Posterior Median", size = "N trials" ) p_rec_alpha + p_rec_beta + plot_layout(guides = "collect") ``` Recovery for $\beta_j$ is typically noisier than for $\alpha_j$ at any given trial count. The memory signal $\text{logit}(m_{ij})$ starts at zero (the $\text{logit}(0.5) = 0$ initialization) and becomes informative only as the cumulative opponent history diverges from 0.5. Agents with few trials show stronger shrinkage toward $\mu_\beta = 1.5$ — not toward zero, but toward the estimated population mean, which is the appropriate regularization given what the rest of the cohort tells us about the population. #### Prior sensitivity ```{r ch07_memory_prior_sensitivity} sens_results_mem <- powerscale_sensitivity( fit_memory, variable = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho") ) cat("=== PRIOR SENSITIVITY DIAGNOSTICS: MEMORY AGENT ===\n") print(sens_results_mem) ps_seq_mem <- powerscale_sequence( fit_memory, variable = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho") ) powerscale_plot_dens(ps_seq_mem) + labs( title = "Prior Sensitivity: Density Perturbation (Memory Agent)", subtitle = "Does scaling the prior or likelihood drastically alter our beliefs about the population?" ) powerscale_plot_quantities(ps_seq_mem) + labs( title = "Prior Sensitivity: Parameter Estimates (Memory Agent)", subtitle = "Horizontal lines: baseline posterior bounds. Deviations indicate sensitivity." ) ``` Pay particular attention to the `rho` panel. Because the correlation is identified only from across-agent covariation, it has fewer effective observations than the marginal population parameters. If `rho` shows strong sensitivity to prior scaling, the data contain limited information about the correlation structure and the LKJ prior is doing meaningful regularization. #### Simulation-Based Calibration Checks (SBC) Parameter recovery on a single dataset is anecdotal. We must prove that the sampler is globally unbiased across the entire prior space. ```{r ch07_sbc_memory} sbc_memory_path <- here::here("simmodels", "ch07_sbc_memory.rds") generate_sbc_dataset_memory <- function(n_agents = 50, n_trials_min = 80, n_trials_max = 240) { # Use the same reversal schedule as the main fit. # opponent_reversal is available in the chapter environment from Edit 6.16. sim <- simulate_hierarchical_memory( n_agents = n_agents, n_trials_min = n_trials_min, n_trials_max = n_trials_max, opponent_choices = opponent_reversal ) hp <- attr(sim, "hyperparameters") list( variables = list( mu_alpha = hp$mu_alpha, mu_beta = hp$mu_beta, `sigma[1]` = hp$sigma_alpha, `sigma[2]` = hp$sigma_beta, rho = hp$rho, alpha = sim |> distinct(agent_id, true_alpha) |> arrange(agent_id) |> pull(true_alpha), beta_mem = sim |> distinct(agent_id, true_beta) |> arrange(agent_id) |> pull(true_beta) ), generated = prep_stan_data_memory(sim) ) } generator_memory <- SBC_generator_function( generate_sbc_dataset_memory, n_agents = 50, n_trials_min = 30, n_trials_max = 120 ) backend_memory <- SBC_backend_cmdstan_sample( mod_memory_ncp, iter_warmup = 1000, iter_sampling = 1000, chains = 1, refresh = 0 ) if (regenerate_simulations || !file.exists(sbc_memory_path)) { sbc_datasets_memory <- generate_datasets(generator_memory, 100) plan(sequential) #future::plan(future::multisession, workers = 4) sbc_results_memory <- compute_SBC(datasets = sbc_datasets_memory, backend = backend_memory, keep_fits = FALSE) saveRDS( list(datasets = sbc_datasets_memory, results = sbc_results_memory), sbc_memory_path ) } else { sbc_mem_obj <- readRDS(sbc_memory_path) sbc_datasets_memory <- sbc_mem_obj$datasets sbc_results_memory <- sbc_mem_obj$results } plot_rank_hist(sbc_results_memory, variables = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho")) + labs(title = "SBC Rank Histograms: Memory Agent NCP") plot_ecdf_diff(sbc_results_memory, variables = c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho")) + labs( title = "SBC ECDF Difference: Memory Agent NCP", subtitle = "Deviation from the diagonal indicates miscalibration." ) # 1. Extract the lightweight stats dataframe from the SBC object sbc_stats_df <- sbc_results_memory$stats # 2. Filter for the population-level parameters you care about # (e.g., mu_alpha, mu_beta, rho, or even sigma[1]) recovery_df <- sbc_stats_df |> filter(variable %in% c("mu_alpha", "mu_beta", "sigma[1]", "sigma[2]", "rho")) # 3. Create the Parameter Recovery Identity Plot p_sbc_recovery <- ggplot(recovery_df, aes(x = simulated_value, y = median)) + # The Identity Line (Perfect Recovery) geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", linewidth = 1) + # The Posterior Estimates with 90% Credible Intervals geom_errorbar(aes(ymin = q5, ymax = q95), width = 0, alpha = 0.3, color = "steelblue") + geom_point(color = "midnightblue", alpha = 0.7, size = 2) + # Facet by parameter facet_wrap(~variable, scales = "free") + labs( title = "Parameter Recovery (Extracted Directly from SBC)", subtitle = "Points represent the posterior median. Bars are 90% Credible Intervals.", x = "True Generative Value (Drawn from Prior)", y = "Posterior Median Estimate" ) + theme_classic() + theme(strip.background = element_rect(fill = "gray95", color = NA)) print(p_sbc_recovery) # 4. Quantify the Recovery (Correlation and RMSE) recovery_metrics <- recovery_df |> group_by(variable) |> summarize( Pearson_r = cor(simulated_value, median), RMSE = sqrt(mean((simulated_value - median)^2)), .groups = "drop" ) cat("=== PARAMETER RECOVERY QUANTIFICATION ===\n") print(knitr::kable(recovery_metrics, digits = 3)) ``` ```{r ch07_sbc_memory_diagnostics} diags_memory <- sbc_results_memory$backend_diagnostics mem_fail_rate <- mean(diags_memory$n_divergent > 0) * 100 diags_memory |> ggplot(aes(x = n_divergent)) + geom_histogram(binwidth = 1, fill = "darkred", alpha = 0.8) + labs( title = "Memory NCP Divergence Counts Across 200 SBC Universes", subtitle = sprintf("%.0f%% of universes produced at least one divergence.", mem_fail_rate), x = "Number of Divergent Transitions", y = "Number of SBC Universes" ) # Does any failure concentrate at small sigma values? posterior::as_draws_df(sbc_datasets_memory$variables) |> as_tibble() |> mutate(sim_id = row_number()) |> left_join( diags_memory |> mutate(sim_id = row_number()) |> dplyr::select(sim_id, n_divergent), by = "sim_id" ) |> mutate(had_divergence = n_divergent > 0) |> ggplot(aes(x = `sigma[1]`, y = `sigma[2]`, color = had_divergence)) + geom_point(alpha = 0.6, size = 1.5) + scale_color_manual( values = c("FALSE" = "steelblue", "TRUE" = "darkred"), labels = c("FALSE" = "Clean", "TRUE" = "Divergent"), name = NULL ) + labs( title = "Memory NCP: Where Do Divergences Occur?", subtitle = "Any divergences should not concentrate at small sigma values (that would indicate residual funnel geometry).", x = expression(sigma[alpha]), y = expression(sigma[beta]) ) ``` The `rho` rank histogram will typically have wider confidence bands than the marginal mean and SD histograms at $N = 200$. This is expected: $\rho$ is identified only from covariation across $J = 50$ agents, so each SBC universe contributes one effective draw of $\rho$ to the calibration check. Mild irregularity within the confidence envelope is statistical noise. Systematic U-shape or strong asymmetry would indicate a genuine calibration failure. #### Cognitive Interpretation of $\hat{\rho}$ The posterior on $\rho$ is the inferential result that distinguishes this model from two independent univariate hierarchical models. * **$\rho \approx 0$**: Baseline tendency and memory-based opponent tracking are orthogonal individual differences. Studying one tells us nothing about the other. * **$\rho > 0$**: Agents with a stronger directional baseline also tend to weight memory more heavily — consistent with a general exploitation tendency, where both parameters index willingness to depart from 50/50 in response to a signal. * **$\rho < 0$**: Agents with a strong directional baseline tend to suppress memory use — a compensatory account in which prior commitment crowds out sensitivity to the opponent's history. None of these interpretations follow from the model alone. The model provides a calibrated posterior over $\rho$. Cognitive theory provides the framework for deciding what that value means. #### LOO for Model Comparison We retain the LOO-CV score here for the model comparison section at the end of the chapter. ```{r ch07_loo_memory} loo_memory <- loo(fit_memory$draws("log_lik", format = "matrix")) print(loo_memory) ``` ```{r ch07_save_exports} # Save everything Ch. 8 needs for model comparison. # Ch. 8 will load this object to avoid refitting. ch07_exports_file <- here("simmodels", "ch07_memory_exports_for_ch7.rds") saveRDS( list( fit_memory = fit_memory, # Fitted CmdStanR object loo_memory = loo_memory, # Pre-computed LOO score stan_data_memory = stan_data_memory, # Stan data list (reversal design) sim_data_memory = sim_data_memory, # Full trial-level tibble reversal_schedule = opponent_reversal # 240-trial opponent sequence ), ch07_exports_file ) cat("Ch. 7 exports saved to:", ch07_exports_file, "\n") cat("Load in Ch. 8 with: readRDS(here('simmodels', 'ch07_memory_exports_for_ch7.rds'))\n") ``` ## Conclusion: From One Agent to a Population This chapter extended the Bayesian modeling workflow to multiple participants. The core ideas are: - **Partial pooling** is not a compromise — it is the statistically correct way to combine information across individuals with unequal data, and it provably outperforms both extremes (as the shrinkage plot showed). - **Centered vs. non-centered parameterization** is a geometric choice, not a philosophical one. CP fails when population variance is small (Neal's Funnel); NCP fixes the global geometry but can be locally inefficient when data are very informative. SBC revealed which regime we are in. - **The reversal learning design** is essential for the memory model. A constant-bias opponent makes `bias` and `beta` statistically inseparable; the reversal schedule provides the contrast that identifies them. **Coming up in Chapter 8:** We have now validated two models — the hierarchical biased agent and the hierarchical memory agent — and computed LOO scores for both. Chapter 8 introduces model comparison: given that both models have passed validation, which one better predicts held-out data? The answer will use the LOO scores and the `loo_compare()` function from the `loo` package.

7.1 Introduction

7.1.1 The Epistemic Challenge: Pooling Information

7.1.2 The Solution: Partial Pooling (Multilevel or Hierarchical Modeling)

7.1.2.1 Mixed Models in the Fields

7.1.2.2 Stein’s Paradox and the Optimality of Shrinkage

7.1.2.3 Bayesian Formalisation

7.1.2.4 Into Cognitive Science

7.1.2.5 Non-Centred Parameterisation and Neal’s Funnel

7.2 Learning Objectives

7.2.1 Biased Agent Model (Multilevel)

7.3 Phase 1: Generative Plausibility of the Hierarchical Biased Agent

7.3.1 Executing the Prior Predictive Check: The Logit-Normal Artifact

7.4 Phase 2: Data Preparation and the Centered Stan Model

7.4.1 Stan Model: Centered Parameterization (CP)

7.4.2 Fitting the model

7.4.3 Assessing the fitting process

7.4.3.1 Chain Mixing and Stationarity (\(\hat{R}\))

7.4.3.2 Effective Sample Size (ESS)

7.4.3.3 Energy and E-BFMI

7.4.4 Prior-Posterior Update Checks: Quantifying Learning

7.4.5 Posterior Parameter Correlations

7.4.6 Posterior Predictive Checks (PPC): Generative Adequacy

7.4.7 Calibration and Recovery

7.4.7.1 Individual-Level Recovery

7.4.8 Prior sensitivity

7.4.9 Simulation-Based Calibration Checks (SBC)

7.4.10 Setting up the SBC Pipeline

7.4.11 Non-Centered Parameterization (NCP) as the Solution

7.4.11.1 The Geometry of the Transformation

7.4.12 Stan Model: Multilevel Biased Agent (NCP)

7.4.12.1 Posterior Simulation-Based Calibration Checks: Probing Local Geometry

7.4.13 Comparing CP and NCP: Global Fragility vs. Local Efficiency

7.5 The Epistemic Showdown: Visualizing Shrinkage

7.5.1 The Complete Pooling Model (Underfitting)

7.5.2 The No Pooling Model (Overfitting)

7.5.3 The Geometry of Shrinkage

7.5.4 The Horizontal Lines: The Quartet of Population Means

7.5.5 The Points and Arrows: Information-Weighted Regularization

7.6 The Memory Agent: Two Correlated Individual Differences

7.6.1 Introduction

7.6.1.1 The Bivariate Population Distribution

7.6.1.2 The LKJ Prior

7.6.1.3 Non-Centered Parameterization

7.6.2 Phase 1: Generative Plausibility of the Memory Agent

7.6.2.1 Executing the Prior Predictive Check: The Joint Population Cloud

7.6.3 Phase 2: Data Preparation and the NCP Stan Model

7.6.3.1 Stan Model: Hierarchical Memory Agent (NCP)

7.6.3.2 Fitting the model

7.6.3.3 Assessing the fitting process

7.6.3.4 Chain Mixing and Stationarity (\(\hat{R}\))

7.6.3.4.1 Effective Sample Size (ESS)

7.6.3.4.2 Energy (E-BFMI)

7.6.3.5 Posterior Parameter Correlations

7.6.4 Prior-Posterior Update Checks: Quantifying Learning

7.6.5 Posterior Predictive Checks (PPC): Generative Adequacy

7.6.5.1 Calibration and Recovery

7.6.5.1.1 Individual-Level Recovery

7.6.5.2 Prior sensitivity

7.6.5.3 Simulation-Based Calibration Checks (SBC)

7.6.5.4 Cognitive Interpretation of \(\hat{\rho}\)

7.6.5.5 LOO for Model Comparison

7.7 Conclusion: From One Agent to a Population