5 From simulation to model fitting

📍 Where we are in the Bayesian modeling workflow: Ch. 3 built forward models — code that generates data from known parameters. This chapter reverses the arrow: given observed choices, we infer the parameters that most plausibly produced them. This is Bayesian parameter estimation, and it is the engine at the heart of all remaining chapters. We fit to simulated data here precisely because we know the ground truth — a sanity check before the engine is handed real human data in Ch. 6.

In Chapter 3, we formalized verbal models into computational agents and explored their behavior through simulation by setting their parameters (like bias rate or WSLS rules). However, when analyzing real experimental data, we don’t know the true parameter values underlying a participant’s behavior. The core task shifts from simulation to parameter estimation: inferring the plausible values of model parameters given the observed data.

5.1 Learning Objectives

This chapter transitions from simulating predefined models (Chapter 3) to inferring model parameters from data using Bayesian methods. By the end of this chapter, you will be able to:

Model Cognitive Processes: Apply these techniques to estimate parameters for simple cognitive strategies, including biased choice and different types of memory models.
Master Reparameterization: Learn to rethink the mathematical form of models. You will appreciate the utility of the log-odds transformation for mapping bounded probability parameters into unbounded spaces.
Read MCMC Diagnostics: Learn the mandatory diagnostic battery for Bayesian inference, specifically focusing on chain convergence ($\hat{R}$) and Effective Sample Size (ESS).

In a rigorous Bayesian workflow, we never fit a model to empirical data without validating it first. Therefore, throughout this chapter, we will fit our models to a single simulated dataset where we know the ground truth. This allows us to learn the mechanics of Stan and verify our single-fit posterior. In the next chapter, we will scale this up into a full rigorous validation pipeline (Parameter Recovery and predictive checks).

5.2 The Challenge: Inferring Latent Parameters

Why is this a challenge?

Parameters are Unseen: Cognitive parameters (like bias strength, memory decay, learning rate) are not directly observable; we must infer them from behavior (choices, RTs, etc.).
Behavior is Noisy: Human behavior is variable. Our models need to account for this noise while extracting the underlying parameter signal.
Model Plausibility: We need to determine not just if a model fits, but how well it fits, what are its strengths and limitations, and whether the estimated parameter values are theoretically meaningful.

Consider the biased agent from Chapter 3. We simulated agents with a known rate. Now, imagine you only have the sequence of choices (the h data) from an unknown agent. How can you estimate the underlying rate that most likely generated those choices?

Our Approach: Bayesian Inference with Stan

This chapter introduces Bayesian inference as our primary tool for parameter estimation. The Bayesian approach allows us to combine prior knowledge about parameters with information from the observed data (via the likelihood function) to arrive at an updated understanding (posterior distribution) of the parameters. We will use Stan, a powerful probabilistic programming language, implemented via the cmdstanr R package, to:

Specify Models: Define our cognitive models formally, including parameters and their prior distributions.
Fit Models: Use Stan’s algorithms (primarily Hamiltonian Monte Carlo - HMC, and a faster but more risky approximation method called pathfinder) to sample from the posterior distribution of parameters given the data.
Evaluate Fits: Examine the results to understand parameter estimates and their uncertainty.

As a crucial validation step, we will first apply these methods to simulated data, where we know the true parameters. This allows us to check if our models and fitting procedures work correctly – a process called parameter recovery.

Terminology note — “parameter recovery” used carefully: Here and in the conclusion of this chapter, we use “parameter recovery” loosely to mean: fitting a model to one simulated dataset and checking whether the posterior lands near the true value. This is a useful first sanity check. In Ch. 6 we will use the term more precisely to mean a systematic study that repeats this across dozens of datasets at many different true parameter values and noise levels. That is a qualitatively different (and stronger) validation. The even stricter test — Simulation-Based Calibration Checks (SBC) — checks whether posterior intervals are correctly calibrated across the entire prior space. These distinctions will matter a lot in Ch. 6, so don’t worry about them too much now, you’ll hear more about them.

5.3 Simulating data

As usual we start with simulated data, where we know the underlying mechanisms and parameter values.

Simulated data are rarely enough (empirical data often offer unexpected challenges), but they are a great starting point to stress test your model: does the model reconstruct the right parameter values? Does it reproduce the overall patterns in the data?

Here we build a new simulation of random agents with bias and noise. The code and visualization is really nothing different from last chapter.

Code

# Flag to control whether to regenerate simulation/fitting results for the knitted files
# Set to TRUE to rerun everything (takes time!), FALSE to load saved results.
regenerate_simulations <- FALSE 

# --- Simulate Data for Fitting ---
# Goal: Generate data from known models/parameters to test our fitting procedures.

trials <- 120 # Number of trials per simulated agent

# Define the agent function (copied from previosu chapter)
# This agent chooses '1' with probability 'rate', potentially adding noise.
RandomAgentNoise_f <- function(rate, noise) {
  # Ensure rate is probability (in case log-odds are passed, though not here)
  rate_prob <- plogis(rate) # Assumes rate might be log-odds; use identity if rate is already probability
  choice <- rbinom(1, 1, rate_prob) # Base choice
  if (noise > 0 & rbinom(1, 1, noise) == 1) { # Check if noise occurs
    choice <- rbinom(1, 1, 0.5) # Override with random 50/50 choice
  }
  return(choice)
}

# --- Generate Data Across Different Conditions ---
# Define relative paths for saving/loading simulation data
sim_data_dir <- "simdata" # Assumes a 'simdata' subdirectory
sim_data_file <- here(sim_data_dir, "ch06_randomnoise.csv") 

# Create directory if it doesn't exist
if (!dir.exists(sim_data_dir)) dir.create(sim_data_dir)

if (regenerate_simulations || !file.exists(sim_data_file)) {
  cat("Generating new simulation data...\n")
  # Use expand_grid for cleaner parameter combinations
  param_grid <- expand_grid(
    noise = seq(0, 0.5, 0.1), # Noise levels
    rate = seq(0, 1, 0.1)     # Bias rate levels (probability scale)
  )

  # Use map_dfr for efficient simulation across parameters
  d <- pmap_dfr(param_grid, function(noise, rate) {
    # Simulate choices for one agent/condition
    choices <- map_int(1:trials, ~RandomAgentNoise_f(qlogis(rate), noise)) # Convert rate to log-odds for function if needed
    # Create tibble for this condition
    tibble(
      trial = 1:trials,
      choice = choices,
      true_rate = rate, # Store the true rate used for generation
      true_noise = noise # Store the true noise used
    )
  }) %>%
  # Calculate cumulative rate for visualization
  group_by(true_rate, true_noise) %>%
  mutate(cumulative_rate = cumsum(choice) / row_number()) %>%
  ungroup()

  # Save the generated data
  write_csv(d, sim_data_file)
  cat("Generated new simulation data and saved to", sim_data_file, "\n")
} else {
  # Load existing simulation data
  d <- read_csv(sim_data_file)
  cat("Loaded existing simulation data from", sim_data_file, "\n")
}

Loaded existing simulation data from /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/simdata/ch06_randomnoise.csv

Code

# --- Visualize Simulated Data ---
# Plot cumulative rate, faceted by noise level
p1 <- ggplot(d, aes(x = trial, y = cumulative_rate, group = true_rate, color = true_rate)) +
  geom_line(alpha = 0.8) +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  scale_color_viridis_c() + # Use perceptually uniform color scale
  facet_wrap(~paste("Noise =", true_noise)) + # Facet by noise level
  labs(
    title = "Simulated Choice Behavior (Cumulative Rate)",
    subtitle = "Agents with different bias rates and noise levels",
    x = "Trial Number",
    y = "Cumulative Proportion Choosing 'Right'",
    color = "True Bias Rate"
  ) +
  theme_minimal() +
  ylim(0, 1)

print(p1)

5.4 How the Posterior Is Actually Computed

Before we hand this problem to Stan, it is worth understanding what Stan is actually doing — because “posterior ∝ prior × likelihood” is easy to say and hard to see. This section builds a posterior by hand using a grid approximation: a brute-force method that works when there is only one parameter and we are willing to evaluate every possible value of it.

The setup. Suppose an agent made 3 choices, picking “right” on trials 1 and 3 and “left” on trial 2 (h = c(1, 0, 1)). We want to know: what is the most plausible value of θ — the agent’s underlying bias toward “right”?

Bayes’ rule tells us:

\[P(\theta \mid \text{data}) \propto P(\theta) \times P(\text{data} \mid \theta)\]

In words: the posterior is proportional to the prior (what we believed before seeing the data) multiplied by the likelihood (how probable the data is under each value of θ). We then re-normalize so the result sums to 1.

Code

# 1. Define a coarse grid of 10 candidate values for theta
theta_grid <- seq(0.1, 1.0, by = 0.1)

# 2. Prior: uniform — Beta(1,1) — every value equally plausible before seeing data
prior <- rep(1, length(theta_grid))
prior <- prior / sum(prior)   # Normalize to sum to 1

# 3. Observed data
h <- c(1, 0, 1)   # 2 "right" out of 3 trials

# 4. Likelihood: P(data | theta)
# Each choice is Bernoulli(theta), so the joint likelihood is the product
# across trials
likelihood <- sapply(theta_grid, function(th) {
  prod(dbinom(h, size = 1, prob = th))
})

# 5. Unnormalized posterior: prior * likelihood (pointwise)
posterior_unnorm <- prior * likelihood

# 6. Normalize so it sums to 1
posterior <- posterior_unnorm / sum(posterior_unnorm)

# --- Display as a table ---
grid_table <- tibble(
  `θ`           = theta_grid,
  `Prior`       = round(prior, 3),
  `Likelihood`  = round(likelihood, 4),
  `Posterior`   = round(posterior, 3)
)

knitr::kable(grid_table,
             caption = "Grid approximation: prior × likelihood → posterior (3 observations)")

Grid approximation: prior × likelihood → posterior (3 observations)
θ	Prior	Likelihood	Posterior
0.1	0.1	0.009	0.011
0.2	0.1	0.032	0.039
0.3	0.1	0.063	0.076
0.4	0.1	0.096	0.116
0.5	0.1	0.125	0.152
0.6	0.1	0.144	0.175
0.7	0.1	0.147	0.178
0.8	0.1	0.128	0.155
0.9	0.1	0.081	0.098
1.0	0.1	0.000	0.000

Grid approximation with 3 observed choices. The posterior (blue) is the pointwise product of the prior (red) and the likelihood (green), re-normalized. Notice that the prior is flat — every value of θ was equally plausible beforehand — and the posterior peaks near θ = 0.7, which best explains seeing 2 ‘right’ choices out of 3.

Code

# --- Plot all three distributions ---
grid_long <- grid_table %>%
  rename(theta = `θ`, prior = Prior, likelihood = Likelihood, posterior = Posterior) %>%
  pivot_longer(c(prior, likelihood, posterior),
               names_to = "distribution", values_to = "mass") %>%
  mutate(distribution = factor(distribution,
                                levels = c("prior", "likelihood", "posterior"),
                                labels = c("Prior", "Likelihood", "Posterior")))

ggplot(grid_long, aes(x = theta, y = mass, fill = distribution)) +
  geom_col(position = "dodge", alpha = 0.8, width = 0.08) +
  scale_fill_manual(values = c("Prior" = "red", "Likelihood" = "darkgreen",
                                "Posterior" = "blue")) +
  scale_x_continuous(breaks = theta_grid) +
  labs(
    title    = "Bayes by Hand: Prior × Likelihood → Posterior",
    subtitle = "Data: h = c(1, 0, 1) — two 'right' choices out of three trials",
    x        = "θ  (probability of choosing 'right')",
    y        = "Probability mass",
    fill     = NULL
  ) +
  theme_classic() +
  theme(legend.position = "top")

Grid approximation with 3 observed choices. The posterior (blue) is the pointwise product of the prior (red) and the likelihood (green), re-normalized. Notice that the prior is flat — every value of θ was equally plausible beforehand — and the posterior peaks near θ = 0.7, which best explains seeing 2 ‘right’ choices out of 3.

Three things to notice from the table and the plot:

The prior is flat. Before seeing any data, every value of θ is equally plausible. This is what Beta(1, 1) means — maximal prior ignorance.
The likelihood peaks near θ = 0.7. The agent chose “right” 2 out of 3 times, so θ = 0.7 makes the observed sequence most probable. Values near 0 or 1 make it very improbable.
The posterior is the pointwise product, re-normalized. It is narrower than the prior (the data ruled out extreme values) and centred on the likelihood peak. This is exactly the “update”: the data shifted our beliefs from flat to concentrated.

Why Stan instead of a grid? This worked with 1 parameter and 10 grid points. With 2 parameters and 100 grid points each, we need 100 × 100 = 10,000 evaluations. With 5 parameters: 100⁵ = 10 billion. The grid grows exponentially — this is the curse of dimensionality. Stan’s Hamiltonian Monte Carlo (HMC) sampler sidesteps this by sampling from regions of high posterior probability rather than evaluating every possible combination. The samples it returns are the blue histogram you will see in the next section — same idea, just no grid required.

5.5 Building our First Stan Model: Inferring Bias Rate

N.B. Refer to the video and slides for the step by step build-up of the Stan code.

Let’s use Stan to estimate the underlying bias rate (which we’ll call theta in the model) from just the sequence of choices (h) generated by one of our simulated agents. This model is represented in the

We’ll start with a simple case: an agent with rate = 0.8 and noise = 0.

Preparing Data for Stan Why a list? Well, dataframes (now tibbles) are amazing. But they have a big drawback: they require each variable to have the same length. Lists do not have that limitation, they are more flexible. So, lists. We’ll have to learn how to live with them.

Code

# Subset data for one specific agent simulation (rate=0.8, noise=0)
d1 <- d %>% filter(true_rate == 0.8, true_noise == 0.0) # Make sure noise level matches

# Stan requires data in a list format
data_for_stan <- list(
  n = trials,     # Number of trials (integer)
  h = d1$choice   # Vector of choices (0s and 1s) for this agent
)

str(data_for_stan) # Show the structure of the list

List of 2
 $ n: num 120
 $ h: num [1:120] 1 1 0 1 1 0 1 1 1 1 ...

5.5.1 Model

We write the stan code within the R code (so I can show it to you more easily), then we save it as a stan file, which can be loaded at a later stage in order to compile it. [Missing: more info on compiling etc.]

Remember that the minimal Stan model requires 3 chunks, one specifying the data it will need as input; one specifying the parameters to be estimated; one specifying the model within which the parameters appear, and the priors for those parameters.

Code

# This R chunk defines the Stan code as a string and writes it to a file.
# It's marked eval=FALSE because we don't run the R code defining the string here,
# but the Stan code itself IS the important content.

stan_model_code <- "
// Stan Model: Simple Bernoulli for Bias Estimation
// Goal: Estimate the underlying probability (theta) of choosing 1 ('right')
//       given a sequence of binary choices.

// 1. Data Block: Declares the data Stan expects from R
data {
  int<lower=1> n;        // Number of trials (must be at least 1)
  array[n] int<lower=0, upper=1> h; // Array 'h' of length 'n' containing choices (0 or 1)
}

// 2. Parameters Block: Declares the parameters the model will estimate
parameters {
  real<lower=0, upper=1> theta; // The bias parameter (probability), constrained between 0 and 1
}

// 3. Model Block: Defines the priors and the data model (used as likelihood during inference)
model {
  // Prior: Our belief about theta *before* seeing the data.
  // We use a Beta(1, 1) prior, which is equivalent to a Uniform(0, 1) distribution.
  // This represents maximal prior ignorance about the bias.
  // 'target +=' adds the log-probability density to the overall model log-probability.
  target += beta_lpdf(theta | 1, 1); // lpdf = log probability density function

  // Likelihood: How the data 'h' depend on the parameter 'theta'.
  // We model each choice 'h' as a Bernoulli trial with success probability 'theta'.
  // The model assesses how likely the observed sequence 'h' is given a value of 'theta'.
  target += bernoulli_lpmf(h | theta); // lpmf = log probability mass function (for discrete data)
}

// 4. Generated Quantities Block (Optional but useful)
// Code here is executed *after* sampling, using the estimated parameter values.
// Useful for calculating derived quantities or predictions.
generated quantities {
  // let's generate the prior for visualization
  // first we define it
  real<lower = 0, upper = 1> prior_theta; 
  // then we generate it
  prior_theta = beta_rng(1, 1); // Draw a random sample from the prior distribution for theta
}
"
# Define relative path for Stan model file
stan_model_dir <- "stan"
stan_file_bernoulli <- here(stan_model_dir, "ch06_SimpleBernoulli.stan")

# Create directory if needed
if (!dir.exists(stan_model_dir)) dir.create(stan_model_dir)

# Write the Stan code to the file
writeLines(stan_model_code, stan_file_bernoulli)

cat("Stan model written to:", stan_file_bernoulli, "\n")

Stan model written to: /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/stan/ch06_SimpleBernoulli.stan

Compiling and Fitting the Stan Model

Now we use cmdstanr to compile this model (translating it into efficient C++ code) and then run the MCMC sampler to get posterior estimates for theta.

Code

## --- Compile and Fit ---
# Specify model file path
stan_file_bernoulli <- here("stan", "ch06_SimpleBernoulli.stan")
# Define path for saving fitted model object
stan_results_dir <- "simmodels"
model_file_bernoulli <- here(stan_results_dir, "ch06_SimpleBernoulli.rds")

if (!dir.exists(stan_results_dir)) dir.create(stan_results_dir)

if (regenerate_simulations || !file.exists(model_file_bernoulli)) {
  cat("Compiling and fitting Stan model (Bernoulli)...\n")
  # Compile the Stan model (only needs to be done once unless code changes)
  mod_bernoulli <- cmdstan_model(stan_file_bernoulli,
                                 cpp_options = list(stan_threads = FALSE), # Enable threading
                                 stanc_options = list("O1")) # Basic optimization

  # Sample from the posterior distribution using MCMC
  fit_bernoulli <- mod_bernoulli$sample(
    data = data_for_stan,          # The data list we prepared
    seed = 123,                    # For reproducible MCMC sampling
    chains = 4,                    # Number of parallel Markov chains (recommend 4)
    parallel_chains = 4, # Run chains in parallel
    #threads_per_chain = 1,         # For within-chain parallelization (usually 1 is fine)
    iter_warmup = 1000,            # Number of warmup iterations per chain (discarded)
    iter_sampling = 2000,          # Number of sampling iterations per chain (kept)
    refresh = 0,                   # How often to print progress updates. 0 means never since the model is so fast
    max_treedepth = 10,            # Controls complexity of MCMC steps (adjust if needed)
    adapt_delta = 0.8              # Target acceptance rate (adjust if divergences occur)
  )

  # Save the fitted model object
  fit_bernoulli$save_object(file = model_file_bernoulli)
  cat("Model fit completed and saved to:", model_file_bernoulli, "\n")
} else {
  # Load existing fitted model object
  fit_bernoulli <- readRDS(model_file_bernoulli)
  cat("Loaded existing model fit from:", model_file_bernoulli, "\n")
}

Loaded existing model fit from: /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/simmodels/ch06_SimpleBernoulli.rds

5.5.2 Assessing MCMC Health (The Mandatory Diagnostics)

Why check diagnostics before looking at the estimates? The posterior distribution Stan returns is only valid if the sampler explored the parameter space correctly. If it did not — due to divergent transitions, poor mixing, or insufficient samples — the numbers in the posterior are artifacts of the algorithm, not properties of the model. Looking at estimates from a broken sampler is like reading a map drawn by a robot that kept falling over: the map looks plausible but cannot be trusted. Diagnostics are the check that the robot walked properly before you navigate with its map.

Before we look at the parameter estimates, we MUST check whether the model was able to sample the posterior space: the health of the MCMC chains. Fitting a model is like sending a robotic explorer (the sampler) to map a landscape (the posterior). If the robot’s tracks are broken, the map is mathematically invalid. We must check:

Divergences: Did the simulation crash? (Zero tolerance).
$\hat{R}$ (R-hat): Did our 4 independent chains map the exact same space? (Must be strictly $< 1.01$).
Effective Sample Size (ESS): Do we have enough truly independent draws? (Bulk and Tail ESS should be $> 400$).
Visual Diagnostics: Trace plots (should look like a “hairy caterpillar”) and Rank Histograms (should be uniform across chains).

Let’s run this mandatory diagnostic battery on our first fit:

Code

# 1. Check for Divergent Transitions
diagnostics <- fit_bernoulli$diagnostic_summary()
cat("Divergent transitions per chain:", paste(diagnostics$num_divergent, collapse = ", "), "\n")

Divergent transitions per chain: 0, 0, 0, 0

Code

# 2. Check R-hat and ESS metrics
fit_summary <- fit_bernoulli$summary(variables = "theta")
print(fit_summary %>% dplyr::select(variable, mean, rhat, ess_bulk, ess_tail))

# A tibble: 1 × 5
  variable  mean  rhat ess_bulk ess_tail
  <chr>    <dbl> <dbl>    <dbl>    <dbl>
1 theta    0.836  1.00    2661.    3744.

Code

# 3. Visual Checks (Trace and Rank Plots)
draws_df <- as_draws_df(fit_bernoulli$draws())

# A healthy trace plot is well-mixed with no wandering
p_trace <- mcmc_trace(draws_df, pars = "theta") + 
  theme_classic() + 
  ggtitle("Trace Plot", 
          subtitle = "Healthy = all chains overlap and fluctuate around the same stable band")

# A healthy rank histogram has uniform bars across all chains
p_rank <- mcmc_rank_overlay(draws_df, pars = "theta", ref_line = TRUE) + 
  theme_classic() + 
  ggtitle("Trank plot", subtitle = "Healthy = chains interleave uniformly around the reference line — no chain dominates any rank band")

p_trace / p_rank

Code

# 4. Look at the Posterior Estimate
ggplot(draws_df) +
  geom_density(aes(theta, fill = "Posterior"), alpha = 0.6) +
  geom_density(aes(prior_theta, fill = "Prior"), alpha = 0.6) +
  geom_vline(xintercept = 0.8, linetype = "dashed", color = "black", linewidth = 1.2) +
  scale_fill_manual(values = c("Posterior" = "blue", "Prior" = "red")) +
  labs(
    title    = "Prior-Posterior Update: Bias Parameter (θ)",
    subtitle = "Healthy = posterior (blue) is narrower than prior (red) and shifted toward the true value (dashed)",
    x        = "θ  (Probability of Choosing Right)",
    y        = "Density",
    fill     = "Distribution"
  ) +
  theme_classic()

The posterior (blue) is substantially narrower than the prior (red) and has shifted toward the true value of 0.8 (dashed line). The posterior mean is approximately 0.755, not exactly 0.8 — this is expected with only 120 trials and is not a sign that something is wrong. What matters is that (a) the posterior is much more informative than the prior, confirming the data were indeed informative, and (b) the true value 0.8 sits comfortably within the credible interval. If the posterior were identical to the prior, the data would be telling us nothing; if the posterior were centered far from the true value and excluded it, the model would be misspecified.

Now we build the same model, but using the log odds scale for the theta parameter, which will become useful later when we condition theta on variables and build multilevel models (as we can do what we want in a log odds space and it will always be bound between 0 and 1).

Alternative Parameterization: Log-Odds Scale

While estimating theta directly as a probability (0-1) is intuitive, it can sometimes be computationally advantageous or necessary for more complex models (like multilevel models in a later chapter) to estimate parameters on an unbounded scale. The logit (log-odds) transformation converts a probability $p$ (from 0 to 1) to log-odds (from $-\infty$ to $+\infty$): \[\text{logit}(p) = \log\left(\frac{p}{1-p}\right)\] The inverse transformation (logistic function or inv_logit) converts log-odds back to probability: \[\text{inv\_logit}(x) = \frac{1}{1 + \exp(-x)}\] We can rewrite our Stan model to estimate theta on the log-odds scale. However, we must be careful with our priors, as the non-linear transformation distorts the probability mass.

Too Narrow: A standard normal prior $\lambda \sim \text{Normal}(0, 1)$ concentrates mass around $p=0.5$, biasing the model toward chance.
Too Wide: A wide prior (e.g., $\lambda \sim \text{Normal}(0, 5)$) pushes mass to the extremes (0 and 1), implying we expect deterministic behavior.
Just Right: To approximate a uniform distribution on the probability scale (like our Beta(1,1)), we typically use $\lambda \sim \text{Normal}(0, 1.5)$. This is the specific “sweet spot” that is essentially flat, avoiding both the center-bias of narrow priors and the extreme-bias of wide priors.

Code

# Stan code using logit parameterization
stan_model_logit_code <- "
// Stan Model: Simple Bernoulli (Logit Parameterization)
// Estimate theta on the log-odds scale

data {
  int<lower=1> n;
  array[n] int<lower=0, upper=1> h;
}

parameters {
  real theta_logit; // Parameter is now on the unbounded log-odds scale
}

model {
  // Prior on log-odds scale (e.g., Normal(0, 1.5))
  // Normal(0, 1.5) on log-odds corresponds roughly to a diffuse prior on probability scale
  target += normal_lpdf(theta_logit | 0, 1.5);

  // Likelihood using the logit version of the Bernoulli PMF
  // This tells Stan that theta_logit is on the log-odds scale
  target += bernoulli_logit_lpmf(h | theta_logit);
}

generated quantities {
  // Generate the prior for viz purposes
  real prior_theta = inv_logit(normal_rng(0, 1.5)); // Sample from the log-odds prior
  // Convert estimate back to probability scale for easier interpretation
  real<lower=0, upper=1> theta = inv_logit(theta_logit);
}
"
# Define file path
stan_file_logit <- here(stan_model_dir, "ch06_SimpleBernoulli_logodds.stan")
# Write the Stan code to the file
writeLines(stan_model_logit_code, stan_file_logit)
cat("Stan model (logit) written to:", stan_file_logit, "\n")

Stan model (logit) written to: /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/stan/ch06_SimpleBernoulli_logodds.stan

Stan Efficiency — Sufficient Statistics: Aggregate IID Trials

When all trials are independently and identically distributed (no memory, no sequential dependency, one $\theta$ for the whole session), you can replace a loop of $N$ Bernoulli calls with a single Binomial call on the sufficient statistic — the total number of successes.

// Slower: N separate Bernoulli evaluations (one autodiff node each)
for (i in 1:N)
  target += bernoulli_logit_lpmf(h[i] | theta_logit);

// Faster: one Binomial evaluation — mathematically identical for IID data
target += binomial_logit_lpmf(sum(h) | N, theta_logit);

Pass sum(h) and N as integers rather than the full array — Stan never sees the individual trial values.

When this applies: single-subject models with one rate parameter per session (Chapters 4–5).

When it does not apply: any model where the likelihood varies by trial — memory models, multilevel models with per-subject parameters indexed by subj_id, sequential belief-updating models. For those, the per-trial loop is necessary.

Code

## With the logit format
## Specify where the model is
file <- here("stan", "ch06_SimpleBernoulli_logodds.stan")

# File path for saved model
model_file <- here("simmodels","ch06_SimpleBernoulli_logodds.rds")

# Check if we need to rerun the simulation
if (regenerate_simulations || !file.exists(model_file)) {
  # Compile the model
  mod <- cmdstan_model(file, 
                       cpp_options = list(stan_threads = FALSE),
                       stanc_options = list("O1"))
  
  # The following command calls Stan with specific options.
  samples_biased_logodds <- mod$sample(
    data = data_for_stan,
    seed = 123,
    chains = 2,
    parallel_chains = 2,
    #threads_per_chain = 1,
    iter_warmup = 1000,
    iter_sampling = 2000,
    refresh = 0,
    max_treedepth = 20,
    adapt_delta = 0.99,
  )
  
  # Save the fitted model
  samples_biased_logodds$save_object(file = model_file)
  cat("Generated new model fit and saved to", model_file, "\n")
} else {
  # Load existing results
  cat("Loading biased model (log-odds) samples...\n")
  samples_biased_logodds <- readRDS(model_file)
  cat("Available parameters:", paste(colnames(as_draws_df(samples_biased_logodds$draws())), collapse = ", "), "\n")
  cat("Loaded existing model fit from", model_file, "\n")
}

Loading biased model (log-odds) samples...
Available parameters: lp__, theta_logit, prior_theta, theta, .chain, .iteration, .draw 
Loaded existing model fit from /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/simmodels/ch06_SimpleBernoulli_logodds.rds

5.5.3 Summarizing the results

Code

# Summary
samples_biased_logodds$summary()

# A tibble: 4 × 10
  variable       mean  median     sd    mad       q5     q95  rhat ess_bulk
  <chr>         <dbl>   <dbl>  <dbl>  <dbl>    <dbl>   <dbl> <dbl>    <dbl>
1 lp__        -54.9   -54.6   0.725  0.330  -56.2    -54.4   1.000    1234.
2 theta_logit   1.65    1.65  0.247  0.249    1.25     2.04  1.00      884.
3 prior_theta   0.506   0.507 0.274  0.358    0.0790   0.925 1.00     3947.
4 theta         0.836   0.838 0.0337 0.0334   0.777    0.885 1.00      884.
# ℹ 1 more variable: ess_tail <dbl>

Code

# Extract posterior samples and include sampling of the prior:
draws_df_biased_logodds <- as_draws_df(samples_biased_logodds$draws()) 

# Explicitly extract theta parameter
theta_param_logodds <- draws_df_biased_logodds$theta
cat("Successfully extracted theta parameter with", length(theta_param_logodds), "values\n")

Successfully extracted theta parameter with 4000 values

Code

ggplot(draws_df_biased_logodds, aes(.iteration, theta, group = .chain, color = .chain)) +
  geom_line() +
  theme_classic()

Code

# Now let's plot the density for theta (prior and posterior)
ggplot(draws_df_biased_logodds) +
  geom_density(aes(theta), fill = "blue", alpha = 0.8) +
  geom_density(aes(prior_theta), fill = "red", alpha = 0.8) +
  geom_vline(xintercept = 0.8, linetype = "dashed", color = "black", size = 1.5) +
  xlab("Rate") +
  ylab("Posterior Density") +
  theme_classic()

We can see that the results are very similar.

5.6 Bonus: Pathfinder Variational Inference vs. Full MCMC

Stan Optimization: Pathfinder as Warm-Start Initialization

Stan’s default NUTS-HMC sampler draws posterior samples by simulating a Hamiltonian dynamical system. During warm-up, the sampler adapts its step size and metric — a phase that can take hundreds of iterations if chains start far from the posterior mass. Pathfinder (Zhang et al. 2022) is a fast variational approximation that traces the L-BFGS optimization path from an initial point to the posterior mode and fits a series of multivariate normal approximations along the way. It is not a gold-standard posterior, but it tells HMC roughly where the posterior lives before NUTS begins.

Three strategies, in order of increasing accuracy and cost:

Strategy	When to use	Typical speed
Pathfinder alone	Quick exploration, prior sensitivity checks	~10–50× faster than NUTS
NUTS with default init	Standard analysis	Baseline
Pathfinder → NUTS warm-start	Hierarchical or complex models where warm-up is slow	20–60% faster than default NUTS

The warm-start trick: run mod$pathfinder(...) to get a posterior approximation, then pass it as init to mod$sample(...). This replaces the random Uniform(−2, 2) starting points with draws from the Pathfinder approximation, letting the sampler skip most of the warm-up trajectory. We will use this pattern throughout the book for models with challenging geometry or many parameters.

When Pathfinder fails: Near-degenerate posteriors (e.g., hard funnels, flat ridges) can cause Pathfinder’s variational approximation to collapse. Always wrap it in tryCatch(..., error = function(e) NULL) and fall back to explicit per-chain inits centered on prior medians if it returns NULL.

Code

## Bonus content: Benchmarking NUTS, Pathfinder, and Pathfinder-initialized NUTS

# We will compare three methods:
# 1. Standard NUTS (MCMC)
# 2. Pathfinder (Variational Inference)
# 3. NUTS initialized with Pathfinder

# Define file paths for all three models
file_nuts_std  <- here("simmodels", "ch06_benchmark_nuts_std.rds")
file_pathfinder <- here("simmodels", "ch06_benchmark_pathfinder.rds")
file_nuts_init  <- here("simmodels", "ch06_benchmark_nuts_init.rds")

stan_file_logit <- here::here("stan", "ch06_SimpleBernoulli_logodds.stan")
mod <- cmdstan_model(
  stan_file_logit, 
  cpp_options = list(stan_threads = FALSE),
  stanc_options = list("O1")
)

# --- 1. Run Standard NUTS (Baseline) ---
if (regenerate_simulations || !file.exists(file_nuts_std)) {
  cat("Running Standard NUTS...\n")
  fit_nuts_std <- mod$sample(
    data = data_for_stan,
    seed = 123,
    chains = 4,
    parallel_chains = 4,
    iter_warmup = 1000, # Standard warmup
    iter_sampling = 2000,
    refresh = 0
  )
  fit_nuts_std$save_object(file_nuts_std)
} else {
  fit_nuts_std <- readRDS(file_nuts_std)
}

# --- 2. Run Pathfinder ---
if (regenerate_simulations || !file.exists(file_pathfinder)) {
  cat("Running Pathfinder...\n")
  fit_pathfinder <- mod$pathfinder(
    data = data_for_stan,
    seed = 123,
    draws = 4000, 
    num_paths = 4,
    single_path_draws = 1000,
    history_size = 50
  )
  fit_pathfinder$save_object(file_pathfinder)
} else {
  fit_pathfinder <- readRDS(file_pathfinder)
}

# --- 3. Run NUTS initialized with Pathfinder ---
if (regenerate_simulations || !file.exists(file_nuts_init)) {
  cat("Running NUTS initialized with Pathfinder...\n")
  fit_nuts_init <- mod$sample(
    data = data_for_stan,
    seed = 123,
    chains = 4,
    parallel_chains = 4,
    iter_warmup = 500,    # Reduced warmup allowed by better init!
    iter_sampling = 2000,
    init = fit_pathfinder, # <--- The Magic Line
    refresh = 0
  )
  fit_nuts_init$save_object(file_nuts_init)
} else {
  fit_nuts_init <- readRDS(file_nuts_init)
}


# --- COMPARISON: TIMING ---

# Extract execution times (in seconds)
# Note: $time()$total gives the wall clock time for the computation
time_nuts_std  <- fit_nuts_std$time()$total
time_pathfinder <- fit_pathfinder$time()$total
# For the initialized model, the total workflow time is Pathfinder + NUTS
time_nuts_init <- time_pathfinder + fit_nuts_init$time()$total 

# Create a summary table
benchmarks <- tibble(
  Method = c("Standard NUTS", "Pathfinder", "Pathfinder + NUTS"),
  Time_Sec = c(time_nuts_std, time_pathfinder, time_nuts_init),
  Speedup_vs_Std = time_nuts_std / Time_Sec
)

print(benchmarks)

# A tibble: 3 × 3
  Method            Time_Sec Speedup_vs_Std
  <chr>                <dbl>          <dbl>
1 Standard NUTS        0.312          1    
2 Pathfinder           0.270          1.16 
3 Pathfinder + NUTS    0.798          0.391

Code

# --- COMPARISON: POSTERIORS ---

# Extract draws
draws_std  <- as_draws_df(fit_nuts_std$draws("theta")) %>% mutate(Method = "Standard NUTS")
draws_path <- as_draws_df(fit_pathfinder$draws("theta")) %>% mutate(Method = "Pathfinder")
draws_init <- as_draws_df(fit_nuts_init$draws("theta")) %>% mutate(Method = "Pathfinder + NUTS")

# Combine
all_draws <- bind_rows(draws_std, draws_path, draws_init)

# Plot
ggplot(all_draws, aes(x = theta, fill = Method)) +
  geom_density(alpha = 0.4) +
  geom_vline(xintercept = 0.8, linetype = "dashed") +
  scale_fill_viridis_d() +
  labs(
    title = "Posterior Comparison",
    subtitle = "Pathfinder approximates well; Initialized NUTS refines it faster than Standard NUTS",
    x = "Theta (Rate)",
    y = "Density"
  ) +
  theme_classic()

5.7 Moving Beyond Simple Bias: Memory Models

The simple biased agent model assumes choices are independent over time, influenced only by a fixed theta. However, behavior is often history-dependent. Let’s explore models where the choice probability theta on trial t depends on previous events.

5.7.1 Memory Model 1: GLM-like Approach (External Predictor)

One way to incorporate memory is to treat a summary of past events as an external predictor influencing the current choice probability, similar to a predictor in a Generalized Linear Model (GLM).

Let’s assume the choice probability theta depends on the cumulative rate of the opponent’s ‘right’ choices observed up to the previous trial.To make the variable more intuitive we code previous rate - which is bound to a probability 0-1 space - into log-odds via a logit link/transformation. In this way a previous rate with more left than right choices will result in a negative value, thereby decreasing our propensity to choose right; and one with more right than left choices will result in a positive value, thereby increasing our propensity to choose right.

Code

# We subset to only include no noise and a specific rate
d1 <- d %>% 
  subset(true_noise == 0 & true_rate == 0.8) %>% 
  rename(Other = choice) %>% 
  mutate(cumulativerate = lag(cumulative_rate, 1))

d1$cumulativerate[1] <- 0.5 # no prior info at first trial
d1$cumulativerate[d1$cumulativerate == 0] <- 0.01
d1$cumulativerate[d1$cumulativerate == 1] <- 0.99

# Now we create the memory agent with a coefficient of 1 (in log odds)
MemoryAgent_f <- function(bias, beta, cumulativerate){
    choice = rbinom(1, 1, plogis(bias + beta * qlogis(cumulativerate)))
  return(choice)
}

d1$Self[1] <- RandomAgentNoise_f(0, 0) # 0.5 in the transformed space

for (i in 2:trials) {
  d1$Self[i] <- MemoryAgent_f(bias = 0, beta = 1, d1$cumulativerate[i])
}

## Create the data
data_memory <- list(
  n = 120,
  h = d1$Self,
  memory = d1$cumulativerate # this creates the new parameter: the rate of right hands so far in log-odds
)

Code

stan_model <- "
// The input (data) for the model. n of trials and h for (right and left) hand
data {
 int<lower=1> n;
 array[n] int h;
 vector[n] memory; // here we add the new variable between 0.01 and .99
}

// The parameters accepted by the model. 
parameters {
  real bias; // how likely is the agent to pick right when the previous rate has no information (50-50)?
  real beta; // how strongly is previous rate impacting the decision?
}

// The model to be estimated. 
model {
  // priors
  target += normal_lpdf(bias | 0, .3);
  target += normal_lpdf(beta | 0, .5);
  
  // model
  target += bernoulli_logit_lpmf(h | bias + beta * logit(memory));
}

generated quantities {
  // Generate prior samples for Prior-Posterior Update checks
  real bias_prior = normal_rng(0, 0.3);
  real beta_prior = normal_rng(0, 0.5);
}
"
write_stan_file(
  stan_model,
  dir = "stan/",
  basename = "ch06_MemoryBernoulli.stan")

[1] "/Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/stan/ch06_MemoryBernoulli.stan"

Code

## Specify where the model is
file <- here("stan","ch06_MemoryBernoulli.stan")

# File path for saved model
model_file_memory <- here("simmodels", "ch06_MemoryBernoulli.rds")

# Check if we need to rerun the simulation
if (regenerate_simulations || !file.exists(model_file_memory)) {
  # Compile the model
  mod_memory <- cmdstan_model(file)
  
  # The following command calls Stan with specific options.
  samples_memory <- mod_memory$sample(
    data = data_memory,
    seed = 123,
    chains = 2,
    parallel_chains = 2,
    threads_per_chain = 1,
    iter_warmup = 1000,
    iter_sampling = 1000,
    refresh = 0,
    max_treedepth = 20,
    adapt_delta = 0.99,
  )
  
  # Save the fitted model
  samples_memory$save_object(file = model_file_memory)
  cat("Generated new model fit and saved to", model_file_memory, "\n")
} else {
  # Load existing results
  cat("Loading memory model samples...\n")
  samples_memory <- readRDS(model_file_memory)
  cat("Available parameters:", paste(colnames(as_draws_df(samples_memory$draws())), collapse=", "), "\n")
  cat("Loaded existing model fit from", model_file_memory, "\n")
}

Loading memory model samples...
Available parameters: lp__, bias, beta, bias_prior, beta_prior, .chain, .iteration, .draw 
Loaded existing model fit from /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/simmodels/ch06_MemoryBernoulli.rds

5.7.2 Summarizing the results

Code

# Check if samples_memory exists
if (!exists("samples_memory")) {
  cat("Loading memory model samples...\n")
  samples_memory <- readRDS(here("simmodels", "ch06_MemoryBernoulli.rds"))
  cat("Available parameters:", paste(colnames(as_draws_df(samples_memory$draws())), collapse = ", "), "\n")
}

# Extract posterior samples and include sampling of the prior:
draws_df_memory <- as_draws_df(samples_memory$draws()) 

# Explicitly extract parameters
bias_param <- draws_df_memory$bias
beta_param <- draws_df_memory$beta
cat("Successfully extracted", length(bias_param), "values for bias parameter\n")

Successfully extracted 2000 values for bias parameter

Code

cat("Successfully extracted", length(beta_param), "values for beta parameter\n")

Successfully extracted 2000 values for beta parameter

Code

# Trace plot for bias
ggplot(draws_df_memory, aes(.iteration, bias, group = .chain, color = .chain)) +
  geom_line() +
  labs(title = "Trace plot for bias parameter") +
  theme_classic()

Code

# Trace plot for beta
ggplot(draws_df_memory, aes(.iteration, beta, group = .chain, color = .chain)) +
  geom_line() +
  labs(title = "Trace plot for beta parameter") +
  theme_classic()

Code

# Now let's plot the density for bias (prior and posterior)
ggplot(draws_df_memory) +
  geom_density(aes(bias), fill = "blue", alpha = 0.3) +
  geom_density(aes(bias_prior), fill = "red", alpha = 0.3) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 1.5) +
  labs(title = "Prior-Posterior Update for Bias Parameter",
       subtitle = "Blue: posterior, Red: prior, Dashed: true value") +
  xlab("Bias") +
  ylab("Posterior Density") +
  theme_classic()

Code

# Now let's plot the density for beta (prior and posterior)
ggplot(draws_df_memory) +
  geom_density(aes(beta), fill = "blue", alpha = 0.3) +
  geom_density(aes(beta_prior), fill = "red", alpha = 0.3) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "black", size = 1.5) +
  labs(title = "Prior-Posterior Update for Beta Parameter",
       subtitle = "Blue: posterior, Red: prior, Dashed: true value") +
  xlab("Beta") +
  ylab("Posterior Density") +
  theme_classic()

Code

# Print summary
samples_memory$summary()

# A tibble: 5 × 10
  variable     mean   median    sd   mad      q5     q95  rhat ess_bulk ess_tail
  <chr>       <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -70.3    -70.0    0.971 0.736 -72.2   -69.4   1.000     644.     733.
2 bias       0.304    0.311  0.252 0.244  -0.120   0.717 1.01      383.     555.
3 beta       0.518    0.513  0.222 0.225   0.160   0.882 1.01      427.     474.
4 bias_pr…  -0.0138  -0.0106 0.300 0.292  -0.509   0.466 1.000    2201.    1758.
5 beta_pr…   0.0302   0.0395 0.491 0.496  -0.770   0.814 1.000    1861.    1940.

The trace plots should show the same “hairy caterpillar” pattern as before — well-mixed chains with no systematic drift. The prior-posterior update plots tell the story: the posterior for bias should be narrow and centred near 0 (log-odds of 0.5 = unbiased), and the posterior for beta should be centred near 1 (the true generating value). The data have meaningfully updated both parameters away from their priors. Formal quality checks — prior predictive checks, systematic parameter recovery, and SBC — come in Ch. 6.

5.7.3 Memory Model 2: Internal State Variable

Instead of feeding memory as external data, we can model it as an internal state that updates within the model.

For this specific model — where memory is simply a running average of the opponent’s choices — the update rule depends only on the observed data (other), not on any parameter being estimated. We should therefore calculate it in Stan’s transformed data block.

Stan Efficiency Rule 1 — Block Placement: Run it Once, Not Thousands of Times

Every Stan model has blocks that execute at very different frequencies. Choosing the right block is the most impactful single optimization:

Block	Executes	Use for
`transformed data`	Once, before sampling	Any quantity derived only from data — precomputed distances, normalized inputs, log-transforms of fixed constants
`transformed parameters`	Every leapfrog step	Deterministic functions of parameters whose gradients must flow backward — recursive states, reparameterizations, predicted values used in the likelihood
`model` (local `{}` block)	Every leapfrog step, not saved	Scratch variables for accumulating `target +=`; no overhead from the saved-draw tape
`generated quantities`	Once per saved draw (post-warmup only)	PPCs, log-likelihoods, prior samples — expensive quantities you only need in the final posterior

Concrete cost of misplacement. A typical fit runs 4 chains × 1000 warmup × ~6 leapfrog steps per draw = ~24 000 leapfrog evaluations. Any computation in transformed parameters runs ~24 000 times. The same computation in transformed data runs exactly once. For a model with 500 trials, precomputing 500 memory values in transformed data rather than transformed parameters saves ~12 million floating-point operations per fit.

The decision tree:

Does this depend only on data (not on any parameter)? → transformed data
Does this depend on parameters and does the likelihood or another parameter need it? → transformed parameters
Is this needed only after sampling (PPC draws, LOO log-likelihoods)? → generated quantities, gated by run_diagnostics (introduced properly in Ch. 5)
Is this just a temporary accumulator inside a loop? → Local variable inside {} in model or transformed parameters

In the memory model below, memory depends only on other (data) → transformed data. In the forgetting model that follows, memory depends on the free parameter forgetting → local variable inside model.

Corollary — pre-compute in R when possible. If a quantity depends only on data and will never change between fits, computing it in R and passing it as a data field is even faster than transformed data: Stan never sees the raw computation at all. For example, the multilevel memory model in Chapter 6 pre-computes the running opponent rate in R and passes opp_rate_prev as a vector, rather than recomputing it inside Stan on every chain.

Stan Efficiency — Vectorize the Likelihood Call

Stan evaluates each ~ or target += statement as a separate autodiff node. Replacing a trial-level loop with a vectorized distribution call collapses $N$ nodes into one, reducing autodiff graph size and improving cache locality.

// Slower: N separate nodes in the autodiff graph
for (i in 1:N)
  target += bernoulli_logit_lpmf(h[i] | theta_logit);

// Faster: one vectorized call — h and theta_logit are both length-N vectors
target += bernoulli_logit_lpmf(h | theta_logit);

The vectorized form works whenever theta_logit is either a scalar (broadcasts to all trials) or a vector of the same length as h. In the multilevel model of Chapter 6, target += bernoulli_logit_lpmf(y | theta_logit[subj_id]) vectorizes the full N_total trial loop by scatter-indexing the per-subject parameters.

Caveat: vectorization is not available for mixture likelihoods (log_mix) because each observation requires its own separate log_mix call. The loop is unavoidable there (Ch. 8).

Code

## Create the data — run_diagnostics = 1 for the single validation fit
data <- list(
  n = 120,
  h = d1$Self,
  other = d1$Other,
  run_diagnostics = 1L
)

stan_model <- "
// Memory-based choice model with prior and posterior predictions.
// memory[] is data-derived and lives in transformed data (computed once).
// run_diagnostics gates expensive generated quantities for SBC speed.

data {
 int<lower=1> n;
 array[n] int h;
 array[n] int other;
 int<lower=0, upper=1> run_diagnostics; // 1 = compute PPC + log_lik; 0 = skip (use for SBC)
}

transformed data {
  // memory depends only on other[] (data), so it runs ONCE here.
  // Moving this to transformed parameters would recompute it ~24 000 times per fit.
  vector[n] memory;
  memory[1] = 0.5;
  for (trial in 1:(n-1)) {
     memory[trial + 1] = memory[trial] + ((other[trial] - memory[trial]) / (trial + 1));
     if (memory[trial + 1] < 0.01) memory[trial + 1] = 0.01;
     if (memory[trial + 1] > 0.99) memory[trial + 1] = 0.99;
  }
}

parameters {
  real bias;
  real beta;
}

model {
  target += normal_lpdf(bias | 0, .3);
  target += normal_lpdf(beta | 0, .5);
  // Vectorized likelihood — no loop needed because memory is precomputed.
  target += bernoulli_logit_lpmf(h | bias + beta * logit(memory));
}

generated quantities {
  // Prior samples are always generated (cheap).
  real bias_prior = normal_rng(0, 0.3);
  real beta_prior = normal_rng(0, 0.5);
  // Expensive diagnostics are gated — set run_diagnostics = 0 during SBC.
  vector[n] log_lik;
  array[n] int y_rep;
  if (run_diagnostics) {
    for (trial in 1:n) {
      log_lik[trial] = bernoulli_logit_lpmf(h[trial] | bias + beta * logit(memory[trial]));
      y_rep[trial]   = bernoulli_logit_rng(bias + beta * logit(memory[trial]));
    }
  }
}
"

write_stan_file(
  stan_model,
  dir = "stan/",
  basename = "ch06_InternalMemory.stan")

[1] "/Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/stan/ch06_InternalMemory.stan"

Code

## Specify where the model is
file <- here("stan", "ch06_InternalMemory.stan")

# File path for saved model
model_file <- here("simmodels", "ch06_InternalMemory.rds")

# Check if we need to rerun the simulation
if (regenerate_simulations || !file.exists(model_file)) {
  # Compile the model
  mod <- cmdstan_model(file, 
                       cpp_options = list(stan_threads = FALSE),
                       stanc_options = list("O1"))
  
  # The following command calls Stan with specific options.
  samples_memory_internal <- mod$sample(
    data = data,
    seed = 123,
    chains = 1,
    parallel_chains = 2,
    threads_per_chain = 1,
    iter_warmup = 1000,
    iter_sampling = 1000,
    refresh = 0,
    max_treedepth = 20,
    adapt_delta = 0.99,
  )
  
  # Save the fitted model
  samples_memory_internal$save_object(file = model_file)
  cat("Generated new model fit and saved to", model_file, "\n")
} else {
  # Load existing results
  samples_memory_internal <- readRDS(model_file)
  cat("Loaded existing model fit from", model_file, "\n")
}

Loaded existing model fit from /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/simmodels/ch06_InternalMemory.rds

Code

draws_df <- as_draws_df(samples_memory_internal$draws())

# 1. Check chain convergence
# Plot traces for main parameters
mcmc_trace(draws_df, pars = c("bias", "beta")) +
  theme_minimal() +
  ggtitle("Parameter Traces Across Chains")

Code

# Plot rank histograms to check mixing
mcmc_rank_hist(draws_df, pars = c("bias", "beta"))

Code

# 2. Prior-Posterior Update Check
p1 <- ggplot() +
  geom_density(data = draws_df, aes(bias, fill = "Posterior"), alpha = 0.5) +
  geom_density(data = draws_df, aes(bias_prior, fill = "Prior"), alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  scale_fill_manual(values = c("Prior" = "red", "Posterior" = "blue")) +
  theme_minimal() +
  ggtitle("Prior-Posterior Update: Bias Parameter")

p2 <- ggplot() +
  geom_density(data = draws_df, aes(beta, fill = "Posterior"), alpha = 0.5) +
  geom_density(data = draws_df, aes(beta_prior, fill = "Prior"), alpha = 0.5) +
  geom_vline(xintercept = 1, linetype = "dashed") +
  scale_fill_manual(values = c("Prior" = "red", "Posterior" = "blue")) +
  theme_minimal() +
  ggtitle("Prior-Posterior Update: Beta Parameter")

p3 <- ggplot() +
  geom_point(data = draws_df, aes(bias, beta), alpha = 0.5) +
  theme_minimal() +
  ggtitle("Correlation")

p1 + p2 + p3

Now that we know how to model memory as an internal state, we can play with making the update discount the past, setting a parameter that indicates after how many trials memory is lost, etc.

5.7.4 Memory Model 3: Exponential Forgetting (similar to RL)

A more cognitively plausible memory model incorporates forgetting, where recent events have more influence than distant ones. This can be implemented by adding a forgetting parameter (often called a learning rate, α, in RL) that controls the weight of the most recent outcome versus the previous memory state.

Code

stan_model <- "
// Memory model with exponential forgetting (learning-rate style).
// memory[] depends on the free parameter 'forgetting', so it cannot go in
// transformed data. It lives as a local vector inside the model block —
// computed at every leapfrog step but never saved to the posterior draws.

data {
  int<lower=1> n;
  array[n] int h;
  array[n] int other;
  int<lower=0, upper=1> run_diagnostics;
}

parameters {
  real bias;
  real beta;
  real<lower=0, upper=1> forgetting;
}

model {
  target += beta_lpdf(forgetting | 1, 1);
  target += normal_lpdf(bias | 0, .3);
  target += normal_lpdf(beta | 0, .5);

  // memory is a LOCAL variable — declared inside model, so it lives on the
  // stack and is discarded after each leapfrog step without touching the saved-
  // draw tape. This is the right pattern when memory depends on a parameter.
  vector[n] memory;
  for (trial in 1:n) {
    if (trial == 1) memory[trial] = 0.5;
    target += bernoulli_logit_lpmf(h[trial] | bias + beta * logit(memory[trial]));
    if (trial < n) {
      memory[trial + 1] = (1 - forgetting) * memory[trial] + forgetting * other[trial];
      if (memory[trial + 1] <= 0) memory[trial + 1] = 0.01;
      if (memory[trial + 1] >= 1) memory[trial + 1] = 0.99;
    }
  }
}

generated quantities {
  real bias_prior      = normal_rng(0, 0.3);
  real beta_prior      = normal_rng(0, 0.5);
  real forgetting_prior = beta_rng(1, 1);
  vector[n] log_lik;
  array[n] int y_rep;
  if (run_diagnostics) {
    // Recompute memory for the saved draw — runs once per posterior sample.
    vector[n] memory_gq;
    for (trial in 1:n) {
      if (trial == 1) memory_gq[trial] = 0.5;
      real lin = bias + beta * logit(memory_gq[trial]);
      log_lik[trial] = bernoulli_logit_lpmf(h[trial] | lin);
      y_rep[trial]   = bernoulli_logit_rng(lin);
      if (trial < n) {
        memory_gq[trial + 1] = (1 - forgetting) * memory_gq[trial] + forgetting * other[trial];
        if (memory_gq[trial + 1] <= 0) memory_gq[trial + 1] = 0.01;
        if (memory_gq[trial + 1] >= 1) memory_gq[trial + 1] = 0.99;
      }
    }
  }
}
"
write_stan_file(
  stan_model,
  dir = "stan/",
  basename = "ch06_InternalMemory2.stan")

[1] "/Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/stan/ch06_InternalMemory2.stan"

Code

## Specify where the model is
file <- here("stan","ch06_InternalMemory2.stan")

# File path for saved model
model_file <- here("simmodels","ch06_InternalMemory2.rds")

# Check if we need to rerun the simulation
if (regenerate_simulations || !file.exists(model_file)) {
  # Compile the model
  mod <- cmdstan_model(file, 
                       cpp_options = list(stan_threads = FALSE),
                       stanc_options = list("O1"))
  
  data_forgetting <- modifyList(data, list(run_diagnostics = 1L))
  # The following command calls Stan with specific options.
  samples_memory_forgetting <- mod$sample(
    data = data_forgetting,
    seed = 123,
    chains = 4,
    parallel_chains = 4,
    iter_warmup = 1000,
    iter_sampling = 1000,
    refresh = 0,
    adapt_delta = 0.95
  )
  
  # Save the fitted model
  samples_memory_forgetting$save_object(file = model_file)
  cat("Generated new model fit and saved to", model_file, "\n")
} else {
  # Load existing results
  samples_memory_forgetting <- readRDS(model_file)
  cat("Loaded existing model fit from", model_file, "\n")
}

Loaded existing model fit from /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/simmodels/ch06_InternalMemory2.rds

Code

samples_memory_forgetting$summary()

# A tibble: 4 × 10
  variable    mean   median    sd   mad       q5     q95  rhat ess_bulk ess_tail
  <chr>      <dbl>    <dbl> <dbl> <dbl>    <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -75.3   -74.9    1.42  1.18  -78.2    -73.5    1.00     219.     283.
2 bias       0.668   0.674  0.256 0.278   0.242    1.08   1.01     157.     168.
3 beta       0.108   0.0603 0.199 0.162  -0.136    0.478  1.01     109.     322.
4 forgett…   0.243   0.141  0.241 0.157   0.0108   0.802  1.02     165.     262.

The memory model we’ve implemented can be seen as part of a broader family of models that track and update beliefs based on incoming evidence. Let’s explore how it relates to some key frameworks.

5.8 Memory Model 4: Bayesian Agent (Optimal Updating)

A fully Bayesian agent wouldn’t just track the rate but maintain a full probability distribution (a Beta distribution) over the likely bias of the opponent, updating it optimally according to Bayes’ rule after each observation.

Code

stan_model <- "
data {
  int<lower=1> n;  // number of trials
  array[n] int h;  // agent's choices (0 or 1)
  array[n] int other;  // other player's choices (0 or 1)
}

parameters {
  real<lower=0> alpha_prior;  // Prior alpha parameter
  real<lower=0> beta_prior;   // Prior beta parameter
}

transformed parameters {
  vector[n] alpha;  // Alpha parameter at each trial
  vector[n] beta;   // Beta parameter at each trial
  vector[n] rate;   // Expected rate at each trial
  
  // Initialize with prior
  alpha[1] = alpha_prior;
  beta[1] = beta_prior;
  rate[1] = alpha[1] / (alpha[1] + beta[1]);
  
  // Sequential updating of Beta distribution
  for(t in 2:n) {
    // Update Beta parameters based on previous observation
    alpha[t] = alpha[t-1] + other[t-1];
    beta[t] = beta[t-1] + (1 - other[t-1]);
    
    // Calculate expected rate
    rate[t] = alpha[t] / (alpha[t] + beta[t]);
  }
}

model {
  // Priors on hyperparameters
  target += gamma_lpdf(alpha_prior | 2, 1);
  target += gamma_lpdf(beta_prior | 2, 1);
  
  // Agent's choices follow current rate estimates
  for(t in 1:n) {
    target += bernoulli_lpmf(h[t] | rate[t]);
  }
}

generated quantities {
  // We strictly reserve posterior predictive checks for Chapter 6.
  // Here, we just calculate the initial expected rate for reference.
  real initial_rate = alpha_prior / (alpha_prior + beta_prior);
}
"
write_stan_file(
  stan_model,
  dir = "stan/",
  basename = "ch06_BayesianMemory.stan")

[1] "/Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/stan/ch06_BayesianMemory.stan"

Code

## Specify where the model is
file <- here("stan","ch06_BayesianMemory.stan")

# File path for saved model
model_file <- here("simmodels","ch06_BayesianMemory.rds")

# Check if we need to rerun the simulation
if (regenerate_simulations || !file.exists(model_file)) {
  # Compile the model
  mod <- cmdstan_model(file)
  
  # The following command calls Stan with specific options.
  samples_memory_bayes <- mod$sample(
    data = data,
    seed = 123,
    chains = 1,
    parallel_chains = 2,
    threads_per_chain = 1,
    iter_warmup = 1000,
    iter_sampling = 1000,
    refresh = 0,
    max_treedepth = 20,
    adapt_delta = 0.99,
  )
  
  # Save the fitted model
  samples_memory_bayes$save_object(file = model_file)
  cat("Generated new model fit and saved to", model_file, "\n")
} else {
  # Load existing results
  samples_memory_bayes <- readRDS(model_file)
  cat("Loaded existing model fit from", model_file, "\n")
}

Loaded existing model fit from /Users/au209589/Dropbox/Teaching/AdvancedCognitiveModeling23_book/simmodels/ch06_BayesianMemory.rds

Code

samples_memory_bayes$summary()

# A tibble: 364 × 10
   variable      mean median    sd   mad      q5    q95  rhat ess_bulk ess_tail
   <chr>        <dbl>  <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 lp__        -73.3  -73.0  1.07  0.855 -75.4   -72.2  1.00      370.     379.
 2 alpha_prior   2.11   1.76 1.52  1.31    0.361   5.18 1.000     383.     446.
 3 beta_prior    1.25   1.07 0.819 0.651   0.299   2.90 1.00      407.     399.
 4 alpha[1]      2.11   1.76 1.52  1.31    0.361   5.18 1.000     383.     446.
 5 alpha[2]      3.11   2.76 1.52  1.31    1.36    6.18 1.000     383.     446.
 6 alpha[3]      4.11   3.76 1.52  1.31    2.36    7.18 1.000     383.     446.
 7 alpha[4]      4.11   3.76 1.52  1.31    2.36    7.18 1.000     383.     446.
 8 alpha[5]      5.11   4.76 1.52  1.31    3.36    8.18 1.000     383.     446.
 9 alpha[6]      6.11   5.76 1.52  1.31    4.36    9.18 1.000     383.     446.
10 alpha[7]      6.11   5.76 1.52  1.31    4.36    9.18 1.000     383.     446.
# ℹ 354 more rows

Code

# Extract draws
draws_df <- as_draws_df(samples_memory_bayes$draws())

# First let's look at the priors
ggplot(draws_df) +
  geom_density(aes(alpha_prior), fill = "blue", alpha = 0.3) +
  geom_density(aes(beta_prior), fill = "red", alpha = 0.3) +
  theme_classic() +
  labs(title = "Prior Distributions",
       x = "Parameter Value",
       y = "Density")

Code

# Now let's look at how the rate evolves over trials
# First melt the rate values across trials into long format
rate_df <- draws_df %>%
  dplyr::select(starts_with("rate[")) %>%
  pivot_longer(everything(), 
               names_to = "trial",
               values_to = "rate",
               names_pattern = "rate\\[(\\d+)\\]") %>%
  mutate(trial = as.numeric(trial))

# Calculate summary statistics for each trial
rate_summary <- rate_df %>%
  group_by(trial) %>%
  summarise(
    mean_rate = mean(rate),
    lower = quantile(rate, 0.025),
    upper = quantile(rate, 0.975)
  )


plot_data <- tibble(trial = seq(120), choices = data$other)

# Plot the evolution of rate estimates
ggplot(rate_summary, aes(x = trial)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  geom_line(aes(y = mean_rate), color = "blue") +
  # Add true data points
  geom_line(data = plot_data, 
             aes(x = trial, y = choices), color = "orange", alpha = 0.5) +
  theme_classic() +
  labs(title = "Evolution of Rate Estimates",
       x = "Trial",
       y = "Rate",
       subtitle = "Blue line: posterior mean, Gray band: 95% CI") +
  ylim(0, 1)

Code

# Let's also look at the correlation between alpha and beta parameters
ggplot(draws_df) +
  geom_point(aes(alpha_prior, beta_prior), alpha = 0.1) +
  theme_classic() +
  labs(title = "Correlation between Alpha and Beta Parameters",
       x = "Alpha",
       y = "Beta")

5.9 Relationship to Rescorla-Wagner

The Rescorla-Wagner model of learning—which drives the Reinforcement Learning agents we will build in in a later chapter follows the form:

\[V_{t+1} = V_{t} + \alpha (R_{t} - V_{t})\]

where:

$V_{t}$ is the current Value estimate (in our code: memory[t]).
$\alpha$ is the Learning Rate (in our code: forgetting).
$R_{t}$ is the Reward or Outcome (in our code: other[t]).
$(R_{t} - V_{t})$ is the Prediction Error.

Our memory model with a forgetting parameter follows a very similar structure:\[\text{memory}_{t+1} = (1 - \text{forgetting}) \times \text{memory}_{t} + \text{forgetting} \times \text{outcome}_{t}\]If we expand this, we can see it is mathematically identical to the Rescorla-Wagner update rule:\[\begin{aligned} \text{memory}_{t+1} &= \text{memory}_{t} - \text{forgetting} \times \text{memory}_{t} + \text{forgetting} \times \text{outcome}_{t} \\ \text{memory}_{t+1} &= \text{memory}_{t} + \text{forgetting} \times (\text{outcome}_{t} - \text{memory}_{t}) \end{aligned}\]

This reveals that our “exponential forgetting” memory model is actually a standard Reinforcement Learning agent with a big fake moustache. The forgetting parameter dictates how quickly the agent updates their beliefs (value estimate) based on new prediction errors.

5.9.1 Connection to Kalman Filters

Our memory model updates beliefs about the probability of right-hand choices using a weighted average of past observations. This is conceptually similar to how a Kalman filter works, though simpler:

Kalman filters maintain both an estimate and uncertainty about that estimate
They optimally weight new evidence based on relative uncertainty
Our forgetting model uses a fixed weighting scheme (1/trial or the forgetting parameter)
The Bayesian Agent (ch06_BayesianMemory.stan) captures uncertainty via the Beta distribution, similar in spirit but simpler than a Kalman filter.

5.9.2 Connection to Hierarchical Gaussian Filter (HGF)

The HGF extends these ideas by:

Tracking beliefs at multiple levels
Allowing learning rates to vary over time
Explicitly modeling environmental volatility

Our model could be seen as the simplest case of an HGF where:

We only track one level (probability of right-hand choice)
Have a fixed learning rate (forgetting parameter)
Don’t explicitly model environmental volatility

5.9.3 Implications for Model Development

Understanding these relationships helps us think about how models relate to each other and to extend our model:

We could add uncertainty estimates to get Kalman-like behavior
We could make the forgetting parameter dynamic to capture changing learning rates
We could add multiple levels to track both immediate probabilities and longer-term trends

Each extension would make the model more flexible but also more complex to fit to data. The choice depends on our specific research questions and available data.

5.10 Conclusion: Estimating Parameters and Exploring Memory

This chapter marked a crucial transition from simulating models with known parameters to the core task of parameter estimation: inferring plausible parameter values from observed data using Bayesian inference and Stan.

We learned how to:

Specify Bayesian models in Stan, defining data, parameters, priors, and data models.
Fit models using cmdstanr to obtain posterior distributions for parameters like choice bias (theta).
Utilize transformations like log-odds for computational benefits and flexible prior specification.
Validate our fitting procedure through parameter recovery, ensuring our models can retrieve known values from simulated data.
Implement different cognitive models for history-dependent choice, exploring various ways to represent memory: as an external predictor, an internal updating state, incorporating exponential forgetting (linking to Reinforcement Learning principles), and even as a fully Bayesian belief update.
Recognize connections between these models and broader frameworks like Kalman filters and HGF.

By fitting these models, we moved beyond simply describing behavior (like the cumulative rates in Chapter 3) to quantifying underlying latent parameters (like bias, beta, forgetting).

However, a single fit to a single simulated dataset is only a preliminary sanity check — the equivalent of flipping a coin once to check if it’s fair. It does not tell us whether the model behaves well across a wide range of parameter values, whether our credible intervals are correctly calibrated, or whether the fitted model can actually reproduce the patterns in the data.

Coming up in Chapter 6: We take the Stan models built in this chapter and subject them to a four-phase validation battery before they are allowed near real human data:

Prior Predictive Check — do the priors allow only cognitively plausible behavior?
Parameter Recovery — does the model find the right answer across many simulated datasets and noise levels?
Simulation-Based Calibration Checks (SBC) — are our posterior intervals actually calibrated (does a 95% CI contain the truth 95% of the time)?
Posterior Predictive Check — does the fitted model reproduce key patterns in the observed data?

Only after passing all four tests does a model earn the right to touch real empirical data.

Zhang, Lu, Bob Carpenter, Andrew Gelman, and Aki Vehtari. 2022. “Pathfinder: Parallel Quasi-Newton Variational Inference.” Journal of Machine Learning Research 23 (306): 1–49.

# From simulation to model fitting {#sec-inferring-rates} ```{r ch05_setup, include=FALSE} knitr::opts_chunk$set( warning = FALSE, message = FALSE, echo = TRUE, fig.width = 8, fig.height = 5, fig.align = 'center', out.width = "80%", dpi = 300 ) pacman::p_load( tidyverse, here, cmdstanr, posterior, bayesplot, furrr, future, patchwork ) # Parallel Setup: Leave 2 cores free for system stability n_cores <- parallel::detectCores() - 2 future::plan(multisession, workers = n_cores) theme_set(theme_classic()) ``` > **📍 Where we are in the Bayesian modeling workflow:** > Ch. 3 built forward models — code that generates data *from* known parameters. > This chapter reverses the arrow: given observed choices, we infer the parameters > that most plausibly produced them. This is **Bayesian parameter estimation**, and > it is the engine at the heart of all remaining chapters. We fit to *simulated* > data here precisely because we know the ground truth — a sanity check before the > engine is handed real human data in Ch. 6. In Chapter 3, we formalized verbal models into computational agents and explored their behavior through simulation by *setting* their parameters (like bias `rate` or WSLS rules). However, when analyzing real experimental data, we don't know the true parameter values underlying a participant's behavior. The core task shifts from simulation to **parameter estimation**: inferring the plausible values of model parameters *given* the observed data. ## Learning Objectives This chapter transitions from simulating predefined models (Chapter 3) to inferring model parameters from data using Bayesian methods. By the end of this chapter, you will be able to: * **Model Cognitive Processes:** Apply these techniques to estimate parameters for simple cognitive strategies, including biased choice and different types of memory models. * **Master Reparameterization:** Learn to rethink the mathematical form of models. You will appreciate the utility of the log-odds transformation for mapping bounded probability parameters into unbounded spaces. * **Read MCMC Diagnostics:** Learn the mandatory diagnostic battery for Bayesian inference, specifically focusing on chain convergence ($\hat{R}$) and Effective Sample Size (ESS). In a rigorous Bayesian workflow, we *never* fit a model to empirical data without validating it first. Therefore, throughout this chapter, we will fit our models to a single **simulated dataset** where we know the ground truth. This allows us to learn the mechanics of Stan and verify our single-fit posterior. In the next chapter, we will scale this up into a full rigorous validation pipeline (Parameter Recovery and predictive checks). ## The Challenge: Inferring Latent Parameters Why is this a challenge? 1. **Parameters are Unseen:** Cognitive parameters (like bias strength, memory decay, learning rate) are not directly observable; we must infer them from behavior (choices, RTs, etc.). 2. **Behavior is Noisy:** Human behavior is variable. Our models need to account for this noise while extracting the underlying parameter signal. 3. **Model Plausibility:** We need to determine not just *if* a model fits, but *how well* it fits, what are its strengths and limitations, and whether the estimated parameter values are theoretically meaningful. Consider the biased agent from Chapter 3. We simulated agents with a known `rate`. Now, imagine you only have the sequence of choices (the `h` data) from an unknown agent. How can you estimate the underlying `rate` that most likely generated those choices? **Our Approach: Bayesian Inference with Stan** This chapter introduces **Bayesian inference** as our primary tool for parameter estimation. The Bayesian approach allows us to combine prior knowledge about parameters with information from the observed data (via the likelihood function) to arrive at an updated understanding (posterior distribution) of the parameters. We will use **Stan**, a powerful probabilistic programming language, implemented via the `cmdstanr` R package, to: 1. **Specify Models:** Define our cognitive models formally, including parameters and their prior distributions. 2. **Fit Models:** Use Stan's algorithms (primarily Hamiltonian Monte Carlo - HMC, and a faster but more risky approximation method called pathfinder) to sample from the posterior distribution of parameters given the data. 3. **Evaluate Fits:** Examine the results to understand parameter estimates and their uncertainty. As a crucial validation step, we will first apply these methods to **simulated data**, where we know the true parameters. This allows us to check if our models and fitting procedures work correctly – a process called **parameter recovery**. > **Terminology note — "parameter recovery" used carefully:** > Here and in the conclusion of this chapter, we use "parameter recovery" loosely > to mean: *fitting a model to one simulated dataset and checking whether the > posterior lands near the true value.* This is a useful first sanity check. > In Ch. 6 we will use the term more precisely to mean a *systematic* study that > repeats this across dozens of datasets at many different true parameter values > and noise levels. That is a qualitatively different (and stronger) validation. > The even stricter test — Simulation-Based Calibration Checks (SBC) — checks whether > posterior *intervals* are correctly calibrated across the entire prior space. > These distinctions will matter a lot in Ch. 6, so don't worry about them too much now, you'll hear more about them. ## Simulating data As usual we start with simulated data, where we know the underlying mechanisms and parameter values. Simulated data are rarely enough (empirical data often offer unexpected challenges), but they are a great starting point to stress test your model: does the model reconstruct the right parameter values? Does it reproduce the overall patterns in the data? Here we build a new simulation of random agents with bias and noise. The code and visualization is really nothing different from last chapter. ```{r ch05_simulating data} # Flag to control whether to regenerate simulation/fitting results for the knitted files # Set to TRUE to rerun everything (takes time!), FALSE to load saved results. regenerate_simulations <- FALSE # --- Simulate Data for Fitting --- # Goal: Generate data from known models/parameters to test our fitting procedures. trials <- 120 # Number of trials per simulated agent # Define the agent function (copied from previosu chapter) # This agent chooses '1' with probability 'rate', potentially adding noise. RandomAgentNoise_f <- function(rate, noise) { # Ensure rate is probability (in case log-odds are passed, though not here) rate_prob <- plogis(rate) # Assumes rate might be log-odds; use identity if rate is already probability choice <- rbinom(1, 1, rate_prob) # Base choice if (noise > 0 & rbinom(1, 1, noise) == 1) { # Check if noise occurs choice <- rbinom(1, 1, 0.5) # Override with random 50/50 choice } return(choice) } # --- Generate Data Across Different Conditions --- # Define relative paths for saving/loading simulation data sim_data_dir <- "simdata" # Assumes a 'simdata' subdirectory sim_data_file <- here(sim_data_dir, "ch06_randomnoise.csv") # Create directory if it doesn't exist if (!dir.exists(sim_data_dir)) dir.create(sim_data_dir) if (regenerate_simulations || !file.exists(sim_data_file)) { cat("Generating new simulation data...\n") # Use expand_grid for cleaner parameter combinations param_grid <- expand_grid( noise = seq(0, 0.5, 0.1), # Noise levels rate = seq(0, 1, 0.1) # Bias rate levels (probability scale) ) # Use map_dfr for efficient simulation across parameters d <- pmap_dfr(param_grid, function(noise, rate) { # Simulate choices for one agent/condition choices <- map_int(1:trials, ~RandomAgentNoise_f(qlogis(rate), noise)) # Convert rate to log-odds for function if needed # Create tibble for this condition tibble( trial = 1:trials, choice = choices, true_rate = rate, # Store the true rate used for generation true_noise = noise # Store the true noise used ) }) %>% # Calculate cumulative rate for visualization group_by(true_rate, true_noise) %>% mutate(cumulative_rate = cumsum(choice) / row_number()) %>% ungroup() # Save the generated data write_csv(d, sim_data_file) cat("Generated new simulation data and saved to", sim_data_file, "\n") } else { # Load existing simulation data d <- read_csv(sim_data_file) cat("Loaded existing simulation data from", sim_data_file, "\n") } # --- Visualize Simulated Data --- # Plot cumulative rate, faceted by noise level p1 <- ggplot(d, aes(x = trial, y = cumulative_rate, group = true_rate, color = true_rate)) + geom_line(alpha = 0.8) + geom_hline(yintercept = 0.5, linetype = "dashed") + scale_color_viridis_c() + # Use perceptually uniform color scale facet_wrap(~paste("Noise =", true_noise)) + # Facet by noise level labs( title = "Simulated Choice Behavior (Cumulative Rate)", subtitle = "Agents with different bias rates and noise levels", x = "Trial Number", y = "Cumulative Proportion Choosing 'Right'", color = "True Bias Rate" ) + theme_minimal() + ylim(0, 1) print(p1) ``` ## How the Posterior Is Actually Computed Before we hand this problem to Stan, it is worth understanding what Stan is actually *doing* — because "posterior ∝ prior × likelihood" is easy to say and hard to see. This section builds a posterior by hand using a **grid approximation**: a brute-force method that works when there is only one parameter and we are willing to evaluate every possible value of it. **The setup.** Suppose an agent made 3 choices, picking "right" on trials 1 and 3 and "left" on trial 2 (`h = c(1, 0, 1)`). We want to know: what is the most plausible value of θ — the agent's underlying bias toward "right"? **Bayes' rule** tells us: $$P(\theta \mid \text{data}) \propto P(\theta) \times P(\text{data} \mid \theta)$$ In words: the posterior is proportional to the prior (what we believed before seeing the data) multiplied by the likelihood (how probable the data is under each value of θ). We then re-normalize so the result sums to 1. ```{r 04-bayes-by-hand, fig.cap="Grid approximation with 3 observed choices. The posterior (blue) is the pointwise product of the prior (red) and the likelihood (green), re-normalized. Notice that the prior is flat — every value of θ was equally plausible beforehand — and the posterior peaks near θ = 0.7, which best explains seeing 2 'right' choices out of 3."} # 1. Define a coarse grid of 10 candidate values for theta theta_grid <- seq(0.1, 1.0, by = 0.1) # 2. Prior: uniform — Beta(1,1) — every value equally plausible before seeing data prior <- rep(1, length(theta_grid)) prior <- prior / sum(prior) # Normalize to sum to 1 # 3. Observed data h <- c(1, 0, 1) # 2 "right" out of 3 trials # 4. Likelihood: P(data | theta) # Each choice is Bernoulli(theta), so the joint likelihood is the product # across trials likelihood <- sapply(theta_grid, function(th) { prod(dbinom(h, size = 1, prob = th)) }) # 5. Unnormalized posterior: prior * likelihood (pointwise) posterior_unnorm <- prior * likelihood # 6. Normalize so it sums to 1 posterior <- posterior_unnorm / sum(posterior_unnorm) # --- Display as a table --- grid_table <- tibble( `θ` = theta_grid, `Prior` = round(prior, 3), `Likelihood` = round(likelihood, 4), `Posterior` = round(posterior, 3) ) knitr::kable(grid_table, caption = "Grid approximation: prior × likelihood → posterior (3 observations)") # --- Plot all three distributions --- grid_long <- grid_table %>% rename(theta = `θ`, prior = Prior, likelihood = Likelihood, posterior = Posterior) %>% pivot_longer(c(prior, likelihood, posterior), names_to = "distribution", values_to = "mass") %>% mutate(distribution = factor(distribution, levels = c("prior", "likelihood", "posterior"), labels = c("Prior", "Likelihood", "Posterior"))) ggplot(grid_long, aes(x = theta, y = mass, fill = distribution)) + geom_col(position = "dodge", alpha = 0.8, width = 0.08) + scale_fill_manual(values = c("Prior" = "red", "Likelihood" = "darkgreen", "Posterior" = "blue")) + scale_x_continuous(breaks = theta_grid) + labs( title = "Bayes by Hand: Prior × Likelihood → Posterior", subtitle = "Data: h = c(1, 0, 1) — two 'right' choices out of three trials", x = "θ (probability of choosing 'right')", y = "Probability mass", fill = NULL ) + theme_classic() + theme(legend.position = "top") ``` Three things to notice from the table and the plot: 1. **The prior is flat.** Before seeing any data, every value of θ is equally plausible. This is what `Beta(1, 1)` means — maximal prior ignorance. 2. **The likelihood peaks near θ = 0.7.** The agent chose "right" 2 out of 3 times, so θ = 0.7 makes the observed sequence most probable. Values near 0 or 1 make it very improbable. 3. **The posterior is the pointwise product, re-normalized.** It is narrower than the prior (the data ruled out extreme values) and centred on the likelihood peak. This is exactly the "update": the data shifted our beliefs from flat to concentrated. **Why Stan instead of a grid?** This worked with 1 parameter and 10 grid points. With 2 parameters and 100 grid points each, we need 100 × 100 = 10,000 evaluations. With 5 parameters: 100⁵ = **10 billion**. The grid grows exponentially — this is the *curse of dimensionality*. Stan's Hamiltonian Monte Carlo (HMC) sampler sidesteps this by *sampling* from regions of high posterior probability rather than evaluating every possible combination. The samples it returns are the blue histogram you will see in the next section — same idea, just no grid required. ## Building our First Stan Model: Inferring Bias Rate N.B. Refer to the video and slides for the step by step build-up of the Stan code. Let's use Stan to estimate the underlying bias `rate` (which we'll call `theta` in the model) from just the sequence of choices (`h`) generated by one of our simulated agents. This model is represented in the ```{tikz, echo=FALSE, fig.cap="Binomial Model with Formulas", fig.ext = 'png', cache=TRUE} \usetikzlibrary{positioning, shapes.geometric, arrows.meta} \begin{tikzpicture}[ % GLOBAL STYLES node distance=1.5cm, continuous/.style={circle, draw=black, thick, minimum size=1.2cm}, discrete/.style={rectangle, draw=black, thick, minimum size=1.2cm}, observed/.style={fill=gray!30}, latent/.style={fill=white}, stochastic/.style={->, >={Latex[length=3mm]}, thick}, % New style for the formula text formula/.style={right=1cm, align=left, font=\small} ] % --- NODES --- % 1. Rate (Theta) - Top \node[continuous, latent] (theta) {$\theta$}; % 2. Right (k) - Middle \node[discrete, observed, below=of theta] (k) {Right}; % 3. Trials (n) - Bottom \node[discrete, observed, below=of k] (n) {$n$}; % --- EDGES --- \draw[stochastic] (theta) -- (k); \draw[stochastic] (n) -- (k); % --- FORMULAS (The new part) --- % Formula for Theta % We use 'align=left' to allow multiple lines if you want to list options \node[formula] at (theta.east) { $\theta \sim \text{Beta}(1,1)$ }; % Formula for Right (Outcome) % Note: If you have 'n' trials, the distribution is usually Binomial \node[formula] at (k.east) { $\text{Right} \sim \text{Binomial}(\theta, n)$ }; % (Optional) Label for N if it's fixed % \node[formula] at (n.east) { $n = 100$ }; % --- LABELS --- \node[above=0.1cm of theta] {\textbf{Rate}}; \node[below=0.1cm of n] {\textbf{Trials}}; \end{tikzpicture} ``` ```{tikz, echo=FALSE, fig.cap="Kruschke-Style Binomial Model", fig.ext = 'png', cache=TRUE} \usetikzlibrary{positioning, arrows.meta, calc, shapes.geometric} \begin{tikzpicture}[ node distance=2.5cm, % 1. DEFINE THE ICONS (PICS) % Beta Distribution Icon (Smooth Curve) pics/beta_dist/.style={ code={ % Axis \draw[thin] (-1,0) -- (1,0); % Curve (Beta-like hump) \draw[thick, blue] plot[domain=-1:1, samples=50] (\x, {0.8*exp(-4*\x*\x)}); % Label inside or near \node[font=\footnotesize, below] at (0,0) {Beta}; } }, % Binomial Distribution Icon (Bar Chart) pics/binom_dist/.style={ code={ % Axis \draw[thin] (-1,0) -- (1,0); % Bars \draw[thick, blue, ycomb] plot coordinates { (-0.6, 0.2) (-0.2, 0.5) (0.2, 0.7) (0.6, 0.3) }; % Label \node[font=\footnotesize, below] at (0,0) {Binomial}; } }, % Arrow Style with Tilde stochastic/.style={->, >={Latex[length=3mm]}, thick, font=\large}, label_text/.style={align=center, font=\small} ] % --- TOP: PRIOR (RATE) --- % We define a coordinate for the center of the Beta plot \coordinate (prior_pos) at (0,0); % Draw the picture \draw (prior_pos) pic (beta) {beta_dist}; % Add Labels for Prior \node[above=0.8cm of prior_pos, font=\bfseries] {Rate ($\theta$)}; \node[right=1.2cm of prior_pos, anchor=west, font=\small] {$\theta \sim \text{Beta}(1,1)$}; % --- BOTTOM: LIKELIHOOD (CHOICE) --- % Position it below \coordinate[below=3cm of prior_pos] (like_pos); % Draw the picture \draw (like_pos) pic (binom) {binom_dist}; % Add Labels for Likelihood \node[below=0.5cm of like_pos, font=\bfseries] {Choice (Right)}; \node[right=1.2cm of like_pos, anchor=west, font=\small] {$\text{Right} \sim \text{Binomial}(\theta, n)$}; % --- SIDE: TRIALS (n) --- % Just a text node feeding in \node[left=1.5cm of like_pos, font=\bfseries] (n) {Trials ($n$)}; % --- EDGES --- % 1. From Beta to Binomial (Stochastic) % We start the arrow slightly below the top axis and end slightly above the bottom one \draw[stochastic] (0, -0.2) -- (0, -2.2) node[midway, right] {$\sim$}; % 2. From Trials to Binomial (Deterministic/Parameter) \draw[stochastic] (n) -- (-1.1, -3); % Connects to the side of the plot area \end{tikzpicture} ``` We'll start with a simple case: an agent with `rate = 0.8` and `noise = 0`. **Preparing Data for Stan** Why a list? Well, dataframes (now tibbles) are amazing. But they have a big drawback: they require each variable to have the same length. Lists do not have that limitation, they are more flexible. So, lists. We'll have to learn how to live with them. ```{r 04-prep-stan-data} # Subset data for one specific agent simulation (rate=0.8, noise=0) d1 <- d %>% filter(true_rate == 0.8, true_noise == 0.0) # Make sure noise level matches # Stan requires data in a list format data_for_stan <- list( n = trials, # Number of trials (integer) h = d1$choice # Vector of choices (0s and 1s) for this agent ) str(data_for_stan) # Show the structure of the list ``` ### Model We write the stan code within the R code (so I can show it to you more easily), then we save it as a stan file, which can be loaded at a later stage in order to compile it. [Missing: more info on compiling etc.] Remember that the minimal Stan model requires 3 chunks, one specifying the data it will need as input; one specifying the parameters to be estimated; one specifying the model within which the parameters appear, and the priors for those parameters. ```{r ch05_defining the biased model} # This R chunk defines the Stan code as a string and writes it to a file. # It's marked eval=FALSE because we don't run the R code defining the string here, # but the Stan code itself IS the important content. stan_model_code <- " // Stan Model: Simple Bernoulli for Bias Estimation // Goal: Estimate the underlying probability (theta) of choosing 1 ('right') // given a sequence of binary choices. // 1. Data Block: Declares the data Stan expects from R data { int<lower=1> n; // Number of trials (must be at least 1) array[n] int<lower=0, upper=1> h; // Array 'h' of length 'n' containing choices (0 or 1) } // 2. Parameters Block: Declares the parameters the model will estimate parameters { real<lower=0, upper=1> theta; // The bias parameter (probability), constrained between 0 and 1 } // 3. Model Block: Defines the priors and the data model (used as likelihood during inference) model { // Prior: Our belief about theta *before* seeing the data. // We use a Beta(1, 1) prior, which is equivalent to a Uniform(0, 1) distribution. // This represents maximal prior ignorance about the bias. // 'target +=' adds the log-probability density to the overall model log-probability. target += beta_lpdf(theta | 1, 1); // lpdf = log probability density function // Likelihood: How the data 'h' depend on the parameter 'theta'. // We model each choice 'h' as a Bernoulli trial with success probability 'theta'. // The model assesses how likely the observed sequence 'h' is given a value of 'theta'. target += bernoulli_lpmf(h | theta); // lpmf = log probability mass function (for discrete data) } // 4. Generated Quantities Block (Optional but useful) // Code here is executed *after* sampling, using the estimated parameter values. // Useful for calculating derived quantities or predictions. generated quantities { // let's generate the prior for visualization // first we define it real<lower = 0, upper = 1> prior_theta; // then we generate it prior_theta = beta_rng(1, 1); // Draw a random sample from the prior distribution for theta } " # Define relative path for Stan model file stan_model_dir <- "stan" stan_file_bernoulli <- here(stan_model_dir, "ch06_SimpleBernoulli.stan") # Create directory if needed if (!dir.exists(stan_model_dir)) dir.create(stan_model_dir) # Write the Stan code to the file writeLines(stan_model_code, stan_file_bernoulli) cat("Stan model written to:", stan_file_bernoulli, "\n") ``` Compiling and Fitting the Stan Model Now we use cmdstanr to compile this model (translating it into efficient C++ code) and then run the MCMC sampler to get posterior estimates for theta. ```{r 04 fit the biased model} ## --- Compile and Fit --- # Specify model file path stan_file_bernoulli <- here("stan", "ch06_SimpleBernoulli.stan") # Define path for saving fitted model object stan_results_dir <- "simmodels" model_file_bernoulli <- here(stan_results_dir, "ch06_SimpleBernoulli.rds") if (!dir.exists(stan_results_dir)) dir.create(stan_results_dir) if (regenerate_simulations || !file.exists(model_file_bernoulli)) { cat("Compiling and fitting Stan model (Bernoulli)...\n") # Compile the Stan model (only needs to be done once unless code changes) mod_bernoulli <- cmdstan_model(stan_file_bernoulli, cpp_options = list(stan_threads = FALSE), # Enable threading stanc_options = list("O1")) # Basic optimization # Sample from the posterior distribution using MCMC fit_bernoulli <- mod_bernoulli$sample( data = data_for_stan, # The data list we prepared seed = 123, # For reproducible MCMC sampling chains = 4, # Number of parallel Markov chains (recommend 4) parallel_chains = 4, # Run chains in parallel #threads_per_chain = 1, # For within-chain parallelization (usually 1 is fine) iter_warmup = 1000, # Number of warmup iterations per chain (discarded) iter_sampling = 2000, # Number of sampling iterations per chain (kept) refresh = 0, # How often to print progress updates. 0 means never since the model is so fast max_treedepth = 10, # Controls complexity of MCMC steps (adjust if needed) adapt_delta = 0.8 # Target acceptance rate (adjust if divergences occur) ) # Save the fitted model object fit_bernoulli$save_object(file = model_file_bernoulli) cat("Model fit completed and saved to:", model_file_bernoulli, "\n") } else { # Load existing fitted model object fit_bernoulli <- readRDS(model_file_bernoulli) cat("Loaded existing model fit from:", model_file_bernoulli, "\n") } ``` ### Assessing MCMC Health (The Mandatory Diagnostics) > *Why check diagnostics before looking at the estimates?* The posterior > distribution Stan returns is only valid if the sampler explored the parameter > space correctly. If it did not — due to divergent transitions, poor mixing, or > insufficient samples — the numbers in the posterior are artifacts of the > algorithm, not properties of the model. Looking at estimates from a broken > sampler is like reading a map drawn by a robot that kept falling over: the map > looks plausible but cannot be trusted. Diagnostics are the check that the robot > walked properly before you navigate with its map. Before we look at the parameter estimates, we MUST check whether the model was able to sample the posterior space: the health of the MCMC chains. Fitting a model is like sending a robotic explorer (the sampler) to map a landscape (the posterior). If the robot's tracks are broken, the map is mathematically invalid. We must check: 1. **Divergences:** Did the simulation crash? (Zero tolerance). 2. **$\hat{R}$ (R-hat):** Did our 4 independent chains map the exact same space? (Must be strictly $< 1.01$). 3. **Effective Sample Size (ESS):** Do we have enough *truly independent* draws? (Bulk and Tail ESS should be $> 400$). 4. **Visual Diagnostics:** Trace plots (should look like a "hairy caterpillar") and Rank Histograms (should be uniform across chains). Let's run this mandatory diagnostic battery on our first fit: ```{r 04 biased model quality} # 1. Check for Divergent Transitions diagnostics <- fit_bernoulli$diagnostic_summary() cat("Divergent transitions per chain:", paste(diagnostics$num_divergent, collapse = ", "), "\n") # 2. Check R-hat and ESS metrics fit_summary <- fit_bernoulli$summary(variables = "theta") print(fit_summary %>% dplyr::select(variable, mean, rhat, ess_bulk, ess_tail)) # 3. Visual Checks (Trace and Rank Plots) draws_df <- as_draws_df(fit_bernoulli$draws()) # A healthy trace plot is well-mixed with no wandering p_trace <- mcmc_trace(draws_df, pars = "theta") + theme_classic() + ggtitle("Trace Plot", subtitle = "Healthy = all chains overlap and fluctuate around the same stable band") # A healthy rank histogram has uniform bars across all chains p_rank <- mcmc_rank_overlay(draws_df, pars = "theta", ref_line = TRUE) + theme_classic() + ggtitle("Trank plot", subtitle = "Healthy = chains interleave uniformly around the reference line — no chain dominates any rank band") p_trace / p_rank # 4. Look at the Posterior Estimate ggplot(draws_df) + geom_density(aes(theta, fill = "Posterior"), alpha = 0.6) + geom_density(aes(prior_theta, fill = "Prior"), alpha = 0.6) + geom_vline(xintercept = 0.8, linetype = "dashed", color = "black", linewidth = 1.2) + scale_fill_manual(values = c("Posterior" = "blue", "Prior" = "red")) + labs( title = "Prior-Posterior Update: Bias Parameter (θ)", subtitle = "Healthy = posterior (blue) is narrower than prior (red) and shifted toward the true value (dashed)", x = "θ (Probability of Choosing Right)", y = "Density", fill = "Distribution" ) + theme_classic() ``` The posterior (blue) is substantially narrower than the prior (red) and has shifted toward the true value of 0.8 (dashed line). The posterior mean is approximately 0.755, not exactly 0.8 — this is expected with only 120 trials and is not a sign that something is wrong. What matters is that (a) the posterior is much more informative than the prior, confirming the data were indeed informative, and (b) the true value 0.8 sits comfortably within the credible interval. If the posterior were identical to the prior, the data would be telling us nothing; if the posterior were centered far from the true value and excluded it, the model would be misspecified. Now we build the same model, but using the log odds scale for the theta parameter, which will become useful later when we condition theta on variables and build multilevel models (as we can do what we want in a log odds space and it will always be bound between 0 and 1). **Alternative Parameterization: Log-Odds Scale** While estimating theta directly as a probability (0-1) is intuitive, it can sometimes be computationally advantageous or necessary for more complex models (like multilevel models in a later chapter) to estimate parameters on an unbounded scale. The logit (log-odds) transformation converts a probability $p$ (from 0 to 1) to log-odds (from $-\infty$ to $+\infty$): $$\text{logit}(p) = \log\left(\frac{p}{1-p}\right)$$ The inverse transformation (logistic function or inv_logit) converts log-odds back to probability: $$\text{inv\_logit}(x) = \frac{1}{1 + \exp(-x)}$$ We can rewrite our Stan model to estimate theta on the log-odds scale. However, we must be careful with our priors, as the non-linear transformation distorts the probability mass. * Too Narrow: A standard normal prior $\lambda \sim \text{Normal}(0, 1)$ concentrates mass around $p=0.5$, biasing the model toward chance. * Too Wide: A wide prior (e.g., $\lambda \sim \text{Normal}(0, 5)$) pushes mass to the extremes (0 and 1), implying we expect deterministic behavior. * Just Right: To approximate a uniform distribution on the probability scale (like our Beta(1,1)), we typically use $\lambda \sim \text{Normal}(0, 1.5)$. This is the specific "sweet spot" that is essentially flat, avoiding both the center-bias of narrow priors and the extreme-bias of wide priors. ```{r 04-define-stan-model-logit} # Stan code using logit parameterization stan_model_logit_code <- " // Stan Model: Simple Bernoulli (Logit Parameterization) // Estimate theta on the log-odds scale data { int<lower=1> n; array[n] int<lower=0, upper=1> h; } parameters { real theta_logit; // Parameter is now on the unbounded log-odds scale } model { // Prior on log-odds scale (e.g., Normal(0, 1.5)) // Normal(0, 1.5) on log-odds corresponds roughly to a diffuse prior on probability scale target += normal_lpdf(theta_logit | 0, 1.5); // Likelihood using the logit version of the Bernoulli PMF // This tells Stan that theta_logit is on the log-odds scale target += bernoulli_logit_lpmf(h | theta_logit); } generated quantities { // Generate the prior for viz purposes real prior_theta = inv_logit(normal_rng(0, 1.5)); // Sample from the log-odds prior // Convert estimate back to probability scale for easier interpretation real<lower=0, upper=1> theta = inv_logit(theta_logit); } " # Define file path stan_file_logit <- here(stan_model_dir, "ch06_SimpleBernoulli_logodds.stan") # Write the Stan code to the file writeLines(stan_model_logit_code, stan_file_logit) cat("Stan model (logit) written to:", stan_file_logit, "\n") ``` ::: {.callout-tip title="Stan Efficiency — Sufficient Statistics: Aggregate IID Trials"} When all trials are **independently and identically distributed** (no memory, no sequential dependency, one $\theta$ for the whole session), you can replace a loop of $N$ Bernoulli calls with a single Binomial call on the **sufficient statistic** — the total number of successes. ```stan // Slower: N separate Bernoulli evaluations (one autodiff node each) for (i in 1:N) target += bernoulli_logit_lpmf(h[i] | theta_logit); // Faster: one Binomial evaluation — mathematically identical for IID data target += binomial_logit_lpmf(sum(h) | N, theta_logit); ``` Pass `sum(h)` and `N` as integers rather than the full array — Stan never sees the individual trial values. **When this applies:** single-subject models with one rate parameter per session (Chapters 4–5). **When it does not apply:** any model where the likelihood varies by trial — memory models, multilevel models with per-subject parameters indexed by `subj_id`, sequential belief-updating models. For those, the per-trial loop is necessary. ::: ```{r ch04_fit_logit_model} ## With the logit format ## Specify where the model is file <- here("stan", "ch06_SimpleBernoulli_logodds.stan") # File path for saved model model_file <- here("simmodels","ch06_SimpleBernoulli_logodds.rds") # Check if we need to rerun the simulation if (regenerate_simulations || !file.exists(model_file)) { # Compile the model mod <- cmdstan_model(file, cpp_options = list(stan_threads = FALSE), stanc_options = list("O1")) # The following command calls Stan with specific options. samples_biased_logodds <- mod$sample( data = data_for_stan, seed = 123, chains = 2, parallel_chains = 2, #threads_per_chain = 1, iter_warmup = 1000, iter_sampling = 2000, refresh = 0, max_treedepth = 20, adapt_delta = 0.99, ) # Save the fitted model samples_biased_logodds$save_object(file = model_file) cat("Generated new model fit and saved to", model_file, "\n") } else { # Load existing results cat("Loading biased model (log-odds) samples...\n") samples_biased_logodds <- readRDS(model_file) cat("Available parameters:", paste(colnames(as_draws_df(samples_biased_logodds$draws())), collapse = ", "), "\n") cat("Loaded existing model fit from", model_file, "\n") } ``` ### Summarizing the results ```{r 04 biased model log-odds quality assessment} # Summary samples_biased_logodds$summary() # Extract posterior samples and include sampling of the prior: draws_df_biased_logodds <- as_draws_df(samples_biased_logodds$draws()) # Explicitly extract theta parameter theta_param_logodds <- draws_df_biased_logodds$theta cat("Successfully extracted theta parameter with", length(theta_param_logodds), "values\n") ggplot(draws_df_biased_logodds, aes(.iteration, theta, group = .chain, color = .chain)) + geom_line() + theme_classic() # Now let's plot the density for theta (prior and posterior) ggplot(draws_df_biased_logodds) + geom_density(aes(theta), fill = "blue", alpha = 0.8) + geom_density(aes(prior_theta), fill = "red", alpha = 0.8) + geom_vline(xintercept = 0.8, linetype = "dashed", color = "black", size = 1.5) + xlab("Rate") + ylab("Posterior Density") + theme_classic() ``` We can see that the results are very similar. ## Bonus: Pathfinder Variational Inference vs. Full MCMC ::: {.callout-tip title="Stan Optimization: Pathfinder as Warm-Start Initialization"} Stan's default NUTS-HMC sampler draws posterior samples by simulating a Hamiltonian dynamical system. During **warm-up**, the sampler adapts its step size and metric — a phase that can take hundreds of iterations if chains start far from the posterior mass. **Pathfinder** [@zhang2022pathfinder] is a fast variational approximation that traces the L-BFGS optimization path from an initial point to the posterior mode and fits a series of multivariate normal approximations along the way. It is not a gold-standard posterior, but it tells HMC *roughly where the posterior lives* before NUTS begins. Three strategies, in order of increasing accuracy and cost: | Strategy | When to use | Typical speed | |---|---|---| | **Pathfinder alone** | Quick exploration, prior sensitivity checks | ~10–50× faster than NUTS | | **NUTS with default init** | Standard analysis | Baseline | | **Pathfinder → NUTS warm-start** | Hierarchical or complex models where warm-up is slow | 20–60% faster than default NUTS | The warm-start trick: run `mod$pathfinder(...)` to get a posterior approximation, then pass it as `init` to `mod$sample(...)`. This replaces the random Uniform(−2, 2) starting points with draws from the Pathfinder approximation, letting the sampler skip most of the warm-up trajectory. We will use this pattern throughout the book for models with challenging geometry or many parameters. **When Pathfinder fails:** Near-degenerate posteriors (e.g., hard funnels, flat ridges) can cause Pathfinder's variational approximation to collapse. Always wrap it in `tryCatch(..., error = function(e) NULL)` and fall back to explicit per-chain inits centered on prior medians if it returns `NULL`. ::: ```{r ch05_intro to pathfinded} ## Bonus content: Benchmarking NUTS, Pathfinder, and Pathfinder-initialized NUTS # We will compare three methods: # 1. Standard NUTS (MCMC) # 2. Pathfinder (Variational Inference) # 3. NUTS initialized with Pathfinder # Define file paths for all three models file_nuts_std <- here("simmodels", "ch06_benchmark_nuts_std.rds") file_pathfinder <- here("simmodels", "ch06_benchmark_pathfinder.rds") file_nuts_init <- here("simmodels", "ch06_benchmark_nuts_init.rds") stan_file_logit <- here::here("stan", "ch06_SimpleBernoulli_logodds.stan") mod <- cmdstan_model( stan_file_logit, cpp_options = list(stan_threads = FALSE), stanc_options = list("O1") ) # --- 1. Run Standard NUTS (Baseline) --- if (regenerate_simulations || !file.exists(file_nuts_std)) { cat("Running Standard NUTS...\n") fit_nuts_std <- mod$sample( data = data_for_stan, seed = 123, chains = 4, parallel_chains = 4, iter_warmup = 1000, # Standard warmup iter_sampling = 2000, refresh = 0 ) fit_nuts_std$save_object(file_nuts_std) } else { fit_nuts_std <- readRDS(file_nuts_std) } # --- 2. Run Pathfinder --- if (regenerate_simulations || !file.exists(file_pathfinder)) { cat("Running Pathfinder...\n") fit_pathfinder <- mod$pathfinder( data = data_for_stan, seed = 123, draws = 4000, num_paths = 4, single_path_draws = 1000, history_size = 50 ) fit_pathfinder$save_object(file_pathfinder) } else { fit_pathfinder <- readRDS(file_pathfinder) } # --- 3. Run NUTS initialized with Pathfinder --- if (regenerate_simulations || !file.exists(file_nuts_init)) { cat("Running NUTS initialized with Pathfinder...\n") fit_nuts_init <- mod$sample( data = data_for_stan, seed = 123, chains = 4, parallel_chains = 4, iter_warmup = 500, # Reduced warmup allowed by better init! iter_sampling = 2000, init = fit_pathfinder, # <--- The Magic Line refresh = 0 ) fit_nuts_init$save_object(file_nuts_init) } else { fit_nuts_init <- readRDS(file_nuts_init) } # --- COMPARISON: TIMING --- # Extract execution times (in seconds) # Note: $time()$total gives the wall clock time for the computation time_nuts_std <- fit_nuts_std$time()$total time_pathfinder <- fit_pathfinder$time()$total # For the initialized model, the total workflow time is Pathfinder + NUTS time_nuts_init <- time_pathfinder + fit_nuts_init$time()$total # Create a summary table benchmarks <- tibble( Method = c("Standard NUTS", "Pathfinder", "Pathfinder + NUTS"), Time_Sec = c(time_nuts_std, time_pathfinder, time_nuts_init), Speedup_vs_Std = time_nuts_std / Time_Sec ) print(benchmarks) # --- COMPARISON: POSTERIORS --- # Extract draws draws_std <- as_draws_df(fit_nuts_std$draws("theta")) %>% mutate(Method = "Standard NUTS") draws_path <- as_draws_df(fit_pathfinder$draws("theta")) %>% mutate(Method = "Pathfinder") draws_init <- as_draws_df(fit_nuts_init$draws("theta")) %>% mutate(Method = "Pathfinder + NUTS") # Combine all_draws <- bind_rows(draws_std, draws_path, draws_init) # Plot ggplot(all_draws, aes(x = theta, fill = Method)) + geom_density(alpha = 0.4) + geom_vline(xintercept = 0.8, linetype = "dashed") + scale_fill_viridis_d() + labs( title = "Posterior Comparison", subtitle = "Pathfinder approximates well; Initialized NUTS refines it faster than Standard NUTS", x = "Theta (Rate)", y = "Density" ) + theme_classic() ``` ## Moving Beyond Simple Bias: Memory Models {#sec-memory-models} The simple biased agent model assumes choices are independent over time, influenced only by a fixed `theta`. However, behavior is often history-dependent. Let's explore models where the choice probability `theta` on trial *t* depends on previous events. ### Memory Model 1: GLM-like Approach (External Predictor) One way to incorporate memory is to treat a summary of past events as an external predictor influencing the current choice probability, similar to a predictor in a Generalized Linear Model (GLM). Let's assume the choice probability `theta` depends on the **cumulative rate of the opponent's 'right' choices** observed up to the *previous* trial.To make the variable more intuitive we code previous rate - which is bound to a probability 0-1 space - into log-odds via a logit link/transformation. In this way a previous rate with more left than right choices will result in a negative value, thereby decreasing our propensity to choose right; and one with more right than left choices will result in a positive value, thereby increasing our propensity to choose right. ```{r ch05_generate data for the memory agent model} # We subset to only include no noise and a specific rate d1 <- d %>% subset(true_noise == 0 & true_rate == 0.8) %>% rename(Other = choice) %>% mutate(cumulativerate = lag(cumulative_rate, 1)) d1$cumulativerate[1] <- 0.5 # no prior info at first trial d1$cumulativerate[d1$cumulativerate == 0] <- 0.01 d1$cumulativerate[d1$cumulativerate == 1] <- 0.99 # Now we create the memory agent with a coefficient of 1 (in log odds) MemoryAgent_f <- function(bias, beta, cumulativerate){ choice = rbinom(1, 1, plogis(bias + beta * qlogis(cumulativerate))) return(choice) } d1$Self[1] <- RandomAgentNoise_f(0, 0) # 0.5 in the transformed space for (i in 2:trials) { d1$Self[i] <- MemoryAgent_f(bias = 0, beta = 1, d1$cumulativerate[i]) } ## Create the data data_memory <- list( n = 120, h = d1$Self, memory = d1$cumulativerate # this creates the new parameter: the rate of right hands so far in log-odds ) ``` ```{r ch05_memory stan model} stan_model <- " // The input (data) for the model. n of trials and h for (right and left) hand data { int<lower=1> n; array[n] int h; vector[n] memory; // here we add the new variable between 0.01 and .99 } // The parameters accepted by the model. parameters { real bias; // how likely is the agent to pick right when the previous rate has no information (50-50)? real beta; // how strongly is previous rate impacting the decision? } // The model to be estimated. model { // priors target += normal_lpdf(bias | 0, .3); target += normal_lpdf(beta | 0, .5); // model target += bernoulli_logit_lpmf(h | bias + beta * logit(memory)); } generated quantities { // Generate prior samples for Prior-Posterior Update checks real bias_prior = normal_rng(0, 0.3); real beta_prior = normal_rng(0, 0.5); } " write_stan_file( stan_model, dir = "stan/", basename = "ch06_MemoryBernoulli.stan") ## Specify where the model is file <- here("stan","ch06_MemoryBernoulli.stan") # File path for saved model model_file_memory <- here("simmodels", "ch06_MemoryBernoulli.rds") # Check if we need to rerun the simulation if (regenerate_simulations || !file.exists(model_file_memory)) { # Compile the model mod_memory <- cmdstan_model(file) # The following command calls Stan with specific options. samples_memory <- mod_memory$sample( data = data_memory, seed = 123, chains = 2, parallel_chains = 2, threads_per_chain = 1, iter_warmup = 1000, iter_sampling = 1000, refresh = 0, max_treedepth = 20, adapt_delta = 0.99, ) # Save the fitted model samples_memory$save_object(file = model_file_memory) cat("Generated new model fit and saved to", model_file_memory, "\n") } else { # Load existing results cat("Loading memory model samples...\n") samples_memory <- readRDS(model_file_memory) cat("Available parameters:", paste(colnames(as_draws_df(samples_memory$draws())), collapse=", "), "\n") cat("Loaded existing model fit from", model_file_memory, "\n") } ``` ### Summarizing the results ```{r 04 memory model log-odds quality assessment} # Check if samples_memory exists if (!exists("samples_memory")) { cat("Loading memory model samples...\n") samples_memory <- readRDS(here("simmodels", "ch06_MemoryBernoulli.rds")) cat("Available parameters:", paste(colnames(as_draws_df(samples_memory$draws())), collapse = ", "), "\n") } # Extract posterior samples and include sampling of the prior: draws_df_memory <- as_draws_df(samples_memory$draws()) # Explicitly extract parameters bias_param <- draws_df_memory$bias beta_param <- draws_df_memory$beta cat("Successfully extracted", length(bias_param), "values for bias parameter\n") cat("Successfully extracted", length(beta_param), "values for beta parameter\n") # Trace plot for bias ggplot(draws_df_memory, aes(.iteration, bias, group = .chain, color = .chain)) + geom_line() + labs(title = "Trace plot for bias parameter") + theme_classic() # Trace plot for beta ggplot(draws_df_memory, aes(.iteration, beta, group = .chain, color = .chain)) + geom_line() + labs(title = "Trace plot for beta parameter") + theme_classic() # Now let's plot the density for bias (prior and posterior) ggplot(draws_df_memory) + geom_density(aes(bias), fill = "blue", alpha = 0.3) + geom_density(aes(bias_prior), fill = "red", alpha = 0.3) + geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 1.5) + labs(title = "Prior-Posterior Update for Bias Parameter", subtitle = "Blue: posterior, Red: prior, Dashed: true value") + xlab("Bias") + ylab("Posterior Density") + theme_classic() # Now let's plot the density for beta (prior and posterior) ggplot(draws_df_memory) + geom_density(aes(beta), fill = "blue", alpha = 0.3) + geom_density(aes(beta_prior), fill = "red", alpha = 0.3) + geom_vline(xintercept = 1, linetype = "dashed", color = "black", size = 1.5) + labs(title = "Prior-Posterior Update for Beta Parameter", subtitle = "Blue: posterior, Red: prior, Dashed: true value") + xlab("Beta") + ylab("Posterior Density") + theme_classic() # Print summary samples_memory$summary() ``` The trace plots should show the same "hairy caterpillar" pattern as before — well-mixed chains with no systematic drift. The prior-posterior update plots tell the story: the posterior for `bias` should be narrow and centred near 0 (log-odds of 0.5 = unbiased), and the posterior for `beta` should be centred near 1 (the true generating value). The data have meaningfully updated both parameters away from their priors. Formal quality checks — prior predictive checks, systematic parameter recovery, and SBC — come in Ch. 6. ### Memory Model 2: Internal State Variable Instead of feeding memory as external data, we can model it as an internal state that updates within the model. For this specific model — where memory is simply a running average of the opponent's choices — the update rule depends only on the observed data (`other`), not on any parameter being estimated. We should therefore calculate it in Stan's `transformed data` block. ::: {.callout-tip title="Stan Efficiency Rule 1 — Block Placement: Run it Once, Not Thousands of Times"} Every Stan model has blocks that execute at very different frequencies. Choosing the **right block** is the most impactful single optimization: | Block | Executes | Use for | |---|---|---| | `transformed data` | **Once**, before sampling | Any quantity derived **only from data** — precomputed distances, normalized inputs, log-transforms of fixed constants | | `transformed parameters` | **Every leapfrog step** | Deterministic functions of parameters whose gradients must flow backward — recursive states, reparameterizations, predicted values used in the likelihood | | `model` (local `{}` block) | Every leapfrog step, **not saved** | Scratch variables for accumulating `target +=`; no overhead from the saved-draw tape | | `generated quantities` | **Once per saved draw** (post-warmup only) | PPCs, log-likelihoods, prior samples — expensive quantities you only need in the final posterior | **Concrete cost of misplacement.** A typical fit runs 4 chains × 1000 warmup × ~6 leapfrog steps per draw = ~24 000 leapfrog evaluations. Any computation in `transformed parameters` runs ~24 000 times. The same computation in `transformed data` runs exactly once. For a model with 500 trials, precomputing 500 memory values in `transformed data` rather than `transformed parameters` saves ~12 million floating-point operations per fit. **The decision tree:** 1. Does this depend **only on data** (not on any parameter)? → `transformed data` 2. Does this depend on **parameters** and does the likelihood or another parameter need it? → `transformed parameters` 3. Is this needed **only after sampling** (PPC draws, LOO log-likelihoods)? → `generated quantities`, gated by `run_diagnostics` (introduced properly in Ch. 5) 4. Is this just a temporary accumulator inside a loop? → Local variable inside `{}` in `model` or `transformed parameters` In the memory model below, `memory` depends only on `other` (data) → `transformed data`. In the forgetting model that follows, `memory` depends on the free parameter `forgetting` → local variable inside `model`. **Corollary — pre-compute in R when possible.** If a quantity depends only on data and will never change between fits, computing it in R and passing it as a data field is even faster than `transformed data`: Stan never sees the raw computation at all. For example, the multilevel memory model in Chapter 6 pre-computes the running opponent rate in R and passes `opp_rate_prev` as a vector, rather than recomputing it inside Stan on every chain. ::: ::: {.callout-tip title="Stan Efficiency — Vectorize the Likelihood Call"} Stan evaluates each `~` or `target +=` statement as a separate autodiff node. Replacing a trial-level loop with a **vectorized distribution call** collapses $N$ nodes into one, reducing autodiff graph size and improving cache locality. ```stan // Slower: N separate nodes in the autodiff graph for (i in 1:N) target += bernoulli_logit_lpmf(h[i] | theta_logit); // Faster: one vectorized call — h and theta_logit are both length-N vectors target += bernoulli_logit_lpmf(h | theta_logit); ``` The vectorized form works whenever `theta_logit` is either a **scalar** (broadcasts to all trials) or a **vector of the same length as `h`**. In the multilevel model of Chapter 6, `target += bernoulli_logit_lpmf(y | theta_logit[subj_id])` vectorizes the full N_total trial loop by scatter-indexing the per-subject parameters. **Caveat:** vectorization is not available for mixture likelihoods (`log_mix`) because each observation requires its own separate `log_mix` call. The loop is unavoidable there (Ch. 8). ::: ```{r ch05_internal memory stan model} ## Create the data — run_diagnostics = 1 for the single validation fit data <- list( n = 120, h = d1$Self, other = d1$Other, run_diagnostics = 1L ) stan_model <- " // Memory-based choice model with prior and posterior predictions. // memory[] is data-derived and lives in transformed data (computed once). // run_diagnostics gates expensive generated quantities for SBC speed. data { int<lower=1> n; array[n] int h; array[n] int other; int<lower=0, upper=1> run_diagnostics; // 1 = compute PPC + log_lik; 0 = skip (use for SBC) } transformed data { // memory depends only on other[] (data), so it runs ONCE here. // Moving this to transformed parameters would recompute it ~24 000 times per fit. vector[n] memory; memory[1] = 0.5; for (trial in 1:(n-1)) { memory[trial + 1] = memory[trial] + ((other[trial] - memory[trial]) / (trial + 1)); if (memory[trial + 1] < 0.01) memory[trial + 1] = 0.01; if (memory[trial + 1] > 0.99) memory[trial + 1] = 0.99; } } parameters { real bias; real beta; } model { target += normal_lpdf(bias | 0, .3); target += normal_lpdf(beta | 0, .5); // Vectorized likelihood — no loop needed because memory is precomputed. target += bernoulli_logit_lpmf(h | bias + beta * logit(memory)); } generated quantities { // Prior samples are always generated (cheap). real bias_prior = normal_rng(0, 0.3); real beta_prior = normal_rng(0, 0.5); // Expensive diagnostics are gated — set run_diagnostics = 0 during SBC. vector[n] log_lik; array[n] int y_rep; if (run_diagnostics) { for (trial in 1:n) { log_lik[trial] = bernoulli_logit_lpmf(h[trial] | bias + beta * logit(memory[trial])); y_rep[trial] = bernoulli_logit_rng(bias + beta * logit(memory[trial])); } } } " write_stan_file( stan_model, dir = "stan/", basename = "ch06_InternalMemory.stan") ## Specify where the model is file <- here("stan", "ch06_InternalMemory.stan") # File path for saved model model_file <- here("simmodels", "ch06_InternalMemory.rds") # Check if we need to rerun the simulation if (regenerate_simulations || !file.exists(model_file)) { # Compile the model mod <- cmdstan_model(file, cpp_options = list(stan_threads = FALSE), stanc_options = list("O1")) # The following command calls Stan with specific options. samples_memory_internal <- mod$sample( data = data, seed = 123, chains = 1, parallel_chains = 2, threads_per_chain = 1, iter_warmup = 1000, iter_sampling = 1000, refresh = 0, max_treedepth = 20, adapt_delta = 0.99, ) # Save the fitted model samples_memory_internal$save_object(file = model_file) cat("Generated new model fit and saved to", model_file, "\n") } else { # Load existing results samples_memory_internal <- readRDS(model_file) cat("Loaded existing model fit from", model_file, "\n") } draws_df <- as_draws_df(samples_memory_internal$draws()) # 1. Check chain convergence # Plot traces for main parameters mcmc_trace(draws_df, pars = c("bias", "beta")) + theme_minimal() + ggtitle("Parameter Traces Across Chains") # Plot rank histograms to check mixing mcmc_rank_hist(draws_df, pars = c("bias", "beta")) # 2. Prior-Posterior Update Check p1 <- ggplot() + geom_density(data = draws_df, aes(bias, fill = "Posterior"), alpha = 0.5) + geom_density(data = draws_df, aes(bias_prior, fill = "Prior"), alpha = 0.5) + geom_vline(xintercept = 0, linetype = "dashed") + scale_fill_manual(values = c("Prior" = "red", "Posterior" = "blue")) + theme_minimal() + ggtitle("Prior-Posterior Update: Bias Parameter") p2 <- ggplot() + geom_density(data = draws_df, aes(beta, fill = "Posterior"), alpha = 0.5) + geom_density(data = draws_df, aes(beta_prior, fill = "Prior"), alpha = 0.5) + geom_vline(xintercept = 1, linetype = "dashed") + scale_fill_manual(values = c("Prior" = "red", "Posterior" = "blue")) + theme_minimal() + ggtitle("Prior-Posterior Update: Beta Parameter") p3 <- ggplot() + geom_point(data = draws_df, aes(bias, beta), alpha = 0.5) + theme_minimal() + ggtitle("Correlation") p1 + p2 + p3 ``` Now that we know how to model memory as an internal state, we can play with making the update discount the past, setting a parameter that indicates after how many trials memory is lost, etc. ### Memory Model 3: Exponential Forgetting (similar to RL) A more cognitively plausible memory model incorporates forgetting, where recent events have more influence than distant ones. This can be implemented by adding a `forgetting` parameter (often called a learning rate, α, in RL) that controls the weight of the most recent outcome versus the previous memory state. ```{r ch05_internal memory with forgetting stan model} stan_model <- " // Memory model with exponential forgetting (learning-rate style). // memory[] depends on the free parameter 'forgetting', so it cannot go in // transformed data. It lives as a local vector inside the model block — // computed at every leapfrog step but never saved to the posterior draws. data { int<lower=1> n; array[n] int h; array[n] int other; int<lower=0, upper=1> run_diagnostics; } parameters { real bias; real beta; real<lower=0, upper=1> forgetting; } model { target += beta_lpdf(forgetting | 1, 1); target += normal_lpdf(bias | 0, .3); target += normal_lpdf(beta | 0, .5); // memory is a LOCAL variable — declared inside model, so it lives on the // stack and is discarded after each leapfrog step without touching the saved- // draw tape. This is the right pattern when memory depends on a parameter. vector[n] memory; for (trial in 1:n) { if (trial == 1) memory[trial] = 0.5; target += bernoulli_logit_lpmf(h[trial] | bias + beta * logit(memory[trial])); if (trial < n) { memory[trial + 1] = (1 - forgetting) * memory[trial] + forgetting * other[trial]; if (memory[trial + 1] <= 0) memory[trial + 1] = 0.01; if (memory[trial + 1] >= 1) memory[trial + 1] = 0.99; } } } generated quantities { real bias_prior = normal_rng(0, 0.3); real beta_prior = normal_rng(0, 0.5); real forgetting_prior = beta_rng(1, 1); vector[n] log_lik; array[n] int y_rep; if (run_diagnostics) { // Recompute memory for the saved draw — runs once per posterior sample. vector[n] memory_gq; for (trial in 1:n) { if (trial == 1) memory_gq[trial] = 0.5; real lin = bias + beta * logit(memory_gq[trial]); log_lik[trial] = bernoulli_logit_lpmf(h[trial] | lin); y_rep[trial] = bernoulli_logit_rng(lin); if (trial < n) { memory_gq[trial + 1] = (1 - forgetting) * memory_gq[trial] + forgetting * other[trial]; if (memory_gq[trial + 1] <= 0) memory_gq[trial + 1] = 0.01; if (memory_gq[trial + 1] >= 1) memory_gq[trial + 1] = 0.99; } } } } " write_stan_file( stan_model, dir = "stan/", basename = "ch06_InternalMemory2.stan") ## Specify where the model is file <- here("stan","ch06_InternalMemory2.stan") # File path for saved model model_file <- here("simmodels","ch06_InternalMemory2.rds") # Check if we need to rerun the simulation if (regenerate_simulations || !file.exists(model_file)) { # Compile the model mod <- cmdstan_model(file, cpp_options = list(stan_threads = FALSE), stanc_options = list("O1")) data_forgetting <- modifyList(data, list(run_diagnostics = 1L)) # The following command calls Stan with specific options. samples_memory_forgetting <- mod$sample( data = data_forgetting, seed = 123, chains = 4, parallel_chains = 4, iter_warmup = 1000, iter_sampling = 1000, refresh = 0, adapt_delta = 0.95 ) # Save the fitted model samples_memory_forgetting$save_object(file = model_file) cat("Generated new model fit and saved to", model_file, "\n") } else { # Load existing results samples_memory_forgetting <- readRDS(model_file) cat("Loaded existing model fit from", model_file, "\n") } samples_memory_forgetting$summary() ``` The memory model we've implemented can be seen as part of a broader family of models that track and update beliefs based on incoming evidence. Let's explore how it relates to some key frameworks. ## Memory Model 4: Bayesian Agent (Optimal Updating) A fully Bayesian agent wouldn't just track the rate but maintain a full probability distribution (a Beta distribution) over the likely bias of the opponent, updating it optimally according to Bayes' rule after each observation. ```{r ch05_bayesian memory stan model} stan_model <- " data { int<lower=1> n; // number of trials array[n] int h; // agent's choices (0 or 1) array[n] int other; // other player's choices (0 or 1) } parameters { real<lower=0> alpha_prior; // Prior alpha parameter real<lower=0> beta_prior; // Prior beta parameter } transformed parameters { vector[n] alpha; // Alpha parameter at each trial vector[n] beta; // Beta parameter at each trial vector[n] rate; // Expected rate at each trial // Initialize with prior alpha[1] = alpha_prior; beta[1] = beta_prior; rate[1] = alpha[1] / (alpha[1] + beta[1]); // Sequential updating of Beta distribution for(t in 2:n) { // Update Beta parameters based on previous observation alpha[t] = alpha[t-1] + other[t-1]; beta[t] = beta[t-1] + (1 - other[t-1]); // Calculate expected rate rate[t] = alpha[t] / (alpha[t] + beta[t]); } } model { // Priors on hyperparameters target += gamma_lpdf(alpha_prior | 2, 1); target += gamma_lpdf(beta_prior | 2, 1); // Agent's choices follow current rate estimates for(t in 1:n) { target += bernoulli_lpmf(h[t] | rate[t]); } } generated quantities { // We strictly reserve posterior predictive checks for Chapter 6. // Here, we just calculate the initial expected rate for reference. real initial_rate = alpha_prior / (alpha_prior + beta_prior); } " write_stan_file( stan_model, dir = "stan/", basename = "ch06_BayesianMemory.stan") ## Specify where the model is file <- here("stan","ch06_BayesianMemory.stan") # File path for saved model model_file <- here("simmodels","ch06_BayesianMemory.rds") # Check if we need to rerun the simulation if (regenerate_simulations || !file.exists(model_file)) { # Compile the model mod <- cmdstan_model(file) # The following command calls Stan with specific options. samples_memory_bayes <- mod$sample( data = data, seed = 123, chains = 1, parallel_chains = 2, threads_per_chain = 1, iter_warmup = 1000, iter_sampling = 1000, refresh = 0, max_treedepth = 20, adapt_delta = 0.99, ) # Save the fitted model samples_memory_bayes$save_object(file = model_file) cat("Generated new model fit and saved to", model_file, "\n") } else { # Load existing results samples_memory_bayes <- readRDS(model_file) cat("Loaded existing model fit from", model_file, "\n") } samples_memory_bayes$summary() # Extract draws draws_df <- as_draws_df(samples_memory_bayes$draws()) # First let's look at the priors ggplot(draws_df) + geom_density(aes(alpha_prior), fill = "blue", alpha = 0.3) + geom_density(aes(beta_prior), fill = "red", alpha = 0.3) + theme_classic() + labs(title = "Prior Distributions", x = "Parameter Value", y = "Density") # Now let's look at how the rate evolves over trials # First melt the rate values across trials into long format rate_df <- draws_df %>% dplyr::select(starts_with("rate[")) %>% pivot_longer(everything(), names_to = "trial", values_to = "rate", names_pattern = "rate\\[(\\d+)\\]") %>% mutate(trial = as.numeric(trial)) # Calculate summary statistics for each trial rate_summary <- rate_df %>% group_by(trial) %>% summarise( mean_rate = mean(rate), lower = quantile(rate, 0.025), upper = quantile(rate, 0.975) ) plot_data <- tibble(trial = seq(120), choices = data$other) # Plot the evolution of rate estimates ggplot(rate_summary, aes(x = trial)) + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) + geom_line(aes(y = mean_rate), color = "blue") + # Add true data points geom_line(data = plot_data, aes(x = trial, y = choices), color = "orange", alpha = 0.5) + theme_classic() + labs(title = "Evolution of Rate Estimates", x = "Trial", y = "Rate", subtitle = "Blue line: posterior mean, Gray band: 95% CI") + ylim(0, 1) # Let's also look at the correlation between alpha and beta parameters ggplot(draws_df) + geom_point(aes(alpha_prior, beta_prior), alpha = 0.1) + theme_classic() + labs(title = "Correlation between Alpha and Beta Parameters", x = "Alpha", y = "Beta") ``` ## Relationship to Rescorla-Wagner The Rescorla-Wagner model of learning—which drives the Reinforcement Learning agents we will build in in a later chapter follows the form: $$V_{t+1} = V_{t} + \alpha (R_{t} - V_{t})$$ where: * $V_{t}$ is the current Value estimate (in our code: memory[t]). * $\alpha$ is the Learning Rate (in our code: forgetting). * $R_{t}$ is the Reward or Outcome (in our code: other[t]). * $(R_{t} - V_{t})$ is the Prediction Error. Our memory model with a forgetting parameter follows a very similar structure:$$\text{memory}_{t+1} = (1 - \text{forgetting}) \times \text{memory}_{t} + \text{forgetting} \times \text{outcome}_{t}$$If we expand this, we can see it is mathematically identical to the Rescorla-Wagner update rule:$$\begin{aligned} \text{memory}_{t+1} &= \text{memory}_{t} - \text{forgetting} \times \text{memory}_{t} + \text{forgetting} \times \text{outcome}_{t} \\ \text{memory}_{t+1} &= \text{memory}_{t} + \text{forgetting} \times (\text{outcome}_{t} - \text{memory}_{t}) \end{aligned}$$ This reveals that our "exponential forgetting" memory model is actually a standard Reinforcement Learning agent with a big fake moustache. The forgetting parameter dictates how quickly the agent updates their beliefs (value estimate) based on new prediction errors. ### Connection to Kalman Filters Our memory model updates beliefs about the probability of right-hand choices using a weighted average of past observations. This is conceptually similar to how a Kalman filter works, though simpler: * Kalman filters maintain both an estimate and uncertainty about that estimate * They optimally weight new evidence based on relative uncertainty * Our forgetting model uses a fixed weighting scheme (1/trial or the forgetting parameter) * The Bayesian Agent (`ch06_BayesianMemory.stan`) captures uncertainty via the Beta distribution, similar in spirit but simpler than a Kalman filter. ### Connection to Hierarchical Gaussian Filter (HGF) The HGF extends these ideas by: * Tracking beliefs at multiple levels * Allowing learning rates to vary over time * Explicitly modeling environmental volatility Our model could be seen as the simplest case of an HGF where: * We only track one level (probability of right-hand choice) * Have a fixed learning rate (forgetting parameter) * Don't explicitly model environmental volatility ### Implications for Model Development Understanding these relationships helps us think about how models relate to each other and to extend our model: * We could add uncertainty estimates to get Kalman-like behavior * We could make the forgetting parameter dynamic to capture changing learning rates * We could add multiple levels to track both immediate probabilities and longer-term trends Each extension would make the model more flexible but also more complex to fit to data. The choice depends on our specific research questions and available data. ## Conclusion: Estimating Parameters and Exploring Memory This chapter marked a crucial transition from simulating models with known parameters to the core task of **parameter estimation**: inferring plausible parameter values from observed data using Bayesian inference and Stan. We learned how to: * **Specify Bayesian models** in Stan, defining data, parameters, priors, and data models. * **Fit models** using `cmdstanr` to obtain posterior distributions for parameters like choice bias (`theta`). * **Utilize transformations** like log-odds for computational benefits and flexible prior specification. * **Validate our fitting procedure** through **parameter recovery**, ensuring our models can retrieve known values from simulated data. * **Implement different cognitive models** for history-dependent choice, exploring various ways to represent **memory**: as an external predictor, an internal updating state, incorporating exponential forgetting (linking to Reinforcement Learning principles), and even as a fully Bayesian belief update. * **Recognize connections** between these models and broader frameworks like Kalman filters and HGF. By fitting these models, we moved beyond simply describing behavior (like the cumulative rates in Chapter 3) to quantifying underlying latent parameters (like `bias`, `beta`, `forgetting`). However, a single fit to a single simulated dataset is only a preliminary sanity check — the equivalent of flipping a coin once to check if it's fair. It does not tell us whether the model behaves well across a wide range of parameter values, whether our credible intervals are correctly calibrated, or whether the fitted model can actually reproduce the patterns in the data. **Coming up in Chapter 6:** We take the Stan models built in this chapter and subject them to a four-phase validation battery before they are allowed near real human data: 1. **Prior Predictive Check** — do the priors allow only cognitively plausible behavior? 2. **Parameter Recovery** — does the model find the right answer across many simulated datasets and noise levels? 3. **Simulation-Based Calibration Checks (SBC)** — are our posterior intervals actually calibrated (does a 95% CI contain the truth 95% of the time)? 4. **Posterior Predictive Check** — does the fitted model reproduce key patterns in the observed data? Only after passing all four tests does a model earn the right to touch real empirical data.