Appendix B — Stan Optimization Cheatsheet
This appendix collects every Stan programming technique introduced across Chapters 4–11, with minimal code examples, cost estimates, and guidance on when to apply each one. The techniques are ordered by first appearance in the book.
Quick navigation
| # | Technique | Chapter |
|---|---|---|
| 1 | Block placement | Ch 4 |
| 2 | Pre-compute in R | Ch 4, 6 |
| 3 | Sufficient statistics | Ch 4 |
| 4 | Vectorize the likelihood | Ch 4, 6 |
| 5 | Priors as data | Ch 5 |
| 6 | Conditional diagnostics (run_diagnostics) |
Ch 5 |
| 7 | Post-hoc generated quantities | Ch 5 |
| 8 | Non-centered parameterization | Ch 6 |
| 9 | Cholesky factorization | Ch 6 |
| 10 | Contiguous subject indexing | Ch 6 |
| 11 | Pathfinder warm-start | Ch 4, 6, 11 |
| 12 | Marginalizing discrete parameters | Ch 8 |
| 13 | Sequential state unrolling | Ch 9 |
| 14 | reduce_sum parallelism |
Ch 11 |
| 15 | Combining techniques | Ch 6, 11 |
B.1 1. Block Placement — Run It Once, Not Thousands of Times
First introduced: Section 5.7 (Chapter 4, Memory Model 1)
HMC runs thousands of leapfrog steps during warm-up. Every line of code in transformed parameters or model executes at each step; every line in transformed data executes exactly once.
| Block | Runs | Put here |
|---|---|---|
transformed data |
Once, before sampling | Quantities derived only from data — precomputed distances, log-transforms of fixed inputs, cumulative sums |
transformed parameters |
Every leapfrog step (saved) | Deterministic functions of parameters whose gradients must propagate — NCP reparameterisations, recursive state, predicted values used in the likelihood |
model (local vars) |
Every leapfrog step (not saved) | Intermediate scratch quantities not needed outside the loop |
generated quantities |
Once per saved draw | Posterior predictive samples, log_lik, prior draws — anything expensive that is not needed for sampling |
A typical fit runs 4 chains × 1 000 warmup iterations × ~6 leapfrog steps = ~24 000 leapfrog evaluations during warm-up. Moving 500 precomputed memory values from transformed parameters to transformed data saves ~12 million floating-point operations per fit.
transformed data {
// computed once — running average of opponent choices
vector[N] memory;
memory[1] = 0.5;
for (t in 2:N)
memory[t] = (1.0 / t) * sum(other[1:(t-1)]);
}B.2 2. Pre-Compute in R — Don’t Let Stan See the Raw Computation
First introduced: Chapter 7 (Chapter 6, multilevel memory agent)
transformed data runs once before sampling — but the computation still happens inside Stan’s runtime. If a quantity depends only on observed data and will never change between fits, computing it in R and passing it as a data field is even cheaper: Stan never processes the raw inputs at all.
# R: compute the running opponent rate once (O(N) vectorised R operation)
d <- d |>
group_by(subject) |>
mutate(opp_rate_prev = lag(cummean(other_choice), default = 0.5)) |>
ungroup()
stan_data <- list(
N_total = nrow(d),
y = d$choice,
opp_rate_prev = d$opp_rate_prev # passed as a plain vector
# ...
)// Stan only receives the pre-computed vector — no loop needed
data {
vector<lower=0,upper=1>[N_total] opp_rate_prev;
}
model {
target += bernoulli_logit_lpmf(y | mu + beta * logit(opp_rate_prev));
}Decision rule: If the computation involves only observed data and uses fast R/dplyr operations, do it in R. Reserve transformed data for cases where the computation is cleaner to express in Stan’s syntax (e.g., distance matrices over stimuli).
B.3 3. Sufficient Statistics — Aggregate IID Trials
First introduced: Chapter 5 (Chapter 4, simple bias model)
When all trials are independently and identically distributed under the same parameter, the total number of successes is a sufficient statistic — it carries all the information in the data about \(\theta\). A single Binomial call replaces \(N\) Bernoulli calls.
// data block: pass the aggregate, not the full array
data {
int<lower=0> N;
int<lower=0, upper=N> successes; // sum(h)
}
model {
target += binomial_logit_lpmf(successes | N, theta_logit);
}stan_data <- list(N = 120, successes = sum(h))One Binomial evaluation vs. 120 Bernoulli evaluations — every leapfrog step, every chain. For a single-subject model with 4 chains × 24 000 leapfrog evaluations the difference is small in absolute time, but the principle scales: for 10 000 IID trials the saving is real.
When it applies: single-subject, single-session models where \(\theta\) is constant across all \(N\) trials (Chapters 4–5 bias models).
When it does not apply: memory models, multilevel models (per-subject \(\theta_j\)), any model where the trial-level likelihood depends on a trial-varying predictor.
B.4 4. Vectorize the Likelihood Call
First introduced: Chapter 5 (Chapter 4); used throughout from Chapter 5 onward
Stan evaluates each distribution call as a node in the autodiff graph. A vectorized call over a length-\(N\) vector creates one node; a loop over \(N\) trials creates \(N\) nodes. Smaller graph = faster gradient computation and better cache use.
// Slower: N nodes in the autodiff graph
for (i in 1:N)
target += bernoulli_logit_lpmf(h[i] | theta_logit);
// Faster: one node — mathematically identical
target += bernoulli_logit_lpmf(h | theta_logit);In multilevel models, scatter-indexing by subject makes the entire trial pool vectorizable:
// N_total trials, J subjects — one call instead of N_total calls
target += bernoulli_logit_lpmf(y | theta_logit[subj_id]);log_mix() for mixture likelihoods (Chapter 8) operates on one observation at a time because each call mixes two observation-specific log-probabilities. There is no vectorized log_mix — the per-observation loop is correct and unavoidable there.
B.5 5. Priors as Data — Avoid Recompilation
First introduced: Chapter 6 (Chapter 5)
Pass prior hyperparameters through the data block instead of hard-coding them. Stan recompiles whenever the code changes, never for new data. This lets you run prior sensitivity analyses and SBC sweeps without recompiling.
data {
int<lower=0> N;
array[N] int y;
real prior_mu_mean;
real<lower=0> prior_mu_sigma;
}
model {
mu ~ normal(prior_mu_mean, prior_mu_sigma);
y ~ bernoulli_logit(mu);
}data <- list(N = N, y = y,
prior_mu_mean = 0, prior_mu_sigma = 1.5)B.6 6. The run_diagnostics Flag — Conditional Generated Quantities
First introduced: Chapter 6 (Chapter 5)
Computing log_lik (N values per draw) and y_rep in generated quantities can double run-time. Gate them behind an integer flag so SBC backends can turn them off.
data {
int<lower=0, upper=1> run_diagnostics;
}
generated quantities {
vector[N] log_lik = rep_vector(0.0, N);
array[N] int y_rep;
if (run_diagnostics) {
for (i in 1:N) {
log_lik[i] = bernoulli_logit_lpmf(y[i] | mu);
y_rep[i] = bernoulli_logit_rng(mu);
}
}
}# Full fit — diagnostics on
data <- list(..., run_diagnostics = 1L)
# SBC backend — diagnostics off, ~2× faster generated quantities
sbc_data <- modifyList(data, list(run_diagnostics = 0L))For 1 000 trials, 4 chains, 1 000 sampling draws, run_diagnostics = 1 generates and saves 4 million log_lik values. During SBC (which only needs the parameters), setting run_diagnostics = 0 eliminates that cost entirely.
B.7 7. Post-Hoc Generated Quantities with $generate_quantities()
First introduced: Chapter 6 (Chapter 5)
The run_diagnostics flag lets you skip log_lik and y_rep during SBC. A complementary pattern avoids the trade-off entirely: fit once with diagnostics off, then replay the saved parameter draws through the generated quantities block on demand, without re-running MCMC.
# Fast fit — no log_lik or y_rep during sampling
fit_fast <- mod$sample(data = list(..., run_diagnostics = 0L), ...)
# Later: add log_lik and y_rep from the saved draws
fit_gq <- mod$generate_quantities(
fitted_params = fit_fast,
data = list(..., run_diagnostics = 1L)
)
log_lik <- fit_gq$draws("log_lik")
y_rep <- fit_gq$draws("y_rep")$generate_quantities() runs only the generated quantities block — the sampler is never touched. Key consequences:
- LOO and PPC can be deferred until you actually need them.
- A fit saved with
readRDS()from a previous session can be replayed: pass the loaded object asfitted_params. - SBC backends always run with
run_diagnostics = 0; call$generate_quantities()only on the small subset of fits you want to inspect in detail.
The compiled model object must be the same for both the original fit and the $generate_quantities() call.
B.8 8. Non-Centered Parameterization (NCP)
First introduced: Chapter 7 (Chapter 6, multilevel biased agent) Extended to simplex parameters: Chapter 12 (Chapter 11, multilevel decay GCM)
Centered parameterizations create funnel geometries when the group-level variance (sigma) is small: individual-level parameters collapse toward the mean, forming a narrow neck that HMC cannot traverse efficiently, producing divergences, low E-BFMI, and max_treedepth warnings.
Centered (problematic):
parameters {
real mu;
real<lower=0> sigma;
vector[J] theta;
}
model {
theta ~ normal(mu, sigma);
}
Non-Centered (preferred):
parameters {
real mu;
real<lower=0> sigma;
vector[J] z; // N(0,1) offset
}
transformed parameters {
vector[J] theta = mu + sigma * z;
}
model {
z ~ std_normal();
}B.8.1 Logit-Normal NCP for Simplex Parameters
For attention weights or mixing proportions modeled as a simplex, avoid the Dirichlet + kappa parameterization. Small kappa values produce nearly-flat Dirichlet distributions, creating the same funnel that centered normal models do. For a 2-element simplex use a logit-normal NCP instead:
parameters {
real pop_logit_w_mean;
real<lower=0> pop_logit_w_sd;
vector[J] z_w;
}
transformed parameters {
array[J] vector[2] w;
for (j in 1:J) {
real lw = pop_logit_w_mean + pop_logit_w_sd * z_w[j];
w[j][1] = inv_logit(lw);
w[j][2] = 1.0 - w[j][1];
}
}
model {
z_w ~ std_normal();
}The centered Dirichlet + kappa ~ exponential(0.1) parameterization in Chapter 11’s multilevel decay GCM produced 4 000 max_treedepth warnings per run and failed to complete 10 iterations overnight at adapt_delta = 0.99, max_treedepth = 20. Switching to logit-normal NCP eliminated the funnel geometry entirely.
B.9 9. Cholesky Factorization for Correlation Matrices
First introduced: Chapter 7 (Chapter 6, multilevel memory agent with correlated parameters)
Whenever a model estimates a correlation matrix \(\Omega\) over individual-level parameters, use the Cholesky-factorized form rather than sampling \(\Omega\) directly.
| Parameterization | Leapfrog cost | Why prefer/avoid |
|---|---|---|
corr_matrix[K] Omega + lkj_corr_lpdf |
\(O(K^3)\) matrix operations + positive-definiteness check | Expensive; unconstrained transform requires matrix square root |
cholesky_factor_corr[K] L_Omega + lkj_corr_cholesky_lpdf |
\(O(K^2)\) | Lies directly on its natural manifold; no matrix inversion needed |
// Avoid
parameters { corr_matrix[2] Omega; }
model { Omega ~ lkj_corr(2.0); }
// Prefer
parameters { cholesky_factor_corr[2] L_Omega; }
model { L_Omega ~ lkj_corr_cholesky(2.0); }
transformed parameters {
// O(K^2) multiply — no square root
matrix[2, J] deviations = diag_pre_multiply(sigma, L_Omega) * z;
}
generated quantities {
// Reconstruct Omega once per saved draw, not every leapfrog step
matrix[2, 2] Omega = multiply_lower_tri_self_transpose(L_Omega);
real rho = Omega[1, 2];
}Key point: diag_pre_multiply(sigma, L_Omega) * z produces individual-level NCP deviations directly from the Cholesky factor — the full correlation matrix \(\Omega\) is only recovered in generated quantities, running once per saved draw rather than thousands of times during warm-up.
For \(K = 2\) parameters the cost difference is modest; for \(K \geq 5\) (e.g., five per-subject RL parameters) it becomes substantial.
B.10 10. Contiguous Subject Indexing — subj_start / subj_end
First introduced: Chapter 7 (Chapter 6, multilevel memory model)
Two strategies for multi-subject trial data:
Strategy A — subj_id lookup (simple; works only when trials are independent):
data { array[N_total] int subj_id; }
model {
for (i in 1:N_total)
target += bernoulli_lpmf(y[i] | f(theta[subj_id[i]], x[i]));
}Strategy B — Contiguous block indexing (required for sequential models and reduce_sum):
data {
array[J] int subj_start;
array[J] int subj_end;
}
model {
for (j in 1:J)
for (t in subj_start[j]:subj_end[j])
target += bernoulli_lpmf(y[t] | f(theta[j], x[t]));
}Build the index in R:
subj_start <- as.integer(tapply(seq_along(subj_id), subj_id, min))
subj_end <- as.integer(tapply(seq_along(subj_id), subj_id, max))Use Strategy B whenever:
- Trial \(t\) depends on state accumulated from trials \(1, \ldots, t-1\) (memory, belief updating, RL)
- You intend to use
reduce_sumfor within-chain parallelism (see Section B.14)
B.11 11. Pathfinder Warm-Start Initialisation
First benchmarked: Chapter 5 (Chapter 4) Timing comparison for multilevel models: Chapter 7 (Chapter 6) Three-way benchmark: Chapter 12 (Chapter 11)
Stan’s default initialisation draws chain starts from Uniform(−2, 2) in unconstrained space — far from the posterior on hierarchical models. Pathfinder runs L-BFGS from each start, fits Gaussian approximations along the path, and returns approximate posterior draws. Passing these as init to $sample() lets HMC skip most of the warm-up trajectory.
# Step 1 — fast variational approximation
pf <- mod$pathfinder(data = stan_data, num_paths = 4, draws = 1000, seed = 123)
# Step 2 — draw one starting point per chain
init_pf <- lapply(1:4, function(i) as.list(pf$draws()[i, , drop = FALSE]))
# Step 3 — full MCMC from warm starts
fit <- mod$sample(data = stan_data, init = init_pf,
chains = 4, iter_warmup = 500, iter_sampling = 1000)Pathfinder gives the largest benefit on hierarchical models and any model with correlated population parameters. It saves 20–50% of total wall time by shortening the warmup phase. On simple, well-conditioned models the warm-up is already fast — the Pathfinder overhead may outweigh the saving.
Pathfinder output is not a gold-standard posterior. Always run full MCMC and check diagnostics.
B.12 12. Marginalizing Discrete Parameters with log_mix
First introduced: Chapter 9 (Chapter 8, single and multilevel mixture models)
HMC requires continuous, differentiable parameters. Discrete latent variables (component indicators, regime switches) must be analytically marginalized out of the likelihood before sampling.
// two-component mixture
real lp1 = bernoulli_logit_lpmf(y | alpha);
real lp2 = bernoulli_logit_lpmf(y | nu);
// log( pi*exp(lp1) + (1-pi)*exp(lp2) )
target += log_mix(pi, lp1, lp2);For \(K > 2\) components use log_sum_exp directly:
vector[K] lps;
for (k in 1:K)
lps[k] = log(pi[k]) + component_lpmf(y | theta[k]);
target += log_sum_exp(lps);// WRONG — HMC cannot compute gradients of discrete parameters
int z ~ categorical(pi);
target += bernoulli_lpmf(y | theta[z]);This will compile but produce incorrect results or sampler errors at runtime. Always marginalise.
B.13 13. Sequential State Unrolling in transformed parameters
First introduced: Chapter 10 (Chapter 9, weighted and proportional agent models)
When the likelihood depends on a latent state that evolves trial-by-trial (belief, RL value, running weight), unroll the recursion in transformed parameters so the gradient flows back through the entire trajectory.
transformed parameters {
vector[N] predicted_choice;
{
// local scope: scratch variable allocated and freed each leapfrog step
real belief = 0.5;
for (t in 1:N) {
predicted_choice[t] = inv_logit(weight * (belief - 0.5));
belief += alpha * (feedback[t] - belief);
}
}
}
model {
target += bernoulli_lpmf(y | predicted_choice);
}transformed parameters and not model?
State that enters the likelihood needs its gradient tracked by autodiff. Declare only the output vector (predicted_choice) as a transformed parameters variable; keep loop-internal scratch variables inside a {} local block to avoid placing them on the saved-draw tape.
Exception: If the loop is independent across subjects, move it into a partial_sum_lp function and use reduce_sum for across-subject parallelism (see Section B.14).
B.14 14. Intra-Chain Parallelism with reduce_sum
First introduced: Chapter 12 (Chapter 11, multilevel decay GCM)
reduce_sum exploits conditional independence across subjects: given population parameters, subject \(j\)’s trial loop shares no mutable state with subject \(k\)’s loop. The subject dimension can therefore be distributed across CPU threads within a single chain, giving near-linear speedup up to the number of subjects.
Step 1 — Write a partial_sum_lp function in the functions block:
functions {
real partial_sum_lp(
array[] int subj_slice, int start, int end,
vector log_c,
array[] int subj_start_g, array[] int subj_end_g,
array[] int y, array[,] real x
) {
real lp = 0.0;
for (idx in 1:size(subj_slice)) {
int j = subj_slice[idx];
for (t in subj_start_g[j]:subj_end_g[j])
lp += bernoulli_lpmf(y[t] | f(exp(log_c[j]), x[t]));
}
return lp;
}
}Step 2 — Replace the subject loop with reduce_sum in the model block:
model {
// subject_ids = {1, 2, ..., J}
target += reduce_sum(partial_sum_lp, subject_ids, 1,
log_c, subj_start, subj_end, y, x);
}Step 3 — Compile with threading support and sample:
mod_par <- cmdstan_model(stan_file,
cpp_options = list(stan_threads = TRUE))
n_threads <- max(1L, parallel::detectCores() %/% 4L)
fit_par <- mod_par$sample(
data = stan_data,
chains = 4,
parallel_chains = 4,
threads_per_chain = n_threads,
iter_warmup = 500,
iter_sampling = 1000
)reduce_sum worth it?
| Condition | Effect |
|---|---|
| ≥ 20 subjects with expensive per-subject loops | Large speedup (near-linear up to J) |
| Few subjects or very cheap loops | Threading overhead may dominate |
| Within-subject sequential dependency | Still valid — parallelises across subjects, not within |
Contiguous subj_start/subj_end indexing |
Required (see Section B.10) |
The grainsize argument (second integer in reduce_sum) controls the minimum slice size. Use 1 as a starting point; increase if the per-subject work is small.
reduce_sum
In the single-subject GCM (Chapter 11), trial \(t\) depends on the memory state accumulated from trials \(1, \ldots, t-1\). This sequential dependency is within a subject, so reduce_sum cannot help there. Parallelism across subjects is always safe because subjects are conditionally independent.
B.15 15. Combining Techniques — Compound Speedups
Three-way timing benchmark: Chapter 12 (Chapter 11)
The techniques above stack multiplicatively on hierarchical models:
| Technique | Typical saving | When to apply |
|---|---|---|
| NCP instead of CP | 5–100× fewer divergences | Any multilevel model — always prefer NCP |
transformed data precompute |
10–100× fewer flops | Static quantities derived from data only |
run_diagnostics = 0 in SBC |
~2× faster generated quantities |
All SBC backends |
| Pathfinder warm-start | 20–50% less wall time | Hierarchical models; correlated posteriors |
reduce_sum (10 subjects, 4-core laptop) |
2–3× within-chain | ≥ 20 subjects, expensive loops |
Pathfinder + reduce_sum together |
3–5× vs. serial baseline | Large multilevel models, multi-core hardware |
Benchmark template (used in Chapters 6 and 11):
time_serial <- system.time(
fit_serial <- mod$sample(data = stan_data, ...)
)[["elapsed"]]
pf <- mod$pathfinder(data = stan_data, num_paths = 4, draws = 1000)
init_pf <- lapply(1:4, function(i) as.list(pf$draws()[i, , drop = FALSE]))
time_serial_pf <- system.time(
fit_pf <- mod$sample(data = stan_data, init = init_pf, ...)
)[["elapsed"]]
n_threads <- max(1L, parallel::detectCores() %/% 4L)
time_par <- system.time(
fit_par <- mod_par$sample(
data = stan_data, init = init_pf,
threads_per_chain = n_threads, ...)
)[["elapsed"]]
tibble(
Strategy = c("Serial (baseline)", "Serial + Pathfinder", "Parallel + Pathfinder"),
Time_sec = c(time_serial, time_serial_pf, time_par),
Speedup = round(time_serial / c(time_serial, time_serial_pf, time_par), 2)
)B.16 Quick Reference
| Problem observed | Technique | Section |
|---|---|---|
| Redundant computation during leapfrog | Block placement → transformed data |
§1 |
| Quantity derived from data only, easier to compute in R | Pre-compute in R | §2 |
| Many IID trials, one parameter per session | Sufficient statistics → binomial_lpmf |
§3 |
| Trial loop over identical distribution | Vectorize the likelihood call | §4 |
| Changing priors requires recompiling | Priors as data | §5 |
| SBC / LOO is slow | run_diagnostics flag |
§6 |
| Want PPC/LOO without re-running MCMC | $generate_quantities() post-hoc |
§7 |
| Divergences / low E-BFMI in multilevel model | Non-centered parameterization (NCP) | §8 |
| Simplex + small kappa → funnel geometry | Logit-normal NCP for simplex | §8 |
| Model has correlation matrix \(\Omega\) | Cholesky factorization (lkj_corr_cholesky) |
§9 |
| Sequential model with per-subject trial loops | Contiguous subj_start/subj_end indexing |
§10 |
| Slow warm-up on complex / hierarchical model | Pathfinder warm-start | §11 |
| Discrete latent variables (indicators, regimes) | Marginalise with log_mix |
§12 |
| Trial-to-trial latent state in likelihood | Sequential unrolling in transformed parameters |
§13 |
| Many subjects, multi-core machine | reduce_sum intra-chain parallelism |
§14 |