Appendix B — Stan Optimization Cheatsheet

This appendix collects every Stan programming technique introduced across Chapters 4–11, with minimal code examples, cost estimates, and guidance on when to apply each one. The techniques are ordered by first appearance in the book.

Quick navigation

#	Technique	Chapter
1	Block placement	Ch 4
2	Pre-compute in R	Ch 4, 6
3	Sufficient statistics	Ch 4
4	Vectorize the likelihood	Ch 4, 6
5	Priors as data	Ch 5
6	Conditional diagnostics (`run_diagnostics`)	Ch 5
7	Post-hoc generated quantities	Ch 5
8	Non-centered parameterization	Ch 6
9	Cholesky factorization	Ch 6
10	Contiguous subject indexing	Ch 6
11	Pathfinder warm-start	Ch 4, 6, 11
12	Marginalizing discrete parameters	Ch 8
13	Sequential state unrolling	Ch 9
14	`reduce_sum` parallelism	Ch 11
15	Combining techniques	Ch 6, 11
16	Sampler and compiler options	App B

B.1 1. Block Placement — Run It Once, Not Thousands of Times

First introduced: Section 5.7 (Chapter 4, Memory Model 1)

HMC runs thousands of leapfrog steps during warm-up. Every line of code in transformed parameters or model executes at each step; every line in transformed data executes exactly once.

Block	Runs	Put here
`transformed data`	Once, before sampling	Quantities derived only from data — precomputed distances, log-transforms of fixed inputs, cumulative sums
`transformed parameters`	Every leapfrog step (saved)	Deterministic functions of parameters whose gradients must propagate — NCP reparameterisations, recursive state, predicted values used in the likelihood
`model` (local vars)	Every leapfrog step (not saved)	Intermediate scratch quantities not needed outside the loop
`generated quantities`	Once per saved draw	Posterior predictive samples, `log_lik`, prior draws — anything expensive that is not needed for sampling

Cost of misplacement

A typical fit runs 4 chains × 1 000 warmup iterations × ~6 leapfrog steps = ~24 000 leapfrog evaluations during warm-up. Moving 500 precomputed memory values from transformed parameters to transformed data saves ~12 million floating-point operations per fit.

transformed data {
  // computed once — running average of opponent choices
  vector[N] memory;
  memory[1] = 0.5;
  for (t in 2:N)
    memory[t] = (1.0 / t) * sum(other[1:(t-1)]);
}

B.2 2. Pre-Compute in R — Don’t Let Stan See the Raw Computation

First introduced: Chapter 7 (Chapter 6, multilevel memory agent)

transformed data runs once before sampling — but the computation still happens inside Stan’s runtime. If a quantity depends only on observed data and will never change between fits, computing it in R and passing it as a data field is even cheaper: Stan never processes the raw inputs at all.

# R: compute the running opponent rate once (O(N) vectorised R operation)
d <- d |>
  group_by(subject) |>
  mutate(opp_rate_prev = lag(cummean(other_choice), default = 0.5)) |>
  ungroup()

stan_data <- list(
  N_total      = nrow(d),
  y            = d$choice,
  opp_rate_prev = d$opp_rate_prev   # passed as a plain vector
  # ...
)

// Stan only receives the pre-computed vector — no loop needed
data {
  vector<lower=0,upper=1>[N_total] opp_rate_prev;
}
model {
  target += bernoulli_logit_lpmf(y | mu + beta * logit(opp_rate_prev));
}

Decision rule: If the computation involves only observed data and uses fast R/dplyr operations, do it in R. Reserve transformed data for cases where the computation is cleaner to express in Stan’s syntax (e.g., distance matrices over stimuli).

B.3 3. Sufficient Statistics — Aggregate IID Trials

First introduced: Chapter 5 (Chapter 4, simple bias model)

When all trials are independently and identically distributed under the same parameter, the total number of successes is a sufficient statistic — it carries all the information in the data about $\theta$. A single Binomial call replaces $N$ Bernoulli calls.

// data block: pass the aggregate, not the full array
data {
  int<lower=0> N;
  int<lower=0, upper=N> successes;  // sum(h)
}
model {
  target += binomial_logit_lpmf(successes | N, theta_logit);
}

stan_data <- list(N = 120, successes = sum(h))

Speedup

One Binomial evaluation vs. 120 Bernoulli evaluations — every leapfrog step, every chain. For a single-subject model with 4 chains × 24 000 leapfrog evaluations the difference is small in absolute time, but the principle scales: for 10 000 IID trials the saving is real.

When it applies: single-subject, single-session models where $\theta$ is constant across all $N$ trials (Chapters 4–5 bias models).

When it does not apply: memory models, multilevel models (per-subject $\theta_j$), any model where the trial-level likelihood depends on a trial-varying predictor.

B.4 4. Vectorize the Likelihood Call

First introduced: Chapter 5 (Chapter 4); used throughout from Chapter 5 onward

Stan evaluates each distribution call as a node in the autodiff graph. A vectorized call over a length-$N$ vector creates one node; a loop over $N$ trials creates $N$ nodes. Smaller graph = faster gradient computation and better cache use.

// Slower: N nodes in the autodiff graph
for (i in 1:N)
  target += bernoulli_logit_lpmf(h[i] | theta_logit);

// Faster: one node — mathematically identical
target += bernoulli_logit_lpmf(h | theta_logit);

In multilevel models, scatter-indexing by subject makes the entire trial pool vectorizable:

// N_total trials, J subjects — one call instead of N_total calls
target += bernoulli_logit_lpmf(y | theta_logit[subj_id]);

When vectorization is not available

log_mix() for mixture likelihoods (Chapter 8) operates on one observation at a time because each call mixes two observation-specific log-probabilities. There is no vectorized log_mix — the per-observation loop is correct and unavoidable there.

B.5 5. Priors as Data — Avoid Recompilation

First introduced: Chapter 6 (Chapter 5)

Pass prior hyperparameters through the data block instead of hard-coding them. Stan recompiles whenever the code changes, never for new data. This lets you run prior sensitivity analyses and SBC sweeps without recompiling.

data {
  int<lower=0> N;
  array[N] int y;
  real             prior_mu_mean;
  real<lower=0>    prior_mu_sigma;
}
model {
  mu ~ normal(prior_mu_mean, prior_mu_sigma);
  y  ~ bernoulli_logit(mu);
}

data <- list(N = N, y = y,
             prior_mu_mean = 0, prior_mu_sigma = 1.5)

B.6 6. The `run_diagnostics` Flag — Conditional Generated Quantities

First introduced: Chapter 6 (Chapter 5)

Computing log_lik (N values per draw) and y_rep in generated quantities can double run-time. Gate them behind an integer flag so SBC backends can turn them off.

data {
  int<lower=0, upper=1> run_diagnostics;
}
generated quantities {
  vector[N] log_lik = rep_vector(0.0, N);
  array[N] int y_rep;
  if (run_diagnostics) {
    for (i in 1:N) {
      log_lik[i] = bernoulli_logit_lpmf(y[i] | mu);
      y_rep[i]   = bernoulli_logit_rng(mu);
    }
  }
}

# Full fit — diagnostics on
data <- list(..., run_diagnostics = 1L)

# SBC backend — diagnostics off, ~2× faster generated quantities
sbc_data <- modifyList(data, list(run_diagnostics = 0L))

When does it matter most?

For 1 000 trials, 4 chains, 1 000 sampling draws, run_diagnostics = 1 generates and saves 4 million log_lik values. During SBC (which only needs the parameters), setting run_diagnostics = 0 eliminates that cost entirely.

B.7 7. Post-Hoc Generated Quantities with `$generate_quantities()`

First introduced: Chapter 6 (Chapter 5)

The run_diagnostics flag lets you skip log_lik and y_rep during SBC. A complementary pattern avoids the trade-off entirely: fit once with diagnostics off, then replay the saved parameter draws through the generated quantities block on demand, without re-running MCMC.

# Fast fit — no log_lik or y_rep during sampling
fit_fast <- mod$sample(data = list(..., run_diagnostics = 0L), ...)

# Later: add log_lik and y_rep from the saved draws
fit_gq <- mod$generate_quantities(
  fitted_params = fit_fast,
  data          = list(..., run_diagnostics = 1L)
)

log_lik <- fit_gq$draws("log_lik")
y_rep   <- fit_gq$draws("y_rep")

$generate_quantities() runs only the generated quantities block — the sampler is never touched. Key consequences:

LOO and PPC can be deferred until you actually need them.
A fit saved with readRDS() from a previous session can be replayed: pass the loaded object as fitted_params.
SBC backends always run with run_diagnostics = 0; call $generate_quantities() only on the small subset of fits you want to inspect in detail.

The compiled model object must be the same for both the original fit and the $generate_quantities() call.

B.8 8. Non-Centered Parameterization (NCP)

First introduced: Chapter 7 (Chapter 6, multilevel biased agent) Extended to simplex parameters: Chapter 14 (Chapter 12, multilevel decay GCM)

Centered parameterizations create funnel geometries when the group-level variance (sigma) is small: individual-level parameters collapse toward the mean, forming a narrow neck that HMC cannot traverse efficiently, producing divergences, low E-BFMI, and max_treedepth warnings.

Centered (problematic):

parameters {
  real mu;
  real<lower=0> sigma;
  vector[J] theta;
}
model {
  theta ~ normal(mu, sigma);
}

Non-Centered (preferred):

parameters {
  real mu;
  real<lower=0> sigma;
  vector[J] z;          // N(0,1) offset
}
transformed parameters {
  vector[J] theta = mu + sigma * z;
}
model {
  z ~ std_normal();
}

B.8.1 Logit-Normal NCP for Simplex Parameters

For attention weights or mixing proportions modeled as a simplex, avoid the Dirichlet + kappa parameterization. Small kappa values produce nearly-flat Dirichlet distributions, creating the same funnel that centered normal models do. For a 2-element simplex use a logit-normal NCP instead:

parameters {
  real          pop_logit_w_mean;
  real<lower=0> pop_logit_w_sd;
  vector[J]     z_w;
}
transformed parameters {
  array[J] vector[2] w;
  for (j in 1:J) {
    real lw  = pop_logit_w_mean + pop_logit_w_sd * z_w[j];
    w[j][1]  = inv_logit(lw);
    w[j][2]  = 1.0 - w[j][1];
  }
}
model {
  z_w ~ std_normal();
}

The cost of ignoring NCP

The centered Dirichlet + kappa ~ exponential(0.1) parameterization in Chapter 11’s multilevel decay GCM produced 4 000 max_treedepth warnings per run and failed to complete 10 iterations overnight at adapt_delta = 0.99, max_treedepth = 20. Switching to logit-normal NCP eliminated the funnel geometry entirely.

When centered parameterization is preferred

NCP is not always the right choice. When the data are highly informative relative to the prior, the centered parameterization can be more efficient. In that regime, the group-level variance $\sigma$ is well-constrained by the data, so the individual-level parameters $\theta_j$ are tightly funneled by the likelihood rather than the prior — and HMC can navigate the centered geometry directly without the stretched $z$-offsets that NCP introduces.

A practical heuristic (the park rule): prefer NCP when the per-group sample size is small relative to the between-group variance (data are weak); prefer CP when data are plentiful and the posterior for $\sigma$ is far from zero. When in doubt, fit both and compare diagnostics.

See Vehtari et al. — the park rule for a formal treatment with diagnostics to guide the choice.

B.9 9. Cholesky Factorization for Correlation Matrices

First introduced: Chapter 7 (Chapter 6, multilevel memory agent with correlated parameters)

Whenever a model estimates a correlation matrix $\Omega$ over individual-level parameters, use the Cholesky-factorized form rather than sampling $\Omega$ directly.

Parameterization	Leapfrog cost	Why prefer/avoid
`corr_matrix[K] Omega` + `lkj_corr_lpdf`	$O(K^3)$ matrix operations + positive-definiteness check	Expensive; unconstrained transform requires matrix square root
`cholesky_factor_corr[K] L_Omega` + `lkj_corr_cholesky_lpdf`	$O(K^2)$	Lies directly on its natural manifold; no matrix inversion needed

// Avoid
parameters { corr_matrix[2] Omega; }
model      { Omega ~ lkj_corr(2.0); }

// Prefer
parameters { cholesky_factor_corr[2] L_Omega; }
model      { L_Omega ~ lkj_corr_cholesky(2.0); }

transformed parameters {
  // O(K^2) multiply — no square root
  matrix[2, J] deviations = diag_pre_multiply(sigma, L_Omega) * z;
}

generated quantities {
  // Reconstruct Omega once per saved draw, not every leapfrog step
  matrix[2, 2] Omega = multiply_lower_tri_self_transpose(L_Omega);
  real rho = Omega[1, 2];
}

Key point: diag_pre_multiply(sigma, L_Omega) * z produces individual-level NCP deviations directly from the Cholesky factor — the full correlation matrix $\Omega$ is only recovered in generated quantities, running once per saved draw rather than thousands of times during warm-up.

For $K = 2$ parameters the cost difference is modest; for $K \geq 5$ (e.g., five per-subject RL parameters) it becomes substantial.

B.10 10. Contiguous Subject Indexing — `subj_start` / `subj_end`

First introduced: Chapter 7 (Chapter 6, multilevel memory model)

Two strategies for multi-subject trial data:

Strategy A — subj_id lookup (simple; works only when trials are independent):

data { array[N_total] int subj_id; }
model {
  for (i in 1:N_total)
    target += bernoulli_lpmf(y[i] | f(theta[subj_id[i]], x[i]));
}

Strategy B — Contiguous block indexing (required for sequential models and reduce_sum):

data {
  array[J] int subj_start;
  array[J] int subj_end;
}
model {
  for (j in 1:J)
    for (t in subj_start[j]:subj_end[j])
      target += bernoulli_lpmf(y[t] | f(theta[j], x[t]));
}

Build the index in R:

subj_start <- as.integer(tapply(seq_along(subj_id), subj_id, min))
subj_end   <- as.integer(tapply(seq_along(subj_id), subj_id, max))

Use Strategy B whenever:

Trial $t$ depends on state accumulated from trials $1, \ldots, t-1$ (memory, belief updating, RL)
You intend to use reduce_sum for within-chain parallelism (see Section B.14)

B.11 11. Pathfinder Warm-Start Initialisation

First benchmarked: Chapter 5 (Chapter 4) Timing comparison for multilevel models: Chapter 7 (Chapter 6) Three-way benchmark: Chapter 14 (Chapter 12)

Stan’s default initialisation draws chain starts from Uniform(−2, 2) in unconstrained space — far from the posterior on hierarchical models. Pathfinder runs L-BFGS from each start, fits Gaussian approximations along the path, and returns approximate posterior draws. Passing these as init to $sample() lets HMC skip most of the warm-up trajectory.

# Step 1 — fast variational approximation
pf <- mod$pathfinder(data = stan_data, num_paths = 4, draws = 1000, seed = 123)

# Step 2 — draw one starting point per chain
init_pf <- lapply(1:4, function(i) as.list(pf$draws()[i, , drop = FALSE]))

# Step 3 — full MCMC from warm starts
fit <- mod$sample(data = stan_data, init = init_pf,
                  chains = 4, iter_warmup = 500, iter_sampling = 1000)

When Pathfinder helps most

Pathfinder gives the largest benefit on hierarchical models and any model with correlated population parameters. It saves 20–50% of total wall time by shortening the warmup phase. On simple, well-conditioned models the warm-up is already fast — the Pathfinder overhead may outweigh the saving.

Pathfinder output is not a gold-standard posterior. Always run full MCMC and check diagnostics.

Fallback strategy: If Pathfinder fails (e.g., due to extreme initial gradients), provide an explicit init list containing the jittered prior medians for all parameters. This is almost always better than the default Uniform(-2, 2) start.

B.12 12. Marginalizing Discrete Parameters with `log_mix`

First introduced: Chapter 9 (Chapter 8, single and multilevel mixture models)

HMC requires continuous, differentiable parameters. Discrete latent variables (component indicators, regime switches) must be analytically marginalized out of the likelihood before sampling.

// two-component mixture
real lp1 = bernoulli_logit_lpmf(y | alpha);
real lp2 = bernoulli_logit_lpmf(y | nu);
// log( pi*exp(lp1) + (1-pi)*exp(lp2) )
target += log_mix(pi, lp1, lp2);

For $K > 2$ components use log_sum_exp directly:

vector[K] lps;
for (k in 1:K)
  lps[k] = log(pi[k]) + component_lpmf(y | theta[k]);
target += log_sum_exp(lps);

Never sample discrete indicators directly

HMC requires a continuous, differentiable log-posterior surface to compute gradients. A discrete latent variable — such as a cluster assignment z — creates a staircase: no gradient exists at its steps, so HMC cannot traverse it.

// WRONG — z is an integer; Stan cannot differentiate theta[z] with respect to z
int z ~ categorical(pi);
target += bernoulli_lpmf(y | theta[z]);

This compiles but fails at runtime (sampler errors or silently wrong draws).

Fix: marginalise z out analytically. Instead of sampling which component each observation came from, compute the log-probability of the observation summed across all components, weighted by the mixing proportions — exactly the log_sum_exp pattern above:

// CORRECT — z never appears; the sum is continuous and differentiable
vector[K] lps;
for (k in 1:K)
  lps[k] = log(pi[k]) + bernoulli_lpmf(y | theta[k]);
target += log_sum_exp(lps);

The gradient now flows cleanly through pi and theta, and HMC works as intended.

B.13 13. Sequential State Unrolling in `transformed parameters`

First introduced: Chapter 10 (Chapter 9, weighted and proportional agent models)

When the likelihood depends on a latent state that evolves trial-by-trial (belief, RL value, running weight), unroll the recursion in transformed parameters so the gradient flows back through the entire trajectory.

transformed parameters {
  vector[N] predicted_choice;
  {
    // local scope: scratch variable allocated and freed each leapfrog step
    real belief = 0.5;
    for (t in 1:N) {
      predicted_choice[t] = inv_logit(weight * (belief - 0.5));
      belief += alpha * (feedback[t] - belief);
    }
  }
}
model {
  target += bernoulli_lpmf(y | predicted_choice);
}

Why transformed parameters and not model?

State that enters the likelihood needs its gradient tracked by autodiff. Declare only the output vector (predicted_choice) as a transformed parameters variable; keep loop-internal scratch variables inside a {} local block to avoid placing them on the saved-draw tape.

Exception: If the loop is independent across subjects, move it into a partial_sum_lp function and use reduce_sum for across-subject parallelism (see Section B.14).

B.14 14. Intra-Chain Parallelism with `reduce_sum`

First introduced: Chapter 14 (Chapter 12, multilevel decay GCM)

reduce_sum exploits conditional independence across subjects: given population parameters, subject $j$’s trial loop shares no mutable state with subject $k$’s loop. The subject dimension can therefore be distributed across CPU threads within a single chain, giving near-linear speedup up to the number of subjects.

Step 1 — Write a partial_sum_lp function in the functions block:

functions {
  real partial_sum_lp(
    array[] int subj_slice, int start, int end,
    vector log_c,
    array[] int subj_start_g, array[] int subj_end_g,
    array[] int y, array[,] real x
  ) {
    real lp = 0.0;
    for (idx in 1:size(subj_slice)) {
      int j = subj_slice[idx];
      for (t in subj_start_g[j]:subj_end_g[j])
        lp += bernoulli_lpmf(y[t] | f(exp(log_c[j]), x[t]));
    }
    return lp;
  }
}

Step 2 — Replace the subject loop with reduce_sum in the model block:

model {
  // subject_ids = {1, 2, ..., J}
  target += reduce_sum(partial_sum_lp, subject_ids, 1,
                       log_c, subj_start, subj_end, y, x);
}

Step 3 — Compile with threading support and sample:

mod_par <- cmdstan_model(stan_file,
                         cpp_options = list(stan_threads = TRUE))

n_threads <- max(1L, parallel::detectCores() %/% 4L)

fit_par <- mod_par$sample(
  data             = stan_data,
  chains           = 4,
  parallel_chains  = 4,
  threads_per_chain = n_threads,
  iter_warmup      = 500,
  iter_sampling    = 1000
)

When is reduce_sum worth it?

Condition	Effect
≥ 20 subjects with expensive per-subject loops	Large speedup (near-linear up to J)
Few subjects or very cheap loops	Threading overhead may dominate
Within-subject sequential dependency	Still valid — parallelises across subjects, not within
Contiguous `subj_start`/`subj_end` indexing	Required (see Section B.10)

The grainsize argument (second integer in reduce_sum) controls the minimum slice size. Use 1 as a starting point; increase if the per-subject work is small.

Within-subject loops cannot be parallelised with reduce_sum

In the single-subject GCM (Chapter 12), trial $t$ depends on the memory state accumulated from trials $1, \ldots, t-1$. This sequential dependency is within a subject, so reduce_sum cannot help there. Parallelism across subjects is always safe because subjects are conditionally independent.

B.15 15. Combining Techniques — Compound Speedups

Three-way timing benchmark: Chapter 14 (Chapter 12)

The techniques above stack multiplicatively on hierarchical models:

Technique	Typical saving	When to apply
NCP instead of CP	5–100× fewer divergences	Any multilevel model — always prefer NCP
`transformed data` precompute	10–100× fewer flops	Static quantities derived from data only
`run_diagnostics = 0` in SBC	~2× faster `generated quantities`	All SBC backends
Pathfinder warm-start	20–50% less wall time	Hierarchical models; correlated posteriors
`reduce_sum` (10 subjects, 4-core laptop)	2–3× within-chain	≥ 20 subjects, expensive loops
Pathfinder + `reduce_sum` together	3–5× vs. serial baseline	Large multilevel models, multi-core hardware

Benchmark template (used in Chapters 6 and 11):

time_serial <- system.time(
  fit_serial <- mod$sample(data = stan_data, ...)
)[["elapsed"]]

pf       <- mod$pathfinder(data = stan_data, num_paths = 4, draws = 1000)
init_pf  <- lapply(1:4, function(i) as.list(pf$draws()[i, , drop = FALSE]))

time_serial_pf <- system.time(
  fit_pf <- mod$sample(data = stan_data, init = init_pf, ...)
)[["elapsed"]]

n_threads <- max(1L, parallel::detectCores() %/% 4L)
time_par  <- system.time(
  fit_par <- mod_par$sample(
    data = stan_data, init = init_pf,
    threads_per_chain = n_threads, ...)
)[["elapsed"]]

tibble(
  Strategy = c("Serial (baseline)", "Serial + Pathfinder", "Parallel + Pathfinder"),
  Time_sec = c(time_serial, time_serial_pf, time_par),
  Speedup  = round(time_serial / c(time_serial, time_serial_pf, time_par), 2)
)

B.16 16. Sampler and Compiler Options

Reference: Options for improving Stan sampling speed

The techniques in §1–15 reduce the work per leapfrog step. The options below change how the sampler and compiler behave around those steps. They are cheap to try and often give meaningful savings with no code changes.

B.16.1 Sampler options (passed to `$sample()`)

Option	Default	Guidance
`adapt_delta`	0.80	Raise to 0.90–0.95 when divergences persist after NCP; higher values force smaller step sizes and more leapfrog steps per transition
`max_treedepth`	10	Raise to 12–15 only when `max_treedepth` warnings dominate; raising without fixing geometry just burns time
`iter_warmup`	1000	Lower to 500 on well-conditioned models; raise to 2000 when E-BFMI warnings persist across chains
`iter_sampling`	1000	Raise when posterior tails matter (e.g., rare-event predictions) or when $\hat{R}$ instability is numerical rather than geometric
`chains`	4	Never run fewer than 4 — $\hat{R}$ requires between-chain comparison
`parallel_chains`	1	Set to `chains` when fitting a single model; reduce when running many parallel fits

MCMC Diagnostic Standards

Following the “Bayesian Workflow” standards, a model is only considered successfully sampled if it passes these three gates:

Zero Divergences: Any divergence indicates the sampler missed part of the posterior. Fix via NCP (Section B.8) or by raising adapt_delta.
$\hat{R} < 1.01$: Large $\hat{R}$ means chains haven’t converged to the same target.
Bulk and Tail ESS > 400: (at least 100 per chain). Lower ESS means the error in estimating the mean or the tails is too high for reliable inference.
No E-BFMI warnings: Energy-Bayesian Fraction of Missing Information should be $> 0.2$. Low E-BFMI indicates the sampler is struggling to explore the tails of the posterior, often due to heavy-tailed priors or poor parameterization.

fit <- mod$sample(
  data           = stan_data,
  chains         = 4,
  parallel_chains = 4,
  iter_warmup    = 1000,
  iter_sampling  = 1000,
  adapt_delta    = 0.90,     # raise from 0.80 if divergences persist
  max_treedepth  = 12        # raise from 10 if treedepth warnings dominate
)

B.16.2 Compiler optimizations

Stan models are compiled to C++. Enabling processor-specific optimizations costs nothing at runtime but can save 10–30% on algebraically intensive models (GCMs, ToM).

mod <- cmdstan_model(
  stan_file,
  cpp_options = list(
    stan_threads = TRUE,          # required for reduce_sum (§14)
    CXXFLAGS = "-O3 -march=native" # processor-specific vectorization
  )
)

-march=native portability

Binaries compiled with -march=native use CPU instructions specific to the build machine. Do not share compiled Stan objects across machines with different processor families. Recompile on the target machine.

B.16.3 Output size

By default Stan saves 8-significant-figure draws. For large models with many generated quantities, reduce to 6 to halve .csv file sizes with negligible precision loss:

fit <- mod$sample(..., sig_figs = 6)

B.17 Quick Reference

Problem observed	Technique	Section
Redundant computation during leapfrog	Block placement → `transformed data`	§1
Quantity derived from data only, easier to compute in R	Pre-compute in R	§2
Many IID trials, one parameter per session	Sufficient statistics → `binomial_lpmf`	§3
Trial loop over identical distribution	Vectorize the likelihood call	§4
Changing priors requires recompiling	Priors as data	§5
SBC / LOO is slow	`run_diagnostics` flag	§6
Want PPC/LOO without re-running MCMC	`$generate_quantities()` post-hoc	§7
Divergences / low E-BFMI in multilevel model	Non-centered parameterization (NCP)	§8
Simplex + small kappa → funnel geometry	Logit-normal NCP for simplex	§8
Model has correlation matrix $\Omega$	Cholesky factorization (`lkj_corr_cholesky`)	§9
Sequential model with per-subject trial loops	Contiguous `subj_start`/`subj_end` indexing	§10
Slow warm-up on complex / hierarchical model	Pathfinder warm-start	§11
Discrete latent variables (indicators, regimes)	Marginalise with `log_mix`	§12
Trial-to-trial latent state in likelihood	Sequential unrolling in `transformed parameters`	§13
Many subjects, multi-core machine	`reduce_sum` intra-chain parallelism	§14
Divergences / treedepth / slow warm-up	Sampler options (`adapt_delta`, `max_treedepth`)	§16
Slow compilation or runtime on algebraic models	Compiler flags (`-O3 -march=native`)	§16
Large `.csv` output files	Reduce `sig_figs` to 6	§16

# Stan Optimization Cheatsheet {#sec-stan-cheatsheet} ```{r appb_setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5, fig.align = 'center', out.width = "80%", dpi = 300 ) pacman::p_load( tidyverse, cmdstanr, posterior, bayesplot, patchwork, here ) ``` This appendix collects every Stan programming technique introduced across Chapters 4–11, with minimal code examples, cost estimates, and guidance on when to apply each one. The techniques are ordered by first appearance in the book. **Quick navigation** | # | Technique | Chapter | |---|---|---| | 1 | [Block placement](#sec-block-placement) | Ch 4 | | 2 | [Pre-compute in R](#sec-precompute-r) | Ch 4, 6 | | 3 | [Sufficient statistics](#sec-sufficient-stats) | Ch 4 | | 4 | [Vectorize the likelihood](#sec-vectorize) | Ch 4, 6 | | 5 | [Priors as data](#sec-priors-as-data) | Ch 5 | | 6 | [Conditional diagnostics (`run_diagnostics`)](#sec-run-diagnostics) | Ch 5 | | 7 | [Post-hoc generated quantities](#sec-generate-quantities) | Ch 5 | | 8 | [Non-centered parameterization](#sec-ncp) | Ch 6 | | 9 | [Cholesky factorization](#sec-cholesky) | Ch 6 | | 10 | [Contiguous subject indexing](#sec-subject-indexing) | Ch 6 | | 11 | [Pathfinder warm-start](#sec-pathfinder) | Ch 4, 6, 11 | | 12 | [Marginalizing discrete parameters](#sec-log-mix) | Ch 8 | | 13 | [Sequential state unrolling](#sec-sequential-state) | Ch 9 | | 14 | [`reduce_sum` parallelism](#sec-reduce-sum) | Ch 11 | | 15 | [Combining techniques](#sec-compound-speedups) | Ch 6, 11 | | 16 | [Sampler and compiler options](#sec-sampler-options) | App B | --- ## 1. Block Placement — Run It Once, Not Thousands of Times {#sec-block-placement} *First introduced: @sec-memory-models (Chapter 4, Memory Model 1)* HMC runs thousands of leapfrog steps during warm-up. Every line of code in `transformed parameters` or `model` executes at each step; every line in `transformed data` executes exactly **once**. | Block | Runs | Put here | |---|---|---| | `transformed data` | Once, before sampling | Quantities derived **only from data** — precomputed distances, log-transforms of fixed inputs, cumulative sums | | `transformed parameters` | Every leapfrog step *(saved)* | Deterministic functions of **parameters** whose gradients must propagate — NCP reparameterisations, recursive state, predicted values used in the likelihood | | `model` (local vars) | Every leapfrog step *(not saved)* | Intermediate scratch quantities not needed outside the loop | | `generated quantities` | Once per saved draw | Posterior predictive samples, `log_lik`, prior draws — anything expensive that is **not** needed for sampling | ::: {.callout-tip} ## Cost of misplacement A typical fit runs 4 chains × 1 000 warmup iterations × ~6 leapfrog steps = **~24 000 leapfrog evaluations** during warm-up. Moving 500 precomputed memory values from `transformed parameters` to `transformed data` saves ~12 million floating-point operations per fit. ::: ```stan transformed data { // computed once — running average of opponent choices vector[N] memory; memory[1] = 0.5; for (t in 2:N) memory[t] = (1.0 / t) * sum(other[1:(t-1)]); } ``` --- ## 2. Pre-Compute in R — Don't Let Stan See the Raw Computation {#sec-precompute-r} *First introduced: @sec-multilevel-models (Chapter 6, multilevel memory agent)* `transformed data` runs once before sampling — but the computation still happens inside Stan's runtime. If a quantity depends only on observed data and will never change between fits, computing it in R and passing it as a data field is even cheaper: Stan never processes the raw inputs at all. ```r # R: compute the running opponent rate once (O(N) vectorised R operation) d <- d |> group_by(subject) |> mutate(opp_rate_prev = lag(cummean(other_choice), default = 0.5)) |> ungroup() stan_data <- list( N_total = nrow(d), y = d$choice, opp_rate_prev = d$opp_rate_prev # passed as a plain vector # ... ) ``` ```stan // Stan only receives the pre-computed vector — no loop needed data { vector<lower=0,upper=1>[N_total] opp_rate_prev; } model { target += bernoulli_logit_lpmf(y | mu + beta * logit(opp_rate_prev)); } ``` **Decision rule:** If the computation involves only observed data and uses fast R/dplyr operations, do it in R. Reserve `transformed data` for cases where the computation is cleaner to express in Stan's syntax (e.g., distance matrices over stimuli). --- ## 3. Sufficient Statistics — Aggregate IID Trials {#sec-sufficient-stats} *First introduced: @sec-inferring-rates (Chapter 4, simple bias model)* When all trials are **independently and identically distributed** under the same parameter, the total number of successes is a **sufficient statistic** — it carries all the information in the data about $\theta$. A single Binomial call replaces $N$ Bernoulli calls. ```stan // data block: pass the aggregate, not the full array data { int<lower=0> N; int<lower=0, upper=N> successes; // sum(h) } model { target += binomial_logit_lpmf(successes | N, theta_logit); } ``` ```r stan_data <- list(N = 120, successes = sum(h)) ``` ::: {.callout-tip} ## Speedup One Binomial evaluation vs. 120 Bernoulli evaluations — every leapfrog step, every chain. For a single-subject model with 4 chains × 24 000 leapfrog evaluations the difference is small in absolute time, but the principle scales: for 10 000 IID trials the saving is real. ::: **When it applies:** single-subject, single-session models where $\theta$ is constant across all $N$ trials (Chapters 4–5 bias models). **When it does not apply:** memory models, multilevel models (per-subject $\theta_j$), any model where the trial-level likelihood depends on a trial-varying predictor. --- ## 4. Vectorize the Likelihood Call {#sec-vectorize} *First introduced: @sec-inferring-rates (Chapter 4); used throughout from Chapter 5 onward* Stan evaluates each distribution call as a node in the autodiff graph. A vectorized call over a length-$N$ vector creates **one node**; a loop over $N$ trials creates **$N$ nodes**. Smaller graph = faster gradient computation and better cache use. ```stan // Slower: N nodes in the autodiff graph for (i in 1:N) target += bernoulli_logit_lpmf(h[i] | theta_logit); // Faster: one node — mathematically identical target += bernoulli_logit_lpmf(h | theta_logit); ``` In multilevel models, scatter-indexing by subject makes the entire trial pool vectorizable: ```stan // N_total trials, J subjects — one call instead of N_total calls target += bernoulli_logit_lpmf(y | theta_logit[subj_id]); ``` ::: {.callout-note} ## When vectorization is not available `log_mix()` for mixture likelihoods (Chapter 8) operates on **one observation at a time** because each call mixes two observation-specific log-probabilities. There is no vectorized `log_mix` — the per-observation loop is correct and unavoidable there. ::: --- ## 5. Priors as Data — Avoid Recompilation {#sec-priors-as-data} *First introduced: @sec-model-quality-checks (Chapter 5)* Pass prior hyperparameters through the `data` block instead of hard-coding them. Stan recompiles whenever the **code** changes, never for new **data**. This lets you run prior sensitivity analyses and SBC sweeps without recompiling. ```stan data { int<lower=0> N; array[N] int y; real prior_mu_mean; real<lower=0> prior_mu_sigma; } model { mu ~ normal(prior_mu_mean, prior_mu_sigma); y ~ bernoulli_logit(mu); } ``` ```r data <- list(N = N, y = y, prior_mu_mean = 0, prior_mu_sigma = 1.5) ``` --- ## 6. The `run_diagnostics` Flag — Conditional Generated Quantities {#sec-run-diagnostics} *First introduced: @sec-model-quality-checks (Chapter 5)* Computing `log_lik` (N values per draw) and `y_rep` in `generated quantities` can double run-time. Gate them behind an integer flag so SBC backends can turn them off. ```stan data { int<lower=0, upper=1> run_diagnostics; } generated quantities { vector[N] log_lik = rep_vector(0.0, N); array[N] int y_rep; if (run_diagnostics) { for (i in 1:N) { log_lik[i] = bernoulli_logit_lpmf(y[i] | mu); y_rep[i] = bernoulli_logit_rng(mu); } } } ``` ```r # Full fit — diagnostics on data <- list(..., run_diagnostics = 1L) # SBC backend — diagnostics off, ~2× faster generated quantities sbc_data <- modifyList(data, list(run_diagnostics = 0L)) ``` ::: {.callout-tip} ## When does it matter most? For 1 000 trials, 4 chains, 1 000 sampling draws, `run_diagnostics = 1` generates and saves 4 million `log_lik` values. During SBC (which only needs the parameters), setting `run_diagnostics = 0` eliminates that cost entirely. ::: --- ## 7. Post-Hoc Generated Quantities with `$generate_quantities()` {#sec-generate-quantities} *First introduced: @sec-model-quality-checks (Chapter 5)* The `run_diagnostics` flag lets you skip `log_lik` and `y_rep` during SBC. A complementary pattern avoids the trade-off entirely: **fit once with diagnostics off, then replay the saved parameter draws through the `generated quantities` block on demand**, without re-running MCMC. ```r # Fast fit — no log_lik or y_rep during sampling fit_fast <- mod$sample(data = list(..., run_diagnostics = 0L), ...) # Later: add log_lik and y_rep from the saved draws fit_gq <- mod$generate_quantities( fitted_params = fit_fast, data = list(..., run_diagnostics = 1L) ) log_lik <- fit_gq$draws("log_lik") y_rep <- fit_gq$draws("y_rep") ``` `$generate_quantities()` runs only the `generated quantities` block — the sampler is never touched. Key consequences: - LOO and PPC can be **deferred** until you actually need them. - A fit saved with `readRDS()` from a previous session can be replayed: pass the loaded object as `fitted_params`. - SBC backends always run with `run_diagnostics = 0`; call `$generate_quantities()` only on the small subset of fits you want to inspect in detail. The compiled model object must be the same for both the original fit and the `$generate_quantities()` call. --- ## 8. Non-Centered Parameterization (NCP) {#sec-ncp} *First introduced: @sec-multilevel-models (Chapter 6, multilevel biased agent)* *Extended to simplex parameters: @sec-categorization-exemplars (Chapter 12, multilevel decay GCM)* Centered parameterizations create **funnel geometries** when the group-level variance (`sigma`) is small: individual-level parameters collapse toward the mean, forming a narrow neck that HMC cannot traverse efficiently, producing divergences, low E-BFMI, and `max_treedepth` warnings. ::: {.columns} ::: {.column width="48%"} **Centered (problematic):** ```stan parameters { real mu; real<lower=0> sigma; vector[J] theta; } model { theta ~ normal(mu, sigma); } ``` ::: ::: {.column width="4%"}   ::: ::: {.column width="48%"} **Non-Centered (preferred):** ```stan parameters { real mu; real<lower=0> sigma; vector[J] z; // N(0,1) offset } transformed parameters { vector[J] theta = mu + sigma * z; } model { z ~ std_normal(); } ``` ::: ::: ### Logit-Normal NCP for Simplex Parameters For attention weights or mixing proportions modeled as a simplex, avoid the Dirichlet + kappa parameterization. Small kappa values produce nearly-flat Dirichlet distributions, creating the same funnel that centered normal models do. For a 2-element simplex use a **logit-normal NCP** instead: ```stan parameters { real pop_logit_w_mean; real<lower=0> pop_logit_w_sd; vector[J] z_w; } transformed parameters { array[J] vector[2] w; for (j in 1:J) { real lw = pop_logit_w_mean + pop_logit_w_sd * z_w[j]; w[j][1] = inv_logit(lw); w[j][2] = 1.0 - w[j][1]; } } model { z_w ~ std_normal(); } ``` ::: {.callout-warning} ## The cost of ignoring NCP The centered Dirichlet + `kappa ~ exponential(0.1)` parameterization in Chapter 11's multilevel decay GCM produced **4 000 max_treedepth warnings per run** and failed to complete 10 iterations overnight at `adapt_delta = 0.99, max_treedepth = 20`. Switching to logit-normal NCP eliminated the funnel geometry entirely. ::: ::: {.callout-tip} ## When centered parameterization is preferred NCP is not always the right choice. When the data are highly informative relative to the prior, the *centered* parameterization can be more efficient. In that regime, the group-level variance $\sigma$ is well-constrained by the data, so the individual-level parameters $\theta_j$ are tightly funneled by the likelihood rather than the prior — and HMC can navigate the centered geometry directly without the stretched $z$-offsets that NCP introduces. A practical heuristic (the **park rule**): prefer NCP when the per-group sample size is small relative to the between-group variance (data are weak); prefer CP when data are plentiful and the posterior for $\sigma$ is far from zero. When in doubt, fit both and compare diagnostics. See [Vehtari et al. — the park rule](https://avehtari.github.io/Bayesian-Workflow/park_rule/park_rule.html) for a formal treatment with diagnostics to guide the choice. ::: --- ## 9. Cholesky Factorization for Correlation Matrices {#sec-cholesky} *First introduced: @sec-multilevel-models (Chapter 6, multilevel memory agent with correlated parameters)* Whenever a model estimates a **correlation matrix** $\Omega$ over individual-level parameters, use the Cholesky-factorized form rather than sampling $\Omega$ directly. | Parameterization | Leapfrog cost | Why prefer/avoid | |---|---|---| | `corr_matrix[K] Omega` + `lkj_corr_lpdf` | $O(K^3)$ matrix operations + positive-definiteness check | Expensive; unconstrained transform requires matrix square root | | `cholesky_factor_corr[K] L_Omega` + `lkj_corr_cholesky_lpdf` | $O(K^2)$ | Lies directly on its natural manifold; no matrix inversion needed | ```stan // Avoid parameters { corr_matrix[2] Omega; } model { Omega ~ lkj_corr(2.0); } // Prefer parameters { cholesky_factor_corr[2] L_Omega; } model { L_Omega ~ lkj_corr_cholesky(2.0); } transformed parameters { // O(K^2) multiply — no square root matrix[2, J] deviations = diag_pre_multiply(sigma, L_Omega) * z; } generated quantities { // Reconstruct Omega once per saved draw, not every leapfrog step matrix[2, 2] Omega = multiply_lower_tri_self_transpose(L_Omega); real rho = Omega[1, 2]; } ``` **Key point:** `diag_pre_multiply(sigma, L_Omega) * z` produces individual-level NCP deviations directly from the Cholesky factor — the full correlation matrix $\Omega$ is only recovered in `generated quantities`, running once per saved draw rather than thousands of times during warm-up. For $K = 2$ parameters the cost difference is modest; for $K \geq 5$ (e.g., five per-subject RL parameters) it becomes substantial. --- ## 10. Contiguous Subject Indexing — `subj_start` / `subj_end` {#sec-subject-indexing} *First introduced: @sec-multilevel-models (Chapter 6, multilevel memory model)* Two strategies for multi-subject trial data: **Strategy A — `subj_id` lookup** (simple; works only when trials are independent): ```stan data { array[N_total] int subj_id; } model { for (i in 1:N_total) target += bernoulli_lpmf(y[i] | f(theta[subj_id[i]], x[i])); } ``` **Strategy B — Contiguous block indexing** (required for sequential models and `reduce_sum`): ```stan data { array[J] int subj_start; array[J] int subj_end; } model { for (j in 1:J) for (t in subj_start[j]:subj_end[j]) target += bernoulli_lpmf(y[t] | f(theta[j], x[t])); } ``` Build the index in R: ```r subj_start <- as.integer(tapply(seq_along(subj_id), subj_id, min)) subj_end <- as.integer(tapply(seq_along(subj_id), subj_id, max)) ``` Use Strategy B whenever: - Trial $t$ depends on state accumulated from trials $1, \ldots, t-1$ (memory, belief updating, RL) - You intend to use `reduce_sum` for within-chain parallelism (see @sec-reduce-sum) --- ## 11. Pathfinder Warm-Start Initialisation {#sec-pathfinder} *First benchmarked: @sec-inferring-rates (Chapter 4)* *Timing comparison for multilevel models: @sec-multilevel-models (Chapter 6)* *Three-way benchmark: @sec-categorization-exemplars (Chapter 12)* Stan's default initialisation draws chain starts from Uniform(−2, 2) in unconstrained space — far from the posterior on hierarchical models. **Pathfinder** runs L-BFGS from each start, fits Gaussian approximations along the path, and returns approximate posterior draws. Passing these as `init` to `$sample()` lets HMC skip most of the warm-up trajectory. ```r # Step 1 — fast variational approximation pf <- mod$pathfinder(data = stan_data, num_paths = 4, draws = 1000, seed = 123) # Step 2 — draw one starting point per chain init_pf <- lapply(1:4, function(i) as.list(pf$draws()[i, , drop = FALSE])) # Step 3 — full MCMC from warm starts fit <- mod$sample(data = stan_data, init = init_pf, chains = 4, iter_warmup = 500, iter_sampling = 1000) ``` ::: {.callout-tip} ## When Pathfinder helps most Pathfinder gives the largest benefit on **hierarchical models** and any model with correlated population parameters. It saves 20–50% of total wall time by shortening the warmup phase. On simple, well-conditioned models the warm-up is already fast — the Pathfinder overhead may outweigh the saving. **Pathfinder output is not a gold-standard posterior.** Always run full MCMC and check diagnostics. **Fallback strategy:** If Pathfinder fails (e.g., due to extreme initial gradients), provide an explicit `init` list containing the **jittered prior medians** for all parameters. This is almost always better than the default `Uniform(-2, 2)` start. ::: --- ## 12. Marginalizing Discrete Parameters with `log_mix` {#sec-log-mix} *First introduced: @sec-mixture-models (Chapter 8, single and multilevel mixture models)* HMC requires continuous, differentiable parameters. Discrete latent variables (component indicators, regime switches) must be **analytically marginalized** out of the likelihood before sampling. ```stan // two-component mixture real lp1 = bernoulli_logit_lpmf(y | alpha); real lp2 = bernoulli_logit_lpmf(y | nu); // log( pi*exp(lp1) + (1-pi)*exp(lp2) ) target += log_mix(pi, lp1, lp2); ``` For $K > 2$ components use `log_sum_exp` directly: ```stan vector[K] lps; for (k in 1:K) lps[k] = log(pi[k]) + component_lpmf(y | theta[k]); target += log_sum_exp(lps); ``` ::: {.callout-warning} ## Never sample discrete indicators directly HMC requires a continuous, differentiable log-posterior surface to compute gradients. A discrete latent variable — such as a cluster assignment `z` — creates a staircase: no gradient exists at its steps, so HMC cannot traverse it. ```stan // WRONG — z is an integer; Stan cannot differentiate theta[z] with respect to z int z ~ categorical(pi); target += bernoulli_lpmf(y | theta[z]); ``` This compiles but fails at runtime (sampler errors or silently wrong draws). **Fix: marginalise `z` out analytically.** Instead of sampling which component each observation came from, compute the log-probability of the observation summed across all components, weighted by the mixing proportions — exactly the `log_sum_exp` pattern above: ```stan // CORRECT — z never appears; the sum is continuous and differentiable vector[K] lps; for (k in 1:K) lps[k] = log(pi[k]) + bernoulli_lpmf(y | theta[k]); target += log_sum_exp(lps); ``` The gradient now flows cleanly through `pi` and `theta`, and HMC works as intended. ::: --- ## 13. Sequential State Unrolling in `transformed parameters` {#sec-sequential-state} *First introduced: @sec-theory-of-mind (Chapter 9, weighted and proportional agent models)* When the likelihood depends on a latent state that evolves trial-by-trial (belief, RL value, running weight), unroll the recursion in `transformed parameters` so the gradient flows back through the entire trajectory. ```stan transformed parameters { vector[N] predicted_choice; { // local scope: scratch variable allocated and freed each leapfrog step real belief = 0.5; for (t in 1:N) { predicted_choice[t] = inv_logit(weight * (belief - 0.5)); belief += alpha * (feedback[t] - belief); } } } model { target += bernoulli_lpmf(y | predicted_choice); } ``` ::: {.callout-tip} ## Why `transformed parameters` and not `model`? State that enters the likelihood needs its gradient tracked by autodiff. Declare only the **output vector** (`predicted_choice`) as a `transformed parameters` variable; keep loop-internal scratch variables inside a `{}` local block to avoid placing them on the saved-draw tape. **Exception:** If the loop is independent across subjects, move it into a `partial_sum_lp` function and use `reduce_sum` for across-subject parallelism (see @sec-reduce-sum). ::: --- ## 14. Intra-Chain Parallelism with `reduce_sum` {#sec-reduce-sum} *First introduced: @sec-categorization-exemplars (Chapter 12, multilevel decay GCM)* `reduce_sum` exploits **conditional independence across subjects**: given population parameters, subject $j$'s trial loop shares no mutable state with subject $k$'s loop. The subject dimension can therefore be distributed across CPU threads **within a single chain**, giving near-linear speedup up to the number of subjects. **Step 1 — Write a `partial_sum_lp` function in the `functions` block:** ```stan functions { real partial_sum_lp( array[] int subj_slice, int start, int end, vector log_c, array[] int subj_start_g, array[] int subj_end_g, array[] int y, array[,] real x ) { real lp = 0.0; for (idx in 1:size(subj_slice)) { int j = subj_slice[idx]; for (t in subj_start_g[j]:subj_end_g[j]) lp += bernoulli_lpmf(y[t] | f(exp(log_c[j]), x[t])); } return lp; } } ``` **Step 2 — Replace the subject loop with `reduce_sum` in the `model` block:** ```stan model { // subject_ids = {1, 2, ..., J} target += reduce_sum(partial_sum_lp, subject_ids, 1, log_c, subj_start, subj_end, y, x); } ``` **Step 3 — Compile with threading support and sample:** ```r mod_par <- cmdstan_model(stan_file, cpp_options = list(stan_threads = TRUE)) n_threads <- max(1L, parallel::detectCores() %/% 4L) fit_par <- mod_par$sample( data = stan_data, chains = 4, parallel_chains = 4, threads_per_chain = n_threads, iter_warmup = 500, iter_sampling = 1000 ) ``` ::: {.callout-tip} ## When is `reduce_sum` worth it? | Condition | Effect | |---|---| | ≥ 20 subjects with expensive per-subject loops | Large speedup (near-linear up to J) | | Few subjects or very cheap loops | Threading overhead may dominate | | Within-subject sequential dependency | Still valid — parallelises *across* subjects, not *within* | | Contiguous `subj_start`/`subj_end` indexing | **Required** (see @sec-subject-indexing) | The grainsize argument (second integer in `reduce_sum`) controls the minimum slice size. Use `1` as a starting point; increase if the per-subject work is small. ::: ::: {.callout-note} ## Within-subject loops cannot be parallelised with `reduce_sum` In the single-subject GCM (Chapter 12), trial $t$ depends on the memory state accumulated from trials $1, \ldots, t-1$. This sequential dependency is **within** a subject, so `reduce_sum` cannot help there. Parallelism across subjects is always safe because subjects are conditionally independent. ::: --- ## 15. Combining Techniques — Compound Speedups {#sec-compound-speedups} *Three-way timing benchmark: @sec-categorization-exemplars (Chapter 12)* The techniques above stack multiplicatively on hierarchical models: | Technique | Typical saving | When to apply | |---|---|---| | NCP instead of CP | 5–100× fewer divergences | Any multilevel model — always prefer NCP | | `transformed data` precompute | 10–100× fewer flops | Static quantities derived from data only | | `run_diagnostics = 0` in SBC | ~2× faster `generated quantities` | All SBC backends | | Pathfinder warm-start | 20–50% less wall time | Hierarchical models; correlated posteriors | | `reduce_sum` (10 subjects, 4-core laptop) | 2–3× within-chain | ≥ 20 subjects, expensive loops | | Pathfinder + `reduce_sum` together | 3–5× vs. serial baseline | Large multilevel models, multi-core hardware | **Benchmark template** (used in Chapters 6 and 11): ```r time_serial <- system.time( fit_serial <- mod$sample(data = stan_data, ...) )[["elapsed"]] pf <- mod$pathfinder(data = stan_data, num_paths = 4, draws = 1000) init_pf <- lapply(1:4, function(i) as.list(pf$draws()[i, , drop = FALSE])) time_serial_pf <- system.time( fit_pf <- mod$sample(data = stan_data, init = init_pf, ...) )[["elapsed"]] n_threads <- max(1L, parallel::detectCores() %/% 4L) time_par <- system.time( fit_par <- mod_par$sample( data = stan_data, init = init_pf, threads_per_chain = n_threads, ...) )[["elapsed"]] tibble( Strategy = c("Serial (baseline)", "Serial + Pathfinder", "Parallel + Pathfinder"), Time_sec = c(time_serial, time_serial_pf, time_par), Speedup = round(time_serial / c(time_serial, time_serial_pf, time_par), 2) ) ``` --- ## 16. Sampler and Compiler Options {#sec-sampler-options} *Reference: [Options for improving Stan sampling speed](https://blog.mc-stan.org/2022/08/03/options-for-improving-stan-sampling-speed/)* The techniques in §1–15 reduce the *work per leapfrog step*. The options below change how the sampler and compiler behave around those steps. They are cheap to try and often give meaningful savings with no code changes. ### Sampler options (passed to `$sample()`) | Option | Default | Guidance | |---|---|---| | `adapt_delta` | 0.80 | Raise to 0.90–0.95 when divergences persist after NCP; higher values force smaller step sizes and more leapfrog steps per transition | | `max_treedepth` | 10 | Raise to 12–15 only when `max_treedepth` warnings dominate; raising without fixing geometry just burns time | | `iter_warmup` | 1000 | Lower to 500 on well-conditioned models; raise to 2000 when E-BFMI warnings persist across chains | | `iter_sampling` | 1000 | Raise when posterior tails matter (e.g., rare-event predictions) or when $\hat{R}$ instability is numerical rather than geometric | | `chains` | 4 | Never run fewer than 4 — $\hat{R}$ requires between-chain comparison | | `parallel_chains` | 1 | Set to `chains` when fitting a single model; reduce when running many parallel fits | ::: {.callout-important} ## MCMC Diagnostic Standards Following the "Bayesian Workflow" standards, a model is only considered successfully sampled if it passes these three gates: 1. **Zero Divergences:** Any divergence indicates the sampler missed part of the posterior. Fix via NCP (@sec-ncp) or by raising `adapt_delta`. 2. **$\hat{R} < 1.01$:** Large $\hat{R}$ means chains haven't converged to the same target. 3. **Bulk and Tail ESS > 400:** (at least 100 per chain). Lower ESS means the error in estimating the mean or the tails is too high for reliable inference. 4. **No E-BFMI warnings:** Energy-Bayesian Fraction of Missing Information should be $> 0.2$. Low E-BFMI indicates the sampler is struggling to explore the tails of the posterior, often due to heavy-tailed priors or poor parameterization. ::: ```r fit <- mod$sample( data = stan_data, chains = 4, parallel_chains = 4, iter_warmup = 1000, iter_sampling = 1000, adapt_delta = 0.90, # raise from 0.80 if divergences persist max_treedepth = 12 # raise from 10 if treedepth warnings dominate ) ``` ### Compiler optimizations Stan models are compiled to C++. Enabling processor-specific optimizations costs nothing at runtime but can save 10–30% on algebraically intensive models (GCMs, ToM). ```r mod <- cmdstan_model( stan_file, cpp_options = list( stan_threads = TRUE, # required for reduce_sum (§14) CXXFLAGS = "-O3 -march=native" # processor-specific vectorization ) ) ``` ::: {.callout-warning} ## `-march=native` portability Binaries compiled with `-march=native` use CPU instructions specific to the build machine. Do not share compiled Stan objects across machines with different processor families. Recompile on the target machine. ::: ### Output size By default Stan saves 8-significant-figure draws. For large models with many generated quantities, reduce to 6 to halve `.csv` file sizes with negligible precision loss: ```r fit <- mod$sample(..., sig_figs = 6) ``` --- ## Quick Reference {#sec-stan-quick-ref} | Problem observed | Technique | Section | |---|---|---| | Redundant computation during leapfrog | Block placement → `transformed data` | [§1](#sec-block-placement) | | Quantity derived from data only, easier to compute in R | Pre-compute in R | [§2](#sec-precompute-r) | | Many IID trials, one parameter per session | Sufficient statistics → `binomial_lpmf` | [§3](#sec-sufficient-stats) | | Trial loop over identical distribution | Vectorize the likelihood call | [§4](#sec-vectorize) | | Changing priors requires recompiling | Priors as data | [§5](#sec-priors-as-data) | | SBC / LOO is slow | `run_diagnostics` flag | [§6](#sec-run-diagnostics) | | Want PPC/LOO without re-running MCMC | `$generate_quantities()` post-hoc | [§7](#sec-generate-quantities) | | Divergences / low E-BFMI in multilevel model | Non-centered parameterization (NCP) | [§8](#sec-ncp) | | Simplex + small kappa → funnel geometry | Logit-normal NCP for simplex | [§8](#sec-ncp) | | Model has correlation matrix $\Omega$ | Cholesky factorization (`lkj_corr_cholesky`) | [§9](#sec-cholesky) | | Sequential model with per-subject trial loops | Contiguous `subj_start`/`subj_end` indexing | [§10](#sec-subject-indexing) | | Slow warm-up on complex / hierarchical model | Pathfinder warm-start | [§11](#sec-pathfinder) | | Discrete latent variables (indicators, regimes) | Marginalise with `log_mix` | [§12](#sec-log-mix) | | Trial-to-trial latent state in likelihood | Sequential unrolling in `transformed parameters` | [§13](#sec-sequential-state) | | Many subjects, multi-core machine | `reduce_sum` intra-chain parallelism | [§14](#sec-reduce-sum) | | Divergences / treedepth / slow warm-up | Sampler options (`adapt_delta`, `max_treedepth`) | [§16](#sec-sampler-options) | | Slow compilation or runtime on algebraic models | Compiler flags (`-O3 -march=native`) | [§16](#sec-sampler-options) | | Large `.csv` output files | Reduce `sig_figs` to 6 | [§16](#sec-sampler-options) |

B.1 1. Block Placement — Run It Once, Not Thousands of Times

B.2 2. Pre-Compute in R — Don’t Let Stan See the Raw Computation

B.3 3. Sufficient Statistics — Aggregate IID Trials

B.4 4. Vectorize the Likelihood Call

B.5 5. Priors as Data — Avoid Recompilation

B.6 6. The run_diagnostics Flag — Conditional Generated Quantities

B.7 7. Post-Hoc Generated Quantities with $generate_quantities()

B.8 8. Non-Centered Parameterization (NCP)

B.8.1 Logit-Normal NCP for Simplex Parameters

B.9 9. Cholesky Factorization for Correlation Matrices

B.10 10. Contiguous Subject Indexing — subj_start / subj_end

B.11 11. Pathfinder Warm-Start Initialisation

B.12 12. Marginalizing Discrete Parameters with log_mix

B.13 13. Sequential State Unrolling in transformed parameters

B.14 14. Intra-Chain Parallelism with reduce_sum

B.15 15. Combining Techniques — Compound Speedups

B.16 16. Sampler and Compiler Options

B.16.1 Sampler options (passed to $sample())

B.16.2 Compiler optimizations

B.16.3 Output size

B.17 Quick Reference

B.6 6. The `run_diagnostics` Flag — Conditional Generated Quantities

B.7 7. Post-Hoc Generated Quantities with `$generate_quantities()`

B.10 10. Contiguous Subject Indexing — `subj_start` / `subj_end`

B.12 12. Marginalizing Discrete Parameters with `log_mix`

B.13 13. Sequential State Unrolling in `transformed parameters`

B.14 14. Intra-Chain Parallelism with `reduce_sum`

B.16.1 Sampler options (passed to `$sample()`)