Appendix B — Stan Optimization Cheatsheet

This appendix collects every Stan programming technique introduced across Chapters 4–11, with minimal code examples, cost estimates, and guidance on when to apply each one. The techniques are ordered by first appearance in the book.

Quick navigation

# Technique Chapter
1 Block placement Ch 4
2 Pre-compute in R Ch 4, 6
3 Sufficient statistics Ch 4
4 Vectorize the likelihood Ch 4, 6
5 Priors as data Ch 5
6 Conditional diagnostics (run_diagnostics) Ch 5
7 Post-hoc generated quantities Ch 5
8 Non-centered parameterization Ch 6
9 Cholesky factorization Ch 6
10 Contiguous subject indexing Ch 6
11 Pathfinder warm-start Ch 4, 6, 11
12 Marginalizing discrete parameters Ch 8
13 Sequential state unrolling Ch 9
14 reduce_sum parallelism Ch 11
15 Combining techniques Ch 6, 11
16 Sampler and compiler options App B

B.1 1. Block Placement — Run It Once, Not Thousands of Times

First introduced: Section 5.7 (Chapter 4, Memory Model 1)

HMC runs thousands of leapfrog steps during warm-up. Every line of code in transformed parameters or model executes at each step; every line in transformed data executes exactly once.

Block Runs Put here
transformed data Once, before sampling Quantities derived only from data — precomputed distances, log-transforms of fixed inputs, cumulative sums
transformed parameters Every leapfrog step (saved) Deterministic functions of parameters whose gradients must propagate — NCP reparameterisations, recursive state, predicted values used in the likelihood
model (local vars) Every leapfrog step (not saved) Intermediate scratch quantities not needed outside the loop
generated quantities Once per saved draw Posterior predictive samples, log_lik, prior draws — anything expensive that is not needed for sampling
TipCost of misplacement

A typical fit runs 4 chains × 1 000 warmup iterations × ~6 leapfrog steps = ~24 000 leapfrog evaluations during warm-up. Moving 500 precomputed memory values from transformed parameters to transformed data saves ~12 million floating-point operations per fit.

transformed data {
  // computed once — running average of opponent choices
  vector[N] memory;
  memory[1] = 0.5;
  for (t in 2:N)
    memory[t] = (1.0 / t) * sum(other[1:(t-1)]);
}

B.2 2. Pre-Compute in R — Don’t Let Stan See the Raw Computation

First introduced: Chapter 7 (Chapter 6, multilevel memory agent)

transformed data runs once before sampling — but the computation still happens inside Stan’s runtime. If a quantity depends only on observed data and will never change between fits, computing it in R and passing it as a data field is even cheaper: Stan never processes the raw inputs at all.

# R: compute the running opponent rate once (O(N) vectorised R operation)
d <- d |>
  group_by(subject) |>
  mutate(opp_rate_prev = lag(cummean(other_choice), default = 0.5)) |>
  ungroup()

stan_data <- list(
  N_total      = nrow(d),
  y            = d$choice,
  opp_rate_prev = d$opp_rate_prev   # passed as a plain vector
  # ...
)
// Stan only receives the pre-computed vector — no loop needed
data {
  vector<lower=0,upper=1>[N_total] opp_rate_prev;
}
model {
  target += bernoulli_logit_lpmf(y | mu + beta * logit(opp_rate_prev));
}

Decision rule: If the computation involves only observed data and uses fast R/dplyr operations, do it in R. Reserve transformed data for cases where the computation is cleaner to express in Stan’s syntax (e.g., distance matrices over stimuli).


B.3 3. Sufficient Statistics — Aggregate IID Trials

First introduced: Chapter 5 (Chapter 4, simple bias model)

When all trials are independently and identically distributed under the same parameter, the total number of successes is a sufficient statistic — it carries all the information in the data about \(\theta\). A single Binomial call replaces \(N\) Bernoulli calls.

// data block: pass the aggregate, not the full array
data {
  int<lower=0> N;
  int<lower=0, upper=N> successes;  // sum(h)
}
model {
  target += binomial_logit_lpmf(successes | N, theta_logit);
}
stan_data <- list(N = 120, successes = sum(h))
TipSpeedup

One Binomial evaluation vs. 120 Bernoulli evaluations — every leapfrog step, every chain. For a single-subject model with 4 chains × 24 000 leapfrog evaluations the difference is small in absolute time, but the principle scales: for 10 000 IID trials the saving is real.

When it applies: single-subject, single-session models where \(\theta\) is constant across all \(N\) trials (Chapters 4–5 bias models).

When it does not apply: memory models, multilevel models (per-subject \(\theta_j\)), any model where the trial-level likelihood depends on a trial-varying predictor.


B.4 4. Vectorize the Likelihood Call

First introduced: Chapter 5 (Chapter 4); used throughout from Chapter 5 onward

Stan evaluates each distribution call as a node in the autodiff graph. A vectorized call over a length-\(N\) vector creates one node; a loop over \(N\) trials creates \(N\) nodes. Smaller graph = faster gradient computation and better cache use.

// Slower: N nodes in the autodiff graph
for (i in 1:N)
  target += bernoulli_logit_lpmf(h[i] | theta_logit);

// Faster: one node — mathematically identical
target += bernoulli_logit_lpmf(h | theta_logit);

In multilevel models, scatter-indexing by subject makes the entire trial pool vectorizable:

// N_total trials, J subjects — one call instead of N_total calls
target += bernoulli_logit_lpmf(y | theta_logit[subj_id]);
NoteWhen vectorization is not available

log_mix() for mixture likelihoods (Chapter 8) operates on one observation at a time because each call mixes two observation-specific log-probabilities. There is no vectorized log_mix — the per-observation loop is correct and unavoidable there.


B.5 5. Priors as Data — Avoid Recompilation

First introduced: Chapter 6 (Chapter 5)

Pass prior hyperparameters through the data block instead of hard-coding them. Stan recompiles whenever the code changes, never for new data. This lets you run prior sensitivity analyses and SBC sweeps without recompiling.

data {
  int<lower=0> N;
  array[N] int y;
  real             prior_mu_mean;
  real<lower=0>    prior_mu_sigma;
}
model {
  mu ~ normal(prior_mu_mean, prior_mu_sigma);
  y  ~ bernoulli_logit(mu);
}
data <- list(N = N, y = y,
             prior_mu_mean = 0, prior_mu_sigma = 1.5)

B.6 6. The run_diagnostics Flag — Conditional Generated Quantities

First introduced: Chapter 6 (Chapter 5)

Computing log_lik (N values per draw) and y_rep in generated quantities can double run-time. Gate them behind an integer flag so SBC backends can turn them off.

data {
  int<lower=0, upper=1> run_diagnostics;
}
generated quantities {
  vector[N] log_lik = rep_vector(0.0, N);
  array[N] int y_rep;
  if (run_diagnostics) {
    for (i in 1:N) {
      log_lik[i] = bernoulli_logit_lpmf(y[i] | mu);
      y_rep[i]   = bernoulli_logit_rng(mu);
    }
  }
}
# Full fit — diagnostics on
data <- list(..., run_diagnostics = 1L)

# SBC backend — diagnostics off, ~2× faster generated quantities
sbc_data <- modifyList(data, list(run_diagnostics = 0L))
TipWhen does it matter most?

For 1 000 trials, 4 chains, 1 000 sampling draws, run_diagnostics = 1 generates and saves 4 million log_lik values. During SBC (which only needs the parameters), setting run_diagnostics = 0 eliminates that cost entirely.


B.7 7. Post-Hoc Generated Quantities with $generate_quantities()

First introduced: Chapter 6 (Chapter 5)

The run_diagnostics flag lets you skip log_lik and y_rep during SBC. A complementary pattern avoids the trade-off entirely: fit once with diagnostics off, then replay the saved parameter draws through the generated quantities block on demand, without re-running MCMC.

# Fast fit — no log_lik or y_rep during sampling
fit_fast <- mod$sample(data = list(..., run_diagnostics = 0L), ...)

# Later: add log_lik and y_rep from the saved draws
fit_gq <- mod$generate_quantities(
  fitted_params = fit_fast,
  data          = list(..., run_diagnostics = 1L)
)

log_lik <- fit_gq$draws("log_lik")
y_rep   <- fit_gq$draws("y_rep")

$generate_quantities() runs only the generated quantities block — the sampler is never touched. Key consequences:

  • LOO and PPC can be deferred until you actually need them.
  • A fit saved with readRDS() from a previous session can be replayed: pass the loaded object as fitted_params.
  • SBC backends always run with run_diagnostics = 0; call $generate_quantities() only on the small subset of fits you want to inspect in detail.

The compiled model object must be the same for both the original fit and the $generate_quantities() call.


B.8 8. Non-Centered Parameterization (NCP)

First introduced: Chapter 7 (Chapter 6, multilevel biased agent) Extended to simplex parameters: Chapter 14 (Chapter 12, multilevel decay GCM)

Centered parameterizations create funnel geometries when the group-level variance (sigma) is small: individual-level parameters collapse toward the mean, forming a narrow neck that HMC cannot traverse efficiently, producing divergences, low E-BFMI, and max_treedepth warnings.

Centered (problematic):

parameters {
  real mu;
  real<lower=0> sigma;
  vector[J] theta;
}
model {
  theta ~ normal(mu, sigma);
}

 

Non-Centered (preferred):

parameters {
  real mu;
  real<lower=0> sigma;
  vector[J] z;          // N(0,1) offset
}
transformed parameters {
  vector[J] theta = mu + sigma * z;
}
model {
  z ~ std_normal();
}

B.8.1 Logit-Normal NCP for Simplex Parameters

For attention weights or mixing proportions modeled as a simplex, avoid the Dirichlet + kappa parameterization. Small kappa values produce nearly-flat Dirichlet distributions, creating the same funnel that centered normal models do. For a 2-element simplex use a logit-normal NCP instead:

parameters {
  real          pop_logit_w_mean;
  real<lower=0> pop_logit_w_sd;
  vector[J]     z_w;
}
transformed parameters {
  array[J] vector[2] w;
  for (j in 1:J) {
    real lw  = pop_logit_w_mean + pop_logit_w_sd * z_w[j];
    w[j][1]  = inv_logit(lw);
    w[j][2]  = 1.0 - w[j][1];
  }
}
model {
  z_w ~ std_normal();
}
WarningThe cost of ignoring NCP

The centered Dirichlet + kappa ~ exponential(0.1) parameterization in Chapter 11’s multilevel decay GCM produced 4 000 max_treedepth warnings per run and failed to complete 10 iterations overnight at adapt_delta = 0.99, max_treedepth = 20. Switching to logit-normal NCP eliminated the funnel geometry entirely.

TipWhen centered parameterization is preferred

NCP is not always the right choice. When the data are highly informative relative to the prior, the centered parameterization can be more efficient. In that regime, the group-level variance \(\sigma\) is well-constrained by the data, so the individual-level parameters \(\theta_j\) are tightly funneled by the likelihood rather than the prior — and HMC can navigate the centered geometry directly without the stretched \(z\)-offsets that NCP introduces.

A practical heuristic (the park rule): prefer NCP when the per-group sample size is small relative to the between-group variance (data are weak); prefer CP when data are plentiful and the posterior for \(\sigma\) is far from zero. When in doubt, fit both and compare diagnostics.

See Vehtari et al. — the park rule for a formal treatment with diagnostics to guide the choice.


B.9 9. Cholesky Factorization for Correlation Matrices

First introduced: Chapter 7 (Chapter 6, multilevel memory agent with correlated parameters)

Whenever a model estimates a correlation matrix \(\Omega\) over individual-level parameters, use the Cholesky-factorized form rather than sampling \(\Omega\) directly.

Parameterization Leapfrog cost Why prefer/avoid
corr_matrix[K] Omega + lkj_corr_lpdf \(O(K^3)\) matrix operations + positive-definiteness check Expensive; unconstrained transform requires matrix square root
cholesky_factor_corr[K] L_Omega + lkj_corr_cholesky_lpdf \(O(K^2)\) Lies directly on its natural manifold; no matrix inversion needed
// Avoid
parameters { corr_matrix[2] Omega; }
model      { Omega ~ lkj_corr(2.0); }

// Prefer
parameters { cholesky_factor_corr[2] L_Omega; }
model      { L_Omega ~ lkj_corr_cholesky(2.0); }

transformed parameters {
  // O(K^2) multiply — no square root
  matrix[2, J] deviations = diag_pre_multiply(sigma, L_Omega) * z;
}

generated quantities {
  // Reconstruct Omega once per saved draw, not every leapfrog step
  matrix[2, 2] Omega = multiply_lower_tri_self_transpose(L_Omega);
  real rho = Omega[1, 2];
}

Key point: diag_pre_multiply(sigma, L_Omega) * z produces individual-level NCP deviations directly from the Cholesky factor — the full correlation matrix \(\Omega\) is only recovered in generated quantities, running once per saved draw rather than thousands of times during warm-up.

For \(K = 2\) parameters the cost difference is modest; for \(K \geq 5\) (e.g., five per-subject RL parameters) it becomes substantial.


B.10 10. Contiguous Subject Indexing — subj_start / subj_end

First introduced: Chapter 7 (Chapter 6, multilevel memory model)

Two strategies for multi-subject trial data:

Strategy A — subj_id lookup (simple; works only when trials are independent):

data { array[N_total] int subj_id; }
model {
  for (i in 1:N_total)
    target += bernoulli_lpmf(y[i] | f(theta[subj_id[i]], x[i]));
}

Strategy B — Contiguous block indexing (required for sequential models and reduce_sum):

data {
  array[J] int subj_start;
  array[J] int subj_end;
}
model {
  for (j in 1:J)
    for (t in subj_start[j]:subj_end[j])
      target += bernoulli_lpmf(y[t] | f(theta[j], x[t]));
}

Build the index in R:

subj_start <- as.integer(tapply(seq_along(subj_id), subj_id, min))
subj_end   <- as.integer(tapply(seq_along(subj_id), subj_id, max))

Use Strategy B whenever:

  • Trial \(t\) depends on state accumulated from trials \(1, \ldots, t-1\) (memory, belief updating, RL)
  • You intend to use reduce_sum for within-chain parallelism (see Section B.14)

B.11 11. Pathfinder Warm-Start Initialisation

First benchmarked: Chapter 5 (Chapter 4) Timing comparison for multilevel models: Chapter 7 (Chapter 6) Three-way benchmark: Chapter 14 (Chapter 12)

Stan’s default initialisation draws chain starts from Uniform(−2, 2) in unconstrained space — far from the posterior on hierarchical models. Pathfinder runs L-BFGS from each start, fits Gaussian approximations along the path, and returns approximate posterior draws. Passing these as init to $sample() lets HMC skip most of the warm-up trajectory.

# Step 1 — fast variational approximation
pf <- mod$pathfinder(data = stan_data, num_paths = 4, draws = 1000, seed = 123)

# Step 2 — draw one starting point per chain
init_pf <- lapply(1:4, function(i) as.list(pf$draws()[i, , drop = FALSE]))

# Step 3 — full MCMC from warm starts
fit <- mod$sample(data = stan_data, init = init_pf,
                  chains = 4, iter_warmup = 500, iter_sampling = 1000)
TipWhen Pathfinder helps most

Pathfinder gives the largest benefit on hierarchical models and any model with correlated population parameters. It saves 20–50% of total wall time by shortening the warmup phase. On simple, well-conditioned models the warm-up is already fast — the Pathfinder overhead may outweigh the saving.

Pathfinder output is not a gold-standard posterior. Always run full MCMC and check diagnostics.

Fallback strategy: If Pathfinder fails (e.g., due to extreme initial gradients), provide an explicit init list containing the jittered prior medians for all parameters. This is almost always better than the default Uniform(-2, 2) start.


B.12 12. Marginalizing Discrete Parameters with log_mix

First introduced: Chapter 9 (Chapter 8, single and multilevel mixture models)

HMC requires continuous, differentiable parameters. Discrete latent variables (component indicators, regime switches) must be analytically marginalized out of the likelihood before sampling.

// two-component mixture
real lp1 = bernoulli_logit_lpmf(y | alpha);
real lp2 = bernoulli_logit_lpmf(y | nu);
// log( pi*exp(lp1) + (1-pi)*exp(lp2) )
target += log_mix(pi, lp1, lp2);

For \(K > 2\) components use log_sum_exp directly:

vector[K] lps;
for (k in 1:K)
  lps[k] = log(pi[k]) + component_lpmf(y | theta[k]);
target += log_sum_exp(lps);
WarningNever sample discrete indicators directly

HMC requires a continuous, differentiable log-posterior surface to compute gradients. A discrete latent variable — such as a cluster assignment z — creates a staircase: no gradient exists at its steps, so HMC cannot traverse it.

// WRONG — z is an integer; Stan cannot differentiate theta[z] with respect to z
int z ~ categorical(pi);
target += bernoulli_lpmf(y | theta[z]);

This compiles but fails at runtime (sampler errors or silently wrong draws).

Fix: marginalise z out analytically. Instead of sampling which component each observation came from, compute the log-probability of the observation summed across all components, weighted by the mixing proportions — exactly the log_sum_exp pattern above:

// CORRECT — z never appears; the sum is continuous and differentiable
vector[K] lps;
for (k in 1:K)
  lps[k] = log(pi[k]) + bernoulli_lpmf(y | theta[k]);
target += log_sum_exp(lps);

The gradient now flows cleanly through pi and theta, and HMC works as intended.


B.13 13. Sequential State Unrolling in transformed parameters

First introduced: Chapter 10 (Chapter 9, weighted and proportional agent models)

When the likelihood depends on a latent state that evolves trial-by-trial (belief, RL value, running weight), unroll the recursion in transformed parameters so the gradient flows back through the entire trajectory.

transformed parameters {
  vector[N] predicted_choice;
  {
    // local scope: scratch variable allocated and freed each leapfrog step
    real belief = 0.5;
    for (t in 1:N) {
      predicted_choice[t] = inv_logit(weight * (belief - 0.5));
      belief += alpha * (feedback[t] - belief);
    }
  }
}
model {
  target += bernoulli_lpmf(y | predicted_choice);
}
TipWhy transformed parameters and not model?

State that enters the likelihood needs its gradient tracked by autodiff. Declare only the output vector (predicted_choice) as a transformed parameters variable; keep loop-internal scratch variables inside a {} local block to avoid placing them on the saved-draw tape.

Exception: If the loop is independent across subjects, move it into a partial_sum_lp function and use reduce_sum for across-subject parallelism (see Section B.14).


B.14 14. Intra-Chain Parallelism with reduce_sum

First introduced: Chapter 14 (Chapter 12, multilevel decay GCM)

reduce_sum exploits conditional independence across subjects: given population parameters, subject \(j\)’s trial loop shares no mutable state with subject \(k\)’s loop. The subject dimension can therefore be distributed across CPU threads within a single chain, giving near-linear speedup up to the number of subjects.

Step 1 — Write a partial_sum_lp function in the functions block:

functions {
  real partial_sum_lp(
    array[] int subj_slice, int start, int end,
    vector log_c,
    array[] int subj_start_g, array[] int subj_end_g,
    array[] int y, array[,] real x
  ) {
    real lp = 0.0;
    for (idx in 1:size(subj_slice)) {
      int j = subj_slice[idx];
      for (t in subj_start_g[j]:subj_end_g[j])
        lp += bernoulli_lpmf(y[t] | f(exp(log_c[j]), x[t]));
    }
    return lp;
  }
}

Step 2 — Replace the subject loop with reduce_sum in the model block:

model {
  // subject_ids = {1, 2, ..., J}
  target += reduce_sum(partial_sum_lp, subject_ids, 1,
                       log_c, subj_start, subj_end, y, x);
}

Step 3 — Compile with threading support and sample:

mod_par <- cmdstan_model(stan_file,
                         cpp_options = list(stan_threads = TRUE))

n_threads <- max(1L, parallel::detectCores() %/% 4L)

fit_par <- mod_par$sample(
  data             = stan_data,
  chains           = 4,
  parallel_chains  = 4,
  threads_per_chain = n_threads,
  iter_warmup      = 500,
  iter_sampling    = 1000
)
TipWhen is reduce_sum worth it?
Condition Effect
≥ 20 subjects with expensive per-subject loops Large speedup (near-linear up to J)
Few subjects or very cheap loops Threading overhead may dominate
Within-subject sequential dependency Still valid — parallelises across subjects, not within
Contiguous subj_start/subj_end indexing Required (see Section B.10)

The grainsize argument (second integer in reduce_sum) controls the minimum slice size. Use 1 as a starting point; increase if the per-subject work is small.

NoteWithin-subject loops cannot be parallelised with reduce_sum

In the single-subject GCM (Chapter 12), trial \(t\) depends on the memory state accumulated from trials \(1, \ldots, t-1\). This sequential dependency is within a subject, so reduce_sum cannot help there. Parallelism across subjects is always safe because subjects are conditionally independent.


B.15 15. Combining Techniques — Compound Speedups

Three-way timing benchmark: Chapter 14 (Chapter 12)

The techniques above stack multiplicatively on hierarchical models:

Technique Typical saving When to apply
NCP instead of CP 5–100× fewer divergences Any multilevel model — always prefer NCP
transformed data precompute 10–100× fewer flops Static quantities derived from data only
run_diagnostics = 0 in SBC ~2× faster generated quantities All SBC backends
Pathfinder warm-start 20–50% less wall time Hierarchical models; correlated posteriors
reduce_sum (10 subjects, 4-core laptop) 2–3× within-chain ≥ 20 subjects, expensive loops
Pathfinder + reduce_sum together 3–5× vs. serial baseline Large multilevel models, multi-core hardware

Benchmark template (used in Chapters 6 and 11):

time_serial <- system.time(
  fit_serial <- mod$sample(data = stan_data, ...)
)[["elapsed"]]

pf       <- mod$pathfinder(data = stan_data, num_paths = 4, draws = 1000)
init_pf  <- lapply(1:4, function(i) as.list(pf$draws()[i, , drop = FALSE]))

time_serial_pf <- system.time(
  fit_pf <- mod$sample(data = stan_data, init = init_pf, ...)
)[["elapsed"]]

n_threads <- max(1L, parallel::detectCores() %/% 4L)
time_par  <- system.time(
  fit_par <- mod_par$sample(
    data = stan_data, init = init_pf,
    threads_per_chain = n_threads, ...)
)[["elapsed"]]

tibble(
  Strategy = c("Serial (baseline)", "Serial + Pathfinder", "Parallel + Pathfinder"),
  Time_sec = c(time_serial, time_serial_pf, time_par),
  Speedup  = round(time_serial / c(time_serial, time_serial_pf, time_par), 2)
)

B.16 16. Sampler and Compiler Options

Reference: Options for improving Stan sampling speed

The techniques in §1–15 reduce the work per leapfrog step. The options below change how the sampler and compiler behave around those steps. They are cheap to try and often give meaningful savings with no code changes.

B.16.1 Sampler options (passed to $sample())

Option Default Guidance
adapt_delta 0.80 Raise to 0.90–0.95 when divergences persist after NCP; higher values force smaller step sizes and more leapfrog steps per transition
max_treedepth 10 Raise to 12–15 only when max_treedepth warnings dominate; raising without fixing geometry just burns time
iter_warmup 1000 Lower to 500 on well-conditioned models; raise to 2000 when E-BFMI warnings persist across chains
iter_sampling 1000 Raise when posterior tails matter (e.g., rare-event predictions) or when \(\hat{R}\) instability is numerical rather than geometric
chains 4 Never run fewer than 4 — \(\hat{R}\) requires between-chain comparison
parallel_chains 1 Set to chains when fitting a single model; reduce when running many parallel fits
ImportantMCMC Diagnostic Standards

Following the “Bayesian Workflow” standards, a model is only considered successfully sampled if it passes these three gates:

  1. Zero Divergences: Any divergence indicates the sampler missed part of the posterior. Fix via NCP (Section B.8) or by raising adapt_delta.
  2. \(\hat{R} < 1.01\): Large \(\hat{R}\) means chains haven’t converged to the same target.
  3. Bulk and Tail ESS > 400: (at least 100 per chain). Lower ESS means the error in estimating the mean or the tails is too high for reliable inference.
  4. No E-BFMI warnings: Energy-Bayesian Fraction of Missing Information should be \(> 0.2\). Low E-BFMI indicates the sampler is struggling to explore the tails of the posterior, often due to heavy-tailed priors or poor parameterization.
fit <- mod$sample(
  data           = stan_data,
  chains         = 4,
  parallel_chains = 4,
  iter_warmup    = 1000,
  iter_sampling  = 1000,
  adapt_delta    = 0.90,     # raise from 0.80 if divergences persist
  max_treedepth  = 12        # raise from 10 if treedepth warnings dominate
)

B.16.2 Compiler optimizations

Stan models are compiled to C++. Enabling processor-specific optimizations costs nothing at runtime but can save 10–30% on algebraically intensive models (GCMs, ToM).

mod <- cmdstan_model(
  stan_file,
  cpp_options = list(
    stan_threads = TRUE,          # required for reduce_sum (§14)
    CXXFLAGS = "-O3 -march=native" # processor-specific vectorization
  )
)
Warning-march=native portability

Binaries compiled with -march=native use CPU instructions specific to the build machine. Do not share compiled Stan objects across machines with different processor families. Recompile on the target machine.

B.16.3 Output size

By default Stan saves 8-significant-figure draws. For large models with many generated quantities, reduce to 6 to halve .csv file sizes with negligible precision loss:

fit <- mod$sample(..., sig_figs = 6)

B.17 Quick Reference

Problem observed Technique Section
Redundant computation during leapfrog Block placement → transformed data §1
Quantity derived from data only, easier to compute in R Pre-compute in R §2
Many IID trials, one parameter per session Sufficient statistics → binomial_lpmf §3
Trial loop over identical distribution Vectorize the likelihood call §4
Changing priors requires recompiling Priors as data §5
SBC / LOO is slow run_diagnostics flag §6
Want PPC/LOO without re-running MCMC $generate_quantities() post-hoc §7
Divergences / low E-BFMI in multilevel model Non-centered parameterization (NCP) §8
Simplex + small kappa → funnel geometry Logit-normal NCP for simplex §8
Model has correlation matrix \(\Omega\) Cholesky factorization (lkj_corr_cholesky) §9
Sequential model with per-subject trial loops Contiguous subj_start/subj_end indexing §10
Slow warm-up on complex / hierarchical model Pathfinder warm-start §11
Discrete latent variables (indicators, regimes) Marginalise with log_mix §12
Trial-to-trial latent state in likelihood Sequential unrolling in transformed parameters §13
Many subjects, multi-core machine reduce_sum intra-chain parallelism §14
Divergences / treedepth / slow warm-up Sampler options (adapt_delta, max_treedepth) §16
Slow compilation or runtime on algebraic models Compiler flags (-O3 -march=native) §16
Large .csv output files Reduce sig_figs to 6 §16