Skip to content

R Source Reference Map

This document maps each JaxGAM implementation task to the specific R mgcv source files and functions that implement the equivalent logic.

Setup: Local mgcv Clone

Clone the CRAN GitHub mirror and set the MGCV_SOURCE environment variable:

git clone https://github.com/cran/mgcv.git <your-path>/mgcv
export MGCV_SOURCE=<your-path>/mgcv

Add the export to your shell profile (.zshrc, .bashrc, etc.) so it persists. Agents can run echo $MGCV_SOURCE to discover the clone location.

How to Use This File

When implementing a task, find it in the table below. Read the listed R functions before writing Python using the local clone. The R code is the ground truth for numerical behavior - edge cases, special handling, and algorithmic choices that aren't captured in the design doc.

Source layout:

$MGCV_SOURCE/R/       # R source files
$MGCV_SOURCE/src/     # C source files

For example, to read the smooth constructors:

$MGCV_SOURCE/R/smooth.r

When debugging a mismatch with R, the R function listed here is where to look. Run the R code in an R session with debug(function_name) to step through and compare intermediate values.

Reading R code tips for agents: - R uses 1-based indexing. Adjust when porting. - <- is assignment. <<- is assignment to parent scope. - .C() and .Call() invoke compiled C code in src/. The C implementations are the actual hot paths - the R wrappers often just do argument validation. - @ accesses S4 slots. $ accesses list elements or S3 components. - mgcv::: accesses unexported (internal) functions. These are often the ones we need most.


Source File Inventory

$MGCV_SOURCE/
├── R/                    # R source files
│   ├── gam.fit3.r        # PIRLS, working weights, convergence
│   ├── smooth.r          # All smooth constructors (basis + penalty)
│   ├── fast-REML.r       # REML/ML criterion, Newton optimizer
│   ├── mgcv.r            # Main file: gam(), predict.gam(), summary.gam(), gam.setup(), gam.side(), interpret.gam(), pen.edf(), testStat()
│   ├── bam.r             # bam() - NOT needed for v1.0
│   ├── gamm.r            # gamm() - NOT needed for v1.0
│   ├── plots.r           # plot.gam
│   ├── efam.r            # Extended families (nb(), Tweedie, etc.)
│   └── sparse.r          # Sparse matrix routines - NOT needed for v1.0
├── src/                  # C source files
│   ├── tprs.c            # TPRS eigendecomposition (the real implementation)
│   ├── mat.c             # Matrix operations (Cholesky, QR, etc.)
│   ├── gdi.c             # Generalized derivative information (REML)
│   ├── discrete.c        # Discretization for bam - NOT needed for v1.0
│   └── mgcv.c            # Package init, misc C utilities
└── man/                  # Documentation (useful for parameter descriptions)
    ├── gam.Rd
    ├── s.Rd
    └── ...

Task-to-Source Mapping

Phase 1: Foundation

R file Function What it does
base R stats package make.link() Constructs link with linkfun, linkinv, mu.eta, valideta
base R stats package Individual family functions (e.g., binomial()$linkfun) Per-family link implementations

Key detail: R's mu.eta is dμ/dη (derivative of inverse link), not dη/dμ. Make sure our naming matches.

Task 1.2 - Families

R file Function What it does
base R stats package gaussian() Returns family object with variance, dev.resids, aic, initialize, validmu
base R stats package binomial() Same structure. Note initialize handles proportion vs count input
base R stats package poisson() Note dev.resids uses 2 * wt * (y * log(ifelse(y==0, 1, y/mu)) - (y - mu)) - the ifelse for y=0 is critical
base R stats package Gamma() Note capital G. dev.resids uses -2 * wt * (log(ifelse(y==0, 1, y/mu)) - (y - mu)/mu)

Key detail: R's dev.resids returns per-observation deviance contributions (not squared residuals and not summed). The total deviance is sum(dev.resids(...)). Watch for the sign - deviance residuals are sign(y - mu) * sqrt(abs(dev.resids)).

Key detail: initialize expressions in R families use the eval(family$initialize) pattern with mustart being set in the calling environment. This is R-specific - in Python, just return a starting μ vector.

Task 1.3 - Penalty Matrices

R file Function What it does
R/smooth.r smooth.construct.*.smooth.spec() (each smooth type) Returns $S - list of penalty matrices
Key detail: Each smooth's $S is a list of penalty matrices (usually length 1, but tensor products have one per marginal). The $sp vector in the fitted model gives the corresponding smoothing parameters.

Task 1.4 - TPRS Basis and Penalty

R file Function What it does
R/smooth.r smooth.construct.tp.smooth.spec() START HERE. R-level TPRS constructor. Calls C code for the heavy lifting.
R/smooth.r Predict.matrix.tprs.smooth() Prediction matrix construction for new data
R/smooth.r null.space.dimension() Computes null space dimension for given d and m
src/tprs.c construct_tprs() THE REAL IMPLEMENTATION. Eigendecomposition, knot selection, basis truncation. Read this carefully - the R wrapper just packages arguments.
src/tprs.c tps_g() TPS semi-kernel evaluation η(r)
src/tprs.c tprs_setup() Knot selection via max-min distance

Key details: - The eigendecomposition is of the augmented matrix [E T; T' 0] where E is the TPS kernel matrix evaluated at knots and T is the polynomial null space basis. This is not a simple eigendecomposition of E alone. - ts (shrinkage) adds an extra penalty: the eigenvalues corresponding to the null space get a small penalty (1e-4 × max eigenvalue of the wiggly penalty). The exact scaling matters for matching R. - Knot selection: when n > k, mgcv selects k knots from the data using a greedy max-min distance algorithm (tprs_setup in src/tprs.c). The knot order affects the basis, so it must match.

Task 1.5 - Cubic Regression Splines

R file Function What it does
R/smooth.r smooth.construct.cr.smooth.spec() Natural cubic regression spline constructor
R/smooth.r smooth.construct.cs.smooth.spec() Shrinkage cubic (same as cr + null space penalty)
R/smooth.r smooth.construct.cc.smooth.spec() Cyclic cubic
R/smooth.r Predict.matrix.cr.smooth() Prediction matrix for cr
R/smooth.r place.knots() Quantile-based knot placement

Key detail: mgcv's cubic regression splines are NOT B-splines. They use a specific natural spline basis evaluated at the knots directly. The penalty is the integrated squared second derivative, computed analytically from the spline coefficients. place.knots() uses quantiles of unique data values, not raw quantiles - this matters when data has ties.

Task 1.6 - Tensor Products

R file Function What it does
R/smooth.r smooth.construct.tensor.smooth.spec() te() constructor
R/smooth.r tensor.prod.model.matrix() Row-wise Kronecker product of marginal bases
R/smooth.r tensor.prod.penalties() Penalty construction from marginal penalties
R/smooth.r smooth.construct.t2.smooth.spec() t2() - different penalty structure than te()

Key detail: te() penalties are I ⊗ S_1, S_2 ⊗ I, etc. - one penalty per marginal, each Kronecker-producted with identity matrices for the other marginals. ti() is te() with main effects removed via constraints. t2() uses a different penalty construction (sum of Kronecker products) - defer t2 to later if needed.

Task 1.7 - Formula Parser

R file Function What it does
R/mgcv.r interpret.gam() THE KEY FUNCTION. Parses formula, identifies smooth terms, separates parametric from smooth.
R/mgcv.r interpret.gam0() Helper that walks the formula tree
R/smooth.r s() Smooth term specification (captures arguments, doesn't build basis)
R/smooth.r te(), ti(), t2() Tensor product term specifications

Key detail: interpret.gam() returns a list with $pf (parametric formula), $smooth.spec (list of smooth specs), and $response (response name). Each smooth spec is essentially our SmoothSpec - it holds the variable names, basis type, dimension, and options, but doesn't construct the basis yet. That happens later in gam.setup().

Task 1.8 - Factor-By Smooths

R file Function What it does
R/smooth.r smooth.construct() (search for by.var) The by variable handling in smooth construction
R/mgcv.r gam.setup() (search for by) How by variables are resolved during model setup

Key detail: When by is a factor, mgcv creates one smooth per level and sets $by.level on each. The design matrix is block-diagonal - row i has nonzeros only in the block for its factor level. The penalty is duplicated per level (each gets its own λ). When by is numeric, the basis matrix is simply multiplied column-wise by the numeric variable.

Key detail: mgcv detects factor vs numeric by using is.factor(). Integers are NOT treated as factors. Character vectors are NOT automatically converted. Only explicit R factors trigger the factor-by path.

Task 1.9 - Identifiability Constraints

R file Function What it does
R/mgcv.r gam.side() THE KEY FUNCTION. Identifies which smooths need sum-to-zero constraints to resolve identifiability.
R/mgcv.r gam.setup() (search for C matrix) How constraints are absorbed into the basis
R/smooth.r smooth.construct() (search for $C) Constraint matrix on each smooth

Key detail: gam.side() works by checking whether any smooth's column space overlaps with the intercept or another smooth's column space. When overlap is detected, it imposes a sum-to-zero constraint: 1' X_s β_s = 0. This is absorbed into the basis via QR decomposition, reducing the column count by 1. The constraint matrix C is stored on the smooth object and used to undo the absorption for prediction.

Key detail: The order of constraint application matters. gam.side() processes smooths in a specific order. If constraints are applied in a different order, the basis differs and results won't match R.

Task 1.10 - Design Matrix Assembly

R file Function What it does
R/mgcv.r gam.setup() THE MASTER ASSEMBLY FUNCTION. Constructs all smooth bases, applies constraints, assembles full X, builds penalty list.
R/mgcv.r gam() (first ~100 lines) Orchestration: formula parsing → gam.setup → fitting

Key detail: gam.setup() returns a list with $X (full model matrix), $S (list of penalty matrices), $off (column offsets per smooth in X), $sp (initial smoothing parameters), and constraint info. This is the Phase 1 output in our architecture.


Phase 2: Fitting

Task 2.3 - PIRLS Inner Loop

R file Function What it does
R/gam.fit3.r gam.fit3() THE INNER LOOP. PIRLS for standard families. ~500 lines.
R/gam.fit3.r gam.fit5() PIRLS for extended families (not needed for v1.0)
src/gdi.c gdi1(), gdi2() C implementations of the PIRLS numerics (XtWX, etc.)

Key details: - Search for step.half in gam.fit3() to find the step-halving logic. It's ~30 lines. The halving limit is 100 for the inner PIRLS loop; the outer Newton uses maxHalf = 30. - Convergence criterion: abs(pdev - old.pdev) < control$epsilon * (abs(scale) + abs(pdev)) where control$epsilon defaults to 1e-7. Note this uses penalized deviance and a multiplicative tolerance scaled by the dispersion parameter. - Working weights: computed as w * mu.eta.val^2 / variance where w is prior weights, mu.eta.val is dμ/dη, and variance is V(μ). This is W in our notation. - Working response: z <- (eta - offset) + (y - mu) / mu.eta.val. Note the offset handling. - Initialization: eval(family$initialize) sets mustart. Then eta <- family$linkfun(mustart).

Task 2.4 - REML Criterion

R file Function What it does
R/fast-REML.r Sl.initial.repara() Reparameterizes penalties for stable REML computation
R/fast-REML.r Sl.addS() Adds penalties together with smoothing parameter weighting
R/fast-REML.r fast.REML.fit() Newton optimizer for REML (ALSO contains the criterion computation)
src/gdi.c gdi1(), gdi2() C implementations of REML derivatives (gradient and Hessian of V w.r.t. log λ)
R/gam.fit3.r gam.fit3() (search for reml or ldetS) Where REML is evaluated during fitting

Key details: - The REML criterion in mgcv is: V = deviance + log|XtWX + S_λ| - log|S_λ*| where S_λ* is the penalty projected onto its range. The ldetS (log determinant of S_λ) computation is subtle - it only includes the non-zero eigenvalues. - gdi1()/gdi2() in C compute the derivatives of V w.r.t. log(sp) analytically. This is what we replace with jax.grad. But reading the analytical derivatives helps validate our AD output. - Sl.initial.repara() reparameterizes the problem so each penalty has a nice form. This is an optimization for the Newton step - we may or may not need it.

Task 2.5 - Newton Outer Optimizer

R file Function What it does
R/fast-REML.r fast.REML.fit() The outer Newton loop. Search for while to find the iteration.
R/gam.fit3.r gam.fit3() (search for while) The outer-outer loop that alternates between PIRLS and λ updates

Key detail: mgcv's outer iteration in gam.fit3 is NOT a clean Newton loop on REML. It interleaves PIRLS convergence checks with smoothing parameter updates. The exact interleaving matters for convergence on difficult models. Read the while loop structure in gam.fit3 carefully - it has PIRLS inner iterations nested inside λ update outer iterations, with a specific convergence protocol.


Phase 3: Post-Estimation

Task 3.1 - Prediction

R file Function What it does
R/mgcv.r predict.gam() Prediction from fitted GAM
R/smooth.r Predict.matrix.*() Per-smooth-type prediction matrix construction
R/smooth.r PredictMat() Dispatches to the right Predict.matrix method

Key detail: predict.gam() with type="lpmatrix" returns the full prediction matrix (our X_p). This is invaluable for debugging: if X_p matches, any prediction difference is in the coefficients, not the basis construction.

Task 3.2 - Summary and EDF

R file Function What it does
R/mgcv.r summary.gam() Summary output including smooth significance tests
R/mgcv.r pen.edf() EDF computation from hat matrix trace
R/mgcv.r testStat() Wood's (2013) test statistic for smooth significance

Key detail: EDF is computed as trace(F) where F = X (XtWX + S_λ)^{-1} XtW. Per-smooth EDF is the trace of the corresponding diagonal block. mgcv computes this from the Bayesian covariance: edf = rowSums(Vp * crossprod(X)) (which is diag(Vp @ X.T @ X) but computed without forming the full product).

Task 3.3 - Plotting

R file Function What it does
R/plots.r plot.gam() The main plotting function
R/plots.r plot.mgcv.smooth() Per-smooth-type plotting

Key detail: mgcv generates plot data by constructing a prediction grid, getting predictions ± SE, and plotting. For 1D smooths: 100 evenly-spaced points in the covariate range (n=100). For 2D: 40×40 grid (n2=40). The SE bands are ±2*SE (not ±1.96).


Accessing the Source

The mgcv source is cloned locally (see Setup above). Agents should read files directly from $MGCV_SOURCE:

# Discover the clone location
echo $MGCV_SOURCE

# Read a specific R file
$MGCV_SOURCE/R/smooth.r

# Read a C source file
$MGCV_SOURCE/src/tprs.c

For large files like smooth.r (~8000 lines), search for the specific function name rather than reading the entire file.

Pinning a version: The master branch tracks the latest CRAN release. To pin to a specific version for reproducibility, checkout a tagged release:

cd $MGCV_SOURCE && git checkout 1.9-1

The CRAN mirror is GPL-2 (Simon Wood). Attribution is required if redistributing; reading for reference is fine.


Quick Lookup: "I'm Debugging X, Where Do I Look?"

Negative Binomial (Extended Family)

R file Function What it does
R/efam.r nb() lines 161-310 NB extended family: deviance, Dd derivatives, ls (saturated loglik), aic, theta state
R/efam.r estimate.theta() lines 5-96 Theta estimation via Fellner-Schall (not used by jaxgam; we use joint Newton)
R/gam.fit4.r gam.fit4() lines 240-548 Extended PIRLS: observed weights from Dd, step-halving on penalized deviance
R/gam.fit4.r dDeta() lines 4-77 Deviance derivative utility: d²D/dη², d²D/(dηdθ), etc.
src/gdi.c ift2() lines 1368-1462 IFT for theta: dβ/dθ = -H⁻¹ X' d²D/(dηdθ)

Key details: - R's nb()$Dd() provides deviance derivatives up to 4th order w.r.t. mu plus mixed partials w.r.t. theta. jaxgam computes these via jax.grad through deviance_fn. - R uses gam.fit4 (observed weights w = 0.5 * d²D/dη²) for extended families, not gam.fit3 (Fisher weights). jaxgam mirrors this in pirls_loop when log_theta is provided. - ift2() in C computes dβ/dθ analytically. jaxgam gets this for free via the 3-primal custom_jvp on PIRLS.

General Debugging

Problem R file Function Search for
Basis matrix doesn't match R R/smooth.r smooth.construct.*.smooth.spec() The specific basis type
Penalty matrix doesn't match R/smooth.r Same constructor $S
Knots don't match R/smooth.r or src/tprs.c place.knots() or tprs_setup() Knot selection algorithm
Coefficients don't match R/gam.fit3.r gam.fit3() coef
Deviance doesn't match base R stats package *$dev.resids Per-family deviance
REML score doesn't match R/fast-REML.r fast.REML.fit() reml
EDF doesn't match R/mgcv.r pen.edf() EDF trace computation
Smoothing params don't match R/fast-REML.r fast.REML.fit() sp
Prediction doesn't match R/mgcv.r predict.gam() Use type="lpmatrix" to isolate
Step-halving behavior differs R/gam.fit3.r gam.fit3() step.half
Constraint/identifiability issue R/mgcv.r gam.side() Constraint detection logic
Factor-by doesn't match R/smooth.r smooth.construct() by.var, by.level
Summary p-values don't match R/mgcv.r summary.gam() testStat
NB theta doesn't match R R/efam.r nb() getTheta, Dd
NB deviance derivatives wrong R/gam.fit4.r dDeta() Deta, Deta2
NB observed weights wrong R/gam.fit4.r gam.fit4() w <- dd$Deta2*.5