Foundation Model Survival Analysis

This MSc dissertation at the University of Warwick, supervised by Matheus F. Torquato, investigates whether frozen time-series foundation models can serve as viable feature extractors for survival-analytic remaining useful life (RUL) estimation, bypassing the fine-tuning paradigm entirely.

Setup

Three pre-trained foundation models, i.e. MOMENT, Chronos, and Moirai are paired with six survival heads (CoxPH, CoxTime, DeepHit, MTLR, DSM, Weibull). All FM weights remain frozen; survival heads receive raw embeddings with no encoder adaptation.

Evaluation consists of three benchmarks with fundamentally different physical natures: C-MAPSS and N-CMAPSS (simulated turbofan degradation) and XJTU-SY (accelerated bearing run-to-failure vibration at 25.6 kHz). The combinatorial matrix 3 FMs × 6 heads × multiple splits × 3 seeds gives several hundred independent training runs.

Benefits of Survival Analysis?

The dominant prognostics paradigm treats RUL as a point regression problem, reducing the full temporal uncertainty into a scalar. Survival analysis, on the other hand, is the complete conditional distribution over failure time, giving a survival function $S(t \mid \mathbf{x})$ from which any quantile, confidence band, or decision threshold can be derived.

For fleet-level maintenance scheduling under asymmetric cost structures, this distributional output is more actionable than a point estimate.

Results

C-index 0.949 on N-CMAPSS (Chronos + CoxPH) — near-perfect discriminative ranking of failure risk across heterogeneous flights.
RMSE 11.4 cycles on C-MAPSS FD003 (Chronos + CoxPH), competitive with state-of-the-art regression-only methods that directly minimise MSE, despite the survival objective optimising a fundamentally different criterion.
8.5× error reduction versus Dintén et al. (CMES 2025) on multi-condition C-MAPSS subsets. MOMENT + CoxPH achieves RMSE 18.8 vs 149.9 on FD002 using the same backbone embeddings. The gap is attributable entirely to normalisation and head choice.

Normalisation

Pre-processing was very important as different FMs have different sensitivities to input distribution.

MOMENT collapsed to near-chance C-index (≈ 0.5) on multi-condition C-MAPSS subsets (FD002, FD004) while performing strongly on single-condition splits. This is because of MOMENT’s pre-training objective.

Masked patch reconstruction requires the encoder to impute raw sensor values, which forces it to encode absolute signal levels. On single-condition data, levels track degradation. On multi-condition subsets (FD002 and FD004 contain six operating conditions each), sensor baselines shift between conditions, and global min-max normalisation preserves these condition-dependent offsets in the embeddings.

Stratifying normalisation per engine and per operating condition subtracts these baselines before extraction, recovering +22 percentage points of C-index on FD002.

Notably, the fix is MOMENT-specific. Chronos’s tokenisation into 4096 discrete bins is largely invariant to offsets, so it needs no such correction.

Head Comparison

When the input representation is a frozen, non-adapted embedding space, survival heads are not interchangeable.

CoxPH was the most consistently reliable head across every FM × dataset combination tested, because of its loss geometry rather than its capacity. The Cox partial likelihood depends only on relative risk ordering: any strictly monotonic transformation of the learned projection gives the same loss. This makes CoxPH invariant to the distributional form of the embedding space i.e. it only requires higher-risk units to give higher hazard scores. FM embeddings, whose geometry was shaped by unrelated pre-training objectives, have no reason to show any particular distributional shape, but they do preserve orderings that track degradation.

Conversely, Weibull and DSM heads degenerated across every FM × dataset combination. Both impose strong parametric assumptions: Weibull AFT requires the network’s 2-dim output to parameterise a Weibull-shaped distribution of durations; DSM requires a mixture of LogNormals. FM embedding geometry has no structural reason to satisfy these constraints.

On N-CMAPSS FD1, the same Moirai embeddings produce CoxPH C=0.934 and DSM C=0.682. This shows that embeddings are not the bottleneck and instead it’s the head’s distributional assumption. The Weibull head routinely collapsed to identical survival curves for every unit across all FM × dataset pairs.

Discrimination-Calibration Gap

A recurring pattern across the matrix: excellent ranking does not imply accurate point prediction. On C-MAPSS FD003, MOMENT + CoxPH achieves C-index 0.884 (near-perfect ordering) while producing RMSE 34.9 cycles — the survival function has the right shape but is systematically shifted, so every point estimate overshoots true RUL.

This is because of the structural feature of the survival-analytic framing. Discriminative losses (partial likelihood, cross-entropy over discrete bins) are invariant to monotonic shifts of the survival function, so they can produce perfect ranking while leaving calibration uncorrected. Regression methods that directly minimise MSE will always win on RMSE. The value proposition of survival analysis lies in the full distribution, calibrated uncertainty, and robustness under censoring and not just point accuracy.

A corollary: the NASA score’s asymmetric exponential penalty makes it unreliable as a model-selection criterion on C-MAPSS, where truncated test engines require extrapolation and small calibration errors compound through $\exp(d/13)$. Therefore comparisons use C-index for ranking and RMSE for point accuracy.

Cross-Domain Transfer

The stronger test of a universal feature extractor is cross-domain transfer, i.e. whether embeddings learned from one signal type generalise to physically distinct degradation mechanisms without adaptation.

On XJTU-SY accelerated bearing data, the framework transfers effectively. Moirai + CoxPH achieves C-index 0.828, exceeding an LSTM+DeepHit baseline by 21 percentage points (0.607). Notably, ridge regression on the same Moirai embeddings only reaches C=0.629, near-identical to the LSTM.

The FM ranking inverts: Moirai dominates on bearings whereas Chronos leads on turbofan profiles. This can be attributed to the architectural differences between the FMs, which create different inductive biases in the learned representations:

Chronos tokenises continuous values into 4096 discrete bins before transformer processing, which acts as an implicit denoising filter. C-MAPSS is simulator-generated telemetry with injected measurement noise that is cleaned by tokenisation.
MOMENT’s masked reconstruction encodes absolute signal levels that are helpful on single-condition telemetry but require per-condition normalisation on multi-regime operating data.
Moirai’s any-variate attention learns cross-channel correlations during pre-training on LOTSA (27B observations, 9 domains). On C-MAPSS this provides no measurable benefit i.e. degradation is visible independently in each sensor, and mean-pooling aggregation discards any residual cross-channel structure. On XJTU-SY, where time-domain and frequency-domain vibration features correlate in bearing-failure-specific patterns, the architecture enables Moirai to extract more prognostic signal than the other FMs.

Understandably, no single FM dominates across domains. Signal structure determines the right choice wherecu tokenisation wins on noisy telemetry, cross-variate attention wins on heterogeneous multivariate features, and masked reconstruction sits between the two with a normalisation dependency.

Contributions

Four connected findings the matrix supports:

Frozen FM embeddings transfer to survival analysis without fine-tuning, matching regression-only methods on discrimination while producing full distributional outputs.
FM architectural choices predict dataset-specific performance in ways that are mechanistically explicable and not just empirical. Tokenisation, masked reconstruction, and any-variate attention each succeed on signal types that match their pre-training inductive bias.
Discriminative survival heads (especially CoxPH) are resilient to FM embedding geometry because their losses depend only on ranking; parametric heads (Weibull, DSM) fail consistently due to distributional mismatch.
Per-engine-per-condition normalisation is a necessary preprocessing step for masked-reconstruction FMs on multi-regime data.