Power Failure Prediction

A graph neural network catches 99% of cascading blackouts on synthetic power grid data, runs in 0.35 ms on a CPU after INT8 quantisation, and survives sensor noise.

On 14 August 2003 a transmission line in Ohio sagged onto an overgrown tree and tripped. The power it had been carrying redistributed across the rest of the grid, overloading the next set of lines, tripping them too. Within three hours, 50 million people across the northeast US and Canada had lost electricity. Estimated damage was $4 to $10 billion.

That chain reaction is a cascading failure.

Graph Topology

A power grid is a network and not a list of values. Lines between substations carry power along paths set by Kirchhoff’s laws. When one line trips, current redistributes through paths that are still standing. Whether the grid survives the next minute depends on the line connections.

Most machine learning models flatten their inputs into one row of numbers. If applied to a grid, the topology, i.e. the thing that actually drives the cascade, disappears. Graph neural networks (GNNs) keep the structure. They pass messages along edges, so the model sees how stress propagates rather than averaging it away.

Setup

PowerGraph IEEE-24 cascade subset (Varbella et al., NeurIPS 2024). It has 24 buses, 38 transmission lines, ~21,500 graphs from a power-flow simulator, 20.1% positive rate (cascade observed). I trained five GNN architectures (GCN, GAT, GIN, GINe, GPS), an XGBoost baseline on 42 hand-engineered statistics, a Random Forest, and a plain Transformer.

The production model is GINe, a Graph Isomorphism Network variant that reads edge features (line ratings, power flows, reactance) directly through edge-aware message passing. It was Optuna-tuned, focal-loss trained, and wrapped in a directed-edge aggregation layer. The forward/backward mixing parameter converged to α ≈ 0.5 across every seed, which matches the physics: a line trip ripples both upstream towards generators and downstream towards loads.

Graph edge over tabular methods

GNNs allow you to predict which lines will fail, not just whether a cascade happens.

GINe scores per-edge failure probabilities at PR-AUC 0.76 and AUROC 0.998 once you add betweenness centrality and load ratio as edge features. XGBoost has no per-line head, only one prediction per grid. A per-edge XGBoost head is technically possible but you’d have to re-engineer features per topology and you’d still lose the cross-edge coupling that message passing has natively.

SimGRACE pre-training reduces required simulations by 94%

SimGRACE contrastive pre-training on all four PowerGraph topologies, then supervised fine-tuning on the held-out one, matched native training within seed noise (0.989 vs 0.988 BalAcc). Moreover, the frozen pre-trained backbone reaches 0.71 BalAcc against random-init’s chance-level 0.50, i.e. the encoder picks up task-relevant structure with zero target-grid labels.

According to the learning curve you need around 3,600 MATPOWER simulations to saturate from scratch on IEEE-24. This can be a bottleneck for larger grids, which take hours to simulate. With SimGRACE pre-training, you can get 0.95 BalAcc with just 200 fine-tuning graphs, a 94% reduction in required simulations. Tabular methods have no equivalent mechanism here, the 42 hand-engineered statistics would need re-deriving for every new grid.

GINe is resilient to sensor noise

On clean data, XGBoost beats GINe (0.9947 vs 0.9890 BalAcc). However, once you inject 5% Gaussian noise into the inputs, the ordering flips:

GINe: 0.926 BalAcc (loses 1.3pp)
XGBoost: 0.798 BalAcc (loses 19.6pp)

You can call this a robustness inversion. The graph itself is the denoiser as message passing averages across neighbouring lines and smooths out corrupted readings that flat-feature models have no mechanism to recover from. The two curves cross around ε ≈ 0.2 of normalised noise, which sets a rough deployment ceiling.

ONNX INT8 quantisation

A control-room operator screens contingencies on a 30-minute cycle, so the model has to perform well within a second. After ONNX INT8 quantisation it shrinks 54% from 173 KB to 80 KB and runs in 0.35 ms on a CPU, with 99.5% output agreement against the FP32 reference.

On clean data, XGBoost wins

Aggregate accuracy on IEEE-24 favours XGBoost: 0.9947 BalAcc against GINe’s 0.9890. Tree ensembles have been hard to beat on tabular data for years (Shwartz-Ziv & Armon 2022, Grinsztajn et al. 2022). Flatten the grid to 42 statistics and XGBoost can pick up the patterns with a few hundred trees. The GNN has to learn the same patterns from scratch, which is a higher bar.

I checked whether the gap was due to feature engineering by concatenating GINe embeddings to the XGBoost feature vector. SHAP attributes 80 to 85% of the ensemble’s decisions to the GINe embedding dimensions, but adding those embeddings only lifts test accuracy by 0.035pp. SHAP measures importance within a model; lift tests whether the embeddings add information beyond the hand-crafted statistics. On this 24-bus grid summarised by 42 features, they don’t.

GNN’s edge isn’t accuracy, it’s the capabilities above plus calibrated uncertainty. Split conformal prediction returns prediction sets with finite-sample coverage guarantees: a singleton commits to a label, a pair flags uncertainty for operator review, an empty set escalates an out-of-distribution input. This is required by EU AI Act Article 14 for a high-risk infrastructure model that has to defer to a human.

Post-hoc explanations are unreliable for individual predictions

Two attempts to explain individual predictions failed.

Integrated Gradients (Sundararajan et al. 2017) recovered the ground-truth cascading edges at AUROC 0.878, which looked promising until I checked the Fidelity+ score. Fidelity+ measures whether removing the top-attributed edges actually changes the prediction. The number was -0.729. Removing the supposedly-important edges didn’t move the output signifying attributions were not causally linked to the prediction.

CF-GNNExplainer (Lucic et al. 2022) flipped the prediction on 4 of 200 confident positives, a 2% validity rate.

Both line up with Rudin’s (2019) argument that post-hoc explanations are unreliable for individual decisions. They’re fine for aggregate model audit, where attribution noise averages out across cases, but not for telling an operator why the model called a cascade. Per-decision routing is what conformal abstention does instead.

Sensitivity to topology

GINe handles noisy measurements gracefully but doesn’t handle missing structure. 20% targeted edge dropout (highest-loaded lines first) drops BalAcc from 0.939 to 0.654, a 30% relative collapse. The model tolerates noise, not topology change. A real-grid change (line out of service, new substation, planned reconfiguration) would require retraining or fine-tuning.

Tech stack

Training runs on PyTorch 2.11 + PyTorch Geometric (CUDA 12.8), with torch-scatter and torch-sparse for the GNN. XGBoost for the tabular baseline. Optuna for hyperparameter search and MLflow for experiment tracking. SHAP for the embedding-importance analysis. Pandera and Evidently for data validation.

The deployment is a Streamlit dashboard with Plotly visuals, running ONNX Runtime for the quantised inference path. NetworkX renders the graph topology.

Live demo · Full technical report (PDF)

The report covers the full architecture comparison (GINe vs GPS vs directed GINe), SimGRACE pre-training math, conformal prediction calibration, the EU AI Act compliance artefact list, and the negative result on the DNS regression open problem.