Few-Shot Tool Wear Prediction
Built for the PHM-AP 2025 Data Challenge. A CNC milling machine holds a small cutting insert and uses it to carve metal workpieces. Each pass across the workpiece is one “cut.” With each cut, the tool’s edge wears down a little, measured in micrometres as flank wear (VB). There are 6 training tools, each measured at only 6 of their 26 cuts, with 3 held-out tools to predict. Final result: RMSE 9.45 µm, R² 0.951, beating the competition winner’s 10.05.
Data
Each cut produces about 1.4 million sensor readings across four channels (vibration X/Y/Z, acoustic emission) sampled at 25.6 kHz, plus a controller log with spindle speed, feed rate, and force readings. Wear labels exist only at cuts 1, 6, 11, 16, 21, and 26.
There’s a lot of sensor data per cut, but only 36 wear labels to train on.
Feature Engineering
Raw signals get compressed into a single row of summary numbers per cut. Six extraction methods run in parallel:
- Time-domain stats: RMS, peak, crest factor, skewness, kurtosis per sensor channel
- catch22: 22 canonical time-series features capturing autocorrelation, entropy, and distribution shape
- Wavelet decomposition: sub-band energy and kurtosis across frequency bands; tool wear tends to excite specific vibration frequencies
- Controller aggregates: mean spindle current, feed rate variance, force statistics
- Rolling windows: the same stats computed over the 5 preceding cuts, so the model sees trends rather than snapshots
- Physics estimates: early Taylor power-law fits based on available wear measurements
This gives ~675 features per cut. Two filters prune that down to what the model actually needs. First, correlation dropping removes features that are near-copies of each other (threshold 0.95). That gets to ~200 features. Automatic Relevance Determination (ARD) fits a quick kernel model and scores each feature’s actual contribution to predicting wear, keeping the top 20-40. Three features are always retained regardless: cut number, cut number squared (wear accelerates near end of life), and the tool’s initial wear reading.
Expanding Sparse Labels
36 labels is not enough to train on. The Generic Tool Wear Model (GTWM) uses physics to fill in the 20 unmeasured cuts per tool.

Tool wear follows a consistent three-phase pattern: a rapid break-in phase in the first few cuts as the insert’s edge settles, a long steady-state phase of slow linear accumulation, then an accelerating end-of-life phase as the edge degrades. GTWM fits a piecewise curve capturing all three zones through the 6 real measurements, then reads off estimates for the 20 unmeasured cuts.
Pseudo-labels are affine-anchored (the curve passes exactly through the real measurements) and isotonic regression is applied as a final cleanup to enforce monotonicity (a tool physically cannot un-wear itself). These generated labels carry a sample weight of 0.6 versus 1.0 for real measurements, so the model treats them as informed estimates rather than ground truth.
Expanding 36 labels to 156 dropped RMSE from ~11.5 to ~10.0 which was the biggest single improvement.
Model
Through the labelled data, the model needs to predict wear at any cut for any tool. A Gaussian Process Regressor (GPR) is well-suited here: it gives calibrated uncertainty estimates and generalises well with small datasets. But a plain GPR still has to learn the full wear curve from features alone. A physics-based prior can help guide this learning process.

Taylor’s tool life law relates cutting speed to tool life: $$V \cdot T^n = C$$
Rearranged for flank wear over discrete cuts, this becomes a power-law growth curve. Instead of fitting one global Taylor model for all tools, a separate curve is fitted per training trajectory: $$VB_i = C_i \cdot t^{n_i} + c_i$$
Each tool gets its own parameters. The GPR then only needs to model the residuals i.e. the small gap between the physics prediction and the actual measurement. This is a much easier task than learning the full wear curve from features, and it keeps predictions physically plausible even where training data is thin.
Even though all 6 tools run under nominally identical conditions, their wear curves differ substantially e.g. different inserts, microscopic surface variation, slight temperature differences. A single global Taylor fit leaves large residuals for every tool. Per-trajectory fitting gives the GP a tight, tool-specific prior to correct from.

MultiTargetGPR runs three GPRs in parallel on different representations of the same target: raw VB, log-transformed log(VB + 1), and min-max normalised. Each representation exposes a different error structure. The log view compresses the high-wear tail in late cuts where errors tend to be largest; the normalised view is scale-invariant across tools. Averaging the three inverse-transformed predictions cancels out individual mistakes and stabilises results across all wear phases.
Hyperparameter tuning uses Optuna with 100 trials to find the optimal kernel parameters (alpha, length_scale, nu for Matérn) for each of the three GPRs. The objective is variance-penalised: $$\mathcal{L} = \overline{\text{RMSE}} + 0.3,\sigma_{\text{RMSE}}$$
This discourages solutions that predict five folds perfectly but blow up on one.
Validation
With only 6 training tools, leave-one-out cross-validation is the only honest evaluation strategy: train on 5 tools, predict the 6th, rotate through all 6. Errors are measured only at the 6 real microscope measurements per tool and pseudo-labels are used for training but never for scoring.
Tool 4 was the hardest trajectory throughout with an unusual wear pattern that no model fit cleanly. The average RMSE of 9.45 beats the competition winner’s 10.05 because the other 5 tools are predicted well enough to compensate.