Impact of Network Complexity in ANN Regressors

The simplest neural network is just linear regression. Add hidden units and you gain the capacity to fit curves. Add too many and the model starts fitting noise instead of signal. This tradeoff isn’t abstract; you can watch it happen by running the same 1D regression task with different hidden layer sizes.

Setup

Task: fit a one-dimensional nonlinear function using a single-hidden-layer network. We tested 0, 2, 4, 8, 16, and 32 hidden units, all trained with SGD at a learning rate of 0.0011 for 1000 epochs. Sigmoid activations throughout. Weights initialized uniformly in [-1, 1].

One thing that mattered more than expected: data normalization. Sigmoid activations saturate outside of roughly [-4, 4], and unnormalized inputs push hidden units into saturation before training properly starts, making gradients vanish. Normalizing inputs to zero mean and unit variance was the difference between a network that learned and one that stalled on epoch 1.

Six Configurations, One Pattern

0 hidden units (linear baseline): Learns a line. The model is structurally incapable of fitting curves, and loss plateaued early regardless of training duration.

2-8 hidden units: These models captured the function’s curvature without memorizing noise. The 8-unit model performed best on the held-out test set, with smooth predictions that tracked the underlying function.

16 and 32 hidden units: Training loss dropped lower than the 8-unit model. Test loss climbed. The models fit the noise alongside the signal, producing predictions that wiggled through training points but failed on held-out data.

Bias-Variance, Concretely

Plot train loss and test loss as a function of hidden units, and you get a V-shaped test curve with a flat training curve underneath. The crossover is exactly where you want to stop adding capacity.

For this problem (a smooth 1D function with moderate noise), 8 hidden units was enough. The instinct to add more capacity is almost always wrong when the dataset is small relative to the function’s complexity. More units give the model more ways to overfit, not more signal to extract.

What would actually improve things: L2 regularization, dropout, more training data, or early stopping against a validation set. Those are the right levers. Doubling the hidden unit count is not.

← back to writing