Memorization Isn’t Learning, It’s Overfitting
Why simpler models are more generalizable
If a student memorizes every definition in a textbook, have they really grasped the theories within? Or have they proven they can store the information in the book within their biological neural network, and recall that information when needed?
Learning is about being able to generalize — to grasp the underlying concepts and apply them elsewhere. It’s about understanding not only the ‘what’ of the matter but also the ‘why.’
Parallels with information theory
Neural networks are a malleable medium for storing and transforming information. They compress features by downscaling them into this lower-dimensional representation of the broader dataset.
The diagram above shows a generative architecture known as an autoencoder. First it encodes a representation of the data into the latent space, then it decodes that data back to its original form.
Memorizing noise
In discriminative learning, the information embedded in hidden layers constitutes a lower dimensional representation of the differences between the samples in the dataset, sort of like a primary component analysis (PCA). If a discriminative model is trained on every sample in a given dataset, then it is going to continually memorize every difference it can find as it relentlessly seeks to minimize loss. Like a vacuum sealer, it will suck out all of the information entropy until all of the data is rigidly wrapped in plastic. This includes the noise contained in the dataset. That’s how a model becomes overfit.
In this way, an overfit model just saves the data it is exposed to in a different, neural format — it’s more so copying & reformatting rather than actually learning.
Minimizing entropy with forced generalization
Dividing a dataset into subsets of training, validation, and test/ holdout data not only denies the model of access to the total sum of the information but also penalizes it for overfitting on noise in the training data.
When the model makes an incorrect prediction based upon characteristics that are present in the training data, but not the validation data, then it incurs loss. This forces the algorithm to learn more generalizable patterns.
Generalization is the entire point of predictive analytics. The reason why we train a model is because we want to be able to apply what we have learned to previously unseen data found in the wild. We seek the patterns in the microcosm that can be reproduced in the macrocosm of the broader population.
Mo’ parameters, mo’ problems
How can we apply this knowledge to our deep learning practice? The simpler the model, the better. Here simplicity is best defined by the number of trainable parameters. When an algorithm has less parameters, it’s forced to use those weights to find the most broadly applicable patterns possible in the training data.
Start small! Begin each training experiment with a shallow (few layers) and narrow (few neurons) network. If you find that the model isn’t learning, then try gradually expanding the number of neurons-per-layer and then the number of layers. That way you wind up with the simplest model possible.