Getting Started

Evaluation Bias

Are you inadvertently training on your entire dataset?

Layne Sadler

Published in

Towards Data Science

7 min readMar 24, 2021

Photo Credit: Min An (http://bit.ly/pexels_artichokes)

By using insight that has been derived from the model’s performance against test data, we have effectively used our entire dataset to tune the model without ever directly training the model on the test samples.

Status quo.

The 70–30 split, we see it everywhere. Where 70% of the samples are used to train the model and 30% are set aside to test that model:

features_train, features_test, labels_train, labels_test = sklearn.preprocessing.train_test_split(
    features, labels,
    test_size=0.30
)

This two-way split is hardcoded into most machine learning tools as the default. However, by way of thought experiment, we’ll explore how it introduces bias into our models.

Skepticism of academic best practices.

As a newcomer to machine learning, I was wary of letting too many ivory tower best practices creep into my workflow just for appearances.

30 percent?! Why that’s throwing out nearly a third of the data! How is that considered data ‘science?’

So it took me a while to warm up to the idea of setting aside even more data in hopes of training a more generalizable model.

After I figured out how to produce high-accuracy models on my own datasets, my skepticism of more advanced techniques was only reinforced — especially when it came to introducing a third validation split. How could denying even more data from my training split training possibly yield better results?

Besides, I liked my trash models! They worked! “Is anyone here touching the test data?” I asked myself. “No, no one is touching the test data.” I reasoned. The split happens early in the workflow. There’s no way the model sees it. So what was all of the validation fuss about?

Example of flawed training process.

Let’s role-play a few iterations of a hypothetical and over-simplified 2 split training process to see if we can spot where information about our test data leaks in.

Run 1
+-------+-----------+------+
| Train | (acc)     | 0.88 |
+-------+-----------+------+
| Test  | (val_acc) | 0.74 |
+-------+-----------+------+

Looks like our model is actually learning. Good. Given that this is our first run, we can probably milk a lot more accuracy out of this data, so let’s change the batch size parameter and see what happens.

Run 2
+-------+-----------+------+
| Train | (acc)     | 0.94 |
+-------+-----------+------+
| Test  | (val_acc) | 0.87 |   # increased batch size.
+-------+-----------+------+

Yes, of course the model performs better like that. Okay, okay, nearly there. Now add another layer to really extract that last level of detail.

Run 3
+-------+-----------+------+
| Train | (acc)     | 0.98 |
+-------+-----------+------+
| Test  | (val_acc) | 0.63 |  # extra layer of depth.
+-------+-----------+------+

Bonk! Hmm, well, clearly that extra layer overfit on the training data. So let’s rollback that change, and play with the number of neurons in our existing layers instead.

Run 4
+-------+-----------+------+
| Train | (acc)     | 0.95 |
+-------+-----------+------+
| Test  | (val_acc) | 0.94 |  # increased neurons.
+-------+-----------+------+

Eureka — true masters of the datasphere! No datum in the cosmos can escape our inference!

Reflection.

Pump the brakes. What just happened here? Where is the improvement in our model actually coming from?

Well, looking across all of the runs, what is the common denominator? I’ll give you a hint — there’s actually a 2nd neural network in this workflow that is performing a hidden regression analysis. No, it’s not buried deep in the source code of the backpropagation. It’s been right behind our noses the whole time. It’s your brain!

As data scientists, we are performing our own internal hyperparameter sweep against our performance metrics when evaluating a model. We are the common denominator that sees the overall training process.

In fact, since the neural network only uses loss as its guiding metric for improvement, other metrics like accuracy and R² exist for the sole purpose of helping us understand of how our model is performing.

Therefore, when we make changes to our model based on what we’ve learned about how previous topologies and hyperparameters affected accuracy on the test data, we are introducing massive bias into our model. By using insight that has been derived from the model’s performance against test data, we have effectively used our entire dataset to tune the model without ever directly training the model on the test samples.

How to fix.

So how can we prevent ourselves from introducing this bias into our workflow?

A good first step is to start using a third validation split for evaluating your training runs. This allows you to truly set aside your test data as a holdout split. You only use this holdout split for evaluation purposes once you feel that you already have a model that will generalize well based on how it has performed on the validation data.

Remember, the underlying reason that we use a 3rd validation split is not to hide the samples from the algorithm. Rather, we do this to hide it from ourselves as evaluators of the performance metrics while we redesign the algorithm across many runs.

The distribution of data across your splits will look something like this:

splits = {
    "train": 0.67,
    "validation": 0.13,
    "test": 0.20
}

How to determine the size splits, you ask? It all depends on how much data you have. If you have a lot of data, then you can afford to let your validation and test splits eat into your training set. However, the number amount of samples (100, 500, 1000) isn’t what actually matters. It’s all about being able to ensure that each split is representative of the broader population of samples. Meaning that each split contains the same degree of variability. Is a classroom of 15 students representative of a school? Does a sampleset of 10,000 people from Geneva accurately reflect the EU as a whole? Maybe/ maybe not.

Stratification.

This broader representation is achieved by stratifying the data; making sure that each split is evenly distributed. Does the data have the same ‘shape’ in each split?

Image credit: Mike Yi, A Complete Guide to Histograms (2019) ChartIO

However, even if we stratify by both our labels and all of our features, there’s a chance that there’s some unseen variability in each split. And I’m not just talking about outliers. Remember, the features that are actually gathered are but a small subset of an infinite number of characteristics that can be used to describe the samples. There are latent, unseen features that our model is trying to tease out during training.

Sidebar: If we use sklearn’s StratifiedKFold() method for leave-one-out cross-validation, we actually get both validation folds and stratification for free.
However, this means that you have to keep track of sample indices of each fold grouping when feeding them into your deep learning library. Also, you may not have enough data to ensure that each fold has samples from each class in it, which results in major pain when calculating metrics. Also it does not handle stratification of continuous variables. You can probably get away with a simple validation split.

Solution.

We’ve seen that introducing validation splits or folds adds more moving pieces to our workflow, especially if we perform cross-validation.

Fortunately, the open source AIQC framework for reproducible deep learning data preparation and batch model tuning can handle this for you!

github.com/aiqc/aiqc

Here we see how the High-Level API makes splitting and folding stratified data a breeze:

splitset = aiqc.Dataset.Tabular.make(    dataFrame_or_filePath = df
    , dtype = None  # option to force a specific data type.    , label_column = 'object'
    , features_excluded = None    , size_test = 0.22
    , size_validation = 0.12
    , fold_count = 5    # number of folds for cross-validation.
    , bin_count = 3     # number of bins for stratification.    , label_encoder = None
    , feature_encoders = None  #see next blog!
)

If you need to access the splits/ folds manually, they can be fetched via methods like:

Foldset.to_numpy(
    id:int
    , fold_index:int
    , fold_names:list #['folds_train_combined', 'fold_validation']
    , include_label:bool
    , include_featureset:bool
    , feature_columns:list
 )

Afterward, you can even set hide_test=True to prevent the automated performance metrics and charts about your holdout set from being revealed.

AIQC: automated metrics for each split/fold of each model in the batch. Filtered by performance thresholds. Here, the test split is hidden.

AIQC: example of one of the automated charts for classification analysis.

AIQC: automated metrics for classification analysis.

Takeaways.

Practitioners introduce bias into their model when tuning new models based on the performance of old models on the test/holdout data.
Adding a validation split/fold serves as a buffer to protect the test/holdout data from this scrutiny.
Slicing a dataset requires stratification to ensure it is representative of the population. Complicating matters, it also results in more metrics and charts that need to be calculated.
Unlike most machine learning tools, the AIQC API makes it easy to use validation splits/folds, dynamically keeps track of each split/fold, and automatically calculates metrics for them.

Looking forward.

Now, you may be wondering, since we’ve segmented our data so heavily, how can we encode each split/ fold? That’s a topic for our next blog!