Data Leakage

Are you encoding bias into your training data?

Published in

Towards Data Science

4 min readDec 15, 2021

Leaky pipes — Photo Credit: (https://pixabay.com/photos/water-water-pipes-drinking-water-4803866/)

If an encoder is fit on the entire dataset, then the encoding of training data is influenced by the distribution of the validation and test data.

Have you ever wondered why data processing libraries like scikit-learn separate the encoding process into 2 steps: fit() and transform()? In this post, we’ll explore the reasoning behind that pattern.

Leakage Defined

The purpose of a supervised algorithm is to make predictions about data that the algorithm has not seen before. Therefore, when training an algorithm to make predictions about validation and test splits, practitioners need to make sure that the algorithm does not inadvertently gain access to information about those splits. When this malpractice occurs, it is referred to as “data leakage.”

How Leakage Happens

One particular area where data leakage can come into play is the preprocessing (aka encoding) of numeric (continuous or non-ordinal integer) data.

The de facto encoder for handling numeric data in scikit-learn is StandardScaler. It scales values between -1 and 1 around an average of 0.

For example, let’s say we are normalizing daily temperatures for the city of Boston over the course of a year:

The average temperature (e.g. 52 Fahrenheit) will be encoded as zero
With high temperatures (e.g. 98 Fahrenheit) near 1
And low temperatures (e.g. -5 Fahrenheit) near -1

Temperature Gauges — Photo Credit: (https://unsplash.com/photos/0aqJNZ5tVBc)

Now, we could encode our entire dataset in a single step like so:

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
encoded_dataset = scaler.fit_transform(dataset)

However, can you think of a reason why we wouldn’t want to include all of our splits (training, validation, test) in this standardization process? What if our test split contained a few extreme temperatures from a cold snap?

If we included these colder days in the encoding, then it would move our average (0 midpoint) lower, thereby swaying the encoding of our training samples to be a bit higher than normal. The weights of the algorithm would then be biased by these higher training input values. In essence, the performance of the algorithm is inflated because it has been exposed to information that was not present in the training samples.

So Why Encode?

If the process of standardizing numeric data is prone to leakage, then why can’t it be skipped? Encoding provides two major benefits:

Equal Feature Importance — Let’s say we have two features: exam_score and SAT_score [USA college prep test]. On one hand, the exam has a maximum score of 100, but, on the other hand, the SAT has a maximum score of 1600. If we don’t normalize these two features based on their range of possible values, then an algorithm would initially be prone to prioritizing the SAT_score feature because of its larger values. However, if we normalize both features between 0 and 1, then they will be treated equally at the start of training.

Help Prevent Gradient Explosion — Neural networks learn better when input values are close to zero. If you don’t believe me, try training a CNN both with and without scaling the image data first. The reason why is that many activation functions (e.g. the sigmoid curve, as seen in the figure below) are asymptotic (aka saturated) at 1 and -1. Due to the asymptotes, higher and lower values doesn’t meaningfully impact the output value. Conveniently, most standardization functions output values that are close to 0.

Sigmoid Curve — Photo Credit: (https://en.wikipedia.org/wiki/Sigmoid_function#/media/File:Logistic-curve.svg)

Split Before Encode

So how do we prevent data leakage from occurring during encoding?

The key is to segment the dataset into training, validation, and test split prior to performing the standardization. However, we don’t fit_transform() each split separately. First, we fit() on our training split, and then we transform() each of the other splits like so:

scaler = StandardScaler()
scaler.fit(split_training)encoded_training   = scaler.transform(split_training)
encoded_validation = scaler.transform(split_validation)
encoded_test       = scaler.transform(split_test)

This pattern is especially challenging to apply during cross-validation because the training and validation samples rotate during resampling. Keeping track of the preprocessing of folds by hand becomes impractical.

Enter AIQC

Fortunately, the automated Pipelines() of AIQC, an open source library created by the author of this blog post, solves these challenges by:

Detecting the use of numeric sklearn preprocessors
fit()’ing on training samples before separately transform()’ing each split and/or fold
Saving every fit() for downstream decoding of predictions

splitset = aiqc.Pipeline.Tabular.make(
    # --- Data source ---
    df_or_path = df
    # --- Label preprocessing ---
    , label_column = 'SurfaceTempK'
    , label_encoder = dict(sklearn_preprocess =   StandardScaler(copy=False))
    # --- Feature preprocessing ---
    , feature_cols_excluded = 'SurfaceTempK'
    , feature_encoders = [
        dict(dtypes=['float64'], sklearn_preprocess=RobustScaler(copy=False)),
        dict(dtypes=['int64'], sklearn_preprocess=OneHotEncoder(sparse=False))
    ]
    , feature_reshape_indices = None
    # --- Stratification ---
    , size_test = 0.12
    , size_validation = 0.22
    , fold_count = None
    , bin_count = 4
)

As seen above, all the practitioner needs to do is:

Define the encoder(s) to use for their labels and features
Specify the size of their validation and test subsets
Specify the number of folds, if any, to use

This enables practitioners to spend more time on the actual analysis rather than juggling the encoding of various splits.

https://github.com/aiqc/aiqc

In summary, we’ve learned that data leakage occurs when an encoder is fit on non-training data. This problem can be solved by separately encoding each split/fold, but doing so manually introduces another dimension of data wrangling for practitioners to keep track of. Thankfully, aiqc.Pipelines() can be used to automate this and other pre/post-processing challenges.