There is often a discussion about what is the science part of data science. I like Karl Popper's definition of the scientific process as a process of falsifying the hypothesis and evaluating predictions. Karl Popper was born in 1902 and he was very close to the major progress in different fields - he observed Einstein coming up with the Relativity Theory and his desire to repeatedly experimentally test his theory even after it was already widely accepted. Popper also observed Sigmund Freud who approach the theories from a different side - he observed some phenomenon and come up with a theory to explain it, but no hypothesis testing, no prediction evaluation was part of the process, Karl Popper called it pseudo-science.

To me, there is a strong parallel between this story and the methodology used commonly in data science called cross-validation. We cannot just create a model based on past observations without any evaluation of unseen data. Theory/model which can perfectly explain a past observation is of no value if it cannot be used for prediction since there is an infinite set of concurrent perfect fits - perfect explanations. Freud's theories are today considered good maybe for leisure reading but not as a scientific theory as opposed to Einstein's work so we should rather test our models as thoroughly and honestly as possible to strive for a successful long-running model in production.

It could look incredibly dumb to say that by calling cross_validate(X, y, 5) we are doing science, but it doesn't really matter how simple is to evaluate prediction if you are doing it right. Despite awesome tools like the sklearn there are still many traps we can fall into and this small blog-post can show how to prevent at least some of them and give you few tricks I found useful in practice.

We can start with the simplest approch - keeping one holdout set with train_test_split.

from sklearn.model_selection import train_test_split

Let's load some data for regression.

X, y= make_regression_with_trend(n_samples=1000, n_features=4, n_informative=3, bias=2, noise=5, random_state=0)
print(X.shape, 'the full dataset shape')

(1000, 5) the full dataset shape

Keep 30 % of dataset for evaluation.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print(X_train.shape, 'the train dataset shape')
print(X_test.shape, 'the test dataset shape')

(700, 5) the train dataset shape
(300, 5) the test dataset shape

And fit the linear regression.

model = LinearRegression()
model.fit(X_train, y_train).score(X_test, y_test).round(2)

0.74

R2 = 0.74 not bad right? Let's just try it one more time.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, shuffle=False)

model.fit(X_train, y_train).score(X_test, y_test).round(2)

0.63

What happened? As usual, the devil is in the detail, and details in the case of functions are default parameters. In the case of train_test_split function the usual suspect is the parameter shuffle with default being True so if your data has any inherent structure (like sensor values from the differen t time), you run into the risk that you will have an artificially good prediction because you leaked information about temporal structure into your training set.

You could assume that shuffle is True for all other splitting functions - but you would be wrong. Let say that you want to have more robust cross-validation so you want more splits. The natural selection would be Kfold class, but it has shuffle with default False, so watch out for defaults in sklearn (check i.e. this article for other problematic defaults in sklearn https://ryxcommar.com/2019/08/30/scikit-learns-defaults-are-wrong/)

Temporal structure in dataset

The problem of having data with some structure (often temporal) which would result in data leakage when shuffling the data is in my experience far more common than is usually acknowledged. Most of the training data are collected over time and, likely, the relationships are not fixed. Take customer/sales related data - trends, behaviors, social paradigm everything is constantly changing so trying to evaluate performance with shuffling the data will very likely result in overfitting (think i.e. about the impact of covid on sales of toilet paper). At first look, it could seem that the temporal changes are not that much connected with machine-generated data (sensors...), and shuffling is just a fine strategy. It could be, but very often it's not - sensors, devices, machines are changed and operated by humans with an unstable performance which very often leads to trends visible in the data.

The simplest thing to do is to just plot your dependent variable over time and check whether there are any visible trends and try to cross-validate with shuffling and without to see if there are any significant differences. If you spot that the data are not really perpendicular to the time dimension we should cross-validate accordingly.

Sklearn offers TimeSeriesSplit. Be sure to use version => 0.24, before this class had just argument n_splits and max_train_size which made it quite useless. From version 0.24 you can use test_size to specify your horizon, max_train_size to clip your data (maybe you want to keep just 2 last years of sales data for training because older are not relevant anymore), gap argument is very useful to make multiple splits more independent. The following code nicely illustrates how individual parameters works:

cv = TimeSeriesSplit(n_splits = 3, max_train_size=10, test_size=3, gap=2)
for train_index, test_index in cv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [979 980 981 982 983 984 985 986 987 988] TEST: [991 992 993]
TRAIN: [982 983 984 985 986 987 988 989 990 991] TEST: [994 995 996]
TRAIN: [985 986 987 988 989 990 991 992 993 994] TEST: [997 998 999]

Vizualize the predictions from cross validation

One thing I don't particularly like about sklearn is that in most cases the prediction from GridSearch and cross-validation are not available and we are left just with few summarizing numbers about model performance. One workaround for that could be a custom implementation of a scoring function that tracks the data for you as it's done i.e. in hcrystalball package (disclosure I'm the maintainer of this project https://github.com/heidelbergcement/hcrystalball/blob/master/src/hcrystalball/metrics/_scorer.py), but it currently works just with hcrystalball wrapper and not with parallel computing (n_jobs>1). One solution is to use cross_val_predict which is returning predictions for all folds, but it doesn't work with TimeSeriesSplit thus very often the simplest solution is to use KFold with the sorted dataset by datetime without shuffling which we can get just by specifying cv parameter of cross_val_predict as an integer as showen below.

sns.set(
    font="DejaVu Sans",
    rc={
        "figure.figsize": [12, 8],
        "axes.axisbelow": False,
        "axes.edgecolor": "black",
        "axes.facecolor": "None",
        "axes.grid": False,
        "axes.labelcolor": "black",
        "axes.spines.right": False,
        "axes.spines.top": False,
        "figure.facecolor": "white",
        "lines.solid_capstyle": "round",
        "patch.edgecolor": "w",
        "text.color": "black",
        "xtick.bottom": False,
        "xtick.color": "black",
        "xtick.direction": "out",
        "xtick.top": False,
        "xtick.labelsize": "x-large",
        "ytick.labelsize": "x-large",
        "ytick.color": "black",
        "ytick.direction": "out",
        "ytick.left": False,
        "ytick.right": False,
        "axes.labelsize": 18,
        "legend.fontsize": 18,
    },
)

predictions_unshuffled = cross_val_predict(model, X_train, y_train, cv=3)
pd.DataFrame({'predictions_unshuffled': perd_unshuffled,'actuals':y_train}).plot.scatter('predictions_unshuffled', 'actuals');

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.

This short post was by no means a cross-validation tutorial, but if you want to consult high-quality material with everything around cross-validation I highly recommend this https://arxiv.org/pdf/1811.12808.pdf great paper from Sebastian Raschka.