A Novel Validation Strategy

In the previous post, we discussed the impact of a validation strategy on Machine Learning modeling results. We also touched on the importance of understanding the nature of the data when deciding on a validation strategy. In this post, I will describe the validation strategy we have chosen for the sex prediction problem.

 

Our requirements for a validation strategy

 

Understanding how the developed ML model will be applied in a production environment is crucial. The validation strategy should allow for, in the process of creating a model, obtaining an assessment of its quality metrics as close as possible to the real one. For Neuroscience Software, this means application to cross-subject data splitting.

It is also important to anticipate in advance whether the audience will be familiar with the simulation results. For example, our research team intends to compare the results we obtain with similar studies. Neuroscience Software has publications and patents planned, so our focus remains not only on our internal audience but also on the larger scientific community. A review of publications on the topic of Deep Learning with EEG revealed that researchers, as a rule, have little data (from tens to thousands of patients). As a result, it is likely that studies utilizing 10-fold cross-validation are most credible. 

The speed at which experiments are executed is essential to the success of the project. Obviously, the study results will be more reliable if the data scientist runs a dozen training-validation cycles of the model per day rather than one. However, this requirement is at odds with the previous one of increasing folds, and pushes for either reducing the number of folds, or abandoning cross-validation in favor of train/dev/test data split. Before announcing our decision, let's consider one more aspect of a validation strategy.

The validation strategy should also help control under- and overfitting of a model. To control the latter, separate development (DEV) and test datasets are required. Data scientists usually perform many internal experiments with a DEV dataset to work out the best decisions on data preparation algorithms, neural network architecture, and its hyperparameters. We typically conduct anywhere between ten to several hundred such experiments, leading to overfitting to the DEV dataset. Evaluation of the test dataset is done for the final version of the model and detects model overfitting, if any.

Given the previously discussed considerations, our decision when choosing between multi-fold cross-validation, the presence of a train-dev-test split, and fast execution of the experiment is to use different validation strategies for external and internal tasks.

 

Our Validation Strategy to Report Final Results

 

 

We use 10-fold cross-subject cross-validation to report the final results. On each cycle, we use 10% of the data (1 fold) as a test. For ten cycles, this fold “runs through” all the data we have, which allows us to get the most informative assessment of the quality of the model because the metric is calculated for each EEG record available to us. When training the model, we use “Early stopping” and “Reduce on plateau” techniques. To avoid overfitting to a test fold, we separate one fold from the train and use it to decide on Early stopping and Reduce on plateau. Thus, this DEV fold is used for training but not for updating the model weights.

 

Our Development Strategy 

 

The validation strategy for the development phase is also based on a cross-subject split and divides all available data into two parts:

  • a separate test dataset of 20% size;
  • the remaining 80% is used for 4-fold cross-validation.

Early stopping and Reduce on plateau decisions are made on the same fold on which the cross-validation metric is calculated.

Validation strategy

Reporting

Development

Data split

Cross-subject split prevents session-level leakage

Folds

10 folds

4 folds + holdout test

The quality of the estimate of the expectation and the confidence interval of the measured metric

higher

lower

Duration of one experiment

longer

faster

Overfitting detection

exists

 

Next time, we will discuss the most extensive open dataset with EEG data - Temple University EEG Corpus.

 

Back to blog list

Last posts