Brainify.AI Reveals Data Leakage Issues in EEG Datasets at ISBI 2023

1 min read
Apr 24, 2023 9:45:00 AM

Brainify.AI, a leading artificial intelligence research organization, has recently presented the results of their groundbreaking study, "DATA LEAKAGE PROBLEM IN LARGE MULTI-SITE EEG DATASETS," at the International Symposium on Biomedical Imaging (ISBI - 2023) organized by IEEE in Colombia. Ilya Zakharov, the Senior Researcher at Brainify.AI, delivered a comprehensive presentation of the team's findings.

The study focuses on data leakage in large-scale electroencephalogram (EEG) datasets, a problem that has been largely overlooked by the scientific community. Data leakage occurs when specific information used by machine learning (ML) models for training is not available during prediction or is irrelevant to the main target, resulting in false results. The study demonstrates that a deep convolutional neural network (DCNN) model can predict non-physiological information in EEG data, such as the recording location, with 99% accuracy.

Brainify.AI used data from several publicly available datasets, including TD-BRAIN, EMBARC, TUH, and CAN-BIND, totaling 15,000 participants and 26 different recording locations. The raw EEG data was preprocessed using the same automatic algorithms and brought to the same layout by finding common electrodes for all datasets. The DCNN model, with 5-fold cross-subject cross-validation and controlling for participants' age and sex, was then developed to predict the site of data collection.

The high accuracy (99%) of the model in predicting the exact recording location, even after controlling for demographic characteristics, emphasizes the significance of data leakage issues in large-scale, multi-site EEG datasets. This indicates that non-physiological information remains in the EEG signal even after using advanced artifact rejection pipelines, increasing the likelihood of false discoveries and decreasing cross-dataset prediction.

Brainify.AI's study underlines the urgent need for new tools to remove potential sources of data leakage from EEG data and harmonize data in large-scale EEG projects. These advances are crucial for the identification of clinically relevant biomarkers of various diseases, including psychiatric, neurological, and neurodevelopmental disorders, and for the broader application of modern machine learning approaches in neuroscience.