Harmonize Project: Reproducible Scripts for "Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites"

About

The Forschungszentrum Jülich Machine Learning Library

It is currently being developed and maintained at the Applied Machine Learning group at Forschungszentrum Juelich, Germany.

Overview

This repository contains all scripts and resources needed to reproduce the experiments presented in the paper "Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites." The paper explores the effectiveness of data harmonization methods, particularly in scenarios where class balance differs across data collection sites, and proposes the PrettYharmonize approach to address data leakage issues. Using this repository, researchers can replicate the study results, perform experiments on synthetic and real-world datasets, and validate the PrettYharmonize pipeline.

Paper Link: https://arxiv.org/abs/2410.19643

Repository Structure:
data/ - Directory for input datasets (user-provided)
data_preprocessing/ - Scripts for raw data preprocessing to generate the site-target dependence and independence scenarios
scripts/ - Main analysis scripts for all experiments output/ - Results will be stored here
plots/ - Notebooks for figure generation
pyproject.toml - Environment configuration

Requirements

The environment can be installed using the pyproject.toml file in this repository.

Installation

Clone the repository:

git clone https://github.com/juaml/harmonize_project.git
cd harmonize_project

Create/activate virtual environment:

Linux/Mac: python -m venv venv && source venv/bin/activate
Windows: python -m venv venv && venv\Scripts\activate

Install dependencies:

pip install .
For development: pip install -e .[dev]

Install PrettYharmonize:
pip install git+https://github.com/juaml/PrettYharmonize
Download data

Data must be downloaded by the user and stored in the respectively folders inside data/.

Pre-processing data

The pre processing of the data has to be made using the scripts conteined in data_preprocessing/.

Run scripts

The code for classification or regression in (in)dependence scenarios are stored in scripts/.

Plot

You can replicate the figures from the results stored in output/.

Citation

If you use PrettYharmonize in your work, please cite the following:
@article{nieto2024impact,
  title={Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites},
  author={Nieto, Nicol{\'a}s and Eickhoff, Simon B and Jung, Christian and Reuter, Martin and Diers, Kersten and Kelm, Malte and Lichtenberg, Artur and Raimondo, Federico and Patil, Kaustubh R},
  journal={arXiv preprint arXiv:2410.19643},
  year={2024}
}

Licensing

preattyharmonize is released under the AGPL v3 license:

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
data		data
data_preprocessing		data_preprocessing
lib		lib
output/predictions_age_regression		output/predictions_age_regression
plots		plots
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Harmonize Project: Reproducible Scripts for "Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites"

About

Overview

Requirements

Installation

Citation

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

juaml/harmonize_project

Folders and files

Latest commit

History

Repository files navigation

Harmonize Project: Reproducible Scripts for "Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites"

About

Overview

Requirements

Installation

Citation

Licensing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages