Repository for the paperEvaluating the Efficacy of Instance Incremental vs.Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection
In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?
Create a new python environment, install the requirements:
pip install -r requirements.txt
- Clone this repository of your machine
- Run the Notebook for reproducing results
NB: To use the notebook, you will need to install it in the python environment you have created using pip for example
- Please download first datasets (of interest) using the link here and place them in Datasets folder.
- For example in
python main.py --n_delays 0 1000 70000 --static_optim_ntrial 30 --model_name DT --dataset_name sea_g --init_fit_ratio 0.1 --n_windows 10000
- n_delays: indicates the average label delay, generated following a Poisson distribution of mean 0, 1000, 70000 respectively
- static_optim_ntrial: indicates the number of trial for tuning parameters offline (before the stream evaluation)
- model_name: the model name (e.g., DT is for Decision Tree)
- dataset_name: the dataset name (sea_g here)
- init_fit_ratio: the fraction of the dataset used for the offline optimization (0.1 in the above example)
- n_windows: average number of instances (following a Poisson distribution) in each evaluation batch (10000 for this example)