major refactoring: process and split data properly, save splitted data #73

MinaAlmasi · 2024-10-14T09:42:20Z

Summary

Cleaned AI datasets prior to combining with human datasets to streamline pipeline, dropping any generations below min length prior to extracting metrics (closing issue Refactoring: Weird steps in the pipeline (cleaning at various steps that could be streamlined) #61)
Re-extracted metrics on cleaned AI datasets, computed perplexity with GPT-2 on both human and AI (took 5-8 hours to run GPT-2)
Deleted scripts insrc/utils e.g., process_generations and process_metrics that were used EVERYWHERE and instead did this processing ONCE (see point 4)
Introduced functionality of src/utils/process_X scripts to scripts in src/make_dataset where both text and metrics datasets are created in identical train, val, test splits and saved to datasets_complete (closing issue Create script to split data for classifiers and save to folder #70)
Added check_datasets to check that make_dataset/text_dataset.py and make_dataset/metrics_dataset.py make identical splits in terms of ID/MODEL (important for comparison between XGBOOST and TFIDF)
6.Renamed datasets to datasets_files to complement newly added datasets_complete folder to handle the distinction between the individual files (generated text and extracted metrics) and the combined, splitted data (train/val/test text and metrics)
Updated PCA and classifier scripts (src/pca and src/classify) to handle new datasets
Plotted loadings for ALL 64 PC components
Created new heatmaps (metrics/heatmaps.py for ECHO presentation)

Future

New feats:

Reverse-engineer feature importances
Shap values

Refactoring:

Consider also saving PCA splits (right now they are transformed prior to classification)
Move more files from src/utils to their respective folder if not used in multiple places

… is to completely get rid of utils/process_generations)

…erplexity = true

…engths to this

…several

…flow, and standardises across pipelines)

…ts match (important for tf-idf versus metrics clf)

… the eval fails if the text is too long, in those cases e.g., dailymail + stories, we truncate to 1024 and compute perplexity from that)

… complete

…ots in utils for now

…tted scaler and pca)

…and loading a fitted PCA and scaler, black/isort formatting

…ystem

…_dataset.py)

…a, beluga generations for the same id are in the same split).

MinaAlmasi added 30 commits September 11, 2024 15:05

ci: ign. big complete datasets for now

13563cc

feat: adapt process_generations to create a big textual dataset (hope…

40370d6

… is to completely get rid of utils/process_generations)

feat: mv split_data fn from classify

ce05f93

fix: update path to complete_datasets / "text"

8a50fbd

refactor: delete as functionality is moved to make_datasets/text_dataset

0379868

fix: add drop lengths to text_dataset

fd27607

ci: ign complete_datasets content (temp)

850ae57

refactor: mv raw data to "raw data" folder

914d0a6

refactor: update paths to fit new raw_datasets folder

0ab2144

refactor: mv files to "individual_files"

9f6cfd5

refactor: change folder name again

5021a0f

refactor: update path to datasets_files

5f6f8ac

feat: save cleaned AI data

4c9ea30

fix: update path to "datasets_files"

47fd590

refactor: black formatting, isort, add docstring

905ec81

refactor: black formatting/isort

064f57d

refactor: load in cleaned data, black/isort formatting, set compute_p…

f9daf34

…erplexity = true

refactor: mv clean_ai_data out of clean (for human data) + add drop l…

f0e0b29

…engths to this

fix: change from wrong "to_csv" to "to_json"

354c6ff

fix: add lines=True and orient="records" to save not as ONE line but …

8eec1cd

…several

feat: make text dataset from already CLEANED ai data (simplifies work…

40cf435

…flow, and standardises across pipelines)

refactor: rm unused imports

35abf63

feat: clean 1.5 temp also

acb63ef

feat: add file to create metrics dataset

71b7220

fix: save to temp folder, do not sort ai paths

341f29c

feat: add check dataset to ensure metrics and text are the same

f49f542

feat: add a "check datasets" to check whether metrics and text datase…

6f2f99f

…ts match (important for tf-idf versus metrics clf)

fix: bugfix - set a maximum length if index error is given (sometimes…

03601c0

… the eval fails if the text is too long, in those cases e.g., dailymail + stories, we truncate to 1024 and compute perplexity from that)

feat: extract new metrics with perplexity compute by gpt-2

233c5a9

refactor: mv pca out of classify folder, adapt to new metrics dataset…

d50ebdc

… complete

MinaAlmasi added 24 commits October 5, 2024 17:48

refactor: move pca results of classify folder

9e69169

refactor: add cols to drop in utils to simplify code

b02f573

feat: save scaler, mv pca pipeline to run_pca from utils, keep pca_pl…

345f416

…ots in utils for now

feat: update script with new setup (transforming val data based on fi…

60002c2

…tted scaler and pca)

feat: adapt clf to new format with metrics_dataset and text_dataset, …

aa735b2

…and loading a fitted PCA and scaler, black/isort formatting

feat: new results with new clf workflow + add new tables

1fe2160

feat: plot loadings for all PC components, add title

74e7eed

feat: plot all 64 pc comps, add tqdm to plotting

66a13fe

refactor: delete old heatmaps, mv heatmaps to metrics, adapt to new s…

0a92247

…ystem

refactor: delete old misc that is no longer needed

173dff2

fix: adjust font size for plots, add black/isort formatting to script

f11f4bd

feat: add loadings matrix

3c99603

add datasets_complete to github (with no files, but instrution of setup)

ba17bc6

refactor: rm old code (functionality is moved to make_dataset/metrics…

36fda0f

…_dataset.py)

refactor: delete old code that does not work anymore

c86f008

style: black formatting

e39f83e

feat: update scripts to keep IDS in the same split (i.e., human, llam…

77f3f27

…a, beluga generations for the same id are in the same split).

refactor: delete split_data (functionality in "make_datasets" now)

adb5f3d

feat: add run script

908c58a

fix: update only for temp1 (with new metrics dataset)

291494b

feat: new results with new splitting (keeping same ids)

18c98ef

feat: add table-making to classify/run.sh

d442970

fix: rm redundant "unique_id" col from cols_to_drop

9de039e

feat: new results with re-run PCA (pca was not re-run previously)

42d7aa2

MinaAlmasi changed the title ~~major refactoring: split data properly~~ major refactoring: process and split data properly, save splitted data Oct 14, 2024

MinaAlmasi merged commit 0cc3ff5 into main Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

major refactoring: process and split data properly, save splitted data #73

major refactoring: process and split data properly, save splitted data #73

MinaAlmasi commented Oct 14, 2024 •

edited

Loading

major refactoring: process and split data properly, save splitted data #73

major refactoring: process and split data properly, save splitted data #73

Conversation

MinaAlmasi commented Oct 14, 2024 • edited Loading

Summary

Future

MinaAlmasi commented Oct 14, 2024 •

edited

Loading