Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

major refactoring: process and split data properly, save splitted data #73

Merged
merged 55 commits into from
Oct 14, 2024

Conversation

MinaAlmasi
Copy link
Collaborator

@MinaAlmasi MinaAlmasi commented Oct 14, 2024

Summary

  1. Cleaned AI datasets prior to combining with human datasets to streamline pipeline, dropping any generations below min length prior to extracting metrics (closing issue Refactoring: Weird steps in the pipeline (cleaning at various steps that could be streamlined) #61)
  2. Re-extracted metrics on cleaned AI datasets, computed perplexity with GPT-2 on both human and AI (took 5-8 hours to run GPT-2)
  3. Deleted scripts insrc/utils e.g., process_generations and process_metrics that were used EVERYWHERE and instead did this processing ONCE (see point 4)
  4. Introduced functionality of src/utils/process_X scripts to scripts in src/make_dataset where both text and metrics datasets are created in identical train, val, test splits and saved to datasets_complete (closing issue Create script to split data for classifiers and save to folder #70)
  5. Added check_datasets to check that make_dataset/text_dataset.py and make_dataset/metrics_dataset.py make identical splits in terms of ID/MODEL (important for comparison between XGBOOST and TFIDF)
    6.Renamed datasets to datasets_files to complement newly added datasets_complete folder to handle the distinction between the individual files (generated text and extracted metrics) and the combined, splitted data (train/val/test text and metrics)
  6. Updated PCA and classifier scripts (src/pca and src/classify) to handle new datasets
  7. Plotted loadings for ALL 64 PC components
  8. Created new heatmaps (metrics/heatmaps.py for ECHO presentation)

Future

New feats:

  1. Reverse-engineer feature importances
  2. Shap values

Refactoring:

  1. Consider also saving PCA splits (right now they are transformed prior to classification)
  2. Move more files from src/utils to their respective folder if not used in multiple places

… is to completely get rid of utils/process_generations)
…ts match (important for tf-idf versus metrics clf)
… the eval fails if the text is too long, in those cases e.g., dailymail + stories, we truncate to 1024 and compute perplexity from that)
…and loading a fitted PCA and scaler, black/isort formatting
…a, beluga generations for the same id are in the same split).
@MinaAlmasi MinaAlmasi changed the title major refactoring: split data properly major refactoring: process and split data properly, save splitted data Oct 14, 2024
@MinaAlmasi MinaAlmasi merged commit 0cc3ff5 into main Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant