Skip to content

PootieT/malawi_news_classification

Repository files navigation

malawi_news_classification

Low resource text classification

Welcome to the repo for final class project for CS 505 (NLP). In this project we are tasked with this Malawi News Classification dataset. In limited time span, we tested a few techniques in data augmentation, creating / finetuning better embedding space with Transformer-based models, as well as some data science techniques to boost performance in feature space.

See project presentation here

Baseline models:

For any baseline models.

  • Support Vector Machines
  • Random Forests
  • XGBoost
  • Multi-layer Perceptron
  • Logistic Regression

For Classification Results from all the models:

    python3.9 experiments/main.py -<data_dir> -<embedding_file>
  • data_dir : Directory where the training data is located (Text)

  • embedding_file : Name of the embedding file

  • The results will be generated as a csv file in this location

Data Augmentation methods:

  • Mixup - Script

       python mixUp.py -<train_data_dir> -<embeddings type>
    • "Embeddings type" means the kind of embeddings to use when augmenting the data
    • Mixup Augmented data will be generated in this Location
  • NLPAug - Script

  • Manual News Scraping - Data

Types of embedding methods used:

  • Count Vectorization
  • TFIDF
  • English aligned Chichewa MT5 embeddings - Script
    python train_mt5_contrastive.py

Parallel RealNews Subset

For our alignment experiment, we created our own parallel news dataset. To recreate such data, you need to:

  1. Download realnews dataset from GROVER Repo
  2. Split files into smaller chunks for parallel translation (if running models) or small enough for Google Translation
    ./split_file_process_template.sh <input_path> <num_partition>
  3. Translating the files!
    1. If you are running in SCC and translating with Marian English-Chichewa Translation Model, you can run
    qsub utils/run_translation_en_ny.qsub
    1. If you choose to use Google, the easiest free way is to convert them into chunks of excel sheets no bigger than 2 mb, and submit them as files manually. utils should have some file conversion file you may find helpful.
  4. Once you obtain translation files (Or, check SCC /projectnb/cs505/projects/realnews), you can run alignment training with:
python experiments/train_mt5_contrastive.py

(make sure you modify the paths to the Chichewa and English files in main section.)

About

Low resource text classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published