malawi_news_classification

Low resource text classification

Welcome to the repo for final class project for CS 505 (NLP). In this project we are tasked with this Malawi News Classification dataset. In limited time span, we tested a few techniques in data augmentation, creating / finetuning better embedding space with Transformer-based models, as well as some data science techniques to boost performance in feature space.

See project presentation here

Baseline models:

For any baseline models.

Support Vector Machines
Random Forests
XGBoost
Multi-layer Perceptron
Logistic Regression

For Classification Results from all the models:

    python3.9 experiments/main.py -<data_dir> -<embedding_file>

data_dir : Directory where the training data is located (Text)
embedding_file : Name of the embedding file
The results will be generated as a csv file in this location

Data Augmentation methods:

Mixup - Script
```
   python mixUp.py -<train_data_dir> -<embeddings type>
```
- "Embeddings type" means the kind of embeddings to use when augmenting the data
- Mixup Augmented data will be generated in this Location
NLPAug - Script
- NLPAug Description
Manual News Scraping - Data

Types of embedding methods used:

Count Vectorization
TFIDF
English aligned Chichewa MT5 embeddings - Script
```
python train_mt5_contrastive.py
```

Parallel RealNews Subset

For our alignment experiment, we created our own parallel news dataset. To recreate such data, you need to:

Download realnews dataset from GROVER Repo
Split files into smaller chunks for parallel translation (if running models) or small enough for Google Translation
```
./split_file_process_template.sh <input_path> <num_partition>
```
Translating the files!
1. If you are running in SCC and translating with Marian English-Chichewa Translation Model, you can run
```
qsub utils/run_translation_en_ny.qsub
```
1. If you choose to use Google, the easiest free way is to convert them into chunks of excel sheets no bigger than 2 mb, and submit them as files manually. utils should have some file conversion file you may find helpful.
Once you obtain translation files (Or, check SCC /projectnb/cs505/projects/realnews), you can run alignment training with:

python experiments/train_mt5_contrastive.py

(make sure you modify the paths to the Chichewa and English files in main section.)

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
analysis		analysis
data		data
data_gathering		data_gathering
experiments		experiments
models		models
utils		utils
.gitignore		.gitignore
Malawi News Classification.pdf		Malawi News Classification.pdf
README.md		README.md
exploration.py		exploration.py
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

malawi_news_classification

Baseline models:

Data Augmentation methods:

Types of embedding methods used:

Parallel RealNews Subset

About

Releases

Packages

Contributors 3

Languages

PootieT/malawi_news_classification

Folders and files

Latest commit

History

Repository files navigation

malawi_news_classification

Baseline models:

Data Augmentation methods:

Types of embedding methods used:

Parallel RealNews Subset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages