Skip to content

Latest commit

 

History

History
102 lines (90 loc) · 4.13 KB

README.md

File metadata and controls

102 lines (90 loc) · 4.13 KB

CS-433 - Project 2 (Text Classification)

Folder Architecture

The project structure can be split into four main parts that are described as follows:

1. Initialization

All the code responsible of the initialization of the project is stored in the /src/datasets folder. The structure is as follows:

/src/datasets
│
├── build_vocab.sh
├── cooc.py
├── cut_vocab.sh
├── glove_solution.py
├── pickle_vocab.py
├── tweet_to_vector.py

2. Models

All the code responsible of the different models is stored in the /src/models folder. The structure is as follows:

/src/models
│
├── averaged_embeddings_models
│   ├── GradientBoosting.py
│   ├── LogisticRegression.py
│   ├── NeuralNetwork.py
│   ├── SupportVectorMachine.py
│
├── sequenced_embeddings_models
│   ├── RecurrentNeuralNetwork.py

3. Utilitaries Functions

All the code that manage file storage/loading is stored in the /src/utils folder. The structure is as follows:

/src/utils
│
├── dataloader.py
├── initialization.py
├── submission.py

4. Data Storage

All the files that contains data are stored within the /data folder. The structure is as follows:

/data
│
├── init                                // Generated by run.py
│   ├── cooc.pkl                        // Generated by run.py
|   ├── SGD_embeddings.npy              // Generated by run.py
│   ├── vocal_full.txt                  // Generated by run.py
│   ├── vocab_cut.txt                   // Generated by run.py
│   ├── vocab.pkl                       // Generated by run.py
│
├── submission                          // Generated by run.py
│   ├── <Model Name>_<Dataset Type>.csv // Generated by run.py
│
├── twitter-datasets                    // Unzipped twitter-datasets.zip
│   ├── sample_submission.csv
│   ├── test_data.txt
│   ├── train_neg_embedding.txt         // Generated by run.py
│   ├── train_neg_full_embedding.txt    // Generated by run.py
│   ├── train_neg_full.txt
│   ├── train_neg.txt
│   ├── train_pos_embedding.txt         // Generated by run.py
│   ├── train_pos_full_embedding.txt    // Generated by run.py
│   ├── train_pos_full.txt
│   ├── train_pos.txt
|
├── twitter-datasets.zip

Run Setup

As is, the src/run.py file generates the submission file that performed the best score on aicrowd.com. But it is very simple to change the parameters of it to train another model. Here are the possible modifications:

  • model_type - default value = RecurrentNeuralNetwork, can be changed to GradientBoosting, LogisticRegression, SupportVectorMachine or NeuralNetwork
  • full_dataset - default to True, can be changed to False (recommended to have a faster execution time)
  • force_generation - default to False, can be changed to True (not recommended)
  • Model Hyperparameters - every initialized model (e.g. model = GradientBoosting()) has default hyperparameters that can be changed easily.

How to Run

In order to run the code and access to the .csv submission files, it is required to execute the following steps:

  1. Install the following python libraries (pip install <library name>)
  • numpy
  • pandas
  • xgboost
  • scikit-learn
  • tensorflow
  • tqdm
  1. In your terminal, navigate to the /src directory and enter the following:
python run.py

Important Note 1: the first time this script is executed, all required files will be generated and stored in the data folder. This may take a while.
Important Note 2: The default training execution time is very long (>10 hours) and was previously run on the EPFL Scitas server. It is very recommended to modifiy the run.py script parameters to test the model on the small dataset.

  1. Once the script has finished, it should have stored the .csv file inside the data/submission folder. The naming of the submission files varies depending on the model and dataset used.
    submission files are named as follows: <Model Name>_<Dataset Type>.csv