Classifier that uses training data from https://data.world/crowdflower/gender-classifier-data to categorize tweets to specific locations
This project aims to classify tweets using solely the content of the tweets (discounting information about the users involved). The project uses a RNN with an underlying multilayer bidirectional LSTM taking advantage of 100 dimensional GloVe embeddings. Primarily implemented using Torch and TensorFlow but also uses additional libraries for small implementation details
numpy==1.20.1
dill==0.3.3
matplotlib==3.3.4
torchtext==0.9.1
torch==1.8.1
Note: GloVe Embeddings will also need to be downloaded although they should automatically upon first run of the code
All necessary tools that would be required by a user are available though the main directory while each subdirectory contains helper functions. Almost all of the neccessary tuning can be done through the config file and all important functions can be called through implement.py.
- The classifier subdirectory contains all of the machine learning implementation including the dataloader, training and prediction/testing integrations.
- The dataset_dump subdirectory contains any extra data generated by the preprocessing or saved by the machine learning model
- The raw_data subdirectory contains any data provided by us users prior to any preprocessing
- The tweet_preprocessor subdirectory is an open source library that we use to help us in tokenization
- The preprocess subdirectory includes all of our implementations of text preprocessing
All necessary functions may be called using implement.py. The additional argument parameter determines what is run.
- generate - generates preprocessed data from raw csv file
- train - causes the model to begin training on the given dataset. Provides metric feedbacks through iostream and generates plot. Please close the plot to terminate the program. This can sometimes take several minutes
- test - tests the data's prediction given the contents of 'test_input.txt'. Always train before testing
You can modify test_input.txt to see the assosciated classification for every line in the file.
Examples of usage are:
python3 interface.py generate
python3 interface.py train
python3 interface.py test
Note: Granularity management is a pending feature