This project focuses on building a robust model to classify tweets related to real-world disasters. The primary objective is to predict whether a tweet is about an actual disaster (label 1) or not (label 0).
Leveraging modern natural language processing (NLP) techniques, this project implements a highly effective classification model using DeBERTa, distributed training with PyTorch, and advanced hyperparameter optimization strategies.
-
DeBERTa Model
- Disentangled Attention Mechanism: By separating content and positional information, this mechanism improves the model's understanding of the text's context.
- Enhanced Mask Decoder: A more sophisticated approach to token reconstruction, especially useful for complex language tasks.
-
PyTorch and Ray for Distributed Training
- PyTorch serves as the deep learning framework, providing flexibility.
- Ray enables scalable, distributed training and hyperparameter tuning to optimize performance.
-
Hyperparameter Optimization
- ASHA Scheduler: An efficient algorithm that prioritizes the most promising training configurations, saving time and resources.
- Optuna: A hyperparameter tuning framework, applied to identify the most effective settings for model performance.
The data preprocessing steps include:
- Cleaning tweet text by removing unwanted elements such as URLs and special characters.
- Tokenizing using DeBERTa’s tokenizer to maintain linguistic structure.
- Handling missing data effectively.
The DisasterDataset
class was developed to handle text data preprocessing and conversion into a suitable format for the DeBERTa model. This class ensures the text is tokenized, padded, and truncated correctly, and prepares target labels as tensors for PyTorch.
The DisasterModel
leverages DeBERTa as the backbone for extracting text features, followed by:
- Dropout Layers to prevent overfitting.
- Fully Connected Layers with ReLU activations to classify the tweet text.
- ASHA Scheduler dynamically allocates resources to the best-performing trials, improving efficiency.
- Optuna was employed to search for optimal hyperparameters like batch size, learning rate, and number of epochs, resulting in a well-tuned model.
The model training process involved cross-validation and early stopping to avoid overfitting. Evaluation metrics like accuracy, precision, recall, and F1-score were used to assess performance on both training and validation datasets.
After training, the model was applied to test data, yielding strong performance in classifying disaster-related tweets.
This approach demonstrates the effectiveness of combining state-of-the-art NLP models with robust training and optimization techniques. The final model achieves high accuracy in classifying tweets based on whether they pertain to actual disasters.
By integrating DeBERTa with PyTorch and leveraging distributed training through Ray, this project highlights an efficient solution to disaster-related tweet classification. The use of hyperparameter optimization techniques, including ASHA and Optuna, significantly improved model performance.
To run this project, you will need to follow these steps:
-
Clone or download the repository that contains the notebook.
-
Install the required dependencies by running:
pip install -r requirements.txt
-
Launch Jupyter Notebook
-
Open the notebook (
deberta_tweets_classification.ipynb
) from the Jupyter interface. -
Run the cells sequentially to execute the code and train the model.