We explore the task of predicting if a StackOverflow question is likely to be closed using information that is only available at time of submission, i.e: the title of the question, the body of the question and the question tags. Our project addresses this as a multi-class classification task on set of imbalanced labels. The labels are the reasons for question closure.
Dataset Link
Use the file so_dataset_cleaned.csv
the title,body and tags can be tokenized by \t
Analysis data: 50 data-points taken from test set for qualitaive analysis.
Colab BERT with losses and data augmentation
ELMo Files: download and unzip elmo_vectors.zip
SO Word2Vec: Alternative? w2v embeddings for software engineering domain, file is very large ~1.5GB