Skip to content

Latest commit

 

History

History
23 lines (13 loc) · 1.36 KB

README.md

File metadata and controls

23 lines (13 loc) · 1.36 KB

CS685SOQuestionClosed

We explore the task of predicting if a StackOverflow question is likely to be closed using information that is only available at time of submission, i.e: the title of the question, the body of the question and the question tags. Our project addresses this as a multi-class classification task on set of imbalanced labels. The labels are the reasons for question closure.

Dataset

Dataset Link Use the file so_dataset_cleaned.csv the title,body and tags can be tokenized by \t

Analysis data: 50 data-points taken from test set for qualitaive analysis.

Colab Notebook:

Colab ML

Colab BERT

Colab BERT with losses and data augmentation

Word Vectors

ELMo Files: download and unzip elmo_vectors.zip

SO Word2Vec: Alternative? w2v embeddings for software engineering domain, file is very large ~1.5GB