Comparing Selective Masking Methods for Depression Detection in Social Media

Abstract

Identifying those at risk for depression is a crucial issue in which social media provides an excellent platform for examining the linguistic patterns of depressed individuals. A significant challenge in a depression classification problem is ensuring that the prediction model is not overly dependent on keywords, such that it fails to predict when keywords are unavailable. One promising approach is masking, i.e., by masking important words selectively and asking the model to predict the masked words, the model is forced to learn the context rather than the keywords. This study evaluates seven masking techniques, such as random masking, log-odds ratio, and the use of attention scores. In addition, whether to predict the masked words during pretraining or fine-tuning phase was also examined. Last, six class imbalance ratios were compared to determine the robustness of the masked selection methods. Key findings demonstrated that selective masking generally outperforms random masking in terms of classification accuracy. In addition, the most accurate and robust models were identified. Our research also indicated that reconstructing the masked words during the pre-training phase is more advantageous than during the fine-tuning phase. Further discussion and implications were made. This is the first study to comprehensively compare masking selection methods, which has broad implications for the field of depression classification and the general NLP.

Dataset

Reddit Self-reported Depression Diagnosis (RSDD) dataset and Time-RSDD dataset (https://georgetown-ir-lab.github.io/emnlp17-depression/)

The datasets should be loaded into the OP_datasets folder

Training Approaches

BERT further pre-train + fine-tune FURTHER-01-MLM.py and FURTHER-02-classi.py (adapted from https://github.com/GU-DataLab/stance-detection-KE-MLM and https://github.com/thunlp/SelectiveMasking)
BERT fine-tune with reconstruction objective MASKER.py (adapted from https://github.com/alinlab/MASKER)
Standard BERT fine-tune BASE-classi.py

Selective Masking Methods

Random masking random
Depression Lexicon deplex (lexicon.txt from https://github.com/gamallo/depression_classification/tree/master/lexicons)
Log-odds-ratio logodds (from https://github.com/kornosk/log-odds-ratio)
TF-IDF tfidf (adapted from https://github.com/alinlab/MASKER)
Sum attention sumatt (adapted from https://github.com/alinlab/MASKER)
Top attention prop
Neural Network NN (adapted from https://github.com/thunlp/SelectiveMasking)

get_datasets contains python script and .ipynb files for extracting, preprocesing and creating the dataset objects for training

keyword contains .ipynb files for obtaining the keywords and the resulting keywords in .txt format

src contain the source code for creating a masked dataset and training & evaluation loop

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
OP_datasets		OP_datasets
model_RSDD		model_RSDD
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparing Selective Masking Methods for Depression Detection in Social Media

Abstract

Dataset

Training Approaches

Selective Masking Methods

About

Releases

Languages

chanapapan/Depression-Detection

Folders and files

Latest commit

History

Repository files navigation

Comparing Selective Masking Methods for Depression Detection in Social Media

Abstract

Dataset

Training Approaches

Selective Masking Methods

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages