This repository contains the methods for producing language features from subreddits. If you use the code and want to cite our work, please use the following paper:
George Gkotsis, Anika Oellrich, Tim Hubbard, Richard Dobson, Maria Liakata, Sumithra Velupillai and Rina Dutta. The Language of Mental Health Problems in Social Media, Computational Linguistics and Clinical Psychology (clpsych, NAACL 2016).
The repository includes two Pandas Dataframes that are a small subset of the original datasets used in our study. The data provided here are mostly for demonstration purposes.
The complete dataset we used can be found in reddit (comments, posts).
Follow requirements.txt (spaCy has an extra step)
For the syntactic features, run:
import pandas as pd
import content
df = pd.read_pickle("suicidewatch-sample.pickle")
df = content.addSyntacticFeatures(df)
For the affection features, run:
import afinnsenti
import labmt
df['text'] = df.apply(content.getTextFromRecord, axis=1)
df = afinnsenti.addEmotionalFeature(df)
df = labmt.addEmotionalFeature(df)
import binaryClassification
binaryClassification.main()
rs = binaryClassification.readResults()
The complete output of the classification results is also stored as a dictionary in pickle format (file: combinations-10fold.pickle)
Follow the link