Text Mining for Sustainability: Detecting Corporate Greenwashing with The Sustainable Development Goals
This repository is part of the Master's thesis Text Mining for Sustainability: Detecting Corporate Greenwashing with The Sustainable Development Goals and hosts all the tools that were used in the reseach.
The following scripts were used for my thesis. There were run in order as listed below as they require the output of the previous step. The PDFs are not stored on this repository but can be found on the websites of the companies.
-
pdf_extractor.py Extract paragraphs of text from PDF files
-
gn_links.py Collect all article urls from a Google News page.
-
article_scraper.py Scrape paragraphs of online news articles from a list of links
-
filter_data.py Only keep texts that are at least 20 tokens
-
aurora.py Implementation of The Aurora Universities Network SDG classifier. Requires queries.py to work. This classifier drops the windowing constrains from the original classifier.
-
osdg.py Label text with OSDG classifier. Requires that the OSDG docker container is running.
-
combine_columns.py Combines the output of aurora and OSDG from two columns into an extra column.
-
sentiment.py Add a column with a sentiment score from VADER.
This folder contains files with all the links to news articles that were used for the research.
- Annotation_Guidelines.pdf The annotation guidelines that were used for the evaluation task.
- corpus.csv The questions from the evaluation with gold labels.
This folder contains all the data that was used for the research
This folder contains the saved HTML search results from Google News that were used for the research