Skip to content

Using machine learning to predict ocean health based on scraped BBNJ speech contents

Notifications You must be signed in to change notification settings

srathnavel/bbnj-speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BBNJ Conference Speech Data

Natural Language Processing with spacyr

This project models ocean health within a country's marine jurisdiction (as measured by the Ocean Health Index) using the country's speeches at meetings of the UN Biodiversity Beyond National Jurisdictions conference.

I scraped the speech pdf data from the BBNJ website and read in the contents of each pdf. I used spacy to process the text in the scraped and tidied dataset, including lemmatization (finding the root word) and part of speech tagging (only included nouns and verbs). The cleaned dataset is quite small - it comprises 100 speeches in English given by only 40 distinct countries.

I set up and trained 3 model types (random forest, generalized linear, k-nearest neighbors). The random forest model performed best - albeit with an R-squared of less than 0.25. This model was fitted to the reserved test data to generate predictions (on only 30 speeches, 30% of the dataset).

I mapped error (RMSE) and found that the model is worse at predicting ocean health scores for smaller island countries. Error is higher for Sri Lanka and parts of Polynesia. Important phrases (lemmas) found using variable importance plotting included: "conservation", "instrument", "capacity building", and "transfer technology".

About

Using machine learning to predict ocean health based on scraped BBNJ speech contents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages