May 1st 2022 By: Kinan Al-Mouk
- Email: [email protected]
Goal: Explore the Linguistic Elements of the six Official United Nations' Languages: English, Spanish, French, Russian, Arabic, and Mandarin Chinese.
Data Source: United Nations, Department for General Assembly and Conference Management: UN Parallel Corpora
This project counts as submission for my term project for LING1340 Data Science for Linguists instructed by Na-Rae Han at the University of Pittsburgh. All data was obtained from the UN website and processed using nltk
and SpaCy
.
-
final_report.md
is my final report write-up. -
DataProcessing.ipynb
is the file that contains all new processing of the data post using SpaCy for English, Spanish, French, Russian, and Mandarin processing. -
new_image_files/
folder is where the matplot graphs are saved as.png
files fromDataProcessing.ipynb
-
UN_Data_Analysis.ipynb
is the file that contains all processing of the inital data using nltk. -
data_samples/
folder is where the segmented data files from the UN Parallel Corpus website are found. -
image_files/
folder is where the matplot graphs are saved as.png
files fromUN_Data_Analysis.ipynb
-
UNv1.0.pdf
is the licensing information downloaded with original data from the UN Parallel Corpus website. -
LICENSE.md
contains licensing information fromUNv1.0.pdf
. -
project_plan.md
was my initial project plan. -
progress_report.md
contains progress logs throughout the completion of the project.