Skip to content

Latest commit

 

History

History
21 lines (12 loc) · 2.08 KB

project_plan.md

File metadata and controls

21 lines (12 loc) · 2.08 KB

UN Parallel Corpora Analysis Project Plan

Summary

My term project will analyze all UN documents in the public domain written between 1990-2014. The documents will be available in the six official lanaguges of the United Naitons: English, Spanish, French, Arabic, Russian, and Mandarin Chinese.

Data Portion and Description

The United Nations Parallel Corpus v1.0 is composed of official records and other parliamentary documents of the United Nations that are in the public domain. These documents are mostly available in the six official languages of the United Nations. The current version of the corpus contains content that was produced and manually translated between 1990 and 2014, including sentence-level alignments.

The corpus was created as part of the United Nations commitment to multilingualism and as a reaction to the growing importance of statistical machine translation (SMT) within the Department for General Assembly and Conference Management (DGACM) translation services and the United Nations SMT system, Tapta4UN.

The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation. For convenience, the corpus is also available pre-packaged as language-specific bi-texts and as a six-language parallel corpus subset.

I will be using the Fully aligned plain subcorpus in the six official UN languages composed of folders organized by language, publication year, and publication symbols. The formatting is XML files.

Analysis

I will be analyzing word length, sentence length, hapaxes, TTR, etc of each UN offical language. I am askign the question: How can each language say the same thing but with different structures, as well as what remains the same? What can't change about langauge, what is the most core aspect of langauge that is neccessary to illustrate or report the smae information. Is that close to semanticity?

Presentation

UN color scheme, Pandas charts, texts from several articles (english translations when not an english-L2 file.)