Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extractor configurations for Amharic Language #759

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ object DateTimeParserConfig
val monthsMap = Map(
// For "ar" configuration, right-to-left rendering may seem like a bug, but it's not.
// Don't change this unless you know how it is done.
"am" -> Map(
"መስከረም" -> 1,
"ጥቅምት" -> 2,
"ኅዳር" -> 3,
"ታኅሳስ" -> 4,
"ጥር" -> 5,
"የካቲት" -> 6,
"መጋቢት" -> 7,
"ሚያዝያ" -> 8,
"ግንቦት" -> 9,
"ሰኔ" -> 10,
"ሐምሌ" -> 11,
"ነሐሴ" -> 12,
"ጳጉሜ" -> 13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 13 would be an invalid month value here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition, google translate provides this as a translation to the above

"September" -> 1,
 "October" -> 2,
 "November" -> 3,
 "December" -> 4,
 "January" -> 5,
 "February" -> 6,
 "March" -> 7,
 "April" -> 8,
 "May" -> 9,
 "June" -> 10,
 "July" -> 11,
 "August" -> 12,
 "Pagume" -> 13

not sure how accurate this translation is or if there is a calendar difference but, in the general case January should be mapped to 1 not September (and accordingly the other values)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ethiopian calendar system has 13 months. we use a different calendar system :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I thought so... :)
the thing is that this needs to map to the Gregorian calendar. Values with month 13 will fail to get parsed and we will loose that triple, maybe map that to 12 or skip that line? Not sure what would result in fewer errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among other things...

google translate provides this as a translation

It would seem that this ought to result in a bug report (from/by a properly knowledgeable reporter such as @Meti-Adane) against Google Translate, as the 13 Amharic month names (with twelve lengths being 30 days and one length being 5 or 6 days; and with New Year's Day on September 11 or 12, depending on leap years) obviously cannot be directly translated to the 12 Gregorian month names (with lengths of 28, 29, 30, or 31 days; and with New Year's Day always on January 1).

As to the mapping here, the handling of Hebrew months (closely tied to the lunar cycle; twelve months of 29 or 30 days, plus a thirteenth leap month every few years) might provide some valuable hints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimkont — "this needs to map to the Gregorian calendar"... but why? It seems to me that this PR reveals a significant flaw in a number of systems. There ought to be a way to express a date in any calendar, which some systems might offer to translate to one or more other calendars, including Gregorian, Amharic, Hebrew, etc. Is translation among some of these easy? Lots of systems will support them. Is translation among some of these hard? Fewer systems will support them, or there will be a few libraries produced to handle such translations.

Net of all this -- I think there ought to be an issue about non-Gregorian calandar ingestion and preservation. This is not too different from the (ongoing) efforts to losslessly handle data using multiple geocoordinate systems for Earth plus other celestial bodies (Moon, Mars, etc.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the prompt @TallTed , I created an issue here: #761

@Meti-Adane we could still merge this PR and possibly only remove the date mappings until the issue is resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimkont Well noted. I have implemented a date converter for Ethiopian calendar (Ethiopian to Gregorian). I will push the change list in a separate PR to make the review easier.

),
"ar" -> Map("جانفي"->1,"فيفري"->2,"مارس"->3,"أفريل"->4,"ماي"->5,"جوان"->6,"جويلية"->7,"أوت"->8,"سبتمبر"->9,"أكتوبر"->10,"نوفمبر"->11,"ديسمبر"->12,
"يناير"->1,"فبراير"->2,"أبريل"->4,"مايو"->5,"يونيو"->6,"يوليو"->7,"يوليوز"->7,"أغسطس"->8,"غشت"->8,"شتنبر"->9,"نونبر"->11,"دجنبر"->12),
"bg" -> Map("януари"->1,"февруари"->2,"март"->3,"април"->4,"май"->5,"юни"->6,"юли"->7,"август"->8,"септември"->9,"октомври"->10,"ноември"->11,"декември"->12),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,55 @@ object DurationParserConfig
"years" -> "year",
"yr" -> "year"
),
"am" -> Map(
"second" -> "second",
"s" -> "second",
"sec" -> "second",
"seconds" -> "second",
"secs" -> "second",
"\"" -> "second",
"ሰከንድ" -> "second",
"ሴኮንድ" -> "second",
"minute" -> "minute",
"m" -> "minute",
"min" -> "minute",
"minutes" -> "minute",
"min." -> "minute",
"mins" -> "minute",
"minu" -> "minute",
"'" -> "minute",
"ደቂቃ" -> "minute",
"ደቂቃዎች" -> "minute",
"hour" -> "hour",
"h" -> "hour",
"hours" -> "hour",
"hr" -> "hour",
"hr." -> "hour",
"hrs" -> "hour",
"hrs." -> "hour",
"ሰአት" -> "hour",
"ሰዓታት" -> "hour",
"ሰዓት" -> "hour",
"day" -> "day",
"d" -> "day",
"d." -> "day",
"days" -> "day",
"ቀን" -> "day",
"ቀናት" -> "day",
"ቀኖች" -> "day",
"month" -> "month",
"months" -> "month",
"ወር" -> "month",
"ወራት" -> "month",
"ወሮች" -> "month",
"year" -> "year",
"y" -> "year",
"years" -> "year",
"yr" -> "year",
"አመት" -> "year",
"ዓመት" -> "year",
"ዓመታት" -> "year"
),
// For "ar" configuration, rendering right-to-left may seems like a bug, but it's not.
// Don't change this else if you know how it is done.
"ar" -> Map(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ object GeoCoordinateParserConfig

//map latitude letters used in languages to the ones used in English ("E" for East and "W" for West)
val longitudeLetterMap = Map(
"am" -> Map("E" -> "E", "W" -> "W"),
"de" -> Map("E" -> "E", "O" -> "E", "W" -> "W"),
"en" -> Map("E" -> "E", "W" -> "W"),
"cs" -> Map("E" -> "E", "W" -> "W"),
Expand All @@ -22,6 +23,7 @@ object GeoCoordinateParserConfig

//map longitude letters used in languages to the ones used in English ("N" for North and "S" for South)
val latitudeLetterMap = Map(
"am" -> Map("N" -> "N", "S" -> "S"),
"en" -> Map("N" -> "N", "S" -> "S"),
"cs" -> Map("N" -> "N", "S" -> "S"),
"mk" -> Map("N" -> "N", "S" -> "S")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,23 @@ object ParserUtilsConfig
"bln" -> 9,
"trillion" -> 12,
"quadrillion" -> 15
),
"am" -> Map(
"አስር" -> 1,
"መቶ" -> 2,
"መቶዎች" -> 2,
"thousand" -> 3,
"ሺህ" -> 3,
"million" -> 6,
"mln" -> 6,
"ሚሊዮን" -> 6,
"billion" -> 9,
"ቢሊዮን" -> 9,
"bln" -> 9,
"trillion" -> 12,
"ትሪሊዮን" -> 12,
"quadrillion" -> 15,
"ኳድሪሊየን" -> 15
),
// For "ar" configuration, rendering right-to-left may seems like a bug, but it's not.
// Don't change this else if you know how it is done.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ object DateIntervalMappingConfig
// Don't change this else if you know how it is done.
val presentMap = Map(
"en" -> Set("present", "now"), // for example see https://en.wikipedia.org/wiki/Donald_Trump -> Political party -> Republican (1987–1999, 2009–2011, 2012–present)
"am" -> Set("አሁን", "እስካሁን", "እስካሁን ድረስ"),
"ar" -> Set("الحاضر"),
"be" -> Set("па гэты дзень", "па сучаснасць"),
"bg" -> Set("до наши дни", "настояще", "досега"),
Expand Down Expand Up @@ -38,6 +39,7 @@ object DateIntervalMappingConfig

val sinceMap = Map(
"en" -> "since",
"am" -> "(?:ጀምሮ|አንሥቶ|አንስቶ)",
"ca" -> "des del",
"es" -> "desde",
"fr" -> "depuis",
Expand All @@ -48,12 +50,14 @@ object DateIntervalMappingConfig

val onwardMap = Map(
"en" -> "onward",
"am" -> "በኋላ",
"es" -> "en adelante",
"pt" -> "adiante|avante"
)

val splitMap = Map(
"en" -> "to",
"am" -> "እስከ",
"es" -> "al|a la|a|hasta (?:el|la)",
"fr" -> "à|au",
"pl" -> "do",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ object DisambiguationExtractorConfig
// For "ar" and "he" configurations, rendering right-to-left may seem like a bug, but it's not.
// Don't change this unless you know what you're doing.
val disambiguationTitlePartMap = Map(
"am" -> " (መንታ)",
"ar" -> " (توضيح)",
"bg" -> " (пояснение)",
"ca" -> " (desambiguació)",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,19 @@ object GenderExtractorConfig
val pronounsMap = Map(
"en" -> Map("she" -> "female", "her" -> "female", "he" -> "male", "his" -> "male", "him" -> "male", "herself" -> "female", "himself" -> "male",
"She" -> "female", "Her" -> "female", "He" -> "male", "His" -> "male", "Him" -> "male", "Herself" -> "female", "Himself" -> "male" //TODO why not just do case insensitive matches?
),
"am" -> Map(
"እሷ" -> "ሴት",
"እሷን" -> "ሴት",
"የሷ" -> "ሴት",
"እራሷን" -> "ሴት",
"እራሷ" -> "ሴት",
"እሱ" -> "ወንድ",
"እሱን" -> "ወንድ",
"የእሱ" -> "ወንድ",
"የራሱ" -> "ወንድ",
"እራሱ" -> "ወንድ",
"እራሱን" -> "ወንድ"
),
"pt" -> Map ("ela"-> "mulher", "dela" -> "mulher", "ele" -> "homem", "dele" -> "homem", "nela" -> "mulher", "nele" -> "homem",
"Ela"-> "mulher", "Dela" -> "mulher", "Ele" -> "homem", "Dele" -> "homem", "Nela" -> "mulher", "Nele" -> "homem"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,23 @@ object HomepageExtractorConfig
// Don't change this else if you know how it is done.

private val propertyNamesMap = Map(
"am" -> Set(
"ድህረገፅ",
"ድህረ_ገፅ",
"ገጽ",
"ድህረ ገጽ",
"ድህረ_ገጽ",
"ድረ_ገፅ",
"ድረገፅ",
"ድረገጽ",
"ድረ ገጽ",
"ድረ_ገጽ",
"ዋና_ገጽ",
"ዌብሳይት",
"website",
"web",
"site"
),
"ar" -> Set("الموقع", "الصفحة الرسمية", "موقع", "الصفحة الرئيسية", "صفحة ويب", "موقع ويب"),
"bg" -> Set("сайт", "уебсайт"),
"ca" -> Set("pàgina", "web", "lloc"),
Expand Down Expand Up @@ -38,6 +55,7 @@ object HomepageExtractorConfig
val supportedLanguages = propertyNamesMap.keySet

private val externalLinkSectionsMap = Map(
"am" -> "(?:የውጭ ንባብ|የውጭ ማያያዣ)",
"ar" -> "وصلات خارجية",
"bg" -> "Външни препратки",
"ca" -> "(?:Enllaços externs|Enllaço extern)",
Expand Down Expand Up @@ -65,6 +83,7 @@ object HomepageExtractorConfig
}

private val officialMap = Map(
"am" -> "ዋና",
"ar" -> "رسمي",
"bg" -> "официален",
"ca" -> "oficial",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ object InfoboxExtractorConfig

val ignoreProperties = Map (
"en"-> Set("image", "image_photo", "map"),
"am"-> Set("ምስል", "ፎቶ", "ስዕል", "ካርታ", "አርማ"),
"ar"-> Set("صورة"),
"id"-> Set("foto", "gambar"),
"el"-> Set("εικόνα", "εικονα", "Εικόνα", "Εικονα", "χάρτης", "Χάρτης"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ object TopicalConceptsExtractorConfig
val catMainTemplates = Set(
"مزيد" ,// ar
"Infocat", "Infocatm", // ca
"Catmore", // el,ja
"Catmore", // el,ja,am
"Cat main", // en
"AP", // es
"Nagusia", // eu
Expand Down
2 changes: 2 additions & 0 deletions dump/extraction.default.properties
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ extractors=.ArticleCategoriesExtractor,.ArticlePageExtractor,.ArticleTemplatesEx
.PageLinksExtractor,.RedirectExtractor,.RevisionIdExtractor,.ProvenanceExtractor,.SkosCategoriesExtractor,\
.WikiPageLengthExtractor,.WikiPageOutDegreeExtractor

extractors.am=.MappingExtractor,.DisambiguationExtractor,.HomepageExtractor,.GenderExtractor,.TopicalConceptsExtractor

extractors.ar=.MappingExtractor,.TopicalConceptsExtractor

extractors.be=.MappingExtractor
Expand Down
2 changes: 2 additions & 0 deletions dump/extraction.mappings.properties
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ languages=@mappings

extractors=.MappingExtractor

#extractors.am=.MappingExtractor,.DisambiguationExtractor,.HomepageExtractor,.GenderExtractor,.TopicalConceptsExtractor
#
#extractors.ar=.MappingExtractor,.TopicalConceptsExtractor
#
#extractors.be=.MappingExtractor
Expand Down
2 changes: 1 addition & 1 deletion dump/extraction.topical.properties
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# use only directories that contain a 'download-complete' file? Default is false.
require-download-complete=true

languages=ar,ca,el,en,es,eu,fr,it,pt,ru
languages=am,ar,ca,el,en,es,eu,fr,it,pt,ru

# extractor class names starting with "." are prefixed by "org.dbpedia.extraction.mappings"

Expand Down
Loading