Skip to content

Commit

Permalink
fix: Read multiple meanings from same source
Browse files Browse the repository at this point in the history
  • Loading branch information
trungnt2910 committed Jan 9, 2023
1 parent f9808c3 commit 8767d50
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 6 deletions.
2 changes: 1 addition & 1 deletion out_vn/index.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"title":"Từ điển Hán Nôm","format":3,"revision":"trungnt2910.hannom.20230109-132817.kanjidic2","sequenced":false}
{"title":"Từ điển Hán Nôm","format":3,"revision":"trungnt2910.hannom.20230109-152759.kanjidic2","sequenced":false}
2 changes: 1 addition & 1 deletion out_vn/kanji_bank_1.json

Large diffs are not rendered by default.

39 changes: 35 additions & 4 deletions src/Converters/HanNomConverter.cs
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,37 @@ async Task DoWork()
Character = kanji
};

/* TODO: Refactor the whole thing to build some kind of "Han Nom data tree"
* Then, we can get rid of all the HTML and serialize this page into a cleaner data type (JSON,...)
* The tree of a typical entry, for reference:
hvres han-word
hvres-header
hvres-word
hvres-definition
hvres-details
hvres-meaning
hvres-source
hvres-meaning
hvres-source
hvres-meaning
... More source - meaning pairs
hvres name=....
hvres-header
hvres-definition
hvres-details
hvres-source
hvres-meaning
hvres-source
hvres-meaning
... More source - meaning pairs
... More hvres name=... groups
*/

if (mainDetail != null)
{
var mainDetailFirstMeaning = mainDetail.Descendants()
Expand Down Expand Up @@ -292,17 +323,17 @@ async Task DoWork()

// We only extract data from the general dictionary. Other dictionaries provide definitions
// that are too complicated.
var definitionTuDienPhoThongHeader = definition.Descendants()
.Where(d => d.HasClass("hvres-source") && d.InnerText.Trim() == "Từ điển phổ thông")
.FirstOrDefault();
var definitionTuDienPhoThongHeaders = definition.Descendants()
.Where(d => d.HasClass("hvres-source") && d.InnerText.Trim() == "Từ điển phổ thông");

if (definitionTuDienPhoThongHeader != null)
foreach (var definitionTuDienPhoThongHeader in definitionTuDienPhoThongHeaders)
{
var definitionTuDienPhoThongContent = definitionTuDienPhoThongHeader.NextSibling;
while (definitionTuDienPhoThongContent != null && !definitionTuDienPhoThongContent.HasClass("hvres-meaning"))
{
definitionTuDienPhoThongContent = definitionTuDienPhoThongContent.NextSibling;
}

var definitionTuDienPhoThongText = definitionTuDienPhoThongContent?.GetPlainText().Split(Environment.NewLine)
.Select(s => s.Trim())
.Select(s => s.Trim("0123456789. ".ToCharArray()))
Expand Down

0 comments on commit 8767d50

Please sign in to comment.