Skip to content

Using adversarial texts for training text normalization algorithms #375

@p16i

Description

@p16i

Consider the following example:

Current PythaiNLP's text normalization relies heavily on rules, which are sufficient in some circumstances. However, consider the following example, it considerably fails

image.

Input: อัตราดอกเบ้ียเงนิฝากและเงนิกู้เพ่ิมขึน้
Expect: อัตราดอกเบี้ยเงินฝากและเงินกู้เพิ่มขึ้น

Would it be possible if we can train a ML model for text normalization? I think the approach is similar to what we did for thai2rom, which is a seq2seq model.

Speaking about training data, we might develop a probabilistic model that perturbed a given word according to to some rules, e.g. สระลอย. So, we can use it to generate the training data for our seq2seq normalization model.

From what I can see, consider that many Thai official documents are in PDF, this model will be very useful for preprocessing results from PDF parsing, which is typically not robust for cases such as สระลอย.

@c4n : do you think we can leverage what you've developed for https://github.com/c4n/Thai-Adversarial-Evaluation here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementenhance functionalities

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions