Using adversarial texts for training text normalization algorithms

Consider the following example:

Current PythaiNLP's text normalization relies heavily on rules, which are sufficient in some circumstances. However, consider the following example, it considerably fails

![image](https://user-images.githubusercontent.com/1214890/78266931-b2e7a300-7506-11ea-9e91-700cf7971018.png).

**Input:** `อัตราดอกเบ้ียเงนิฝากและเงนิกู้เพ่ิมขึน้`
**Expect:** `อัตราดอกเบี้ยเงินฝากและเงินกู้เพิ่มขึ้น`

Would it be possible if we can train a ML model for text normalization? I think the approach is similar to what we did for `thai2rom`, which is a seq2seq model.

Speaking about training data, we might develop a probabilistic model that perturbed a given word according to to some rules, e.g. สระลอย. So, we can use it to generate the training data for our seq2seq normalization model.

From what I can see, consider that many Thai official documents are in PDF, this model will be very useful for preprocessing results from PDF parsing, which is typically not robust for cases such as สระลอย.

@c4n : do you think we can leverage what you've developed for https://github.com/c4n/Thai-Adversarial-Evaluation here?





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using adversarial texts for training text normalization algorithms #375

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using adversarial texts for training text normalization algorithms #375

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions