Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-side deduplication #928

Open
ZJaume opened this issue Nov 13, 2024 · 1 comment
Open

Single-side deduplication #928

ZJaume opened this issue Nov 13, 2024 · 1 comment
Labels
quality Improving robustness and translation quality

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 13, 2024

Some experiments that a colleague did during MaCoCu project, found that deduplication taking into account only source side or target side, improved translation quality. IIRC it was not clear what was better, to do it on the source or on the target, but both were better than deduplicating In some cases I think it was about 1 BLEU point for mid-resource languages. This probably reduces the amount of translation inconsistencies.

I couldn't found the table with the results, but I think this is worth exploring.

Maybe you are already doing this, but I was not sure. At least in the old pipeline dedupe is using the whole sentence pair.

@ZJaume ZJaume added the quality Improving robustness and translation quality label Nov 13, 2024
@gregtatum
Copy link
Member

We are de-duplicating based on source and target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

2 participants