-
Notifications
You must be signed in to change notification settings - Fork 34
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update training guide * Fix docs * Add index file * Remove header * Fix docs link * Remove tensorboard section * Add theme * Update navigation * Add logo * Use absolute links * Fix code links * Fix code links * Fix link * Clarify what config is * Fix note for bicleaner Co-authored-by: Marco Castelluccio <[email protected]> * Fix typo Co-authored-by: Greg Tatum <[email protected]> * Fix link * Fix mentioning of Marian Co-authored-by: Greg Tatum <[email protected]> * Remove "my" * Make note about snakemake more visible * Fix phrasing * Add link to bilceaner paper * Add clarifications * Add links to default training configs * Add reference to bilceaner section * Small fixes --------- Co-authored-by: Marco Castelluccio <[email protected]> Co-authored-by: Greg Tatum <[email protected]>
- Loading branch information
1 parent
cf51faa
commit 2df0a3a
Showing
15 changed files
with
465 additions
and
184 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
remote_theme: just-the-docs/just-the-docs | ||
#color_scheme: dark | ||
title: Firefox Translations Training | ||
description: Documentation for the Firefox Translations training pipelines | ||
heading_anchors: true | ||
# doesn't work | ||
favicon_ico: "img/logo.svg" | ||
# Aux links for the upper right navigation | ||
aux_links: | ||
"GitHub": | ||
- "https://github.com/mozilla/firefox-translations-training" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
--- | ||
layout: default | ||
title: Data cleaning | ||
nav_order: 5 | ||
--- | ||
|
||
# Data cleaning | ||
|
||
Making datasets less noisy to improve quality of translation. | ||
|
||
## Regular pipeline | ||
|
||
|
||
Config setting: | ||
``` | ||
use-opuscleaner: false | ||
``` | ||
|
||
### Dataset fixing | ||
|
||
Some datasets require fixes like detokenization. | ||
Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes). | ||
Naming convention: | ||
- `<dataset_name>.sh` for parallel dataset cleaning | ||
- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset | ||
- `/` in dataset name should be replaced with `_` | ||
|
||
### Cleaning scripts | ||
|
||
Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script. | ||
|
||
|
||
### Bicleaner | ||
|
||
It is recommended to use Bicleaner ML models to filter noisy data. | ||
See more details on how to configure it in the [Model training guide, Bicleaner section](training-guide.md/#bicleaner). | ||
|
||
|
||
## OpusCleaner | ||
|
||
Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project. | ||
|
||
Config setting: | ||
``` | ||
use-opuscleaner: true | ||
``` | ||
|
||
## Custom filter configs | ||
The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset | ||
to get a training corpus with less noise and train higher quality translation models. | ||
|
||
Filtering rules can be tuned in an interactive UI. | ||
|
||
### Installation | ||
|
||
Install the OpusCleaner UI on a server. | ||
See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner). | ||
|
||
For local usage: run from a poetry shell `make opuscleaner-ui`. | ||
Then go to `http://0.0.0.0:8000`. | ||
|
||
### Making filters | ||
|
||
Choose a language pair and download the required OPUS datasets. | ||
They will correspond to `opus_...` training datasets in the training pipeline config. | ||
|
||
Configure cleaning rules for the datasets in the UI. | ||
|
||
Copy JSON files for the produced filters `data/train-parts/*.filter.json` to | ||
`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/`. | ||
|
||
### Default config | ||
|
||
If no custom config was specifed for the dataset, | ||
the [default config template](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. | ||
|
||
Modify if needed. Some rules require specifying source or target language. | ||
The `<src>` and `<trg>` in the template will be automatically replaced with the trained language pair. | ||
The generated default config will be copied to the target dataset cleaning directory. | ||
|
||
### Running | ||
|
||
Enable OpusCleaner in the training pipeline config and run the pipeline as usual. | ||
OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/clean-corpus.sh) script. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,9 @@ | ||
--- | ||
layout: default | ||
title: Development | ||
nav_order: 7 | ||
--- | ||
|
||
# Development | ||
|
||
## Architecture | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
--- | ||
layout: default | ||
title: Home | ||
nav_order: 1 | ||
description: "Firefox Translations Training documentation." | ||
permalink: / | ||
--- | ||
|
||
# Firefox Translations training | ||
Training pipelines for Firefox Translations machine translation models. | ||
|
||
The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository, | ||
compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and | ||
power the Firefox web page translation starting with version 118. | ||
|
||
The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. | ||
|
||
## Training pipeline | ||
|
||
The pipeline is capable of training a translation model for a language pair end to end. | ||
Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. | ||
Some settings, especially low resource languages might require extra tuning. | ||
|
||
We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine . | ||
|
||
## Learning resources | ||
|
||
- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/) | ||
- [Model training guide](training-guide.md) - practical advice on how to use the pipeline | ||
- [Reference papers](references.md) | ||
|
||
|
||
## Acknowledgements | ||
This project uses materials developed by: | ||
- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303 | ||
- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546] | ||
- OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/)) | ||
- Many other open source projects and research papers (see [References](references.md)) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
--- | ||
layout: default | ||
title: Orchestrators | ||
nav_order: 6 | ||
has_children: true | ||
has_toc: false | ||
--- | ||
|
||
# Orchestrators | ||
|
||
An orchestrator is responsible for workflow management and parallelization. | ||
|
||
Supported orchestrators: | ||
|
||
- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI. | ||
It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability. | ||
[Usage instructions](task-cluster.md). | ||
- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that can be used to run the pipeline locally or on a Slurm cluster. | ||
[Usage instructions](snakemake.md). | ||
|
||
Mozilla is currently switching to Taskcluster and the Snakemake workflow will be less actively maintained in the future. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
2df0a3a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh! Looks like an error! Details
Taskcluster-GitHub attempted to create a task for this event with the following scopes:
The expansion of these scopes is not sufficient to create the task, leading to the following:
Client ID static/taskcluster/github does not have sufficient scopes and is missing the following scopes:
This request requires the client to satisfy the following scope expression: