Configure vocab for CJK #906

eu9ene · 2024-10-30T00:26:34Z

character coverage
size

closes #745

pipeline/train/spm-vocab.sh

gregtatum

Seems reasonable to me.

This reverts commit a6c35bf.

ZJaume · 2024-11-06T11:15:20Z

I'm thinking now that the 64K vocab as a workaround of split 32K vocabs may end up slowing down the decoding process because the softmax will be more expensive. Also I remember in the past the s2s models struggling to learn with large vocabulary outputs like 80k. So I wonder if we will see something of this in the Transformer. We'll see what happens in this preliminary experiments, but I'm starting to be more inclined to split vocabs.

eu9ene · 2024-11-06T18:50:29Z

I'm thinking now that the 64K vocab as a workaround of split 32K vocabs may end up slowing down the decoding process because the softmax will be more expensive. Also I remember in the past the s2s models struggling to learn with large vocabulary outputs like 80k. So I wonder if we will see something of this in the Transformer. We'll see what happens in this preliminary experiments, but I'm starting to be more inclined to split vocabs.

yeah, I guess I'll rerun the experiment with the latest fixes and we can test it. I'll also work on #913 so that we can experiment with the split vocabs

# Conflicts: # pipeline/alignments/align.py # pipeline/data/cjk.py # pipeline/data/dataset_importer.py # pipeline/data/download-mono.py # pipeline/data/requirements/data.in # poetry.lock # taskcluster/kinds/finetune-student/kind.yml # taskcluster/kinds/train-student/kind.yml # taskcluster/kinds/train-teacher/kind.yml # tests/test_alignments.py # tests/test_cjk.py # tests/test_data_importer.py # tests/test_training.py # utils/config_generator.py

eu9ene added 17 commits October 22, 2024 15:23

Add a converter for Chinese

970c5b1

Convert imported datasets to simplified

49d81fe

Add augmentation modifier for cjk

2e51caa

Update tests

b80f339

Move constants to the beginning of the file

9a9f069

Output tokenized text from alignments step

f5cc4c1

Detokenize text in Tags modifier

788ee88

Add CJK OpusTrainer configs

a28cfb2

Update taskcluster kinds to use tokenized text and cjk configs

542a240

Test training for Chinese

f56440b

Update docs

5e07ef2

Reduce chunk size for alignments

ccf91d9

Add python path env

c0ac303

Fix comment

2c034b8

Change character coverage for CJK

a6c35bf

Use larger vocab for CJK

7525982

Use all items from the provided vocabulary

8f4dcd8

eu9ene requested review from gregtatum and ZJaume October 30, 2024 00:26

eu9ene requested review from a team as code owners October 30, 2024 00:26

eu9ene requested review from hneiva and removed request for a team and hneiva October 30, 2024 00:26

ZJaume reviewed Oct 30, 2024

View reviewed changes

pipeline/train/spm-vocab.sh Outdated Show resolved Hide resolved

gregtatum approved these changes Oct 31, 2024

View reviewed changes

Revert "Change character coverage for CJK"

8fb1748

This reverts commit a6c35bf.

Use default sentencepiece character coverage

6c2e1e1

eu9ene requested a review from ZJaume November 5, 2024 01:18

ZJaume approved these changes Nov 6, 2024

View reviewed changes

eu9ene mentioned this pull request Nov 6, 2024

Allow for split vocabs #913

Open

eu9ene changed the base branch from cjk_training to main November 6, 2024 23:45

eu9ene added 3 commits November 6, 2024 15:53

Relock poetry

b90d4ae

Run linter

b057ce2

eu9ene merged commit 28de0c8 into main Nov 7, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure vocab for CJK #906

Configure vocab for CJK #906

eu9ene commented Oct 30, 2024

gregtatum left a comment

ZJaume commented Nov 6, 2024

eu9ene commented Nov 6, 2024

Configure vocab for CJK #906

Configure vocab for CJK #906

Conversation

eu9ene commented Oct 30, 2024

gregtatum left a comment

Choose a reason for hiding this comment

ZJaume commented Nov 6, 2024

eu9ene commented Nov 6, 2024