Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure vocab for CJK #906

Merged
merged 22 commits into from
Nov 7, 2024
Merged

Configure vocab for CJK #906

merged 22 commits into from
Nov 7, 2024

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Oct 30, 2024

  • character coverage
  • size

closes #745

@eu9ene eu9ene requested review from gregtatum and ZJaume October 30, 2024 00:26
@eu9ene eu9ene requested review from a team as code owners October 30, 2024 00:26
@eu9ene eu9ene requested review from hneiva and removed request for a team and hneiva October 30, 2024 00:26
pipeline/train/spm-vocab.sh Outdated Show resolved Hide resolved
Copy link
Member

@gregtatum gregtatum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

@eu9ene eu9ene requested a review from ZJaume November 5, 2024 01:18
@ZJaume
Copy link
Collaborator

ZJaume commented Nov 6, 2024

I'm thinking now that the 64K vocab as a workaround of split 32K vocabs may end up slowing down the decoding process because the softmax will be more expensive. Also I remember in the past the s2s models struggling to learn with large vocabulary outputs like 80k. So I wonder if we will see something of this in the Transformer. We'll see what happens in this preliminary experiments, but I'm starting to be more inclined to split vocabs.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Nov 6, 2024

I'm thinking now that the 64K vocab as a workaround of split 32K vocabs may end up slowing down the decoding process because the softmax will be more expensive. Also I remember in the past the s2s models struggling to learn with large vocabulary outputs like 80k. So I wonder if we will see something of this in the Transformer. We'll see what happens in this preliminary experiments, but I'm starting to be more inclined to split vocabs.

yeah, I guess I'll rerun the experiment with the latest fixes and we can test it. I'll also work on #913 so that we can experiment with the split vocabs

@eu9ene eu9ene mentioned this pull request Nov 6, 2024
@eu9ene eu9ene changed the base branch from cjk_training to main November 6, 2024 23:45
# Conflicts:
#	pipeline/alignments/align.py
#	pipeline/data/cjk.py
#	pipeline/data/dataset_importer.py
#	pipeline/data/download-mono.py
#	pipeline/data/requirements/data.in
#	poetry.lock
#	taskcluster/kinds/finetune-student/kind.yml
#	taskcluster/kinds/train-student/kind.yml
#	taskcluster/kinds/train-teacher/kind.yml
#	tests/test_alignments.py
#	tests/test_cjk.py
#	tests/test_data_importer.py
#	tests/test_training.py
#	utils/config_generator.py
@eu9ene eu9ene merged commit 28de0c8 into main Nov 7, 2024
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate issues with SentencePiece vocabulary for CJK
3 participants