Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotwords encoding for phonemes #981

Open
w11wo opened this issue Jun 6, 2024 · 8 comments
Open

Hotwords encoding for phonemes #981

w11wo opened this issue Jun 6, 2024 · 8 comments

Comments

@w11wo
Copy link
Contributor

w11wo commented Jun 6, 2024

Hi. I have a phoneme-based Zipformer model.

Before this PR, I was able to apply hotwords encoding for phoneme sequences, e.g. ɪ z/dʒ ʌ s t/b ɛ s t, following the older implementation of e.g. Chinese character hotwords encoding. But now, I noticed that the Chinese character hotwords encoding have changed from 深 度 学 习 (whitespace between chars) to 深度学习 (no whitespace). And I assume the string parser will simply iterate through the non-whitespace characters in the string sequence.

This, however, breaks my use case, since phoneme sequence with digraphs, e.g. dʒ ʌ s t will be incorrectly split to d ʒ ʌ s t. The issue is that my model's vocab supports digraph and requires the old implementation.

Is it possible to add another modeling unit, other than the currently supported ones (cjk, BPE, cjk+BPE)? Maybe instead of iterating for every non-whitespace character, split by whitespace first? This new modeling unit can hopefully support other use cases similar to mine.

Massive thanks for all the work and help thus far!

@csukuangfj
Copy link
Collaborator

@pkufool Could you have a look?

@pkufool
Copy link
Contributor

pkufool commented Jun 7, 2024

Emm, May be I we can add a option like do-not-tokenize, I think it should fix your issue.

@pkufool
Copy link
Contributor

pkufool commented Jun 7, 2024

For now, I think you can use the older version, v1.9.24 .

@w11wo
Copy link
Contributor Author

w11wo commented Jun 7, 2024

@pkufool Yes, the do-not-tokenize option sounds good.

I can stick to older versions for now, but I wanted to try the customizable per-word hotwords scores, which comes in the latest releases only, hence the need for this new feature.

@pkufool
Copy link
Contributor

pkufool commented Jun 7, 2024

@w11wo OK, will make a PR.

@w11wo
Copy link
Contributor Author

w11wo commented Jun 18, 2024

Hi @pkufool, is there an update on the PR?

@pkufool
Copy link
Contributor

pkufool commented Jun 21, 2024

Hi @pkufool, is there an update on the PR?

There is an on-going PR #1039

@w11wo
Copy link
Contributor Author

w11wo commented Jun 21, 2024

Thank you so much @pkufool. Looking forward to it getting merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants