Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于fastText的ngram #94

Open
Wascien opened this issue Aug 14, 2022 · 1 comment
Open

关于fastText的ngram #94

Wascien opened this issue Aug 14, 2022 · 1 comment

Comments

@Wascien
Copy link

Wascien commented Aug 14, 2022

` def biGramHash(sequence, t, buckets):
t1 = sequence[t - 1] if t - 1 >= 0 else 0
return (t1 * 14918087) % buckets

def triGramHash(sequence, t, buckets):
    t1 = sequence[t - 1] if t - 1 >= 0 else 0
    t2 = sequence[t - 2] if t - 2 >= 0 else 0
    return (t2 * 14918087 * 18408749 + t1 * 14918087) % buckets`

我想请问一下,关于n-gram,您在代码中是这样映射过去的。我有点疑惑,比如2-gram,为什么只用了2个单词中的前一个id进行了映射,而不是使用两个单词id?

@choeycui
Copy link

choeycui commented Oct 26, 2022

逻辑是这样的:

  1. 如果用所有的bi-gram和tri-gram先构建dictionary, 则n-gram词表的size会非常大, computational complexity会非常大
  2. 因为1. 的原因, 所以作者采用了hash的方式控制了词表大小
  3. 如果用2. 中hash的方法或者其他方式控制词表大小, 则绝大多数的n-gram都会变成未出现的n-gram特征(原因是中文文本在做分类时实际上比英文的n-gram特征更加sparse), 在此前提下, 这里的hash方式就是用了Katz backoff, 对bi-gram和tri-gram进行了回退----Katz backoff: 从N-gram回退到(N-1)-gram,例如Count(the,dog)~=Count(dog)

Reference:
[Mikolov et al.2011] Tom´aˇs Mikolov, Anoop Deoras,Daniel Povey, Luk´aˇs Burget, and Jan Cernock`y. 2011. ˇStrategies for training large scale neural network language models. In Workshop on Automatic Speech Recognition and Understanding. IEEE.
https://en.wikipedia.org/wiki/Katz%27s_back-off_model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants