关于fastText的ngram #94

Wascien · 2022-08-14T15:36:51Z

` def biGramHash(sequence, t, buckets):
t1 = sequence[t - 1] if t - 1 >= 0 else 0
return (t1 * 14918087) % buckets

def triGramHash(sequence, t, buckets):
    t1 = sequence[t - 1] if t - 1 >= 0 else 0
    t2 = sequence[t - 2] if t - 2 >= 0 else 0
    return (t2 * 14918087 * 18408749 + t1 * 14918087) % buckets`

我想请问一下，关于n-gram，您在代码中是这样映射过去的。我有点疑惑，比如2-gram，为什么只用了2个单词中的前一个id进行了映射，而不是使用两个单词id？

The text was updated successfully, but these errors were encountered:

choeycui · 2022-10-26T10:49:14Z

逻辑是这样的:

如果用所有的bi-gram和tri-gram先构建dictionary, 则n-gram词表的size会非常大, computational complexity会非常大
因为1. 的原因, 所以作者采用了hash的方式控制了词表大小
如果用2. 中hash的方法或者其他方式控制词表大小, 则绝大多数的n-gram都会变成未出现的n-gram特征(原因是中文文本在做分类时实际上比英文的n-gram特征更加sparse), 在此前提下, 这里的hash方式就是用了Katz backoff, 对bi-gram和tri-gram进行了回退----Katz backoff: 从N-gram回退到(N-1)-gram，例如Count(the,dog)~=Count(dog)

Reference:
[Mikolov et al.2011] Tom´aˇs Mikolov, Anoop Deoras,Daniel Povey, Luk´aˇs Burget, and Jan Cernock`y. 2011. ˇStrategies for training large scale neural network language models. In Workshop on Automatic Speech Recognition and Understanding. IEEE.
https://en.wikipedia.org/wiki/Katz%27s_back-off_model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于fastText的ngram #94

关于fastText的ngram #94

Wascien commented Aug 14, 2022

choeycui commented Oct 26, 2022 •

edited

Loading

关于fastText的ngram #94

关于fastText的ngram #94

Comments

Wascien commented Aug 14, 2022

choeycui commented Oct 26, 2022 • edited Loading

choeycui commented Oct 26, 2022 •

edited

Loading