Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maxent LMs #12

Open
danpovey opened this issue Jun 4, 2016 · 5 comments
Open

maxent LMs #12

danpovey opened this issue Jun 4, 2016 · 5 comments

Comments

@danpovey
Copy link
Owner

danpovey commented Jun 4, 2016

Another issue for anyone who's watching this project:
it would be nice, as an additional baseline for the paper, to try maxent LMs.
Can someone figure out how to do this on, say, Switchboard or tedlium?

@danpovey
Copy link
Owner Author

danpovey commented Jun 4, 2016

... I think the latest version of SRILM supports them, and they're supposed to be a little better than regular Kneser-Ney LMs.

@vince62s
Copy link
Contributor

FYI, on a news 1.5GB corpus, I get:
Order 3 Order 4
srilm size ppl size ppl
Unpruned 767,2 92,18 2071,7 66,86
Maxent 702,1 97,09 1952,8 70,33

not that good then

@danpovey
Copy link
Owner Author

I don't really understand what you are saying here, can you please format
more clearly and use the English standard for decimals i.e. dot not comma?

I found the reason for the crash with 4-gram pruning you found before- it's
about states with no counts being discarded when we need to keep the
discount amount- and the fix is not a one-liner, I'll work on it today. It
would affect even the un-pruned perplexities.

Dan

On Thu, Jun 30, 2016 at 12:21 PM, vince62s [email protected] wrote:

FYI, on a news 1.5GB corpus, I get:
Order 3 Order 4
srilm size ppl size ppl
Unpruned 767,2 92,18 2071,7 66,86
Maxent 702,1 97,09 1952,8 70,33

not that good then


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#12 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ADJVuwZZqvFcnck5euySeiBFxzn15AT6ks5qRBcrgaJpZM4IuAMi
.

@vince62s
Copy link
Contributor

yeah sorry copy paste from Excel.
Order 3
srilm standard size=767.2 MB - ppl=92.18
srilm maxent size=702.1 MB - ppl=97.09
Order 4
srilm standard size=2071.7 MB - ppl=66.86
srilm maxent size=1952.8 MB - ppl=70.33

The corpus is "French news shuffle 2014" about 1.5 GB text file,
I took out 10k sentences for a dev set.
Just for info the order 4 Maxent run took 2.5 hours and up to 70GB of ram....

@vince62s
Copy link
Contributor

what I am trying to say here is that these results are somehow surprising, because when I ran it on the cantab-tedlium text corpus (entropy filtered) maxent gave better results.
But then I read Tanel's paper on Maxent, and improvements were not so obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants