-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of WPM's BerkeleyDB dependency #23
Comments
Is BerkeleyDB the problem or the current format? BDB is quite fast. |
It's the BDB version that WPM uses: #14 (comment) and Oracle BerkeleyDB FAQ. |
Gotten a bit further with this. I'm able to convert the DBs from Java Edition to regular one. Next hurdle, seems to be using more complicated formats for the values. Strings are in UTF-8, but rest? If anyone wants to have a go: /scratch/dodijk/BerkeleyDB on zookst13. In [1]: import bsddb, codecs
In [2]: db = bsddb.btopen("nlwiki-20111104-label.db")
In [3]: db.get(codecs.encode(u'Gro\xdf Vahlberg\x00', 'utf-8'))
Out[3]: '\x01\x01\x02\x02\x01\x8d\n\xe9\xa4\x01\x01\x00\x00' |
The thought was to split it into removing Bdb dependency (as in definitions, relatedness) vs wpm's csv dependency (all other stuff), but think I failed in stating it as such ;-). |
Currently we get from WPM's BerkeleyDB:
We want to replace it with full articles (from Wikipedia XML?), and our own implementation of relatedness calculation (not very complex)
The text was updated successfully, but these errors were encountered: