Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of WPM's BerkeleyDB dependency #23

Open
graus opened this issue Nov 18, 2013 · 5 comments
Open

Get rid of WPM's BerkeleyDB dependency #23

graus opened this issue Nov 18, 2013 · 5 comments

Comments

@graus
Copy link
Contributor

graus commented Nov 18, 2013

Currently we get from WPM's BerkeleyDB:

  • Definitions
  • Relatedness for {in/out}link pages

We want to replace it with full articles (from Wikipedia XML?), and our own implementation of relatedness calculation (not very complex)

@larsmans
Copy link
Contributor

Is BerkeleyDB the problem or the current format? BDB is quite fast.

@dodijk
Copy link
Contributor

dodijk commented Nov 18, 2013

It's the BDB version that WPM uses: #14 (comment) and Oracle BerkeleyDB FAQ.

@dodijk
Copy link
Contributor

dodijk commented Nov 18, 2013

Gotten a bit further with this. I'm able to convert the DBs from Java Edition to regular one. Next hurdle, seems to be using more complicated formats for the values. Strings are in UTF-8, but rest? If anyone wants to have a go: /scratch/dodijk/BerkeleyDB on zookst13.

In [1]: import bsddb, codecs
In [2]: db = bsddb.btopen("nlwiki-20111104-label.db")
In [3]: db.get(codecs.encode(u'Gro\xdf Vahlberg\x00', 'utf-8'))
Out[3]: '\x01\x01\x02\x02\x01\x8d\n\xe9\xa4\x01\x01\x00\x00'

@ghost ghost assigned graus Nov 19, 2013
@dodijk
Copy link
Contributor

dodijk commented Nov 20, 2013

How's this task different than #14, @graus?

@graus
Copy link
Contributor Author

graus commented Nov 21, 2013

The thought was to split it into removing Bdb dependency (as in definitions, relatedness) vs wpm's csv dependency (all other stuff), but think I failed in stating it as such ;-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants