Skip to content

Latest commit

 

History

History
16 lines (11 loc) · 882 Bytes

README.md

File metadata and controls

16 lines (11 loc) · 882 Bytes

wikiparse

Imports wikipedia data dump XML into elasticsearch.

Usage

  • Download the pages-articles XML dump, find the link on this page. You want pages-articles.xml.bz2. DO NOT UNCOMPRESS THE BZ2 FILE.
  • From the releases page, download the wikiparse JAR
  • Run the jar on the BZ2 file: java -jar -Xmx1g wikiparse-0.1.0.jar --es http://localhost:9200 /var/lib/elasticsearch/enwiki-latest-pages-articles.xml.bz2
  • The data will be indexed to an index named en-wikipedia (by default). This can be changed with --index parameter.

License

Wikisample.bz2 Copyright: http://en.wikipedia.org/wiki/Wikipedia:Copyrights All code and other files Copyright © 2013 Andrew Cholakian and distributed under the Eclipse Public License, the same as Clojure.