Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up bulk processing with Tika #18

Open
jeremybmerrill opened this issue Aug 15, 2014 · 7 comments
Open

Speed up bulk processing with Tika #18

jeremybmerrill opened this issue Aug 15, 2014 · 7 comments

Comments

@jeremybmerrill
Copy link
Contributor

Yomu is great. I'm currently using it to process thousands of documents. Unfortunately, this is very slow, because, right now, Yomu starts the JVM for each document. This takes about 2 seconds per document -- which significantly slows me down.

Tika has thought of this and included "server" mode, where Tika starts as a server and processes whatever documents are thrown at it over a socket. Starting Java in server mode takes a little longer, but only has to happen once.

I've modified Yomu to support server mode. The API is the same, but if you want server mode, put this

Yomu.server(:text)

before your code and

Yomu.kill_server!

after it.

For processing even only 6 documents, the speed-up is noticeable: 12ish seconds with the current version of Yomu and 4ish with my server version.

In order to preserve the API as-is (tests pass on my branch with no changes), my method isn't terribly elegant (e.g. class variables) and requires the target extraction type (text/html/metadata) to be selected when the server is inited (this is a Tika constraint). A more elegant and Rubyish way would be to do all the server-based extraction in a block. But this would require changing the API.

If you'd be amenable to this as a patch, @yomu, I'll write tests, edit the docs and submit a PR. I'm happy, too, to submit as is or with the block-based method I mention above, based on what you think is best for the library. Until then, my version is at https://github.com/jeremybmerrill/yomu/tree/feature/servermode

@Erol
Copy link
Member

Erol commented Dec 20, 2014

@jeremybmerrill My apologies for just responding now. Thanks for your work on this!

I like your idea of having a server mode for Yomu and I'm open to changing the API. If you're still up for it, may I know what changes you have in mind and the syntax for wrapping it in a block? I was thinking it would go something like this:

Yomu.start :text do |yomu|
  yomu.read 'path/to/file'
  yomu.read 'path/of/another/file'
end

@jeremybmerrill
Copy link
Contributor Author

Hi @Erol:

My code that does this was merged in via #23 (@rogeriochaves) -- he added tests to my implementation.

The syntax of this implementation (below) is not optimal and very un-Rubyish. Block-like syntax would be far better -- I think I didn't do it just because this implementation was easier and I was in a rush. There'd need to be some refactoring to allow a Yomu instance to process more than one file.

Yomu.server(:text)
Yomu.new(filename).text
Yomu.kill_server!

@xavriley
Copy link

xavriley commented Jul 8, 2015

Just discovered this but only from this issue - made my processing about 100x faster! Worth putting a hint in the README perhaps?

@hatlord
Copy link

hatlord commented Mar 2, 2016

Any way to make this work just for metadata? I can get Yomu to read in many files, but it is slow going to pull out the metadata.

Thanks,

@jeremybmerrill
Copy link
Contributor Author

Yomu.server(:metadata)
Yomu.new(filename1).metadata
Yomu.new(filename2).metadata
Yomu.new(filename3).metadata
Yomu.kill_server!

should work

@hatlord
Copy link

hatlord commented Mar 3, 2016

Thanks for the response, really appreciate it. Ill give that a go :)

@hatlord
Copy link

hatlord commented Mar 3, 2016

Update: Yeah that did improve the speed noticeably, many many thanks :)

Now I just need to figure out why its so slow running through "INFO Document is encrypted" lines and im set :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants