Some FAQs about performance running on thousands of files

If you are seeing tika server die in the background after 2.5K files or after a while, it potentially has to do with the system max file limit (if you are on Mac or some other system) and/or on the max number of open files. You can increase it on Mac using this guide. It is possible that increasing it may still not solve the problem and that further delays must be inserted and a restart of the tika server must be done between subsequent calls.

Here are some tips to get around this.

Tips

Just divide the dataset up into smaller chunks. This would be the best bet, and process it that way.
What if I checked the file limit on my laptop and it shows unlimited for the hard value but the server still dies at ~2500 values. Do you have any suggestions? Divide your JSONs up into much smaller chunk directories. Try with 50-100 to start. Start there make it all the way to the full genLevelCluster.py Also you may want to introduce delays between tika calls in the similarity scripts (edit, cosine, and jaccard).
Is there an easy way to divide 95,000 files? See here as an example or here Split a folder into multiple subfolders in terminal/bash script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some FAQs about performance running on thousands of files

Tips

Clone this wiki locally