-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issues #442
Comments
I would like to help speed-up (if possible) crawler4j a bit. But since I am new to its codebase it would be nice if someone could point to relevant information/posts or any kind of documentation or comment about possible bottlenecks or where should I start looking at. P.S. |
Currently, BerklyDB is integrated into the code
The right way will be to create an interface for the DB where several
implementations could be suggested, one would be berklyDB, then another
internal inmemoryDB could replace it.
I think someone had tried to do just that, I am not sure what was the
conclusion
…On Sun, Apr 26, 2020 at 10:10 PM Papadakos Panagiotis < ***@***.***> wrote:
I would like to help speed-up (if possible) crawler4j a bit. But since I
am new to its codebase it would be nice if someone could point to relevant
information/posts or any kind of documentation or comment about possible
bottlenecks or where should I start looking at.
P.S.
My insight is that probably the bottleneck for scaling things up is the
usage of BerkeleyDB.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#442 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANNWWZ3HQY7GTARHO3AS7LROSBLZANCNFSM4MQBGDSQ>
.
|
Yep, that would be nice. But I guess, initially we have to find which are the bottlenecks in the current implementation. It might be the db, it might be the way crawler4j uses it or it might be something else. |
If you want to fetch billions of Web pages, you might look into other (distributed) Web crawler frameworks written in Java, i.e. Nutch or Stormcrawler. Did you enable resumable crawling? On which OS does your crawler4j is running on? 100% CPU usage sounds like IO wait. Did you check this? Before I switched to Stormcrawler (for scalabilty reasons), I used crawler4j quite heavily for the purpose of focused crawling: I did never experience such issues. |
Thanks for the reply rzo1. I will try to do some kind of profiling to see what is going on. |
Dear all,
I am currently experimenting with crawler4j to download pages from the web (I would like to download if possible billions of pages). But at least in some of my early experiments it seems that this is not feasible. For example, if I start with 300 seeds then after one day of crawling, which means about 200.000 downloaded pages, things slow down a lot and the system's cpu usage is 100% in all of my cores, downloading a page every minute or so.
So basically, is this an expected behavior or maybe something is wrong in my setup? Are there any guidelines about how I can improve things? What are the current bottlenecks in crawler4j that inhibit scale-up?
Best regards
Panagiotis
The text was updated successfully, but these errors were encountered: