-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question Regarding the Crawler Logic #475
Comments
The internal DB is irrelevant and is used by crawler4j, you shouldn't worry
or even notice it.
Each run should fetch the same number of pages if you configured the
crawler as you should.
You should just limit it to the domain you want to crawl and not limit any
other factor (like depth etc...), just let it run.
If you encountered a different number of pages in different runs, maybe the
domain server didn't handle well all of your requests to get pages from the
same domain and it blocked / or just dropped some of the requests.
Try limiting to a single thread if you are not in a hurry and recheck, or
define a politeness variable to delay between fetches from the same domain.
Please try and report.
Avi.
…On Wed, Jun 28, 2023 at 6:07 PM vadimlevitzky ***@***.***> wrote:
Hi,
I have setup crawler to crawl some site, and I get different pages count
for each run, is there some explanation how the logic works? How it works
with the internal DB? when we run 2 times over the same domain, It ashould
return the same ammount of pages isnt it? or just the modified pages?
Thanks,
Vadim
—
Reply to this email directly, view it on GitHub
<#475>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANNWWZTFZY2TZIMDJSPYWTXNRCEJANCNFSM6AAAAAAZXIUAHM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi, |
Crawler4j has an internal system to go into sites with auth.
Look here:
https://github.com/yasserg/crawler4j/blob/68f5c1e4fb86542e74d31c0bcb4b1ae14ba2ea71/crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java#L193
Create an AuthInfo object, and it should work logging in, I don't remember
currently the process of configuring it (several years since I have
implemented it), but it should be pretty straight forward.
Avi
…On Thu, Jun 29, 2023 at 2:34 PM vadimlevitzky ***@***.***> wrote:
Hi,
I have another question is the crawler supports sending cookies in the
request?
I have a site which has some authentication mechanism which requires some
cookies in the request.
Appriciate for quick answer.
Thanks,
vadim
—
Reply to this email directly, view it on GitHub
<#475 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANNWW2IK52EVLK6N6GGOADXNVR4RANCNFSM6AAAAAAZXIUAHM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I have setup crawler to crawl some site, and I get different pages count for each run, is there some explanation how the logic works? How it works with the internal DB? when we run 2 times over the same domain, It ashould return the same ammount of pages isnt it? or just the modified pages?
Thanks,
Vadim
The text was updated successfully, but these errors were encountered: