Question Regarding the Crawler Logic #475

vadimlevitzky · 2023-06-28T15:07:34Z

Hi,
I have setup crawler to crawl some site, and I get different pages count for each run, is there some explanation how the logic works? How it works with the internal DB? when we run 2 times over the same domain, It ashould return the same ammount of pages isnt it? or just the modified pages?
Thanks,
Vadim

Chaiavi · 2023-06-29T07:13:16Z

The internal DB is irrelevant and is used by crawler4j, you shouldn't worry or even notice it. Each run should fetch the same number of pages if you configured the crawler as you should. You should just limit it to the domain you want to crawl and not limit any other factor (like depth etc...), just let it run. If you encountered a different number of pages in different runs, maybe the domain server didn't handle well all of your requests to get pages from the same domain and it blocked / or just dropped some of the requests. Try limiting to a single thread if you are not in a hurry and recheck, or define a politeness variable to delay between fetches from the same domain. Please try and report. Avi.

…

On Wed, Jun 28, 2023 at 6:07 PM vadimlevitzky ***@***.***> wrote: Hi, I have setup crawler to crawl some site, and I get different pages count for each run, is there some explanation how the logic works? How it works with the internal DB? when we run 2 times over the same domain, It ashould return the same ammount of pages isnt it? or just the modified pages? Thanks, Vadim — Reply to this email directly, view it on GitHub <#475>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANNWWZTFZY2TZIMDJSPYWTXNRCEJANCNFSM6AAAAAAZXIUAHM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

vadimlevitzky · 2023-06-29T11:34:22Z

Hi,
I have another question is the crawler supports sending cookies in the request?
I have a site which has some authentication mechanism which requires some cookies in the request.
Appriciate for quick answer.
Thanks,
vadim

Chaiavi · 2023-06-29T12:42:56Z

Crawler4j has an internal system to go into sites with auth. Look here: https://github.com/yasserg/crawler4j/blob/68f5c1e4fb86542e74d31c0bcb4b1ae14ba2ea71/crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java#L193 Create an AuthInfo object, and it should work logging in, I don't remember currently the process of configuring it (several years since I have implemented it), but it should be pretty straight forward. Avi

…

On Thu, Jun 29, 2023 at 2:34 PM vadimlevitzky ***@***.***> wrote: Hi, I have another question is the crawler supports sending cookies in the request? I have a site which has some authentication mechanism which requires some cookies in the request. Appriciate for quick answer. Thanks, vadim — Reply to this email directly, view it on GitHub <#475 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANNWW2IK52EVLK6N6GGOADXNVR4RANCNFSM6AAAAAAZXIUAHM> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question Regarding the Crawler Logic #475

Question Regarding the Crawler Logic #475

vadimlevitzky commented Jun 28, 2023

Chaiavi commented Jun 29, 2023 via email

vadimlevitzky commented Jun 29, 2023

Chaiavi commented Jun 29, 2023 via email

Question Regarding the Crawler Logic #475

Question Regarding the Crawler Logic #475

Comments

vadimlevitzky commented Jun 28, 2023

Chaiavi commented Jun 29, 2023 via email

vadimlevitzky commented Jun 29, 2023

Chaiavi commented Jun 29, 2023 via email