You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
preRequest function cutting a lot of links in case of URL regexp filtering
If the current behavior is a bug, please provide the steps to reproduce
const isVisitedMap = { '\\.com\\/users\\/[\\d]+[\\w]+': false //CONSUMER_PROFILE_REGEXP }
function isVisited(url){
for(const [key, value] of Object.entries(isVisitedMap)){
if(new RegExp(key, 'g').test(url)){
if(value){
return value;
}else{
isVisitedMap[key] = true;
return false;
}
}
}
return false;
}
(async () => {
const crawler = await HCCrawler.launch({
// Function to be evaluated in browsers
preRequest: (async (opt) => { return !isVisited(opt.url)}),
evaluatePage: (() => ({
pagePath: window.location.href,
})),
// Function to be called with evaluated results from browsers
onSuccess: (result => {
console.log('pagePath:', result.result.pagePath);
}),
exporter
});
// Queue a request
await crawler.queue({
url: 'https://www.example.com/',
headless: false,
maxDepth: 4,
userAgent: 'DuckDuckBot',
allowedDomains: [/example\.com$/],
skipDuplicates: true,
});
await crawler.onIdle(); // Resolved when no queue is left
await crawler.close(); // Close the crawler
})();
What is the expected behavior?
an only single page that match regexp skipped
What is the motivation / use case for changing the behavior?
if you'll try to crawl web sites with public access to the user's profile pages (or any other entities pages), you'll probably want to skip all of the user profile links except one, because all of them are similar.
I thought that I can skip a single page by returning false in preRequest function.
Please tell us about your environment:
Version: "headless-chrome-crawler": "^1.8.0",
Platform / OS version: Windows 10
Node.js version: v10.16.3 npm: '6.9.0',
The text was updated successfully, but these errors were encountered:
What is the current behavior?
preRequest function cutting a lot of links in case of URL regexp filtering
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
an only single page that match regexp skipped
What is the motivation / use case for changing the behavior?
if you'll try to crawl web sites with public access to the user's profile pages (or any other entities pages), you'll probably want to skip all of the user profile links except one, because all of them are similar.
I thought that I can skip a single page by returning false in preRequest function.
Please tell us about your environment:
The text was updated successfully, but these errors were encountered: