-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Improve the efficiency of the scraping process #37
Comments
The reasons I picked puppeteer over casual HTTP libraries:
I appreciate your suggestions. Since currently no captcha solving service is being used, I will make a feature/http-scraper branch to give it a go and we'll see how things turn out. |
Testing resultsI figured that the stack of axios, cheerio, and random-useragent works fine with sending proxified requests. However, another major drawback of this approach is that http requests couldn't obtain the search performance results, e.g., I can't find an explanation for this. I assume that being a headless browser, Puppeteer is able to get a more complete HTML content of the page. This drawback actually matters because in the application requirements, it's stated that:
Since this approach clearly doesn't deliver what's expected, if I were to pick HTTP libraries from the beginning I'd have to switch to other alternatives nonetheless. I hope we're on a same page on getting it done right is better than getting it done quick. As always, I'm open to other alternatives to enhance the scraping process. Update:
I tried this at the very first. Combining proxies with random user agents, meanwhile continually decreasing the |
thank you for the effort spending on trying another approach as suggested
I agree that getting things done right is important.
since I don't know how you implemented that so I can't guess if there was anything wrong. But that solution - using axios with random user's agent (even without the proxy) should be able to get the search result in an expected way. I have seen other candidates made similar approach and get the results they want. Here is a simple code as an example import { HttpService } from '@nestjs/axios';
const USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12.6; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (X11; Linux i686; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edge/106.0.1370.47',
];
function randomUserAgent(): string {
const randomIndex = Math.floor(Math.random() * USER_AGENTS.length);
return USER_AGENTS[randomIndex];
}
...
const res = await this.httpService.axiosRef.get(
`https://www.google.com/search?q=${query}&gl=us&hl=en`,
{
headers: {
'User-Agent': randomUserAgent(),
},
},
);
const html = res.data;
...
|
Thank you for your sample. It turns out that the search performance results will only appear with certain type of User Agent, which means the UA have to be "hand picked" and not generate by random. In #43 I've tried replacing Puppeteer with Axios and Cheerio to see how things turn out. It worked out quite significant:
To conclude, my initial assumption on the pros of Puppeteer is wrong, i.e, being solely able to solve captcha does not outweigh the costs, considering how resource demanding it is. I truly appreciate your suggestions and support. |
Thank you for spending effort on the improvement. As mentioned in #35 (comment), one issue I noticed in the PR is that the current approach processes keywords in a loop, making the process brittle and inefficient. Failure related to one keyword will stop the processing of the other keywords, so separate asynchronous processes should be used for each keyword. nimble-scraper/backend/src/services/scraper.service.ts Lines 33 to 34 in 0687d47
|
A for loop helps and reduce the risk of traffic spike and getting detected. We can always wrap up and trigger all search requests at the same time, but that would make the proxies more prone to detection. Please understand in a stealth job, being fast does not equal being efficient. |
Issue
The Scrapping Service has a creative way to use proxies to overcome the mass-searching detection from Google. However, it is using puppeteer, which requires Chromium running in headless mode
nimble-scraper/backend/src/services/scraper.service.ts
Lines 29 to 34 in f0673eb
it requires more resources to work, as pointed out in the Readme
I'm curious why don't you use an HTTP library, e.g.
axios
to send the search requests and parse the result (e.g. using a library like cheerio) instead? it would be way more effecient.Also, as mentioned in #35, instead of using
sleep
and make the code way to overcome the detection from Google, there should be a better way e.g. by using the proxies and rotating the user's agent in the request.Expected
The scrapping process in handled in a more efficient way.
The text was updated successfully, but these errors were encountered: