Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Improve the efficiency of the scraping process #37

Closed
longnd opened this issue Mar 9, 2023 · 6 comments · Fixed by #43
Closed

[Feature] Improve the efficiency of the scraping process #37

longnd opened this issue Mar 9, 2023 · 6 comments · Fixed by #43

Comments

@longnd
Copy link

longnd commented Mar 9, 2023

Issue

The Scrapping Service has a creative way to use proxies to overcome the mass-searching detection from Google. However, it is using puppeteer, which requires Chromium running in headless mode

async scrape(payload: Batch) {
this.fileService.concurrentUploadCount++;
const args = appEnv.IS_PROD ? ['--no-sandbox', '--disable-setuid-sandbox'] : undefined;
const browser = await puppeteer.launch({ args });

it requires more resources to work, as pointed out in the Readme

Currently a 2-CPU 4GB Ubuntu server with 22 proxies can handle up to 7 concurrent uploads before showing sign of scraping failures (Captcha-ed, Timeout, etc).

I'm curious why don't you use an HTTP library, e.g. axios to send the search requests and parse the result (e.g. using a library like cheerio) instead? it would be way more effecient.

Also, as mentioned in #35, instead of using sleep and make the code way to overcome the detection from Google, there should be a better way e.g. by using the proxies and rotating the user's agent in the request.

Expected

The scrapping process in handled in a more efficient way.

@21jake
Copy link
Owner

21jake commented Mar 9, 2023

The reasons I picked puppeteer over casual HTTP libraries:

  • Puppeteer can solve captcha, others can not (to my knowledge). This theoretically means every search request will succeed and significantly increases performance (given the fact that we don't need to worry about being detected - thus remove the DELAY_BETWEEN_CHUNK_MS and increase the CHUNK_SIZE). Moreover, we can remove the need for proxies and save costs. Initially, I thought being able to solve captcha alone completely outweighs its cons.
  • I didn't have a enjoyable experience in the past with making proxified requests using NodeJS HTTP (this Axios bug especially). I should've revisited the issue to see that it's closed.

I appreciate your suggestions. Since currently no captcha solving service is being used, I will make a feature/http-scraper branch to give it a go and we'll see how things turn out.

@21jake
Copy link
Owner

21jake commented Mar 10, 2023

Testing results


I figured that the stack of axios, cheerio, and random-useragent works fine with sending proxified requests. However, another major drawback of this approach is that http requests couldn't obtain the search performance results, e.g., About 10,000 results (0.60 seconds). In the screenshot below, left side is the Axios cache, right side is the Puppeteer cache

Screen Shot 2023-03-10 at 10 36 48

I can't find an explanation for this. I assume that being a headless browser, Puppeteer is able to get a more complete HTML content of the page.

This drawback actually matters because in the application requirements, it's stated that:

For each search result/keyword result page on Google, store the following information on the first results page:
The total search results for this keyword, e.g., About 21,600,000 results (0.42 seconds)

Since this approach clearly doesn't deliver what's expected, if I were to pick HTTP libraries from the beginning I'd have to switch to other alternatives nonetheless. I hope we're on a same page on getting it done right is better than getting it done quick.

As always, I'm open to other alternatives to enhance the scraping process.

Update:


try using the proxies and rotating the user's agent in the request.

I tried this at the very first. Combining proxies with random user agents, meanwhile continually decreasing the sleep() delay. That didn't work out - as if using user agent doesn't have a bit of impact on increasing performance. And that makes sense because Google's main criteria to detect is based on the IP address of the request (they take lots of other steps too, and they're not gonna be public about that). Which means it all boils down to keeping the proxies not overused.

minified-2

@longnd
Copy link
Author

longnd commented Mar 10, 2023

thank you for the effort spending on trying another approach as suggested

I hope we're on a same page on getting it done right is better than getting it done quick.

I agree that getting things done right is important.

Since this approach clearly doesn't deliver what's expected, if I were to pick HTTP libraries from the beginning I'd have to switch to other alternatives nonetheless

since I don't know how you implemented that so I can't guess if there was anything wrong. But that solution - using axios with random user's agent (even without the proxy) should be able to get the search result in an expected way. I have seen other candidates made similar approach and get the results they want. Here is a simple code as an example

import { HttpService } from '@nestjs/axios';


const USER_AGENTS = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 12.6; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (X11; Linux i686; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edge/106.0.1370.47',
];


function randomUserAgent(): string {
  const randomIndex = Math.floor(Math.random() * USER_AGENTS.length);
  return USER_AGENTS[randomIndex];
}

...
const res = await this.httpService.axiosRef.get(
  `https://www.google.com/search?q=${query}&gl=us&hl=en`,
  {
    headers: {
      'User-Agent': randomUserAgent(),
    },
  },
);
const html = res.data;
...

@21jake 21jake linked a pull request Mar 10, 2023 that will close this issue
@21jake
Copy link
Owner

21jake commented Mar 10, 2023

Thank you for your sample. It turns out that the search performance results will only appear with certain type of User Agent, which means the UA have to be "hand picked" and not generate by random.

In #43 I've tried replacing Puppeteer with Axios and Cheerio to see how things turn out. It worked out quite significant:

  • Required info is scraped
  • ~65% reduce in scraping time (From 4.2 minutes to about 1.5 minutes for 100 keywords)
  • ~65% reduce in Docker image size (from 1.4 GB to 500 MB)
  • No more hacky workarounds (the sleep() is gone)
  • Better organized codebase

To conclude, my initial assumption on the pros of Puppeteer is wrong, i.e, being solely able to solve captcha does not outweigh the costs, considering how resource demanding it is.

I truly appreciate your suggestions and support.

@longnd
Copy link
Author

longnd commented Mar 11, 2023

Thank you for spending effort on the improvement. As mentioned in #35 (comment), one issue I noticed in the PR is that the current approach processes keywords in a loop, making the process brittle and inefficient. Failure related to one keyword will stop the processing of the other keywords, so separate asynchronous processes should be used for each keyword.

async scrapeKeywords(keywords: Keyword[]) {
for (let index = 0; index < keywords.length; index++) {

@21jake
Copy link
Owner

21jake commented Mar 13, 2023

A for loop helps and reduce the risk of traffic spike and getting detected. We can always wrap up and trigger all search requests at the same time, but that would make the proxies more prone to detection. Please understand in a stealth job, being fast does not equal being efficient.

@21jake 21jake closed this as completed Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants