Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not working with some sites #4

Open
Gujal00 opened this issue Jun 8, 2024 · 4 comments
Open

Not working with some sites #4

Gujal00 opened this issue Jun 8, 2024 · 4 comments

Comments

@Gujal00
Copy link

Gujal00 commented Jun 8, 2024

Thanks a lot for developing and sharing this solution, it works nicely for a lot of sites.
However there are some sites where it doesn't solve the challenge and gives up with {"code":500,"message":"Request Timeout"}

This is one of the sites, can you please check and advise
https://apnetv.to/Hindi-Serials

@zfcsoftware
Copy link
Owner

It seems to be caused by devtools detector. Normally the browser has been tested on devtools detector and is not caught, but additional precautions must have been taken. I will investigate this issue in more detail, but you can scrape it without any problem with the following method.

Step 1:

waitUntil: ['load', 'networkidle0']

Replace this line with the following.

waitUntil: 'domcontentloaded'

Step 2:
Interfere with the request. Add the following code in the field below.

   const { RequestInterceptionManager } = await import('puppeteer-intercept-and-modify-requests')
       const client = await page.target().createCDPSession()
       const interceptManager = new RequestInterceptionManager(client)
       await interceptManager.intercept({
           urlPattern: `https://apnetv.to/Hindi-Serials`,
           resourceType: "Document",
           modifyResponse({ body }) {
             return {
                      body: body.replace(`window.location.href = 'https://apnetv.to/indexnow.html';`,''),
             };
           },
         });

@Gujal00
Copy link
Author

Gujal00 commented Jun 9, 2024

Thanks a lot for taking the time to look at this issue.
I made the changes and ran on my ubuntu server vm. Looks like even though i did npm install it is missing some dependencies

gujal@tux:~/cf-clearance-scraper$ npm run start

> [email protected] start
> node index.js

Server running on port 3000
Failed to launch the browser process! undefined
[1590:1590:0609/185637.942529:ERROR:ozone_platform_x11.cc(243)] Missing X server or $DISPLAY
[1590:1590:0609/185637.942605:ERROR:env.cc(258)] The platform failed to initialize.  Exiting.


TROUBLESHOOTING: https://pptr.dev/troubleshooting

Anyway i will wait for you to check and release the docker image, then will check with the image as it will fully self contained. Thanks

@zfcsoftware
Copy link
Owner

puppeteer-intercept-and-modify-requests

npm i puppeteer-intercept-and-modify-requests
Just run it. After making the changes and running it, scraping will be available on the relevant site.
Thank you for your feedback.

@jairoxyz
Copy link

jairoxyz commented Jun 9, 2024

I added your code and tried it and it get's cf_clearance for that site. You think there is a way to intercept dev tool detector for any site, rather than hard-coding the intercept and modify for specific sites?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants