-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL Encoding Not as Expected #21
Comments
Update: Have fixed the issue, apparently the module was sensitive to the url formatting, see below - also, for those of you who are doing this at scale, please make sure to specify regions (I kept getting 404 responses and realized it was because of the european regions, when I just specified US, many of those were resolved), thanks!
|
Hi! Thank you for your issue, glad it seems to be resolved. I'm going to keep this open, and will be fixing this URL formatting issue in the next release 😊 |
Thanks George, I've had it running for 15 hours straight now and no IP blocking, outstanding module (I am starting a big-data/machine learning company which involves a ton of web scraping)! And as you know different urls work in different regions (so upon good url formatting and restricting regions, everything has been working well, and the costs to this seem far less than many of these web-scraping/proxy services) |
Hi, it looks like AWS messes up the URL encoding on their end... Will take a look at patching, but in the meantime I'd recommend using the SITE = "https://site.com"
gateway = ApiGateway(SITE)
gateway.start()
s = requests.Session()
s.mount(SITE, gateway)
# path reaches target site as /search?hl=en&num=10&q=barry+bonds+after:2021-12-22+before:+2021-12-23&tbm=nws
s.get(SITE + "/search?q=barry+bonds after:2021-12-22 before: 2021-12-23&tbm=nws&hl=en&num=10")
# path reaches target site as /search?hl=en&num=10&q=barry%2Bbonds+after:2021-12-22+before:+2021-12-23&tbm=nws
s.get(SITE + "/search", params={
"q": "barry+bonds after:2021-12-22 before: 2021-12-23",
"tbm": "nws",
"hl": "en",
"num": 10
}) |
Awesome, makes sense, thanks! |
I have been attempting to use this package to scrape Google News - I am using the most recent release (v1.0.10), and have configured the AWS-CLI. The exact code sequence resulting in a failure is as follows:
Unfortunately, the result is a 429 response for me ... on the other hand, when I tried using a proxy from scrapingbee.com after initially getting blocked by Google (performing step 1), I actually did get a 200 response. I configured the AWS CLI, and I also tried inputting the keys as arguments and creating new users with the API Gateway enabled, as well as using the root key, but have had no luck.
Are you able to replicate this issue/first artificially block yourself from Google, and then being unable to scrape using this ip-rotator module? Thank you very much for an excellent module, and Merry Christmas and happy holidays!
The text was updated successfully, but these errors were encountered: