Skip to content

Commit

Permalink
Merge pull request #12 from RobBrazier/add-response-retriever
Browse files Browse the repository at this point in the history
Add response_retriever option to support proxied scraping
  • Loading branch information
thefakequake authored Mar 15, 2022
2 parents 235b781 + 5d96485 commit 8cb46d0
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 2 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,17 @@ for part in parts:

---

### `Scraper(headers={...})`
### `Scraper(headers={...}, response_retriever=...)`

### Parameters
- **headers** ( [dict](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict) ) - The browser headers for the requests in a dict.

Note: There are headers set by default. I only recommend changing them if you are encountering scraping errors.

- **response_retriever** ( [Callable](https://docs.python.org/3/library/typing.html#typing.Callable) ) - A function accepting arguments (`url, **kwargs`) that is called to retrieve the response from PCPartPicker

Note: A default retriever is configured that calls pcpartpicker.com directly. I only recommend changing this if you need to configure how the request is made (e.g. via a proxy)

# Scraper Methods

---
Expand Down
10 changes: 9 additions & 1 deletion pypartpicker/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,11 +74,19 @@ def __init__(self, **kwargs):
if not isinstance(headers_dict, dict):
raise ValueError("Headers kwarg has to be a dict!")
self.headers = headers_dict
response_retriever = kwargs.get("response_retriever", self.__default_response_retriever)
if not callable(response_retriever):
raise ValueError("response_retriever kwarg must be callable!")
self.response_retriever = response_retriever

@staticmethod
def __default_response_retriever(url, **kwargs):
return requests.get(url, **kwargs)

# Private Helper Function
def __make_soup(self, url) -> BeautifulSoup:
# sends a request to the URL
page = requests.get(url, headers=self.headers)
page = self.response_retriever(url, headers=self.headers)
# gets the HTML code for the website and parses it using Python's built in HTML parser
soup = BeautifulSoup(page.content, 'html.parser')
if "Verification" in soup.find(class_="pageTitle").get_text():
Expand Down

0 comments on commit 8cb46d0

Please sign in to comment.