- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to run shot-scraper javascript
against several URLs at once
#148
Comments
Challenge: the current UI for that command is:
How would passing multiple URLs work? It would be easier if JavaScript came first as then you could tag on multiple URLs as positional options, but that doesn't feel right against the current design. Some options:
|
I built a prototype of that second option: diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 3f1245e..86fc7b4 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -653,6 +653,13 @@ def accessibility(
is_flag=True,
help="Output JSON strings as raw text",
)
+@click.option(
+ "multis",
+ "-m",
+ "--multi",
+ help="Run same JavaScript against multiple pages",
+ multiple=True,
+)
@browser_option
@browser_args_option
@user_agent_option
@@ -668,6 +675,7 @@ def javascript(
auth,
output,
raw,
+ multis,
browser,
browser_args,
user_agent,
@@ -704,9 +712,26 @@ def javascript(
If a JavaScript error occurs an exit code of 1 will be returned.
"""
+ # Special case for --multi - if multis are provided but JavaScript
+ # positional option was not set, assume the first argument is JS
+ if multis and not javascript:
+ javascript = url
+ url = None
+
+ # If they didn't provide JavaScript, assume it's being piped in
if not javascript:
javascript = input.read()
- url = url_or_file_path(url, _check_and_absolutize)
+
+ to_process = []
+ if url:
+ to_process.append(url_or_file_path(url, _check_and_absolutize))
+ to_process.extend(url_or_file_path(multi, _check_and_absolutize) for multi in multis)
+
+ results = []
+
+ if len(to_process) > 1 and not raw:
+ output.write("[\n")
+
with sync_playwright() as p:
context, browser_obj = _browser_context(
p,
@@ -719,18 +744,28 @@ def javascript(
auth_username=auth_username,
auth_password=auth_password,
)
- page = context.new_page()
- if log_console:
- page.on("console", console_log)
- response = page.goto(url)
- skip_or_fail(response, skip, fail)
- result = _evaluate_js(page, javascript)
+ for i, url in enumerate(to_process):
+ is_last = i == len(to_process) - 1
+ page = context.new_page()
+ if log_console:
+ page.on("console", console_log)
+ response = page.goto(url)
+ skip_or_fail(response, skip, fail)
+ result = _evaluate_js(page, javascript)
+ if raw:
+ output.write(str(result) + "\n")
+ else:
+ output.write(
+ json.dumps(result, indent=4, default=str) + ("\n" if is_last else ",\n")
+ )
+
browser_obj.close()
- if raw:
- output.write(str(result))
- return
- output.write(json.dumps(result, indent=4, default=str))
- output.write("\n")
+
+ if len(to_process) > 1 and not raw:
+ output.write("]\n")
+
+ if len(results) == 1:
+ results = results[0]
@cli.command() Then used like this: shot-scraper javascript "
async () => {
const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
return (new readability.Readability(document)).parse();
}" \
-m https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/ \
-m https://simonwillison.net/2024/Mar/26/llm-cmd/ \
-m https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/ \
-m https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-case-study/ \
-m https://simonwillison.net/2024/Mar/16/weeknotes-the-aftermath-of-nicar/ | tee /tmp/all.json It worked, but I'm not sure if the design is right - in particular it feels inconsistent with how |
Here are some idea's I have come across in other scraping tools:
Side Note: going through all the research stuff in issues: it's perhaps an idea to allow shot-scraper to use a config file. That way, all arguments you can pass in command line can be put neatly in a config file. |
I found myself wanting to use the Readability trick against multiple URLs, without having to pay the startup cost of launching a new Chromium instance for each one.
Idea: a way to run
shot-scraper javascript
against more than one URL, returning an array of results.The text was updated successfully, but these errors were encountered: