Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying same options to multiple URLs #719

Open
jwilk opened this issue Sep 6, 2022 · 2 comments
Open

Applying same options to multiple URLs #719

jwilk opened this issue Sep 6, 2022 · 2 comments

Comments

@jwilk
Copy link
Contributor

jwilk commented Sep 6, 2022

I'm watching a large number of URLs that have the same structure, so I'm applying the same set of filters to them:

url: https://example.net/9566
ssl_no_verify: true
filter:
- css: '#data'
- xpath: '//text()'
- format-json: null
- grep: '"(foo|bar)":'
- re.sub: 'T00:00:00(?=")'
- re.sub: '"'
- re.sub: '(?m)^ *|,$'
---
url: https://example.net/14026
ssl_no_verify: true
filter:
- css: '#data'
- xpath: '//text()'
- format-json: null
- grep: '"(foo|bar)":'
- re.sub: 'T00:00:00(?=")'
- re.sub: '"'
- re.sub: '(?m)^ *|,$'
---
url: https://example.net/15829
ssl_no_verify: true
filter:
- css: '#data'
- xpath: '//text()'
- format-json: null
- grep: '"(foo|bar)":'
- re.sub: 'T00:00:00(?=")'
- re.sub: '"'
- re.sub: '(?m)^ *|,$'
---
# ...

This is very tiresome to update.

So I wish I could write something like this instead:

url:
- https://example.net/9566
- https://example.net/14026
- https://example.net/15829
# ...
ssl_no_verify: true
filter:
- css: '#data'
- xpath: '//text()'
- format-json: null
- grep: '"(foo|bar)":'
- re.sub: 'T00:00:00(?=")'
- re.sub: '"'
- re.sub: '(?m)^ *|,$'
@thp
Copy link
Owner

thp commented Sep 8, 2022

One quick and pragmatic way to do this would be to write a small script that generates the urls.yaml from a "template" that you specified like above. This way, you can make it as complex and/or powerful as you want.

The suggestion you had doesn't properly work if e.g. you want to give different URLs different names or something. On the other hand, for this simple case of turning the "url" field into a list, it could work. The job parser needs to be updated to deal with that properly, though (probably the job parser would go and "expand" the data accordingly, so that practically the rest of the codebase "sees" distinct jobs that just happen to have the same filter configuration).

Keeping this open for now as feature idea for the future.

@thp thp added the enhancement label Sep 8, 2022
@trevorshannon
Copy link
Contributor

Perhaps this will help @jwilk or others searching for a solution.

I generally use the global job_defaults to apply the same filters to all my URLs (docs). If I have a few URLs that need an additional filter step or perhaps a few URLs that need a certain filter step skipped, I use a custom SelectiveFilter. This is obviously just for my use case, but perhaps the idea can be generalized.

This custom SelectiveFilter allows you to define a list of regex patterns to match. A defined conventional filter is then applied selectively depending on the results of that match.

Not exactly what you want, but I think the concept of making a custom filter in your hooks.py and giving that filter some logic to either apply itself or not is a workable solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants