-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dag to scrape and save the current ridership URL from the NTD portal #3545
Conversation
airflow/dags/sync_ntd_data_xlsx/scrape_ntd_ridership_xlsx_url.py
Outdated
Show resolved
Hide resolved
38f0deb
to
5806a3d
Compare
@@ -15,9 +15,11 @@ repos: | |||
rev: 6.0.0 | |||
hooks: | |||
- id: flake8 | |||
args: ["--ignore=E501,W503"] # line too long and line before binary operator (black is ok with these) | |||
args: ["--ignore=E501,W503,E231"] # line too long and line before binary operator (black is ok with these) and explicitly ignore the whitespace after colon error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added --ignore=E231
to flake8 args to ignore "missing whitespace after colon" errors in lambda functions, which were causing CI failures despite being properly formatted locally
types: | ||
- python | ||
# Suppress SyntaxWarning about invalid escape sequence from calitp-data-infra dependency without modifying source | ||
entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8
to suppress a SyntaxWarning about invalid escape sequence from the calitp-data-infra dependency without modifying its source code
90dc13c
to
605196c
Compare
605196c
to
16d7592
Compare
16d7592
to
7eefb94
Compare
Description
When we created Airflow operator to ingest the NTD monthly ridership tables every month in XLSX format, we didn't realize that DOT changes the file name every time it publishes the data.
Instead of relying on someone to remember to manually run the data ingestion every month with the new link, this PR dynamically scrapes the webpage for the data download to retrieve the new link, save it in airflow as an environment variable, and allows the request library to utilize the updated link. To do this, it also makes the ingest airflow task dependent on the success of the URL scraper.
Type of change
How has this been tested?
local airflow
Post-merge follow-ups
Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.
monitor for expected behavior