Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dag to scrape and save the current ridership URL from the NTD portal #3545

Merged
merged 18 commits into from
Nov 26, 2024

Conversation

charlie-costanzo
Copy link
Member

@charlie-costanzo charlie-costanzo commented Nov 13, 2024

Description

When we created Airflow operator to ingest the NTD monthly ridership tables every month in XLSX format, we didn't realize that DOT changes the file name every time it publishes the data.

Instead of relying on someone to remember to manually run the data ingestion every month with the new link, this PR dynamically scrapes the webpage for the data download to retrieve the new link, save it in airflow as an environment variable, and allows the request library to utilize the updated link. To do this, it also makes the ingest airflow task dependent on the success of the URL scraper.

Type of change

  • New feature

How has this been tested?

local airflow
Screenshot 2024-11-18 at 6 14 13 PM

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

  • Actions required (specified below)
    monitor for expected behavior

@charlie-costanzo charlie-costanzo marked this pull request as ready for review November 18, 2024 23:17
@charlie-costanzo charlie-costanzo force-pushed the scrape-ntd-ridership-xlsx-url branch 2 times, most recently from 38f0deb to 5806a3d Compare November 22, 2024 19:31
@@ -15,9 +15,11 @@ repos:
rev: 6.0.0
hooks:
- id: flake8
args: ["--ignore=E501,W503"] # line too long and line before binary operator (black is ok with these)
args: ["--ignore=E501,W503,E231"] # line too long and line before binary operator (black is ok with these) and explicitly ignore the whitespace after colon error
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added --ignore=E231 to flake8 args to ignore "missing whitespace after colon" errors in lambda functions, which were causing CI failures despite being properly formatted locally

types:
- python
# Suppress SyntaxWarning about invalid escape sequence from calitp-data-infra dependency without modifying source
entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8 to suppress a SyntaxWarning about invalid escape sequence from the calitp-data-infra dependency without modifying its source code

@charlie-costanzo charlie-costanzo added the data-pipeline-ingestion-and-modeling Ingesting, parsing and modeling data. Evan Siroky is product owner. label Nov 22, 2024
@charlie-costanzo charlie-costanzo merged commit 4a63342 into main Nov 26, 2024
1 check passed
@charlie-costanzo charlie-costanzo deleted the scrape-ntd-ridership-xlsx-url branch November 26, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-pipeline-ingestion-and-modeling Ingesting, parsing and modeling data. Evan Siroky is product owner.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants