dag to scrape and save the current ridership URL from the NTD portal #3545

charlie-costanzo · 2024-11-13T18:09:32Z

Description

When we created Airflow operator to ingest the NTD monthly ridership tables every month in XLSX format, we didn't realize that DOT changes the file name every time it publishes the data.

Instead of relying on someone to remember to manually run the data ingestion every month with the new link, this PR dynamically scrapes the webpage for the data download to retrieve the new link, save it in airflow as an environment variable, and allows the request library to utilize the updated link. To do this, it also makes the ingest airflow task dependent on the success of the URL scraper.

Type of change

New feature

How has this been tested?

local airflow

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

Actions required (specified below)
monitor for expected behavior

airflow/dags/sync_ntd_data_xlsx/scrape_ntd_ridership_xlsx_url.py

charlie-costanzo · 2024-11-22T20:45:20Z

.pre-commit-config.yaml

@@ -15,9 +15,11 @@ repos:
    rev: 6.0.0
    hooks:
      - id: flake8
-        args: ["--ignore=E501,W503"] # line too long and line before binary operator (black is ok with these)
+        args: ["--ignore=E501,W503,E231"] # line too long and line before binary operator (black is ok with these) and explicitly ignore the whitespace after colon error


Added --ignore=E231 to flake8 args to ignore "missing whitespace after colon" errors in lambda functions, which were causing CI failures despite being properly formatted locally

charlie-costanzo · 2024-11-22T20:45:37Z

.pre-commit-config.yaml

        types:
          - python
+        # Suppress SyntaxWarning about invalid escape sequence from calitp-data-infra dependency without modifying source
+        entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8


Added entry: env PYTHONWARNINGS="ignore::SyntaxWarning" flake8 to suppress a SyntaxWarning about invalid escape sequence from the calitp-data-infra dependency without modifying its source code

charlie-costanzo self-assigned this Nov 13, 2024

charlie-costanzo requested review from evansiroky and vevetron as code owners November 13, 2024 18:09

charlie-costanzo marked this pull request as draft November 13, 2024 18:09

mjumbewu reviewed Nov 14, 2024

View reviewed changes

airflow/dags/sync_ntd_data_xlsx/scrape_ntd_ridership_xlsx_url.py Outdated Show resolved Hide resolved

charlie-costanzo requested a review from mjumbewu November 18, 2024 23:16

charlie-costanzo marked this pull request as ready for review November 18, 2024 23:17

charlie-costanzo requested review from tiffanychu90 and hunterowens as code owners November 22, 2024 17:20

charlie-costanzo force-pushed the scrape-ntd-ridership-xlsx-url branch 2 times, most recently from 38f0deb to 5806a3d Compare November 22, 2024 19:31

charlie-costanzo commented Nov 22, 2024

View reviewed changes

charlie-costanzo force-pushed the scrape-ntd-ridership-xlsx-url branch from 90dc13c to 605196c Compare November 22, 2024 20:48

charlie-costanzo added the data-pipeline-ingestion-and-modeling Ingesting, parsing and modeling data. Evan Siroky is product owner. label Nov 22, 2024

charlie-costanzo force-pushed the scrape-ntd-ridership-xlsx-url branch from 605196c to 16d7592 Compare November 22, 2024 22:47

evansiroky approved these changes Nov 26, 2024

View reviewed changes

charlie-costanzo added 14 commits November 26, 2024 09:41

dag to scrape the current ridership URL from the NTD portal

51db7d0

fix naming and add some descriptions

d455622

reconfigured airflow dag setup for dependencies and special handling

1483a30

test storing variables in xcoms

a9be8a7

cleaned up imports

d475762

rebase

e7dd9d4

remove and reorganize some lingering and unnecessary code and test

33069ca

linter not working

c2dd5c5

refactor lambda for flake8

0c4e243

flake8 config change

d9d8025

flake8 config change again

67c0345

create function of url finder

160c2ba

add comment for flake8 suppression

de8286e

accidentally pushed copy file

7120ef6

charlie-costanzo added 4 commits November 26, 2024 09:41

suppress whitespace after colon error

bb53d45

last pass at configuration changes

04e2177

suppress whitespace after colon error

6afce23

remove testing comments, clean up changed files

7eefb94

charlie-costanzo force-pushed the scrape-ntd-ridership-xlsx-url branch from 16d7592 to 7eefb94 Compare November 26, 2024 14:41

charlie-costanzo merged commit 4a63342 into main Nov 26, 2024
1 check passed

charlie-costanzo deleted the scrape-ntd-ridership-xlsx-url branch November 26, 2024 15:34

charlie-costanzo mentioned this pull request Nov 26, 2024

add requirement necessary for recently merged pr 3545 #3560

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dag to scrape and save the current ridership URL from the NTD portal #3545

dag to scrape and save the current ridership URL from the NTD portal #3545

charlie-costanzo commented Nov 13, 2024 •

edited

Loading

charlie-costanzo Nov 22, 2024

charlie-costanzo Nov 22, 2024

dag to scrape and save the current ridership URL from the NTD portal #3545

dag to scrape and save the current ridership URL from the NTD portal #3545

Conversation

charlie-costanzo commented Nov 13, 2024 • edited Loading

Description

Type of change

How has this been tested?

Post-merge follow-ups

charlie-costanzo Nov 22, 2024

Choose a reason for hiding this comment

charlie-costanzo Nov 22, 2024

Choose a reason for hiding this comment

charlie-costanzo commented Nov 13, 2024 •

edited

Loading