New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Bootstrap starting dataset for scraper #10

Open

2 tasks

andrewsutjahjo opened this issue Nov 30, 2021 · 0 comments

Assignees

Collaborator

andrewsutjahjo commented Nov 30, 2021 •

edited

Loading

We need seed data for the scraper + diff-er to start running any of our pipeline.

This story takes backtrack URLs, filenames, BankTrack's document data, and our internal metadata structure #9

and Outputs a populated starting {data_structure} object/instance which can be used by other people.

SPIKE FOR THIS:

Look at pdfs housed on Banktrack's internal server
Try to programmatically find the webpage on that bank's website that has a link to that pdf

Depends on #9 for knowledge of if this is a json, flat file (parquet?), csv, Graph database, or qbit stored archive.

andrewsutjahjo assigned AndriiG13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment