Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap starting dataset for scraper #10

Open
2 tasks
andrewsutjahjo opened this issue Nov 30, 2021 · 0 comments
Open
2 tasks

Bootstrap starting dataset for scraper #10

andrewsutjahjo opened this issue Nov 30, 2021 · 0 comments
Assignees

Comments

@andrewsutjahjo
Copy link
Collaborator

andrewsutjahjo commented Nov 30, 2021

We need seed data for the scraper + diff-er to start running any of our pipeline.

This story takes backtrack URLs, filenames, BankTrack's document data, and our internal metadata structure #9

and Outputs a populated starting {data_structure} object/instance which can be used by other people.

SPIKE FOR THIS:

  • Look at pdfs housed on Banktrack's internal server
  • Try to programmatically find the webpage on that bank's website that has a link to that pdf

Depends on #9 for knowledge of if this is a json, flat file (parquet?), csv, Graph database, or qbit stored archive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants