Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query reuse / caching #8

Open
sgoodm opened this issue Aug 2, 2021 · 0 comments
Open

Query reuse / caching #8

sgoodm opened this issue Aug 2, 2021 · 0 comments
Labels
code related to code for processing data

Comments

@sgoodm
Copy link
Member

sgoodm commented Aug 2, 2021

Current implementation:

My current implementation of this was very basic and intended solely to avoid having to requery directions links to retrieve SVG path data. A path to a feature_df.csv from a previous run (containing the SVG path data and any initial processing for any links) can be provided and load in place of reprocessing any features available from the features_df.csv. This saves a lot of time with the current implementation of querying data for directions links (See: #1 ).

Issue:

The current implementation would fail if the input data source changes (changing the unique ID assigned to project-link or feature combinations). It also can only utilize data from a single previous run.

Possible solutions:

This really depends how far we want to go to deal with this. If querying directions links was faster, I would likely suggest we forego this issue and just process the data freshly each build. That said, I could imagine there being cases where accessing cached data could be useful (e.g., OSM features changed and we want to use a specific version from an old build).

  • One extreme would be to build out a full caching system based on the unique build, input data, TUFF ID, link, etc. I am not sure how we would want to go about specifying what cached data to use and what not to. This would likely be a substantial over engineering for this application.

  • Another approach I considered during the initial implementation was to create a separate script that just merges any number of previous builds. This would be more hands on and require someone to know what subset of data was processed in each build. But it would also be a convenient way of adding new data to an existing dataset without reprocessing the old data.

    • For example: You could run a full build of input_data_01.csv, then the underlying data is updated to include new projects as input_data_02.csv. The existing projects could be filtered from input_data_02.csv and only the new projects processed, then the results from build 1 and build 2 are merged.
    • This use case would get more complicated if subsets of existing data were updated, rather than new data simply added.

Ultimately I will likely leave this until the next update is needed and see what will be useful in practice based on data update patterns.

@sgoodm sgoodm added the code related to code for processing data label Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code related to code for processing data
Projects
None yet
Development

No branches or pull requests

1 participant