Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime: Re-ingest local files if they are updated #5479

Closed
begelundmuller opened this issue Aug 19, 2024 · 4 comments · Fixed by #5656
Closed

Runtime: Re-ingest local files if they are updated #5479

begelundmuller opened this issue Aug 19, 2024 · 4 comments · Fixed by #5656
Assignees

Comments

@begelundmuller
Copy link
Contributor

begelundmuller commented Aug 19, 2024

We have gotten several reports about file sources not being updated when the local file is changed. This is especially problematic for mapping.csv files used for access management in security policies.

We need to be careful only to do this for small local files, to avoid accidentally scanning large/remote files (even for computing a hash). Maybe it should be an optional parameter for the file: connector?

@begelundmuller
Copy link
Contributor Author

I don't have a clear idea about how we might solve this, but here are some thoughts (probably too hacky):

  • We could support a filestore_to_self executor for DuckDB and ClickHouse for ingesting from the file/local_file connector
  • If invalidate_on_change is set to true in the InputProps, we could compute a hash of the ingested files and store it in the ModelResult
  • In the call to ModelManager.Exists, we could re-compute the hash and return false if it changed

@k-anshul
Copy link
Member

k-anshul commented Sep 9, 2024

We can also consider following alternate solutions if the use case is only for mapping files kind of situations:

  1. Create a non materialised model on top of the file so that every query that reads data from the view reads data from the file. For files with few hundred rows the difference is less than 5ms.
  2. Add a source/model config invalidate_always which always returns true for ModelManager.Exists so that the model is always updated on reconcile. Since these mappings are expected to be small it should be okay to always refresh them.

1 has a limitation that dependencies will not be refreshed. 2 has a limitation that dependencies will always be refreshed.

If we want to solve this for all cases then I think the solution listed above is the only way. Two considerations though :

  1. For clickhouse the local file is actually stored on the clickhouse server so we can't compute its hash. Also given clickhouse sources are always run as models we only have the queries and not the actual path of the file.
  2. I think we also need to store the path of the file in ModelManager so that hash can be recomputed.

@begelundmuller
Copy link
Contributor Author

We can also consider following alternate solutions if the use case is only for mapping files kind of situations:

  1. Unfortunately they're often referenced in security policies and dimension/measure expressions, which get hit a ton (e.g. when listing dashboards, we check the security policy of each dashboard in the project), so I'd be too worried about the number of disk reads and CSV parses here.

  2. This might be okay, though it should be used cautiously so it doesn't invalidate large downstream models too frequently. However, there's an issue that applies here, which also applies to my suggestion above, which is that if a model file isn't edited, there is no call to Reconcile at all. Reconcile is only called on controller restart or when the resource spec changes.

If we want to solve this for all cases then I think the solution listed above is the only way. Two considerations though :

About 1. here – for ClickHouse, the file/local_file connector should still load the local path on the runtime and upload/write it into a ClickHouse table. Because the goal for the local file connector is using small data files that are checked into Git.

@k-anshul
Copy link
Member

About 1. here – for ClickHouse, the file/local_file connector should still load the local path on the runtime and upload/write it into a ClickHouse table. Because the goal for the local file connector is using small data files that are checked into Git.

Sure I think this can be handled separately. I will raise an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants