-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upload: be more cautious about grabbing files for upload #305
Conversation
6ea9020
to
cd48df2
Compare
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
aac47fb
to
d3fb06a
Compare
@@ -10,7 +10,6 @@ output.sql | |||
MRCONSO.RRF | |||
*.zip | |||
coverage.xml | |||
*.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are parquet files checked into the repo in the test_data folder, so I modified this rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer to keep this - currently, export defaults to the project root if no arg is given, which means that exporting data is easy to accidentally commit - that the original origin of this rule. I've been explicitly commiting files when needed.
I agree the specific parquet files highlighted below should be fixed to run in tmpdirs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the thinking, shouldn't we also ignore *.csv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have nested ignores - a global ignore and then turn that off for the tests/ dir, probably. For both csv and parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh? Updated to be a little more complicated, but it seems to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very clever!
version: int, | ||
version: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always passed a string, I think this type hint was just a typo.
if any(study in str(path) for study in args["target"]): | ||
if any( | ||
path.parent.name == study and path.name.startswith(f"{study}__") | ||
for study in args["target"] | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual code change here.
Before, this was a simple substring check. Which for my use case (upload site/ -t core) was hitting files like site/data_metrics/data_metrics__count_c_us_core_v4_observation.parquet because it has core in its filename, even though it's not a core study file.
So... I've changed this to check that both the leaf filename starts with study__ and it is inside a directory named for the study (a requirement from part of the uploading code that takes that directory name as the study name). This is pretty strict, but also rules that these files ought to follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, makes sense - also keeps some hand moved data from leaking into the aggregator and creating new studies accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test modifications to allow testing more scenarios easily (in order to satisfy coverage tax)
preview: bool = True, | ||
raises: contextlib.AbstractContextManager = does_not_raise(), | ||
status: int = 204, | ||
version: str = "12345.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This .0
is due to #306
@@ -10,7 +10,6 @@ output.sql | |||
MRCONSO.RRF | |||
*.zip | |||
coverage.xml | |||
*.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer to keep this - currently, export defaults to the project root if no arg is given, which means that exporting data is easy to accidentally commit - that the original origin of this rule. I've been explicitly commiting files when needed.
I agree the specific parquet files highlighted below should be fixed to run in tmpdirs.
if any(study in str(path) for study in args["target"]): | ||
if any( | ||
path.parent.name == study and path.name.startswith(f"{study}__") | ||
for study in args["target"] | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, makes sense - also keeps some hand moved data from leaking into the aggregator and creating new studies accidentally.
Instead of looking for study name in the file, try harder to make sure we are looking at a study *prefix* in the filename.
d3fb06a
to
f40cfc2
Compare
Instead of looking for study name in the file, try harder to make sure we are looking at a study prefix in the filename.
Fixes #302
Checklist
docs/
needs to be updatedgenerate-md
core
study fields that not in US Core, update our list of those incore-study-details.md
manifest.toml