Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload: be more cautious about grabbing files for upload #305

Merged
merged 1 commit into from
Oct 7, 2024

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Oct 3, 2024

Instead of looking for study name in the file, try harder to make sure we are looking at a study prefix in the filename.

Fixes #302

Checklist

  • Consider if documentation in docs/ needs to be updated
    • If you've changed the structure of a table, you may need to run generate-md
    • If you've added/removed core study fields that not in US Core, update our list of those in core-study-details.md
  • Consider if tests should be added
  • Update template repo if there are changes to study configuration in manifest.toml

@mikix mikix force-pushed the mikix/safer-upload branch from 6ea9020 to cd48df2 Compare October 4, 2024 12:21
Copy link

github-actions bot commented Oct 4, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
2240 2231 100% 90% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
cumulus_library/actions/uploader.py 100% 🟢
TOTAL 100% 🟢

updated for commit: f40cfc2 by action🐍

@mikix mikix force-pushed the mikix/safer-upload branch 3 times, most recently from aac47fb to d3fb06a Compare October 4, 2024 14:02
@@ -10,7 +10,6 @@ output.sql
MRCONSO.RRF
*.zip
coverage.xml
*.parquet
Copy link
Contributor Author

@mikix mikix Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are parquet files checked into the repo in the test_data folder, so I modified this rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prefer to keep this - currently, export defaults to the project root if no arg is given, which means that exporting data is easy to accidentally commit - that the original origin of this rule. I've been explicitly commiting files when needed.

I agree the specific parquet files highlighted below should be fixed to run in tmpdirs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the thinking, shouldn't we also ignore *.csv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have nested ignores - a global ignore and then turn that off for the tests/ dir, probably. For both csv and parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh? Updated to be a little more complicated, but it seems to work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clever!

version: int,
version: str,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always passed a string, I think this type hint was just a typo.

Comment on lines -74 to +77
if any(study in str(path) for study in args["target"]):
if any(
path.parent.name == study and path.name.startswith(f"{study}__")
for study in args["target"]
):
Copy link
Contributor Author

@mikix mikix Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual code change here.

Before, this was a simple substring check. Which for my use case (upload site/ -t core) was hitting files like site/data_metrics/data_metrics__count_c_us_core_v4_observation.parquet because it has core in its filename, even though it's not a core study file.

So... I've changed this to check that both the leaf filename starts with study__ and it is inside a directory named for the study (a requirement from part of the uploading code that takes that directory name as the study name). This is pretty strict, but also rules that these files ought to follow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, makes sense - also keeps some hand moved data from leaking into the aggregator and creating new studies accidentally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test modifications to allow testing more scenarios easily (in order to satisfy coverage tax)

preview: bool = True,
raises: contextlib.AbstractContextManager = does_not_raise(),
status: int = 204,
version: str = "12345.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This .0 is due to #306

@mikix mikix marked this pull request as ready for review October 4, 2024 14:14
@@ -10,7 +10,6 @@ output.sql
MRCONSO.RRF
*.zip
coverage.xml
*.parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prefer to keep this - currently, export defaults to the project root if no arg is given, which means that exporting data is easy to accidentally commit - that the original origin of this rule. I've been explicitly commiting files when needed.

I agree the specific parquet files highlighted below should be fixed to run in tmpdirs.

Comment on lines -74 to +77
if any(study in str(path) for study in args["target"]):
if any(
path.parent.name == study and path.name.startswith(f"{study}__")
for study in args["target"]
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, makes sense - also keeps some hand moved data from leaking into the aggregator and creating new studies accidentally.

Instead of looking for study name in the file, try harder to make sure
we are looking at a study *prefix* in the filename.
@mikix mikix force-pushed the mikix/safer-upload branch from d3fb06a to f40cfc2 Compare October 4, 2024 19:02
@mikix mikix merged commit c07eea8 into main Oct 7, 2024
3 checks passed
@mikix mikix deleted the mikix/safer-upload branch October 7, 2024 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upload -t core uploaded other study data
2 participants