Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_job.py write_metadata: avoid ad-hoc file selection for upload #940

Closed
soxofaan opened this issue Nov 14, 2024 · 6 comments
Closed
Assignees

Comments

@soxofaan
Copy link
Member

logger.info("Writing results to object storage")
for file_path in filter(lambda x: x.is_file(), job_dir.rglob("*")):
# TODO: Get list of files to upload from metadata file.
# AFAIK, nothing else is created locally and needs to be uploaded to s3 afterwards.
if UDF_PYTHON_DEPENDENCIES_FOLDER_NAME in str(file_path):
logger.warning(f"Omitting {file_path} as the executors will not be able to access it")
else:
full_path = str(file_path.absolute())
s3_instance.upload_file(full_path, bucket, full_path.strip("/"))

Here we're building an ugly ad-hoc deny-list for "files" that should not be uploaded to S3

As mentioned in the TODO, we should use an explicit asset list to upload instead of blindly assuming everything from the job dir should be uploaded (minus some hand-picked exceptions)

@soxofaan
Copy link
Member Author

cc @EmileSonneveld

@soxofaan
Copy link
Member Author

soxofaan commented Nov 14, 2024

As an illustration that this does not scale:

if UDF_PYTHON_DEPENDENCIES_FOLDER_NAME in str(file_path):

doesn't even work as UDF_PYTHON_DEPENDENCIES_FOLDER_NAME is not in play anymore on CDSE since #845

@EmileSonneveld
Copy link
Contributor

In export_workspace list of files that exists locally an on s3 is determined by the list of stac metadata files.
For example colection.json + item.tiff.json. Probably the same can be used here

@EmileSonneveld EmileSonneveld self-assigned this Nov 14, 2024
@EmileSonneveld
Copy link
Contributor

This issue is a direct result of changes introduced in #877

EmileSonneveld added a commit that referenced this issue Nov 21, 2024
@EmileSonneveld
Copy link
Contributor

EmileSonneveld commented Nov 22, 2024

Logged what files do get uploaded. All logs on cdse dev where one of the following 2:

Writing results to object storage. paths=[PosixPath('/batch_jobs/j-XXX/job_specification.json'), PosixPath('/batch_jobs/j-XXX/job_metadata.json'), PosixPath('/batch_jobs/j-XXX/openEO_2017-03-07Z.tif'), PosixPath('/batch_jobs/j-XXX/openEO_2017-03-07Z.tif.aux.xml'), PosixPath('/batch_jobs/j-XXX/openEO_2017-03-07Z.tif.json'), PosixPath('/batch_jobs/j-XXX/collection.json')]

Writing results to object storage. paths=[PosixPath('/batch_jobs/j-XXX/job_specification.json'), PosixPath('/batch_jobs/j-XXX/job_metadata.json')]

@EmileSonneveld
Copy link
Contributor

Tested on CDSE dev without fusemount. Running the test suite seems to be slightly faster now (40min->30min, but only checked on a few CI runs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants