-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New AWS jobstore. #5123
base: master
Are you sure you want to change the base?
New AWS jobstore. #5123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this probably works as written, but the documentation for how the S3-backed storage works (metadata objects named by file IDs referencing ETag hashes for content-addressed data objects many-to-one) doesn't match what's actually implemented (metadata objects and data objects both named by file ID).
It might be good to actually implement the described content-addressing design, but there are are some big unsolved questions:
- ETags on S3 are all MD5-based and not collision-resistant, so it's easy to get two actual files with the same MD5 and put them in the job store to break it.
- Streaming uploads don't know their hash until they are uploaded, so where should they be uploaded to?
- Without transactions, how do you maintain the consistency of many-to-one relationships between metadata objects and data objects, while still allowing data objects to be deleted when the last referencing metadata object goes away?
It might make sense to not solve this deduplication problem now, and to just admit to using one data copy per file ID as is actually implemented. If we cram executability into the ID somehow, and use the ETag provided by S3 in the response header for integrity checking, it might be possible to not have a metadata file at all.
'src/toil/utils/toilStats.py' | ||
'src/toil/utils/toilStats.py', | ||
'src/toil/server/utils.py', | ||
'src/toil/jobStores/aws/jobStore.py', | ||
'src/toil/jobStores/exceptions.py', | ||
'src/toil/lib/aws/config.py', | ||
'src/toil/lib/aws/s3.py', | ||
'src/toil/lib/retry.py', | ||
'src/toil/lib/pipes.py', | ||
'src/toil/lib/checksum.py', | ||
'src/toil/lib/conversions.py', | ||
'src/toil/lib/iterables.py' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get away with not turning off type checking for all these files? In particular iterables.py
in my Toil looks like it already has type checking all set up, and I don't see this PR adding to it, so why is it here? Why are the other ones here?
Note: This is a small json containing only: etag checksum, & executibility. | ||
The reference files act like unique keys in a database, referencing the original content. | ||
This deduplicates data on s3 if 2+ inputs/outputs have the same content. | ||
3. Logs: The written log files of jobs that have run, plus the log file for the main Toil process. (check this!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been checked?
* environment.pickle: (environment variables) | ||
* config.pickle (user options) | ||
* pid.log (process ID of the workflow; when it finishes, the workflow either succeeded/failed) | ||
* userScript (hot deployment(?); this looks like the job module; poke this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is the user module for hot-deploying Toil Python workflows.
* TODO: are there any others? do either vg or cactus use this? should these have locks and when are they | ||
accessed? are these only written once, but read many times? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shared file API is indeed user-facing, so we need to support it for Toil Python workflows. But it's an error for the workflow to do concurrent access to them; there's an exception type we can use if we detect that.
The reference files act like unique keys in a database, referencing the original content. | ||
This deduplicates data on s3 if 2+ inputs/outputs have the same content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explain here how deletions are handled alongside deduplication. We need something to prevent one writer seeing a file's data is already there and deciding not to write it again, while a deleter thinks it has deleted the last reference to the data and cleans it up, leaving a dangling reference. It's not immediately obvious from this how we intend that to work.
else: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we want to settle on whether this function or the caller is responsible for doing the fallback copy. And if this function isn't going to be responsible, it should do something to make sure you can't call it with encryption on. Either don't take the extra args or throw if they specify encryption. But probably we don't want to just bail and leave the file un-copied.
# Note: this may have errors if using sse-c because of | ||
# a bug with encryption using copy_object and copy (which uses copy_object for files <5GB): | ||
# https://github.com/aws/aws-cli/issues/6012 | ||
# this will only happen if we attempt to copy a file previously encrypted with sse-c | ||
# copying an unencrypted file and encrypting it as sse-c seems to work fine though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue linked appears to apply only to the AWS CLI, and not the API, and also only describes adding encryption, not removing or changing it. Does it really not work if we do it through the API ourselves? (Or would we need tor each in and fiddle with headers manually?)
@@ -346,7 +346,7 @@ def bucket_location_to_region(location: Optional[str]) -> str: | |||
return "us-east-1" if location == "" or location is None else location | |||
|
|||
|
|||
def get_object_for_url(url: ParseResult, existing: Optional[bool] = None) -> "S3Object": | |||
def get_object_for_url(url: Union[ParseResult, str], existing: Optional[bool] = None) -> "S3Object": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should have stronger opinions on which of these we ought to have. It's kind of convenient for the caller to just be able to throw whatever in here, but remembering that you can do that is harder than remembering that you need one type, I think.
And we're not updating the docstring to say that the URL doesn't actually have to be parsed.
return self.writable | ||
|
||
def __exit__(self, exc_type, exc_val, exc_tb): | ||
# Closeing the writable end will send EOF to the readable and cause the reader thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Closeing the writable end will send EOF to the readable and cause the reader thread | |
# Closing the writable end will send EOF to the readable and cause the reader thread |
@@ -69,7 +69,6 @@ lint: | |||
- ${MAIN_PYTHON_PKG} -m virtualenv venv && . venv/bin/activate && make prepare && make develop extras=[all] | |||
- ${MAIN_PYTHON_PKG} -m pip freeze | |||
- ${MAIN_PYTHON_PKG} --version | |||
- make mypy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to keep the mypy step.
Rewritten without SDB. Still a draft.
Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist