Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New AWS jobstore. #5123

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

New AWS jobstore. #5123

wants to merge 3 commits into from

Conversation

DailyDreaming
Copy link
Member

Rewritten without SDB. Still a draft.

Changelog Entry

To be copied to the draft changelog by merger:

  • PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this probably works as written, but the documentation for how the S3-backed storage works (metadata objects named by file IDs referencing ETag hashes for content-addressed data objects many-to-one) doesn't match what's actually implemented (metadata objects and data objects both named by file ID).

It might be good to actually implement the described content-addressing design, but there are are some big unsolved questions:

  • ETags on S3 are all MD5-based and not collision-resistant, so it's easy to get two actual files with the same MD5 and put them in the job store to break it.
  • Streaming uploads don't know their hash until they are uploaded, so where should they be uploaded to?
  • Without transactions, how do you maintain the consistency of many-to-one relationships between metadata objects and data objects, while still allowing data objects to be deleted when the last referencing metadata object goes away?

It might make sense to not solve this deduplication problem now, and to just admit to using one data copy per file ID as is actually implemented. If we cram executability into the ID somehow, and use the ETag provided by S3 in the response header for integrity checking, it might be possible to not have a metadata file at all.

Comment on lines -86 to +96
'src/toil/utils/toilStats.py'
'src/toil/utils/toilStats.py',
'src/toil/server/utils.py',
'src/toil/jobStores/aws/jobStore.py',
'src/toil/jobStores/exceptions.py',
'src/toil/lib/aws/config.py',
'src/toil/lib/aws/s3.py',
'src/toil/lib/retry.py',
'src/toil/lib/pipes.py',
'src/toil/lib/checksum.py',
'src/toil/lib/conversions.py',
'src/toil/lib/iterables.py'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get away with not turning off type checking for all these files? In particular iterables.py in my Toil looks like it already has type checking all set up, and I don't see this PR adding to it, so why is it here? Why are the other ones here?

Note: This is a small json containing only: etag checksum, & executibility.
The reference files act like unique keys in a database, referencing the original content.
This deduplicates data on s3 if 2+ inputs/outputs have the same content.
3. Logs: The written log files of jobs that have run, plus the log file for the main Toil process. (check this!)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been checked?

* environment.pickle: (environment variables)
* config.pickle (user options)
* pid.log (process ID of the workflow; when it finishes, the workflow either succeeded/failed)
* userScript (hot deployment(?); this looks like the job module; poke this)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is the user module for hot-deploying Toil Python workflows.

Comment on lines +99 to +100
* TODO: are there any others? do either vg or cactus use this? should these have locks and when are they
accessed? are these only written once, but read many times?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared file API is indeed user-facing, so we need to support it for Toil Python workflows. But it's an error for the workflow to do concurrent access to them; there's an exception type we can use if we detect that.

Comment on lines +90 to +91
The reference files act like unique keys in a database, referencing the original content.
This deduplicates data on s3 if 2+ inputs/outputs have the same content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain here how deletions are handled alongside deduplication. We need something to prevent one writer seeing a file's data is already there and deciding not to write it again, while a deleter thinks it has deleted the last reference to the data and cleans it up, leaving a dangling reference. It's not immediately obvious from this how we intend that to work.

Comment on lines +234 to +235
else:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we want to settle on whether this function or the caller is responsible for doing the fallback copy. And if this function isn't going to be responsible, it should do something to make sure you can't call it with encryption on. Either don't take the extra args or throw if they specify encryption. But probably we don't want to just bail and leave the file un-copied.

Comment on lines +227 to +231
# Note: this may have errors if using sse-c because of
# a bug with encryption using copy_object and copy (which uses copy_object for files <5GB):
# https://github.com/aws/aws-cli/issues/6012
# this will only happen if we attempt to copy a file previously encrypted with sse-c
# copying an unencrypted file and encrypting it as sse-c seems to work fine though
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue linked appears to apply only to the AWS CLI, and not the API, and also only describes adding encryption, not removing or changing it. Does it really not work if we do it through the API ourselves? (Or would we need tor each in and fiddle with headers manually?)

@@ -346,7 +346,7 @@ def bucket_location_to_region(location: Optional[str]) -> str:
return "us-east-1" if location == "" or location is None else location


def get_object_for_url(url: ParseResult, existing: Optional[bool] = None) -> "S3Object":
def get_object_for_url(url: Union[ParseResult, str], existing: Optional[bool] = None) -> "S3Object":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should have stronger opinions on which of these we ought to have. It's kind of convenient for the caller to just be able to throw whatever in here, but remembering that you can do that is harder than remembering that you need one type, I think.

And we're not updating the docstring to say that the URL doesn't actually have to be parsed.

return self.writable

def __exit__(self, exc_type, exc_val, exc_tb):
# Closeing the writable end will send EOF to the readable and cause the reader thread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Closeing the writable end will send EOF to the readable and cause the reader thread
# Closing the writable end will send EOF to the readable and cause the reader thread

@@ -69,7 +69,6 @@ lint:
- ${MAIN_PYTHON_PKG} -m virtualenv venv && . venv/bin/activate && make prepare && make develop extras=[all]
- ${MAIN_PYTHON_PKG} -m pip freeze
- ${MAIN_PYTHON_PKG} --version
- make mypy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to keep the mypy step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants