New AWS jobstore. #5123

DailyDreaming · 2024-10-14T19:23:05Z

Rewritten without SDB. Still a draft.

Changelog Entry

To be copied to the draft changelog by merger:

PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passes tests.
Make sure the PR has been reviewed since its last modification. If not, review it.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

adamnovak

It looks like this probably works as written, but the documentation for how the S3-backed storage works (metadata objects named by file IDs referencing ETag hashes for content-addressed data objects many-to-one) doesn't match what's actually implemented (metadata objects and data objects both named by file ID).

It might be good to actually implement the described content-addressing design, but there are are some big unsolved questions:

ETags on S3 are all MD5-based and not collision-resistant, so it's easy to get two actual files with the same MD5 and put them in the job store to break it.
Streaming uploads don't know their hash until they are uploaded, so where should they be uploaded to?
Without transactions, how do you maintain the consistency of many-to-one relationships between metadata objects and data objects, while still allowing data objects to be deleted when the last referencing metadata object goes away?

It might make sense to not solve this deduplication problem now, and to just admit to using one data copy per file ID as is actually implemented. If we cram executability into the ID somehow, and use the ETag provided by S3 in the response header for integrity checking, it might be possible to not have a metadata file at all.

adamnovak · 2025-02-05T16:51:54Z

contrib/admin/mypy-with-ignore.py

-        'src/toil/utils/toilStats.py'
+        'src/toil/utils/toilStats.py',
+        'src/toil/server/utils.py',
+        'src/toil/jobStores/aws/jobStore.py',
+        'src/toil/jobStores/exceptions.py',
+        'src/toil/lib/aws/config.py',
+        'src/toil/lib/aws/s3.py',
+        'src/toil/lib/retry.py',
+        'src/toil/lib/pipes.py',
+        'src/toil/lib/checksum.py',
+        'src/toil/lib/conversions.py',
+        'src/toil/lib/iterables.py'


Can we get away with not turning off type checking for all these files? In particular iterables.py in my Toil looks like it already has type checking all set up, and I don't see this PR adding to it, so why is it here? Why are the other ones here?

adamnovak · 2025-02-05T16:55:32Z

src/toil/jobStores/aws/jobStore.py

+                 Note: This is a small json containing only: etag checksum, & executibility.
+            The reference files act like unique keys in a database, referencing the original content.
+            This deduplicates data on s3 if 2+ inputs/outputs have the same content.
+        3. Logs: The written log files of jobs that have run, plus the log file for the main Toil process. (check this!)


Has this been checked?

adamnovak · 2025-02-05T16:56:26Z

src/toil/jobStores/aws/jobStore.py

+            * environment.pickle:  (environment variables)
+            * config.pickle        (user options)
+            * pid.log              (process ID of the workflow; when it finishes, the workflow either succeeded/failed)
+            * userScript           (hot deployment(?);  this looks like the job module;  poke this)


Yeah, this is the user module for hot-deploying Toil Python workflows.

adamnovak · 2025-02-05T16:57:35Z

src/toil/jobStores/aws/jobStore.py

+            * TODO: are there any others?  do either vg or cactus use this?  should these have locks and when are they
+               accessed?  are these only written once, but read many times?


The shared file API is indeed user-facing, so we need to support it for Toil Python workflows. But it's an error for the workflow to do concurrent access to them; there's an exception type we can use if we detect that.

adamnovak · 2025-02-05T17:00:16Z

src/toil/jobStores/aws/jobStore.py

+            The reference files act like unique keys in a database, referencing the original content.
+            This deduplicates data on s3 if 2+ inputs/outputs have the same content.


We should explain here how deletions are handled alongside deduplication. We need something to prevent one writer seeing a file's data is already there and deciding not to write it again, while a deleter thinks it has deleted the last reference to the data and cleans it up, leaving a dangling reference. It's not immediately obvious from this how we intend that to work.

adamnovak · 2025-02-06T20:34:07Z

src/toil/lib/aws/s3.py

+    else:
+        pass


Probably we want to settle on whether this function or the caller is responsible for doing the fallback copy. And if this function isn't going to be responsible, it should do something to make sure you can't call it with encryption on. Either don't take the extra args or throw if they specify encryption. But probably we don't want to just bail and leave the file un-copied.

adamnovak · 2025-02-06T23:15:42Z

src/toil/lib/aws/s3.py

+        # Note: this may have errors if using sse-c because of
+        # a bug with encryption using copy_object and copy (which uses copy_object for files <5GB):
+        # https://github.com/aws/aws-cli/issues/6012
+        # this will only happen if we attempt to copy a file previously encrypted with sse-c
+        # copying an unencrypted file and encrypting it as sse-c seems to work fine though


The issue linked appears to apply only to the AWS CLI, and not the API, and also only describes adding encryption, not removing or changing it. Does it really not work if we do it through the API ourselves? (Or would we need tor each in and fiddle with headers manually?)

adamnovak · 2025-02-06T23:18:56Z

src/toil/lib/aws/utils.py

@@ -346,7 +346,7 @@ def bucket_location_to_region(location: Optional[str]) -> str:
    return "us-east-1" if location == "" or location is None else location


-def get_object_for_url(url: ParseResult, existing: Optional[bool] = None) -> "S3Object":
+def get_object_for_url(url: Union[ParseResult, str], existing: Optional[bool] = None) -> "S3Object":


I feel like we should have stronger opinions on which of these we ought to have. It's kind of convenient for the caller to just be able to throw whatever in here, but remembering that you can do that is harder than remembering that you need one type, I think.

And we're not updating the docstring to say that the URL doesn't actually have to be parsed.

adamnovak · 2025-02-06T23:21:14Z

src/toil/lib/pipes.py

+        return self.writable
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        # Closeing the writable end will send EOF to the readable and cause the reader thread


Suggested change

# Closeing the writable end will send EOF to the readable and cause the reader thread

# Closing the writable end will send EOF to the readable and cause the reader thread

adamnovak · 2025-02-06T23:22:51Z

.gitlab-ci.yml

@@ -69,7 +69,6 @@ lint:
    - ${MAIN_PYTHON_PKG} -m virtualenv venv && . venv/bin/activate && make prepare && make develop extras=[all]
    - ${MAIN_PYTHON_PKG} -m pip freeze
    - ${MAIN_PYTHON_PKG} --version
-    - make mypy


We probably need to keep the mypy step.

New aws jobstore.

b0b1752

DailyDreaming self-assigned this Oct 14, 2024

adamnovak mentioned this pull request Oct 30, 2024

Simplify s3 functions: create_s3_bucket. #4566

Open

19 tasks

DailyDreaming added 2 commits January 23, 2025 21:57

Update.

7dc1a7c

Updates.

06b7a40

adamnovak requested changes Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New AWS jobstore. #5123

New AWS jobstore. #5123

DailyDreaming commented Oct 14, 2024

adamnovak left a comment

adamnovak Feb 5, 2025

adamnovak Feb 5, 2025

adamnovak Feb 5, 2025

adamnovak Feb 5, 2025

adamnovak Feb 5, 2025

adamnovak Feb 6, 2025

adamnovak Feb 6, 2025

adamnovak Feb 6, 2025

adamnovak Feb 6, 2025

adamnovak Feb 6, 2025

		* TODO: are there any others? do either vg or cactus use this? should these have locks and when are they
		accessed? are these only written once, but read many times?

		The reference files act like unique keys in a database, referencing the original content.
		This deduplicates data on s3 if 2+ inputs/outputs have the same content.

	# Closeing the writable end will send EOF to the readable and cause the reader thread
	# Closing the writable end will send EOF to the readable and cause the reader thread

New AWS jobstore. #5123

Are you sure you want to change the base?

New AWS jobstore. #5123

Conversation

DailyDreaming commented Oct 14, 2024

Changelog Entry

Reviewer Checklist

Merger Checklist

adamnovak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment