Delete datasets on S3 #8924

frcroth · 2025-09-15T13:30:43Z

Enables dataset deletion for virtual datasets with data stored on our managed S3.
For non-virtual datasets, only the directory in the binaryData directory is deleted.
If other datasets have realpaths in that directory, deletion is blocked. (Previously there was a move-and-resymlink logic, which is now removed).

Steps to test:

Delete virtual datasets, should run through. Other datasets referencing the data should still have it afterwards.
Delete non-virtual datasets, should run through if paths are unique.
- Other datasets symlinking into this one should block the deletion
Enable S3 upload with credentials, datasets should be deletable afterwards.
Data on S3 buckets not listed with global credentials should never be deleted.

TODOs:

on datastore side, delete local and managedS3 paths
delete virtual datasets (paths used only by them)
delete non-virtual datasets (unless any other dataset’s paths point in there)
Allow manually deleting “Not yet fully uploaded” datasets (be it via publish or via upload)
Re-test delete as cleanup on failed dataset uploads
Re-test delete with S3 paths
cleanup

Issues:

fixes Delete datasets from S3 #8803
follow-up: Make realpath scan more resilient, support attachments #9018

Added changelog entry (create a $PR_NUMBER.md file in unreleased_changes or use ./tools/create-changelog-entry.py)
Removed dev-only changes like prints and application.conf edits
Considered common edge cases
Needs datastore update after deployment

coderabbitai · 2025-09-15T13:30:49Z

📝 Walkthrough

Walkthrough

Centralizes dataset deletion into a new DatasetService.deleteDataset flow that deletes data from disk and managed S3, adds ManagedS3Service and S3 utilities, refactors path-centric DAOs/clients/routes (including DELETE-with-JSON), updates controllers/routes/upload and datastore services, and tightens UPath behavior and tests.

Changes

Cohort / File(s)	Summary
Dataset controller & routes `app/controllers/DatasetController.scala`, `conf/webknossos.latest.routes`, `frontend/javascripts/admin/rest_api.ts`	Renamed `deleteOnDisk` → `delete` and route `/datasets/:datasetId`; added `finishUploadToPaths`; frontend DELETE now targets `/api/datasets/${datasetId}`; controller adds logging, Slack instrumentation and calls `datasetService.deleteDataset`.
Dataset service & DAOs `app/models/dataset/DatasetService.scala`, `app/models/dataset/Dataset.scala`	Replaced `deleteVirtualOrDiskDataset` with `deleteDataset`; added dependencies (`DatasetLayerAttachmentsDAO`, `AnnotationDAO`, `UsedStorageService`); added path-based DAO methods and helpers `findDatasetsUsingDataFromDir` / `deleteDatasetFromDB`; updated mags/attachments handling.
WK remote datastore client & controller `app/controllers/WKRemoteDataStoreController.scala`, `app/models/dataset/WKRemoteDataStoreClient.scala`, `webknossos-datastore/conf/datastore.latest.routes`, `webknossos-datastore/app/.../controllers/DataSourceController.scala`	Removed `getPaths` and AnnotationDAO injection; added client `getBaseDirAbsolute` and DELETE-with-JSON support; added controller endpoints `baseDirAbsolute` and `deletePaths`; controller delegates path deletion to `DataSourceService.deletePathsFromDiskOrManagedS3` and simplifies delete-on-disk flow.
Managed S3 & S3 utilities `webknossos-datastore/app/.../services/ManagedS3Service.scala`, `webknossos-datastore/app/.../helpers/S3UriUtils.scala`, `webknossos-datastore/app/.../datavault/S3DataVault.scala`	Introduced `ManagedS3Service` for grouped/batched deletions and managed-S3 checks; added `S3UriUtils` for S3 URI parsing; refactored `S3DataVault` internals to use `S3UriUtils`.
Datastore RPC & clients `webknossos-datastore/app/.../rpc/RPCRequest.scala`, `webknossos-datastore/app/.../services/DSRemoteWebknossosClient.scala`	Added `RPCRequest.deleteJson[T]` (DELETE with JSON body); removed `fetchPaths` from `DSRemoteWebknossosClient`.
Dataset deletion helper & UPath changes `webknossos-datastore/app/.../helpers/DatasetDeleter.scala`, `webknossos-datastore/app/.../helpers/UPath.scala`	Simplified `DatasetDeleter` trait (removed mixin), changed `deleteOnDisk` signature to take `datasetId` first and implement move-to-trash with retry; narrowed `RemoteUPath.parent` type and made `startsWith` compare path semantics.
DataSource service & upload `webknossos-datastore/app/.../services/DataSourceService.scala`, `webknossos-datastore/app/.../services/uploading/UploadService.scala`, `webknossos-datastore/app/.../services/DataSourceToDiskWriter.scala`	`DataSourceService` gains `managedS3Service`, `deletePathsFromDiskOrManagedS3`, and `existsOnDisk`; `UploadService` now uses `ManagedS3Service` and removes legacy S3 client logic; `DataSourceToDiskWriter` no longer mixes in `PathUtils`.
Storage service & utilities `app/models/storage/UsedStorageService.scala`, `util/src/main/scala/com/scalableminds/util/io/PathUtils.scala`, `util/src/main/scala/com/scalableminds/util/tools/Fox.scala`	`UsedStorageService` removed `DatasetService` dependency and now resolves datastores via DAO; `PathUtils` converted from trait to object; removed redundant Fox imports.
Datastore DI & routes `webknossos-datastore/app/.../DataStoreModule.scala`, `webknossos-datastore/conf/datastore.latest.routes`	Added eager singleton binding for `ManagedS3Service`; added GET `baseDirAbsolute` and DELETE `deletePaths` routes; removed previous invalidateCache route.
Tests & release notes `test/backend/UPathTestSuite.scala`, `unreleased_changes/8924.md`, `frontend/javascripts/test/backend-snapshot-tests/datasets.e2e.ts`	Strengthened `UPath.startsWith` tests; added release note about managed-S3 deletions and symlink-blocking behavior; removed dataset paths E2E test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Extra attention:
- DatasetService.deleteDataset / deleteDatasetFromDB: ordering of datastore path deletion vs DB deletion, shared-directory detection, and annotation handling.
- ManagedS3Service: credential parsing, S3 client/transfer manager lifecycle, pagination and batched delete correctness.
- DAO SQL/query changes (DatasetMagsDAO / DatasetLayerAttachmentsDAO) for correct path discovery and deduplication.
- DatasetDeleter.deleteOnDisk retry semantics and interaction with PathUtils move-to-trash.
- RPCRequest.deleteJson and clients/controllers relying on DELETE-with-body.
- DI wiring for new constructor params and eager binding.

Possibly related PRs

Remove datastore from communication about virtual datasets #8848 — touches the same rename of DatasetController.deleteOnDisk → delete and matching DatasetService rename/flow changes.
Upload datasets to S3 #8912 — overlaps dataset upload/delete code paths and many of the same files (ManagedS3Service, DatasetService, WKRemoteDataStoreClient).
Resolve symbolic links to mags and store them in db #8350 — modifies DAO path/mag handling used by the new path-centric deletion logic.

Suggested reviewers

MichaelBuessemeyer

Poem

🐰 I hopped through buckets, keys, and trash,
Nudged tangled symlinks into a stash.
With retries, logs, and a nimble paw,
I cleared disk and S3 without a flaw.
A carrot cheer for tidy data law!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "Delete datasets on S3" directly and clearly describes the main change in this pull request. The changeset introduces infrastructure to enable dataset deletion from managed S3 storage, which is precisely what the title conveys. The title is concise, specific, and avoids vague terminology, making it immediately clear to a developer scanning the history what the primary objective is.
Linked Issues Check	✅ Passed	The code changes address the primary objective from issue #8803: enabling the "Delete on disk" action to also remove data from the project's managed S3 storage. Key implementations include the new ManagedS3Service with pathIsInManagedS3 validation to restrict deletion to managed S3 only, updated DatasetService.deleteDataset orchestrating virtual dataset deletion, DataSourceService.deletePathsFromDiskOrManagedS3 partitioning paths between local and S3 deletions, and new DAO methods (findMagPathsUsedOnlyByThisDataset, findDatasetsWithMagsInDir, etc.) to support conflict checking for non-virtual datasets. The route changes and endpoint additions properly expose the new functionality.
Out of Scope Changes Check	✅ Passed	The changes are cohesive and support the PR objective of enabling S3 deletion with proper safeguards. Key in-scope modifications include new S3 utilities (S3UriUtils), ManagedS3Service for managed S3 interaction, updated DatasetDeleter to work with the new deletion approach, DAO methods for conflict detection, and endpoint changes. The removal of legacy getPaths endpoints and move-and-resymlink logic aligns with the intentional refactoring described in the PR. Minor refactorings such as PathUtils trait-to-object conversion appear necessary for integration with the new deletion infrastructure. All changes directly support or enable the S3 deletion feature.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch s3-delete

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (7)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)

48-48: Log the actual deduplicated path.

Line 48 logs targetPath, but when retryCount > 0, the actual destination is deduplicatedTargetPath. Log the path that was actually used.

Apply this diff:
-        logger.info(s"Successfully moved dataset from $sourcePath to $targetPath.")
+        logger.info(s"Successfully moved dataset from $sourcePath to $deduplicatedTargetPath.")

app/models/dataset/Dataset.scala (1)

871-890: Critical: Query returns wrong paths for non-virtual datasets

The query selects only m1.path, but for non-virtual datasets the physical path to delete is stored in m1.realPath. This means:

Non-virtual datasets would return logical paths instead of physical filesystem paths
Deletion would target the wrong locations
Could cause data loss or orphaned data

A past reviewer flagged this exact issue and suggested using COALESCE(m1.realPath, m1.path).

Based on learnings

Apply this diff to select the actual physical path:

-      pathsStrOpts <- run(q"""
-           SELECT m1.path FROM webknossos.dataset_mags m1
-           WHERE m1._dataset = $datasetId
-           AND m1.path IS NOT NULL
+      pathsStrOpts <- run(q"""
+           SELECT COALESCE(m1.realPath, m1.path) AS path
+           FROM webknossos.dataset_mags m1
+           WHERE m1._dataset = $datasetId
+           AND COALESCE(m1.realPath, m1.path) IS NOT NULL
            AND NOT EXISTS (
               SELECT m2.path
               FROM webknossos.dataset_mags m2
               WHERE m2._dataset != $datasetId
               AND (
-                m2.path = m1.path
+                m2.path = COALESCE(m1.realPath, m1.path)
                 OR
-                (
-                  m2.realpath IS NOT NULL AND m2.realpath = m1.realpath
-                )
+                m2.realpath = COALESCE(m1.realPath, m1.path)
               )
            )
               """.as[Option[String]])

app/models/dataset/DatasetService.scala (3)

548-557: Good fix: sequence deletion and refresh; errors now propagate.

Replacing _ = ... with <- for both deleteDataset and refreshStorageReport correctly sequences the Fox chain.

511-523: Critical: “used-only-by-this-dataset” can miss shared data (NULL realpath/symlink).

Current DAO logic for mags uses realpath equality that fails with NULL and can delete paths still used by other datasets. Use COALESCE to align with index semantics, or conservatively block when NULLs exist. Also consider validating paths before delete.

Suggested DAO fix (app/models/dataset/Dataset.scala):
- AND NOT EXISTS (
-    SELECT m2.path
-    FROM webknossos.dataset_mags m2
-    WHERE m2._dataset != $datasetId
-    AND (
-      m2.path = m1.path
-      OR (
-        m2.realpath IS NOT NULL AND m2.realpath = m1.realpath
-      )
-    )
- )
+ AND NOT EXISTS (
+    SELECT 1
+    FROM webknossos.dataset_mags m2
+    WHERE m2._dataset != $datasetId
+    AND COALESCE(m2.realpath, m2.path) = COALESCE(m1.realpath, m1.path)
+ )
Optional safety here:
-          pathsUsedOnlyByThisDataset = magPathsUsedOnlyByThisDataset ++ attachmentPathsUsedOnlyByThisDataset
+          pathsUsedOnlyByThisDataset = magPathsUsedOnlyByThisDataset ++ attachmentPathsUsedOnlyByThisDataset
+          // extra safety: validate only-local/managed-S3 and existence
+          _ <- validatePaths(pathsUsedOnlyByThisDataset, datastore)
Run to confirm DAOs use COALESCE:
#!/bin/bash
rg -n -C2 'findMagPathsUsedOnlyByThisDataset|COALESCE\$m2\\.realpath, m2\\.path\$' app/models/dataset/Dataset.scala
525-533: Critical: Deletion guard misses datasets with NULL realpath; use COALESCE in DAO.

findDatasetsUsingDataFromDir relies on DAO query filtering m.realpath IS NOT NULL and starts_with(m.realpath,...). Legacy NULL realpaths won’t be seen → risk of deleting shared data.

Suggested DAO fix (app/models/dataset/Dataset.scala):
- WHERE m.realpath IS NOT NULL
- AND starts_with(m.realpath, $absolutePathWithTrailingSlash)
+ WHERE starts_with(COALESCE(m.realpath, m.path), $absolutePathWithTrailingSlash)
Optional: add a conservative pre-check to abort if any mags under directory have realpath IS NULL until a backfill/migration makes realpath non-NULL.

Also review attachments symmetry: attachments guard uses a.path only; if attachments ever gain realpath, mirror COALESCE there too.
#!/bin/bash
rg -n -C3 'findDatasetsWithMagsInDir|starts_with\$' app/models/dataset/Dataset.scala
rg -n 'COALESCE\\(m\\.realpath, m\\.path\$' app/models/dataset/Dataset.scala

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (1)

9-22: Harden S3 URI parsing: null-safety, regional endpoints, and key normalization.

Avoid NPEs on uri.getHost.
Support regional virtual-hosted endpoints (e.g., bucket.s3.us-west-2.amazonaws.com).
Normalize keys to avoid leading “/” for AWS S3.
Make short-style detection scheme-aware (s3:// only).
Improve non-AWS host detection.

Apply this diff to the existing functions:

@@
-  def hostBucketFromUri(uri: URI): Option[String] = {
-    val host = uri.getHost
-    if (host == null) {
-      None
-    } else if (isShortStyle(uri)) { // assume host is omitted from uri, shortcut form s3://bucket/key
-      Some(host)
-    } else if (isVirtualHostedStyle(uri)) {
-      Some(host.substring(0, host.length - ".s3.amazonaws.com".length))
-    } else if (isPathStyle(uri)) {
-      Some(uri.getPath.substring(1).split("/")(0))
-    } else {
-      None
-    }
-  }
+  def hostBucketFromUri(uri: URI): Option[String] = {
+    val hostO = hostOpt(uri)
+    if (isShortStyle(uri)) {
+      hostO
+    } else if (isVirtualHostedStyle(uri)) {
+      hostO.flatMap(bucketFromVirtualHost)
+    } else if (isPathStyle(uri)) {
+      val path = Option(uri.getPath).getOrElse("")
+      path.stripPrefix("/").split("/", 2).headOption.filter(_.nonEmpty)
+    } else None
+  }
@@
-  // https://bucket-name.s3.region-code.amazonaws.com/key-name
-  private def isVirtualHostedStyle(uri: URI): Boolean =
-    uri.getHost.endsWith(".s3.amazonaws.com")
+  // https://bucket-name.s3.amazonaws.com or https://bucket-name.s3.us-west-2.amazonaws.com
+  private def isVirtualHostedStyle(uri: URI): Boolean =
+    hostOpt(uri).exists(h => VirtualHostedPattern.pattern.matcher(h).matches)
@@
-  // https://s3.region-code.amazonaws.com/bucket-name/key-name
-  private def isPathStyle(uri: URI): Boolean =
-    uri.getHost.matches("s3(.[\\w\\-_]+)?.amazonaws.com") ||
-      (!uri.getHost.contains("amazonaws.com") && uri.getHost.contains("."))
+  // https://s3.amazonaws.com/bucket/key, https://s3.us-west-2.amazonaws.com/bucket/key
+  // or non-AWS S3-compatible hosts (MinIO, Ceph, etc.)
+  private def isPathStyle(uri: URI): Boolean =
+    hostOpt(uri).exists { h =>
+      h.matches("^s3([.-][\\w\\-_]+)?\\.amazonaws\\.com$") ||
+      (!h.contains("amazonaws.com") && h.contains("."))
+    }
@@
-  // S3://bucket-name/key-name
-  private def isShortStyle(uri: URI): Boolean =
-    !uri.getHost.contains(".")
+  // s3://bucket-name/key-name  (short style)
+  private def isShortStyle(uri: URI): Boolean =
+    "s3".equalsIgnoreCase(uri.getScheme) && hostOpt(uri).exists(h => !h.contains("."))
@@
-  def objectKeyFromUri(uri: URI): Box[String] =
-    if (isVirtualHostedStyle(uri)) {
-      Full(uri.getPath)
-    } else if (isPathStyle(uri)) {
-      Full(uri.getPath.substring(1).split("/").tail.mkString("/"))
-    } else if (isShortStyle(uri)) {
-      Full(uri.getPath.tail)
-    } else Failure(s"Not a valid s3 uri: $uri")
+  def objectKeyFromUri(uri: URI): Box[String] = {
+    val path = Option(uri.getPath).getOrElse("")
+    if (isVirtualHostedStyle(uri) || isShortStyle(uri)) {
+      Full(path.stripPrefix("/"))
+    } else if (isPathStyle(uri)) {
+      val parts = path.stripPrefix("/").split("/", 2)
+      if (parts.length == 2) Full(parts(1)) else Failure(s"No object key in $uri")
+    } else Failure(s"Not a valid s3 uri: $uri")
+  }
@@
-  def isNonAmazonHost(uri: URI): Boolean =
-    (isPathStyle(uri) && !uri.getHost.endsWith(".amazonaws.com")) || uri.getHost == "localhost"
+  def isNonAmazonHost(uri: URI): Boolean =
+    hostOpt(uri).exists(h => (isPathStyle(uri) && !h.endsWith(".amazonaws.com")) || h == "localhost")

Add these helpers (append within the object):

// e.g. bucket.s3.amazonaws.com or bucket.s3.us-west-2.amazonaws.com
private val VirtualHostedPattern = """^(.+?)\.s3([.-][\w-]+)?\.amazonaws\.com$""".r

private def hostOpt(uri: URI): Option[String] =
  Option(uri.getHost).orElse(Option(uri.getAuthority)).map(_.trim).filter(_.nonEmpty)

private def bucketFromVirtualHost(host: String): Option[String] = host match {
  case VirtualHostedPattern(bucket, _) => Some(bucket)
  case _                               => None
}

This fixes failures on regional endpoints and ensures keys do not carry a leading slash, which is critical for correct List/Delete behavior. Based on learnings.

Also applies to: 27-39, 40-50

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1)

39-55: Don’t derive bucket/endpoint from credentialName; use the matched S3 credential (name/endpoint/region).

Assuming credentialName is a logical identifier, parsing it as a URI yields wrong bucket and an invalid endpoint (e.g., https://bucket). Derive from the selected S3AccessKeyCredential instead; use its optional endpoint/region, and parse name only if it is actually a URI.

Apply this refactor:

@@
-  lazy val s3UploadBucketOpt: Option[String] =
-    // by convention, the credentialName is the S3 URI so we can extract the bucket from it.
-    S3UriUtils.hostBucketFromUri(new URI(dataStoreConfig.Datastore.S3Upload.credentialName))
+  private lazy val s3UploadCredentialOpt: Option[S3AccessKeyCredential] =
+    dataStoreConfig.Datastore.DataVaults.credentials
+      .flatMap(new CredentialConfigReader(_).getCredential)
+      .collectFirst {
+        case c @ S3AccessKeyCredential(name, _, _, _, _)
+            if dataStoreConfig.Datastore.S3Upload.credentialName == name => c
+      }
+
+  private lazy val s3UploadBaseUriOpt: Option[URI] =
+    s3UploadCredentialOpt.flatMap { c =>
+      // name might be a UPath/URI root like s3://bucket/prefix or https://minio:9000/bucket
+      tryo(UPath.fromString(c.name).toRemoteUriUnsafe).toOption
+        .orElse(tryo(new URI(c.name)).toOption)
+    }
+
+  lazy val s3UploadBucketOpt: Option[String] =
+    s3UploadBaseUriOpt.flatMap(S3UriUtils.hostBucketFromUri)
@@
-  private lazy val s3UploadEndpoint: URI = {
-    // by convention, the credentialName is the S3 URI so we can extract the bucket from it.
-    val credentialUri = new URI(dataStoreConfig.Datastore.S3Upload.credentialName)
-    new URI(
-      "https",
-      null,
-      credentialUri.getHost,
-      -1,
-      null,
-      null,
-      null
-    )
-  }
+  private lazy val s3UploadEndpointOpt: Option[URI] =
+    s3UploadCredentialOpt.flatMap(_.endpoint.map(URI.create))

This prevents constructing invalid endpoints and decouples behavior from a brittle config assumption. Based on learnings.


If `credentialName` is indeed a URI in your deployments, please confirm and we can simplify accordingly.

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (7)</summary><blockquote>

<details>
<summary>webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)</summary><blockquote>

`41-41`: **Extract hardcoded retry limit as a named constant.**

The retry limit `15` is hardcoded. Consider extracting it as a private constant (e.g., `private val MaxMoveDedupRetries = 15`) for clarity and maintainability.



```diff
+  private val MaxMoveDedupRetries = 15
+
   private def deleteWithRetry(sourcePath: Path, targetPath: Path, retryCount: Int = 0)(
       implicit ec: ExecutionContext): Fox[Unit] =
-    if (retryCount > 15) {
+    if (retryCount > MaxMoveDedupRetries) {
       Fox.failure(s"Deleting dataset failed: too many retries.")

app/models/dataset/Dataset.scala (2)

892-906: Consider checking both realpath and path for completeness

The method currently only checks m.realpath to find datasets with mags in a directory. While this is correct for non-virtual datasets (filesystem paths), you might want to also check m.path for safety, especially if virtual datasets could have paths under the same prefix.

If needed, apply this diff to check both fields:
     run(q"""
         SELECT d._id FROM webknossos.dataset_mags m
         JOIN webknossos.datasets d ON m._dataset = d._id
-        WHERE m.realpath IS NOT NULL
-        AND starts_with(m.realpath, $absolutePathWithTrailingSlash)
+        WHERE (
+          (m.realpath IS NOT NULL AND starts_with(m.realpath, $absolutePathWithTrailingSlash))
+          OR starts_with(m.path, $absolutePathWithTrailingSlash)
+        )
         AND d._id != $ignoredDataset
         AND d._datastore = ${dataStore.name.trim}
        """.as[ObjectId])
1293-1306: Add NULL check for consistency and safety

Similar to the mags version, this method should filter out NULL paths to avoid potential issues with the string prefix check.

Apply this diff to add a NULL check:
     run(q"""
         SELECT d._id FROM webknossos.dataset_layer_attachments a
         JOIN webknossos.datasets d ON a._dataset = d._id
-        WHERE starts_with(a.path, $absolutePathWithTrailingSlash)
+        WHERE a.path IS NOT NULL
+        AND starts_with(a.path, $absolutePathWithTrailingSlash)
         AND d._id != $ignoredDataset
         AND d._datastore = ${dataStore.name.trim}
        """.as[ObjectId])

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (1)

1-7: Add tests for core URI forms.

Cover:

s3://bucket/key

https://bucket.s3.amazonaws.com/key

https://bucket.s3.us-west-2.amazonaws.com/key

https://s3.us-west-2.amazonaws.com/bucket/key

http://minio.local:9000/bucket/key

Do you want me to scaffold a ScalaTest suite for this?

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (3)

79-86: Fail fast on invalid/non‑S3 paths; avoid aborting all buckets due to one bad path.

Currently, grouping by Option[String] may include None and the first failure can abort the whole deletion.

-  def deletePaths(paths: Seq[UPath])(implicit ec: ExecutionContext): Fox[Unit] = {
-    val pathsByBucket: Map[Option[String], Seq[UPath]] = paths.groupBy(S3UriUtils.hostBucketFromUpath)
-    for {
-      _ <- Fox.serialCombined(pathsByBucket.keys) { bucket: Option[String] =>
-        deleteS3PathsOnBucket(bucket, pathsByBucket(bucket))
-      }
-    } yield ()
-  }
+  def deletePaths(paths: Seq[UPath])(implicit ec: ExecutionContext): Fox[Unit] = {
+    val (nonS3, s3Paths) = paths.partition(p => !p.getScheme.contains(PathSchemes.schemeS3))
+    if (nonS3.nonEmpty) return Fox.failure(s"Non‑S3 paths passed to ManagedS3Service: ${nonS3.mkString(", ")}")
+
+    val grouped = s3Paths.groupBy(S3UriUtils.hostBucketFromUpath)
+    grouped.get(None).foreach { invalid =>
+      // surface all invalid paths instead of a generic failure
+      return Fox.failure(s"Could not determine S3 bucket for: ${invalid.mkString(", ")}")
+    }
+    val byBucket: Map[String, Seq[UPath]] = grouped.collect { case (Some(b), ps) => (b, ps) }
+    Fox.serialCombined(byBucket.keys.toSeq)(b => deleteS3PathsOnBucket(Some(b), byBucket(b))).map(_ => ())
+  }

This keeps valid buckets progressing even if a caller accidentally passes a bad path and also returns actionable error text.

75-78: transferManagerBox is unused.

Remove or use it (e.g., for future multipart/copy ops) to avoid dead code.

155-158: Speed up pathIsInManagedS3 and limit to S3 credentials.

Avoid repeatedly parsing all credentials; precompute S3 roots and filter to S3 credentials only.

Example:

private lazy val managedS3Roots: Seq[UPath] =
  globalCredentials.collect {
    case S3AccessKeyCredential(name, _, _, _, _) if name.startsWith("s3://") =>
      UPath.fromStringUnsafe(name).normalize
  }

def pathIsInManagedS3(path: UPath): Boolean =
  path.getScheme.contains(PathSchemes.schemeS3) && managedS3Roots.exists(path.startsWith)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df11bf1 and d80fec9.

📒 Files selected for processing (6)

app/controllers/WKRemoteDataStoreController.scala (2 hunks)
app/models/dataset/Dataset.scala (3 hunks)
app/models/dataset/DatasetService.scala (5 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (1 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

app/controllers/WKRemoteDataStoreController.scala

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-05-12T13:07:29.637Z

Learnt from: frcroth
PR: scalableminds/webknossos#8609
File: app/models/dataset/Dataset.scala:753-775
Timestamp: 2025-05-12T13:07:29.637Z
Learning: In the `updateMags` method of DatasetMagsDAO (Scala), the code handles different dataset types distinctly:
1. Non-WKW datasets have `magsOpt` populated and use the first branch which includes axisOrder, channelIndex, and credentialId.
2. WKW datasets will have `wkwResolutionsOpt` populated and use the second branch which includes cubeLength.
3. The final branch is a fallback for legacy data.
This ensures appropriate fields are populated for each dataset type.

Applied to files:

app/models/dataset/Dataset.scala

🧬 Code graph analysis (5)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (6)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreConfig.scala (3)

DataStoreConfig (11-78)

DataVaults (61-63)

S3Upload (64-68)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (4)

S3UriUtils (7-52)

hostBucketFromUri (9-22)

hostBucketFromUpath (24-28)

objectKeyFromUri (40-49)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (8)

UPath (55-97)

toRemoteUriUnsafe (122-122)

toRemoteUriUnsafe (199-199)

getScheme (120-120)

getScheme (197-197)

fromString (60-60)

startsWith (139-143)

startsWith (209-213)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/storage/CredentialConfigReader.scala (1)

CredentialConfigReader (8-64)

util/src/main/scala/com/scalableminds/util/tools/Box.scala (1)

collectFirst (614-644)

util/src/main/scala/com/scalableminds/util/tools/Fox.scala (3)

combined (82-93)

fromFuture (40-51)

successful (51-54)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (4)

util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)

PathUtils (14-221)

ensureDirectory (116-120)

app/models/dataset/WKRemoteDataStoreClient.scala (1)

deleteOnDisk (122-129)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)

deleteOnDisk (413-428)

util/src/main/scala/com/scalableminds/util/tools/Fox.scala (2)

successful (51-54)

failure (56-60)

app/models/dataset/Dataset.scala (4)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (1)

UPath (55-97)

app/utils/sql/SqlInterpolation.scala (1)

q (20-39)

util/src/main/scala/com/scalableminds/util/tools/Fox.scala (2)

flatten (269-279)

toFox (12-12)

util/src/main/scala/com/scalableminds/util/tools/Box.scala (1)

toSingleBox (63-81)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (1)

util/src/main/scala/com/scalableminds/util/tools/Box.scala (1)

Full (661-699)

app/models/dataset/DatasetService.scala (6)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (1)

UPath (55-97)

app/models/annotation/Annotation.scala (2)

findOne (289-296)

countAllByDataset (644-656)

app/models/storage/UsedStorageService.scala (2)

UsedStorageService (33-227)

refreshStorageReportForDataset (211-227)

app/models/dataset/Dataset.scala (6)

deleteDataset (709-737)

findMagPathsUsedOnlyByThisDataset (871-892)

findAttachmentPathsUsedOnlyByThisDataset (1278-1293)

findDatasetsWithMagsInDir (892-906)

findDatasetsWithAttachmentsInDir (1293-1306)

findOne (203-210)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1)

deletePaths (79-86)

app/models/dataset/WKRemoteDataStoreClient.scala (2)

deletePaths (134-141)

deleteOnDisk (122-129)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: frontend-tests
GitHub Check: backend-tests
GitHub Check: build-smoketest-push

🔇 Additional comments (1)

app/models/dataset/Dataset.scala (1)

10-10: LGTM: Clean import simplification

The shift from DataSourceMagInfo to UPath aligns with the path-centric refactoring for dataset deletion.

app/models/dataset/Dataset.scala

app/models/dataset/DatasetService.scala

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc9908c and 6219144.

📒 Files selected for processing (1)

app/controllers/DatasetController.scala (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

app/controllers/DatasetController.scala (5)

app/telemetry/SlackNotificationService.scala (1)

SlackNotificationService (11-59)

util/src/main/scala/com/scalableminds/util/requestlogging/RequestLogging.scala (2)

log (55-62)

logTime (32-48)

util/src/main/scala/com/scalableminds/util/time/Instant.scala (5)

toString (15-15)

Instant (14-45)

Instant (47-103)

now (48-48)

logSince (72-75)

app/utils/WkConf.scala (1)

Features (135-155)

app/models/dataset/DatasetService.scala (1)

deleteDataset (511-538)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-smoketest-push
GitHub Check: backend-tests
GitHub Check: frontend-tests

🔇 Additional comments (1)

app/controllers/DatasetController.scala (1)

38-38: Slack notifier wiring verified and correct.

Import and DI of SlackNotificationService in DatasetController are confirmed. The service is properly defined, bound as an eager singleton in WebknossosModule.scala, and consistently used across the application (SqlClient, TaskCreationService, DatasetService, JobController, AnnotationController, and others).

app/controllers/DatasetController.scala

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)

428-436: Surface invalid/ignored paths to clients.

If DataSourceService rejects paths (non‑managed S3 or outside baseDir), propagate a 400 with details instead of a generic failure/OK. This avoids silent skips. Consider returning per‑path results.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

439-449: Use ManagedS3Service bucket; centralize URI building.

Switch to managedS3Service.s3UploadBucketOpt is correct. Suggest moving endpoint host/URI construction into ManagedS3Service (e.g., a helper returning UPath base) to avoid duplicating URI parsing here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6219144 and 04b1de2.

📒 Files selected for processing (8)

app/controllers/WKRemoteDataStoreController.scala (2 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala (1 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (2 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala (6 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (3 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (1 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (2 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala

🧰 Additional context used

🧬 Code graph analysis (4)

app/controllers/WKRemoteDataStoreController.scala (1)

app/models/dataset/DatasetService.scala (1)

deleteDatasetFromDB (548-562)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (5)

util/src/main/scala/com/scalableminds/util/tools/Fox.scala (5)

Fox (28-228)

Fox (230-303)

serialCombined (93-97)

serialCombined (97-109)

toFox (12-12)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (1)

isLocal (30-30)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (2)

pathIsInManagedS3 (155-159)

deletePaths (79-86)

util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)

PathUtils (14-221)

deleteDirectoryRecursively (184-191)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)

deletePaths (428-437)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (8)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (4)

validateAccessFromTokenContext (87-94)

UserAccessRequest (30-30)

UserAccessRequest (31-72)

webknossos (70-72)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

dataBaseDir (146-146)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (1)

getDataSource (190-197)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)

deleteOnDisk (14-36)

app/models/dataset/WKRemoteDataStoreClient.scala (2)

deleteOnDisk (122-129)

deletePaths (134-141)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1)

deletePaths (79-86)

util/src/main/scala/com/scalableminds/util/mvc/ExtendedController.scala (1)

validateJson (204-209)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (1)

deletePathsFromDiskOrManagedS3 (300-311)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)

PathUtils (14-221)

ensureDirectoryBox (122-130)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: frontend-tests
GitHub Check: build-smoketest-push
GitHub Check: backend-tests

🔇 Additional comments (12)

app/controllers/WKRemoteDataStoreController.scala (3)

6-6: LGTM! Clean import removal.

The Full import is no longer needed in this controller since the Box pattern matching has moved to DatasetService.deleteDatasetFromDB.

38-51: Excellent refactoring – proper separation of concerns.

Removing annotationDAO from the controller is correct since the annotation-count logic now resides in DatasetService.deleteDatasetFromDB. This improves maintainability by centralizing deletion logic in the service layer.

202-212: Well-architected delegation to service layer.

The refactoring correctly moves deletion orchestration to DatasetService.deleteDatasetFromDB, which now handles dataset lookup, annotation counting, conditional deletion, and storage refresh. The idempotent behavior (returning success for non-existent datasets) aligns with HTTP DELETE semantics.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (2)

27-32: DI addition looks good.
The ManagedS3Service injection integrates cleanly with existing services.

313-314: Tiny nit: helper is fine.
existsOnDisk is straightforward and used correctly in refresh path.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (2)

90-94: baseDirAbsolute endpoint: OK.
Scoped to webknossos context; returning a path string is acceptable.

415-426: deleteOnDisk flow simplification: OK.
Fetching the data source first and delegating to service is cleaner; auth via webknossos context matches WK→DS usage.

Confirm no external clients rely on the previous deleteDataset(datasetId) permission here.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (5)

23-24: Imports updated for ManagedS3Service: OK.

100-103: Constructor DI: OK.
ManagedS3Service centralization improves cohesion and config handling.

146-147: PathUtils.ensureDirectoryBox usage: good.
Better error mapping vs direct ensureDirectory.

541-548: S3 TransferManager via service: good.
Clear failure if S3 not configured; checks failedTransfers. LGTM.

582-594: Failure cleanup argument order fix: OK.
deleteOnDisk now receives datasetId first; matches signature.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala

...nossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala

MichaelBuessemeyer

Some feedback :D

Testing results:

Delete non-virtual datasets, should run through if paths are unique.
Other datasets symlinking into this one should block the deletion

Looks good 👍

Data on S3 buckets not listed with global credentials should never be deleted.

Also seemed good

Delete virtual datasets, should run through. Other datasets referencing the data should still have it afterwards.

I am having trouble with setting up a scenario where a s3 mag is shared.
I enabled uploading to s3. Made a successful upload. Used the new s3 path + credentials for explore remote and this also worked. But the dataset is now not rendered 🤔.
The server console says:

backend: 2025-10-29 14:27:27,600 [error] com.scalableminds.webknossos.datastore.services.DSDatasetErrorLoggingService - Error while loading bucket for layer 2-2-1 at BucketPosition(voxelMag1 at (32, 0, 160), bucket at (1,0,5), mag(1, 1, 1), additional coordinates: ), cuboid: Cuboid((32, 0, 160) / (1, 1, 1),32,32,32) for <path removed by me>: Could not read header at zarr.json <~ Failed to read from vault path <~ The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: 1TR63AYS5G6NZF1Y, Extended Request ID: <truncated by myself>) (SDK Attempt Count: 1) Stack trace: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: <truncated by myself>) (SDK Attempt Count: 1)

The rest of the testing went well

app/models/dataset/DatasetService.scala

app/models/dataset/WKRemoteDataStoreClient.scala

conf/webknossos.latest.routes

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala

...nossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala

webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala

MichaelBuessemeyer · 2025-10-29T13:34:14Z

Oh and I forgot to mention that not all S3 is cleared: The segmentation-layer/agglomerates/cumsum.json is not deleted when deleting a remote s3 dataset. Maybe because it is not listed as an used attachment?

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)

39-55: Remove @tailrec annotation.

The @tailrec annotation cannot be satisfied because the recursive call (line 52) is inside a try-catch block, preventing tail-call optimization. This was already flagged in a previous review.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (1)
300-306: Lock down local deletions to the datastore base directory.

Any caller can submit a file:// UPath pointing outside the datastore root, and we will happily recurse-delete that absolute path. From the new DELETE endpoint this lets a remote user wipe arbitrary directories (even /) on the datastore host. We must validate each local path against dataBaseDir, reject anything else, and only then delete either the directory tree or single files. Also fail fast on non-managed remote paths instead of silently skipping them.

Please apply a guard like the following:
   def deletePathsFromDiskOrManagedS3(paths: Seq[UPath]): Fox[Unit] = {
-    val localPaths = paths.filter(_.isLocal).flatMap(_.toLocalPath)
-    val managedS3Paths = paths.filter(managedS3Service.pathIsInManagedS3)
-    for {
-      _ <- Fox.serialCombined(localPaths)(PathUtils.deleteDirectoryRecursively(_).toFox)
-      _ <- managedS3Service.deletePaths(managedS3Paths)
-    } yield ()
+    val (localCandidates, remoteCandidates) = paths.partition(_.isLocal)
+    val managedS3Paths = remoteCandidates.filter(managedS3Service.pathIsInManagedS3)
+    val disallowedRemote = remoteCandidates.diff(managedS3Paths)
+    val baseDir = dataBaseDir.toAbsolutePath.normalize
+
+    for {
+      _ <- Fox.fromBool(disallowedRemote.isEmpty) ?~>
+        s"Refusing to delete non-managed remote paths: ${disallowedRemote.mkString(", ")}"
+      validatedLocal <- Fox.serialCombined(localCandidates) { upath =>
+        for {
+          local <- upath.toLocalPath.toFox
+          abs = local.toAbsolutePath.normalize
+          _ <- Fox.fromBool(abs.startsWith(baseDir)) ?~>
+            s"Refusing to delete outside datastore base directory ($baseDir): $abs"
+        } yield abs
+      }
+      _ <- Fox.serialCombined(validatedLocal) { abs =>
+        if (Files.isDirectory(abs)) PathUtils.deleteDirectoryRecursively(abs).toFox
+        else tryo(Files.deleteIfExists(abs)).map(_ => ()).toFox
+      }
+      _ <- managedS3Service.deletePaths(managedS3Paths)
+    } yield ()
   }
This matches the earlier review feedback but is still unresolved, so please fix before release.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1)
39-72: Stop interpreting credentialName as an S3 URI.

Datastore.S3Upload.credentialName is configured as a plain credential ID (e.g. "managed-s3"). The new code assumes it is an S3 URI, immediately parsing it with new URI(...) and forcing an endpoint override based on that host. On existing installations this throws URISyntaxException, so s3ClientBox becomes a Failure and all managed S3 deletions stop working. Even if the parse succeeded, reinterpreting the name changes semantics and breaks compatibility.

Please derive bucket/endpoint/region from the actual S3AccessKeyCredential metadata (the optional region and endpoint fields) and only override the client when those optional fields are populated. The field should remain a plain identifier. For example:
-  private lazy val s3UploadCredentialsOpt: Option[(String, String)] =
-    dataStoreConfig.Datastore.DataVaults.credentials.flatMap { credentialConfig =>
-      new CredentialConfigReader(credentialConfig).getCredential
-    }.collectFirst {
-      case S3AccessKeyCredential(credentialName, accessKeyId, secretAccessKey, _, _)
-          if dataStoreConfig.Datastore.S3Upload.credentialName == credentialName =>
-        (accessKeyId, secretAccessKey)
-    }
-
-  lazy val s3UploadBucketOpt: Option[String] =
-    S3UriUtils.hostBucketFromUri(new URI(dataStoreConfig.Datastore.S3Upload.credentialName))
-
-  private lazy val s3UploadEndpoint: URI = {
-    val credentialUri = new URI(dataStoreConfig.Datastore.S3Upload.credentialName)
-    new URI("https", null, credentialUri.getHost, -1, null, null, null)
-  }
+  private lazy val s3UploadCredentialOpt: Option[S3AccessKeyCredential] =
+    dataStoreConfig.Datastore.DataVaults.credentials
+      .flatMap(new CredentialConfigReader(_).getCredential)
+      .collectFirst {
+        case cred @ S3AccessKeyCredential(name, _, _, _, _)
+            if dataStoreConfig.Datastore.S3Upload.credentialName == name => cred
+      }
+
+  private lazy val s3UploadCredentialsOpt: Option[(String, String)] =
+    s3UploadCredentialOpt.map(cred => (cred.identifier, cred.secret))
+
+  private lazy val s3EndpointOverrideOpt: Option[URI] =
+    s3UploadCredentialOpt.flatMap(_.endpoint).flatMap(ep => tryo(new URI(ep)).toOption)
+
+  private lazy val s3Region: Region =
+    s3UploadCredentialOpt.flatMap(_.region).map(Region.of).getOrElse(Region.US_EAST_1)
And in the builder only call endpointOverride/forcePathStyle when s3EndpointOverrideOpt is defined:
-        .crossRegionAccessEnabled(true)
-        .forcePathStyle(true)
-        .endpointOverride(s3UploadEndpoint)
-        .region(Region.US_EAST_1)
+        .crossRegionAccessEnabled(true)
+        .region(s3Region)
+      s3EndpointOverrideOpt.foreach { endpoint =>
+        builder.endpointOverride(endpoint)
+        builder.forcePathStyle(true)
+      }
This keeps the existing configuration contract intact and avoids runtime failures.

🧹 Nitpick comments (1)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)
47-47: Consider using s interpolator instead of f.

The f interpolator is intended for formatted strings with format specifiers (like %d, %s). Without format specifiers, s is more appropriate and avoids unnecessary overhead.

Apply this diff:
-          if (retryCount == 0) targetPath else targetPath.resolveSibling(f"${targetPath.getFileName} ($retryCount)")
+          if (retryCount == 0) targetPath else targetPath.resolveSibling(s"${targetPath.getFileName} ($retryCount)")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04b1de2 and 368893d.

📒 Files selected for processing (4)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (2 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (2 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-04-28T14:18:04.368Z

Learnt from: frcroth
PR: scalableminds/webknossos#8236
File: webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/mesh/NeuroglancerPrecomputedMeshFileService.scala:161-166
Timestamp: 2025-04-28T14:18:04.368Z
Learning: In Scala for-comprehensions with the Fox error handling monad, `Fox.fromBool()` expressions should use the `<-` binding operator instead of the `=` assignment operator to properly propagate error conditions. Using `=` will cause validation failures to be silently ignored.

Applied to files:

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala

📚 Learning: 2025-01-27T12:06:42.865Z

Learnt from: MichaelBuessemeyer
PR: scalableminds/webknossos#8352
File: app/models/organization/CreditTransactionService.scala:0-0
Timestamp: 2025-01-27T12:06:42.865Z
Learning: In Scala's for-comprehension with Fox (Future-like type), the `<-` operator ensures sequential execution. If any step fails, the entire chain short-circuits and returns early, preventing subsequent operations from executing. This makes it safe to perform validation checks before database operations.

Applied to files:

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala

📚 Learning: 2025-05-12T13:07:29.637Z

Learnt from: frcroth
PR: scalableminds/webknossos#8609
File: app/models/dataset/Dataset.scala:753-775
Timestamp: 2025-05-12T13:07:29.637Z
Learning: In the `updateMags` method of DatasetMagsDAO (Scala), the code handles different dataset types distinctly:
1. Non-WKW datasets have `magsOpt` populated and use the first branch which includes axisOrder, channelIndex, and credentialId.
2. WKW datasets will have `wkwResolutionsOpt` populated and use the second branch which includes cubeLength.
3. The final branch is a fallback for legacy data.
This ensures appropriate fields are populated for each dataset type.

Applied to files:

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala

🧬 Code graph analysis (3)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (4)

util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)

PathUtils (14-221)

ensureDirectory (116-120)

util/src/main/scala/com/scalableminds/util/tools/Box.scala (1)

Full (661-699)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)

deleteOnDisk (413-428)

util/src/main/scala/com/scalableminds/util/tools/Fox.scala (2)

successful (51-54)

failure (56-60)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (6)

util/src/main/scala/com/scalableminds/util/tools/Fox.scala (5)

Fox (28-228)

Fox (230-303)

serialCombined (93-97)

serialCombined (97-109)

toFox (12-12)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (3)

isLocal (29-29)

toLocalPath (111-111)

toLocalPath (176-176)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (2)

pathIsInManagedS3 (153-157)

deletePaths (79-86)

util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)

PathUtils (14-221)

deleteDirectoryRecursively (184-191)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)

deletePaths (428-437)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

dataBaseDir (146-146)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (6)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreConfig.scala (3)

DataStoreConfig (11-75)

DataVaults (58-60)

S3Upload (61-65)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/PathSchemes.scala (1)

PathSchemes (3-9)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (4)

S3UriUtils (7-52)

hostBucketFromUri (9-22)

hostBucketFromUpath (24-28)

objectKeyFromUri (40-49)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (6)

UPath (54-96)

toRemoteUriUnsafe (121-121)

toRemoteUriUnsafe (188-188)

fromString (59-59)

startsWith (138-142)

startsWith (198-205)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/storage/CredentialConfigReader.scala (1)

CredentialConfigReader (8-64)

util/src/main/scala/com/scalableminds/util/tools/Fox.scala (4)

toFox (12-12)

combined (82-93)

fromFuture (40-51)

successful (51-54)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-smoketest-push
GitHub Check: backend-tests
GitHub Check: frontend-tests

🔇 Additional comments (2)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (2)

2-11: LGTM! Clean refactor of trait dependencies.

The trait signature and imports have been simplified appropriately. Removing DataSourceToDiskWriter and adding PathUtils aligns well with the new trash-based deletion approach.

24-36: LGTM! Idempotent deletion with clear logging.

The existence check and error handling are correct. Returning Fox.successful(()) when the path doesn't exist makes the deletion idempotent, which is the right behavior. The logging clearly distinguishes between actual deletion and no-op cases.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala

fm3 · 2025-10-30T08:55:08Z

Thanks for taking the time to look at my PR! I implemented your code feedback :)

Regarding testing: Yes, it is difficult to set up dependent paths on S3 currently. I just re-tested this case by running a small neuron inferral with the webknossos-worker (with it’s new_publish branch). Worked as expected :)

Still, I’m a little confused by your testing result concerning explore. If I try to explore+add a dataset previously uploaded to S3 via the dataset upload, I get this error (as it should be). Do you still know what you did there exactly?

fm3 · 2025-10-30T08:57:52Z

Oh and I forgot to mention that not all S3 is cleared: The segmentation-layer/agglomerates/cumsum.json is not deleted when deleting a remote s3 dataset. Maybe because it is not listed as an used attachment?

I would have expected that cumsum.json would not have been uploaded to s3 in the first place, as it was not referenced in the datasource-properties.json 🤔 But yes, I think it’s acceptable for the moment that it is not deleted. It is also not very large.

MichaelBuessemeyer · 2025-10-30T11:41:14Z

🤔 I think the exploration result of the dataset uploaded is different. It is buggy imo: When I explore
<s3_root>/webknossos-uploads/sample_organization/delete_me-69034ce2800100b312b9616b I get a successful exploration, but the mag is interpreted as a layer 🤔.

Maybe thats the difference between our test cases?

MichaelBuessemeyer · 2025-10-30T11:42:15Z

Regarding testing: Yes, it is difficult to set up dependent paths on S3 currently. I just re-tested this case by running a small neuron inferral with the webknossos-worker (with it’s new_publish branch). Worked as expected :)

Ok than no need for me to do this again as this takes some time?

MichaelBuessemeyer · 2025-10-30T11:49:08Z

Oh I see, the error

backend: 2025-10-29 14:27:27,600 [error] com.scalableminds.webknossos.datastore.services.DSDatasetErrorLoggingService - Error while loading bucket for layer 2-2-1 at BucketPosition(voxelMag1 at (32, 0, 160), bucket at (1,0,5), mag(1, 1, 1), additional coordinates: ), cuboid: Cuboid((32, 0, 160) / (1, 1, 1),32,32,32) for <path removed by me>: Could not read header at zarr.json <~ Failed to read from vault path <~ The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: 1TR63AYS5G6NZF1Y, Extended Request ID: <truncated by myself>) (SDK Attempt Count: 1) Stack trace: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: <truncated by myself>) (SDK Attempt Count: 1)

states

The AWS Access Key Id you provided does not exist in our records.

This means that the file does not exist, which is alway what the error says:

Could not read header at zarr.json

Therefore, I think the root cause for this is that the exploration thinks it found a valid layer, but this was actually a mag. Thus, the datasource stored in the db is buggy and thus loading the dataset does not work properly.

MichaelBuessemeyer · 2025-10-30T11:51:46Z

But this exploration bug is very likely not related to this pr. So I'd say we can make this green as I did some little retesting showing that it still works as well as your latest changes looking good?

fm3 · 2025-10-30T11:55:18Z

Yes, I agree :) Also, no need to re-test the linked mags in virtual datasets, I did that today

MichaelBuessemeyer

So, as this bug is unrelated, I opened #9034.

And as the code is good and doing what it should be doing 🟢

…r upload (#9047) We can have multiple managed s3 stores. Only one is selected for upload but dataset deletion should work for all of them. This PR refactors the ManagedS3Service a bit and selects the correct client with endpoint/credential for each path. ### URL of deployed dev instance (used for testing): - https://___.webknossos.xyz ### Steps to test: (I already tested locally, no need to re-test) - In application.conf, insert multiple dataVaults.credentials for s3 endpoints. Set s3Upload.credentialName to the first of them - Upload a small dataset - set the s3Upload.credentialName to another of them - make sure both datasets can be loaded - delete both datasets from their respective settings views. - backend logging should show they were deleted from their respective s3 buckets ### Issues: - follow-up for #8924 - fixes #9043 ------ - [x] Added changelog entry (create a `$PR_NUMBER.md` file in `unreleased_changes` or use `./tools/create-changelog-entry.py`) - [x] Removed dev-only changes like prints and application.conf edits - [x] Considered [common edge cases](../blob/master/.github/common_edge_cases.md) - [x] Needs datastore update after deployment

- When scanning datasources from directories, always add resolved mag paths immediately - Introduces realpaths for attachments - Report all realpaths that could be determined, scale down logging for those that couldn’t. ### URL of deployed dev instance (used for testing): - https://___.webknossos.xyz ### Steps to test: - start wk with some datasets in the binaryData dir, watch backend logging, should show realpath scan failures (provoke some by deleting mags or attachments that are referenced in some datasource-properties.jsons) - All realpaths that could be determined should be added to the database ### TODOs: - [x] immediate paths for datasources - [x] simplify layerpath logic - [x] attachment realpath column - [x] don’t fail for datasource on single realpath failure - [x] log failures - [x] scan also attachments - [x] evolution - [x] adapt s3-delete code to also consider realpaths of attachments ### Issues: - blocked by #8924 (paths routes changed) - blocked by #8980 (datavault path refactoring) - fixes #9018 ------ - [x] Added changelog entry (create a `$PR_NUMBER.md` file in `unreleased_changes` or use `./tools/create-changelog-entry.py`) - [x] Removed dev-only changes like prints and application.conf edits - [x] Considered [common edge cases](../blob/master/.github/common_edge_cases.md) - [x] Needs datastore update after deployment --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: MichaelBuessemeyer <[email protected]>

frcroth and others added 9 commits September 10, 2025 15:26

Implement rudimentary upload of datasets to S3

63f1b69

Fix that uploads would have 'aws-chunked' content encoding in files

f2dcd81

Use update data source route for uploading virtual datasets

6e62928

Use S3 transfer manager for file uploads

f90002e

Add application.conf

5d9eff4

Clean up unsued code, remove todos

5e548c8

Merge branch 'master' into s3-upload

a62c986

Update changelog

71b3e30

Add absolute paths to datasets with no mag paths

5c64063

Do not upload unreferenced files to S3

631bc4a

frcroth force-pushed the s3-delete branch from 5aa9890 to b8718e1 Compare September 15, 2025 15:18

normanrz and others added 2 commits September 16, 2025 13:07

Merge branch 'master' into s3-upload

cc0509d

Do not upload on conversion, simplify application.conf

acc56ba

frcroth assigned frcroth and unassigned frcroth Sep 17, 2025

frcroth added 4 commits September 17, 2025 11:53

Merge remote-tracking branch 'origin/master' into s3-upload

77bd98f

Implement deletion of S3 datasets

fc22245

Do not delete mags that are referenced in other layers

063132d

Lint

1b6493b

frcroth force-pushed the s3-delete branch from 3200436 to 1b6493b Compare September 17, 2025 09:54

fm3 added 2 commits September 22, 2025 14:27

Merge branch 'master' into s3-upload

36533df

Merge branch 's3-upload' into s3-delete

193609d

fm3 self-assigned this Sep 23, 2025

fm3 added 6 commits September 23, 2025 10:17

Merge branch 'master' into s3-upload

a37534d

update migration guide

505550c

upath

384513d

Merge branch 'master' into s3-upload

a81f287

WIP adapt upload to use datasetId where possible

18a9fbd

move to target, pass linked layers to wk side

b021cd8

fm3 added 2 commits October 28, 2025 09:21

Merge branch 'master' into s3-delete

542c5b7

add null checks, add comments, improve logging

d80fec9

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

fm3 added 2 commits October 28, 2025 09:44

remove snapshot test of no-longer existing route

cc9908c

add duration logging for dataset deletion

6219144

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

app/controllers/DatasetController.scala Show resolved Hide resolved

fm3 requested a review from MichaelBuessemeyer October 28, 2025 09:24

fm3 mentioned this pull request Oct 28, 2025

Make Realpath Scans More Resilient #9019

Merged

12 tasks

fm3 added 2 commits October 29, 2025 13:03

simplify UPath.startsWith

1af5aca

Merge branch 'master' into s3-delete

04b1de2

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala Show resolved Hide resolved

...nossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala Show resolved Hide resolved

MichaelBuessemeyer requested changes Oct 29, 2025

View reviewed changes

implement pr feedback

368893d

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala Show resolved Hide resolved

webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala Show resolved Hide resolved

fm3 requested a review from MichaelBuessemeyer October 30, 2025 08:58

MichaelBuessemeyer approved these changes Oct 30, 2025

View reviewed changes

Merge branch 'master' into s3-delete

fe5c2c5

fm3 merged commit 05f02f8 into master Nov 3, 2025
5 checks passed

fm3 deleted the s3-delete branch November 3, 2025 12:27

fm3 mentioned this pull request Nov 4, 2025

Fix S3 deletion for buckets/endpoints other than what’s configured for upload #9047

Merged

4 tasks

Delete datasets on S3 #8924

Delete datasets on S3 #8924

Uh oh!

Conversation

frcroth commented Sep 15, 2025 • edited by fm3 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps to test:

TODOs:

Issues:

Uh oh!

coderabbitai bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MichaelBuessemeyer commented Oct 29, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fm3 commented Oct 30, 2025

Uh oh!

fm3 commented Oct 30, 2025

Uh oh!

MichaelBuessemeyer commented Oct 30, 2025

Uh oh!

MichaelBuessemeyer commented Oct 30, 2025

Uh oh!

MichaelBuessemeyer commented Oct 30, 2025

Uh oh!

MichaelBuessemeyer commented Oct 30, 2025

Uh oh!

fm3 commented Oct 30, 2025

Uh oh!

MichaelBuessemeyer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

frcroth commented Sep 15, 2025 •

edited by fm3

Loading

coderabbitai bot commented Sep 15, 2025 •

edited

Loading

MichaelBuessemeyer left a comment •

edited

Loading