Skip to content

Conversation

@frcroth
Copy link
Contributor

@frcroth frcroth commented Sep 15, 2025

  • Enables dataset deletion for virtual datasets with data stored on our managed S3.
  • For non-virtual datasets, only the directory in the binaryData directory is deleted.
  • If other datasets have realpaths in that directory, deletion is blocked. (Previously there was a move-and-resymlink logic, which is now removed).

Steps to test:

  • Delete virtual datasets, should run through. Other datasets referencing the data should still have it afterwards.
  • Delete non-virtual datasets, should run through if paths are unique.
    • Other datasets symlinking into this one should block the deletion
  • Enable S3 upload with credentials, datasets should be deletable afterwards.
  • Data on S3 buckets not listed with global credentials should never be deleted.

TODOs:

  • on datastore side, delete local and managedS3 paths
  • delete virtual datasets (paths used only by them)
  • delete non-virtual datasets (unless any other dataset’s paths point in there)
  • Allow manually deleting “Not yet fully uploaded” datasets (be it via publish or via upload)
  • Re-test delete as cleanup on failed dataset uploads
  • Re-test delete with S3 paths
  • cleanup

Issues:


  • Added changelog entry (create a $PR_NUMBER.md file in unreleased_changes or use ./tools/create-changelog-entry.py)
  • Removed dev-only changes like prints and application.conf edits
  • Considered common edge cases
  • Needs datastore update after deployment

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

📝 Walkthrough

Walkthrough

Centralizes dataset deletion into a new DatasetService.deleteDataset flow that deletes data from disk and managed S3, adds ManagedS3Service and S3 utilities, refactors path-centric DAOs/clients/routes (including DELETE-with-JSON), updates controllers/routes/upload and datastore services, and tightens UPath behavior and tests.

Changes

Cohort / File(s) Summary
Dataset controller & routes
app/controllers/DatasetController.scala, conf/webknossos.latest.routes, frontend/javascripts/admin/rest_api.ts
Renamed deleteOnDiskdelete and route /datasets/:datasetId; added finishUploadToPaths; frontend DELETE now targets /api/datasets/${datasetId}; controller adds logging, Slack instrumentation and calls datasetService.deleteDataset.
Dataset service & DAOs
app/models/dataset/DatasetService.scala, app/models/dataset/Dataset.scala
Replaced deleteVirtualOrDiskDataset with deleteDataset; added dependencies (DatasetLayerAttachmentsDAO, AnnotationDAO, UsedStorageService); added path-based DAO methods and helpers findDatasetsUsingDataFromDir / deleteDatasetFromDB; updated mags/attachments handling.
WK remote datastore client & controller
app/controllers/WKRemoteDataStoreController.scala, app/models/dataset/WKRemoteDataStoreClient.scala, webknossos-datastore/conf/datastore.latest.routes, webknossos-datastore/app/.../controllers/DataSourceController.scala
Removed getPaths and AnnotationDAO injection; added client getBaseDirAbsolute and DELETE-with-JSON support; added controller endpoints baseDirAbsolute and deletePaths; controller delegates path deletion to DataSourceService.deletePathsFromDiskOrManagedS3 and simplifies delete-on-disk flow.
Managed S3 & S3 utilities
webknossos-datastore/app/.../services/ManagedS3Service.scala, webknossos-datastore/app/.../helpers/S3UriUtils.scala, webknossos-datastore/app/.../datavault/S3DataVault.scala
Introduced ManagedS3Service for grouped/batched deletions and managed-S3 checks; added S3UriUtils for S3 URI parsing; refactored S3DataVault internals to use S3UriUtils.
Datastore RPC & clients
webknossos-datastore/app/.../rpc/RPCRequest.scala, webknossos-datastore/app/.../services/DSRemoteWebknossosClient.scala
Added RPCRequest.deleteJson[T] (DELETE with JSON body); removed fetchPaths from DSRemoteWebknossosClient.
Dataset deletion helper & UPath changes
webknossos-datastore/app/.../helpers/DatasetDeleter.scala, webknossos-datastore/app/.../helpers/UPath.scala
Simplified DatasetDeleter trait (removed mixin), changed deleteOnDisk signature to take datasetId first and implement move-to-trash with retry; narrowed RemoteUPath.parent type and made startsWith compare path semantics.
DataSource service & upload
webknossos-datastore/app/.../services/DataSourceService.scala, webknossos-datastore/app/.../services/uploading/UploadService.scala, webknossos-datastore/app/.../services/DataSourceToDiskWriter.scala
DataSourceService gains managedS3Service, deletePathsFromDiskOrManagedS3, and existsOnDisk; UploadService now uses ManagedS3Service and removes legacy S3 client logic; DataSourceToDiskWriter no longer mixes in PathUtils.
Storage service & utilities
app/models/storage/UsedStorageService.scala, util/src/main/scala/com/scalableminds/util/io/PathUtils.scala, util/src/main/scala/com/scalableminds/util/tools/Fox.scala
UsedStorageService removed DatasetService dependency and now resolves datastores via DAO; PathUtils converted from trait to object; removed redundant Fox imports.
Datastore DI & routes
webknossos-datastore/app/.../DataStoreModule.scala, webknossos-datastore/conf/datastore.latest.routes
Added eager singleton binding for ManagedS3Service; added GET baseDirAbsolute and DELETE deletePaths routes; removed previous invalidateCache route.
Tests & release notes
test/backend/UPathTestSuite.scala, unreleased_changes/8924.md, frontend/javascripts/test/backend-snapshot-tests/datasets.e2e.ts
Strengthened UPath.startsWith tests; added release note about managed-S3 deletions and symlink-blocking behavior; removed dataset paths E2E test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Extra attention:
    • DatasetService.deleteDataset / deleteDatasetFromDB: ordering of datastore path deletion vs DB deletion, shared-directory detection, and annotation handling.
    • ManagedS3Service: credential parsing, S3 client/transfer manager lifecycle, pagination and batched delete correctness.
    • DAO SQL/query changes (DatasetMagsDAO / DatasetLayerAttachmentsDAO) for correct path discovery and deduplication.
    • DatasetDeleter.deleteOnDisk retry semantics and interaction with PathUtils move-to-trash.
    • RPCRequest.deleteJson and clients/controllers relying on DELETE-with-body.
    • DI wiring for new constructor params and eager binding.

Possibly related PRs

Suggested reviewers

  • MichaelBuessemeyer

Poem

🐰 I hopped through buckets, keys, and trash,
Nudged tangled symlinks into a stash.
With retries, logs, and a nimble paw,
I cleared disk and S3 without a flaw.
A carrot cheer for tidy data law!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Delete datasets on S3" directly and clearly describes the main change in this pull request. The changeset introduces infrastructure to enable dataset deletion from managed S3 storage, which is precisely what the title conveys. The title is concise, specific, and avoids vague terminology, making it immediately clear to a developer scanning the history what the primary objective is.
Linked Issues Check ✅ Passed The code changes address the primary objective from issue #8803: enabling the "Delete on disk" action to also remove data from the project's managed S3 storage. Key implementations include the new ManagedS3Service with pathIsInManagedS3 validation to restrict deletion to managed S3 only, updated DatasetService.deleteDataset orchestrating virtual dataset deletion, DataSourceService.deletePathsFromDiskOrManagedS3 partitioning paths between local and S3 deletions, and new DAO methods (findMagPathsUsedOnlyByThisDataset, findDatasetsWithMagsInDir, etc.) to support conflict checking for non-virtual datasets. The route changes and endpoint additions properly expose the new functionality.
Out of Scope Changes Check ✅ Passed The changes are cohesive and support the PR objective of enabling S3 deletion with proper safeguards. Key in-scope modifications include new S3 utilities (S3UriUtils), ManagedS3Service for managed S3 interaction, updated DatasetDeleter to work with the new deletion approach, DAO methods for conflict detection, and endpoint changes. The removal of legacy getPaths endpoints and move-and-resymlink logic aligns with the intentional refactoring described in the PR. Minor refactorings such as PathUtils trait-to-object conversion appear necessary for integration with the new deletion infrastructure. All changes directly support or enable the S3 deletion feature.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch s3-delete

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@frcroth frcroth assigned frcroth and unassigned frcroth Sep 17, 2025
@fm3 fm3 self-assigned this Sep 23, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (7)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)

48-48: Log the actual deduplicated path.

Line 48 logs targetPath, but when retryCount > 0, the actual destination is deduplicatedTargetPath. Log the path that was actually used.

Apply this diff:

-        logger.info(s"Successfully moved dataset from $sourcePath to $targetPath.")
+        logger.info(s"Successfully moved dataset from $sourcePath to $deduplicatedTargetPath.")
app/models/dataset/Dataset.scala (1)

871-890: Critical: Query returns wrong paths for non-virtual datasets

The query selects only m1.path, but for non-virtual datasets the physical path to delete is stored in m1.realPath. This means:

  • Non-virtual datasets would return logical paths instead of physical filesystem paths
  • Deletion would target the wrong locations
  • Could cause data loss or orphaned data

A past reviewer flagged this exact issue and suggested using COALESCE(m1.realPath, m1.path).

Based on learnings

Apply this diff to select the actual physical path:

-      pathsStrOpts <- run(q"""
-           SELECT m1.path FROM webknossos.dataset_mags m1
-           WHERE m1._dataset = $datasetId
-           AND m1.path IS NOT NULL
+      pathsStrOpts <- run(q"""
+           SELECT COALESCE(m1.realPath, m1.path) AS path
+           FROM webknossos.dataset_mags m1
+           WHERE m1._dataset = $datasetId
+           AND COALESCE(m1.realPath, m1.path) IS NOT NULL
            AND NOT EXISTS (
               SELECT m2.path
               FROM webknossos.dataset_mags m2
               WHERE m2._dataset != $datasetId
               AND (
-                m2.path = m1.path
+                m2.path = COALESCE(m1.realPath, m1.path)
                 OR
-                (
-                  m2.realpath IS NOT NULL AND m2.realpath = m1.realpath
-                )
+                m2.realpath = COALESCE(m1.realPath, m1.path)
               )
            )
               """.as[Option[String]])
app/models/dataset/DatasetService.scala (3)

548-557: Good fix: sequence deletion and refresh; errors now propagate.

Replacing _ = ... with <- for both deleteDataset and refreshStorageReport correctly sequences the Fox chain.


511-523: Critical: “used-only-by-this-dataset” can miss shared data (NULL realpath/symlink).

Current DAO logic for mags uses realpath equality that fails with NULL and can delete paths still used by other datasets. Use COALESCE to align with index semantics, or conservatively block when NULLs exist. Also consider validating paths before delete.

Suggested DAO fix (app/models/dataset/Dataset.scala):

- AND NOT EXISTS (
-    SELECT m2.path
-    FROM webknossos.dataset_mags m2
-    WHERE m2._dataset != $datasetId
-    AND (
-      m2.path = m1.path
-      OR (
-        m2.realpath IS NOT NULL AND m2.realpath = m1.realpath
-      )
-    )
- )
+ AND NOT EXISTS (
+    SELECT 1
+    FROM webknossos.dataset_mags m2
+    WHERE m2._dataset != $datasetId
+    AND COALESCE(m2.realpath, m2.path) = COALESCE(m1.realpath, m1.path)
+ )

Optional safety here:

-          pathsUsedOnlyByThisDataset = magPathsUsedOnlyByThisDataset ++ attachmentPathsUsedOnlyByThisDataset
+          pathsUsedOnlyByThisDataset = magPathsUsedOnlyByThisDataset ++ attachmentPathsUsedOnlyByThisDataset
+          // extra safety: validate only-local/managed-S3 and existence
+          _ <- validatePaths(pathsUsedOnlyByThisDataset, datastore)

Run to confirm DAOs use COALESCE:

#!/bin/bash
rg -n -C2 'findMagPathsUsedOnlyByThisDataset|COALESCE\\(m2\\.realpath, m2\\.path\\)' app/models/dataset/Dataset.scala

525-533: Critical: Deletion guard misses datasets with NULL realpath; use COALESCE in DAO.

findDatasetsUsingDataFromDir relies on DAO query filtering m.realpath IS NOT NULL and starts_with(m.realpath,...). Legacy NULL realpaths won’t be seen → risk of deleting shared data.

Suggested DAO fix (app/models/dataset/Dataset.scala):

- WHERE m.realpath IS NOT NULL
- AND starts_with(m.realpath, $absolutePathWithTrailingSlash)
+ WHERE starts_with(COALESCE(m.realpath, m.path), $absolutePathWithTrailingSlash)

Optional: add a conservative pre-check to abort if any mags under directory have realpath IS NULL until a backfill/migration makes realpath non-NULL.

Also review attachments symmetry: attachments guard uses a.path only; if attachments ever gain realpath, mirror COALESCE there too.

#!/bin/bash
rg -n -C3 'findDatasetsWithMagsInDir|starts_with\\(' app/models/dataset/Dataset.scala
rg -n 'COALESCE\\(m\\.realpath, m\\.path\\)' app/models/dataset/Dataset.scala
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (1)

9-22: Harden S3 URI parsing: null-safety, regional endpoints, and key normalization.

  • Avoid NPEs on uri.getHost.
  • Support regional virtual-hosted endpoints (e.g., bucket.s3.us-west-2.amazonaws.com).
  • Normalize keys to avoid leading “/” for AWS S3.
  • Make short-style detection scheme-aware (s3:// only).
  • Improve non-AWS host detection.

Apply this diff to the existing functions:

@@
-  def hostBucketFromUri(uri: URI): Option[String] = {
-    val host = uri.getHost
-    if (host == null) {
-      None
-    } else if (isShortStyle(uri)) { // assume host is omitted from uri, shortcut form s3://bucket/key
-      Some(host)
-    } else if (isVirtualHostedStyle(uri)) {
-      Some(host.substring(0, host.length - ".s3.amazonaws.com".length))
-    } else if (isPathStyle(uri)) {
-      Some(uri.getPath.substring(1).split("/")(0))
-    } else {
-      None
-    }
-  }
+  def hostBucketFromUri(uri: URI): Option[String] = {
+    val hostO = hostOpt(uri)
+    if (isShortStyle(uri)) {
+      hostO
+    } else if (isVirtualHostedStyle(uri)) {
+      hostO.flatMap(bucketFromVirtualHost)
+    } else if (isPathStyle(uri)) {
+      val path = Option(uri.getPath).getOrElse("")
+      path.stripPrefix("/").split("/", 2).headOption.filter(_.nonEmpty)
+    } else None
+  }
@@
-  // https://bucket-name.s3.region-code.amazonaws.com/key-name
-  private def isVirtualHostedStyle(uri: URI): Boolean =
-    uri.getHost.endsWith(".s3.amazonaws.com")
+  // https://bucket-name.s3.amazonaws.com or https://bucket-name.s3.us-west-2.amazonaws.com
+  private def isVirtualHostedStyle(uri: URI): Boolean =
+    hostOpt(uri).exists(h => VirtualHostedPattern.pattern.matcher(h).matches)
@@
-  // https://s3.region-code.amazonaws.com/bucket-name/key-name
-  private def isPathStyle(uri: URI): Boolean =
-    uri.getHost.matches("s3(.[\\w\\-_]+)?.amazonaws.com") ||
-      (!uri.getHost.contains("amazonaws.com") && uri.getHost.contains("."))
+  // https://s3.amazonaws.com/bucket/key, https://s3.us-west-2.amazonaws.com/bucket/key
+  // or non-AWS S3-compatible hosts (MinIO, Ceph, etc.)
+  private def isPathStyle(uri: URI): Boolean =
+    hostOpt(uri).exists { h =>
+      h.matches("^s3([.-][\\w\\-_]+)?\\.amazonaws\\.com$") ||
+      (!h.contains("amazonaws.com") && h.contains("."))
+    }
@@
-  // S3://bucket-name/key-name
-  private def isShortStyle(uri: URI): Boolean =
-    !uri.getHost.contains(".")
+  // s3://bucket-name/key-name  (short style)
+  private def isShortStyle(uri: URI): Boolean =
+    "s3".equalsIgnoreCase(uri.getScheme) && hostOpt(uri).exists(h => !h.contains("."))
@@
-  def objectKeyFromUri(uri: URI): Box[String] =
-    if (isVirtualHostedStyle(uri)) {
-      Full(uri.getPath)
-    } else if (isPathStyle(uri)) {
-      Full(uri.getPath.substring(1).split("/").tail.mkString("/"))
-    } else if (isShortStyle(uri)) {
-      Full(uri.getPath.tail)
-    } else Failure(s"Not a valid s3 uri: $uri")
+  def objectKeyFromUri(uri: URI): Box[String] = {
+    val path = Option(uri.getPath).getOrElse("")
+    if (isVirtualHostedStyle(uri) || isShortStyle(uri)) {
+      Full(path.stripPrefix("/"))
+    } else if (isPathStyle(uri)) {
+      val parts = path.stripPrefix("/").split("/", 2)
+      if (parts.length == 2) Full(parts(1)) else Failure(s"No object key in $uri")
+    } else Failure(s"Not a valid s3 uri: $uri")
+  }
@@
-  def isNonAmazonHost(uri: URI): Boolean =
-    (isPathStyle(uri) && !uri.getHost.endsWith(".amazonaws.com")) || uri.getHost == "localhost"
+  def isNonAmazonHost(uri: URI): Boolean =
+    hostOpt(uri).exists(h => (isPathStyle(uri) && !h.endsWith(".amazonaws.com")) || h == "localhost")

Add these helpers (append within the object):

// e.g. bucket.s3.amazonaws.com or bucket.s3.us-west-2.amazonaws.com
private val VirtualHostedPattern = """^(.+?)\.s3([.-][\w-]+)?\.amazonaws\.com$""".r

private def hostOpt(uri: URI): Option[String] =
  Option(uri.getHost).orElse(Option(uri.getAuthority)).map(_.trim).filter(_.nonEmpty)

private def bucketFromVirtualHost(host: String): Option[String] = host match {
  case VirtualHostedPattern(bucket, _) => Some(bucket)
  case _                               => None
}

This fixes failures on regional endpoints and ensures keys do not carry a leading slash, which is critical for correct List/Delete behavior. Based on learnings.

Also applies to: 27-39, 40-50

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1)

39-55: Don’t derive bucket/endpoint from credentialName; use the matched S3 credential (name/endpoint/region).

Assuming credentialName is a logical identifier, parsing it as a URI yields wrong bucket and an invalid endpoint (e.g., https://bucket). Derive from the selected S3AccessKeyCredential instead; use its optional endpoint/region, and parse name only if it is actually a URI.

Apply this refactor:

@@
-  lazy val s3UploadBucketOpt: Option[String] =
-    // by convention, the credentialName is the S3 URI so we can extract the bucket from it.
-    S3UriUtils.hostBucketFromUri(new URI(dataStoreConfig.Datastore.S3Upload.credentialName))
+  private lazy val s3UploadCredentialOpt: Option[S3AccessKeyCredential] =
+    dataStoreConfig.Datastore.DataVaults.credentials
+      .flatMap(new CredentialConfigReader(_).getCredential)
+      .collectFirst {
+        case c @ S3AccessKeyCredential(name, _, _, _, _)
+            if dataStoreConfig.Datastore.S3Upload.credentialName == name => c
+      }
+
+  private lazy val s3UploadBaseUriOpt: Option[URI] =
+    s3UploadCredentialOpt.flatMap { c =>
+      // name might be a UPath/URI root like s3://bucket/prefix or https://minio:9000/bucket
+      tryo(UPath.fromString(c.name).toRemoteUriUnsafe).toOption
+        .orElse(tryo(new URI(c.name)).toOption)
+    }
+
+  lazy val s3UploadBucketOpt: Option[String] =
+    s3UploadBaseUriOpt.flatMap(S3UriUtils.hostBucketFromUri)
@@
-  private lazy val s3UploadEndpoint: URI = {
-    // by convention, the credentialName is the S3 URI so we can extract the bucket from it.
-    val credentialUri = new URI(dataStoreConfig.Datastore.S3Upload.credentialName)
-    new URI(
-      "https",
-      null,
-      credentialUri.getHost,
-      -1,
-      null,
-      null,
-      null
-    )
-  }
+  private lazy val s3UploadEndpointOpt: Option[URI] =
+    s3UploadCredentialOpt.flatMap(_.endpoint.map(URI.create))

This prevents constructing invalid endpoints and decouples behavior from a brittle config assumption. Based on learnings.


If `credentialName` is indeed a URI in your deployments, please confirm and we can simplify accordingly.

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (7)</summary><blockquote>

<details>
<summary>webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)</summary><blockquote>

`41-41`: **Extract hardcoded retry limit as a named constant.**

The retry limit `15` is hardcoded. Consider extracting it as a private constant (e.g., `private val MaxMoveDedupRetries = 15`) for clarity and maintainability.



```diff
+  private val MaxMoveDedupRetries = 15
+
   private def deleteWithRetry(sourcePath: Path, targetPath: Path, retryCount: Int = 0)(
       implicit ec: ExecutionContext): Fox[Unit] =
-    if (retryCount > 15) {
+    if (retryCount > MaxMoveDedupRetries) {
       Fox.failure(s"Deleting dataset failed: too many retries.")
app/models/dataset/Dataset.scala (2)

892-906: Consider checking both realpath and path for completeness

The method currently only checks m.realpath to find datasets with mags in a directory. While this is correct for non-virtual datasets (filesystem paths), you might want to also check m.path for safety, especially if virtual datasets could have paths under the same prefix.

If needed, apply this diff to check both fields:

     run(q"""
         SELECT d._id FROM webknossos.dataset_mags m
         JOIN webknossos.datasets d ON m._dataset = d._id
-        WHERE m.realpath IS NOT NULL
-        AND starts_with(m.realpath, $absolutePathWithTrailingSlash)
+        WHERE (
+          (m.realpath IS NOT NULL AND starts_with(m.realpath, $absolutePathWithTrailingSlash))
+          OR starts_with(m.path, $absolutePathWithTrailingSlash)
+        )
         AND d._id != $ignoredDataset
         AND d._datastore = ${dataStore.name.trim}
        """.as[ObjectId])

1293-1306: Add NULL check for consistency and safety

Similar to the mags version, this method should filter out NULL paths to avoid potential issues with the string prefix check.

Apply this diff to add a NULL check:

     run(q"""
         SELECT d._id FROM webknossos.dataset_layer_attachments a
         JOIN webknossos.datasets d ON a._dataset = d._id
-        WHERE starts_with(a.path, $absolutePathWithTrailingSlash)
+        WHERE a.path IS NOT NULL
+        AND starts_with(a.path, $absolutePathWithTrailingSlash)
         AND d._id != $ignoredDataset
         AND d._datastore = ${dataStore.name.trim}
        """.as[ObjectId])
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (1)

1-7: Add tests for core URI forms.

Cover:

Do you want me to scaffold a ScalaTest suite for this?

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (3)

79-86: Fail fast on invalid/non‑S3 paths; avoid aborting all buckets due to one bad path.

Currently, grouping by Option[String] may include None and the first failure can abort the whole deletion.

-  def deletePaths(paths: Seq[UPath])(implicit ec: ExecutionContext): Fox[Unit] = {
-    val pathsByBucket: Map[Option[String], Seq[UPath]] = paths.groupBy(S3UriUtils.hostBucketFromUpath)
-    for {
-      _ <- Fox.serialCombined(pathsByBucket.keys) { bucket: Option[String] =>
-        deleteS3PathsOnBucket(bucket, pathsByBucket(bucket))
-      }
-    } yield ()
-  }
+  def deletePaths(paths: Seq[UPath])(implicit ec: ExecutionContext): Fox[Unit] = {
+    val (nonS3, s3Paths) = paths.partition(p => !p.getScheme.contains(PathSchemes.schemeS3))
+    if (nonS3.nonEmpty) return Fox.failure(s"Non‑S3 paths passed to ManagedS3Service: ${nonS3.mkString(", ")}")
+
+    val grouped = s3Paths.groupBy(S3UriUtils.hostBucketFromUpath)
+    grouped.get(None).foreach { invalid =>
+      // surface all invalid paths instead of a generic failure
+      return Fox.failure(s"Could not determine S3 bucket for: ${invalid.mkString(", ")}")
+    }
+    val byBucket: Map[String, Seq[UPath]] = grouped.collect { case (Some(b), ps) => (b, ps) }
+    Fox.serialCombined(byBucket.keys.toSeq)(b => deleteS3PathsOnBucket(Some(b), byBucket(b))).map(_ => ())
+  }

This keeps valid buckets progressing even if a caller accidentally passes a bad path and also returns actionable error text.


75-78: transferManagerBox is unused.

Remove or use it (e.g., for future multipart/copy ops) to avoid dead code.


155-158: Speed up pathIsInManagedS3 and limit to S3 credentials.

Avoid repeatedly parsing all credentials; precompute S3 roots and filter to S3 credentials only.

Example:

private lazy val managedS3Roots: Seq[UPath] =
  globalCredentials.collect {
    case S3AccessKeyCredential(name, _, _, _, _) if name.startsWith("s3://") =>
      UPath.fromStringUnsafe(name).normalize
  }

def pathIsInManagedS3(path: UPath): Boolean =
  path.getScheme.contains(PathSchemes.schemeS3) && managedS3Roots.exists(path.startsWith)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df11bf1 and d80fec9.

📒 Files selected for processing (6)
  • app/controllers/WKRemoteDataStoreController.scala (2 hunks)
  • app/models/dataset/Dataset.scala (3 hunks)
  • app/models/dataset/DatasetService.scala (5 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/controllers/WKRemoteDataStoreController.scala
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-12T13:07:29.637Z
Learnt from: frcroth
PR: scalableminds/webknossos#8609
File: app/models/dataset/Dataset.scala:753-775
Timestamp: 2025-05-12T13:07:29.637Z
Learning: In the `updateMags` method of DatasetMagsDAO (Scala), the code handles different dataset types distinctly:
1. Non-WKW datasets have `magsOpt` populated and use the first branch which includes axisOrder, channelIndex, and credentialId.
2. WKW datasets will have `wkwResolutionsOpt` populated and use the second branch which includes cubeLength.
3. The final branch is a fallback for legacy data.
This ensures appropriate fields are populated for each dataset type.

Applied to files:

  • app/models/dataset/Dataset.scala
🧬 Code graph analysis (5)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (6)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreConfig.scala (3)
  • DataStoreConfig (11-78)
  • DataVaults (61-63)
  • S3Upload (64-68)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (4)
  • S3UriUtils (7-52)
  • hostBucketFromUri (9-22)
  • hostBucketFromUpath (24-28)
  • objectKeyFromUri (40-49)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (8)
  • UPath (55-97)
  • toRemoteUriUnsafe (122-122)
  • toRemoteUriUnsafe (199-199)
  • getScheme (120-120)
  • getScheme (197-197)
  • fromString (60-60)
  • startsWith (139-143)
  • startsWith (209-213)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/storage/CredentialConfigReader.scala (1)
  • CredentialConfigReader (8-64)
util/src/main/scala/com/scalableminds/util/tools/Box.scala (1)
  • collectFirst (614-644)
util/src/main/scala/com/scalableminds/util/tools/Fox.scala (3)
  • combined (82-93)
  • fromFuture (40-51)
  • successful (51-54)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (4)
util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)
  • PathUtils (14-221)
  • ensureDirectory (116-120)
app/models/dataset/WKRemoteDataStoreClient.scala (1)
  • deleteOnDisk (122-129)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)
  • deleteOnDisk (413-428)
util/src/main/scala/com/scalableminds/util/tools/Fox.scala (2)
  • successful (51-54)
  • failure (56-60)
app/models/dataset/Dataset.scala (4)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (1)
  • UPath (55-97)
app/utils/sql/SqlInterpolation.scala (1)
  • q (20-39)
util/src/main/scala/com/scalableminds/util/tools/Fox.scala (2)
  • flatten (269-279)
  • toFox (12-12)
util/src/main/scala/com/scalableminds/util/tools/Box.scala (1)
  • toSingleBox (63-81)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (1)
util/src/main/scala/com/scalableminds/util/tools/Box.scala (1)
  • Full (661-699)
app/models/dataset/DatasetService.scala (6)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (1)
  • UPath (55-97)
app/models/annotation/Annotation.scala (2)
  • findOne (289-296)
  • countAllByDataset (644-656)
app/models/storage/UsedStorageService.scala (2)
  • UsedStorageService (33-227)
  • refreshStorageReportForDataset (211-227)
app/models/dataset/Dataset.scala (6)
  • deleteDataset (709-737)
  • findMagPathsUsedOnlyByThisDataset (871-892)
  • findAttachmentPathsUsedOnlyByThisDataset (1278-1293)
  • findDatasetsWithMagsInDir (892-906)
  • findDatasetsWithAttachmentsInDir (1293-1306)
  • findOne (203-210)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1)
  • deletePaths (79-86)
app/models/dataset/WKRemoteDataStoreClient.scala (2)
  • deletePaths (134-141)
  • deleteOnDisk (122-129)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: frontend-tests
  • GitHub Check: backend-tests
  • GitHub Check: build-smoketest-push
🔇 Additional comments (1)
app/models/dataset/Dataset.scala (1)

10-10: LGTM: Clean import simplification

The shift from DataSourceMagInfo to UPath aligns with the path-centric refactoring for dataset deletion.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc9908c and 6219144.

📒 Files selected for processing (1)
  • app/controllers/DatasetController.scala (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
app/controllers/DatasetController.scala (5)
app/telemetry/SlackNotificationService.scala (1)
  • SlackNotificationService (11-59)
util/src/main/scala/com/scalableminds/util/requestlogging/RequestLogging.scala (2)
  • log (55-62)
  • logTime (32-48)
util/src/main/scala/com/scalableminds/util/time/Instant.scala (5)
  • toString (15-15)
  • Instant (14-45)
  • Instant (47-103)
  • now (48-48)
  • logSince (72-75)
app/utils/WkConf.scala (1)
  • Features (135-155)
app/models/dataset/DatasetService.scala (1)
  • deleteDataset (511-538)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-smoketest-push
  • GitHub Check: backend-tests
  • GitHub Check: frontend-tests
🔇 Additional comments (1)
app/controllers/DatasetController.scala (1)

38-38: Slack notifier wiring verified and correct.

Import and DI of SlackNotificationService in DatasetController are confirmed. The service is properly defined, bound as an eager singleton in WebknossosModule.scala, and consistently used across the application (SqlClient, TaskCreationService, DatasetService, JobController, AnnotationController, and others).

@fm3 fm3 mentioned this pull request Oct 28, 2025
12 tasks
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)

428-436: Surface invalid/ignored paths to clients.

If DataSourceService rejects paths (non‑managed S3 or outside baseDir), propagate a 400 with details instead of a generic failure/OK. This avoids silent skips. Consider returning per‑path results.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

439-449: Use ManagedS3Service bucket; centralize URI building.

Switch to managedS3Service.s3UploadBucketOpt is correct. Suggest moving endpoint host/URI construction into ManagedS3Service (e.g., a helper returning UPath base) to avoid duplicating URI parsing here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6219144 and 04b1de2.

📒 Files selected for processing (8)
  • app/controllers/WKRemoteDataStoreController.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala (6 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (3 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala
🧰 Additional context used
🧬 Code graph analysis (4)
app/controllers/WKRemoteDataStoreController.scala (1)
app/models/dataset/DatasetService.scala (1)
  • deleteDatasetFromDB (548-562)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (5)
util/src/main/scala/com/scalableminds/util/tools/Fox.scala (5)
  • Fox (28-228)
  • Fox (230-303)
  • serialCombined (93-97)
  • serialCombined (97-109)
  • toFox (12-12)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (1)
  • isLocal (30-30)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (2)
  • pathIsInManagedS3 (155-159)
  • deletePaths (79-86)
util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)
  • PathUtils (14-221)
  • deleteDirectoryRecursively (184-191)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)
  • deletePaths (428-437)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (8)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (4)
  • validateAccessFromTokenContext (87-94)
  • UserAccessRequest (30-30)
  • UserAccessRequest (31-72)
  • webknossos (70-72)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)
  • dataBaseDir (146-146)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (1)
  • getDataSource (190-197)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)
  • deleteOnDisk (14-36)
app/models/dataset/WKRemoteDataStoreClient.scala (2)
  • deleteOnDisk (122-129)
  • deletePaths (134-141)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1)
  • deletePaths (79-86)
util/src/main/scala/com/scalableminds/util/mvc/ExtendedController.scala (1)
  • validateJson (204-209)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (1)
  • deletePathsFromDiskOrManagedS3 (300-311)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)
util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)
  • PathUtils (14-221)
  • ensureDirectoryBox (122-130)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: frontend-tests
  • GitHub Check: build-smoketest-push
  • GitHub Check: backend-tests
🔇 Additional comments (12)
app/controllers/WKRemoteDataStoreController.scala (3)

6-6: LGTM! Clean import removal.

The Full import is no longer needed in this controller since the Box pattern matching has moved to DatasetService.deleteDatasetFromDB.


38-51: Excellent refactoring – proper separation of concerns.

Removing annotationDAO from the controller is correct since the annotation-count logic now resides in DatasetService.deleteDatasetFromDB. This improves maintainability by centralizing deletion logic in the service layer.


202-212: Well-architected delegation to service layer.

The refactoring correctly moves deletion orchestration to DatasetService.deleteDatasetFromDB, which now handles dataset lookup, annotation counting, conditional deletion, and storage refresh. The idempotent behavior (returning success for non-existent datasets) aligns with HTTP DELETE semantics.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (2)

27-32: DI addition looks good.
The ManagedS3Service injection integrates cleanly with existing services.


313-314: Tiny nit: helper is fine.
existsOnDisk is straightforward and used correctly in refresh path.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (2)

90-94: baseDirAbsolute endpoint: OK.
Scoped to webknossos context; returning a path string is acceptable.


415-426: deleteOnDisk flow simplification: OK.
Fetching the data source first and delegating to service is cleaner; auth via webknossos context matches WK→DS usage.

Confirm no external clients rely on the previous deleteDataset(datasetId) permission here.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (5)

23-24: Imports updated for ManagedS3Service: OK.


100-103: Constructor DI: OK.
ManagedS3Service centralization improves cohesion and config handling.


146-147: PathUtils.ensureDirectoryBox usage: good.
Better error mapping vs direct ensureDirectory.


541-548: S3 TransferManager via service: good.
Clear failure if S3 not configured; checks failedTransfers. LGTM.


582-594: Failure cleanup argument order fix: OK.
deleteOnDisk now receives datasetId first; matches signature.

Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback :D

Testing results:

Delete non-virtual datasets, should run through if paths are unique.
Other datasets symlinking into this one should block the deletion

Looks good 👍

Data on S3 buckets not listed with global credentials should never be deleted.

Also seemed good

Delete virtual datasets, should run through. Other datasets referencing the data should still have it afterwards.

I am having trouble with setting up a scenario where a s3 mag is shared.
I enabled uploading to s3. Made a successful upload. Used the new s3 path + credentials for explore remote and this also worked. But the dataset is now not rendered 🤔.
The server console says:

backend: 2025-10-29 14:27:27,600 [error] com.scalableminds.webknossos.datastore.services.DSDatasetErrorLoggingService - Error while loading bucket for layer 2-2-1 at BucketPosition(voxelMag1 at (32, 0, 160), bucket at (1,0,5), mag(1, 1, 1), additional coordinates: ), cuboid: Cuboid((32, 0, 160) / (1, 1, 1),32,32,32) for <path removed by me>: Could not read header at zarr.json <~ Failed to read from vault path <~ The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: 1TR63AYS5G6NZF1Y, Extended Request ID: <truncated by myself>) (SDK Attempt Count: 1) Stack trace: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: <truncated by myself>) (SDK Attempt Count: 1)

The rest of the testing went well

@MichaelBuessemeyer
Copy link
Contributor

Oh and I forgot to mention that not all S3 is cleared: The segmentation-layer/agglomerates/cumsum.json is not deleted when deleting a remote s3 dataset. Maybe because it is not listed as an used attachment?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)

39-55: Remove @tailrec annotation.

The @tailrec annotation cannot be satisfied because the recursive call (line 52) is inside a try-catch block, preventing tail-call optimization. This was already flagged in a previous review.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (1)

300-306: Lock down local deletions to the datastore base directory.

Any caller can submit a file:// UPath pointing outside the datastore root, and we will happily recurse-delete that absolute path. From the new DELETE endpoint this lets a remote user wipe arbitrary directories (even /) on the datastore host. We must validate each local path against dataBaseDir, reject anything else, and only then delete either the directory tree or single files. Also fail fast on non-managed remote paths instead of silently skipping them.

Please apply a guard like the following:

   def deletePathsFromDiskOrManagedS3(paths: Seq[UPath]): Fox[Unit] = {
-    val localPaths = paths.filter(_.isLocal).flatMap(_.toLocalPath)
-    val managedS3Paths = paths.filter(managedS3Service.pathIsInManagedS3)
-    for {
-      _ <- Fox.serialCombined(localPaths)(PathUtils.deleteDirectoryRecursively(_).toFox)
-      _ <- managedS3Service.deletePaths(managedS3Paths)
-    } yield ()
+    val (localCandidates, remoteCandidates) = paths.partition(_.isLocal)
+    val managedS3Paths = remoteCandidates.filter(managedS3Service.pathIsInManagedS3)
+    val disallowedRemote = remoteCandidates.diff(managedS3Paths)
+    val baseDir = dataBaseDir.toAbsolutePath.normalize
+
+    for {
+      _ <- Fox.fromBool(disallowedRemote.isEmpty) ?~>
+        s"Refusing to delete non-managed remote paths: ${disallowedRemote.mkString(", ")}"
+      validatedLocal <- Fox.serialCombined(localCandidates) { upath =>
+        for {
+          local <- upath.toLocalPath.toFox
+          abs = local.toAbsolutePath.normalize
+          _ <- Fox.fromBool(abs.startsWith(baseDir)) ?~>
+            s"Refusing to delete outside datastore base directory ($baseDir): $abs"
+        } yield abs
+      }
+      _ <- Fox.serialCombined(validatedLocal) { abs =>
+        if (Files.isDirectory(abs)) PathUtils.deleteDirectoryRecursively(abs).toFox
+        else tryo(Files.deleteIfExists(abs)).map(_ => ()).toFox
+      }
+      _ <- managedS3Service.deletePaths(managedS3Paths)
+    } yield ()
   }

This matches the earlier review feedback but is still unresolved, so please fix before release.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1)

39-72: Stop interpreting credentialName as an S3 URI.

Datastore.S3Upload.credentialName is configured as a plain credential ID (e.g. "managed-s3"). The new code assumes it is an S3 URI, immediately parsing it with new URI(...) and forcing an endpoint override based on that host. On existing installations this throws URISyntaxException, so s3ClientBox becomes a Failure and all managed S3 deletions stop working. Even if the parse succeeded, reinterpreting the name changes semantics and breaks compatibility.

Please derive bucket/endpoint/region from the actual S3AccessKeyCredential metadata (the optional region and endpoint fields) and only override the client when those optional fields are populated. The field should remain a plain identifier. For example:

-  private lazy val s3UploadCredentialsOpt: Option[(String, String)] =
-    dataStoreConfig.Datastore.DataVaults.credentials.flatMap { credentialConfig =>
-      new CredentialConfigReader(credentialConfig).getCredential
-    }.collectFirst {
-      case S3AccessKeyCredential(credentialName, accessKeyId, secretAccessKey, _, _)
-          if dataStoreConfig.Datastore.S3Upload.credentialName == credentialName =>
-        (accessKeyId, secretAccessKey)
-    }
-
-  lazy val s3UploadBucketOpt: Option[String] =
-    S3UriUtils.hostBucketFromUri(new URI(dataStoreConfig.Datastore.S3Upload.credentialName))
-
-  private lazy val s3UploadEndpoint: URI = {
-    val credentialUri = new URI(dataStoreConfig.Datastore.S3Upload.credentialName)
-    new URI("https", null, credentialUri.getHost, -1, null, null, null)
-  }
+  private lazy val s3UploadCredentialOpt: Option[S3AccessKeyCredential] =
+    dataStoreConfig.Datastore.DataVaults.credentials
+      .flatMap(new CredentialConfigReader(_).getCredential)
+      .collectFirst {
+        case cred @ S3AccessKeyCredential(name, _, _, _, _)
+            if dataStoreConfig.Datastore.S3Upload.credentialName == name => cred
+      }
+
+  private lazy val s3UploadCredentialsOpt: Option[(String, String)] =
+    s3UploadCredentialOpt.map(cred => (cred.identifier, cred.secret))
+
+  private lazy val s3EndpointOverrideOpt: Option[URI] =
+    s3UploadCredentialOpt.flatMap(_.endpoint).flatMap(ep => tryo(new URI(ep)).toOption)
+
+  private lazy val s3Region: Region =
+    s3UploadCredentialOpt.flatMap(_.region).map(Region.of).getOrElse(Region.US_EAST_1)

And in the builder only call endpointOverride/forcePathStyle when s3EndpointOverrideOpt is defined:

-        .crossRegionAccessEnabled(true)
-        .forcePathStyle(true)
-        .endpointOverride(s3UploadEndpoint)
-        .region(Region.US_EAST_1)
+        .crossRegionAccessEnabled(true)
+        .region(s3Region)
+      s3EndpointOverrideOpt.foreach { endpoint =>
+        builder.endpointOverride(endpoint)
+        builder.forcePathStyle(true)
+      }

This keeps the existing configuration contract intact and avoids runtime failures.

🧹 Nitpick comments (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1)

47-47: Consider using s interpolator instead of f.

The f interpolator is intended for formatted strings with format specifiers (like %d, %s). Without format specifiers, s is more appropriate and avoids unnecessary overhead.

Apply this diff:

-          if (retryCount == 0) targetPath else targetPath.resolveSibling(f"${targetPath.getFileName} ($retryCount)")
+          if (retryCount == 0) targetPath else targetPath.resolveSibling(s"${targetPath.getFileName} ($retryCount)")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04b1de2 and 368893d.

📒 Files selected for processing (4)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-04-28T14:18:04.368Z
Learnt from: frcroth
PR: scalableminds/webknossos#8236
File: webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/mesh/NeuroglancerPrecomputedMeshFileService.scala:161-166
Timestamp: 2025-04-28T14:18:04.368Z
Learning: In Scala for-comprehensions with the Fox error handling monad, `Fox.fromBool()` expressions should use the `<-` binding operator instead of the `=` assignment operator to properly propagate error conditions. Using `=` will cause validation failures to be silently ignored.

Applied to files:

  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala
📚 Learning: 2025-01-27T12:06:42.865Z
Learnt from: MichaelBuessemeyer
PR: scalableminds/webknossos#8352
File: app/models/organization/CreditTransactionService.scala:0-0
Timestamp: 2025-01-27T12:06:42.865Z
Learning: In Scala's for-comprehension with Fox (Future-like type), the `<-` operator ensures sequential execution. If any step fails, the entire chain short-circuits and returns early, preventing subsequent operations from executing. This makes it safe to perform validation checks before database operations.

Applied to files:

  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala
📚 Learning: 2025-05-12T13:07:29.637Z
Learnt from: frcroth
PR: scalableminds/webknossos#8609
File: app/models/dataset/Dataset.scala:753-775
Timestamp: 2025-05-12T13:07:29.637Z
Learning: In the `updateMags` method of DatasetMagsDAO (Scala), the code handles different dataset types distinctly:
1. Non-WKW datasets have `magsOpt` populated and use the first branch which includes axisOrder, channelIndex, and credentialId.
2. WKW datasets will have `wkwResolutionsOpt` populated and use the second branch which includes cubeLength.
3. The final branch is a fallback for legacy data.
This ensures appropriate fields are populated for each dataset type.

Applied to files:

  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala
🧬 Code graph analysis (3)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (4)
util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)
  • PathUtils (14-221)
  • ensureDirectory (116-120)
util/src/main/scala/com/scalableminds/util/tools/Box.scala (1)
  • Full (661-699)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)
  • deleteOnDisk (413-428)
util/src/main/scala/com/scalableminds/util/tools/Fox.scala (2)
  • successful (51-54)
  • failure (56-60)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceService.scala (6)
util/src/main/scala/com/scalableminds/util/tools/Fox.scala (5)
  • Fox (28-228)
  • Fox (230-303)
  • serialCombined (93-97)
  • serialCombined (97-109)
  • toFox (12-12)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (3)
  • isLocal (29-29)
  • toLocalPath (111-111)
  • toLocalPath (176-176)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (2)
  • pathIsInManagedS3 (153-157)
  • deletePaths (79-86)
util/src/main/scala/com/scalableminds/util/io/PathUtils.scala (2)
  • PathUtils (14-221)
  • deleteDirectoryRecursively (184-191)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)
  • deletePaths (428-437)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)
  • dataBaseDir (146-146)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/ManagedS3Service.scala (6)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreConfig.scala (3)
  • DataStoreConfig (11-75)
  • DataVaults (58-60)
  • S3Upload (61-65)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/PathSchemes.scala (1)
  • PathSchemes (3-9)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/S3UriUtils.scala (4)
  • S3UriUtils (7-52)
  • hostBucketFromUri (9-22)
  • hostBucketFromUpath (24-28)
  • objectKeyFromUri (40-49)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/UPath.scala (6)
  • UPath (54-96)
  • toRemoteUriUnsafe (121-121)
  • toRemoteUriUnsafe (188-188)
  • fromString (59-59)
  • startsWith (138-142)
  • startsWith (198-205)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/storage/CredentialConfigReader.scala (1)
  • CredentialConfigReader (8-64)
util/src/main/scala/com/scalableminds/util/tools/Fox.scala (4)
  • toFox (12-12)
  • combined (82-93)
  • fromFuture (40-51)
  • successful (51-54)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-smoketest-push
  • GitHub Check: backend-tests
  • GitHub Check: frontend-tests
🔇 Additional comments (2)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/helpers/DatasetDeleter.scala (2)

2-11: LGTM! Clean refactor of trait dependencies.

The trait signature and imports have been simplified appropriately. Removing DataSourceToDiskWriter and adding PathUtils aligns well with the new trash-based deletion approach.


24-36: LGTM! Idempotent deletion with clear logging.

The existence check and error handling are correct. Returning Fox.successful(()) when the path doesn't exist makes the deletion idempotent, which is the right behavior. The logging clearly distinguishes between actual deletion and no-op cases.

@fm3
Copy link
Member

fm3 commented Oct 30, 2025

Thanks for taking the time to look at my PR! I implemented your code feedback :)

Regarding testing: Yes, it is difficult to set up dependent paths on S3 currently. I just re-tested this case by running a small neuron inferral with the webknossos-worker (with it’s new_publish branch). Worked as expected :)

Still, I’m a little confused by your testing result concerning explore. If I try to explore+add a dataset previously uploaded to S3 via the dataset upload, I get this error (as it should be). Do you still know what you did there exactly?

image

@fm3
Copy link
Member

fm3 commented Oct 30, 2025

Oh and I forgot to mention that not all S3 is cleared: The segmentation-layer/agglomerates/cumsum.json is not deleted when deleting a remote s3 dataset. Maybe because it is not listed as an used attachment?

I would have expected that cumsum.json would not have been uploaded to s3 in the first place, as it was not referenced in the datasource-properties.json 🤔 But yes, I think it’s acceptable for the moment that it is not deleted. It is also not very large.

@MichaelBuessemeyer
Copy link
Contributor

🤔 I think the exploration result of the dataset uploaded is different. It is buggy imo: When I explore
<s3_root>/webknossos-uploads/sample_organization/delete_me-69034ce2800100b312b9616b I get a successful exploration, but the mag is interpreted as a layer 🤔.
image

Maybe thats the difference between our test cases?

@MichaelBuessemeyer
Copy link
Contributor

Regarding testing: Yes, it is difficult to set up dependent paths on S3 currently. I just re-tested this case by running a small neuron inferral with the webknossos-worker (with it’s new_publish branch). Worked as expected :)

Ok than no need for me to do this again as this takes some time?

@MichaelBuessemeyer
Copy link
Contributor

Oh I see, the error

backend: 2025-10-29 14:27:27,600 [error] com.scalableminds.webknossos.datastore.services.DSDatasetErrorLoggingService - Error while loading bucket for layer 2-2-1 at BucketPosition(voxelMag1 at (32, 0, 160), bucket at (1,0,5), mag(1, 1, 1), additional coordinates: ), cuboid: Cuboid((32, 0, 160) / (1, 1, 1),32,32,32) for <path removed by me>: Could not read header at zarr.json <~ Failed to read from vault path <~ The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: 1TR63AYS5G6NZF1Y, Extended Request ID: <truncated by myself>) (SDK Attempt Count: 1) Stack trace: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: <truncated by myself>) (SDK Attempt Count: 1)

states

The AWS Access Key Id you provided does not exist in our records.

This means that the file does not exist, which is alway what the error says:

Could not read header at zarr.json

Therefore, I think the root cause for this is that the exploration thinks it found a valid layer, but this was actually a mag. Thus, the datasource stored in the db is buggy and thus loading the dataset does not work properly.

@MichaelBuessemeyer
Copy link
Contributor

But this exploration bug is very likely not related to this pr. So I'd say we can make this green as I did some little retesting showing that it still works as well as your latest changes looking good?

@fm3
Copy link
Member

fm3 commented Oct 30, 2025

Yes, I agree :) Also, no need to re-test the linked mags in virtual datasets, I did that today

Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as this bug is unrelated, I opened #9034.

And as the code is good and doing what it should be doing 🟢

@fm3 fm3 merged commit 05f02f8 into master Nov 3, 2025
5 checks passed
@fm3 fm3 deleted the s3-delete branch November 3, 2025 12:27
fm3 added a commit that referenced this pull request Nov 5, 2025
…r upload (#9047)

We can have multiple managed s3 stores. Only one is selected for upload
but dataset deletion should work for all of them. This PR refactors the
ManagedS3Service a bit and selects the correct client with
endpoint/credential for each path.

### URL of deployed dev instance (used for testing):
- https://___.webknossos.xyz

### Steps to test:
(I already tested locally, no need to re-test)
- In application.conf, insert multiple dataVaults.credentials for s3
endpoints. Set s3Upload.credentialName to the first of them
- Upload a small dataset
- set the s3Upload.credentialName to another of them
- make sure both datasets can be loaded
- delete both datasets from their respective settings views.
- backend logging should show they were deleted from their respective s3
buckets

### Issues:
- follow-up for #8924 
- fixes #9043

------
- [x] Added changelog entry (create a `$PR_NUMBER.md` file in
`unreleased_changes` or use `./tools/create-changelog-entry.py`)
- [x] Removed dev-only changes like prints and application.conf edits
- [x] Considered [common edge
cases](../blob/master/.github/common_edge_cases.md)
- [x] Needs datastore update after deployment
fm3 added a commit that referenced this pull request Nov 11, 2025
- When scanning datasources from directories, always add resolved mag
paths immediately
- Introduces realpaths for attachments
- Report all realpaths that could be determined, scale down logging for
those that couldn’t.

### URL of deployed dev instance (used for testing):
- https://___.webknossos.xyz

### Steps to test:
- start wk with some datasets in the binaryData dir, watch backend
logging, should show realpath scan failures (provoke some by deleting
mags or attachments that are referenced in some
datasource-properties.jsons)
- All realpaths that could be determined should be added to the database

### TODOs:
- [x] immediate paths for datasources
- [x] simplify layerpath logic
- [x] attachment realpath column
- [x] don’t fail for datasource on single realpath failure
- [x] log failures
- [x] scan also attachments
- [x] evolution
- [x] adapt s3-delete code to also consider realpaths of attachments

### Issues:
- blocked by #8924 (paths routes changed)
- blocked by #8980 (datavault path refactoring)
- fixes #9018

------
- [x] Added changelog entry (create a `$PR_NUMBER.md` file in
`unreleased_changes` or use `./tools/create-changelog-entry.py`)
- [x] Removed dev-only changes like prints and application.conf edits
- [x] Considered [common edge
cases](../blob/master/.github/common_edge_cases.md)
- [x] Needs datastore update after deployment

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: MichaelBuessemeyer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Delete datasets from S3

5 participants