From b4a95bfc231a498a63c21143eb83eef63a9c0d6b Mon Sep 17 00:00:00 2001 From: Terry Brady Date: Tue, 21 May 2024 10:00:28 -0700 Subject: [PATCH] Update README.md --- design/object-reset/README.md | 460 ++-------------------------------- 1 file changed, 24 insertions(+), 436 deletions(-) diff --git a/design/object-reset/README.md b/design/object-reset/README.md index 81dfab0..f439230 100644 --- a/design/object-reset/README.md +++ b/design/object-reset/README.md @@ -1,437 +1,25 @@ -# Object Version Reset +# Object Reset Solutions +- [Object Version Reset](version_reset.md) + - https://github.com/CDLUC3/mrt-doc/issues/1774 + - [Object Prune Design](prune.md) +- Object Backup Bucket and Endpoint: https://github.com/CDLUC3/mrt-doc/issues/1897 +- Object "Fix" Workflow: + - https://github.com/CDLUC3/mrt-doc/issues/1876 + - https://github.com/CDLUC3/mrt-doc/issues/1815 +- Admin Tool Interface for Object Fix: https://github.com/CDLUC3/mrt-doc/issues/1886 +- Object Modification (mime, collection, owner): https://github.com/CDLUC3/mrt-doc/issues/1775 +- Outstanding Storage Admin Tasks + - Initiate delete of object from admin tool + - https://github.com/CDLUC3/mrt-doc/issues/982 + - https://github.com/CDLUC3/mrt-doc/issues/981 + - Tombstone deleted objects + - https://github.com/CDLUC3/mrt-doc/issues/983 + - https://github.com/CDLUC3/mrt-doc/issues/980 + - Change primary storage node for a collection + - https://github.com/CDLUC3/mrt-doc/issues/979 + - https://github.com/CDLUC3/mrt-doc/issues/978 + - Admin interface for LDAP user/role management +- Nuxeo Fix + - https://github.com/CDLUC3/mrt-doc/issues/1655 + - Others? -## Object Prune -- [Object Prune Design](prune.md) - -## Solution Categories - -### Note about existing manifests -- Storage Manifest (manifest.xml) - - contains information about all versions - - is the only cloud storage file that is actively re-written during the life of a Merritt object - - is difficult to modify correctly since the entire object history is captured -- Ingest Manifest (generated by storage) - - This is easy to modify - - This is suitable to use for purposes of creating a new version or a new object - - This is not suitable for re-writing history - -### Alter Object Composition - -#### Option 1: Re-write versions in place; Purge files while rewriting; No provenance for changes - - This is David's changeToken proposal - - PRO: - - aligns with current architecture; minimal changes required - - CON: - - this is a one-off solution - - no provenance recording ther change - -#### Option 2: Re-ingest content as new object (new ark) - no local id; Delete old object - - Note: this was used for the Dash --> Dryad migration. The owner id changed creating a new localid. - - PRO: uses existing functionality - - CON: requires a new ark, doesn't work if localid is present; history is lost once old object is deleted - -#### Option 3: Re-ingest content as new object (new ark) - remap local id; Delete old object - - New functionality is needed to re-map an existing localid to a new ark - - PRO: minimal changes - - CON: requires a new ark; history is lost once old object is deleted - -#### Option 4: Rebuild object from storage manifest (new object_id, same ark); Purge orphaned files - - PRO: could save the storage manifest before and after as a provenance change; opportunity to review changes beforee applying - - CON: does not facilitate renames - -#### Option 5: Collapse History and Reset object to V1 (same ark); Purge orphaned files; History is lost - - PRO: - - may be easy to implement the cleanup logic, ark is retained - - this could be a suitable solution to offer to depositors to initiate - - CON: - - history is lost, no provenance of change, no opportunity to preview - -#### Option 6: Collapse History and Reset object to new version (same ark); Purge orphaned files; History is documented but inaccessible - - Introduce the concept of a "stubbed" version in manifest.xml to retain history - - PRO: - - ark is retained, histoy is kept, need way to retain old history without items in cloud storage - - CON: - - may complicate replic logic - - how would the depositor visualize the stubbed history? - - what history would exist in the inventory database? - -#### ~Option 7: Tombstone files in cloud storage to convey key deletions/key renames; History is documented but inaccessible~ - - PRO: - - change history could be reassembled by referencing 0 byte tombstones; changes are first applied to cloud storage - - CON: - - David and Terry could not identify a benefit to justify this complexity - - Tombostone files would complicate scan resolution issues - -#### Option 8: Repair transactions: apply file delete and file replace transactions (conveyed in json) to an existing object to clean up he object - - This solution might introduce the concept of a stubbed file in the manifest.xml OR the file would be removed entirely. - - This is an attempt to define a generic solution around David's poposed changeToken fix - - A flag could be included in transaction to allow/disallow the complete deletion of all files with identical checksum values - -#### Option 9: Use existing tools to make current version "correct"; Apply a "PRUNE" transaction to purge files that are not in the current version of the object. - - PRO: - - leverages existing tools - - depositor can preview purge candidates in the Merritt UI (and in an enhanced API) - - depositor initiates the PRUNE - - provenance can be recorded when the PRUNE is applied - - should we record a new version to save this provenance information? - - CON: - - fix is a 2 step process - - Process Description - - Repair option 1 (for depositors) - - use mrt-delete files to correct the current object - - Repair option 2 (for depositors and Merritt Team) - - run a Merritt ADD using a storage-generated ingest manifest for reconstruction - - generate the baseline manifest from the current version - - if we expose the storage manifest paths as an input format, what restrictions do we need to set on the use of these patterns? Or, should the use of these paths be a privileged operation? - - question - can the manifest be modified so that a rename effectively takes place? - - Repair option 3 (for Merritt Team) - - run a Merritt ADD using a storage-generated ingest manifest for reconstruction - - generate the baseline manifest from ALL versions - - this would allow versioning mistakes to be created - - Submit the Repair manifest as a Merritt ADD - - Review the correction in the Merritt UI - - The Merritt Object API could add a JSON array of "prune candidates" from prior versions - - Apply a Merritt PRUNE transaction - - Submit this through ingest with no payload - - Prune options - - prune any file key that has not been pulled forward to the current version - - prune any file key that has not been pulled forward to the current version AND that has a duplicate checksum on a different key - - Storage will - - delete prune-able keys from primary storage - - remove pruned keys from the storage manifest (or mark them as pruned) - - trigger an inventory rebuild - - trigger a re-replication of the entire object - - Other Notes - - The current billing process does not account for content deletion. If a motivation for fixing content is to reduce cost, we will need to add features to decrement billing numbers - - tombstone deleted file data to a new database table - - run a monthy reconciliation process (as needed by collection?) to account for content removal - - Consider an API feature that would allow a user identify content that could be pruned - - Create a report to total prune-able counts at a collection level - -### Alter Object Hierarchy -- Change Owner - - The latest mrt-owner.txt file will determine ownership for ALL versions of the object - - Local id collions might make this impossible -- Change collection - - The latest mrt-membership.txt file will determine membership for ALL versions of the object - - Collection changes could be incompatible with the primary storage node/secondary nodes for a collection - - -### Provenance Options -- Create a mrt-repair.txt file - - Embed before and after versions of the storage manifest - - Include date/time of operation - - Include explanation of fix - -## Use Cases - -### Use Case: Nuxeo Path Name Cleanup - -#### Issue -- file pathnames contain url parameters and may not be meaningful to future users -- when the file pathname changes, a new file is created -- duplicate file content exists under different pathnames -- storage cost saving opportunity - -#### Scope: 100,000+ objects - -```sql -show query here -``` - ---- - -### Use Case: Depositor acceidentally uploads unwanted file - -#### Examples -- Accidental PII ingest -- Very large Dryad files rejected by curators in subsequent version -- Storage cost saving opportunity - -#### Scope: Case by Case - - ---- - -### Use Case: Excessive Object Versioning - -#### Example -- escholarship metadata changes -- very granular metadata changes may not be meaningful to future users accessing preservation content - - ---- - -### Use Case: Incomplete Processing of Version addition - -#### Scope: Rare, driven by system outage - - ---- - -### Use Case: Pathname Normalization - -#### Issue -During the scope of a large ingest effort, a desire arises to normalize pathnames for a large batch of files that have already been ingested - - ---- - -### Use Case: Bulk delete files unsuitable for preservation - -#### Issue -- thumbs.db files added for images -- .DS_store files -- __MACOSX files -- .git or .svn repository ingested with content - - ---- - -### Use Case: Storage Savings - -#### Issue: Purge unwanted files from old object versions - - ---- - -## Implmentation Options - -### Scenario -```yaml -ark: ark:/111/222 -versions: -- num: 1 - files: - producer/foo: { size: 10, digest: 'AAA', path: "1/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" } - producer/cat: { size: 30, digest: 'AAC', path: "1/producer/cat" } - size: 60 -- num: 2 - files: - producer/foo: { size: 40, digest: 'FFF', path: "2/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" } - producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" } - size: 70 -- num: 3 - files: - producer/foo: { size: 40, digest: 'EEE', path: "3/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" } - producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" } - size: 40 -size: 170 -``` - -### Option: Mark Current Version as the Reset Version - -```yaml -ark: ark:/111/222 -versions: -- num: 1 - files: - producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" } - size: 20 -- num: 2 - files: - producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" } - size: 30 -- num: 3 - files: - producer/foo: { size: 40, digest: 'EEE', path: "3/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" } - producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" } - size: 40 -size: 90 -``` - -#### Process -- User must fix the current version of the object using existing tools (ingest updates & mrt-delete) -- New transaction is run to mark the state of the current object as definitive - - Is this sent through ingest? - - Is this an administrative action sent to inventory? -- Any files not found in the current version object will be removed from the storage manifest -- Inventory will remove these files -- These files will be eligible for deletion from cloud storage - - Using the storage scan process? - - Using new file delete operations - -#### Pros -- No files are cloned/copied -- Version number stays intact -- Minimizes processing - -#### Cons -- Relies heavily on the existing mrt-delete process to reconcile the current version -- No mechanism exists to normalize or correct pathnames -- No evidence is left to indicate that this process took place - - ---- - -### Option: Pull forward desired files using an ingest manifest exported from storage - -#### Step 1: Pull Forward -```yaml -ark: ark:/111/222 -versions: -- num: 1 - files: - producer/foo: { size: 10, digest: 'AAA', path: "1/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" } - producer/cat: { size: 30, digest: 'AAC', path: "1/producer/cat" } - size: 60 -- num: 2 - files: - producer/foo: { size: 40, digest: 'FFF', path: "2/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" } - producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" } - size: 70 -- num: 3 - files: - producer/foo: { size: 40, digest: 'EEE', path: "3/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" } - producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" } - size: 40 -- num: 4 - files: - producer/foo: { size: 40, digest: 'EEE', path: "4/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "4/producer/bar" } - producer/dog: { size: 30, digest: 'AAC', path: "4/producer/dog" } - size: 90 - reload: true -size: 260 -``` - -#### Step 2: Purge Old Versions -```yaml -ark: ark:/111/222 -versions: -- num: 4 - files: - producer/foo: { size: 40, digest: 'EEE', path: "4/producer/foo" } - producer/bar: { size: 20, digest: 'AAB', path: "4/producer/bar" } - producer/dog: { size: 30, digest: 'AAC', path: "4/producer/dog" } - size: 90 - reload: true -size: 90 -``` - -#### Process -- Ingest manifest is generated from Merritt Storage -- Desired files are selected and renamed as needed -- Manifest is sent to Ingest as a new request type of `replace` -- A new version is generated containing a copy of all desired content -- Storage manifest should reflect that the replacement took place at a specific version -- Files from prior versions should be treated as if they no longer exist -- Older version files will be purge-able by inventory -- Older version files will be purge-able from cloud storage -- A purge process should exist to purge the older version content - -#### Pros -- Use can review the new state of the object before electing to purge the old versions -- The storage manifest and the version numbering will convey that this maintenance activity took place - -#### Cons -- Temporary duplication of object content -- Duplication could persist if the user does not enable / approve the running of the purge process -- Replication logic will need to change to not assume that versioning starts at 1 - - ---- - -### Option: Modify Merritt file paths to use content hash - -```yaml -ark: ark:/111/222 -versions: -- num: 1 - files: - producer/foo: { size: 10, digest: 'AAA', path: "AAA/10" } - producer/bar: { size: 20, digest: 'AAB', path: "AAB/20" } - producer/cat: { size: 30, digest: 'AAC', path: "AAC/30" } - size: 60 -- num: 2 - files: - producer/foo: { size: 40, digest: 'FFF', path: "FFF/40" } - producer/bar: { size: 20, digest: 'AAB', path: "AAB/20" } - producer/dog: { size: 30, digest: 'AAC', path: "AAC/30" } - size: 40 -- num: 3 - files: - producer/foo: { size: 40, digest: 'EEE', path: "EEE/40" } - producer/bar: { size: 20, digest: 'AAB', path: "AAB/20" } - producer/dog: { size: 30, digest: 'AAC', path: "AAC/30" } - size: 40 -size: 140 -``` - -#### Process -- Storage key should be `ark | hash | length` regardless of pathname/version -- Comment dloy - this only becomes needed if we allow a reset to preserve more than one version - preserving only current not needed - -#### Pros -- Storage optimized solution - duplicate files are only stored once per object - -#### Cons -- Loss of semantically meaningful key names (dloy-serious problem) - ---- - -### Option: Reset Version should always become version 1 - -- _dloy : see below._ - -#### Process - -#### Pros - -#### Cons - - ---- - -### Option: Reset object using a new add version - -#### Process -- a version add with reset request is received by ingest -- storage deletes the content for the specific ark -- all add version content is treated as new using the original ark -- inventory receives reset request -- inventory deletes all existing content on the ark -- inventory treats the added content as new - -#### Pros -- same key architecture -- delete and add functions do not change for storage and inventory -- resulting storage sets all content as 1 so specific match of content to saved keys -- all changes in new add populate merritt inv without requesting hybrid processing -- preserving the ark alloww outside references to continue working - -#### Cons -- all content will be written as version 1 so no earlier content is saved - redundant processing -- recovery of failed process may be issue - ---- - -### Option: Reset object using existing current version -- Note this is similar to _Option: Reset Version should always become version 1_ above - -#### Process -- a reset request is received by ingest - without action -- storage passed reset -- storage uses content S3 keys to copy (or leave alone) to a version 1 key. If current file uses previous version 10 then version 10 copied to version 1. Current architecture will not have collisions if only current is preserved -- storage delete all unused content for specific ark (key version > 1) -- inventory deletes all existing content on the ark -- inventory treats the added content as new - -#### Pros -- same key architecture -- resulting storage sets all content as 1 so specific match of content to saved keys -- all changes in new add populate merritt inv without requesting hybrid processing -- preserving the ark alloww outside references to continue working - -#### Cons -- new build copy logic required in storage -- new delete logic required for this special case -- Much content will be rewritten except for matching -- recovery of failed process may be issue