Skip to content

Commit

Permalink
Update version_reset.md
Browse files Browse the repository at this point in the history
  • Loading branch information
terrywbrady authored May 21, 2024
1 parent b4a95bf commit 8aa97a5
Showing 1 changed file with 0 additions and 312 deletions.
312 changes: 0 additions & 312 deletions design/object-reset/version_reset.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,315 +123,3 @@
- Include date/time of operation
- Include explanation of fix

## Use Cases

### Use Case: Nuxeo Path Name Cleanup

#### Issue
- file pathnames contain url parameters and may not be meaningful to future users
- when the file pathname changes, a new file is created
- duplicate file content exists under different pathnames
- storage cost saving opportunity

#### Scope: 100,000+ objects

```sql
show query here
```

---

### Use Case: Depositor acceidentally uploads unwanted file

#### Examples
- Accidental PII ingest
- Very large Dryad files rejected by curators in subsequent version
- Storage cost saving opportunity

#### Scope: Case by Case


---

### Use Case: Excessive Object Versioning

#### Example
- escholarship metadata changes
- very granular metadata changes may not be meaningful to future users accessing preservation content


---

### Use Case: Incomplete Processing of Version addition

#### Scope: Rare, driven by system outage


---

### Use Case: Pathname Normalization

#### Issue
During the scope of a large ingest effort, a desire arises to normalize pathnames for a large batch of files that have already been ingested


---

### Use Case: Bulk delete files unsuitable for preservation

#### Issue
- thumbs.db files added for images
- .DS_store files
- __MACOSX files
- .git or .svn repository ingested with content


---

### Use Case: Storage Savings

#### Issue: Purge unwanted files from old object versions


---

## Implmentation Options

### Scenario
```yaml
ark: ark:/111/222
versions:
- num: 1
files:
producer/foo: { size: 10, digest: 'AAA', path: "1/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" }
producer/cat: { size: 30, digest: 'AAC', path: "1/producer/cat" }
size: 60
- num: 2
files:
producer/foo: { size: 40, digest: 'FFF', path: "2/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" }
producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" }
size: 70
- num: 3
files:
producer/foo: { size: 40, digest: 'EEE', path: "3/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" }
producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" }
size: 40
size: 170
```
### Option: Mark Current Version as the Reset Version
```yaml
ark: ark:/111/222
versions:
- num: 1
files:
producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" }
size: 20
- num: 2
files:
producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" }
size: 30
- num: 3
files:
producer/foo: { size: 40, digest: 'EEE', path: "3/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" }
producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" }
size: 40
size: 90
```
#### Process
- User must fix the current version of the object using existing tools (ingest updates & mrt-delete)
- New transaction is run to mark the state of the current object as definitive
- Is this sent through ingest?
- Is this an administrative action sent to inventory?
- Any files not found in the current version object will be removed from the storage manifest
- Inventory will remove these files
- These files will be eligible for deletion from cloud storage
- Using the storage scan process?
- Using new file delete operations
#### Pros
- No files are cloned/copied
- Version number stays intact
- Minimizes processing
#### Cons
- Relies heavily on the existing mrt-delete process to reconcile the current version
- No mechanism exists to normalize or correct pathnames
- No evidence is left to indicate that this process took place
---
### Option: Pull forward desired files using an ingest manifest exported from storage
#### Step 1: Pull Forward
```yaml
ark: ark:/111/222
versions:
- num: 1
files:
producer/foo: { size: 10, digest: 'AAA', path: "1/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" }
producer/cat: { size: 30, digest: 'AAC', path: "1/producer/cat" }
size: 60
- num: 2
files:
producer/foo: { size: 40, digest: 'FFF', path: "2/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" }
producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" }
size: 70
- num: 3
files:
producer/foo: { size: 40, digest: 'EEE', path: "3/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "1/producer/bar" }
producer/dog: { size: 30, digest: 'AAC', path: "2/producer/dog" }
size: 40
- num: 4
files:
producer/foo: { size: 40, digest: 'EEE', path: "4/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "4/producer/bar" }
producer/dog: { size: 30, digest: 'AAC', path: "4/producer/dog" }
size: 90
reload: true
size: 260
```
#### Step 2: Purge Old Versions
```yaml
ark: ark:/111/222
versions:
- num: 4
files:
producer/foo: { size: 40, digest: 'EEE', path: "4/producer/foo" }
producer/bar: { size: 20, digest: 'AAB', path: "4/producer/bar" }
producer/dog: { size: 30, digest: 'AAC', path: "4/producer/dog" }
size: 90
reload: true
size: 90
```
#### Process
- Ingest manifest is generated from Merritt Storage
- Desired files are selected and renamed as needed
- Manifest is sent to Ingest as a new request type of `replace`
- A new version is generated containing a copy of all desired content
- Storage manifest should reflect that the replacement took place at a specific version
- Files from prior versions should be treated as if they no longer exist
- Older version files will be purge-able by inventory
- Older version files will be purge-able from cloud storage
- A purge process should exist to purge the older version content

#### Pros
- Use can review the new state of the object before electing to purge the old versions
- The storage manifest and the version numbering will convey that this maintenance activity took place

#### Cons
- Temporary duplication of object content
- Duplication could persist if the user does not enable / approve the running of the purge process
- Replication logic will need to change to not assume that versioning starts at 1


---

### Option: Modify Merritt file paths to use content hash

```yaml
ark: ark:/111/222
versions:
- num: 1
files:
producer/foo: { size: 10, digest: 'AAA', path: "AAA/10" }
producer/bar: { size: 20, digest: 'AAB', path: "AAB/20" }
producer/cat: { size: 30, digest: 'AAC', path: "AAC/30" }
size: 60
- num: 2
files:
producer/foo: { size: 40, digest: 'FFF', path: "FFF/40" }
producer/bar: { size: 20, digest: 'AAB', path: "AAB/20" }
producer/dog: { size: 30, digest: 'AAC', path: "AAC/30" }
size: 40
- num: 3
files:
producer/foo: { size: 40, digest: 'EEE', path: "EEE/40" }
producer/bar: { size: 20, digest: 'AAB', path: "AAB/20" }
producer/dog: { size: 30, digest: 'AAC', path: "AAC/30" }
size: 40
size: 140
```

#### Process
- Storage key should be `ark | hash | length` regardless of pathname/version
- Comment dloy - this only becomes needed if we allow a reset to preserve more than one version - preserving only current not needed

#### Pros
- Storage optimized solution - duplicate files are only stored once per object

#### Cons
- Loss of semantically meaningful key names (dloy-serious problem)

---

### Option: Reset Version should always become version 1

- _dloy : see below._

#### Process

#### Pros

#### Cons


---

### Option: Reset object using a new add version

#### Process
- a version add with reset request is received by ingest
- storage deletes the content for the specific ark
- all add version content is treated as new using the original ark
- inventory receives reset request
- inventory deletes all existing content on the ark
- inventory treats the added content as new

#### Pros
- same key architecture
- delete and add functions do not change for storage and inventory
- resulting storage sets all content as 1 so specific match of content to saved keys
- all changes in new add populate merritt inv without requesting hybrid processing
- preserving the ark alloww outside references to continue working

#### Cons
- all content will be written as version 1 so no earlier content is saved - redundant processing
- recovery of failed process may be issue

---

### Option: Reset object using existing current version
- Note this is similar to _Option: Reset Version should always become version 1_ above

#### Process
- a reset request is received by ingest - without action
- storage passed reset
- storage uses content S3 keys to copy (or leave alone) to a version 1 key. If current file uses previous version 10 then version 10 copied to version 1. Current architecture will not have collisions if only current is preserved
- storage delete all unused content for specific ark (key version > 1)
- inventory deletes all existing content on the ark
- inventory treats the added content as new

#### Pros
- same key architecture
- resulting storage sets all content as 1 so specific match of content to saved keys
- all changes in new add populate merritt inv without requesting hybrid processing
- preserving the ark alloww outside references to continue working

#### Cons
- new build copy logic required in storage
- new delete logic required for this special case
- Much content will be rewritten except for matching
- recovery of failed process may be issue

0 comments on commit 8aa97a5

Please sign in to comment.