-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotation Design Discussion #56
Comments
Thanks for getting this started, @doulikecookiedough . We should discuss, as I don't understand the parent/child nomenclature you are mentioning above.
It might be best to have these grouped as a single file graph for each object (serialized in JSON-LD or RDF formats), which would be indexed like other metadata. So in this case, we would have 3 annotation files, one each for |
@mbjones Thank you for the prompt feedback! When I was reviewing how we would implement To clarify, using your example above:
If we're on the same page now, I'm wondering how HashStore plays a role in the Annotation implementation...? It would appear that the calling app would still be working with metadata in HashStore with the same Public API calls. If not, should we save further discussion here for the backend dev meeting to discuss so Robyn/Rushi can also get additional context? |
Maybe, or maybe not. Depends on the format. If stored as JSON-LD, it would likely have more lines, and certainly a lot more if in RDF/XML format. |
Notes: Background:
Design Goal:
Annotation General Design/Infrastructure Discussion:
Potential Issue to Ponder:
Next Step:
|
Mermaid Diagram for review, summary and a few thoughts:
graph TD
subgraph RDFTriple2
direction BT
S["EntityPID1"]
P["ore:aggregatedBy"]
O["OREPID3"]
end
subgraph RDFTriple1
direction BT
S1["EntityPID1"]
P1["cito:documentedBy"]
O1["EMLPID4"]
end
subgraph RDFTriple3
direction BT
S2["EntityPID2"]
P2["prov:wasDerivedFrom"]
O2["EntityPID1"]
end
%% subgraph RDFTriple4
%% direction BT
%% S3["EMLPID4"]
%% P3["dwt:subject"]
%% O3["adcad:Hydrology"]
%% end
ANNO-1 -. "ANNO-1 content" .-> RDFTriple2
ANNO-2 -. "ANNO-2 content" .-> RDFTriple3
subgraph Dataset
%% C1["CSV-1"]
%% C2["CSV-2"]
%% C3["CSV-3"]
%% C4["CSV-4"]
%% C5["CSV-5"]
%% C6["..."]
%% C7["CSV-1000"]
end
O -. "Expresses that a data file is a member of a package" .-> Dataset
subgraph hs["`**HashStore**`"]
subgraph /objects
direction RL
OBJ-1
OBJ-2
OBJ-3
OBJ-4
OBJ-5
OBJ-6
end
subgraph /metadata
direction TB
META-0
META-1
META-2
%% META-3
%% META-4
%% META-5
ANNO-N
ANNO-0
ANNO-1
ANNO-2
ANNO-3
ANNO-4
ANNO-5
end
end
O1 -. "Expresses that a data file is documented by a metadata file" .-> META-2
Next Step:
|
To Do:
|
Updated Diagram (not final):
flowchart TD
ds((dm.dataset))
dm1(dou.mok.1) -- ore:aggregatedBy --> ds
dm2(dou.mok.rev.1) -- prov:wasDerivedFrom --> dm1
dm2(dou.mok.rev.1) -- obj:locatedAt --> sha256(hashstore:dou.mok.rev.1)
dm2 -- ore:aggregatedBy --> ds
dm1 -- prov:wasDerivedFrom --> a1(anno:dou.mok.1)
dm1 -- cito:documentedBy --> sm2(sysmeta:dou.mok.1)
ds -- dwt:subject --> hy1(adcad:Hydrology)
hy1 .-> ANNO-4
a1 .-> ANNO-3
dm1 .-> ANNO-2
ds .-> ANNO-1
ds .-> ANNO-0
ds -- cito:documentedBy --> sm1(sysmeta:dm.dataset)
sm1 .-> SYSMETA-1
sm2 .-> SYSMETA-2
sha256 .-> OBJ-1
subgraph hs["HashStore"]
subgraph /objects
direction RL
OBJ-1
OBJ-2
OBJ-3
objdot("...")
end
subgraph /metadata
direction RL
SYSMETA-1
ANNO-0
ANNO-1
ANNO-2
ANNO-3
ANNO-4
SYSMETA-2
metadot("...")
end
end
classDef orange fill:#f96,stroke-width:3px;
class ds orange
|
Updated Diagram (continued...): flowchart TD
subgraph dmcr["dou.mok.rev.1"]
direction RL
dmpr1["dou.mok.rev.1 - prov:wasDerivedFrom - dou.mok.1"]
end
dmcr .-> ANNO-4
subgraph dmc["dou.mok.1"]
direction RL
dmp1["dou.mok.1 - ore:aggregatedBy - datasetpkg"]
dmp2["dou.mok.1 - rdf:type - hsfs:obj\n(metacat/objects/OBJ-1)"]
dmp3["dou.mok.1 - cito:documentedBy - hsfs:sysmeta\n(metacat/metadata/SYSMETA-2)"]
dmp4["dou.mok.1 - hsfs:algo - 'SHA-256'"]
dmp5["dou.mok.1 - hsfs:checksum - 'a1...f9'"]
end
dmc .-> ANNO-3
subgraph ds["datasetpkg"]
direction RL
dsp2["datasetpkg - ore:aggregates - dou.mok.1"]
dsp3["datasetpkg - cito:documentedBy - hsfs:sysmeta\n(metacat/metadata/SYSMETA-1)"]
dsp1["datasetpkg - dwt:subject - adcad:Hydrology"]
end
ds .-> ANNO-2
subgraph hs["HashStore"]
subgraph /objects
direction RL
o1["OBJ-1\n(SHA-256 Hash:\n datasetpkg')"]
OBJ-2
OBJ-3
objdot("...")
end
subgraph /metadata
direction RL
sys1["SYSMETA-1\n(SHA-256 Hash:\n datasetpkg + format_id')"]
ANNO-0
ANNO-1
ANNO-2
ANNO-3
ANNO-4
sys2["SYSMETA-2\n(SHA-256 Hash:\n dou.mok.1 + format_id')"]
metadot("...")
end
end
Design Challenges & Questions (cont):
|
Proposed Semantic Data Info Package (SDIP) Diagram from Matt's sketch flowchart TD
H13["H13
H13 type PACKAGE
H13 contains H11
H13 contains H12
H13 contains H10"]
subgraph ORE
H12["H12
H12 type ANNO"]
H1["H1: ORE"]
P1["P1: Sysmeta"]
P1 --> H1
H12 --> H1
end
subgraph EML
H11["H11
H11 type ANNO"]
H2["H2: EML"]
P2["P2: Sysmeta"]
H11 --> H2
P2 --> H2
end
H10["H10
H10 type FOLDER
H10 contains H6
H10 contains H9"]
H13 --> H12
H13 --> H11
H13 --> H10
subgraph Blob1
H6["H6
H6 type ANNO
H6 contains H3"]
H3["H3: Data"]
P3["P3: Sysmeta"]
H6 --> H3
P3 --> H3
end
H10 --> H6
H9["H9
H9 type FOLDER
H9 contains H8
H9 contains H8"]
H10 --> H9
subgraph Blob2
H7["H7
H7 type ANNO
H7 contains H4
H4 type BLOB"]
H4["H4: Data"]
P4["P4: Sysmeta"]
H7 --> H4
P4 --> H4
end
H9 --> H7
subgraph Blob3
H8["H8
H8 type ANNO
H8 contains H5
H5 type BLOB"]
H5["H5: Data"]
P5["P5: Sysmeta"]
H8 --> H5
P5 --> H5
end
H9 --> H8
classDef cyan fill:#7ff;
class H13,H12,H11,H10,H6,H9,H7,H8 cyan
classDef mage fill:#ff7ffe;
class P1,P2,P3,P4,P5 mage
classDef lime fill:#dfffda;
class H1,H2 lime
|
Some more thoughts on annotations & the impacts of a change on a large dataset
A package name/id change for a dataset with a million files
Dataset member updates in a package with a million files
|
Closing previous discussion/issue for Annotation Design: N-Triple vs JSON-LD Discussion and will continue discussions here as progress is made with the greater team regarding how to handle large packages. |
Questions & Todo:
/hashstore/metadata
? JSON-LD or EML?Initial Proposal to kickstart the conversation (the content below is not final, and will likely change):
HashStore annotation
is a mapping document that should consist of a single parent member and a list that represents the child membershashstore/metadata
is formed by calculating the SHA-256 hex digest of a givenpid
andformatId
hashstore/metadata
- The id/location/address of this document is formed by calculating the SHA-256 hex digest of a given
pid
,formatId
and the string "parent".Ex. sha-256(pid + formatId + "parent")
- This document is composed of the attributes/content that describe the dataset (ex. title, author, method, keywordSet, etc.)
hashstore/metadata
as the value- The id/address of each child is formed by calculating the SHA-256 hex digest of a given
pid
,formatId
and(int) key
.Ex. sha-256(pid + formatId + 0)
where 0 is the first table in the dataset- Each child represents a data table in the dataset, or chunk of data that belongs to the dataset
The text was updated successfully, but these errors were encountered: