Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✋ BLOCKED - Program reports violations for data_object_set.was_generated_by references #5

Open
3 tasks
eecavanna opened this issue Jul 18, 2024 · 6 comments
Assignees
Labels
bug Something isn't working upstream-issue Cause of issue is in an upstream dependency x small less than 1 day

Comments

@eecavanna
Copy link
Collaborator

eecavanna commented Jul 18, 2024

Here is a single example.

Excerpt from violations.tsv:

source_collection_name source_field_name source_document_object_id source_document_id target_id
data_object_set was_generated_by 649b00451ae706d7b5b1a5f1 nmdc:dobj-13-xx781m34 nmdc:omprc-13-4wkf0639

Excerpt from references.tsv:

source_collection_name source_class_name source_field_name target_collection_name target_class_name
data_object_set DataObject was_generated_by workflow_execution_set MetatranscriptomeAnalysis
data_object_set DataObject was_generated_by workflow_execution_set MetatranscriptomeAssembly
data_object_set DataObject was_generated_by workflow_execution_set MetagenomeSequencing
data_object_set DataObject was_generated_by workflow_execution_set MetagenomeAnnotation
data_object_set DataObject was_generated_by workflow_execution_set ReadQcAnalysis
data_object_set DataObject was_generated_by workflow_execution_set MetatranscriptomeAnnotation
data_object_set DataObject was_generated_by workflow_execution_set WorkflowExecution
data_object_set DataObject was_generated_by workflow_execution_set MetagenomeAssembly
data_object_set DataObject was_generated_by workflow_execution_set MagsAnalysis
data_object_set DataObject was_generated_by workflow_execution_set NomAnalysis
data_object_set DataObject was_generated_by workflow_execution_set ReadBasedTaxonomyAnalysis
data_object_set DataObject was_generated_by workflow_execution_set MetaproteomicsAnalysis
data_object_set DataObject was_generated_by workflow_execution_set MetabolomicsAnalysis

One of my teammates that is very familiar with the schema and database thinks this result is erroneous.

Configuration

# Download the raw content of https://github.com/microbiomedata/berkeley-schema-fy24/blob/main/nmdc_schema/nmdc_materialized_patterns.yaml
curl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/berkeley-schema-fy24/main/nmdc_schema/nmdc_materialized_patterns.yaml
  • Database: Mongo in Berkeley environment as of 1:15pm PT on Thursday, July 18, 2024

Tasks

  • Fully characterize the result (check the schema, database, etc.)
  • Identify the root cause of the issue
  • Fix the issue (if local) or report the issue (if upstream)
@eecavanna eecavanna added bug Something isn't working x small less than 1 day labels Jul 18, 2024
@eecavanna eecavanna self-assigned this Jul 18, 2024
@eecavanna
Copy link
Collaborator Author

eecavanna commented Jul 18, 2024

Excerpts from the schema (with unrelated parts replaced with # ... here):

Base was_generated_by slot definition:

slots:
  # ...
  was_generated_by:
    name: was_generated_by
    from_schema: https://w3id.org/nmdc/nmdc
    mappings:
    - prov:wasGeneratedBy
    range: WorkflowExecution

Class definition:

classes:
  # ...
  DataObject:
    name: DataObject
    # ...
    slots:
    # ...
    - was_generated_by
    slot_usage:
      # ...
      was_generated_by:
        name: was_generated_by
        pattern: ^^(nmdc):(wfmag|wfmb|wfmgan|wfmgas|wfmsa|wfmp|wfmt|wfmtan|wfmtas|wfnom|wfrbt|wfrqc)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\.[0-9]{1,})$|^^(nmdc):(omprc|dgms|dgns)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$
        structured_pattern:
          syntax: ^{id_nmdc_prefix}:(wfmag|wfmb|wfmgan|wfmgas|wfmsa|wfmp|wfmt|wfmtan|wfmtas|wfnom|wfrbt|wfrqc)-{id_shoulder}-{id_blade}{id_version}$|^{id_nmdc_prefix}:(omprc|dgms|dgns)-{id_shoulder}-{id_blade}$
          interpolated: true
    class_uri: nmdc:DataObject

Database slot:

classes:
  # ...
  Database:
    name: Database
    # ...
    slots:
    # ...
    - workflow_execution_set
    class_uri: nmdc:Database
    tree_root: true

Base workflow_execution_set slot definition:

slots:
  # ...
  workflow_execution_set:
    name: workflow_execution_set
    description: This property links a database object to the set of workflow executions.
    from_schema: https://w3id.org/nmdc/nmdc
    mixins:
    - object_set
    range: WorkflowExecution

Class WorkflowExecution definition:

classes:
  # ...
  WorkflowExecution:
    name: WorkflowExecution
    # ...
    is_a: PlannedProcess
    abstract: true
    slots:
    # ...
    - was_informed_by
    slot_usage:
      # ...
      has_input:
        name: has_input
        required: true
        pattern: ^(nmdc):(dobj)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$
        structured_pattern:
          syntax: '{id_nmdc_prefix}:(dobj)-{id_shoulder}-{id_blade}$'
          interpolated: true
      has_output:
        name: has_output
        pattern: ^(nmdc):(dobj)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$
        structured_pattern:
          syntax: '{id_nmdc_prefix}:(dobj)-{id_shoulder}-{id_blade}$'
          interpolated: true
      # ...
      was_informed_by:
        name: was_informed_by
        required: true
    class_uri: nmdc:WorkflowExecution
    rules:
    # ...

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jul 18, 2024

Here's a curl command that can be used to find (via the Berkeley Runtime API) the document being referred to.

curl -X 'GET' \
  'https://api-berkeley.microbiomedata.org/nmdcschema/ids/nmdc%3Aomprc-13-4wkf0639' \
  -H 'accept: application/json'

Here is that document:

{
  "id": "nmdc:omprc-13-4wkf0639",
  "name": "Rachael_21T_04-15A_M_14Mar17_leopard_Infuse",
  "has_input": [
    "nmdc:bsm-13-4bfysc34"
  ],
  "has_output": [
    "nmdc:dobj-13-xx781m34"
  ],
  "description": "High resolution MS spectra only",
  "processing_institution": "EMSL",
  "type": "nmdc:MassSpectrometry",
  "alternative_identifiers": [
    "emsl:570856"
  ],
  "analyte_category": "nom",
  "associated_studies": [
    "nmdc:sty-11-33fbta56"
  ],
  "instrument_used": [
    "nmdc:inst-14-nstrhv39"
  ]
}

I used this Mongo query to find out which collection that document resides in.

Show/hide Mongo query
const id = "nmdc:omprc-13-4wkf0639";
const db = db.getSiblingDB("nmdc");
const collectionNames = db.getCollectionNames();
for (const collectionName of collectionNames) {
  print("Processing: " + collectionName);

  // Skip these collections.
  if (collectionName.startsWith("system.")
      || collectionName.startsWith("minter.")
      || collectionName.startsWith("_")
      || collectionName.startsWith("ids_")) {
      continue;
  }

  const collection = db.getCollection(collectionName);
  
  // Search this collection.
  const cursor = collection.find({id: id}).limit(1);
  if (cursor.hasNext()) {
    print("Found in: " + collectionName);
    break;
  }
}

It resides in the data_generation_set collection.

@eecavanna
Copy link
Collaborator Author

I think my teammate expected refscan to have searched the data_generation_set collection. It isn't clear to me how the schema says that DataObject.was_generated_by can refer to something in a collection named data_generation_set.

@eecavanna
Copy link
Collaborator Author

Turns out this particular violation (and presumably others, although I haven't confirmed it) is due to this bug in the berkeley-schema-fy24 schema.

@eecavanna eecavanna changed the title Program reports violations for data_object_set.was_generated_by references (false positive) Program reports violations for data_object_set.was_generated_by references Jul 19, 2024
@eecavanna eecavanna added the upstream-issue Cause of issue is in an upstream dependency label Jul 19, 2024
@eecavanna
Copy link
Collaborator Author

eecavanna commented Jul 25, 2024

Now that the missing any_of constraint has been added to the Berkeley schema (in version https://github.com/microbiomedata/berkeley-schema-fy24/releases/tag/v11.0.0rc18), I'll re-run refscan (this time, using that new version of the Berkeley schema) and check whether it still finds these violations.

Edit: I am not able to re-run it. See the comment below this one for details.

@eecavanna
Copy link
Collaborator Author

Progress is blocked by the absence of the necessary YAML-formatted schema file in the berkeley-schema-fy24 repo. I created a ticket about that, here: microbiomedata/nmdc-schema#2150

@eecavanna eecavanna changed the title Program reports violations for data_object_set.was_generated_by references ✋ BLOCKED - Program reports violations for data_object_set.was_generated_by references Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream-issue Cause of issue is in an upstream dependency x small less than 1 day
Projects
None yet
Development

No branches or pull requests

1 participant