Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to remap data entity names #46

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

simleo
Copy link
Collaborator

@simleo simleo commented Jun 1, 2023

This is a possible answer to #34. It adds an option to remap all file and directory names to their original counterparts, based on basename values from CWLProv (note that basenames are not available for directories if the RO bundle was generated with cwltool < 3.1.20230209161050, i.e., before common-workflow-language/cwltool#1798).

However, as clearly stated in the option's help text, it can lead to an incorrect crate due to name clashes: for instance, the revsort example would not convert correctly because both the final output and the intermediate output from the rev step are called output.txt. In practice the program does not crash, but only one output.txt entity is registered, and associated to all four parameters related to a file named output.txt:

{
    "@id": "output.txt",
    "@type": "File",
    "exampleOfWork": [
        {"@id": "packed.cwl#sorttool.cwl/input"},
        {"@id": "packed.cwl#revtool.cwl/output"},
        {"@id": "packed.cwl#main/output"},
        {"@id": "packed.cwl#sorttool.cwl/output"}
    ]
}

For comparison, without name remapping, the situation is:

{
    "@id": "b9214658cc453331b62c2282b772a5c063dbd284",
    "@type": "File",
    "alternateName": "output.txt",
    "exampleOfWork": [
        {"@id": "packed.cwl#main/output"},
        {"@id": "packed.cwl#sorttool.cwl/output"}
    ]
},
{
    "@id": "97fe1b50b4582cebc7d853796ebd62e3e163aa3f",
    "@type": "File",
    "alternateName": "output.txt",
    "exampleOfWork": [
        {"@id": "packed.cwl#revtool.cwl/output"},
        {"@id": "packed.cwl#sorttool.cwl/input"}
    ]
}

where the first entity represent the final output and the second entity the intermediate one.

On the other hand, the renaming works fine in the grepucase example, which I have used for the unit test.

So this would be a kind of expert user / use with care option.

@codecov
Copy link

codecov bot commented Jun 1, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.23 🎉

Comparison is base (64397c7) 97.61% compared to head (57eae27) 97.85%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   97.61%   97.85%   +0.23%     
==========================================
  Files          12       12              
  Lines        1930     2146     +216     
==========================================
+ Hits         1884     2100     +216     
  Misses         46       46              
Impacted Files Coverage Δ
src/runcrate/cli.py 97.56% <100.00%> (+0.06%) ⬆️
src/runcrate/convert.py 97.90% <100.00%> (+0.05%) ⬆️
tests/test_cli.py 100.00% <100.00%> (ø)
tests/test_cwlprov_crate_builder.py 99.90% <100.00%> (+0.01%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@jjkoehorst
Copy link

Could it perhaps check if the new file location already exists and if it does it will not continue ?

@simleo
Copy link
Collaborator Author

simleo commented Jun 23, 2023

Could it perhaps check if the new file location already exists and if it does it will not continue ?

You would still get a single entry for the file instead of multiple. It would be the first one encountered, so it would also be implementation-dependent.

@jjkoehorst
Copy link

That I fully understand but to stop the process as this as you mentioned is an issue. This to prevent and mention to people are you sure this is what you want?

@simleo
Copy link
Collaborator Author

simleo commented Jun 23, 2023

Sorry, I got the explanation wrong. Given a file id (path relative to the crate root), the code checks if that is already in the crate, because it happens frequently in normal situations. For instance, if a file is both the input of the whole workflow and the input of the first step. If it is already in the crate, it does not add it a second time. It cannot stop and raise an exception if it finds a file name it has already seen, because that happens normally even with hashes.

@simleo
Copy link
Collaborator Author

simleo commented Jul 19, 2023

I've added some logic to eliminate clashes by using different landing directories for files, depending on the process (workflow / step) and whether it's an input or an output parameter. For the revsort example, for instance, the crate directory tree looks like this:

.
|-- data
|   `-- main
|       |-- in
|       |   `-- whale.txt
|       |-- out
|       |   `-- output.txt
|       |-- rev
|       |   |-- in
|       |   |   `-- whale.txt
|       |   `-- out
|       |       `-- output.txt
|       `-- sorted
|           |-- in
|           |   `-- output.txt
|           `-- out
|               `-- output.txt
|-- packed.cwl
|-- primary-job.json
`-- ro-crate-metadata.json

This leads to some data duplication. For instance, data/main/in/whale.txt is the same as data/main/rev/in/whale.txt. But note that data/main/sorted/in/output.txt is different from data/main/sorted/out/output.txt, hence the need for the in and out subfolders.

One problem is that with name remapping activated, patch_workflow_input_collection does not work correctly anymore. It's a hack to work around a cwltool problem where secondary files are recorded only for step runs but not for the whole workflow run, so after converting parameters, workflow-level parameters with secondary files get mapped to the main entity alone (a file). The hack consists of retrieving the correct mapping from the relevant tool execution, since in the normal operation where hashes are kept for filenames and are placed at the root of the RO-Crate the resulting collection is the same. With names remapping on, the workflow-level parameter should be mapped to a separate collection where files have different paths instead, but we don't have that information from the CWLProv RO.

@stain
Copy link
Contributor

stain commented Mar 28, 2024

This option will not work well with secondary files because of mentioned hack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants