Option to remap data entity names #46

simleo · 2023-06-01T11:10:21Z

This is a possible answer to #34. It adds an option to remap all file and directory names to their original counterparts, based on basename values from CWLProv (note that basenames are not available for directories if the RO bundle was generated with cwltool < 3.1.20230209161050, i.e., before common-workflow-language/cwltool#1798).

However, as clearly stated in the option's help text, it can lead to an incorrect crate due to name clashes: for instance, the revsort example would not convert correctly because both the final output and the intermediate output from the rev step are called output.txt. In practice the program does not crash, but only one output.txt entity is registered, and associated to all four parameters related to a file named output.txt:

{
    "@id": "output.txt",
    "@type": "File",
    "exampleOfWork": [
        {"@id": "packed.cwl#sorttool.cwl/input"},
        {"@id": "packed.cwl#revtool.cwl/output"},
        {"@id": "packed.cwl#main/output"},
        {"@id": "packed.cwl#sorttool.cwl/output"}
    ]
}

For comparison, without name remapping, the situation is:

{
    "@id": "b9214658cc453331b62c2282b772a5c063dbd284",
    "@type": "File",
    "alternateName": "output.txt",
    "exampleOfWork": [
        {"@id": "packed.cwl#main/output"},
        {"@id": "packed.cwl#sorttool.cwl/output"}
    ]
},
{
    "@id": "97fe1b50b4582cebc7d853796ebd62e3e163aa3f",
    "@type": "File",
    "alternateName": "output.txt",
    "exampleOfWork": [
        {"@id": "packed.cwl#revtool.cwl/output"},
        {"@id": "packed.cwl#sorttool.cwl/input"}
    ]
}

where the first entity represent the final output and the second entity the intermediate one.

On the other hand, the renaming works fine in the grepucase example, which I have used for the unit test.

So this would be a kind of expert user / use with care option.

codecov · 2023-06-01T11:14:04Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.23 🎉

Comparison is base (64397c7) 97.61% compared to head (57eae27) 97.85%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   97.61%   97.85%   +0.23%     
==========================================
  Files          12       12              
  Lines        1930     2146     +216     
==========================================
+ Hits         1884     2100     +216     
  Misses         46       46

Impacted Files	Coverage Δ
src/runcrate/cli.py	`97.56% <100.00%> (+0.06%)`	⬆️
src/runcrate/convert.py	`97.90% <100.00%> (+0.05%)`	⬆️
tests/test_cli.py	`100.00% <100.00%> (ø)`
tests/test_cwlprov_crate_builder.py	`99.90% <100.00%> (+0.01%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jjkoehorst · 2023-06-23T08:26:41Z

Could it perhaps check if the new file location already exists and if it does it will not continue ?

simleo · 2023-06-23T08:37:22Z

Could it perhaps check if the new file location already exists and if it does it will not continue ?

You would still get a single entry for the file instead of multiple. It would be the first one encountered, so it would also be implementation-dependent.

jjkoehorst · 2023-06-23T11:32:17Z

That I fully understand but to stop the process as this as you mentioned is an issue. This to prevent and mention to people are you sure this is what you want?

simleo · 2023-06-23T12:41:25Z

Sorry, I got the explanation wrong. Given a file id (path relative to the crate root), the code checks if that is already in the crate, because it happens frequently in normal situations. For instance, if a file is both the input of the whole workflow and the input of the first step. If it is already in the crate, it does not add it a second time. It cannot stop and raise an exception if it finds a file name it has already seen, because that happens normally even with hashes.

simleo · 2023-07-19T15:01:52Z

I've added some logic to eliminate clashes by using different landing directories for files, depending on the process (workflow / step) and whether it's an input or an output parameter. For the revsort example, for instance, the crate directory tree looks like this:

.
|-- data
|   `-- main
|       |-- in
|       |   `-- whale.txt
|       |-- out
|       |   `-- output.txt
|       |-- rev
|       |   |-- in
|       |   |   `-- whale.txt
|       |   `-- out
|       |       `-- output.txt
|       `-- sorted
|           |-- in
|           |   `-- output.txt
|           `-- out
|               `-- output.txt
|-- packed.cwl
|-- primary-job.json
`-- ro-crate-metadata.json

This leads to some data duplication. For instance, data/main/in/whale.txt is the same as data/main/rev/in/whale.txt. But note that data/main/sorted/in/output.txt is different from data/main/sorted/out/output.txt, hence the need for the in and out subfolders.

One problem is that with name remapping activated, patch_workflow_input_collection does not work correctly anymore. It's a hack to work around a cwltool problem where secondary files are recorded only for step runs but not for the whole workflow run, so after converting parameters, workflow-level parameters with secondary files get mapped to the main entity alone (a file). The hack consists of retrieving the correct mapping from the relevant tool execution, since in the normal operation where hashes are kept for filenames and are placed at the root of the RO-Crate the resulting collection is the same. With names remapping on, the workflow-level parameter should be mapped to a separate collection where files have different paths instead, but we don't have that information from the CWLProv RO.

stain · 2024-03-28T15:13:10Z

This option will not work well with secondary files because of mentioned hack.

simleo mentioned this pull request Jun 1, 2023

Option to make data files more user friendly #34

Open

simleo added 4 commits July 17, 2023 12:16

add option to remap data entity names to the original ones

7e7752a

finish test_remap_names

689d75b

add test_cli_convert_remap_names

57e7958

names remapping: avoid clashes by using different dirs

fc20b61

simleo force-pushed the remap_names branch from 6ad44d1 to fc20b61 Compare July 17, 2023 10:25

simleo added 5 commits July 17, 2023 13:21

names remapping: test to check there are no clashes if same basename

29e34d7

convert_param fixes

bc6b8bf

expand some unit tests

79ba621

add test_remap_names_scatter

eaaaad9

add a note to patch_workflow_input_collection

57eae27

simleo mentioned this pull request Jul 20, 2023

CWLProv: no trace of secondaryFiles for workflow activity common-workflow-language/cwltool#1879

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to remap data entity names #46

Option to remap data entity names #46

simleo commented Jun 1, 2023 •

edited

Loading

codecov bot commented Jun 1, 2023 •

edited

Loading

jjkoehorst commented Jun 23, 2023

simleo commented Jun 23, 2023

jjkoehorst commented Jun 23, 2023

simleo commented Jun 23, 2023

simleo commented Jul 19, 2023

stain commented Mar 28, 2024

Option to remap data entity names #46

Are you sure you want to change the base?

Option to remap data entity names #46

Conversation

simleo commented Jun 1, 2023 • edited Loading

codecov bot commented Jun 1, 2023 • edited Loading

Codecov Report

jjkoehorst commented Jun 23, 2023

simleo commented Jun 23, 2023

jjkoehorst commented Jun 23, 2023

simleo commented Jun 23, 2023

simleo commented Jul 19, 2023

stain commented Mar 28, 2024

simleo commented Jun 1, 2023 •

edited

Loading

codecov bot commented Jun 1, 2023 •

edited

Loading