-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to remap data entity names #46
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #46 +/- ##
==========================================
+ Coverage 97.61% 97.85% +0.23%
==========================================
Files 12 12
Lines 1930 2146 +216
==========================================
+ Hits 1884 2100 +216
Misses 46 46
☔ View full report in Codecov by Sentry. |
Could it perhaps check if the new file location already exists and if it does it will not continue ? |
You would still get a single entry for the file instead of multiple. It would be the first one encountered, so it would also be implementation-dependent. |
That I fully understand but to stop the process as this as you mentioned is an issue. This to prevent and mention to people are you sure this is what you want? |
Sorry, I got the explanation wrong. Given a file id (path relative to the crate root), the code checks if that is already in the crate, because it happens frequently in normal situations. For instance, if a file is both the input of the whole workflow and the input of the first step. If it is already in the crate, it does not add it a second time. It cannot stop and raise an exception if it finds a file name it has already seen, because that happens normally even with hashes. |
I've added some logic to eliminate clashes by using different landing directories for files, depending on the process (workflow / step) and whether it's an input or an output parameter. For the revsort example, for instance, the crate directory tree looks like this:
This leads to some data duplication. For instance, One problem is that with name remapping activated, patch_workflow_input_collection does not work correctly anymore. It's a hack to work around a cwltool problem where secondary files are recorded only for step runs but not for the whole workflow run, so after converting parameters, workflow-level parameters with secondary files get mapped to the main entity alone (a file). The hack consists of retrieving the correct mapping from the relevant tool execution, since in the normal operation where hashes are kept for filenames and are placed at the root of the RO-Crate the resulting collection is the same. With names remapping on, the workflow-level parameter should be mapped to a separate collection where files have different paths instead, but we don't have that information from the CWLProv RO. |
This option will not work well with secondary files because of mentioned hack. |
This is a possible answer to #34. It adds an option to remap all file and directory names to their original counterparts, based on
basename
values from CWLProv (note that basenames are not available for directories if the RO bundle was generated with cwltool < 3.1.20230209161050, i.e., before common-workflow-language/cwltool#1798).However, as clearly stated in the option's help text, it can lead to an incorrect crate due to name clashes: for instance, the revsort example would not convert correctly because both the final output and the intermediate output from the
rev
step are calledoutput.txt
. In practice the program does not crash, but only oneoutput.txt
entity is registered, and associated to all four parameters related to a file namedoutput.txt
:For comparison, without name remapping, the situation is:
where the first entity represent the final output and the second entity the intermediate one.
On the other hand, the renaming works fine in the
grepucase
example, which I have used for the unit test.So this would be a kind of expert user / use with care option.