-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CQ2 - Resource usage #10
Comments
First thing is knowing what we're going to model. What does each workflow manager already provide? What could be added that's missing, and how hard would be to do that?
More generally, we need to know about any information / logging data (and how structured it is) provided by the framework that runs the workflow, not just resource usage: timestamps, user info, containerization, etc. Another useful categorization is what's available by default and what needs to be explicitly activated (e.g., user info in cwltool). |
cwltool tracks peak memory usage; and start & stop tips for jobs & steps |
Discussed with @ilveroluca earlier this morning. One source of confusion was my suggestion of memoryRequirements or storageRequirements for the property names, which was not consistent with the question "How much memory/cpu/disk was used in run". I have now removed that suggestion from the issue's description. That's not to say that such indications are not useful: they are quite useful to those who want reproduce the run, since they allow to plan ahead, but they've got nothing to do with what happened during the run. Such indications don't come from the observation of a single run, but rather from the experience (or -- even better -- statistics) of the author(s) or anyone who's worked with the application in various scenarios. They are part of prospective, not retrospective provenance, so we should expect them to come from the workflow's author / maintainer. Indeed, CWL has ResourceRequirement for this purpose. I've now opened #32 to track this. Here, instead, we are focusing on resource usage information for the specific run described in the crate, such as the peak memory usage mentioned by Michael. This isn't just useful to enrich the metadata about the run: it might be the only hint available in all cases where requirements as discussed above are not available (which I expect to be the majority: even when they are known, the authors might not take the time to provide them); additionally, with a sufficiently large number of runs, it could be used to get a good estimate of the general requirements. |
Where is that recorded? Is it available from the CWLProv output? |
About Nextflow, when trace option is enabled, detailed statistics about memory and CPU usage is gathered from each step of execution. The monitoring is done from the shell script created to execute the workflow step. As this script is written in bash and it also depends on additional tools for the detailed monitoring, mainly ps, grep and awk, not all the container instances allow this detailed statistics gathering. This is the usually gathered information for each executed step:
|
Sapporo is a Workflow Execution System (WES), so it calls workflow engines (e.g., cwltool and Nextflow) internally. RO-Crate generated by Sapporo is stored following information (information in the Sapporo container):
These information are generated in https://github.com/sapporo-wes/sapporo-service/blob/856196864e8ccda8c71bef12e5dcf7d5becb21e6/sapporo/ro_crate.py#L697 . Also, we add Log (or provenance) of WES (Sapporo).
These information are originally stored in Sapporo as run_dir (https://github.com/sapporo-wes/sapporo-service#run-dir). Furthermore, we collect information about each files (input/output file)
These information are generated in https://github.com/sapporo-wes/sapporo-service/blob/856196864e8ccda8c71bef12e5dcf7d5becb21e6/sapporo/ro_crate.py#L318 . |
From @rsirvent: |
Sorry I missed this thread. Let me elaborate a bit more what we provide with COMPSs. We have some ways to gather statistics / understand resource usage:
Also, more general information of the resources (how many cores, memory, etc... they have) can be found in COMPSs xml configuration files (resources.xml) (See: https://compss-doc.readthedocs.io/en/stable/Sections/01_Installation/06_Configuration_files.html). Hope it helps. |
What to represent?One of the main challenges here is providing guidelines that make sense for a wide variety of operating systems and workflow engines. This means we cannot go too deep into details. For instance, Nextflow tracing provides details such as virtual memory vs resident set, but this distinction might not apply to all systems and / or be made by all workflow engines. cwltool and Arvados, for instance, simply give "max memory used", which is sufficiently general. The Nextflow tracing example shows that this kind of information can be very detailed; for generality and simplicity, I think we should focus on the most important bits, especially for the first release of the profiles. We could represent the following:
How to represent it?Schema.org has things like memoryRequirements for SoftwareApplication, but these are mininum requirements to run the application. We need to describe actual resource usage and tie it to the actions. I could not find anything for this in the current RO-Crate context, so we probably need new terms. I've searched for ontologies for inspiration, but could not find much (e.g. the WICUS hardware specs seems also focused on application requirements). The simplest approach is to add the properties directly to the action, for instance: {
"@id": "#action-1",
"@type": "CreateAction",
"memoryUsage": "8.43GB",
"cpuUsage": "140.2%",
"gpuUsage": "70.3%",
"usedCpus": "2",
"usedGpus": "1",
...
} With CPU / GPU details: {
"@id": "#action-1",
"@type": "CreateAction",
"memoryUsage": "8.43GB",
"cpuUsage": "140.2%",
"gpuUsage": "70.3%",
"usedCpus": [{"@id": "#cpu-1"}, {"@id": "#cpu-2"}]
"usedGpus": {"@id": "#gpu-1"},
...
},
{
"@id": "#cpu-1",
"@type": "HardwareComponent",
"model": "FooBar 314Pro",
...
},
{
"@id": "#cpu-1",
"@type": "HardwareComponent",
"model": "FooBar XG666",
...
}
|
Autosubmit (to support RO-Crate soon) keeps track of some variables like memory, CPU, disk, that would fit the model discussed so far, I think... but there are other metrics reported that I am not sure if they would fit in the resource usage here (maybe they'd be reported somewhere else in the RO-Crate archive?).
I think a more flexible approach, allowing for custom values to be added would be useful, from what I understood about the topic so far (still getting familiar with RO-Crate, how to implement it, etc, sorry). -Bruno Footnotes |
Several group members were of the same idea at yesterday's meeting, with doubts expressed on the addition of "fixed" properties that might fit poorly the descriptions given by the various engines / systems. Since this will require substantially more thought, I'm removing this issue from the 0.1 milestone. |
What we can easily add for the 0.1 release is a recommendation to add engine-specific logs, reports, traces, etc. to the crate. They can be tied to the corresponding actions easily via {
"@id": "#action-1",
"@type": "CreateAction",
...
},
{
"@id": "trace-20230120-40360336.txt",
"@type": "File",
"name": "Nextflow trace for action-1",
"conformsTo": "https://www.nextflow.io/docs/latest/tracing.html#trace-report",
"encodingFormat": "text/tab-separated-values",
"about": "#action-1"
} This is vanilla RO-Crate, so it does not require adding any terms or specific requirements. Moreover, doing this requires very little effort from the crate producer. Having the information there is already quite useful; a future framework for a uniform representation of it would then be an improvement in interoperability. |
Great, also declare
|
Now that 0.1 is out, recapping the latest discussions on the next steps, the general idea is to use a system based on key-value pairs. So this example: {
"@id": "#action-1",
"@type": "CreateAction",
"memoryUsage": "8.43GB",
"cpuUsage": "140.2%",
"usedCpus": "2"
} could become something like: {
"@id": "#action-1",
"@type": "CreateAction",
"resourceUsage": [
{"@id": "#action-1-memory"},
{"@id": "#action-1-cpu"},
{"@id": "#action-1-nCpu"},
]
},
{
"@id": "#action-1-memory",
"@type": "PropertyValue",
"name": "memory",
"value": "8.43GB",
},
{
"@id": "#action-1-cpu",
"@type": "PropertyValue",
"name": "cpu",
"value": "140.2%",
},
{
"@id": "#action-1-nCpu",
"@type": "PropertyValue",
"name": "nCpu",
"value": "2",
} Note that ids like One advantage of this is that it requires just one extra term to be defined, {
"@id": "#action-1",
"@type": "CreateAction",
"resourceUsage": {"@id": "#action-1-ru"}
},
{
"@id": "#action-1-ru",
"@type": "Dataset",
"conformsTo": "http://example.org/cwltool-rocrate-ru-spec",
"variableMeasured": [
{"@id": "#action-1-peakMemory"}
]
},
{
"@id": "#action-1-peakMemory",
"@type": "PropertyValue",
"name": "peakMemory",
"value": "8.43GB"
},
{
"@id": "http://example.org/cwltool-rocrate-ru-spec",
"@type": "CreativeWork",
"name": "cwltool RO-Crate resource usage spec",
"version": "0.1"
} Note that I've used Regarding keys:
|
Sounds good! I've added a note in our merge request to add RO-Crate about testing to record the metrics somehow into the metadata, preferably with a format like this one (although I might push that to after our merge request is merged).
I think we should do both. Use namespaces, and also have a common set of keys. But I think this common set of keys must have a really good description of each key. For instance, for This way I would be able to use Do you have any idea if this would be available in tools like CWL Viewer and WorkflowHub.eu? For example, if I want to search all the workflows that used -Bruno |
The resource usage dataset is associated to a specific action. In a Provenance Run Crate, where there's an action for each task, one can record each task's resource usage separately. In a Workflow Run Crate you could record the sum of cores used by all tasks (in the action that represents the workflow run), but for things like usage percentages you probably want the average. In Provenance Run Crates, OTOH, if you have per-task resource usage it's probably better not to record resource usage for the whole workflow run. Resources used by the WMS itself should probably be associated to the corresponding
They're both focused on prospective provenance, so I think it's unlikely. |
I think this is almost there -- but https://schema.org/PropertyValue already have a property propertyID for this purpose, so I think that would work better than {
"@id": "#action-1-peakMemory",
"@type": "PropertyValue",
"name": "peakMemory",
"propertyID": "https://example.org/cwltool-rocrate-ru-spec#peakMemory",
"value": 8.43,
"unitText": "GiB"
}, (I also added It would still make sense to add Would say that |
A bit related, similar to what we do in Autosubmit & in our HPC in tracking energy usage, I learned today from a IIIF email about a draft for an HTTP header for carbon emission: https://www.ietf.org/archive/id/draft-martin-http-carbon-emissions-scope-2-00.html So I believe this shows that there are groups interested in tracking resources like energy consumption, carbon emission, etc., that are different than the most common ones like memory/cpu/disk 👍 |
As of today's TC, I have realized https://schema.org/QuantitativeValue could be helpful to describe resource requirements both in prospective and retrospective provenance, due the capability to describe the value and also minimum and maximum ones |
As @dgarijo has suggested in the TC chat, we could also allow pointing to Wikidata terms |
From today's meeting: https://schema.org/Observation, something that might be useful for cases where you have metrics that are not representing computational resources like CPU or memory, but that are still directly related. In Autosubmit we have metrics in the prospective provenance (workflow configuration) that tells us how many nodes, memory, CPU per task, etc. we will use. These resource values could exist in either an Autosubmit namespace, or in an Slurm namespace (or both). PyCOMPSs probably uses the same Slurm resources at some point, though not sure if that's available to external users before/after running the workflow. When an Autosubmit workflow is executed, it will read the resource usage indicated in the configuration, and execute on the HPC. It may use the requested resources, or less. So we are able to get the number of resources used later (even more resources than what we specified, like the energy consumed from Slurm metrics). But the performance of climate models is not assessed only in terms of CPU's, memory, disk used. There are metrics, like these ones from the CMIP Project (Coupled Model Intercomparison Project) I will have a look to see if we can map that, from the perspective provenance traces/logs/files with the Schema.org Observation. Ideally users would be able to visualize both computational resource use, and the performance of the model (in terms of these metrics), all from reading the workflow metadata. Thanks for the tips!! |
As promised, taking as example https://workflowhub.eu/workflows/663 :
The first part is statistics per resource (s01r1b41-ib0, only one worker used in this run) and per method (accumulate, initPointsFrag and computeNewLocalClusters). And the final part is the global statistics, aggregated for all resources (in this case, it's the same, since only one worker was used). So, as a test, the last piece:
Could be represented as:
I don't know how to use the "propertyID" term. Let me know how is it going so far. |
@rsirvent the {
"@id": "#COMPSs_Workflow_Run_Crate_marenostrum4_SLURM_JOB_ID_30650595",
"@type": "CreateAction",
"resourceUsage": [
{"@id": "#maxTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
{"@id": "#executions_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
{"@id": "#avgTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
{"@id": "#minTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"}
],
...
},
{
"@id": "#maxTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "maxTime",
"propertyID": "https://w3id.org/ro/terms/compss#maxTime",
"unitCode": "https://qudt.org/vocab/unit/MilliSEC",
"value": "30"
},
{
"@id": "#executions_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "executions",
"propertyID": "https://w3id.org/ro/terms/compss#executions",
"value": "10"
},
{
"@id": "#avgTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "avgTime",
"propertyID": "https://w3id.org/ro/terms/compss#avgTime",
"value": "7",
"unitCode": "https://qudt.org/vocab/unit/MilliSEC"
},
{
"@id": "#minTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "minTime",
"propertyID": "https://w3id.org/ro/terms/compss#minTime",
"value": "6",
"unitCode": "https://qudt.org/vocab/unit/MilliSEC"
} Note that |
#60 was merged as nextflow example. Need to formulate some text for the pages on how to do this with PropertyValue - documenting propertyID and unitCode This relates more to provenance ro-crate as it probably better to describe a particular prcoess. But some statistics could make sense overall as well as for the engine execution, but could otherwise be difficult to aggregate at workflow level. |
How much memory/cpu/disk was used in run?
The text was updated successfully, but these errors were encountered: