About SI00, CPUTime, maxCPUTime and what more #5912
-
Hi, (Sorry if this got much longer than I hoped.) In a previous ticket of mine (#5897) I made the remark that
to which Federico replied:
I'd like to continue this discussion in a separate ticket, so as not to pollute the other ticket.
My goal was not to bother users with HepSpec06 seconds. It may be perfect for CERN but I wonder if it will make our users happy. I'm sure that if we would tell them that CPUTime is in HS06 second, we are going to get a lot of questions about how they should calculate this, and in the end we would have to do it for them. I'm wondering how other communities deal with this. As I understand it, the CPUTime jdl parameter is multiplied with SI00/250 and then compared with the maxCPUTime parameter to find a queue that matches the jdl. Then a pilot job is submitted. I realise other parameters should also match. In our case I set the maxCPUTime equal to the maximum wall clock time of the queue. If I do this, then if I set the CPUTime just a few seconds above the maxCPUTime of a particular queue, the job wil not end up in that queue anymore. That's what I meant with the 'a 1-to-1 correspondence between CPUTime and maxCPUTime.' On our local GinA compute cluster, I configured 2 queues: infra and long. The infra has a wall clock time of 30 minutes and can only run pvier VO jobs (it's basically for testing). In the jdl I would then set CPUTime = 1800 (which is 30 minutes). If I do not set the CPUTime, the defaultCPUTime is invoked, which is the equivalent of 96 hours in our case. Then the job won't run as the defaultCPUTime is much higher than the MaxCPUTime. I was thinking I might as well set the maxCPUTime parameter for the infra queue to the equivalent of 96 hours. Then I don't need the CPUTime parameter in the jdl anymore, and can just use the 'infra' tag. Or not even that, as infra only supports pvier jobs, pier jobs automatically go to the infra queue. In the documentation it says:
Does this mean the maximum time pilot jobs are allowed to run the queue? At least that's what I think it means. I guess I need some advise on what would be the best way, considering our non-cern users, to configure dirac in such a way that they are not bothered (too much) with things they should not have and do not want to care about. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi, Sorry, definitions can be inconsistent between the documentation and the code, we have to work on that... Let's start with some definitions (that has not been applied into the code yet :-)):
The code and the documentation are not very clear about this,
The PilotBenchmark provides the The If the batch system is not recognized, then the
The
From what I understand, the
In this case, what is the value of the
Using the defaultCPUTime of the jobs is probably the best option to not bother your users with such a complex task as you said.
Let me know if you need further explanations. |
Beta Was this translation helpful? Give feedback.
Hi,
Sorry, definitions can be inconsistent between the documentation and the code, we have to work on that...
Let's start with some definitions (that has not been applied into the code yet :-)):
Because an application will not spend the same time to run on different CPU models, this parameter is not sufficient, especially if you are using heterogeneous computing resources.
Thus, in such context, we generally need to normalize the CPUTime.