Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare runtime/cost between high-cpu and standard cluster #46

Open
tomcordruw opened this issue Aug 27, 2024 · 11 comments
Open

Compare runtime/cost between high-cpu and standard cluster #46

tomcordruw opened this issue Aug 27, 2024 · 11 comments
Assignees

Comments

@tomcordruw
Copy link
Collaborator

Run argo workflow processing significant chunk of a dataset (~3 million events) and compare the results and runtime for a standard cluster and one with higher vCPU count.

@tomcordruw tomcordruw self-assigned this Aug 27, 2024
@tomcordruw
Copy link
Collaborator Author

tomcordruw commented Sep 16, 2024

Edit:
Took into account the time for the plotting step in GCS workflows, which is missing in the NFS version.
The time shown is now the duration without the plotting step, the total with the plotting step included is shown in brackets.

Configuration:

  • events: 3000000
  • jobs: 48
  • recid: 30544
  • region: "europe-north1-b"
  • nodes: 12
  • disk type: pd-standard

Results:
NFS:

  • e2-standard-4: 4 hours 37 minutes
     Cost: 7.81 CHF

  • e2-highcpu-16: 3 hours 16 minutes
     Cost: 16.38 CHF

GCS Bucket:

argo_bucket_run.yaml:

  • e2-standard-4: 4 hours 33 minutes (4 hours 57 minutes)
     Cost: 9.01 CHF
  • e2-highcpu-16: 3 hours 24 minutes (3 hours 46 minutes)
     Cost: 18.72 CHF

argo_bucket_upload.yaml:

  • e2-standard-4: 4 hours 55 minutes (5 hours 17 minutes)
     Cost: 10.67 CHF
  • e2-highcpu-16: 3 hours 36 minutes (3 hours 59 minutes)
     Cost: 18.81 CHF

@katilp
Copy link
Contributor

katilp commented Sep 19, 2024

@tomcordruw Is the time of the bucket workflows without the final plotting step? If not, can you see from the outputs, how long did it take?

@tomcordruw
Copy link
Collaborator Author

@tomcordruw Is the time of the bucket workflows without the final plotting step? If not, can you see from the outputs, how long did it take?

Oh, that would explain it, I didn't realise that step was missing in the nfs workflow.
The plotting step is included in the total runtime here, and in the tests it took between 20-25 minutes, which pretty much accounts for the difference.

@katilp
Copy link
Contributor

katilp commented Sep 20, 2024

@tomcordruw Did these jobs run with the image on the node already or does the time include the image pull?
We need to have the time without the image pull for a scalable comparison. Currently, the image pull is more than 30 mins and may vary so it can distort the comparison.

@tomcordruw
Copy link
Collaborator Author

@tomcordruw Did these jobs run with the image on the node already or does the time include the image pull? We need to have the time without the image pull for a scalable comparison. Currently, the image pull is more than 30 mins and may vary so it can distort the comparison.

The time unfortunately includes the image pull, but I am currently testing the script after some modifications to initially run the start job and pull the images.
From what I can tell, image pulling/pod initialisation takes 31-32 minutes in these configurations which is in line with the difference between the workflows I have been running with/without previously pulling the images.

But of course there can be errors and other things prolonging the image pulling step, so it will be accounted for from now on.

@katilp
Copy link
Contributor

katilp commented Sep 20, 2024

@tomcordruw Is this a fair comparison?

e2-standard-4: 4 vCPUs, 16 GB mem
e2-highcpu-16: 16 vCPUs, 16 GB mem.

If N jobs is 48, a 12-node e2-highcpu-16 cluster is mostly idle.

CPU-wise it could have had 12 jobs on each node (0.8 * 16 because we requested 800m CPU)
Memory-wise 6.
Now it most likely had 4 only (if the 48 jobs were evenly distributed to nodes), or many nodes were idle.
And the cost goes with the time, not with the occupancy.

A fair comparison would be how many events / hour we can get with the maximum occupancy.

For memory requests, as seen in #49 (comment) we could most likely set it lower than 2.3GB, e.g. 1.5GB would allow ~ 10 jobs/node

@tomcordruw
Copy link
Collaborator Author

tomcordruw commented Sep 20, 2024

@katilp
Indeed, what I'm seeing supports what you're writing.
And yes, the cost is based on time, not resource usage, so I will try lowering the resource requests and see how well the highcpu-clusters can be utilised that way.

The resource usage I'm getting so far indicates 1:2 ratio for vCPU (800m) to memory (~1.6GB) for each job.
While there is no fitting e2 machine type (standard is 1:4 and highcpu is 1:1), it can be achieved with custom machine types, so that way we could lower the amount of unused resources.

@katilp
Copy link
Contributor

katilp commented Sep 20, 2024

Right, but the first thing is to have a big enough number of jobs so that it really fills the cluster. The number of jobs might need to be different to compare the two types of clusters. Or less nodes in the high-CPU cluster. What matters is the total number of CPUs.

@tomcordruw
Copy link
Collaborator Author

tomcordruw commented Sep 20, 2024

Right, so e.g. for 12 e2-highcpu-16 nodes, after adjusting resource requests, it should allow 10 jobs per node, meaning 120 jobs total, or alternatively 5 nodes for 48 jobs to have a fair comparison?

@katilp
Copy link
Contributor

katilp commented Sep 20, 2024

Right, so e.g. for 12 e2-highcpu-16 nodes, after adjusting resource requests, it should allow 10 jobs per node, meaning 120 jobs total, or alternatively 5 nodes for 48 jobs to have a fair comparison?

Yes, something like this. It probably requires some manual inspection. Best to start a workflow and see how they go. If there are "left-overs", i.e. jobs that do not fit running parallel, then decrease the number of jobs so that all pfnano steps go at the same time.

@tomcordruw
Copy link
Collaborator Author

Okay, seems clear to me now!
I will do some runs and inspect how things behave and update the comparisons accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants