Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crystalball budgets too many rows per chunk when system has a lot of memory #55

Open
SpheMakh opened this issue Apr 6, 2023 · 7 comments · May be fixed by #56
Open

Crystalball budgets too many rows per chunk when system has a lot of memory #55

SpheMakh opened this issue Apr 6, 2023 · 7 comments · May be fixed by #56
Assignees

Comments

@SpheMakh
Copy link
Collaborator

SpheMakh commented Apr 6, 2023

  • crystalball version: 0.3.0
  • Python version: 3.7
  • Operating System: Ubuntu 20.04.6 LTS

Description

Predicting from a wsclean source list. The progress logged was 100% from the start, to when I terminated the run.

What I Did

Ran crystalball through caracal
https://github.com/caracal-pipeline/caracal/blob/751769ce6d6f14651c03e5988d71eef032e88d84/caracal/workers/crosscal_worker.py#L587

# 2023-04-05 10:52:22 | INFO     | crystalball:_predict | Crystalball version 0.3.0
# 2023-04-05 10:52:22 | INFO     | wsclean:import_from_wsclean | /stimela_mount/input/1934-collapsed-uhf-cat.txt contains 7830 components
# 2023-04-05 10:52:22 | INFO     | wsclean:import_from_wsclean | Total flux of 7830 selected components is 18.317970 Jy
# 2023-04-05 10:52:22 | INFO     | ms:ms_preprocess | inserting new column MODEL_DATA
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | --------------------------------------------------
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | Budgeting
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | --------------------------------------------------
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | system RAM = 1510.60 GB
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | nr of logical CPUs = 48
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | nr sources = 7830
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | nr rows    = 626913
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | nr chans   = 1988
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | nr corrs   = 2
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | sources per chunk = 4249 (auto settings)
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | rows per chunk    = 424942 (auto settings)
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | expected memory usage = 755.30 GB
# 2023-04-05 10:52:23 | INFO     | crystalball:_predict | Field J1939-6342 DDID 0 rows 148428 chans 1988 corrs 2
# Successful read/write open of default-locked table /stimela_mount/msdir/1634506452_sdp_l0.8k-cal.ms: 23 columns, 626913 rows
# 
# [##########################################] | 100% Complete (Estimate) |  2m 0s / ~ 2m 0s
# [##########################################] | 100% Complete (Estimate) |  2m 5s / ~ 2m 5s
# [##########################################] | 100% Complete (Estimate) |  2m10s / ~ 2m10s
# [##########################################] | 100% Complete (Estimate) |  2m15s / ~ 2m15s
# [##########################################] | 100% Complete (Estimate) |  2m20s / ~ 2m20s
# [##########################################] | 100% Complete (Estimate) |  2m25s / ~ 2m25s

I terminated after 3 hours.

@paoloserra
Copy link
Collaborator

very strange, so far it's worked beautifully for me (except for the first few percent of the time, but then it stabilizes)

@KshitijT
Copy link
Contributor

KshitijT commented Apr 6, 2023

I remember mentioning it a couple of days ago to @sjperkins that the progress bar went from 99% to 1% suddenly, as @sjperkins explained this, it is a matter of estimating progress than actually computing it?

@paoloserra
Copy link
Collaborator

@KshitijT was this at the start, or somewhere in between? At the start it's acceptable

@sjperkins
Copy link
Collaborator

very strange, so far it's worked beautifully for me (except for the first few percent of the time, but then it stabilizes)

I checked this yesterday on codex-africanus master branch and it seemed to work.

I think the problem here is that there are only 4 chunks of work

# 2023-04-05 10:52:22 | INFO     | budget:get_budget | nr sources = 7830
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | nr rows    = 626913
...
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | sources per chunk = 4249 (auto settings)
# 2023-04-05 10:52:22 | INFO     | budget:get_budget | rows per chunk    = 424942 (auto settings)

so the progress bar doesn't get to make good estimates (based on historical data). If you increase the chunking on row, do things improve?

@sjperkins
Copy link
Collaborator

I agree, it shouldn't be saying 100% done from the start though.

@KshitijT
Copy link
Contributor

KshitijT commented Apr 6, 2023

@KshitijT was this at the start, or somewhere in between? At the start it's acceptable

This was somewhere inbetween.

@sjperkins sjperkins changed the title Faulty progress bar Crystall budgets too many rows per chunk when system has a lot of memory Apr 6, 2023
@sjperkins sjperkins changed the title Crystall budgets too many rows per chunk when system has a lot of memory Crystalball budgets too many rows per chunk when system has a lot of memory Apr 6, 2023
@sjperkins
Copy link
Collaborator

  • The system has 1.5TB of memory.
  • crystalball budgeting sees that it can allocate many rows (424942 out of 626913 total) and sources (4249 out of 7830 total) , given 48 CPUs.
  • Resulting in 4 chunks to compute, underutilising the other 44 CPUs.
  • And confusing the progress bar

@sjperkins sjperkins linked a pull request Apr 6, 2023 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants