Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_hipscat is significantly slower than to_parquet #352

Closed
2 of 3 tasks
hombit opened this issue Jun 10, 2024 · 2 comments
Closed
2 of 3 tasks

to_hipscat is significantly slower than to_parquet #352

hombit opened this issue Jun 10, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@hombit
Copy link
Contributor

hombit commented Jun 10, 2024

Bug report

Catalog.to_hipscat() is much slower than Catalog._ddf.to_parquet() for large jobs.

For example, for a smaller job, to_hipscat() took 100s, while to_parquet() took only 50s. For a larger job to_hipscat took 63 minutes, while to_parquet() took only 12 minutes. These are on Bridges2, 128 cores / 256 GB, 16 Dask workers.

For the larger job for the most of the time I see no activity with Dask Dashboard, and 100% of a single CPU core usage with top. This probably means that some planning job is talking all the time, not actual computations and I/O.

Code I run
from pathlib import Path

import dask.distributed
import lsdb

HIPSCAT_PATH = Path('/ocean/projects/phy210048p/shared/hipscat/catalogs/')

PS1_BANDS = 'riz'

# Smaller version adds .cone_search(0, 45, 25*3600) to each catalog
ztf_dr17_coord = lsdb.read_hipscat(
    'hipscat/ztf_dr17_coord',
    margin_cache='hipscat/ztf_dr17_coord_2arcsec',
).query('filter == 2')  # r-band only
gaia = lsdb.read_hipscat(
    HIPSCAT_PATH / 'gaia_dr3' / 'gaia',
    margin_cache=str(HIPSCAT_PATH / 'gaia_dr3' / 'gaia_10arcs'),
    columns=[
        'ra', 'dec',
        # 'parallax',
        'parallax_over_error',
        'teff_gspphot', 'teff_gspphot_lower', 'teff_gspphot_upper',
        'logg_gspphot', 'logg_gspphot_lower', 'logg_gspphot_upper',
        # 'ag_gspphot', 'ag_gspphot_lower', 'ag_gspphot_upper',
    ],
).query(
    "parallax_over_error > 10.0"
    " and teff_gspphot_upper < 3800"
    " and (teff_gspphot_upper - teff_gspphot_lower) < 400"
    " and logg_gspphot_lower > 4.5"
    " and (logg_gspphot_upper - logg_gspphot_lower) < 0.2"
)
panstarrs = lsdb.read_hipscat(
    HIPSCAT_PATH / 'ps1' / 'ps1_otmo',
    margin_cache=str(HIPSCAT_PATH / 'ps1' / 'ps1_otmo_10arcs'),
    columns=
        ['raMean', 'decMean']
        + [f'{b}MeanPSFMag' for b in PS1_BANDS]
        + [f'{b}MeanPSFMagErr' for b in PS1_BANDS],
).query(
    "((rMeanPSFMag - iMeanPSFMag) + (rMeanPSFMagErr + iMeanPSFMagErr)) > 0.42"
    " and ((iMeanPSFMag - zMeanPSFMag) + (iMeanPSFMagErr + zMeanPSFMagErr)) > 0.23"
    " and rMeanPSFMagErr < 0.1 and iMeanPSFMagErr < 0.1 and zMeanPSFMagErr < 0.1"
)

catalog = ztf_dr17_coord.crossmatch(
      gaia,
      radius_arcsec=1,
      n_neighbors=1,
      suffixes=['', '_gaia'],
).crossmatch(
    panstarrs,
    radius_arcsec=1,
    n_neighbors=1,
    suffixes=['', ''],
)

with dask.distributed.Client(n_workers=16):
    catalog.to_hipscat(base_catalog_path=f'hipscat/catalog_name', catalog_name='catalog_name')

with dask.distributed.Client(n_workers=16):
    catalog._ddf.to_parquet('delete-me')

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
@hombit hombit added the bug Something isn't working label Jun 10, 2024
@hombit
Copy link
Contributor Author

hombit commented Jul 11, 2024

It changes in the current release (0.2.6), now the code runs .to_hipscat() in 59 minutes and to_parquet() in 32 minutes.

@delucchi-cmu
Copy link
Contributor

Tried to reproduce this issue again today, and it seems more related to overall slowness of operations in the presence of a large task graph, as opposed to an issue specific to the to_hats functionality.

This work is already tracked in this issue: lincc-frameworks/nested-dask#52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

2 participants