Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NERSC SFAPI Reconstruction Flow #44

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

davramov
Copy link
Contributor

@davramov davramov commented Dec 4, 2024

Setting this up as a draft PR for now while I iron out implementation details to match the ALCF workflow.

These two steps are currently working via sfapi and the container Raja built (registry.nersc.gov/als/tomorecon_nersc_mpi_hdf5):

  • Reconstruction
  • Tiff to Zarr

There are still at least a few things that need to be addressed before it is production ready:

  • Pass into reconstruction.py the same file_path that new_832_file_flow and alcf_recon_flow expect (still using a legacy version of reconstruction.py that expects an input.txt file)
  • File transfers back to data832
  • Schedule Pruning
  • Pytest cases

Copy link
Contributor

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

I added a number of review comments. This is a really nice start, and makes the nersc-specific code easy to see and follow. If my comments were a little picky, it's because the code was easy to follow.

tomocagrahpy_hpt.py should be split into sepearate modules, something like: tomography_hpc.py, nersc.py and alcf.py so that no more code than necessary is imported (e.g. when running an ALCF job, you don't need to import NerscClient.)

Also, what about job_controller.py rather than tomography_hpc.py? Since this is under flows/bl832, tomography is implied. And we might some day write the job controllers for local (non-HPC) jobs.

orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
…ig.py, renamed tomography_hpc.py to job_controller.py. Added back nersc.py, which now just uses job_controller (need to add back in the other data transfer/pruning logic, etc). Inject NERSCClient into the nersc controller init. Removed references to export controls. Fixed typing for Config832 references (and pass into the init). Turned create_nersc_client() into a static method in the class. Moved load_dotenv to the top of the module. Consoldated reconstrut() and and build_multi_resolution() in the nersc implementation.
Copy link
Contributor

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This

…c.py. Updated get_controller() in job_controller.py to take in an HPC enum that stores the corresponding Controller implementation for that facility.
password = os.getenv("NERSC_PASSWORD")

job_script = f"""#!/bin/bash
#SBATCH -q preempt
Copy link
Contributor Author

@davramov davramov Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several queue things I have figured out:

  • debug has no minimum time requirement, but using a different queue such as preempt requires allocating 2 hours minimum for a job (which I think is a bit overkill in our case, since recon should only take 5-10 min).
  • Based on testing debug vs preempt, it seems like the the debug jobs get picked up much faster (perhaps due to smaller time constraint, as they are both medium priority).
  • There is also the interactive queue, which has a high priority, but only 4 nodes x 2 jobs can be active simultaneously. We would have to modify the job submission script to use salloc

Just to add some numbers to this after some testing today:

  • jobs in the debug queue have been consistently picked up after about 10 minutes
  • 2 jobs I scheduled this morning in the preempt queue have yet to be picked up (approaching 4 hours...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bjoern let us know that we (under the 'als' project) have access to the realtime queue at NERSC, which should reduce latency. I had an issue using that queue with the following error:
Error during job submission or completion: sbatch: error: Batch job submission failed: Invalid qos specification

I opened a ticket and we are still waiting to hear from the help desk about the correct job parameters to use the queue:
https://nersc.servicenowservices.com/sp?sys_id=03d747c187a216101c6cea470cbb35e9&view=sp&id=ticket&table=incident

…ath and run respective tasks. Revert to debug queue (at least for now) since the queue time is orders of magnitude short than the preempt queue (want to look into the interactive queue). Moved NERSCTomographyHPCController back into job_controller.py due to circular imports. Fixed Enum implementation.
…ows/bl832/nersc.py. Removed the ALCF controlelr from job_controller, in anticipation of that being defined in alcf.py when I refactor that code.
@dylanmcreynolds
Copy link
Contributor

Now that you have a nice clean NERSCTomographyHPCController, you have something you can easy write unit tests for. I'd recommend mocking NerscClient and test that the calls to it are as expected. Also, are we still sure that we need NerscClient?

…it for the job to complete before moving onto the next step. An OK workaround for this: NERSC/sfapi_client#93
@davramov
Copy link
Contributor Author

Now that you have a nice clean NERSCTomographyHPCController, you have something you can easy write unit tests for. I'd recommend mocking NerscClient and test that the calls to it are as expected. Also, are we still sure that we need NerscClient?

Sounds good to me, I will work on adding unit tests. I suspect we can replace references to NerscClient with the official sfapi_client Client class, and capture that logic in create_nersc_client() in NERSCTomographyHPCController. That could be a bit cleaner.

Copy link
Contributor

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the following after discussion:

  • add a big comment to the NerscClient saying that it's deprecated and will be removed when we refactor the ptychography code

  • Remove initialization of Config832 and sfapi Client from NERSCTomogrpahyHPCController, passing them down from flows to tasks as parameters

  • Create unit tests for NERSCTomogrpahyHPCController methods

…l be removed. Removed init of COnfig832 and sfapi_client from NERSCTomographyHPCCOntroller, and instead inject them (no longer optional). Updated unit tests in
@davramov
Copy link
Contributor Author

  • Added comments in orchestration/nersc.py indicating that NerscClient is deprecated and will be removed.
  • Removed init of Config832 and sfapi_client from NERSCTomographyHPCController, and instead inject them (no longer optional).
  • Updated unit tests in test_sfapi_flow.py

@davramov davramov marked this pull request as ready for review December 20, 2024 17:47
@davramov
Copy link
Contributor Author

davramov commented Jan 9, 2025

I met with Raja and we came up with a few next steps to wrap this up:

  • Clean up home/scratch directories to make it production ready (in recon and multires implementations)
    • Copy data from /global/cfs/cdirs/als/data_mover/8.3.2/raw to pscratch for processing
  • Reconstructions need to be saved in the appropriate scratch/ location on NERSC
    • Working through a "permission denied" issue copying reconstruction from pscratch to /global/cfs/cdirs/als/data_mover/8.3.2/scratch
  • Transfer reconstructions back to data832
  • Scheduling pruning
    • Code exists, just waiting to test once data transfers are sorted out.
  • Register Prefect flows/deployments for NERSC tomo reconstruction
    • Ready to deploy using create_deployments_832_nersc.py once the prior steps are sorted out
  • Create a new agent container on flow-prd to handle NERSC tomo reconstruction
  • Come up with a game plan for sfapi credential reauthorization (can we request an SFAPI Client that will be valid for the whole cycle?)

…. Updated nersc.py to use the realtime queue, which now works with minimal delay. Updated recon/multires jobs to use pscratch for temporary file storage/writing, after copying data from the source at /global/cfs/.../8.3.2/raw/. The next steps include copying the reconstructions to the appropriate /global/cfs/.../8.3.2/scratch/ directory, copying to data832, and scheduling pruning.
…ed nersc832_alsdev_pscratch endpoint in the config, and added a nersc_prune_pool and prefect deployment.
… move data within the nersc reconstruction flow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants