-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NERSC SFAPI Reconstruction Flow #44
base: main
Are you sure you want to change the base?
Conversation
…orchestration/flows/bl832/tomography_hpc.py). Linting orchestration/nersc.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
I added a number of review comments. This is a really nice start, and makes the nersc-specific code easy to see and follow. If my comments were a little picky, it's because the code was easy to follow.
tomocagrahpy_hpt.py should be split into sepearate modules, something like: tomography_hpc.py, nersc.py and alcf.py so that no more code than necessary is imported (e.g. when running an ALCF job, you don't need to import NerscClient.)
Also, what about job_controller.py
rather than tomography_hpc.py
? Since this is under flows/bl832, tomography is implied. And we might some day write the job controllers for local (non-HPC) jobs.
…ig.py, renamed tomography_hpc.py to job_controller.py. Added back nersc.py, which now just uses job_controller (need to add back in the other data transfer/pruning logic, etc). Inject NERSCClient into the nersc controller init. Removed references to export controls. Fixed typing for Config832 references (and pass into the init). Turned create_nersc_client() into a static method in the class. Moved load_dotenv to the top of the module. Consoldated reconstrut() and and build_multi_resolution() in the nersc implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
…c.py. Updated get_controller() in job_controller.py to take in an HPC enum that stores the corresponding Controller implementation for that facility.
orchestration/flows/bl832/nersc.py
Outdated
password = os.getenv("NERSC_PASSWORD") | ||
|
||
job_script = f"""#!/bin/bash | ||
#SBATCH -q preempt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several queue things I have figured out:
debug
has no minimum time requirement, but using a different queue such aspreempt
requires allocating 2 hours minimum for a job (which I think is a bit overkill in our case, since recon should only take 5-10 min).- Based on testing debug vs preempt, it seems like the the debug jobs get picked up much faster (perhaps due to smaller time constraint, as they are both medium priority).
- There is also the
interactive
queue, which has a high priority, but only 4 nodes x 2 jobs can be active simultaneously. We would have to modify the job submission script to usesalloc
Just to add some numbers to this after some testing today:
- jobs in the
debug
queue have been consistently picked up after about 10 minutes - 2 jobs I scheduled this morning in the
preempt
queue have yet to be picked up (approaching 4 hours...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bjoern let us know that we (under the 'als' project) have access to the realtime
queue at NERSC, which should reduce latency. I had an issue using that queue with the following error:
Error during job submission or completion: sbatch: error: Batch job submission failed: Invalid qos specification
I opened a ticket and we are still waiting to hear from the help desk about the correct job parameters to use the queue:
https://nersc.servicenowservices.com/sp?sys_id=03d747c187a216101c6cea470cbb35e9&view=sp&id=ticket&table=incident
…ath and run respective tasks. Revert to debug queue (at least for now) since the queue time is orders of magnitude short than the preempt queue (want to look into the interactive queue). Moved NERSCTomographyHPCController back into job_controller.py due to circular imports. Fixed Enum implementation.
…ows/bl832/nersc.py. Removed the ALCF controlelr from job_controller, in anticipation of that being defined in alcf.py when I refactor that code.
Now that you have a nice clean |
…it for the job to complete before moving onto the next step. An OK workaround for this: NERSC/sfapi_client#93
Sounds good to me, I will work on adding unit tests. I suspect we can replace references to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do the following after discussion:
-
add a big comment to the NerscClient saying that it's deprecated and will be removed when we refactor the ptychography code
-
Remove initialization of Config832 and sfapi Client from NERSCTomogrpahyHPCController, passing them down from flows to tasks as parameters
-
Create unit tests for NERSCTomogrpahyHPCController methods
…l be removed. Removed init of COnfig832 and sfapi_client from NERSCTomographyHPCCOntroller, and instead inject them (no longer optional). Updated unit tests in
|
… with up to date python installations
I met with Raja and we came up with a few next steps to wrap this up:
|
…. Updated nersc.py to use the realtime queue, which now works with minimal delay. Updated recon/multires jobs to use pscratch for temporary file storage/writing, after copying data from the source at /global/cfs/.../8.3.2/raw/. The next steps include copying the reconstructions to the appropriate /global/cfs/.../8.3.2/scratch/ directory, copying to data832, and scheduling pruning.
…ed nersc832_alsdev_pscratch endpoint in the config, and added a nersc_prune_pool and prefect deployment.
… move data within the nersc reconstruction flow.
Setting this up as a draft PR for now while I iron out implementation details to match the ALCF workflow.
These two steps are currently working via sfapi and the container Raja built (registry.nersc.gov/als/tomorecon_nersc_mpi_hdf5):
There are still at least a few things that need to be addressed before it is production ready:
new_832_file_flow
andalcf_recon_flow
expect (still using a legacy version of reconstruction.py that expects an input.txt file)