-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add GCP workload observability feature #1167
base: main
Are you sure you want to change the base?
Conversation
ca5c56e
to
5f07742
Compare
metrics, new_time - last_step_completion, per_device_tflops, learning_rate_schedule(step), per_device_tokens | ||
) | ||
last_step_completion = new_time | ||
step_time_delta = datetime.datetime.now() - last_step_completion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes like this need to be handled with care - jax is by default lazy e.g. functions return instantly, they do not block on finishing computation. In this case the results before and after this PR are roughly the same as confirmed by this diff https://diff.googleplex.com/#key=TLvLsJisjPpi (LHS main, RHS this PR)
The real blocking function is when an array is either checkpointed or printed - in this case this is done by write_metrics which is below (not record_scalar_metrics or p_train_step). This is fine since write_metrics is done after last_step_completion, so the step_time_delta will have to wait a real train_step worth of time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you for the documentation!
9254b60
to
2f55959
Compare
35cbe59
to
01df877
Compare
01df877
to
669a14c
Compare
Description
Add option to enable GCP workload monitoring for MaxText workloads.
MaxText/configs/base.yml
getting_started/GCP_Workload_Monitoring.md
Tests
Tested on trillium TPU and confirmed metrics sent to cloud monarch successfully if configs are enabled. No metrics will be sent to cloud monarch if configs are set to False.
Checklist
Before submitting this PR, please make sure (put X in square brackets):