Skip to content

Commit

Permalink
Upload results
Browse files Browse the repository at this point in the history
Co-authored-by: Andreas Prodromou <[email protected]>
Co-authored-by: Bruno Ferreira <[email protected]>
Co-authored-by: Guenther Schmuelling <[email protected]>
Co-authored-by: guschmue <[email protected]>
Co-authored-by: Matt Frank <[email protected]>
Co-authored-by: Murali Emani <[email protected]>
Co-authored-by: Nathan Wasson <[email protected]>
Co-authored-by: Noah Nisbet <[email protected]>
Co-authored-by: nvaprodromou <[email protected]>
Co-authored-by: Pablo Gonzalez <[email protected]>
Co-authored-by: Peter Mattson <[email protected]>
Co-authored-by: pgmpablo157321 <[email protected]>
Co-authored-by: Steve Farrell <[email protected]>
  • Loading branch information
11 people committed Nov 7, 2023
0 parents commit 350e46f
Show file tree
Hide file tree
Showing 2,492 changed files with 1,003,058 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# These owners will be the default owners for everything in the repo.
# Unless a later match takes precedence,they will be requested for review when someone opens a pull request.
* @mlcommons/wg-hpc
36 changes: 36 additions & 0 deletions .github/workflows/cla.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@

name: "cla-bot"
on:
issue_comment:
types: [created]
pull_request_target:
types: [opened,closed,synchronize]

jobs:
cla-check:
runs-on: ubuntu-latest
steps:
- name: "MLCommons CLA bot check"
if: (github.event.comment.body == 'recheck') || github.event_name == 'pull_request_target'
# Alpha Release
uses: mlcommons/cla-bot@master
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# the below token should have repo scope and must be manually added by you in the repository's secret
PERSONAL_ACCESS_TOKEN : ${{ secrets.MLCOMMONS_BOT_CLA_TOKEN }}
with:
path-to-signatures: 'cla-bot/v1/cla.json'
# branch should not be protected
branch: 'main'
allowlist: user1,bot*
remote-organization-name: mlcommons
remote-repository-name: systems

#below are the optional inputs - If the optional inputs are not given, then default values will be taken
#remote-organization-name: enter the remote organization name where the signatures should be stored (Default is storing the signatures in the same repository)
#remote-repository-name: enter the remote repository name where the signatures should be stored (Default is storing the signatures in the same repository)
#create-file-commit-message: 'For example: Creating file for storing CLA Signatures'
#signed-commit-message: 'For example: $contributorName has signed the CLA in #$pullRequestNo'
#custom-notsigned-prcomment: 'pull request comment with Introductory message to ask new contributors to sign'
#custom-pr-sign-comment: 'The signature to be committed in order to sign the CLA'
#custom-allsigned-prcomment: 'pull request comment when all contributors has signed, defaults to **CLA Assistant Lite bot** All Contributors have signed the CLA.'
9 changes: 9 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
## Contributing

The best way to contribute to the MLCommons is to get involved with one of our many project communities. You find more information about getting involved with MLCommons [here](https://mlcommons.org/en/get-involved/#getting-started).

Generally we encourage people to become a MLCommons member if they wish to contribute to MLCommons projects, but outside pull requests are very welcome too.

Regardless of if you are a member, your organization needs to sign the MLCommons CLA. Please fill out this [CLA sign up form](https://forms.gle/Ew1KkBVpyeJDuRw67) form to get started.

MLCommons project work is tracked with issue trackers and pull requests. Modify the project in your own fork and issue a pull request once you want other developers to take a look at what you have done and discuss the proposed changes. Ensure that cla-bot and other checks pass for your Pull requests.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# DeepCAM benchmark

See benchmarks/deepcam/implementations/pytorch/README.md for instructions on
acquiring and formatting the input dataset in preparation for running.
64 changes: 64 additions & 0 deletions Clemson/benchmarks/deepcam/implementations/pytorch/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# The MIT License (MIT)
#
# Copyright (c) 2020-2022 NVIDIA CORPORATION. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of
# this software and associated documentation files (the "Software"), to deal in
# the Software without restriction, including without limitation the rights to
# use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
# the Software, and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
# FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
# IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

#ARG FROM_IMAGE_NAME=gitlab-master.nvidia.com:5005/dl/dgx/pytorch:22.08-py3-devel
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.08-py3
FROM ${FROM_IMAGE_NAME}

ARG dlfw_version=22.08
ARG dlfw_version
ENV DLFW_VERSION ${dlfw_version}

#install mpi4py
RUN pip install h5py mpi4py

#pip install more python modules
RUN pip install wandb

#install mlperf logging
RUN pip install "git+https://github.com/mlperf/logging.git@501bbde47f005d67c6357da6e5c1931eab339f8e"

#install io_helpers
COPY io_helpers /opt/io_helpers
RUN cd /opt/io_helpers && python setup.py install

# create kernel cache dir and point pytorch to it
RUN mkdir -p /opt/pytorch/kernel_cache
ENV PYTORCH_KERNEL_CACHE_PATH /opt/pytorch/kernel_cache

#copy main scripts
COPY src/deepCam /opt/deepCam
COPY src/utils /opt/utils
COPY cleanup.sh /opt/deepCam/cleanup.sh

# worker scripts and files
COPY run_and_time.sh /workspace/run_and_time.sh
COPY run_and_time_multi.sh /workspace/run_and_time_multi.sh
COPY init_datasets.sub /workspace/init_datasets.sub
COPY run.sub /workspace/run.sub
COPY run.slurm /workspace/run.slurm
COPY configs /workspace/configs

#init empty git repo so that wandb works
RUN cd /opt/deepCam && git init

#create additional folders for mapping data in
RUN mkdir -p /data
20 changes: 20 additions & 0 deletions Clemson/benchmarks/deepcam/implementations/pytorch/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
The MIT License (MIT)

Copyright (c) 2020-2022 NVIDIA CORPORATION. All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
128 changes: 128 additions & 0 deletions Clemson/benchmarks/deepcam/implementations/pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Deep Learning Climate Segmentation Benchmark

PyTorch implementation for the climate segmentation benchmark, based on the
Exascale Deep Learning for Climate Analytics codebase here:
https://github.com/azrael417/ClimDeepLearn, and the paper:
https://arxiv.org/abs/1810.01993

## Dataset

The dataset for this benchmark comes from CAM5 [1] simulations and is hosted at
NERSC. The samples are stored in HDF5 files with input images of shape
(768, 1152, 16) and pixel-level labels of shape (768, 1152). The labels have
three target classes (background, atmospheric river, tropical cycline) and were
produced with TECA [2].

The current recommended way to get the data is to use GLOBUS and the following
globus endpoint:

https://app.globus.org/file-manager?origin_id=0b226e2c-4de0-11ea-971a-021304b0cca7&origin_path=%2F

The dataset folder contains a README with some technical description of the
dataset and an All-Hist folder containing all of the data files.

### Preprocessing
The dataset is split into train/validation/test and ships with the `stats.h5` file containing summary statistics.
In order to run the benchmark with the various DALI readers, the dataset has to be converted into numpy file format.
For this purpose, the script `src/utils/convert_hdf52npy.py` is provided (`/opt/utils/convert_hdf52npy.py` inside the container). This script reads the original HDF5 files and generates numpy files for data and labels.
The script leverages MPI to performa distributed conversion of the files and needs to be run for each file directory separately, e.g. for validation and training. Example:

```
DATA_IN=<hdf5-data-path>
DATA_OUT=<numpy-data-path>
NUM_TASKS=1
cp ${DATA_IN}/stats.h5 ${DATA_OUT}/
mpirun -np ${NUM_TASKS} python src/utils/convert_hdf52npy.py --input_directory=${DATA_IN}/train --output_directory=${DATA_OUT}/train
mpirun -np ${NUM_TASKS} python src/utils/convert_hdf52npy.py --input_directory=${DATA_IN}/validation --output_directory=${DATA_OUT}/validation
```

For docker users with slurm job schedulers and pyxis/enroot support we have added the script `init_datasets.sub`, which performs this conversion.

## Before you run

A Dockerfile is provided under the `docker` subdirectory. The following instructions assume you have built the docker container, that you are using slurm with pyxis/enroot support. If your system
uses a different technology, please modify the commands accordingly. Furthermore, the dataset should have been converted to numpy data format as described in the paragraph above. Finally create a file named `config_data.sh` under the configs directory. Inside that file, specify the following environment variable:

```
#!/bin/bash
export DATADIR=<root path to where the converted dataset resides>
```
Make this file executable, i.e. `chmod +x configs/config_data.sh`. If you are using docker, the path specified via the `DATADIR` environment variable will be mounted (read-only) into the container under `/data` from where the benchmark will pick it up.

## How to run the benchmark

The benchmark parameters are steered by environment variables. Please see `run.sub` (`/workspace/run.sub` inside the container), `run_and_time.sh` (`/opt/deepCam/run_and_time.sh` inside the container) for more information.
In order to submit the benchmark, you need to source a configuration file you want to run. Those are under `configs` (or under `/workspace/configs` inside the container).
You also need to name your container image. Assuming you named the image `mlperf-deepcam:v1.0`, then you can submit a job as follows:

```
export CONT=mlperf-deepcam:v1.0
source configs/config_DGXA100_16x8x1.sh
sbatch <specify system-dependent additional args here> -N ${DGXNNODES} -t ${WALLTIME} run.sub
```
The parameters `DGXNNODES` and `WALLTIME` are set by the configuration files.

Please not that in order to pass a locally built image to enroot, you need to export it as a `sqsh` file using `enroot import/create` (see the [enroot documentation](https://github.com/NVIDIA/enroot/blob/master/doc/usage.md) for instructions). We recommend using a registry as enroot can pull the image directly from there.


## Hyperparameters

The table below contains the modifiable hyperparameters. Unless otherwise stated, parameters not
listed in the table below are fixed and changing those could lead to an invalid submission.

|Parameter Name |Default | Constraints | Description|
--- | --- | --- | ---
`--optimizer` | `"Adam"` | Optimizer of Adam or LAMB* type. This benchmark implements `"Adam"` and `"AdamW"` from PyTorch as well as `"FusedLAMB"` from NVIDIA APEX. Algorithmic equivalent implementations to those listed before are allowed. | The optimizer to choose
`--start_lr` | 1e-3 | >= 0. | Start learning rate (or base learning rate if warmup is used)
`--optimizer_betas` | `[0.9, 0.999]` | N/A | Momentum terms for Adam-type optimizers
`--weight_decay` | 1e-6 | >= 0. | L2 weight regularization term
`--lr_warmup_steps` | 0 | >= 0 | Number of steps for learning rate warmup
`--lr_warmup_factor` | 1. | >= 1. | When warmup is used, the target learning_rate will be lr_warmup_factor * start_lr
`--lr_schedule` | - | `type="multistep",milestones="<milestone_list>",decay_rate="<value>"` or `type="cosine_annealing",t_max="<value>",eta_min="<value>"` | Specifies the learning rate schedule. Multistep decays the current learning rate by `decay_rate` at every milestone in the list. Note that the milestones are in unit of steps, not epochs. Number and value of milestones and the `decay_rate` can be chosen arbitrarily. For a milestone list, please specify it as whitespace separated values, for example `milestones="5000 10000"`. For cosine annealing, the minimal lr is given by the value of `eta_min` and the period length in number of steps by `T_max`
`--batchnorm_group_size` | 1 | >= 1 | Determines how many ranks participate in the batchnorm. Specifying a value > 1 will replace nn.BatchNorm2d with nn.SyncBatchNorm everywhere in the model. Currently, nn.SyncBatchNorm only supports node-local batch normalization, but using an Implementation of that same functionality which span arbitrary number of workers is allowed
`--gradient_accumulation_frequency` | 1 | >= 1 | Specifies the number of gradient accumulation steps before a weight update is performed
`--seed` | 333 | > 0 | Random number generator seed. Multiple submissions which employ the same seed are discouraged. Please specify a seed depending on system clock or similar.

*LAMB optimizer has additional hyperparameters such as the global grad clipping norm value. For the purpose of this benchmark, consider all those parameters which are LAMB specific and fixed. The defaults are specified in the [NVIDIA APEX documentation for FusedLAMB](https://nvidia.github.io/apex/_modules/apex/optimizers/fused_lamb.html).

Note that the command line arguments do not directly correspond to logging entries. For compliance checking of oiutput logs, use the table below:

|Key| Constraints | Required |
--- | --- | ---
`seed` | `x > 0` | True
`global_batch_size` | `x > 0` | `True`
`num_workers` | `x > 0` | `True`
`batchnorm_group_size` | `x > 1` | `False`
`gradient_accumulation_frequency` | `x >= 1` | `True`
`opt_name` | `x in ["Adam", "AdamW", "LAMB", "MixedPrecisionLAMB", "DistributedLAMB"]` | `True`
`opt_lr` | `x >= 0.` | `True`
`opt_betas` | unconstrained | `True`
`opt_eps` | `x == 1e-6` | `True`
`opt_weight_decay` | `x >= 0.` | `True`
`opt_bias_correction` | `x == True` | `True if optimizer_name == "LAMB" else False`
`opt_grad_averaging` | `x == True` | `True if optimizer_name == "LAMB" else False`
`opt_max_grad_norm` | `x == 1.0` | `True if optimizer_name == "LAMB" else False`
`scheduler_type` | `x in ["multistep", "cosine_annealing"]` | `True`
`scheduler_milestones` | unconstrained | `True if scheduler_type == "multistep" else False`
`scheduler_decay_rate` | `x >= 1.` | `True if scheduler_type == "multistep" else False`
`scheduler_t_max` | `x >= 0` | `True if scheduler_type == "cosine_annealing" else False`
`scheduler_eta_min` | `x >= 0.` | `True if scheduler_type == "cosine_annealing" else False`
`scheduler_lr_warmup_steps` | `x >= 0` | `False`
`scheduler_lr_warmup_factor` | `x >= 1.` | `True if scheduler_lr_warmup_steps > 0 else False`

The first column lists the keys as they would appear in the logfile. The second column lists the parameters constraints as an equation for parameter variable x. Those can be used to generate lambda expressions in Python. The third one if the corresponding entry has to be in the log file or not. Since there are multiple optimizers and learning rate schedules to choose from, not all parameters need to be logged for a given run. This is expressed by conditional expressions in that column.
**Please note that besides the benchmark specific rules above, standard MLPerf HPC logging rules apply.**

### Using Docker

The implementation comes with a Dockerfile optimized for NVIDIA workstations but usable on
other NVIDIA multi-gpu systems. Use the Dockerfile
`docker/Dockerfile.train` to build the container and the script `src/deepCam/run_scripts/run_training.sh`
for training. The data_dir variable should point to the full path of the `All-Hist` directory containing the downloaded dataset.

## References

1. Wehner, M. F., Reed, K. A., Li, F., Bacmeister, J., Chen, C.-T., Paciorek, C., Gleckler, P. J., Sperber, K. R., Collins, W. D., Gettelman, A., et al.: The effect of horizontal resolution on simulation quality in the Community Atmospheric Model, CAM5. 1, Journal of Advances in Modeling Earth Systems, 6, 980-997, 2014.
2. Prabhat, Byna, S., Vishwanath, V., Dart, E., Wehner, M., Collins, W. D., et al.: TECA: Petascale pattern recognition for climate science, in: International Conference on Computer Analysis of Images and Patterns, pp. 426-436, Springer, 2015b.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# JOB_SHAPE PyTorch DeepCAM

DeepCAM SCALING_TYPE-scaling closed-devision submission on NUM_NODES nodes x NUM_GPU GPUs with batch size GLOBAL_BATCH_SIZE.

To run:

```
export CONT=mlperf-deepcam:v2.0
source configs/CONFIG_FILE
sbatch -N $DGXNNODES -t $WALLTIME run.sub
```
Loading

0 comments on commit 350e46f

Please sign in to comment.