|
| 1 | +Creating Ray Workloads |
| 2 | +----------------------------- |
| 3 | + |
| 4 | +.. include:: ../_prerequisite.rst |
| 5 | + |
| 6 | + |
| 7 | +**Write your training code:** |
| 8 | + |
| 9 | +Here is a sample of Python code (performing Grid Search) leveraging Scikit-Learn |
| 10 | +and Ray to distribute the workload: |
| 11 | + |
| 12 | +.. code-block:: python |
| 13 | +
|
| 14 | + import ray |
| 15 | + from sklearn.datasets import make_classification |
| 16 | + from sklearn.svm import SVC |
| 17 | + from sklearn.model_selection import GridSearchCV |
| 18 | +
|
| 19 | + import os |
| 20 | + import argparse |
| 21 | +
|
| 22 | + ray.init(address='auto', ignore_reinit_error=True) |
| 23 | +
|
| 24 | + @ray.remote |
| 25 | + def gridsearch(args): |
| 26 | + grid = GridSearchCV( |
| 27 | + SVC(gamma="auto", random_state=0, probability=True), |
| 28 | + param_grid={ |
| 29 | + "C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0], |
| 30 | + "kernel": ["rbf", "poly", "sigmoid"], |
| 31 | + "shrinking": [True, False], |
| 32 | + }, |
| 33 | + return_train_score=False, |
| 34 | + cv=args.cv, |
| 35 | + n_jobs=-1, |
| 36 | + ).fit(X, y) |
| 37 | + return grid.best_params_ |
| 38 | +
|
| 39 | + default_n_samples = int(os.getenv("DEFAULT_N_SAMPLES", "1000")) |
| 40 | +
|
| 41 | + parser = argparse.ArgumentParser() |
| 42 | + parser.add_argument("--n_samples", default=default_n_samples, type=int, help="size of dataset") |
| 43 | + parser.add_argument("--cv", default=3, type=int, help="number of cross validations") |
| 44 | + args, unknownargs = parser.parse_known_args() |
| 45 | +
|
| 46 | + # Using environment variable to fetch the SCHEDULER_IP is important |
| 47 | +
|
| 48 | + X, y = make_classification(n_samples=args.n_samples, random_state=42) |
| 49 | +
|
| 50 | + refs = [] |
| 51 | + for i in range(0, 5): |
| 52 | + refs.append(gridsearch.remote(args)) |
| 53 | +
|
| 54 | + best_params = [] |
| 55 | + for ref in refs: |
| 56 | + best_params.append(ray.get(ref)) |
| 57 | +
|
| 58 | + print(best_params) |
| 59 | +
|
| 60 | +**Initialize a distributed-training folder:** |
| 61 | + |
| 62 | +At this point you have created a training file (or files) - ``gridsearch.py`` from the above |
| 63 | +example. Now, run the command below. |
| 64 | + |
| 65 | +.. code-block:: bash |
| 66 | +
|
| 67 | + ads opctl distributed-training init --framework ray --version v1 |
| 68 | +
|
| 69 | +
|
| 70 | +This will download the ``ray`` framework and place it inside ``'oci_dist_training_artifacts'`` folder. |
| 71 | + |
| 72 | +**Note**: Whenever you change the code, you have to build, tag and push the image to repo. This is automatically done in ```ads opctl run``` cli command. |
| 73 | + |
| 74 | +**Containerize your code and build container:** |
| 75 | + |
| 76 | +The required python dependencies are provided inside the conda environment file `oci_dist_training_artifacts/ray/v1/environments.yaml`. If your code requires additional dependency, update this file. |
| 77 | + |
| 78 | +Also, while updating `environments.yaml` do not remove the existing libraries. You can append to the list. |
| 79 | + |
| 80 | +Update the TAG and the IMAGE_NAME as per your needs - |
| 81 | + |
| 82 | +.. code-block:: bash |
| 83 | +
|
| 84 | + export IMAGE_NAME=<region.ocir.io/my-tenancy/image-name> |
| 85 | + export TAG=latest |
| 86 | + export MOUNT_FOLDER_PATH=. |
| 87 | +
|
| 88 | +Build the container image. |
| 89 | + |
| 90 | +.. code-block:: bash |
| 91 | +
|
| 92 | + ads opctl distributed-training build-image \ |
| 93 | + -t $TAG \ |
| 94 | + -reg $IMAGE_NAME \ |
| 95 | + -df oci_dist_training_artifacts/ray/v1/Dockerfile \ |
| 96 | +
|
| 97 | +The code is assumed to be in the current working directory. To override the source code directory, use the ``-s`` flag and specify the code dir. This folder should be within the current working directory. |
| 98 | + |
| 99 | +.. code-block:: bash |
| 100 | +
|
| 101 | + ads opctl distributed-training build-image \ |
| 102 | + -t $TAG \ |
| 103 | + -reg $IMAGE_NAME \ |
| 104 | + -df oci_dist_training_artifacts/ray/v1/Dockerfile \ |
| 105 | + -s $MOUNT_FOLDER_PATH |
| 106 | +
|
| 107 | +If you are behind proxy, ads opctl will automatically use your proxy settings (defined via ``no_proxy``, ``http_proxy`` and ``https_proxy``). |
| 108 | + |
| 109 | +**Define your workload yaml:** |
| 110 | + |
| 111 | +The ``yaml`` file is a declarative way to express the workload. |
| 112 | +In this example, we bring up 1 worker node and 1 chief-worker node. |
| 113 | +The training code to run is ``train.py``. |
| 114 | +All your training code is assumed to be present inside ``/code`` directory within the container. |
| 115 | +Additionally, you can also put any data files inside the same directory |
| 116 | +(and pass on the location ex ``/code/data/**`` as an argument to your training script using runtime->spec->args). |
| 117 | +This particular configuration will run with 2 nodes. |
| 118 | + |
| 119 | +.. code-block:: yaml |
| 120 | +
|
| 121 | + # Example train.yaml for defining ray cluster |
| 122 | + kind: distributed |
| 123 | + apiVersion: v1.0 |
| 124 | + spec: |
| 125 | + infrastructure: |
| 126 | + kind: infrastructure |
| 127 | + type: dataScienceJob |
| 128 | + apiVersion: v1.0 |
| 129 | + spec: |
| 130 | + projectId: oci.xxxx.<project_ocid> |
| 131 | + compartmentId: oci.xxxx.<compartment_ocid> |
| 132 | + displayName: my_distributed_training |
| 133 | + logGroupId: oci.xxxx.<log_group_ocid> |
| 134 | + logId: oci.xxx.<log_ocid> |
| 135 | + subnetId: oci.xxxx.<subnet-ocid> |
| 136 | + shapeName: VM.Standard2.4 |
| 137 | + blockStorageSize: 50 |
| 138 | + cluster: |
| 139 | + kind: ray |
| 140 | + apiVersion: v1.0 |
| 141 | + spec: |
| 142 | + image: "@image" |
| 143 | + workDir: "oci://my-bucket@my-namespace/rayexample/001" |
| 144 | + name: GridSearch Ray |
| 145 | + main: |
| 146 | + config: |
| 147 | + worker: |
| 148 | + config: |
| 149 | + replicas: 2 |
| 150 | + runtime: |
| 151 | + kind: python |
| 152 | + apiVersion: v1.0 |
| 153 | + spec: |
| 154 | + entryPoint: "gridsearch.py" |
| 155 | + kwargs: "--cv 5" |
| 156 | + env: |
| 157 | + - name: DEFAULT_N_SAMPLES |
| 158 | + value: 5000 |
| 159 | +
|
| 160 | +**Note**: make sure that the ``workDir`` points to your object storage |
| 161 | +bucket at OCI. |
| 162 | + |
| 163 | +For ``flex shapes`` use following in the ``train.yaml`` file |
| 164 | + |
| 165 | +.. code:: yaml |
| 166 | +
|
| 167 | + shapeConfigDetails: |
| 168 | + memoryInGBs: 22 |
| 169 | + ocpus: 2 |
| 170 | + shapeName: VM.Standard.E3.Flex |
| 171 | +
|
| 172 | +
|
| 173 | +**Use ads opctl to create the cluster infrastructure and run the workload:** |
| 174 | + |
| 175 | +Do a dry run to inspect how the yaml translates to Job and Job Runs |
| 176 | + |
| 177 | +.. code-block:: bash |
| 178 | +
|
| 179 | + ads opctl run -f train.yaml --dry-run |
| 180 | +
|
| 181 | +
|
| 182 | +.. include:: ../_test_and_submit.rst |
| 183 | + |
| 184 | +**Monitoring the workload logs** |
| 185 | + |
| 186 | +To view the logs from a job run, you could run - |
| 187 | + |
| 188 | +.. code-block:: bash |
| 189 | +
|
| 190 | + ads opctl watch oci.xxxx.<job_run_ocid> |
| 191 | +
|
| 192 | +You could stream the logs from any of the job run ocid using ``ads opctl watch`` command. You could run this command from multiple terminal to watch all of the job runs. Typically, watching ``mainJobRunId`` should yield most informative log. |
0 commit comments