Skip to content

Commit a85cc97

Browse files
authored
Merge branch 'main' into feature/forecast-auto-select
2 parents 2dd71e2 + 9b52ad7 commit a85cc97

File tree

3 files changed

+213
-3
lines changed

3 files changed

+213
-3
lines changed
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
Creating Ray Workloads
2+
-----------------------------
3+
4+
.. include:: ../_prerequisite.rst
5+
6+
7+
**Write your training code:**
8+
9+
Here is a sample of Python code (performing Grid Search) leveraging Scikit-Learn
10+
and Ray to distribute the workload:
11+
12+
.. code-block:: python
13+
14+
import ray
15+
from sklearn.datasets import make_classification
16+
from sklearn.svm import SVC
17+
from sklearn.model_selection import GridSearchCV
18+
19+
import os
20+
import argparse
21+
22+
ray.init(address='auto', ignore_reinit_error=True)
23+
24+
@ray.remote
25+
def gridsearch(args):
26+
grid = GridSearchCV(
27+
SVC(gamma="auto", random_state=0, probability=True),
28+
param_grid={
29+
"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
30+
"kernel": ["rbf", "poly", "sigmoid"],
31+
"shrinking": [True, False],
32+
},
33+
return_train_score=False,
34+
cv=args.cv,
35+
n_jobs=-1,
36+
).fit(X, y)
37+
return grid.best_params_
38+
39+
default_n_samples = int(os.getenv("DEFAULT_N_SAMPLES", "1000"))
40+
41+
parser = argparse.ArgumentParser()
42+
parser.add_argument("--n_samples", default=default_n_samples, type=int, help="size of dataset")
43+
parser.add_argument("--cv", default=3, type=int, help="number of cross validations")
44+
args, unknownargs = parser.parse_known_args()
45+
46+
# Using environment variable to fetch the SCHEDULER_IP is important
47+
48+
X, y = make_classification(n_samples=args.n_samples, random_state=42)
49+
50+
refs = []
51+
for i in range(0, 5):
52+
refs.append(gridsearch.remote(args))
53+
54+
best_params = []
55+
for ref in refs:
56+
best_params.append(ray.get(ref))
57+
58+
print(best_params)
59+
60+
**Initialize a distributed-training folder:**
61+
62+
At this point you have created a training file (or files) - ``gridsearch.py`` from the above
63+
example. Now, run the command below.
64+
65+
.. code-block:: bash
66+
67+
ads opctl distributed-training init --framework ray --version v1
68+
69+
70+
This will download the ``ray`` framework and place it inside ``'oci_dist_training_artifacts'`` folder.
71+
72+
**Note**: Whenever you change the code, you have to build, tag and push the image to repo. This is automatically done in ```ads opctl run``` cli command.
73+
74+
**Containerize your code and build container:**
75+
76+
The required python dependencies are provided inside the conda environment file `oci_dist_training_artifacts/ray/v1/environments.yaml`. If your code requires additional dependency, update this file.
77+
78+
Also, while updating `environments.yaml` do not remove the existing libraries. You can append to the list.
79+
80+
Update the TAG and the IMAGE_NAME as per your needs -
81+
82+
.. code-block:: bash
83+
84+
export IMAGE_NAME=<region.ocir.io/my-tenancy/image-name>
85+
export TAG=latest
86+
export MOUNT_FOLDER_PATH=.
87+
88+
Build the container image.
89+
90+
.. code-block:: bash
91+
92+
ads opctl distributed-training build-image \
93+
-t $TAG \
94+
-reg $IMAGE_NAME \
95+
-df oci_dist_training_artifacts/ray/v1/Dockerfile \
96+
97+
The code is assumed to be in the current working directory. To override the source code directory, use the ``-s`` flag and specify the code dir. This folder should be within the current working directory.
98+
99+
.. code-block:: bash
100+
101+
ads opctl distributed-training build-image \
102+
-t $TAG \
103+
-reg $IMAGE_NAME \
104+
-df oci_dist_training_artifacts/ray/v1/Dockerfile \
105+
-s $MOUNT_FOLDER_PATH
106+
107+
If you are behind proxy, ads opctl will automatically use your proxy settings (defined via ``no_proxy``, ``http_proxy`` and ``https_proxy``).
108+
109+
**Define your workload yaml:**
110+
111+
The ``yaml`` file is a declarative way to express the workload.
112+
In this example, we bring up 1 worker node and 1 chief-worker node.
113+
The training code to run is ``train.py``.
114+
All your training code is assumed to be present inside ``/code`` directory within the container.
115+
Additionally, you can also put any data files inside the same directory
116+
(and pass on the location ex ``/code/data/**`` as an argument to your training script using runtime->spec->args).
117+
This particular configuration will run with 2 nodes.
118+
119+
.. code-block:: yaml
120+
121+
# Example train.yaml for defining ray cluster
122+
kind: distributed
123+
apiVersion: v1.0
124+
spec:
125+
infrastructure:
126+
kind: infrastructure
127+
type: dataScienceJob
128+
apiVersion: v1.0
129+
spec:
130+
projectId: oci.xxxx.<project_ocid>
131+
compartmentId: oci.xxxx.<compartment_ocid>
132+
displayName: my_distributed_training
133+
logGroupId: oci.xxxx.<log_group_ocid>
134+
logId: oci.xxx.<log_ocid>
135+
subnetId: oci.xxxx.<subnet-ocid>
136+
shapeName: VM.Standard2.4
137+
blockStorageSize: 50
138+
cluster:
139+
kind: ray
140+
apiVersion: v1.0
141+
spec:
142+
image: "@image"
143+
workDir: "oci://my-bucket@my-namespace/rayexample/001"
144+
name: GridSearch Ray
145+
main:
146+
config:
147+
worker:
148+
config:
149+
replicas: 2
150+
runtime:
151+
kind: python
152+
apiVersion: v1.0
153+
spec:
154+
entryPoint: "gridsearch.py"
155+
kwargs: "--cv 5"
156+
env:
157+
- name: DEFAULT_N_SAMPLES
158+
value: 5000
159+
160+
**Note**: make sure that the ``workDir`` points to your object storage
161+
bucket at OCI.
162+
163+
For ``flex shapes`` use following in the ``train.yaml`` file
164+
165+
.. code:: yaml
166+
167+
shapeConfigDetails:
168+
memoryInGBs: 22
169+
ocpus: 2
170+
shapeName: VM.Standard.E3.Flex
171+
172+
173+
**Use ads opctl to create the cluster infrastructure and run the workload:**
174+
175+
Do a dry run to inspect how the yaml translates to Job and Job Runs
176+
177+
.. code-block:: bash
178+
179+
ads opctl run -f train.yaml --dry-run
180+
181+
182+
.. include:: ../_test_and_submit.rst
183+
184+
**Monitoring the workload logs**
185+
186+
To view the logs from a job run, you could run -
187+
188+
.. code-block:: bash
189+
190+
ads opctl watch oci.xxxx.<job_run_ocid>
191+
192+
You could stream the logs from any of the job run ocid using ``ads opctl watch`` command. You could run this command from multiple terminal to watch all of the job runs. Typically, watching ``mainJobRunId`` should yield most informative log.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
====
2+
Ray
3+
====
4+
5+
6+
Ray is a framework for distributed computing in Python specialized in ML workloads.
7+
The documentation shows how to create a container and ``yaml`` spec to run a ``Ray``
8+
code sample in distributed modality.
9+
10+
``Ray`` offers a core package to simply execute Python workloads in a distributed manner,
11+
potentially across a cluster of machines (set up through ``Ray`` itself), but also other
12+
extensions to perform more traditional ML computation, such as Hyperparameter Optimization.
13+
14+
15+
.. toctree::
16+
:maxdepth: 3
17+
18+
creating

pyproject.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -151,13 +151,13 @@ forecast = [
151151
"statsmodels",
152152
"plotly",
153153
"oracledb",
154-
"report-creator",
154+
"report-creator==1.0.9",
155155
]
156156
anomaly = [
157157
"oracle_ads[opctl]",
158158
"autots",
159159
"oracledb",
160-
"report-creator",
160+
"report-creator==1.0.9",
161161
]
162162
feature-store-marketplace = [
163163
"oracle-ads[opctl]",
@@ -173,7 +173,7 @@ pii = [
173173
"scrubadub_spacy",
174174
"spacy-transformers==1.2.5",
175175
"spacy==3.6.1",
176-
"report-creator",
176+
"report-creator==1.0.9",
177177
]
178178
llm = ["langchain-community<0.0.32", "langchain>=0.1.10,<0.1.14", "evaluate>=0.4.0"]
179179
aqua = ["jupyter_server"]

0 commit comments

Comments
 (0)