-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC] Add e2e test for tune
api with LLM hyperparameter optimization
#2420
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: helenxie-bit <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/area gsoc |
Ref: #2339 |
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
tune
api with LLM hyperparameter optimizationtune
api with LLM hyperparameter optimization
…roller Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle stale |
@helenxie-bit, sorry I didn't get a chance to review this PR. |
@andreyvelich No worries, this test is still in progress because we need to merge the bug fix of |
We should release Katib 0.18-rc.0 this week, but we can cherry-pick the bug fixes on RC.1 as well. |
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
…tion and logs Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
tune
api with LLM hyperparameter optimizationtune
api with LLM hyperparameter optimization
This PR is ready for review. Please have a look when you have time :) /cc @kubeflow/wg-automl-leads @Electronic-Waste @mahdikhashan |
@helenxie-bit: GitHub didn't allow me to request PR reviews from the following users: mahdikhashan. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @helenxie-bit!
Just small comments.
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name)) | ||
|
||
# Use the test case from fine-tuning API tutorial. | ||
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we link an updated guide for Katib LLM Optimization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Katib LLM Optimization guide is still under review, should I link to the file in its current state for now?
Additionally, the example in the Katib LLM Optimization guide uses a different model and dataset compared to this one. The guide uses the LLaMa model, which requires access tokens. I’ve already applied for the access token and am awaiting approval. Once I receive it, I will test the example to see if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running the above example, but I ran into some unexpected errors in the storage_initializer
container, and the model couldn't be downloaded successfully. It seems like the model used in this example might require different versions of transformers or other libraries. I'll look into it, but it might take some time to resolve.
If we aim to include this in Katib 0.18-rc.0 this week, we might need to stick with the current example. Otherwise, I’ll work on fixing it before RC.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it is fine to include it in RC.1 since it is a bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep URL for Kubeflow Training docs for now.
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
I started reviewing this pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't run the external test on Mac M1 with 8GB ram with K8s cluster using K3d.
INFO:root:---------------------------------------------------------------
INFO:root:E2E is failed for Experiment created by tune: default/tune-example-2
INFO:root:---------------------------------------------------------------
INFO:root:---------------------------------------------------------------
DEBUG:kubernetes.client.rest:response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}
Traceback (most recent call last):
File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 183, in <module>
raise e
File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 177, in <module>
run_e2e_experiment_create_by_tune_with_llm_optimization(katib_client, exp_name_llm_optimization, exp_namespace)
File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 81, in run_e2e_experiment_create_by_tune_with_llm_optimization
katib_client.tune(
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 605, in tune
lora_config = utils.get_trial_substitutions_from_trainer(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 213, in get_trial_substitutions_from_trainer
parameters = json.dumps(parameters.__dict__, cls=SetEncoder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/__init__.py", line 238, in dumps
**kw).encode(obj)
^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 200, in encode
chunks = self.iterencode(o, _one_shot=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 258, in iterencode
return _iterencode(o, 0)
^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 143, in default
return json.JSONEncoder.default(self, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 180, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type LoraRuntimeConfig is not JSON serializable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1223, in delete_experiment
self.custom_api.delete_namespaced_custom_object(
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 911, in delete_namespaced_custom_object
return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs) # noqa: E501
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 1038, in delete_namespaced_custom_object_with_http_info
return self.api_client.call_api(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 415, in request
return self.rest_client.DELETE(url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 270, in DELETE
return self.request("DELETE", url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '2b6ee8e1-d8e8-4ec1-9fe4-bcea39264f1a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7a07626f-55e1-4b48-b5f1-87b4cd8b517f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a02fdbf6-92c2-4dc6-b5f8-416b965fc7f7', 'Date': 'Wed, 29 Jan 2025 12:14:45 GMT', 'Content-Length': '246'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 188, in <module>
katib_client.delete_experiment(exp_name_llm_optimization, exp_namespace)
File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1236, in delete_experiment
raise RuntimeError(f"Failed to delete Katib Experiment: {namespace}/{name}")
RuntimeError: Failed to delete Katib Experiment: default/tune-example-2
NAME READY STATUS RESTARTS AGE
katib-controller-754877f9f-zvscj 1/1 Running 0 20m
katib-db-manager-64d9c694dd-m9k4h 1/1 Running 0 20m
katib-mysql-74f9795f8b-6h55q 1/1 Running 0 20m
katib-ui-65698b4896-glq9p 1/1 Running 0 20m
training-operator-7dc56b6448-28r69 1/1 Running 0 22m
HuggingFaceDatasetParams, | ||
HuggingFaceModelParams, | ||
HuggingFaceTrainerParams, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest importing each e2e specific requirements inside its function, for example:
# Test for Experiment created with external models and datasets.
def run_e2e_experiment_create_by_tune_with_llm_optimization(
katib_client: KatibClient,
exp_name: str,
exp_namespace: str,
):
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceDatasetParams,
HuggingFaceModelParams,
HuggingFaceTrainerParams,
)
import transformers
from peft import LoraConfig
# Create Katib Experiment and wait until it is finished.
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))
in this way, the scope of each test is more determined - WDYT?
logging.info("---------------------------------------------------------------") | ||
katib_client.delete_experiment(exp_name_custom_objective, exp_namespace) | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest we using a simpler iterate over a data structure like the unit-tests, for example:
test_tune_data = [
(
"tune_with_custom_objective",
run_e2e_experiment_create_by_tune_with_custom_objective,
),
]
WDYT?
@@ -79,18 +156,33 @@ def objective(parameters): | |||
client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}}) | |||
|
|||
# Test with run_e2e_experiment_create_by_tune | |||
exp_name = "tune-example" | |||
exp_name_custom_objective = "tune-example-1" | |||
exp_name_llm_optimization = "tune-example-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest a more meaningful name for the test, while I was looking at the result of the tests, it was not easy for me to find out what are the difference between tune-example-1 and 2.
how about tune-for-an-objective-function
and tune-for-external-model
. WDYT? (feel free to offer better names, these were spontaneous ideas).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
# Print the Experiment and Suggestion. | ||
logging.debug(katib_client.get_experiment(exp_name, exp_namespace)) | ||
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest we using a prettifier to format the result of the test success of failure here, for example using pprint. WDYT?
it seems that the reason for test failure on my machine is TypeError: Object of type LoraRuntimeConfig is not JSON serializable my python version is |
Hi @helenxie-bit , we have time until this Wednesday to merge this PR before we cut Katib RC.0. |
What this PR does / why we need it:
This PR adds an e2e test for the
tune
API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: