[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit · 2024-09-03T13:17:38Z

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow · 2024-09-03T13:17:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

helenxie-bit · 2024-09-03T13:21:23Z

/area gsoc

helenxie-bit · 2024-09-03T13:21:49Z

Ref: #2339

Signed-off-by: helenxie-bit <[email protected]>

…roller Signed-off-by: helenxie-bit <[email protected]>

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow · 2025-01-20T10:29:27Z

@helenxie-bit: The label(s) /remove-label lifecycle/stale cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, lifecycle/needs-triage. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/remove-label lifecycle/stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

helenxie-bit · 2025-01-20T10:31:32Z

/remove-lifecycle stale

andreyvelich · 2025-01-20T12:46:05Z

@helenxie-bit, sorry I didn't get a chance to review this PR.
Do you think we can finish it before the Katib 0.18 release ?

helenxie-bit · 2025-01-20T17:51:10Z

@andreyvelich No worries, this test is still in progress because we need to merge the bug fix of tune API first. I expect to complete this test quickly afterward. When is the expected release date for Katib 0.18?

andreyvelich · 2025-01-20T17:53:37Z

Re

We should release Katib 0.18-rc.0 this week, but we can cherry-pick the bug fixes on RC.1 as well.

Signed-off-by: helenxie-bit <[email protected]>

…tion and logs Signed-off-by: helenxie-bit <[email protected]>

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit · 2025-01-25T01:02:00Z

This PR is ready for review. Please have a look when you have time :)

/cc @kubeflow/wg-automl-leads @Electronic-Waste @mahdikhashan

google-oss-prow · 2025-01-25T01:02:05Z

@helenxie-bit: GitHub didn't allow me to request PR reviews from the following users: mahdikhashan.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This PR is ready for review. Please have a look when you have time :)

/cc @kubeflow/wg-automl-leads @Electronic-Waste @mahdikhashan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich

Thank you for doing this @helenxie-bit!
Just small comments.

.github/workflows/free-up-disk-space/action.yaml

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

andreyvelich · 2025-01-27T17:32:31Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

+    logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))
+
+    # Use the test case from fine-tuning API tutorial.
+    # https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/


Should we link an updated guide for Katib LLM Optimization ?

Since the Katib LLM Optimization guide is still under review, should I link to the file in its current state for now?

Additionally, the example in the Katib LLM Optimization guide uses a different model and dataset compared to this one. The guide uses the LLaMa model, which requires access tokens. I’ve already applied for the access token and am awaiting approval. Once I receive it, I will test the example to see if it works.

I tried running the above example, but I ran into some unexpected errors in the storage_initializer container, and the model couldn't be downloaded successfully. It seems like the model used in this example might require different versions of transformers or other libraries. I'll look into it, but it might take some time to resolve.

If we aim to include this in Katib 0.18-rc.0 this week, we might need to stick with the current example. Otherwise, I’ll work on fixing it before RC.1.

I think, it is fine to include it in RC.1 since it is a bug fix.

we can keep URL for Kubeflow Training docs for now.

Signed-off-by: helenxie-bit <[email protected]>

mahdikhashan · 2025-01-28T15:52:11Z

I started reviewing this pr.

mahdikhashan

I couldn't run the external test on Mac M1 with 8GB ram with K8s cluster using K3d.

INFO:root:---------------------------------------------------------------
INFO:root:E2E is failed for Experiment created by tune: default/tune-example-2
INFO:root:---------------------------------------------------------------
INFO:root:---------------------------------------------------------------
DEBUG:kubernetes.client.rest:response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}

Traceback (most recent call last):
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 183, in <module>
    raise e
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 177, in <module>
    run_e2e_experiment_create_by_tune_with_llm_optimization(katib_client, exp_name_llm_optimization, exp_namespace)
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 81, in run_e2e_experiment_create_by_tune_with_llm_optimization
    katib_client.tune(
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 605, in tune
    lora_config = utils.get_trial_substitutions_from_trainer(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 213, in get_trial_substitutions_from_trainer
    parameters = json.dumps(parameters.__dict__, cls=SetEncoder)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 143, in default
    return json.JSONEncoder.default(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type LoraRuntimeConfig is not JSON serializable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1223, in delete_experiment
    self.custom_api.delete_namespaced_custom_object(
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 911, in delete_namespaced_custom_object
    return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 1038, in delete_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 415, in request
    return self.rest_client.DELETE(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 270, in DELETE
    return self.request("DELETE", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '2b6ee8e1-d8e8-4ec1-9fe4-bcea39264f1a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7a07626f-55e1-4b48-b5f1-87b4cd8b517f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a02fdbf6-92c2-4dc6-b5f8-416b965fc7f7', 'Date': 'Wed, 29 Jan 2025 12:14:45 GMT', 'Content-Length': '246'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 188, in <module>
    katib_client.delete_experiment(exp_name_llm_optimization, exp_namespace)
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1236, in delete_experiment
    raise RuntimeError(f"Failed to delete Katib Experiment: {namespace}/{name}")
RuntimeError: Failed to delete Katib Experiment: default/tune-example-2
NAME                                 READY   STATUS    RESTARTS   AGE
katib-controller-754877f9f-zvscj     1/1     Running   0          20m
katib-db-manager-64d9c694dd-m9k4h    1/1     Running   0          20m
katib-mysql-74f9795f8b-6h55q         1/1     Running   0          20m
katib-ui-65698b4896-glq9p            1/1     Running   0          20m
training-operator-7dc56b6448-28r69   1/1     Running   0          22m

mahdikhashan · 2025-01-29T12:27:53Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

+    HuggingFaceDatasetParams,
+    HuggingFaceModelParams,
+    HuggingFaceTrainerParams,
+)


I would suggest importing each e2e specific requirements inside its function, for example:

# Test for Experiment created with external models and datasets. def run_e2e_experiment_create_by_tune_with_llm_optimization( katib_client: KatibClient, exp_name: str, exp_namespace: str, ): from kubeflow.storage_initializer.hugging_face import ( HuggingFaceDatasetParams, HuggingFaceModelParams, HuggingFaceTrainerParams, ) import transformers from peft import LoraConfig # Create Katib Experiment and wait until it is finished. logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

in this way, the scope of each test is more determined - WDYT?

mahdikhashan · 2025-01-29T12:33:11Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

+        logging.info("---------------------------------------------------------------")
+        katib_client.delete_experiment(exp_name_custom_objective, exp_namespace)
+
+    try:


I would suggest we using a simpler iterate over a data structure like the unit-tests, for example:

test_tune_data = [ ( "tune_with_custom_objective", run_e2e_experiment_create_by_tune_with_custom_objective, ), ]

WDYT?

mahdikhashan · 2025-01-29T12:35:48Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

@@ -79,18 +156,33 @@ def objective(parameters):
        client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}})

    # Test with run_e2e_experiment_create_by_tune
-    exp_name = "tune-example"
+    exp_name_custom_objective = "tune-example-1"
+    exp_name_llm_optimization = "tune-example-2"


I would suggest a more meaningful name for the test, while I was looking at the result of the tests, it was not easy for me to find out what are the difference between tune-example-1 and 2.

how about tune-for-an-objective-function and tune-for-external-model. WDYT? (feel free to offer better names, these were spontaneous ideas).

mahdikhashan · 2025-01-29T12:39:41Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

+
+    # Print the Experiment and Suggestion.
+    logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
+    logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))


I would suggest we using a prettifier to format the result of the test success of failure here, for example using pprint. WDYT?

mahdikhashan · 2025-01-29T12:46:17Z

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

it seems that the reason for test failure on my machine is

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

my python version is Python 3.12.7.

andreyvelich · 2025-02-03T13:12:11Z

Hi @helenxie-bit , we have time until this Wednesday to merge this PR before we cut Katib RC.0.
Do you have enough time to finish it ?

add e2e test for tune api

6be7f29

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot requested review from andreyvelich, anencore94 and gaocegege September 3, 2024 13:17

google-oss-prow bot added the size/M label Sep 3, 2024

helenxie-bit mentioned this pull request Sep 3, 2024

[GSoC] Project 4: Hyperparameter Optimization API in Katib for LLMs #2339

Open

6 tasks

google-oss-prow bot added the area/gsoc label Sep 3, 2024

helenxie-bit added 2 commits September 3, 2024 21:38

upgrade training-operator sdk

1a1f119

Signed-off-by: helenxie-bit <[email protected]>

specify the version of training operator sdk

8461a49

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit changed the title ~~[GSoC] Add e2e test for tune api with LLM hyperparameter optimization~~ [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024

google-oss-prow bot added the do-not-merge/work-in-progress label Sep 3, 2024

fix num_labels error and update the version of training operator cont…

c860238

…roller Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024

helenxie-bit added 13 commits September 3, 2024 22:30

check the version of training operator

216ebd9

Signed-off-by: helenxie-bit <[email protected]>

debug

f6b96f5

Signed-off-by: helenxie-bit <[email protected]>

check import path of HuggingFaceModelParams

c636493

Signed-off-by: helenxie-bit <[email protected]>

update the version of training operator sdk

8180422

Signed-off-by: helenxie-bit <[email protected]>

update the name of experiment

6101489

Signed-off-by: helenxie-bit <[email protected]>

add step of checking pod

d67a1b8

Signed-off-by: helenxie-bit <[email protected]>

check the logs of pod

295abb6

Signed-off-by: helenxie-bit <[email protected]>

add check

e0a1b6d

Signed-off-by: helenxie-bit <[email protected]>

check reason for imagepullbackoff

1df7df9

Signed-off-by: helenxie-bit <[email protected]>

revert timeout limit

d1e1311

Signed-off-by: helenxie-bit <[email protected]>

fix format

0cc319f

Signed-off-by: helenxie-bit <[email protected]>

extend timeout limit

0383932

Signed-off-by: helenxie-bit <[email protected]>

update training operator sdk version

08c8634

Signed-off-by: helenxie-bit <[email protected]>

github-actions bot added the lifecycle/stale label Jan 20, 2025

google-oss-prow bot removed the lifecycle/stale label Jan 20, 2025

helenxie-bit added 6 commits January 23, 2025 00:14

update installation of Katib SDK with extra requires

e5bf840

Signed-off-by: helenxie-bit <[email protected]>

test trainer image built with cpu

fca94ae

Signed-off-by: helenxie-bit <[email protected]>

Merge remote-tracking branch 'upstream/master' into e2e-test-tune-api

b5cae0d

add action of free up disk space (including move docker data directory)

a785d35

Signed-off-by: helenxie-bit <[email protected]>

delete unnecessary checks and update the part of fetching pod descrip…

865379e

…tion and logs Signed-off-by: helenxie-bit <[email protected]>

delete fetching pod logs

d1ea629

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit changed the title ~~[WIP] Add e2e test for tune api with LLM hyperparameter optimization~~ [GSoC] Add e2e test for tune api with LLM hyperparameter optimization Jan 25, 2025

google-oss-prow bot removed the do-not-merge/work-in-progress label Jan 25, 2025

google-oss-prow bot requested review from a team and Electronic-Waste January 25, 2025 01:02

Electronic-Waste mentioned this pull request Jan 25, 2025

[Release] Katib 0.18 Roadmap #2386

Open

7 tasks

andreyvelich reviewed Jan 27, 2025

View reviewed changes

helenxie-bit added 3 commits January 27, 2025 09:55

add blank line at the end of free-up-disk-space yaml file

5e2e44f

Signed-off-by: helenxie-bit <[email protected]>

update experiment name

982e268

Signed-off-by: helenxie-bit <[email protected]>

update test function name to be consistent with experiment name

55c404d

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit mentioned this pull request Jan 27, 2025

[SDK] ValueError: <HUB_TOKEN> is not a valid HubStrategy, please select one of ['end', 'every_save', 'checkpoint', 'all_checkpoints'] #2495

Open

mahdikhashan reviewed Jan 29, 2025

View reviewed changes

andreyvelich added this to the v0.18 milestone Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Jan 20, 2025

helenxie-bit commented Jan 20, 2025

andreyvelich commented Jan 20, 2025

helenxie-bit commented Jan 20, 2025

andreyvelich commented Jan 20, 2025

helenxie-bit commented Jan 25, 2025

google-oss-prow bot commented Jan 25, 2025

andreyvelich left a comment

andreyvelich Jan 27, 2025

helenxie-bit Jan 27, 2025

helenxie-bit Jan 27, 2025

andreyvelich Jan 27, 2025

andreyvelich Jan 27, 2025

mahdikhashan commented Jan 28, 2025

mahdikhashan left a comment

mahdikhashan Jan 29, 2025

mahdikhashan Jan 29, 2025 •

edited

Loading

mahdikhashan Jan 29, 2025

mahdikhashan Jan 29, 2025

mahdikhashan Jan 29, 2025

mahdikhashan commented Jan 29, 2025 •

edited

Loading

andreyvelich commented Feb 3, 2025 •

edited

Loading

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Are you sure you want to change the base?

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Conversation

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Jan 20, 2025

helenxie-bit commented Jan 20, 2025

andreyvelich commented Jan 20, 2025

helenxie-bit commented Jan 20, 2025

andreyvelich commented Jan 20, 2025

helenxie-bit commented Jan 25, 2025

google-oss-prow bot commented Jan 25, 2025

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahdikhashan commented Jan 28, 2025

mahdikhashan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahdikhashan Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahdikhashan commented Jan 29, 2025 • edited Loading

andreyvelich commented Feb 3, 2025 • edited Loading

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

mahdikhashan Jan 29, 2025 •

edited

Loading

mahdikhashan commented Jan 29, 2025 •

edited

Loading

andreyvelich commented Feb 3, 2025 •

edited

Loading