Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Open
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: helenxie-bit <[email protected]>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@helenxie-bit
Copy link
Contributor Author

/area gsoc

@helenxie-bit
Copy link
Contributor Author

Ref: #2339

@helenxie-bit helenxie-bit changed the title [GSoC] Add e2e test for tune api with LLM hyperparameter optimization [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Copy link

@helenxie-bit: The label(s) /remove-label lifecycle/stale cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, lifecycle/needs-triage. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/remove-label lifecycle/stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@helenxie-bit
Copy link
Contributor Author

/remove-lifecycle stale

@andreyvelich
Copy link
Member

@helenxie-bit, sorry I didn't get a chance to review this PR.
Do you think we can finish it before the Katib 0.18 release ?

@helenxie-bit
Copy link
Contributor Author

@andreyvelich No worries, this test is still in progress because we need to merge the bug fix of tune API first. I expect to complete this test quickly afterward. When is the expected release date for Katib 0.18?

@andreyvelich
Copy link
Member

Re

We should release Katib 0.18-rc.0 this week, but we can cherry-pick the bug fixes on RC.1 as well.

@helenxie-bit helenxie-bit changed the title [WIP] Add e2e test for tune api with LLM hyperparameter optimization [GSoC] Add e2e test for tune api with LLM hyperparameter optimization Jan 25, 2025
@helenxie-bit
Copy link
Contributor Author

This PR is ready for review. Please have a look when you have time :)

/cc @kubeflow/wg-automl-leads @Electronic-Waste @mahdikhashan

@google-oss-prow google-oss-prow bot requested review from a team and Electronic-Waste January 25, 2025 01:02
Copy link

@helenxie-bit: GitHub didn't allow me to request PR reviews from the following users: mahdikhashan.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This PR is ready for review. Please have a look when you have time :)

/cc @kubeflow/wg-automl-leads @Electronic-Waste @mahdikhashan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @helenxie-bit!
Just small comments.

.github/workflows/free-up-disk-space/action.yaml Outdated Show resolved Hide resolved
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

# Use the test case from fine-tuning API tutorial.
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link an updated guide for Katib LLM Optimization ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Katib LLM Optimization guide is still under review, should I link to the file in its current state for now?

Additionally, the example in the Katib LLM Optimization guide uses a different model and dataset compared to this one. The guide uses the LLaMa model, which requires access tokens. I’ve already applied for the access token and am awaiting approval. Once I receive it, I will test the example to see if it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running the above example, but I ran into some unexpected errors in the storage_initializer container, and the model couldn't be downloaded successfully. It seems like the model used in this example might require different versions of transformers or other libraries. I'll look into it, but it might take some time to resolve.

If we aim to include this in Katib 0.18-rc.0 this week, we might need to stick with the current example. Otherwise, I’ll work on fixing it before RC.1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it is fine to include it in RC.1 since it is a bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep URL for Kubeflow Training docs for now.

@mahdikhashan
Copy link
Contributor

I started reviewing this pr.

Copy link
Contributor

@mahdikhashan mahdikhashan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't run the external test on Mac M1 with 8GB ram with K8s cluster using K3d.

INFO:root:---------------------------------------------------------------
INFO:root:E2E is failed for Experiment created by tune: default/tune-example-2
INFO:root:---------------------------------------------------------------
INFO:root:---------------------------------------------------------------
DEBUG:kubernetes.client.rest:response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}

Traceback (most recent call last):
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 183, in <module>
    raise e
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 177, in <module>
    run_e2e_experiment_create_by_tune_with_llm_optimization(katib_client, exp_name_llm_optimization, exp_namespace)
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 81, in run_e2e_experiment_create_by_tune_with_llm_optimization
    katib_client.tune(
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 605, in tune
    lora_config = utils.get_trial_substitutions_from_trainer(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 213, in get_trial_substitutions_from_trainer
    parameters = json.dumps(parameters.__dict__, cls=SetEncoder)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/utils/utils.py", line 143, in default
    return json.JSONEncoder.default(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type LoraRuntimeConfig is not JSON serializable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1223, in delete_experiment
    self.custom_api.delete_namespaced_custom_object(
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 911, in delete_namespaced_custom_object
    return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api/custom_objects_api.py", line 1038, in delete_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 415, in request
    return self.rest_client.DELETE(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 270, in DELETE
    return self.request("DELETE", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '2b6ee8e1-d8e8-4ec1-9fe4-bcea39264f1a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7a07626f-55e1-4b48-b5f1-87b4cd8b517f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a02fdbf6-92c2-4dc6-b5f8-416b965fc7f7', 'Date': 'Wed, 29 Jan 2025 12:14:45 GMT', 'Content-Length': '246'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"experiments.kubeflow.org \"tune-example-2\" not found","reason":"NotFound","details":{"name":"tune-example-2","group":"kubeflow.org","kind":"experiments"},"code":404}



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mahdikhashan/kubeflow/katib/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py", line 188, in <module>
    katib_client.delete_experiment(exp_name_llm_optimization, exp_namespace)
  File "/Users/mahdikhashan/miniconda3/lib/python3.12/site-packages/kubeflow/katib/api/katib_client.py", line 1236, in delete_experiment
    raise RuntimeError(f"Failed to delete Katib Experiment: {namespace}/{name}")
RuntimeError: Failed to delete Katib Experiment: default/tune-example-2
NAME                                 READY   STATUS    RESTARTS   AGE
katib-controller-754877f9f-zvscj     1/1     Running   0          20m
katib-db-manager-64d9c694dd-m9k4h    1/1     Running   0          20m
katib-mysql-74f9795f8b-6h55q         1/1     Running   0          20m
katib-ui-65698b4896-glq9p            1/1     Running   0          20m
training-operator-7dc56b6448-28r69   1/1     Running   0          22m

HuggingFaceDatasetParams,
HuggingFaceModelParams,
HuggingFaceTrainerParams,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest importing each e2e specific requirements inside its function, for example:

# Test for Experiment created with external models and datasets.
def run_e2e_experiment_create_by_tune_with_llm_optimization(
    katib_client: KatibClient,
    exp_name: str,
    exp_namespace: str,
):
    from kubeflow.storage_initializer.hugging_face import (
        HuggingFaceDatasetParams,
        HuggingFaceModelParams,
        HuggingFaceTrainerParams,
    )
    import transformers
    from peft import LoraConfig

    # Create Katib Experiment and wait until it is finished.
    logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

in this way, the scope of each test is more determined - WDYT?

logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(exp_name_custom_objective, exp_namespace)

try:
Copy link
Contributor

@mahdikhashan mahdikhashan Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we using a simpler iterate over a data structure like the unit-tests, for example:

test_tune_data = [
    (
        "tune_with_custom_objective",
        run_e2e_experiment_create_by_tune_with_custom_objective,
    ),
]

WDYT?

@@ -79,18 +156,33 @@ def objective(parameters):
client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}})

# Test with run_e2e_experiment_create_by_tune
exp_name = "tune-example"
exp_name_custom_objective = "tune-example-1"
exp_name_llm_optimization = "tune-example-2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a more meaningful name for the test, while I was looking at the result of the tests, it was not easy for me to find out what are the difference between tune-example-1 and 2.

how about tune-for-an-objective-function and tune-for-external-model. WDYT? (feel free to offer better names, these were spontaneous ideas).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image


# Print the Experiment and Suggestion.
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we using a prettifier to format the result of the test success of failure here, for example using pprint. WDYT?

image

@mahdikhashan
Copy link
Contributor

mahdikhashan commented Jan 29, 2025

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

it seems that the reason for test failure on my machine is

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

my python version is Python 3.12.7.

@andreyvelich
Copy link
Member

andreyvelich commented Feb 3, 2025

Hi @helenxie-bit , we have time until this Wednesday to merge this PR before we cut Katib RC.0.
Do you have enough time to finish it ?

@andreyvelich andreyvelich added this to the v0.18 milestone Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants