Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Open
wants to merge 58 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
6be7f29
add e2e test for tune api
helenxie-bit Sep 3, 2024
1a1f119
upgrade training-operator sdk
helenxie-bit Sep 3, 2024
8461a49
specify the version of training operator sdk
helenxie-bit Sep 3, 2024
c860238
fix num_labels error and update the version of training operator cont…
helenxie-bit Sep 3, 2024
216ebd9
check the version of training operator
helenxie-bit Sep 3, 2024
f6b96f5
debug
helenxie-bit Sep 3, 2024
c636493
check import path of HuggingFaceModelParams
helenxie-bit Sep 3, 2024
8180422
update the version of training operator sdk
helenxie-bit Sep 5, 2024
6101489
update the name of experiment
helenxie-bit Sep 5, 2024
d67a1b8
add step of checking pod
helenxie-bit Sep 5, 2024
295abb6
check the logs of pod
helenxie-bit Sep 5, 2024
e0a1b6d
add check
helenxie-bit Sep 5, 2024
1df7df9
check reason for imagepullbackoff
helenxie-bit Sep 5, 2024
d1e1311
revert timeout limit
helenxie-bit Sep 5, 2024
0cc319f
fix format
helenxie-bit Sep 5, 2024
0383932
extend timeout limit
helenxie-bit Sep 13, 2024
08c8634
update training operator sdk version
helenxie-bit Sep 13, 2024
7a98a00
check the logs of pod
helenxie-bit Sep 13, 2024
8862d79
rerun tests
helenxie-bit Sep 13, 2024
e4f614d
update the function of getting logs
helenxie-bit Sep 14, 2024
0385eea
add the step of describing pod
helenxie-bit Sep 14, 2024
e0c5170
check disk space
helenxie-bit Sep 14, 2024
0286f70
change work directory
helenxie-bit Sep 17, 2024
f6e5ed5
change work directory
helenxie-bit Sep 17, 2024
7ea7e43
increase timeout limit
helenxie-bit Sep 17, 2024
25d99b1
check the logs of controller and events
helenxie-bit Sep 17, 2024
fcd64fa
change work directory
helenxie-bit Sep 18, 2024
122c611
change work directory
helenxie-bit Sep 18, 2024
c1fde09
change work directory
helenxie-bit Sep 18, 2024
8ff6864
check the logs of kubelet
helenxie-bit Sep 18, 2024
da3c298
check the logs of kubelet
helenxie-bit Sep 18, 2024
a1bff26
increase cpu
helenxie-bit Sep 19, 2024
bbae57b
check the logs of training operator
helenxie-bit Sep 19, 2024
e45ceac
check the use of resources
helenxie-bit Sep 19, 2024
4ae11ed
check the logs of container 'pytorch' and 'storage_initializer'
helenxie-bit Sep 20, 2024
bedab36
fix error of checking use of resources
helenxie-bit Sep 20, 2024
7bfb3cc
add other checks to find the error reason
helenxie-bit Sep 20, 2024
efffdc2
set 'storage_config'
helenxie-bit Sep 21, 2024
2a18b17
reduce the number of tests
helenxie-bit Sep 22, 2024
c6c964b
Check container runtime logs
helenxie-bit Sep 22, 2024
28ffb96
set the driver of minikube as docker
helenxie-bit Sep 22, 2024
dc684e3
set the driver of minikube to none
helenxie-bit Sep 22, 2024
a12034c
check logs of pod
helenxie-bit Sep 24, 2024
b088815
check memory usage
helenxie-bit Sep 29, 2024
e468b27
increase 'termination_grace_period_seconds' in podspec
helenxie-bit Sep 29, 2024
64d8fef
fix annotations error
helenxie-bit Sep 29, 2024
45db42e
restart docker
helenxie-bit Sep 30, 2024
c6e91cd
delete restarting docker
helenxie-bit Sep 30, 2024
b1a2390
use original docker data directory
helenxie-bit Oct 22, 2024
e5bf840
update installation of Katib SDK with extra requires
helenxie-bit Jan 23, 2025
fca94ae
test trainer image built with cpu
helenxie-bit Jan 23, 2025
b5cae0d
Merge remote-tracking branch 'upstream/master' into e2e-test-tune-api
helenxie-bit Jan 24, 2025
a785d35
add action of free up disk space (including move docker data directory)
helenxie-bit Jan 24, 2025
865379e
delete unnecessary checks and update the part of fetching pod descrip…
helenxie-bit Jan 24, 2025
d1ea629
delete fetching pod logs
helenxie-bit Jan 25, 2025
5e2e44f
add blank line at the end of free-up-disk-space yaml file
helenxie-bit Jan 27, 2025
982e268
update experiment name
helenxie-bit Jan 27, 2025
55c404d
update test function name to be consistent with experiment name
helenxie-bit Jan 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/e2e-test-tune-api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,16 @@ jobs:
with:
kubernetes-version: ${{ matrix.kubernetes-version }}

- name: Install Katib SDK with extra requires
shell: bash
run: |
pip install --prefer-binary -e 'sdk/python/v1beta1[huggingface]'

- name: Run e2e test with tune API
uses: ./.github/workflows/template-e2e-test
with:
tune-api: true
training-operator: true

strategy:
fail-fast: false
Expand Down
49 changes: 49 additions & 0 deletions .github/workflows/free-up-disk-space/action.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: Free-Up Disk Space
description: Remove Non-Essential Tools And Move Docker Data Directory to /mnt/docker

runs:
using: composite
steps:
# This step is a Workaround to avoid the "No space left on device" error.
# ref: https://github.com/actions/runner-images/issues/2840
- name: Remove unnecessary files
shell: bash
run: |
echo "Disk usage before cleanup:"
df -hT

sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf /usr/local/share/boost
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/swift

echo "Disk usage after cleanup:"
df -hT

- name: Prune docker images
shell: bash
run: |
docker image prune -a -f
docker system df
df -hT

- name: Move docker data directory
shell: bash
run: |
echo "Stopping docker service ..."
sudo systemctl stop docker
DOCKER_DEFAULT_ROOT_DIR=/var/lib/docker
DOCKER_ROOT_DIR=/mnt/docker
echo "Moving ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
sudo mv ${DOCKER_DEFAULT_ROOT_DIR} ${DOCKER_ROOT_DIR}
echo "Creating symlink ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
sudo ln -s ${DOCKER_ROOT_DIR} ${DOCKER_DEFAULT_ROOT_DIR}
echo "$(sudo ls -l ${DOCKER_DEFAULT_ROOT_DIR})"
echo "Starting docker service ..."
sudo systemctl daemon-reload
sudo systemctl start docker
echo "Docker service status:"
sudo systemctl --no-pager -l -o short status docker
helenxie-bit marked this conversation as resolved.
Show resolved Hide resolved
15 changes: 2 additions & 13 deletions .github/workflows/template-setup-e2e-test/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,8 @@ runs:
steps:
# This step is a Workaround to avoid the "No space left on device" error.
# ref: https://github.com/actions/runner-images/issues/2840
- name: Remove unnecessary files
shell: bash
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/swift

echo "Disk usage after cleanup:"
df -h
- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space

- name: Setup kubectl
uses: azure/setup-kubectl@v4
Expand Down
103 changes: 97 additions & 6 deletions test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
import argparse
import logging

import kubeflow.katib as katib
import transformers
from kubeflow.katib import KatibClient, search
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceDatasetParams,
HuggingFaceModelParams,
HuggingFaceTrainerParams,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest importing each e2e specific requirements inside its function, for example:

# Test for Experiment created with external models and datasets.
def run_e2e_experiment_create_by_tune_with_llm_optimization(
    katib_client: KatibClient,
    exp_name: str,
    exp_namespace: str,
):
    from kubeflow.storage_initializer.hugging_face import (
        HuggingFaceDatasetParams,
        HuggingFaceModelParams,
        HuggingFaceTrainerParams,
    )
    import transformers
    from peft import LoraConfig

    # Create Katib Experiment and wait until it is finished.
    logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

in this way, the scope of each test is more determined - WDYT?

from kubernetes import client
from peft import LoraConfig
from verify import verify_experiment_results

# Experiment timeout is 40 min.
Expand All @@ -11,8 +19,8 @@
# The default logging config.
logging.basicConfig(level=logging.INFO)


def run_e2e_experiment_create_by_tune(
# Test for Experiment created with custom objective function.
def run_e2e_experiment_create_by_tune_with_custom_objective(
katib_client: KatibClient,
exp_name: str,
exp_namespace: str,
Expand Down Expand Up @@ -57,6 +65,75 @@ def objective(parameters):
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))

# Test for Experiment created with external models and datasets.
def run_e2e_experiment_create_by_tune_with_external_model(
katib_client: KatibClient,
exp_name: str,
exp_namespace: str,
):
# Create Katib Experiment and wait until it is finished.
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

# Use the test case from fine-tuning API tutorial.
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link an updated guide for Katib LLM Optimization ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Katib LLM Optimization guide is still under review, should I link to the file in its current state for now?

Additionally, the example in the Katib LLM Optimization guide uses a different model and dataset compared to this one. The guide uses the LLaMa model, which requires access tokens. I’ve already applied for the access token and am awaiting approval. Once I receive it, I will test the example to see if it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running the above example, but I ran into some unexpected errors in the storage_initializer container, and the model couldn't be downloaded successfully. It seems like the model used in this example might require different versions of transformers or other libraries. I'll look into it, but it might take some time to resolve.

If we aim to include this in Katib 0.18-rc.0 this week, we might need to stick with the current example. Otherwise, I’ll work on fixing it before RC.1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it is fine to include it in RC.1 since it is a bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep URL for Kubeflow Training docs for now.

# Create Katib Experiment.
# And Wait until Experiment reaches Succeeded condition.
katib_client.tune(
name=exp_name,
namespace=exp_namespace,
# BERT model URI and type of Transformer to train it.
model_provider_parameters=HuggingFaceModelParams(
model_uri="hf://google-bert/bert-base-cased",
transformer_type=transformers.AutoModelForSequenceClassification,
num_labels=5,
),
# In order to save test time, use 8 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
repo_id="yelp_review_full",
split="train[:8]",
),
# Specify HuggingFace Trainer parameters.
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_tune_api",
save_strategy="no",
learning_rate = search.double(min=1e-05, max=5e-05),
num_train_epochs=1,
),
# Set LoRA config to reduce number of trainable model parameters.
lora_config=LoraConfig(
r = search.int(min=8, max=32),
lora_alpha=8,
lora_dropout=0.1,
bias="none",
),
),
objective_metric_name = "train_loss",
objective_type = "minimize",
algorithm_name = "random",
max_trial_count = 1,
parallel_trial_count = 1,
resources_per_trial=katib.TrainerResources(
num_workers=1,
num_procs_per_worker=1,
resources_per_worker={"cpu": "2", "memory": "10G",},
),
storage_config={
"size": "10Gi",
"access_modes": ["ReadWriteOnce"],
},
retain_trials=True,
)
experiment = katib_client.wait_for_experiment_condition(
exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT
)

# Verify the Experiment results.
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)

# Print the Experiment and Suggestion.
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we using a prettifier to format the result of the test success of failure here, for example using pprint. WDYT?

image


if __name__ == "__main__":
parser = argparse.ArgumentParser()
Expand All @@ -82,15 +159,29 @@ def objective(parameters):
exp_name = "tune-example"
exp_namespace = args.namespace
try:
run_e2e_experiment_create_by_tune(katib_client, exp_name, exp_namespace)
run_e2e_experiment_create_by_tune_with_custom_objective(katib_client, f"{exp_name}-1", exp_namespace)
logging.info("---------------------------------------------------------------")
helenxie-bit marked this conversation as resolved.
Show resolved Hide resolved
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}-1")
except Exception as e:
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}-1")
raise e
finally:
# Delete the Experiment.
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(f"{exp_name}-1", exp_namespace)

try:
Copy link
Contributor

@mahdikhashan mahdikhashan Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we using a simpler iterate over a data structure like the unit-tests, for example:

test_tune_data = [
    (
        "tune_with_custom_objective",
        run_e2e_experiment_create_by_tune_with_custom_objective,
    ),
]

WDYT?

run_e2e_experiment_create_by_tune_with_external_model(katib_client, f"{exp_name}-2", exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}-2")
except Exception as e:
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}")
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}-2")
raise e
finally:
# Delete the Experiment.
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(exp_name, exp_namespace)
katib_client.delete_experiment(f"{exp_name}-2", exp_namespace)