Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Training: Initial Documentation for Kubeflow Trainer V2 #3958

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions content/en/_index.html
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ <h5 class="card-title text-white section-head">AutoML</h5>
</div>
</div>
<div class="card border-primary-dark">
<a href="/docs/components/training/overview/" target="_blank" rel="noopener" >
<a href="/docs/components/trainer/overview/" target="_blank" rel="noopener" >
<img
src="/docs/images/logos/tensorflow-pytorch.png"
class="card-img-top"
Expand All @@ -123,7 +123,7 @@ <h5 class="card-title text-white section-head">AutoML</h5>
<div class="card-body bg-primary-dark">
<h5 class="card-title text-white section-head">Model Training</h5>
<p class="card-text text-white">
<a href="/docs/components/training/overview/" target="_blank" rel="noopener" >Kubeflow Training Operator</a> is a unified interface for model training and fine-tuning on Kubernetes.
<a href="/docs/components/trainer/overview/" target="_blank" rel="noopener" >Kubeflow Trainer</a> is a unified interface for model training and LLM fine-tuning on Kubernetes.
It runs scalable and distributed training jobs for popular frameworks including PyTorch, TensorFlow, MPI, MXNet, PaddlePaddle, and XGBoost.
</p>
</div>
Expand Down
20 changes: 19 additions & 1 deletion content/en/_redirects
Original file line number Diff line number Diff line change
Expand Up @@ -337,4 +337,22 @@ docs/started/requirements/ /docs/started/getting-started/
/docs/components/pipelines/v2/reference/api/kubeflow-pipeline-api-spec/ /docs/components/pipelines/reference/api/kubeflow-pipeline-api-spec/
/docs/components/pipelines/v2/reference/sdk/ /docs/components/pipelines/reference/sdk/
/docs/components/pipelines/v2/run-a-pipeline/ /docs/components/pipelines/user-guides/core-functions/run-a-pipeline/
/docs/components/pipelines/v2/version-compatibility/ /docs/components/pipelines/reference/version-compatibility/
/docs/components/pipelines/v2/version-compatibility/ /docs/components/pipelines/reference/version-compatibility/

# Kubeflow Trainer V2 (https://github.com/kubeflow/training-operator/issues/2214)
/docs/components/training/installation/ /docs/components/trainer/legacy-v1/installation/
/docs/components/training/explanation/ /docs/components/trainer/legacy-v1/explanation/
/docs/components/training/explanation/fine-tuning/ /docs/components/trainer/legacy-v1/explanation/fine-tuning/
/docs/components/training/reference/ /docs/components/trainer/legacy-v1/reference/
/docs/components/training/reference/architecture/ /docs/components/trainer/legacy-v1/reference/architecture/
/docs/components/training/reference/distributed-training/ /docs/components/trainer/legacy-v1/reference/distributed-training/
/docs/components/training/reference/fine-tuning/ /docs/components/trainer/legacy-v1/reference/fine-tuning/
/docs/components/training/user-guides/ /docs/components/trainer/legacy-v1/user-guides/
/docs/components/training/user-guides/fine-tuning/ /docs/components/trainer/legacy-v1/user-guides/fine-tuning/
/docs/components/training/user-guides/jax/ /docs/components/trainer/legacy-v1/user-guides/jax/
/docs/components/training/user-guides/job-scheduling/ /docs/components/trainer/legacy-v1/user-guides/job-scheduling/
/docs/components/training/user-guides/mpi/ /docs/components/trainer/legacy-v1/user-guides/mpi/
/docs/components/training/user-guides/paddle/ /docs/components/trainer/legacy-v1/user-guides/paddle/
/docs/components/training/user-guides/prometheus/ /docs/components/trainer/legacy-v1/user-guides/prometheus/
/docs/components/training/user-guides/tensorflow/ /docs/components/trainer/legacy-v1/user-guides/tensorflow/
/docs/components/training/user-guides/xgboost/ /docs/components/trainer/legacy-v1/user-guides/xgboost/
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,8 @@ trialSpec:
"sidecar.istio.io/inject": "false"
```

If you use `PyTorchJob` or other Training Operator jobs in your Trial template check
[here](/docs/components/training/user-guides/tensorflow/#what-is-tfjob) how to set the annotation.
If you use `PyTorchJob` or other Training Operator jobs in your Trial template, check
[here](/docs/components/trainer/legacy-v1/user-guides/tensorflow/#what-is-tfjob) how to set the annotation.

## Running the Experiment

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@ In Katib examples, you can find the following examples for Trial's Workers:

- [Kubernetes `Job`](https://kubernetes.io/docs/concepts/workloads/controllers/job/)

- [Kubeflow `TFJob`](/docs/components/training/user-guides/tensorflow)
- [Kubeflow `TFJob`](/docs/components/trainer/legacy-v1/user-guides/tensorflow)

- [Kubeflow `PyTorchJob`](/docs/components/training/user-guides/pytorch/)
- [Kubeflow `PyTorchJob`](/docs/components/trainer/legacy-v1/user-guides/pytorch/)

- [Kubeflow `XGBoostJob`](/docs/components/training/user-guides/xgboost)
- [Kubeflow `XGBoostJob`](/docs/components/trainer/legacy-v1/user-guides/xgboost)

- [Kubeflow `MPIJob`](/docs/components/training/user-guides/mpi)
- [Kubeflow `MPIJob`](/docs/components/trainer/legacy-v1/user-guides/mpi)

- [Tekton `Pipelines`](https://github.com/kubeflow/katib/tree/master/examples/v1beta1/tekton)

Expand Down
5 changes: 5 additions & 0 deletions content/en/docs/components/trainer/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
+++
title = "Kubeflow Trainer"
description = "Documentation for Kubeflow Trainer"
weight = 20
+++
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
+++
title = "Contributor Guides"
description = "Documentation for Kubeflow Trainer contributors"
weight = 60
+++

This doc is in progress...
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
+++
title = "Community Guide"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is under discussion, given that other components do not have this content on the website.we want to ensure this is consistent across the website.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue to reflect the conversation we had and made a few updates, feel free to make any suggestions
#3971
but the main idea is to not have individual pages on each project on the website, but continue one centralized place on the website and links to the git repos.

description = "How to get involved to Kubeflow Trainer community"
weight = 20
+++
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
+++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is under discussion, given that other components do not have this content on the website. we want to ensure this is consistent across the website.

title = "Contributing Guide"
description = "How to contribute to Kubeflow Trainer project"
weight = 10
+++

This doc is in progress...
29 changes: 29 additions & 0 deletions content/en/docs/components/trainer/getting-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
+++
title = "Getting Started"
description = "Get Started with Kubeflow Trainer"
weight = 30
+++

This guide describes how to get started with Kubeflow Trainer and run distributed training
with PyTorch.

## Prerequisites

Ensure that you have access to a Kubernetes cluster with Kubeflow Trainer
control plane installed. If it is not set up yet, followÍ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
control plane installed. If it is not set up yet, followÍ
control plane installed. If it is not set up yet, follow

[the installation guide](/docs/components/trainer/operator-guides/installation) to quickly deploy
Kubeflow Trainer on your local Kind cluster.
Copy link
Contributor

@astefanutti astefanutti Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kubeflow Trainer on your local Kind cluster.
Kubeflow Trainer.

It may be just better straight "quickly deploy Kubeflow Trainer". The provided link gives the specifics.


### Installing the Kubeflow Python SDK

Install the latest Kubeflow Python SDK version directly from the source repository:

```bash
pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk_v2
```

TODO (andreyvelich): Add command once we release SDK to PyPI: https://pypi.org/project/kubeflow

## Getting Started with PyTorch

TODO (andreyvelich): Add example from the Notebook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about
"This doc is in progress"

or just remove the section until is fully ready

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the getting started example once we finish this PR with @astefanutti: kubeflow/training-operator#2387

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions content/en/docs/components/trainer/legacy-v1/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
+++
title = "Legacy Kubeflow Training Operator (v1)"
description = "Kubeflow Training Operator V1 Documentation"
weight = 999
+++

{{% alert title="Old Version" color="warning" %}}
This page is about **Kubeflow Training Operator V1**, for the latest information check
[the Kubeflow Trainer V2 documentation](/docs/components/trainer).

Follow [this guide for migrating to Kubeflow Trainer V2](/docs/components/trainer/operator-guides/migration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my two cents here is that given that the component's name changed should we said V2? or just Kubeflow Trainer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could just say:
Follow this guide for migrating to the new Kubeflow Trainer project.
WDYT @varodrig @kubeflow/wg-training-leads ?

{{% /alert %}}
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ share your experience using the [#kubeflow-training Slack channel](/docs/about/c
or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new).
{{% /alert %}}

This page explains how the [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning)
This page explains how the [Training Operator fine-tuning API](/docs/components/trainer/legacy-v1/user-guides/fine-tuning)
fits into the Kubeflow ecosystem.

In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI),
Expand Down Expand Up @@ -60,4 +60,4 @@ Different user personas can benefit from this feature:

## Next Steps

- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning).
- Understand [the architecture behind `train` API](/docs/components/trainer/legacy-v1/reference/fine-tuning).
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ This guide describes how to get started with the Training Operator and run a few

You need to install the following components to run examples:

- The Training Operator control plane [installed](/docs/components/training/installation/#installing-the-control-plane).
- The Training Python SDK [installed](/docs/components/training/installation/#installing-the-python-sdk).
- The Training Operator control plane [installed](/docs/components/trainer/legacy-v1/installation/#installing-the-control-plane).
- The Training Python SDK [installed](/docs/components/trainer/legacy-v1/installation/#installing-the-python-sdk).

## Getting Started with PyTorchJob

Expand Down Expand Up @@ -153,6 +153,6 @@ TrainingClient().get_job_logs(

## Next steps

- Run the [FashionMNIST example](https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb) with using Training Operator Python SDK.
- Run the [FashionMNIST example](https://github.com/kubeflow/training-operator/blob/release-1.9/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb) with using Training Operator Python SDK.

- Learn more about [the PyTorchJob APIs](/docs/components/training/user-guides/pytorch/).
- Learn more about [the PyTorchJob APIs](/docs/components/trainer/legacy-v1/user-guides/pytorch/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in installing the control plane,
when says : "if you have already installed kubeflow platform" is going to the latest version of kubeflow. Should we replace the link with the 1.9 release?

https://v1-9-branch.kubeflow.org/docs/started/installing-kubeflow/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the latest version of Kubeflow Platform 1.10 will also include Training Operator v1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Next Steps section , the link is associated with the latest getting started guide instead of v1

Run your first Training Operator Job by following the Getting Started guide.

Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ appropriate Kubernetes workloads to perform distributed ML training and fine-tun

These are the minimal requirements to install the Training Operator:

- Kubernetes >= 1.27
- `kubectl` >= 1.27
- Kubernetes >= 1.28
- `kubectl` >= 1.28
- Python >= 3.7

## Installing the Training Operator
Expand Down Expand Up @@ -65,7 +65,7 @@ xgboostjobs.kubeflow.org 2023-06-09T00:31:04Z
### Installing the Python SDK

The Training Operator [implements a Python SDK](https://pypi.org/project/kubeflow-training/)
to simplify creation of distributed training and fine-tuning jobs for Data Scientists.
to simplify creation of distributed training and fine-tuning jobs.

Run the following command to install the latest stable release of the Training SDK:

Expand Down Expand Up @@ -96,4 +96,4 @@ pip install -U "kubeflow-training[huggingface]"

## Next steps

Run your first Training Operator Job by following the [Getting Started guide](/docs/components/training/getting-started/).
Run your first Training Operator Job by following the [Getting Started guide](/docs/components/trainer/legacy-v1/getting-started/).
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ The Training Operator implements a centralized Kubernetes controller to orchestr
You can run high-performance computing (HPC) tasks with the Training Operator and MPIJob since it
supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.
The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version,
please follow [this guide](/docs/components/training/user-guides/mpi/) to install MPI Operator V2.
please follow [this guide](/docs/components/trainer/legacy-v1/user-guides/mpi/) to install MPI Operator V2.

<img src="/docs/components/training/images/training-operator-overview.drawio.svg"
<img src="/docs/components/trainer/legacy-v1/images/training-operator-overview.drawio.svg"
alt="Training Operator Overview"
class="mt-3 mb-3">

Expand All @@ -38,7 +38,7 @@ various distributed training strategies for different ML frameworks.
The Training Operator addresses the Model Training and Model Fine-Tuning steps in the AI/ML
lifecycle as shown in diagram below:

<img src="/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg"
<img src="/docs/components/trainer/legacy-v1/images/ml-lifecycle-training-operator.drawio.svg"
alt="AI/ML Lifecycle Training Operator"
class="mt-3 mb-3">

Expand All @@ -49,7 +49,7 @@ Kubernetes cluster using APIs and interfaces provided by Training Operator.

- **The Training Operator is extensible and portable.**

You can deploy Training Operator on any cloud where you have Kubernetes cluster and you can
You can deploy the Training Operator on any cloud where you have Kubernetes cluster and you can
integrate their own ML frameworks written in any programming languages with Training Operator.

- **The Training Operator is integrated with the Kubernetes ecosystem.**
Expand All @@ -63,17 +63,17 @@ To perform distributed training the Training Operator implements the following
[Custom Resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
for each ML framework:

| ML Framework | Custom Resource |
| ------------ | ------------------------------------------------------------ |
| PyTorch | [PyTorchJob](/docs/components/training/user-guides/pytorch/) |
| TensorFlow | [TFJob](/docs/components/training/user-guides/tensorflow/) |
| XGBoost | [XGBoostJob](/docs/components/training/user-guides/xgboost/) |
| MPI | [MPIJob](/docs/components/training/user-guides/mpi/) |
| PaddlePaddle | [PaddleJob](/docs/components/training/user-guides/paddle/) |
| JAX | [JAXJob](/docs/components/training/user-guides/jax/) |
| ML Framework | Custom Resource |
| ------------ | --------------------------------------------------------------------- |
| PyTorch | [PyTorchJob](/docs/components/trainer/legacy-v1/user-guides/pytorch/) |
| TensorFlow | [TFJob](/docs/components/trainer/legacy-v1/user-guides/tensorflow/) |
| XGBoost | [XGBoostJob](/docs/components/trainer/legacy-v1/user-guides/xgboost/) |
| MPI | [MPIJob](/docs/components/trainer/legacy-v1/user-guides/mpi/) |
| PaddlePaddle | [PaddleJob](/docs/components/trainer/legacy-v1/user-guides/paddle/) |
| JAX | [JAXJob](/docs/components/trainer/legacy-v1/user-guides/jax/) |

## Next steps

- Follow [the installation guide](/docs/components/training/installation/) to deploy the Training Operator.
- Follow [the installation guide](/docs/components/trainer/legacy-v1/installation/) to deploy the Training Operator.

- Run examples from [getting started guide](/docs/components/training/getting-started/).
- Run examples from [getting started guide](/docs/components/trainer/legacy-v1/getting-started/).
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,15 @@ The dedicated "Backend" operator was not implemented and instead
consolidated to the "Frontend" operator.

The benefits of this approach were:

1. Shared testing and release infrastructure
2. Unlocked production grade features like manifests and metadata support
3. Simpler Kubeflow releases
4. A Single Source of Truth (SSOT) for other Kubeflow components to interact with

The V1 Training Operator architecture diagram can be seen in the diagram below:

<img src="/docs/components/training/images/training-operator-v1-architecture.drawio.svg"
<img src="/docs/components/trainer/legacy-v1/images/training-operator-v1-architecture.drawio.svg"
alt="Training Operator V1 Architecture"
class="mt-3 mb-3">

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This page shows different distributed strategies that can be used by the Trainin
This diagram shows how the Training Operator creates PyTorch workers for the
[ring all-reduce algorithm](https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/).

<img src="/docs/components/training/images/distributed-pytorchjob.drawio.svg"
<img src="/docs/components/trainer/legacy-v1/images/distributed-pytorchjob.drawio.svg"
alt="Distributed PyTorchJob"
class="mt-3 mb-3">

Expand All @@ -34,7 +34,7 @@ the appropriate environment variables for `torchrun`.
This diagram shows how the Training Operator creates the TensorFlow parameter server (PS) and workers for
[PS distributed training](https://www.tensorflow.org/tutorials/distribute/parameter_server_training).

<img src="/docs/components/training/images/distributed-tfjob.drawio.svg"
<img src="/docs/components/trainer/legacy-v1/images/distributed-tfjob.drawio.svg"
alt="Distributed TFJob"
class="mt-3 mb-3">

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ weight = 10
+++

This page shows how Training Operator implements the
[API to fine-tune LLMs](/docs/components/training/user-guides/fine-tuning).
[API to fine-tune LLMs](/docs/components/trainer/legacy-v1/user-guides/fine-tuning).

## Architecture

In the following diagram you can see how `train` Python API works:

<img src="/docs/components/training/images/fine-tune-llm-api.drawio.svg"
<img src="/docs/components/trainer/legacy-v1/images/fine-tune-llm-api.drawio.svg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'l be great if we can update the links to the kubernetes documentation to the kubernetes version supported by the V1 legacy https://v1-28.docs.kubernetes.io/docs/concepts/storage/persistent-volumes/ for eadOnlyMany access mode

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to do it here ?
I think, Access Mode is stable feature in Kubernetes since v1.18, so we won't expect any changes to this section.
Since we can support newer version of Kubernetes in the future releases of Training Operator (e.g. v1.9.2), it would be hard to always update these links.

alt="Fine-Tune API for LLMs"
class="mt-3 mb-3">

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ share your experience using the [#kubeflow-training Slack channel](https://cloud
or the [Kubeflow Training Operator GitHub](https://github.com/kubeflow/training-operator/issues/new).
{{% /alert %}}

This page describes how to use a [`train` API from the Training Python SDK](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112)
This page describes how to use a [`train` API from the Training Python SDK](https://github.com/kubeflow/training-operator/blob/release-1.9/sdk/python/kubeflow/training/api/training_client.py#L95)
that simplifies the ability to fine-tune LLMs with distributed PyTorchJob workers.

If you want to learn more about how the fine-tuning API fits in the Kubeflow ecosystem, head to
the [explanation guide](/docs/components/training/explanation/fine-tuning).
the [explanation guide](/docs/components/trainer/legacy-v1/explanation/fine-tuning).

## Prerequisites

You need to install the Training Python SDK [with fine-tuning support](/docs/components/training/installation/#install-the-python-sdk-with-fine-tuning-capabilities)
You need to install the Training Python SDK [with fine-tuning support](/docs/components/trainer/legacy-v1/installation/#install-the-python-sdk-with-fine-tuning-capabilities)
to run this API.

## How to use the Fine-Tuning API?
Expand Down Expand Up @@ -92,6 +92,7 @@ to fine-tune the LLM.
Platform engineers can customize the storage initializer and trainer images by setting the `STORAGE_INITIALIZER_IMAGE` and `TRAINER_TRANSFORMER_IMAGE` environment variables before executing the `train` command.

For example: In your python code, set the env vars before executing `train`:

```python
...
os.environ['STORAGE_INITIALIZER_IMAGE'] = 'docker.io/<username>/<custom-storage-initiailizer_image>'
Expand All @@ -102,9 +103,9 @@ TrainingClient().train(...)

## Next Steps

- Run the example to [fine-tune the TinyLlama LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb)
- Run the example to [fine-tune the TinyLlama LLM](https://github.com/kubeflow/training-operator/blob/release-1.9/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb)

- Check this example to compare the `create_job` and the `train` Python API for
[fine-tuning BERT LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/text-classification/Fine-Tune-BERT-LLM.ipynb).
[fine-tuning BERT LLM](https://github.com/kubeflow/training-operator/blob/release-1.9/examples/pytorch/text-classification/Fine-Tune-BERT-LLM.ipynb).

- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning).
- Understand [the architecture behind `train` API](/docs/components/trainer/legacy-v1/reference/fine-tuning).
Loading