-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate a YAML Component Spec from a Python DSL component function #3748
Comments
Thanks @sm-hawkfish |
Interesting topic. We were discussing internally whether YAML or decorated Python functions are a better way to provide a shared library of components. We are running into a couple of challenges with YAML files:
On the other hand, as said by others, consuming the YAML file from a code repository is much easier that consuming them from e.g. a Python package in PyPI. A feature that generates the YAML specifications from Python functions would potentially offer the flexibility of Python functions and the easy of consumption of text files. Also CI/CD would be very flexible. So definitely interested. |
Thanks for summarizing all of these in an issue! /cc @Ark-kun @numerology |
Commenting on some of the statements for now.
I think that the component.yaml is the minimal description of the component. Your example shows that it has less lines than the python version. It also supports more features than the legacy
Do you have any use for those OpenAPI schemas? Those schemas as well as Why would you use Components vs.
|
I think we already support the feature that you want. The structures in from kfp.components.structures import *
component_spec = ComponentSpec(
name='Dummy op',
description='Dummy component for illustrative purposes',
inputs=[
InputSpec(name='input1', type='String'),
InputSpec(name='input2', type='GCSPath'),
],
outputs=[
OutputSpec(name='output1', type='GCSPath'),
],
implementation=ContainerImplementation(container=ContainerSpec(
image="dummy-image",
command=[
"python", "runner.py",
"--input1", InputValuePlaceholder('input1'),
"--input2", InputPathPlaceholder('input2'),
"--output1", OutputPathPlaceholder('output1'),
],
))
)
component_spec.save('dummy.component.yaml') What do you think? |
Thank you for the quick and detailed reply @Ark-kun ! The code snippet you provided answers my initial question, although I now have a follow-up :) To take a step back, a large part of my motivation for wanting to write the Components in Python is so that I could extend the types in Additionally, these tools would allow us to easily define complex types, so that we could pass an object of related Pipeline params to the pipeline function, rather than denormalizing them into lots of Pipeline params with primitive types. For example, suppose I want to create a component to launch a Katib Hyperparameter tuning job, it seems convenient to have a single pipeline parameter I was actually just preparing some code snippets to open a separate Github Issue demonstrating how the from typing import List
from typing import NamedTuple
import kfp
from kfp import dsl
from pydantic import BaseModel
from pydantic import Field
class SimpleParam(BaseModel):
field1: str
field2: float
class ComplexParam(BaseModel):
field1: str = Field(..., regex="^gs://.*$")
field2: float = Field(..., ge=0, le=1)
field3: List[int]
Component1Outputs = NamedTuple("Component1Outputs", [("output", SimpleParam)])
@dsl.component
def component1(arg: ComplexParam) -> Component1Outputs:
output = "/tmp/python_dummy_op/output1_path"
return dsl.ContainerOp(
name="Dummy op",
image="dummy-image",
command=["python3", "runner.py", "--input", arg, "--output", output],
file_outputs={"output": output},
)
@dsl.component
def component2(arg1: ComplexParam, arg2: SimpleParam):
return dsl.ContainerOp(
name="Dummy op",
image="dummy-image",
command=["python3", "runner.py", "--input1", arg1, "--input2", arg2],
)
@dsl.pipeline()
def my_pipeline(arg: ComplexParam):
component1_task = component1(arg=arg)
component2(arg1=arg, arg2=component1_task.outputs["output"])
if __name__ == "__main__":
arg = ComplexParam(
field1="gs://my_bucket/hello_world.txt", field2=0.5, field3=["1", "2"]
)
client = kfp.Client()
run = client.create_run_from_pipeline_func(
my_pipeline, arguments={"arg": arg}, run_name="Test custom types"
) I removed some hacky code from the above example that makes the pydantic Models backwards compatible with the
There are still some open questions:
The reason I said your response prompted a follow up question is that it does not look like |
Hi @Ark-kun, I know that you are very busy with other Issues and Pull Requests, but want to keep this on your radar. Are there ideas or concepts in my previous post that you would be interested in discussing further? |
I want to note some core KFP aspects: Low level = command-line programs, strings and filesOn the low conceptual level KFP orchestrates containerized command-line programs. The "command-line programs" is an important part. It helps users understand the limitations and the solutions. KFP does not orchestrate python or Java classes. KFP does not pass in-memory object between running programs. KFP passes data, serialized as bytes or strings. Portability:KFP needs to be portable, language-agnostic and platform-agnostic. The users can still use python-specific serialization formats like Pickle, but they should understand that this has negative portability implications - a Java-based downstream component won't be able to rea the pickled data. ComponentsKFP components are described by the ComponentSpec class and the corresponding component.yaml serialization. This is the source of truth regarding components. All other component creation methods build on that. Any new high-level component feature should be built on top of that structure. The structure is pretty flexible, so even it's usually not a problem. For example, python-based components are still built on top of ComponentSpec and ContainerSpec. TypesLet's start with untyped world. Components exchange pieces of data blobs. Why would the user want to specify the types for their component inputs and outputs? I see several reasons:
Types in KFP componentsKFP components support type specifications. The type specification is essentially a dict (and the values can also be strings, dicts or lists). The system is very flexible and allows specifying arbitrary user types. (You should not confuse types and objects.) Compile-time reference argument compatibility checkingKFP already supports this. If both input and argument have types, then the system immediately chacks that the types are the same when you pass output to input. There is no need for any additional tools. Currently the type compatibility check simply compares the two type specifications (dicts). Compile-time constant argument value checkingKFP has some support for this. There is a limited set of types (str/String, int/Integer, float/Float, bool/Boolean, list/JsonArray, dict/JsonObject) for which the constant argument values are checked against the input type and serialized automatically. Values of other types must be manually serialized before passing them to the component. |
I think that you do not need kfp.dsl.types to declare your own custom types.
KFP already validates the types even before submission. If both input and output are typed, when you try to pass an output to an incompatible input, you'll get error. No need to integrate any external tools.
I think this is a misconception. Just because you can use some python class as KFP type, it does not mean you can just pass an object of that class to some component. KFP orchstrates containerized command-line programs. You cannot pass in-memory objects. At some point they must be serialized and sent over the network as bytes and then maybe deserialized by some code. In KFP team we try to keep the API surface of the SDK minimal, so we only support automatic serialization of 6 primitive types. Everything else must be serialized by the pipeline code and deserialized by the component code. Remember that KFP runs arbitrary containerized command-line programs. In general the containers do not have KFP SDK or even Python installed. A Java program won't automatically understand a Python memory object.
Sure, component code already can do whatever it wants and the component author can specify any structure describing the type. The SDK is not peeking in that though. Type specifications are opaque identifiers and containers are black boxes.
You can easily do that. Declate a single input. Optionally, give it some type like |
I think it can. You can convert your complex type specifications to a JSON-like structure and use it directly. Even is InputSpec only supported type name (a single string), you could still serialize an arbitrary type specification to that string (e.g. using JSON). Comments on some of the samples:
So, what is the resolved command-line supposed to be?
I'm writing a Java-based component. How can I read the output of your component?
Yes. These are the questions that have shaped the SDK's API surface regarding types. This is why SDK only supports serialization of 6 primitive types and everything else is the responsibility of the pipeline and component authors.
Serialization is custom code. There are only two places where custom code is executed - inside the launched containers in the cloud and on the pipeline author's/submitter's machine. Since the serialization must occur before the container can be launched, this only leaves the pipeline author's/submitter's machine. The complex objects must be serialized before the pipeline can be compiled or submitted. And this is what the SDK expects at this moment.
There were some plans for this. It was the reason why those |
@sm-hawkfish What do you think? |
Hi @Ark-kun , I apologize for the extended delay -- I wanted to make sure I had a chance to review your comments and go through the code-base in more detail before responding. For one, I have come to agree with you that using To that end, I found the existing python decorator for lightweight components to be very inspirational: as you well know, For all of its benefits, there are a few drawbacks to
In an attempt to supercharge I have taken as an example a KFP Component that submits a Katib Hyperparameter Tuning experiment: Implementing the ComponentWe create a file from enum import Enum
from typing import List
from pydantic import BaseModel
class ObjectiveType(str, Enum):
unknown = ""
minimize = "minimize"
maximize = "maximize"
class ObjectiveSpec(BaseModel):
type: ObjectiveType
goal: float = None
objectiveMetricName: str
additionalMetricNames: List[str] = None There are many more specs, but this should give a feel for the syntax. From there, we create from typing import Dict
from typing import List
from typing import NamedTuple
from typing import Union
import typer
from .katib_specifications import ObjectiveSpec
class Outputs(NamedTuple):
best_hyperparameters = Dict[str, Union[str, float, int]]
def katib_hyperparameter_tuning(
data_dir: str = typer.Option(
..., help="The GCS directory containing training data"
),
objective_spec: ObjectiveSpec = typer.Option(
...,
help="The Katib Objective to optimize",
),
) -> Outputs:
"""
Distributed Hyperparamter Tuning (and Neural Architecture Search) using Kubeflow Katib. The
upstream documentation on available hyperparameter search algorithms is available here:
https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#search-algorithms-in-detail
"""
# Implementation - `objective_spec` can be used like a namedtuple or turned into a dictionary via objective_spec.dict()
best_hyperparameters = {"max_depth": 4}
return best_hyperparameters
if __name__ == "__main__":
typer.run(katib_hyperparameter_tuning) The idea is that Typer can create a CLI for you automatically using the type annotations in the function. The input provided to the component will be JSON strings, so the KFP infrastructure doesn't need to know or care about this: python -m src.component --data-dir gs://my-bucket/my-training-data/ --objective-spec '{"type": "maximize", "objectiveMetricName": "roc_auc", "additionalMetricNames": ["accuracy"]}' And Typer will also do the work of casting the input types into the type declared by the annotation. In the case above, something like this would be run by Typer behind the scenes: objective_spec = ObjectiveSpec.parse_raw(
'{"type": "maximize", "objectiveMetricName": "roc_auc", "additionalMetricNames": ["accuracy"]}'
) I am cheating here a little bit, as I did need to make slight modifications to Typer in order to parse these JSON inputs into Pydantic models, which is outlined in this issue Generating the Component SpecSince Pydantic Models can output an OpenAPI schema, I just needed to make some adjustments to extract_component_interface and Here's an example of how to get the schema: from .katib_specifications import ObjectiveSpec
print(json.dumps(ObjectiveSpec.schema(), indent=2))
{
"title": "ObjectiveSpec",
"type": "object",
"properties": {
"type": {
"$ref": "#/definitions/ObjectiveTypes"
},
"goal": {
"title": "Goal",
"type": "number"
},
"objectiveMetricName": {
"title": "Objectivemetricname",
"type": "string"
},
"additionalMetricNames": {
"title": "Additionalmetricnames",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"type",
"objectiveMetricName"
],
"definitions": {
"ObjectiveTypes": {
"title": "ObjectiveTypes",
"description": "An enumeration.",
"enum": [
"",
"minimize",
"maximize"
],
"type": "string"
}
}
} The end result is the following component specification: name: Katib hyperparameter tuning
description: |-
Distributed Hyperparamter Tuning (and Neural Architecture Search) using Kubeflow Katib. The
upstream documentation on available hyperparameter search algorithms is available here:
https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#search-algorithms-in-detail
inputs:
- name: data_dir
type:
String:
openapi_schema_validator: {type: string}
description: The GCS directory containing training data
- name: objective_spec
type:
src.katib_specifications.ObjectiveSpec:
openapi_schema_validator:
type: object
properties:
type: {$ref: '#/definitions/ObjectiveTypes'}
goal: {title: Goal, type: number}
objectiveMetricName: {title: Objectivemetricname, type: string}
additionalMetricNames:
title: Additionalmetricnames
type: array
items: {type: string}
required: [type, objectiveMetricName]
definitions:
ObjectiveTypes:
title: ObjectiveTypes
description: An enumeration.
enum: ['', minimize, maximize]
type: string
description: The Katib Objective to optimize.
outputs:
- name: best_hyperparameters
type:
Dict[str, Union[str, float, int]]:
openapi_schema_validator:
type: object
additionalProperties:
anyOf:
- {type: string}
- {type: number}
- {type: integer}
implementation:
container:
image: DUMMY_IMAGE
args:
- --data-dir
- {inputValue: data_dir}
- {inputValue: objective_spec}
- '----output-paths'
- {outputPath: best_hyperparameters} You can see I kept the key Using the Component in a PipelineGenerating the OpenAPI schema has a couple of nice benefits: Auto-Generated DocumentationI wrote a very simple script to parse the This provides a nice reference to data scientists who are getting familiar with the inputs that each component expects. Pipeline Compile-time validationIn addition, I wrote a small function The validation code looks something like: from openapi_schema_validator import OAS30Validator
def validate_component_input(input_spec: InputSpec, input_arg: Any):
schema = list(input_spec.type.values())[0]["openapi_schema_validator"]
validator = OAS30Validator(schema)
validator.validate(input_arg) This allows data scientists to get feedback on their pipeline arguments at compile time (note the objective spec type below is "maximization" instead of "maximize"): import kfp
KATIB_OBJECTIVE_SPEC = {
"type": "maximization",
"objectiveMetricName": "roc_auc",
"additionalMetricNames": ["accuracy"],
}
component_store = ComponentStore()
hyperparameter_op = component_store.load_component("hyperparameter_tuning")
# Define a pipeline and create a task from a component:
@kfp.dsl.pipeline(
name="Train Model", description="Train model",
)
def my_pipeline(
data_dir=DATA_DIR,
katib_objective_spec=KATIB_OBJECTIVE_SPEC
):
hyperparameter_op(
data_dir=data_task.output,
objective_spec=katib_objective_spec
)
if __name__ == "__main__":
validate_pipeline(my_pipeline)
kfp.compiler.Compiler().compile(my_pipeline, "/tmp/pipeline.tar.gz") When the user runs this script, they will get: jsonschema.exceptions.ValidationError: 'maximization' is not one of ['', 'minimize', 'maximize']
Failed validating 'enum' in schema['properties']['type']:
OrderedDict([('title', 'ObjectiveTypes'),
('description', 'An enumeration.'),
('enum', ['', 'minimize', 'maximize']),
('type', 'string'),
('nullable', False)])
On instance['type']:
'maximization' And can easily correct the typo before submitting the pipeline run to the cluster. As stated above, I would be very happy to go into more detail on any of the above steps if you find aspects of this approach interesting. And congratulations on the 1.0.0 release ! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/freeze |
/lifecycle frozen |
Hi everyone,
Using the kfp.dsl.component decorator to define KFP Components is a great user experience, since it leverages standard features of the Python language like type annotations and docstring inspection. By contrast, the YAML Component specification requires engineers to learn a new syntax and is generally more verbose than the Python equivalent (especially when it comes to writing custom type definitaions using the OpenAPI Schema)
That said, one advantage of the YAML spec that was raised in the Slack channel is around the ease of distributing a readily parseable file format to a variety of different client applications.
Since both formats have their pros and cons, the purpose of this issue is to discuss the value and feasibility of enhancing the KFP Python SDK to support generating the YAML specification from the Python DSL component definition. This would give users the best of both worlds by allowing them to define components comfortably in Python and to ship Component specifications to downstream clients in YAML.
By way of example, consider the following (equivalent) specifications of a
Dummy op
:When compiled into Pipelines, the Component Metadata is nearly identical:
Yields:
CC'ing @Bobgy and @eterna2 since they were involved in the Slack discussion as well (I was unable to find Lida Li's Github username).
The text was updated successfully, but these errors were encountered: