Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a YAML Component Spec from a Python DSL component function #3748

Open
sm-hawkfish opened this issue May 12, 2020 · 15 comments
Open
Assignees
Labels

Comments

@sm-hawkfish
Copy link

sm-hawkfish commented May 12, 2020

Hi everyone,

Using the kfp.dsl.component decorator to define KFP Components is a great user experience, since it leverages standard features of the Python language like type annotations and docstring inspection. By contrast, the YAML Component specification requires engineers to learn a new syntax and is generally more verbose than the Python equivalent (especially when it comes to writing custom type definitaions using the OpenAPI Schema)

That said, one advantage of the YAML spec that was raised in the Slack channel is around the ease of distributing a readily parseable file format to a variety of different client applications.

Since both formats have their pros and cons, the purpose of this issue is to discuss the value and feasibility of enhancing the KFP Python SDK to support generating the YAML specification from the Python DSL component definition. This would give users the best of both worlds by allowing them to define components comfortably in Python and to ship Component specifications to downstream clients in YAML.

By way of example, consider the following (equivalent) specifications of a Dummy op:

from kfp import dsl
from kfp.dsl.types import GCSPath
from kfp.dsl.types import String


@dsl.component
def dummy_op(input1: String(), input2: GCSPath()) -> {"output1": GCSPath()}:
    """
    Dummy component for illustrative purposes
    """
    output1 = "/tmp/python_dummy_op/output1_path"
    return dsl.ContainerOp(
        name="Dummy op",
        image="dummy-image",
        command=[
            "python",
            "runner.py",
            "--input1",
            input1,
            "--input2",
            input2,
            "--output1",
            output1,
        ],
        file_outputs={"output1": output1},
    )
name: Dummy op
description: Dummy component for illustrative purposes
inputs:
- {name: input1, type: {String: {openapi_schema_validator: {type: string}}}}
- {name: input2, type: {GCSPath: {openapi_schema_validator: {type: string, pattern: ^gs://.*$}}}}
outputs:
- {name: output1, type: {GCSPath: {openapi_schema_validator: {type: string, pattern: ^gs://.*$}}}}
implementation:
  container:
    image: dummy-image
    command:
      - python3
      - runner.py
      - --input1
      - inputValue: input1
      - --input2
      - inputValue: input2
      - --output1
      - outputPath: output1

When compiled into Pipelines, the Component Metadata is nearly identical:

@dsl.pipeline(
    name="Python Pipeline",
    description="Demonstrate use of Python DSL",
)
def python_pipeline(input1: String(), input2: GCSPath()):
    dummy_task = dummy_op(input1=input1, input2=input2)
    print("Python Pipeline", dummy_task._metadata)


@dsl.pipeline(
    name="YAML Pipeline",
    description="Demonstrate use of YAML component spec",
)
def yaml_pipeline(input1: String(), input2: GCSPath()):
    dummy_task = yaml_dummy_op(input1=input1, input2=input2)
    print("YAML Pipeline", dummy_task._metadata)


if __name__ == "__main__":
    kfp.compiler.Compiler().compile(python_pipeline, "python_pipeline.tar.gz")
    kfp.compiler.Compiler().compile(yaml_pipeline, "yaml_pipeline.tar.gz")

Yields:

Python Pipeline
{'description': 'Dummy component for illustrative purposes',
 'inputs': [{'name': 'input1',
             'type': {'String': {'openapi_schema_validator': {'type': 'string'}}}},
            {'name': 'input2',
             'type': {'GCSPath': {'openapi_schema_validator': {'pattern': '^gs://.*$',
                                                               'type': 'string'}}}}],
 'name': 'Dummy op',
 'outputs': [{'name': 'output1',
              'type': {'GCSPath': {'openapi_schema_validator': {'pattern': '^gs://.*$',
                                                                'type': 'string'}}}}]}


YAML Pipeline
{'description': 'Dummy component for illustrative purposes',
 'inputs': [{'name': 'input1',
             'type': {'String': OrderedDict([('openapi_schema_validator',
                                              OrderedDict([('type',
                                                            'string')]))])}},
            {'name': 'input2',
             'type': {'GCSPath': OrderedDict([('openapi_schema_validator',
                                               OrderedDict([('type', 'string'),
                                                            ('pattern',
                                                             '^gs://.*$')]))])}}],
 'name': 'Dummy op',
 'outputs': [{'name': 'output1',
              'type': {'GCSPath': OrderedDict([('openapi_schema_validator',
                                                OrderedDict([('type', 'string'),
                                                             ('pattern',
                                                              '^gs://.*$')]))])}}]}

CC'ing @Bobgy and @eterna2 since they were involved in the Slack discussion as well (I was unable to find Lida Li's Github username).

@lilida
Copy link
Contributor

lilida commented May 12, 2020

Thanks @sm-hawkfish

@joostgp
Copy link

joostgp commented May 12, 2020

Interesting topic. We were discussing internally whether YAML or decorated Python functions are a better way to provide a shared library of components.

We are running into a couple of challenges with YAML files:

  • We have some legacy containers where parameters are passed as environment variables. I don't think this is supported in the component YAML files.
  • We have several container registries, using decorated Python functions we can parameterize the container registry of the images
  • It is more natural for developers to create the Python functions. There is not enough added value in creating the YAML files.

On the other hand, as said by others, consuming the YAML file from a code repository is much easier that consuming them from e.g. a Python package in PyPI.

A feature that generates the YAML specifications from Python functions would potentially offer the flexibility of Python functions and the easy of consumption of text files. Also CI/CD would be very flexible. So definitely interested.

@Bobgy
Copy link
Contributor

Bobgy commented May 12, 2020

Thanks for summarizing all of these in an issue!

/cc @Ark-kun @numerology
Who have the best knowledge.

@Bobgy Bobgy added status/triaged Whether the issue has been explicitly triaged area/sdk/dsl/compiler kind/feature labels May 12, 2020
@Ark-kun Ark-kun self-assigned this May 13, 2020
@Ark-kun
Copy link
Contributor

Ark-kun commented May 13, 2020

Commenting on some of the statements for now.

By contrast, the YAML Component specification requires engineers to learn a new syntax and is generally more verbose than the Python equivalent

I think that the component.yaml is the minimal description of the component. Your example shows that it has less lines than the python version. It also supports more features than the legacy @component decorator.

(especially when it comes to writing custom type definitaions using the OpenAPI Schema)

Do you have any use for those OpenAPI schemas? Those schemas as well as kfp.dsl.types are mostly deprecated.

Why would you use input1: kfp.dsl.types.String() instead of just pythonic input1: str? This will result in more compact and more supported component.yaml.

Components vs. ContainerOp:
Component consists of interface (input and output specifications) and implementation (currently container and graph are supported).

ContainerOp objects are not really components. They are semi-resolved Task objects. They do not contain the whole information. Given ContainerOp produced by giving arguments to a component, you cannot restore the component back. The information is lost. For example, ContainerOp does not really have concept of inputs. When you pass arguments to dummy_op, they're injected directly in the command which cannot be restored back. The @component decorator was added as a hack to try to preserve some of that information, but it only preserves the input types. Other information is still lost. This is why we really discourage users from creating ContainerOp objects directly - it takes the same amount of effort to write component.yaml file which gives a real reusable component that can be shared between pipelines and users.

@Ark-kun
Copy link
Contributor

Ark-kun commented May 13, 2020

I think we already support the feature that you want. The structures in kfp.components.structures allows you to build your component specification using python. The code has about the same size as the ContainerOp instantiation while creating a real component. The specification can then be saved to component.yaml.

from kfp.components.structures import *

component_spec = ComponentSpec(
    name='Dummy op',
    description='Dummy component for illustrative purposes',
    inputs=[
        InputSpec(name='input1', type='String'),
        InputSpec(name='input2', type='GCSPath'),
    ],
    outputs=[
        OutputSpec(name='output1', type='GCSPath'),
    ],
    implementation=ContainerImplementation(container=ContainerSpec(
        image="dummy-image",
        command=[
            "python", "runner.py",
            "--input1", InputValuePlaceholder('input1'),
            "--input2", InputPathPlaceholder('input2'),
            "--output1", OutputPathPlaceholder('output1'),
        ],
    ))
)

component_spec.save('dummy.component.yaml')

What do you think?

@sm-hawkfish
Copy link
Author

sm-hawkfish commented May 13, 2020

Thank you for the quick and detailed reply @Ark-kun ! The code snippet you provided answers my initial question, although I now have a follow-up :)

To take a step back, a large part of my motivation for wanting to write the Components in Python is so that I could extend the types in kfp.dsl.types with my own types. By using Python, we could use a variety of existing tools like mypy and pydantic to validate Pipeline parameters before submitting. The goal of this would be to cut down on the iterative loop of development.

Additionally, these tools would allow us to easily define complex types, so that we could pass an object of related Pipeline params to the pipeline function, rather than denormalizing them into lots of Pipeline params with primitive types. For example, suppose I want to create a component to launch a Katib Hyperparameter tuning job, it seems convenient to have a single pipeline parameter katib-objective, which is an object containing all of the fields outlined in the ObjectiveSpec detailed here and here. And/or each Component could have its own complex type, to make the Pipeline function input signature cleaner.

I was actually just preparing some code snippets to open a separate Github Issue demonstrating how the kfp.dsl.types system could be revamped using something like pydantic, but I did not know that that file is considered largely deprecated. I'll share the basic idea here:

from typing import List
from typing import NamedTuple

import kfp
from kfp import dsl
from pydantic import BaseModel
from pydantic import Field

class SimpleParam(BaseModel):
    field1: str
    field2: float


class ComplexParam(BaseModel):
    field1: str = Field(..., regex="^gs://.*$")
    field2: float = Field(..., ge=0, le=1)
    field3: List[int]

Component1Outputs = NamedTuple("Component1Outputs", [("output", SimpleParam)])


@dsl.component
def component1(arg: ComplexParam) -> Component1Outputs:
    output = "/tmp/python_dummy_op/output1_path"
    return dsl.ContainerOp(
        name="Dummy op",
        image="dummy-image",
        command=["python3", "runner.py", "--input", arg, "--output", output],
        file_outputs={"output": output},
    )


@dsl.component
def component2(arg1: ComplexParam, arg2: SimpleParam):
    return dsl.ContainerOp(
        name="Dummy op",
        image="dummy-image",
        command=["python3", "runner.py", "--input1", arg1, "--input2", arg2],
    )


@dsl.pipeline()
def my_pipeline(arg: ComplexParam):
    component1_task = component1(arg=arg)
    component2(arg1=arg, arg2=component1_task.outputs["output"])


if __name__ == "__main__":
    arg = ComplexParam(
        field1="gs://my_bucket/hello_world.txt", field2=0.5, field3=["1", "2"]
    )
    client = kfp.Client()
    run = client.create_run_from_pipeline_func(
        my_pipeline, arguments={"arg": arg}, run_name="Test custom types"
    )

I removed some hacky code from the above example that makes the pydantic Models backwards compatible with the kfp.dsl.types.BaseType, which I'd be happy to share if you are interested. Fully implemented, the code would allow for the following workflow:

  • At testing time, mypy statically validates that the arg being passed to the pipeline function is actually a ComplexParam
  • At runtime, pydantic validates that the arg conforms to the ComplexParam Model schema (and performs common-sense type casting automatically)
  • At runtime, the KFP DSL type checking ensures that component I/O edges have the same type (e.g. confirms that component1 outputs a SimpleParam and component2 expects a SimpleParam.

There are still some open questions:

  • Where the serialization of these complex types should occur
  • Relationship between this work and any work to validate / submit Pipelines from the Web UI (note that Pydantic models do define an OpenAPI Schema that could be shipped elsewhere).

The reason I said your response prompted a follow up question is that it does not look like InputSpec could accept the complex types that I created in the above code snippet. I'd love to hear your input on this (and would be happy to open a separate Issue if you would prefer to talk about types/validation elsewhere)

@sm-hawkfish
Copy link
Author

Hi @Ark-kun, I know that you are very busy with other Issues and Pull Requests, but want to keep this on your radar. Are there ideas or concepts in my previous post that you would be interested in discussing further?

@Ark-kun
Copy link
Contributor

Ark-kun commented May 15, 2020

I want to note some core KFP aspects:

Low level = command-line programs, strings and files

On the low conceptual level KFP orchestrates containerized command-line programs. The "command-line programs" is an important part. It helps users understand the limitations and the solutions. KFP does not orchestrate python or Java classes. KFP does not pass in-memory object between running programs. KFP passes data, serialized as bytes or strings.

Portability:

KFP needs to be portable, language-agnostic and platform-agnostic. The users can still use python-specific serialization formats like Pickle, but they should understand that this has negative portability implications - a Java-based downstream component won't be able to rea the pickled data.

Components

KFP components are described by the ComponentSpec class and the corresponding component.yaml serialization. This is the source of truth regarding components. All other component creation methods build on that. Any new high-level component feature should be built on top of that structure. The structure is pretty flexible, so even it's usually not a problem. For example, python-based components are still built on top of ComponentSpec and ContainerSpec.

Types

Let's start with untyped world. Components exchange pieces of data blobs. Why would the user want to specify the types for their component inputs and outputs? I see several reasons:

  1. Compile-time reference argument compatibility checking. This feature prevents passing outputs of one type to an input with another type.

  2. Compile-time constant argument value checking. This feature prevents passing objects of one type to an input with another type.

  3. Visualization. The UX might visualize data of certain types based on the type information.

Types in KFP components

KFP components support type specifications. The type specification is essentially a dict (and the values can also be strings, dicts or lists). The system is very flexible and allows specifying arbitrary user types. (You should not confuse types and objects.)

Compile-time reference argument compatibility checking

KFP already supports this. If both input and argument have types, then the system immediately chacks that the types are the same when you pass output to input. There is no need for any additional tools. Currently the type compatibility check simply compares the two type specifications (dicts).

Compile-time constant argument value checking

KFP has some support for this. There is a limited set of types (str/String, int/Integer, float/Float, bool/Boolean, list/JsonArray, dict/JsonObject) for which the constant argument values are checked against the input type and serialized automatically. Values of other types must be manually serialized before passing them to the component.

@Ark-kun
Copy link
Contributor

Ark-kun commented May 15, 2020

To take a step back, a large part of my motivation for wanting to write the Components in Python is so that I could extend the types in kfp.dsl.types with my own types.

I think that you do not need kfp.dsl.types to declare your own custom types.
You can use arbitrary type name or type structure (dict). You can even use an object that has a to_dict method although we do not support this (but this is how the types in kfp.dsl.types are implemented - they just return dict. That's it.).

By using Python, we could use a variety of existing tools like mypy and pydantic to validate Pipeline parameters before submitting.

KFP already validates the types even before submission.

If both input and output are typed, when you try to pass an output to an incompatible input, you'll get error. No need to integrate any external tools.

Additionally, these tools would allow us to easily define complex types, so that we could pass an object of related Pipeline params to the pipeline function, rather than denormalizing them into lots of Pipeline params with primitive types.

I think this is a misconception. Just because you can use some python class as KFP type, it does not mean you can just pass an object of that class to some component. KFP orchstrates containerized command-line programs. You cannot pass in-memory objects. At some point they must be serialized and sent over the network as bytes and then maybe deserialized by some code.

In KFP team we try to keep the API surface of the SDK minimal, so we only support automatic serialization of 6 primitive types. Everything else must be serialized by the pipeline code and deserialized by the component code. Remember that KFP runs arbitrary containerized command-line programs. In general the containers do not have KFP SDK or even Python installed. A Java program won't automatically understand a Python memory object.

each Component could have its own complex type, to make the Pipeline function input signature cleaner.

Sure, component code already can do whatever it wants and the component author can specify any structure describing the type.

The SDK is not peeking in that though. Type specifications are opaque identifiers and containers are black boxes.

For example, suppose I want to create a component to launch a Katib Hyperparameter tuning job, it seems convenient to have a single pipeline parameter katib-objective, which is an object containing all of the fields outlined in the ObjectiveSpec detailed here and here.

You can easily do that. Declate a single input. Optionally, give it some type like katib.ObjectiveSpec. When building a pipeline, construct an object of that type and serialize it to string (e.g. as JSON) before passing to the component. (If using JSON you might call the type {JsonObject: {data_type: katib.ObjectiveSpec}}, but that won't change much).

@Ark-kun
Copy link
Contributor

Ark-kun commented May 15, 2020

it does not look like InputSpec could accept the complex types that I created in the above code snippet.

I think it can. You can convert your complex type specifications to a JSON-like structure and use it directly. Even is InputSpec only supported type name (a single string), you could still serialize an arbitrary type specification to that string (e.g. using JSON).

Comments on some of the samples:

def component2(arg1: ComplexParam, arg2: SimpleParam):
    return dsl.ContainerOp(
        name="Dummy op",
        image="dummy-image",
        command=["python3", "runner.py", "--input1", arg1, "--input2", arg2],
    )

...
    arg = ComplexParam(
        field1="gs://my_bucket/hello_world.txt", field2=0.5, field3=["1", "2"]
    )

So, what is the resolved command-line supposed to be?

Component1Outputs = NamedTuple("Component1Outputs", [("output", SimpleParam)])

@dsl.component
def component1(arg: ComplexParam) -> Component1Outputs:
    output = "/tmp/python_dummy_op/output1_path"
    return dsl.ContainerOp(
        name="Dummy op",
        image="dummy-image",
        command=["python3", "runner.py", "--input", arg, "--output", output],
        file_outputs={"output": output},
    )

I'm writing a Java-based component. How can I read the output of your component?

There are still some open questions:
Where the serialization of these complex types should occur
Relationship between this work and any work to validate / submit Pipelines from the Web UI (note that Pydantic models do define an OpenAPI Schema that could be shipped elsewhere).

Yes. These are the questions that have shaped the SDK's API surface regarding types. This is why SDK only supports serialization of 6 primitive types and everything else is the responsibility of the pipeline and component authors.

Where the serialization of these complex types should occur

Serialization is custom code. There are only two places where custom code is executed - inside the launched containers in the cloud and on the pipeline author's/submitter's machine. Since the serialization must occur before the container can be launched, this only leaves the pipeline author's/submitter's machine. The complex objects must be serialized before the pipeline can be compiled or submitted. And this is what the SDK expects at this moment.

Relationship between this work and any work to validate / submit Pipelines from the Web UI (note that Pydantic models do define an OpenAPI Schema that could be shipped elsewhere).

There were some plans for this. It was the reason why those openapi_validator schemas were added. The SDK makes the whole ComponentSpec available to the UX. However the feature has not been implemented on the UX side. It would be useful to have.

@Ark-kun
Copy link
Contributor

Ark-kun commented Jun 8, 2020

@sm-hawkfish What do you think?

@sm-hawkfish
Copy link
Author

Hi @Ark-kun , I apologize for the extended delay -- I wanted to make sure I had a chance to review your comments and go through the code-base in more detail before responding.

For one, I have come to agree with you that using func_to_component_spec is not really a "shortcut", since it's a similar amount of work to writing the Component Spec in YAML. In your response, you also mentioned some aspects of the KFP methodology (specifically in regards to components being containerized CLI programs) that I never meant to question, so to clarify: the goal of this Issue is only to discuss possible improvements to the Python SDK to make it easier for Component authors to create new re-usable components using idiomatic python and for Pipeline authors to get rich, compile-time type validation.

To that end, I found the existing python decorator for lightweight components to be very inspirational: as you well know, create_component_from_func uses type annotations in the signature of the component function in order to generate a component.yaml specification. This strikes me as a much better than either writing the component.yaml directly or writing a func_to_component_spec, since it is created from the component implementation itself.

For all of its benefits, there are a few drawbacks to create_component_from_func, as it pertains to creating re-usable components:

  • The component source code is copied into the component.yaml, limiting the complexity of the component implementation (and requiring users to make all imports within the function)
  • The types placed in the component.yaml are limited in their ability to validate user input, since they are string names like 'JsonObject', as opposed to OpenAPI schema definitions.
  • Limited stack traces, per your issue here

In an attempt to supercharge create_component_from_func, I have made some local modifications to the KFP Python SDK and incorporated the libraries Pydantic and Typer. I will provide some snippets below on how this looks and am happy to provide additional detail (or contribute) if you are interested in the approach.

I have taken as an example a KFP Component that submits a Katib Hyperparameter Tuning experiment:

Implementing the Component

We create a file src/katib_specifications.py, with Pydantic models that mirror the Katib specs defined here

from enum import Enum
from typing import List

from pydantic import BaseModel


class ObjectiveType(str, Enum):
    unknown = ""
    minimize = "minimize"
    maximize = "maximize"


class ObjectiveSpec(BaseModel):
    type: ObjectiveType
    goal: float = None
    objectiveMetricName: str
    additionalMetricNames: List[str] = None

There are many more specs, but this should give a feel for the syntax.

From there, we create src/component.py which contains the function that will be the component entrypoint:

from typing import Dict
from typing import List
from typing import NamedTuple
from typing import Union

import typer

from .katib_specifications import ObjectiveSpec


class Outputs(NamedTuple):
    best_hyperparameters = Dict[str, Union[str, float, int]]


def katib_hyperparameter_tuning(
    data_dir: str = typer.Option(
        ..., help="The GCS directory containing training data"
    ),
    objective_spec: ObjectiveSpec = typer.Option(
        ...,
        help="The Katib Objective to optimize",
    ),
) -> Outputs:
    """
    Distributed Hyperparamter Tuning (and Neural Architecture Search) using Kubeflow Katib. The
    upstream documentation on available hyperparameter search algorithms is available here:
    https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#search-algorithms-in-detail
    """

    # Implementation - `objective_spec` can be used like a namedtuple or turned into a dictionary via objective_spec.dict()

    best_hyperparameters = {"max_depth": 4}
    return best_hyperparameters


if __name__ == "__main__":
    typer.run(katib_hyperparameter_tuning)

The idea is that Typer can create a CLI for you automatically using the type annotations in the function. The input provided to the component will be JSON strings, so the KFP infrastructure doesn't need to know or care about this:

python -m src.component --data-dir gs://my-bucket/my-training-data/ --objective-spec '{"type": "maximize", "objectiveMetricName": "roc_auc", "additionalMetricNames": ["accuracy"]}'

And Typer will also do the work of casting the input types into the type declared by the annotation. In the case above, something like this would be run by Typer behind the scenes:

objective_spec = ObjectiveSpec.parse_raw(
    '{"type": "maximize", "objectiveMetricName": "roc_auc", "additionalMetricNames": ["accuracy"]}'
)

I am cheating here a little bit, as I did need to make slight modifications to Typer in order to parse these JSON inputs into Pydantic models, which is outlined in this issue

Generating the Component Spec

Since Pydantic Models can output an OpenAPI schema, I just needed to make some adjustments to extract_component_interface and annotation_to_type_struct in the KFP SDK in order to turn the type annotations into OpenAPI schema definitions.

Here's an example of how to get the schema:

from .katib_specifications import ObjectiveSpec

print(json.dumps(ObjectiveSpec.schema(), indent=2))
{
  "title": "ObjectiveSpec",
  "type": "object",
  "properties": {
    "type": {
      "$ref": "#/definitions/ObjectiveTypes"
    },
    "goal": {
      "title": "Goal",
      "type": "number"
    },
    "objectiveMetricName": {
      "title": "Objectivemetricname",
      "type": "string"
    },
    "additionalMetricNames": {
      "title": "Additionalmetricnames",
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": [
    "type",
    "objectiveMetricName"
  ],
  "definitions": {
    "ObjectiveTypes": {
      "title": "ObjectiveTypes",
      "description": "An enumeration.",
      "enum": [
        "",
        "minimize",
        "maximize"
      ],
      "type": "string"
    }
  }
}

The end result is the following component specification:

name: Katib hyperparameter tuning
description: |-
  Distributed Hyperparamter Tuning (and Neural Architecture Search) using Kubeflow Katib. The
      upstream documentation on available hyperparameter search algorithms is available here:
      https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#search-algorithms-in-detail
inputs:
- name: data_dir
  type:
    String:
      openapi_schema_validator: {type: string}
  description: The GCS directory containing training data
- name: objective_spec
  type:
    src.katib_specifications.ObjectiveSpec:
      openapi_schema_validator:
        type: object
        properties:
          type: {$ref: '#/definitions/ObjectiveTypes'}
          goal: {title: Goal, type: number}
          objectiveMetricName: {title: Objectivemetricname, type: string}
          additionalMetricNames:
            title: Additionalmetricnames
            type: array
            items: {type: string}
        required: [type, objectiveMetricName]
        definitions:
          ObjectiveTypes:
            title: ObjectiveTypes
            description: An enumeration.
            enum: ['', minimize, maximize]
            type: string
  description: The Katib Objective to optimize.
outputs:
- name: best_hyperparameters
  type:
    Dict[str, Union[str, float, int]]:
      openapi_schema_validator:
        type: object
        additionalProperties:
          anyOf:
          - {type: string}
          - {type: number}
          - {type: integer}
implementation:
  container:
    image: DUMMY_IMAGE
    args:
    - --data-dir
    - {inputValue: data_dir}
    - {inputValue: objective_spec}
    - '----output-paths'
    - {outputPath: best_hyperparameters}

You can see I kept the key openapi_schema_validator similar to kfp.dsl.types so that existing features like DSL Type Checking would continue to work as expected.

Using the Component in a Pipeline

Generating the OpenAPI schema has a couple of nice benefits:

Auto-Generated Documentation

I wrote a very simple script to parse the component.yaml file and generate a static documentation site using redoc

KatibDocs

This provides a nice reference to data scientists who are getting familiar with the inputs that each component expects.

Pipeline Compile-time validation

In addition, I wrote a small function validate_pipeline that will validate the inputs against the OpenAPI schema in the Component Spec.

The validation code looks something like:

from openapi_schema_validator import OAS30Validator

def validate_component_input(input_spec: InputSpec, input_arg: Any):
    schema = list(input_spec.type.values())[0]["openapi_schema_validator"]
    validator = OAS30Validator(schema)
    validator.validate(input_arg)

This allows data scientists to get feedback on their pipeline arguments at compile time (note the objective spec type below is "maximization" instead of "maximize"):

import kfp

KATIB_OBJECTIVE_SPEC = {
    "type": "maximization",
    "objectiveMetricName": "roc_auc",
    "additionalMetricNames": ["accuracy"],
}

component_store = ComponentStore()
hyperparameter_op = component_store.load_component("hyperparameter_tuning")


# Define a pipeline and create a task from a component:
@kfp.dsl.pipeline(
    name="Train Model", description="Train model",
)
def my_pipeline(
    data_dir=DATA_DIR,
    katib_objective_spec=KATIB_OBJECTIVE_SPEC
):

    hyperparameter_op(
        data_dir=data_task.output,
        objective_spec=katib_objective_spec
    )

if __name__ == "__main__":
    validate_pipeline(my_pipeline)

    kfp.compiler.Compiler().compile(my_pipeline, "/tmp/pipeline.tar.gz")

When the user runs this script, they will get:

jsonschema.exceptions.ValidationError: 'maximization' is not one of ['', 'minimize', 'maximize']

Failed validating 'enum' in schema['properties']['type']:
    OrderedDict([('title', 'ObjectiveTypes'),
                 ('description', 'An enumeration.'),
                 ('enum', ['', 'minimize', 'maximize']),
                 ('type', 'string'),
                 ('nullable', False)])

On instance['type']:
    'maximization'

And can easily correct the typo before submitting the pipeline run to the cluster.

As stated above, I would be very happy to go into more detail on any of the above steps if you find aspects of this approach interesting.

And congratulations on the 1.0.0 release !

@stale
Copy link

stale bot commented Nov 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 1, 2020
@Ark-kun
Copy link
Contributor

Ark-kun commented Nov 4, 2020

/freeze

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 4, 2020
@Bobgy
Copy link
Contributor

Bobgy commented Nov 4, 2020

/lifecycle frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants