A pipeline comprises one or more nodes that are (in many cases) connected to define execution dependencies. Each node is implemented by a component and typically performs only a single task, such as loading data, processing data, training a model, or sending an email.
A generic pipeline comprises nodes that are implemented using generic components. Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Kubeflow Pipelines, and Apache Airflow.
The following tutorials cover generic pipelines:
- Introduction to generic pipelines
- Run generic pipelines on Kubeflow Pipelines
- Run generic pipelines on Apache Airflow
A runtime specific pipeline comprises nodes that are implemented using generic components or custom components. Custom components are runtime specific and user-provided.
In this intermediate tutorial you will learn how to add Kubeflow Pipelines components to Elyra and how to utilize them in pipelines.
The features described in this tutorial require Elyra v3.3 or later. The tutorial instructions were last updated using Elyra v3.3.0 and Kubeflow v1.4.1.
Elyra does not support Kubeflow Pipelines Python function-based components.
- JupyterLab 3.x with the Elyra extension v3.3 (or later) installed.
- Access to a local or cloud Kubeflow Pipelines deployment.
Some familiarity with Kubeflow Pipelines and Kubeflow Pipelines components is required to complete the tutorial. If you are new to Elyra, please review the Run generic pipelines on Kubeflow Pipelines tutorial. It introduces concepts and tasks that are used in this tutorial, but not explained here to avoid content duplication.
Collect the following information for your Kubeflow Pipelines installation:
- API endpoint, e.g.
http://kubernetes-service.ibm.com/pipeline
- Namespace, for a multi-user, auth-enabled Kubeflow installation, e.g.
mynamespace
- Username, for a multi-user, auth-enabled Kubeflow installation, e.g.
jdoe
- Password, for a multi-user, auth-enabled Kubeflow installation, e.g.
passw0rd
- Workflow engine type, which should be
Argo
orTekton
. Contact your administrator if you are unsure which engine your deployment utilizes.
Elyra utilizes S3-compatible cloud storage to make data available to notebooks and scripts while they are executed. Any kind of S3-based cloud storage should work (e.g. IBM Cloud Object Storage or Minio) as long as it can be accessed from the machine where JupyterLab/Elyra is running and from the Kubeflow Pipelines cluster.
Collect the following information:
- S3 compatible object storage endpoint, e.g.
http://minio-service.kubernetes:9000
- S3 object storage username, e.g.
minio
- S3 object storage password, e.g.
minio123
- S3 object storage bucket, e.g.
pipelines-artifacts
Create a runtime environment configuration for your Kubeflow Pipelines installation as described in Runtime configuration topic in the User Guide or the Run generic pipelines on Kubeflow Pipelines tutorial.
This tutorial uses the run-pipelines-on-kubeflow-pipelines
sample from the https://github.com/elyra-ai/examples GitHub repository.
-
Launch JupyterLab.
-
Open the Git clone wizard (Git > Clone A Repository).
-
Enter
https://github.com/elyra-ai/examples.git
as Clone URI. -
In the File Browser navigate to
examples/pipelines/run-pipelines-on-kubeflow-pipelines
.The cloned repository includes a set of custom component specifications that you will add to the Elyra component catalog and use in a pipeline. The '
Download File
' component downloads a file from a web resource. The 'Count Rows
' component counts the lines in a row-based file.
You are ready to start the tutorial.
Elyra stores information about custom components in component catalogs and makes those components available in the Visual Pipeline Editor's palette. Components can be grouped into categories to make them more easily discoverable.
Custom components are managed in the JupyterLab UI using the Pipeline components panel. You access the panel by:
- Selecting
Pipeline Components
from the JupyterLab sidebar. - Clicking the
Open Pipeline Components
button in the pipeline editor toolbar. - Searching for
Manage pipeline components
in the JupyterLab command palette.
You can automate the component management tasks using the
elyra-metadata install component-catalogs
CLI command.
The component catalog can access component specifications that are stored in the local file system or on remote sources. In this tutorial, 'local' refers to the file system where JupyterLab/Elyra is running. For example, if you've installed Elyra on your laptop, 'local' refers to the laptop's file system. If you've installed Elyra in a container image, 'local' refers to the container's file system.
Use locally stored component specifications if there is no (immediate) need to share the specification with other users, for example, during initial development and prototyping.
-
Open the Pipeline components panel using one of the approaches mentioned above.
Note that the palette may already include a few example components, depending on how you installed Elyra. These examples are included for illustrative purposes to help you get started but you won't use them in this tutorial.
-
Add a new component catalog entry by clicking
+
andNew Filesystem Component Catalog
. The first tutorial component you are adding counts the number of rows in a file. -
Enter or select the following:
-
Name:
analyze data
-
Description:
analyze row based data
-
Runtime Type:
KUBEFLOW_PIPELINES
-
Category:
analyze
-
Base Directory:
.../examples/pipelines/run-pipelines-on-kubeflow-pipelines/components
(on Windows:...\examples\pipelines\run-pipelines-on-kubeflow-pipelines\components
)Note: Replace
...
with the path to the location where you cloned the Elyra example repository. The base directory can include~
or~user
to indicate the home directory. The concatenation of the base directory and each path must resolve to an absolute path or Elyra won't be able to locate the specified files. -
Paths:
count-rows.yaml
-
-
Save the component catalog entry.
There are two approaches you can take to add multiple related component specifications:
- Specify multiple Path values.
- Store the related specifications in the same directory and use the
Directory
catalog type. Elyra searches the directory for specifications. Check the Include Subdirectories checkbox to search subdirectories for component specifications as well.
Refer to the descriptions in the linked documentation topic for details and examples.
Locally stored component specifications have the advantage that they can be quickly loaded by Elyra. If you need to share component specifications with other users, ensure that the given Paths are the same relative paths across installations. The Base Directory can differ across installations.
The URL Component Catalog
type only supports web resources that can be downloaded using HTTP GET
requests, which don't require authentication.
To add component specifications to the catalog that are stored on the web:
- Open the Pipeline components panel.
- Add a new component catalog entry by clicking
+
andNew URL Component Catalog
. - Enter the following information:
- Name:
download data
- Description:
download data from public sources
- Runtime:
KUBEFLOW_PIPELINES
- Category Names:
download
- URLs:
https://raw.githubusercontent.com/elyra-ai/examples/main/pipelines/run-pipelines-on-kubeflow-pipelines/components/download-file.yaml
- Name:
- Save the component catalog entry.
The catalog is now populated with the custom components you'll use in the tutorial pipeline.
Next, you'll create a pipeline that uses the registered components.
The pipeline editor's palette is populated from the component catalog. To use the components in a pipeline:
-
Open the JupyterLab Launcher.
-
Click the
Kubeflow Pipeline Editor
tile to open the Visual Pipeline Editor for Kubeflow Pipelines. -
Expand the palette panel. Two new component categories are displayed (
analyze
anddownload
), each containing one component entry that you added: -
Drag the '
Download File
' component onto the canvas to create the first pipeline node. -
Drag the '
Count Rows
' component onto the canvas to create a second node and connect the two nodes as shown.Note that each node is tagged with an error icon. Hover over each node and review the error messages. The components require inputs, which you need to specify to render the nodes functional.
-
Open the properties of the '
Download File
' node:- select the node and expand (↤) the properties slideout panel on the right OR
- right click on the node and select
Open Properties
-
Review the node properties. The properties are a combination of Elyra metadata and information that was extracted from the component's specification:
name: Download File description: Downloads a file from a public HTTP/S URL using a GET request. inputs: - {name: URL, type: String, optional: false, description: 'File URL'} outputs: - {name: downloaded file, type: String, description: 'Content of the downloaded file.'} ...
The component requires one input ('
URL
') and produces one output ('downloaded file
'), which is the content of the downloaded file.The node properties include:
-
Label
(Elyra property): If specified, the value is used as node name in the pipeline instead of the component name. Use labels to resolve naming conflicts that might arise if a pipeline uses the same component multiple times. For example, if a pipeline utilizes the 'Download File
' component to download two files, you could override the node name by specifying 'Download labels
' and 'Download observations
' as labels: -
URL
: This is a required input of the 'Download File
' component:inputs: - {name: URL, type: String, optional: false, description: 'File URL'}
The pipeline editor renders component inputs using an editable widget, such as a text box, and, if one was provided, displays input's description. Since this property is marked in the specification as required, the pipeline editor enforces the constraint.
-
downloaded file
: This is an output of the 'Download File
' component:outputs: - {name: downloaded file, type: String, description: 'Content of the downloaded file.'}
The pipeline editor renders outputs using read-only widgets.
-
Component source
: A read-only property that identifies the location from where the component specification was loaded. This property is displayed for informational purposes only.
-
-
Enter
https://raw.githubusercontent.com/elyra-ai/examples/main/pipelines/run-pipelines-on-kubeflow-pipelines/data/data.csv
as value for theURL
input property. -
Open the properties of the '
Count Rows
' node. The specification for the underlying component looks as follows:name: Count Rows description: Count the number of rows in the input file inputs: - {name: input file, type: String, optional: false, description: 'Row-based file to be analyzed'} outputs: - {name: row count, type: String, description: 'Number of rows in the input file.'} ... implementation: ... command: [ python3, /pipelines/component/src/count-rows.py, --input-file-path, {inputPath: input file}, ...
The component requires one input ('
input file
') and produces one output ('row count
'), which is the number of rows in this file.Note that Kubeflow Pipelines passes the input to the implementing Python script as a file handle:
--input-file-path, {inputPath: input file},
The pipeline editor takes this as a cue and renders a selector widget for this input:
Since the '
Count Rows
' node is only connected to one upstream node ('Download File
'), you can only choose from the outputs of that node. (An upstream node is a node that is connected to and executed before the node in question.)If a node is connected to multiple upstream nodes, you can choose the output of any of these nodes as input, as shown in this example:
The output of the second download node ('
Download Metadata
') cannot be consumed by the 'Count Rows
' node, because the two nodes are not connected in this pipeline.Elyra intentionally only supports explicit dependencies between nodes to avoid potential usability issues.
-
Save the pipeline.
-
Rename the pipeline to something meaningful:
- right click on the pipeline editor tab and select
Rename Pipeline...
OR - in the JupyterLab File Browser right click on the
.pipeline
file
- right click on the pipeline editor tab and select
Next, you run the pipeline.
To run the pipeline on Kubeflow Pipelines:
-
Click the
Run
button in the pipeline editor toolbar.You can also use the
elyra-pipeline submit
command to run the pipeline using the command line interface. -
In the run pipeline dialog select the runtime configuration you created when you completed the setup for this tutorial.
-
Start the pipeline run and monitor the execution progress in the Kubeflow Pipelines Central Dashboard.
-
Review the outputs of each pipeline task. The output of the '
Count Rows
' node should indicate that the downloaded file contains five rows.Elyra does not store custom component outputs in cloud storage. (It only does this for generic pipeline components.) To access the output of custom components use the Kubeflow Central Dashboard.
This concludes the Run pipelines on Kubeflow Pipelines tutorial. You've learned how to
- add custom Kubeflow Pipelines components
- create a pipeline from custom components
- Building Components topic in the Kubeflow Pipelines documentation
- Pipelines topic in the Elyra User Guide
- Pipeline components topic in the Elyra User Guide
- Requirements and best practices for custom pipeline components topic in the Elyra User Guide
- Example component catalog connectors
- Component catalog directory
- Command line interface topic in the Elyra User Guide