A pipeline comprises one or more nodes that are (in many cases) connected to define execution dependencies. Each node is implemented by a component and typically performs only a single task, such as loading data, processing data, training a model, or sending an email.
A generic pipeline comprises nodes that are implemented using generic components. In the current release Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Apache Airflow, and Kubeflow Pipelines.
The Introduction to generic pipelines tutorial outlines how to create a generic pipeline using the Visual Pipeline Editor.
In this intermediate tutorial you will learn how to run a generic pipeline on Apache Airflow, monitor pipeline execution using the Apache Airflow GUI, and access the outputs.
The tutorial instructions were last updated using Elyra v3.0 and Apache Airflow v1.10.12.
- JupyterLab 3.x with the Elyra extension v3.x (or newer) installed.
- Access to a local or cloud deployment of Apache Airflow that has been configured for use with Elyra.
Apache Airflow version 2.x is currently not supported.
Gather the following information:
- Apache Airflow API endpoint, e.g.
https://your-airflow-webserver:port
Elyra currently supports Apache Airflow deployments that utilize GitHub or GitHub Enterprise for Directed Acyclic Graph (DAG) storage. Collect the following information:
- GitHub server API endpoint, e.g.
https://api.github.com
- Name and owner of the repository where DAGs are stored, e.g.
your-git-org/your-dag-repo
. This repository must exist. - Branch in named repository, e.g.
test-dags
. This branch must exist. - Personal access token that Elyra can use to push DAGs to the repository, e.g.
4d79206e616d6520697320426f6e642e204a616d657320426f6e64
Elyra utilizes S3-compatible cloud storage to make data available to notebooks and Python scripts while they are executed. Any kind of cloud storage should work (e.g. IBM Cloud Object Storage or Minio) as long as it can be accessed from the machine where JupyterLab is running and the Apache Airflow cluster. Collect the following information:
- S3 compatible object storage endpoint, e.g.
http://minio-service.kubernetes:9000
- S3 object storage username, e.g.
minio
- S3 object storage password, e.g.
minio123
- S3 object storage bucket, e.g.
airflow-task-artifacts
This tutorial uses the run-generic-pipelines-on-apache-airflow
sample from the https://github.com/elyra-ai/examples GitHub repository.
-
Launch JupyterLab.
-
Open the Git clone wizard (Git > Clone A Repository).
-
Enter
https://github.com/elyra-ai/examples.git
as Clone URI. -
In the File Browser navigate to
examples/pipelines/run-generic-pipelines-on-apache-airflow
.The cloned repository includes a set of Jupyter notebooks and a Python script that download a weather data set from an open data directory called the Data Asset Exchange, cleanse the data, analyze the data, and perform time-series predictions. The repository also includes a pipeline named
hello-generic-world
that runs the files in the appropriate order.
You are ready to start the tutorial.
-
Open the
hello-generic-world
pipeline file. -
Right click generic node
Load weather data
and select Open Properties to review its configuration.A generic node configuration identifies the runtime environment, input artifacts (file to be executed, file dependencies and environment variables), and output files.
Each generic node is executed in a separate container, which is instantiated using the configured runtime image.
All nodes in this tutorial pipeline are configured to utilize a pre-configured public container image that has Python and the
Pandas
package preinstalled. For your own pipelines you should always utilize custom-built container images that have the appropriate prerequisites installed. Refer to the runtime image configuration topic in the User Guide for more information.If the container requires a specific minimum amount of resources during execution, you can specify them. For example, to speed up model training, you might want to make GPUs available.
If no custom resource requirements are defined, the defaults in the Apache Airflow environment are used.
Containers in which the notebooks or scripts are executed don't share a file system. Elyra utilizes S3-compatible cloud storage to facilitate the transfer of files from the JupyterLab environment to the containers and between containers.
Therefore you must declare files that the notebook or script requires and declare files that are being produced. The node you are inspecting does not have any file input dependecies but it does produce an output file.
Notebooks and scripts can be parameterized using environment variables. The node you are looking at requires a variable that identifies the download location of a data file.
Refer to Best practices for file-based pipeline nodes in the User Guide to learn more about considerations for each configuration setting.
A runtime configuration in Elyra contains connectivity information for an Apache Airflow instance and S3-compatible cloud storage. In this tutorial you will use the GUI to define the configuration, but you can also use the CLI.
-
From the pipeline editor tool bar (or the JupyterLab sidebar on the left side) choose Runtimes to open the runtime management panel.
-
Click + and New Apache Airflow runtime to create a new configuration for your Apache Airflow deployment.
-
Enter a name and a description for the configuration and optionally assign tags to support searching.
-
Enter the Apache Airflow server URL, the Kubernetes namespace where Airflow is deployed, and details for the DAG GitHub repository that Airflow is monitoring.
Refer to the runtime configuration documentation for a description of each input field.
-
Enter the connectivity information for your S3-compatible cloud storage:
- The cloud object storage endpoint URL, e.g.
https://minio-service.kubeflow:9000
- Username, e.g.
minio
- Password, e.g.
minio123
- Bucket name, where Elyra will store the pipeline input and output artifacts, e.g.
my-elyra-artifact-bucket
Refer to this topic for important information about the Cloud Object Storage credentials secret.
- The cloud object storage endpoint URL, e.g.
-
Save the runtime configuration.
-
Expand the twistie in front of the configuration entry.
The displayed links provide access to the configured Apache Airflow GUI and the cloud storage UI (if one is available at the specified URL). Open the links to confirm connectivity.
You can run pipelines from the Visual Pipeline Editor or using the elyra-pipeline
command line interface.
-
Open the run wizard.
-
The Pipeline Name is pre-populated with the pipeline file name. The specified name is used to name the DAG in Apache Airflow.
-
Select
Apache Airflow
as Runtime platform. -
From the Runtime configuration drop down select the runtime configuration you just created.
-
Start the pipeline run. The pipeline artifacts (notebooks, scripts and file input dependencies) are gathered, packaged, and uploaded to cloud storage. Elyra generates a DAG and pushes it to the GitHub repository branch that you've specified in the runtime configuration.
The DAG name is derived from the pipeline name and concatenated with the current timestamp. Therefore the GitHub repository contains a DAG file for each pipeline run that you initiate from the Visual Pipeline Editor or the Elyra CLI.
The confirmation message contains three links:
-
GitHub repository: Opens the GitHub repository location where the DAG was saved.
If this link returns a 404 error, make sure you are logged in to GitHub and your id is authorized to access the repository.
-
Run details: Links to the Apache Airflow GUI where you monitor the pipeline execution progress.
-
Object storage: Links to cloud storage bucket where the input artifacts and output artifacts are stored.
-
-
Open the run details and object storage links in a new browser tab or window.
Elyra does not provide a monitoring interface for Apache Airflow. However, it does provide a link to the Apache Airflow GUI.
-
Open the Run Details link to access the Apache Airflow GUI.
The generated DAG is configured to run only once. When the DAG is executed depends on how frequently Apache Airflow polls the GitHub repository for changes.
-
Click on the
hello-generic-world
DAG and select the Graph View to access the task information. -
Click on any completed task and open the log file.
Note that the task is implemented using the
NotebookOp
operator, which downloads the compressed input artifact archive from the cloud storage bucket, extracts the archive, processes the notebook or script, and uploads the output artifacts to the cloud storage bucket. -
Wait for the DAG run to finish.
The DAG run outputs (completed notebooks, script output, and declared output files) are persisted in the cloud storage bucket you've configured in the runtime configuration.
Elyra does not automatically download the output artifacts to your JupyterLab environment from the cloud storage bucket after DAG execution has completed. Therefore you have to use an S3 client, or, if configured, the cloud storage's web GUI to access the artifacts.
-
Navigate to the bucket you've specified in the runtime configuration to review the content.
The bucket contains, for each node, the following artifacts, which are prefixed with the DAG name:
- a
tar.gz
archive containing the notebook or script, and, if applicable, its declared input file dependencies - if the node is associated with a notebook, the artifacts include the completed notebook with it's populated output cells and an HTML version of the completed notebook
- if the node is associated with a script, the artifacts include the console output that the script produced
- if applicable, the declared output files
For example, for the
load_data
notebook that was executed by theLoad weather data
node, the following artifacts should be present:load_data-<UUID>.tar.gz
(input artifacts)load_data.ipynb
(output artifact)load_data.html
(output artifact)data/noaa-weather-data-jfk-airport/jfk_weather.csv
(output artifact)
- a
-
Download the output artifacts to your local machine and inspect them.
When you run a pipeline from the pipeline, Elyra generates a DAG and uploads it to the configured GitHub repository. If desired, you can customize the DAG by exporting the pipeline instead:
-
Open the pipeline in the Visual Pipeline Editor.
-
Click the Export Pipeline button.
-
Select Apache Airflow and the Runtime configuration you've created and export the pipeline.
An exported pipeline comprises of two parts: the DAG Python code and the input artifact archives that were uploaded to cloud storage.
-
Locate the generated
hello_generic_pipeline.py
Python script in the JupyterLab File Browser. -
Open the Python script and briefly review the generated code in the Python editor.
-
To run the DAG, push the file manually to the GitHub repository.
This concludes the Run generic pipelines on Apache Airflow tutorial. You've learned how to
- create an Apache Airflow runtime configuration
- run a pipeline on Apache Airflow
- monitor the pipeline run progress in the Apache Airflow GUI
- access the pipeline run output on cloud storage
- export a pipeline as a DAG
- Pipelines topic in the Elyra User Guide
- Pipeline components topic in the Elyra User Guide
- Best practices for file-based pipeline nodes topic in the Elyra User Guide
- Runtime configuration topic in the Elyra User Guide
- Runtime image configuration topic in the Elyra User Guide
- Command line interface topic in the Elyra User Guide