An example demonstrates NiFi Google Cloud (GC) Dataflow Job runner
The repository contains an example to demonstrate NiFi GC Dataflow Job Runner functioning. The example includes two GC Dataflow jobs that are assumed to be run in a sequence. This README describes in details only the part relevant for the repository itself. The overall example description is described on [TBD].
The example suggests performing four steps:
- Integrate NiFi GC Dataflow Job Runner packet into the Apache NiFi bundle
- Create job templates based on the code from this repository
- Create and tune Apache NiFi data flow based on the template from this repository
- Run the data flow
Instruction on how to do this can be found here NiFi GC Dataflow Job Runner
GC Dataflow jobs are implemented on Apache Beam framework in a form of pipelines. Apache Beam allows running a pipeline for data processing on different cluster platforms. To make this possible Apache Beam provides a number of platform-dependent runners. One of these platforms is Google Cloud Platform (GCP) and the corresponding runner is a DataflowRunner. The provided example is based on GCP.
After a code of a job processing some data is written one has three options: 1) to run it locally with DirectRunner (used for tests only, not suitable for large data processing); 2) to run it directly with GC Dataflow; 3) to create a job template and run it lately. The third option is used in the example.
The example provides two Apache Beam pipelines. The first pipeline is BuildVocabulary. It accepts texts stored in [GC Storage] (https://cloud.google.com/storage/) and build a vocabulary of word stems. For stems extraction java version of Porter Stemming Algorithm is used. The vocabulary is a text document with the following structure:
<a text file name>/<a unique word stem>/<a number of the word stems in the text file>
The second pipeline is ComputeLexicalDiversity. It accepts vocabulary from the previous pipeline and computes lexical diversity (or Type-token ratio, TTR) of some text using the formula:
TTR = <a number unique word stems> / <a total number of words>
-
Clone project as usual. The project is not runnable from the a scratch because some specific properties must be set first
-
So, make it runnable again:
- Go to DataFlowDefaultOptionsBuilder and set parameters of the runner. You may consult with the official documentation. Only
options.region
is set to "europe-west1" andoptions.maxNumWorkers
is set to one. Change them on more appropriate ones if you want. - Go to BuildVocabulary and set
options.templateLocation
. It will be a template path of the job building vocabulary. It should look likegs://bucket name/some path/template name
, i.e. it is a path within GC Storage - Do the same thing for ComputeLexicalDiversity
Pay attention that to create and push a template it is necessary: 1) set
options.runner
to DataflowRunner; 2) setoptions.templateLocation
to template path. If the option is not set and you will run the code then the job itself will be launched on GC Dataflow platform - Go to DataFlowDefaultOptionsBuilder and set parameters of the runner. You may consult with the official documentation. Only
-
Templates can be now created and push to GC Storage
- Run main of BuildVocabulary
- Run main of ComputeLexicalDiversity
-
As a result you should see templates on the given paths
Created job templates are supposed to be used in an illustrative Apache NiFi data flow. NiFi data flow specification is fixed in the XML file. The flow uses special processors called NiFi GC Dataflow Job Runners. It is assumed that NiFi platform is already extended with them (see NiFi GC Dataflow Job Runner)
The flow specifies the following processing:
- List text file names stored in GC Storage on the given path
- For each text file a stem vocabulary is builded with BuildVocabulary job
- For each vocabulary a TTR is computed with ComputeLexicalDiversity job
While calling GC Dataflow jobs special notifications of job statuses are generated. They are passed to the specified Slack channel