Skip to content

Sample code for Apache Beam to perform ETL from a stream-processing service (Pub/Sub) to BigQuery using Dataflow as the runner

License

Notifications You must be signed in to change notification settings

MichailParaskevopoulos/heathrow-flights-apache-beam

Repository files navigation

Heathrow Flights Streaming and ETL

This is a sample code for Apache Beam to perform ETL from a stream-processing service (Pub/Sub) to BigQuery using Dataflow as the runner.

Overview

An implementation of Apache Beam to stream the active flights either going to or leaving London Heathrow airport using Pub/Sub, process the data by joining with a dimensional table, and load them to a BigQuery table.

The project consists of the following Java classes for the ETL pipeline:

  1. com.streaming.ETL - Main method for injesting the Pub/Sub messages and the dimensional table, and loading the processed data to BigQuery
  2. com.streaming.messageParsing - Class for parsing the injested Pub/Sub messages into BigQuery row objects (TableRow)
  3. com.streaming.aggregate - Class for converting TableRows to Key/Value pairs that can be grouped by a key and aggregated
  4. com.streaming.join - Class for joining the aggregated Pub/Sub messages with TableRows from the dimensional table

The project also consists of the following scripts:

  1. pom.xml - Configuration file for Apache Maven
  2. message_subscriber - Directory with the com.subscriber.Subscriber Java Class and pom.xml file to set up a Google Cloud Function to receive messages from a streaming service and publish them to Pub/Sub

Executing Apache Beam ETL Pipelines on Google Dataflow using the Cloud SDK

From the same directory as the Apache Beam project execute:

mvn compile exec:java \
-Dexec.mainClass=com.streaming.ETL \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
    --project=${PROJECT} \
    --region=${REGION} \
    --inputTopic=${TOPIC} \
    --runner=DataflowRunner \
    --windowSize=${WINDOW}"

References

This project relies on the following resources:

  1. Streaming data source Heathrow Flights from Ably Hub
  2. OpenFlights dataset for the dimensional tables with airline and aircraft data

About

Sample code for Apache Beam to perform ETL from a stream-processing service (Pub/Sub) to BigQuery using Dataflow as the runner

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages