Skip to content

Latest commit

 

History

History
 
 

DataProcessing

Serverless Data Processing Workshop

In this workshop you'll explore approaches for processing data using serverless architectures. You'll build processing infrastructure to enable operations personnel in Wild Rydes headquarters to monitor the health of the unicorn fleet. Each unicorn is equipped with a sensor that reports its location and vitals and you'll explore approaches for processing this data in batches and real-time.

To build this infrastructure, you will use AWS Lambda, Amazon S3, Amazon Kinesis, Amazon DynamoDB, and Amazon Athena. You'll create functions in Lambda to process files and streams, use DynamoDB to persist unicorn vitals, create a serverless application to aggregate these data points using Kinesis Analytics, archive the raw data using Kinesis Firehose and Amazon S3, and you'll use Amazon Athena to run ad-hoc queries against the raw data.

Prerequisites

AWS Account

In order to complete this workshop you'll need an AWS Account with access to create AWS Identity and Access Management (IAM), Amazon Simple Storage Service (S3), Amazon DynamoDB, AWS Lambda, Amazon Kinesis Streams, Amazon Kinesis Analytics, Amazon Kinesis Firehose, and Amazon Athena resources.

The code and instructions in this workshop assume only one student is using a given AWS account at a time. If you try sharing an account with another student, you'll run into naming conflicts for certain resources. You can work around this by either using a suffix in your resource names or using distinct Regions, but the instructions do not provide details on the changes required to make this work.

Region

Choose an AWS Region to execute the workshops which support the complete set of services covered in the material including AWS Lambda, Amazon Kinesis Streams, Amazon Kinesis Firehose, Amazon Kinesis Analytics, and Amazon Athena. Use the Region Table to determine which services are available in a Region. Regions that support these services include US East (N. Virginia) and US West (Oregon).

Kinesis Command-Line Clients

The modules which involve streaming data and Amazon Kinesis utilize two command-line clients to simulate and display sensor data from the unicorns in the fleet.

Producer

The producer generates sensor data from a unicorn taking a passenger on a Wild Ryde. Each second, it emits the location of the unicorn as a latitude and longitude point, the distance traveled in meters in the previous second, and the unicorn's current level of magic and health points.

Consumer

The consumer reads and displays formatted JSON messages from an Amazon Kinesis stream which allow us to monitor in real-time what's being sent to the stream. Using the consumer, you can monitor the data the producer is sending and how your applications are processing that data.

Setup

The producer and consumer are small programs written in the Go Programming language. The below instructions walk through downloading binaries for macOS, Windows, or Linux and preparing them for use. If you prefer to inspect and build them yourself, the source code is included in this repository and can be compiled using Go.

  1. Using the AWS Command Line Interface or the provided links, copy the command-line clients built for your platform from an S3 bucket to your local system:

    macOS (producer, consumer)

    aws s3 cp --recursive s3://wildrydes-us-east-1/DataProcessing/kinesis-clients/macos/ .
    chmod a+x producer consumer

    Windows (producer.exe, consumer.exe)

    aws s3 cp --recursive s3://wildrydes-us-east-1/DataProcessing/kinesis-clients/windows/ .

    Linux (producer, consumer)

    aws s3 cp --recursive s3://wildrydes-us-east-1/DataProcessing/kinesis-clients/linux/ .
    chmod a+x producer consumer
  2. Run the producer with -h to view its command-line arguments:

    macOS / Linux

    $ ./producer -h
      -name string
      Unicorn Name (default "Shadowfax")
      -region string
      Region (default "us-east-1")
      -stream string
      Stream Name (default "wildrydes")

    Windows

    C:\Downloads>producer.exe -h
      -name string
      Unicorn Name (default "Shadowfax")
      -region string
      Region (default "us-east-1")
      -stream string
      Stream Name (default "wildrydes")

    Note the defaults. Running this command without any arguments will produce data about a unicorn named Shadowfax to a stream named wildrydes in US East (N. Virginia).

  3. Run the consumer with -h to view its command-line arguments:

    macOS / Linux

    $ ./consumer -h
      -region string
      Region (default "us-east-1")
      -stream string
      Stream Name (default "wildrydes")

    Windows

    C:\Downloads>consumer.exe -h
      -region string
      Region (default "us-east-1")
      -stream string
      Stream Name (default "wildrydes")

    Note the defaults. Running this command without any arguments will read from the stream named wildrydes in US East (N. Virginia).

  4. The command-line clients require authentication credentials with the permission to put and get records from Amazon Kinesis Streams. These credentials can be provided to the clients by either:

    1. Using a shared credentials file

      This credentials file is the same one used by other SDKs and the AWS Command Line Interface. If you're already using a shared credentials file, you can use it for this purpose, too. If you've not yet configured credentials, run aws configure to interactively configure the CLI:

      $ aws configure
      AWS Access Key ID [None]: YOUR_ACCESS_KEY_ID_HERE
      AWS Secret Access Key [None]: YOUR_SECRET_ACCESS_KEY_HERE

      If you'd like to use a named profile, you'll need to set an environment variable with the key AWS_PROFILE and the value of the profile name to use:

      export AWS_PROFILE=workshop
    2. Using environment variables

      The clients can also use credentials set in your environment to sign requests to AWS. Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables locally with your credentials.

      macOS / Linux

      export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID_HERE
      export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY_HERE

      Windows

      set AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID_HERE
      set AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY_HERE

    See the AWS SDK for Go configuration documentation for more details.

Modules

  1. File Processing
  2. Real-time Data Streaming
  3. Streaming Aggregation
  4. Stream Processing
  5. Data Archiving

After you have completed the workshop you can delete all of the resources that were created by following the clean-up guide.