SF crime statistics with Spark Streaming

This project is part of the Data Streaming Nanodegree program, from Udacity. I will implement a streaming application using Apache Spark Structured Streaming to ingest data and a Kafka server to produce data related to San Francisco crime incidents, extracted from Kaggle.

Install

The project requires Python 3.6 and the following:

Spark 2.4.3
Scala 2.11.x
Java 1.8.x
Kafka build with Scala 2.11.x

After the installation, set up your python environment to run the code in this repository, create a new environment with Anaconda, and install the dependencies.

$ conda create --name kafka-spark python=3.6
$ source activate kafka-spark
$ pip install -r requirements.txt

Run

In a terminal or command window, navigate to the top-level project directory (that contains this README) and run the following commands. You’ll need to open up several terminal tabs to execute each command:

$ cd scripts/
$ . start.sh
$ . start-zookeeper.sh
$ . start-kafka-server.sh
$ . start-broker.sh

Finally, after starting up all the services, you can ingest data using a Kafka consumer or using Spark. Again, navigate to the top-level project directory:

$ python consumer_server.py
$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py

Project Requirements

Screenshots

All screenshots are available in the screenshots/ folder.

Questions

1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

Depending on the startup parameters used, we can fine-tune both latency and throughput to our requirements. For example, processedRowsPerSecond makes the application ingests more data per second, leading to higher throughput.

2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

I tested different values to the parameters bellow, and 200 to the first two presented the best performance:

spark.streaming.kafka.maxRatePerPartition
spark.streaming.kafka.processedRowsPerSecond
spark.streaming.backpressure.enabled
spark.default.parallelism
spark.sql.shuffle.partitions

License

The contents of this repository are covered under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SF crime statistics with Spark Streaming

Install

Run

Project Requirements

Screenshots

Questions

1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

SF crime statistics with Spark Streaming

Install

Run

Project Requirements

Screenshots

Questions

1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

License