Skip to content

Latest commit

 

History

History
65 lines (44 loc) · 2.54 KB

README.md

File metadata and controls

65 lines (44 loc) · 2.54 KB

SF crime statistics with Spark Streaming

This project is part of the Data Streaming Nanodegree program, from Udacity. I will implement a streaming application using Apache Spark Structured Streaming to ingest data and a Kafka server to produce data related to San Francisco crime incidents, extracted from Kaggle.

Install

The project requires Python 3.6 and the following:

After the installation, set up your python environment to run the code in this repository, create a new environment with Anaconda, and install the dependencies.

$ conda create --name kafka-spark python=3.6
$ source activate kafka-spark
$ pip install -r requirements.txt

Run

In a terminal or command window, navigate to the top-level project directory (that contains this README) and run the following commands. You’ll need to open up several terminal tabs to execute each command:

$ cd scripts/
$ . start.sh
$ . start-zookeeper.sh
$ . start-kafka-server.sh
$ . start-broker.sh

Finally, after starting up all the services, you can ingest data using a Kafka consumer or using Spark. Again, navigate to the top-level project directory:

$ python consumer_server.py
$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py

Project Requirements

Screenshots

All screenshots are available in the screenshots/ folder.

Questions

1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

Depending on the startup parameters used, we can fine-tune both latency and throughput to our requirements. For example, processedRowsPerSecond makes the application ingests more data per second, leading to higher throughput.

2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

I tested different values to the parameters bellow, and 200 to the first two presented the best performance:

  • spark.streaming.kafka.maxRatePerPartition
  • spark.streaming.kafka.processedRowsPerSecond
  • spark.streaming.backpressure.enabled
  • spark.default.parallelism
  • spark.sql.shuffle.partitions

License

The contents of this repository are covered under the MIT License.