This is a simple Spark application to read in data and output to a query-able store. It runs on a single-node Docker image with Spark and stores data to the local file system only. This is useful for demo purposes only.
- sbt
- docker
- Execute the build.sh script at the root of this project. This script runs an sbt build and a subsequent docker build. It tags the image as sumosha/traffic_accidents:latest
docker run -it sumosha/traffic_accidents:latest
- This will launch a spark-shell session, where we can create tables, query, and aggegrate data directly in here. See below...
- Since we are running in a non-persistent Spark environment, the data needs to be preloaded as tables using the shell. Here's how:
val df1 = spark.read.parquet("/tmp/output/311_service_requests.parquet")
df1.createOrReplaceTempView("service_requests")
val df2 = spark.read.parquet("/tmp/output/traffic_accidents.parquet")
df2.createOrReplaceTempView("accidents")
- Now we can query using sql (spark SQL) against tables service_requests and accidents. To verify this:
spark.sql("select * from service_requests").show()
spark.sql("select * from accidents").show()
- Spark SQL reference here -- it's just sql really
These can be located in the queries folder.
- For demonstration only, the docker image contains the output of the job run on the local disk. In the real world, I would output the data to an S3 or HDFS store somewhere.
- This can easily be changed to store to these systems using the URI pattern of s3a:// or hdfs://
- Parquet format selected because of its wide support and also its faster query-ability (write once, read forever).
- I would use Hive or AWS Athena to query the output. I would continue to use Spark for more complex analytical queries.
- In reality, this problem can be abstracted out to a more generic input --> output pattern, perhaps even with custom functions to perform specific or generic transformations. This toy app can be made more configurable for a wide variety of uses.
- This should run on a Spark cluster when dealing with larger amounts of data. How that is set up and launched depends on the existing infrastructure being used.