README.md

Google Dataproc Workflow template

To manage batch jobs, we use Google Cloud Dataproc Workflow templates.

The following command should be able to create the workflow with all parameters ready to be triggered by EventsAPI when needed.

You can check the results going to the web interface of Dataproc.

Requires: mvn,

mvn archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4
Add Spark dependency and shading plugin to the generated pom.xml (see media-spark pom.xml as an example)
Create Spark job. You should take in an event name as the only argument. This event name should direct the Spark job to execute on an specific event. Make sure that results are also output into separate folders by event.
Generate jar: mvn package
Upload jar from target folder to epic-spark-jars Google Cloud bucket
Add job to workflow.yml under the jobs tag. Use this template replacing with your jar file and step id of your job (step id needs to be unique):

- sparkJob:
    args:
    - gs://epic-historic-tweets/random/*
    mainJarFileUri: gs://epic-spark-jars/YOUR_JAR_FILE.jar
  stepId: YOUR_STEP_ID

Add your step into the event parameter (see fields list). Make sure to replace your YOUR_STEP_ID with the step id you set in the previous step.

  - jobs['YOUR_STEP_ID'].sparkJob.args[0]

Update workflow: gcloud dataproc workflow-templates import epic-spark --source workflow.yaml