- JSON configuration-driven data movement - no Java/Scala knowledge needed
- Join and transform data among heterogeneous datastores (including NoSQL datastores) using ANSI SQL
- Deploys on Amazon AWS EMR and Fargate; but can run on any Spark cluster
- Picks up datastore credentials stored in Hashicorp Vault, Amazon Secrets Manager
- Execution logs and migration history configurable to Amazon AWS Cloudwatch, S3
- Use built-in cron scheduler, or call REST API from external schedulers
... and many more features documented here
Note: DataPull consists of two services, an API written in Java Spring Boot, and a Spark app written in Scala. Although Scala apps can run on JDK 11, per official docs it is recommended that Java 8 be used for compiling Scala code. The effort to upgrade to OpenJDK 11+ is tracked here
Pre-requisite: Docker Desktop
- Clone this repo locally and check out the master branch
git clone [email protected]:homeaway/datapull.git
- Build a local docker image for running spark as a dockerised server
cd ./datapull docker build -f ./core/docker_spark_server/Dockerfile -t expedia/spark2.4.8-scala2.11-hadoop2.10.1 ./core/docker_spark_server
- Build the Scala JAR from within the
core
foldercd ./core cp ../master_application_config-dev.yml ./src/main/resources/application.yml docker run \ -e MAVEN_OPTS="-Xmx1024M -Xss128M -XX:MetaspaceSize=512M -XX:MaxMetaspaceSize=1024M -XX:+CMSClassUnloadingEnabled" \ --rm \ -v "${PWD}":/usr/src/mymaven \ -v "${HOME}/.m2":/root/.m2 \ -w /usr/src/mymaven \ maven:3.6.3-jdk-8 mvn clean install
- Execute a sample JSON input file Input_Sample_filesystem-to-filesystem.json that moves data from a CSV file HelloWorld.csv to a folder of json files named SampleData_Json.
docker run \ -v $(pwd):/core \ -w /core \ -it \ --rm \ expedia/spark2.4.8-scala2.11-hadoop2.10.1 spark-submit \ --packages org.apache.spark:spark-sql_2.11:2.4.8,org.apache.spark:spark-avro_2.11:2.4.8,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.8 \ --deploy-mode client \ --class core.DataPull \ target/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar src/main/resources/Samples/Input_Sample_filesystem-to-filesystem.json local
- Open the relative path
target/classes/SampleData_Json
to find the result of the DataPull i.e. the data fromtarget/classes/SampleData/HelloWorld/HelloWorld.csv
transformed into JSON.
Pro-tip: The folder
target/classes/SampleData_Json
is created by the docker spark container, so you will not be able to delete it until you take ownership of it by runningsudo chown -R $(whoami):$(whoami) .
Pre-requisite: IntelliJ with Scala plugin configured. Check out this Help page if this plugin is not installed.
- Clone this repo locally and check out the master branch
- Open the folder core in IntelliJ IDE.
- When prompted, add this project as a maven project.
- By default, this source code is designed to execute a sample JSON input file Input_Sample_filesystem-to-filesystem.json that moves data from a CSV file HelloWorld.csv to a folder of json files named SampleData_Json.
- Go to File > Project Structure... , and choose 1.8 (java version) as the Project SDK
- Go to Run > Edit Configurations... , and do the following
- Create an Application configuration (use the + sign on the top left corner of the modal window)
- Set the Name to Debug
- Set the Main Class as Core.DataPull
- Use classpath of module Core.DataPull
- Set JRE to 1.8
- Click Apply and then OK
- Click Run > Debug 'Debug' to start the debug execution
- Open the relative path target/classes/SampleData_Json to find the result of the DataPull i.e. the data from target/classes/SampleData/HelloWorld/HelloWorld.csv transformed into JSON.
Deploying DataPull to Amazon AWS, involves
- installing the DataPull API and Spark JAR in AWS Fargate, using this runbook
- running DataPulls in AWS EMR, using this runbook
Please follow instructions in manual-tests/README.md
Please create an issue in this git repo, using the bug report or feature request templates.
DataPull documentation is available at https://homeaway.github.io/datapull/ . To update this documentation, please do the following steps...
-
Fork the DataPull repo
-
In terminal from the root of the repo, run
- if Docker is installed, run
docker run --rm -it -p 8000:8000 -v ${PWD}/docs:/docs squidfunk/mkdocs-material
- or, if MkDocs and Material for MkDocs are installed, run
cd docs mkdocs serve
-
Open http://127.0.0.1:8000 to see a preview of the documentation site. You can edit the documentation by following https://www.mkdocs.org/#getting-started
-
Once you're done updating the documentation, please commit and push your updates to your forked repo.
-
In terminal from the root of the forked repo, run one of the following command blocks, to update and push your
gh-pages
branch.- if Docker is installed, run
docker run --rm -it -v ~/.ssh:/root/.ssh -v ${PWD}:/docs squidfunk/mkdocs-material gh-deploy --config-file /docs/docs/mkdocs.yml
- or, if MkDocs and Material for MkDocs are installed, run
cd docs mkdocs gh-deploy
-
Create 2 PRs (one for forked repo branch that you updated, another for
gh-pages
branch) and we'll review and approve them.
Thanks again, for helping make DataPull better!