This is a getting started guide to XGBoost4J-Spark on Databricks. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs on Databricks.
- Apache Spark 2.4+ running in DataBricks Runtime 5.3 ML with GPU, 5.4 ML with GPU, or 5.5 ML with GPU. Make sure it matches the hardware and software requirements below.
- Hardware Requirements
- NVIDIA Pascal™ GPU architecture or better
- Multi-node clusters with homogenous GPU configuration
- Software Requirements
- Ubuntu 16.04/CentOS
- CUDA V10.1/10.0/9.2
- NVIDIA driver compatible with your CUDA
- NCCL 2.4.7
The number of GPUs per node dictates the number of Spark executors that can run in that node. Each executor should only be allowed to run 1 task at any given time.
Create a Databricks cluster (Clusters
-> + Create Cluster
) that meets the above prerequisites.
- Make sure to use one of the 5.3 ML with GPU, 5.4 ML with GPU, or 5.5 LTS ML with GPU Databricks runtimes.
- Use nodes with 1 GPU each such as p3.xlarge or Standard_NC6s_v3. We currently don't support nodes with multiple GPUs. p2 (AWS) and NC12/24 (Azure) nodes do not meet the architecture requirements for the XGBoost worker (although they can be used for the driver node).
- Under Autopilot Options, disable autoscaling.
- Choose the number of workers that matches the number of GPUs you want to use.
- Select a worker type that has 1 GPU for the worker like p3.xlarge or NC6s_v3, for example.
- After you start a Databricks cluster, use the initialization notebooks -- 5.3 & 5.4 notebook or 5.5 notebook to setup execution.
The initialization notebooks will perform the following steps:
1.Downloading the CUDA and Rapids XGBoost4j Spark jars
2.Creating a new directory for initialization script in Databricks file system (DBFS)
3.Creating an initialization script inside the new directory to copy jars inside Databricks jar directory
4.Download and decompress the Sample Mortgage Notebook dataset
After executing the steps in the initialization notebook, please follow the 1. Cluster initialization script and 2. Install the xgboost4j_spark jar in the cluster to ensure it is ready for XGBoost training.
- See Initialization scripts for how to configure cluster initialization scripts.
- Edit your cluster, adding an initialization script from dbfs:/databricks/init_scripts/init.sh in the "Advanced Options" under "Init Scripts" tab
- Reboot the cluster
- See Libraries for how to install jars from DBFS
- Go to "Libraries" tab under your cluster and install dbfs:/FileStore/jars/xgboost4j-spark_2.x-1.0.0-Beta3.jar in your cluster by selecting the "DBFS" option for installing jars
These steps will ensure you have a GPU Cluster ready for importing XGBoost notebooks or create your own XGBoost Application for training.
- See Managing Notebooks on how to import a notebook.
- Import the example notebook: XGBoost4j-Spark mortgage notebook
- Inside the mortgage example notebook, update the data paths from "/data/datasets/mortgage-small/train" to "dbfs:/FileStore/tables/mortgage/csv/train/mortgage_train_merged.csv" "/data/datasets/mortgage-small/eval" to "dbfs:/FileStore/tables/mortgage/csv/test/mortgage_eval_merged.csv"
The example notebook comes with the following configuration, you can adjust this according to your setup. See supported configuration options here: xgboost parameters
params = {
'eta': 0.1,
'gamma': 0.1,
'missing': 0.0,
'treeMethod': 'gpu_hist',
'maxDepth': 10,
'maxLeaves': 256,
'growPolicy': 'depthwise',
'minChildWeight': 30.0,
'lambda_': 1.0,
'scalePosWeight': 2.0,
'subsample': 1.0,
'nthread': 1,
'numRound': 100,
'numWorkers': 1,
}
-
Run all the cells in the notebook.
-
View the results In the cell 5 (Training), 7 (Transforming) and 8 (Accuracy of Evaluation) you will see the output.
--------------
==> Benchmark:
Training takes 6.48 seconds
--------------
--------------
==> Benchmark: Transformation takes 3.2 seconds
--------------
------Accuracy of Evaluation------
Accuracy is 0.9980699597729774
* The timings in this Getting Started guide are only illustrative. Please see our release announcement for official benchmarks.