This repository that consists of examples of machine learning techinques like bootstrap and clustered bootstrap implemented using Apache Spark and H2o Sparkling Water.
Repository consists of the following sections:
- spark-examples-scala-shell
- spark-examples-python-notebooks
- h2o-examples-scala-ide
- h2o-examples-flow_ui
- data
This section consists of scala shell examples of generalized linear models with linear regression and Gradient Boosted Machines with bootstrap and clustered bootstrap examples. Examples also include fitting distributed models on bootstrap examples in parallel in spark.
In these examples, we demonstrate bootstrap on glm using both Apache Spark MLlib and H2o Sparkling Water. Shell scripts include:
- Perform bootstrap on glm in parallel using sparkling-water(CDH 5.5.1): spark-sparkling_water-glm-bootstrap_cdh551.scala
- Perform clustered bootstrap on gbm in parallel using sparkling-water(CDH 5.5.1): spark-sparkling_water-gbm-clustered_bootstrap_cdh551.scala
- Perform bootstrap on glm in parallel using plain apache spark MLlib (CDH 5.5.1): spark-mllib-glm-clustered_bootstrap_cdh551.scala
- Perform bootstrap on glm in parallel using plain apache spark MLlib (CDH 5.7.0): spark-mllib-glm-clustered_bootstrap_cdh570.scala
More on this section here
This section consists of ipyton notebooks running in PySpark shell. Notebooks include:
- Calculate confidence intervals using bootstrap technique with PySpark: pyspark-confidence_intervals-bootstrap.ipynb
- Perform clustered bootstrap on glm using PySpark and MLlib: pyspark-mllib-glm-clustered_bootstrap.ipynb
- Perform distributed cross validation of random forests model using PySpark and ML Pipelines: pyspark-mllib-randomforests-crossvalidation.ipynb
- Perform parallel cross validation of single machine scikit-learn models using PySpark and spark-sklearn package: pyspark-sklearn-crossvalidation.ipynb
More on this section here
This section consists of scala code that demonstrates how to develop spark and sparkling water applications in a Scala IDE. Code in this section was built using the IntelliJ IDE. Eclipse is another popular IDE and users can use any IDE of their preference.
Three main classes in this example project are:
- com.cloudera.sa.ml.sparklingwater.GBMBootstrap.scala
- com.cloudera.sa.ml.sparklingwater.GlmBootstrap.scala
- com.cloudera.sa.ml.spark.GlmBootstrap.scala
Compile the code and build a target jar file. Once the jar file is created, copy the jar file to hadoop cluster and submit a spark job as shown below to run Bootstrap on GBM using H2o algorithms.
spark-submit --master yarn-cluster --driver-memory 3g --driver-cores 2 --executor-memory 4g --executor-cores 3
--num-executors 4 --jars jars/commons-csv-1.1.jar,jars/spark-csv_2.10-1.4.0.jar,./sparkling-water-1.5.14/assembly/build/libs/sparkling-water-assembly-1.5.14-all.jar
--class com.cloudera.sa.ml.sparklingwater.GBMBootstrap ml-examples_2.10-1.0.jar skewdata-policy-new.csv data/output/2
This section consists of scala flow code that can be imported. Learn more about H2o flow
- Flow file to perform clustered bootstrap on glm in parallel using spark (CDH 5.5.1): spark-mllib-glm-clustered_bootstrap_cdh551.flow
- Flow file to perform clustered bootstrap on gbm in parallel using sparkling-water (CDH 5.5.1): spark-sparkling_water-gbm-clustered_bootstrap_cdh551.flow
- Flow file to perform bootstrap on glm in parallel using sparkling-water (CDH 5.5.1): spark-sparkling_water-glm-bootstrap_cdh551.flow
You can load the following flow files in H2o flow UI as described on Flow Guide
More on this section here
Data section consists of all the synthetic datasets that were used in the above scala and python examples. When you download this
- Day.csv
- Skewdata-policy-new.csv
- skewdata.csv
- simdata_20K_120vars.csv.zip (compressed)
- Learn more about Apache Spark.
- Learn more about Apache Spark MLlib and ML.
- Learn more about H2o Sparkling Water.
- Learn more about h2o sparkling water developer guide
- Learn more about Spark-sklearn
- Learn more about Bootstrapping (Statistics)