Python API for XGBoost-Spark

This doc focuses on GPU related Python API interfaces. Six new classes are introduced:

GpuDataset
GpuDataReader
XGBoostClassifier
XGBoostClassificationModel
XGBoostRegressor
XGBoostRegressionModel

GpuDataset

The full name is ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset. A GpuDataset is an object that is produced by GpuDataReaders and consumed by XGBoostClassifiers and XGBoostRegressors. No constructors or methods are exposed for this class.

GpuDataReader

The full name is ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataReader. A GpuDataReader sets options and builds GpuDataset from data sources. The data loading is a lazy operation. It occurs when the data is processed later.

Constructors

GpuDataReader(spark_session)
- spark_session: a SparkSession for data loading

Methods

format(source): This method sets data format. Valid values include csv, parquet and orc.
- source: a String represents the data format to set
- returns the data reader itself
schema(schema): This method sets data schema.
- schema: data schema either in StructType format or a DDL-formatted String (e.g., a INT, b STRING, c DOUBLE)
- returns the data reader itself
option(key, value): This method sets an option.
- key: a String represents the option key
- value: the option value, valid types include Boolean, Integer, Float and String
- returns the data reader itself
options(options). This method sets options.
- options: an option Dictionary[String, String]
- returns the data reader itself
load(*paths): This method builds a GpuDataset.
- paths: the data source paths, might be empty, one path, or a list of paths
- returns a GpuDataset as the result
csv(*paths): This method builds a GpuDataset.
- paths: the CSV data paths, might be one path or a list of paths
- returns a GpuDataset as the result
parquet(*paths): This method builds a GpuDataset.
- paths: the Parquet data paths, might be one path or a list of paths
- returns a GpuDataset as the result
orc(*paths):. This method builds a GpuDataset.
- paths: the ORC data paths, might be one path or a list of paths
- returns a GpuDataset as the result

Options

Common options
- asFloats: A Boolean flag indicates whether cast all numeric values to floats. Default is True.
- maxRowsPerChunk: An Integer specifies the max rows per chunk. Default is 2147483647 (2^31-1).
Options for CSV
- comment: A single character used for skipping lines beginning with this character. Default is empty string. By default, it is disabled.
- header: A Boolean flag indicates whether the first line should be used as names of columns. Default is False.
- nullValue: The string representation of a null(None) value. Default is empty string.
- quote: A single character used for escaping quoted values where the separator can be part of the value. Default is ".
- sep: A single character as a separator between adjacent values. Default is ,.

XGBoostClassifier

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier. It is a wrapper around Scala XGBoostClassifier.

Constructors

XGBoostClassifier(**params)
- all standard xgboost parameters are supported, but please note a few differences:
  - only camelCase is supported when specifying parameter names, e.g., maxDepth
  - parameter lambda is renamed to lambda_, because lambda is a keyword in Python

Methods

Note: Only GPU related methods are listed below.

setFeaturesCols(features_cols). This method sets the feature columns for training.
- features_cols: a list of feature column names in String format to set
- returns the classifier itself
setEvalSets(eval_sets): This method sets eval sets for training.
- eval_sets: eval sets of type Dictionary[String, GpuDataset] for training (For CPU training, the type is Dictionary[String, DataFrame])
- returns the classifier itself
fit(dataset): This method triggers the training.
- dataset: a GpuDataset to train
- returns the training result as a XGBoostClassificationModel
- Note: For CPU training, you can still call fit to train a DataFrame

XGBoostClassificationModel

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel. It is a wrapper around Scala XGBoostClassificationModel.

Methods

Note: Only GPU related methods are listed below.

transform(dataset:): This method predicts results based on the model.
- dataset: a GpuDataset to predicate
- returns a DataFrame with the prediction

XGBoostRegressor

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor. It is a wrapper around Scala XGBoostRegressor.

Constructors

XGBoostRegressor(**params)
- all standard xgboost parameters are supported, but please note a few differences:
  - only camelCase is supported when specifying parameter names, e.g., maxDepth
  - parameter lambda is renamed to lambda_, because lambda is a keyword in Python

Methods

Note: Only GPU related methods are listed below.

setFeaturesCols(features_cols). This method sets the feature columns for training.
- features_cols: a list of feature column names in String format to set
- returns the regressor itself
setEvalSets(eval_sets): This method sets eval sets for training.
- eval_sets: eval sets of type Dictionary[String, GpuDataset] for training (For CPU training, the type is Dictionary[String, DataFrame])
- returns the regressor itself
fit(dataset): This method triggers the training.
- dataset: a GpuDataset to train
- returns the training result as a XGBoostRegressionModel
- Note: For CPU training, you can still call fit to train a DataFrame

XGBoostRegressionModel

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel. It is a wrapper around Scala XGBoostRegressionModel.

Methods

Note: Only GPU related methods are listed below.

transform(dataset:): This method predicts results based on the model.
- dataset: a GpuDataset to predicate
- returns a DataFrame with the prediction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python.md

python.md

Python API for XGBoost-Spark

GpuDataset

GpuDataReader

Constructors

Methods

Options

XGBoostClassifier

Constructors

Methods

XGBoostClassificationModel

Methods

XGBoostRegressor

Constructors

Methods

XGBoostRegressionModel

Methods

Files

python.md

Latest commit

History

python.md

File metadata and controls

Python API for XGBoost-Spark

GpuDataset

GpuDataReader

Constructors

Methods

Options

XGBoostClassifier

Constructors

Methods

XGBoostClassificationModel

Methods

XGBoostRegressor

Constructors

Methods

XGBoostRegressionModel

Methods