MLPerf Inference Benchmarks for Recommendation Task

This is the reference implementation for MLPerf Inference benchmarks.

Supported Models

name	framework	acc.	AUC	dataset	weights	size	prec.	notes
dlrm (debugging)	PyTorch	78.82%	N/A	Criteo KaggleDAC	N/A	~1GB	fp32
dlrm (debugging)	PyTorch	81.07%	N/A	Criteo Terabyte	pytorch, onnx	~10GB	fp32	--max-ind-range=10000000 --data-sub-sample-rate=0.875
dlrm (official)	PyTorch	N/A	80.25%	Criteo Terabyte	pytorch, onnx	~100GB	fp32	--max-ind-range=40000000

Disclaimer

This is a reference implementation of the benchmark that is not meant to be the fastest implementation possible. The official model is the largest model on the order of 100GB, while interface to other models is only provided to facilitate debugging.

The reference implementation includes all required pre-processing of datasets. It also includes --accuracy option to validate accuracy and ROC AUC (receiver operating characteritics area under the curve) metrics as required by MLPerf [1].

The reference implementation was tested on a machine with 256GB RAM and 8x32GB GPUs.

[1] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 2006

Prerequisites and Installation

We support PyTorch and might add TensorFlow backend implementation.

The following steps are only needed if you run the benchmark without Docker.

Python 3.5, 3.6 or 3.7 is supported and we recommend to use Anaconda.

Install the desired backend. For pytoch:

pip install torch torchvision
pip install scikit-learn
pip install numpy
pip install pydot
pip install torchviz
pip install protobuf
pip install tqdm

Prepare the code and dataset

Download or clone the MLPerf inference and trainining code

cd $HOME
mkdir ./mlperf && cd ./mlperf
git clone --recurse-submodules https://github.com/mlperf/training.git
git clone --recurse-submodules https://github.com/mlperf/inference.git
export DLRM_DIR=$HOME/mlperf/training/recommendation/dlrm

Download pre-trained model weights (see links available above)

cd $HOME/mlperf/inference/v0.5/recommendation
mkdir ./model && cd ./model
mv <downloaded_file> dlrm_terabyte.pytorch
export MODEL_DIR=./model

Download corresponding Criteo dataset (see links available above)

cd $HOME/mlperf/inference/v0.5/recommendation
mkdir ./criteo && cd ./criteo
mv <downloaded_file(s)> ./
export DATA_DIR=./criteo

Build and install the loadgen

cd $HOME/mlperf/inference/loadgen
CFLAGS="-std=c++14" python setup.py develop --user

More information about the model weights

File name	framework	Size in bytes (`du *`)	MD5 hash (`md5sum *`)
tb0875_10M.pt	pytorch	12GB	b7cacffcf75f767faa9cb2af397723aa
tb00_40M.pt	pytorch	90GB	2d49a5288cddb37c3c64860a06d79bb9

More information about the datasets

dataset	download link
Criteo Kaggle DAC	https://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/
Criteo Terabyte	https://labs.criteo.com/2013/12/download-terabyte-click-logs/

The Criteo Kaggle DAC dataset is composed of 7 days, which are stored in file: train.txt. This file is expected by the code.
The Criteo Terabyte dataset is stored in several files corresponding to 24 days: day_0.gz, day_1.gz, ..., day_23.gz (~343GB).

File name	Size in bytes (`du *`)	MD5 hash (`md5sum *`)
`day_0.gz`	15927304	6cef23542552c3195e9e6e2bdbc4c235
`day_1.gz`	16292832	94b73908ee8f912c175420376c8952db
`day_2.gz`	16022296	c3c0272c26cfaa03d932b2856a751ff5
`day_3.gz`	14779644	b727ecfaaf482507bb998248833aa4c2
`day_4.gz`	12514396	b99eaa6e324e49d9df4e6f840e76b6d9
`day_5.gz`	14169436	1294d0a56a90294aebf80133078d9879
`day_6.gz`	16753584	68586521483313e13aefb29e7b767fdb
`day_7.gz`	16465452	a2c1c4bfec20fc88b0b504175a16b644
`day_8.gz`	15845316	faabf247fd56140a76effa6a3da63253
`day_9.gz`	16195668	ee3347a28c1dd2fb2c92094e643c132b
`day_10.gz`	15201768	d043c2ec0346eb4c71aaae935416688e
`day_11.gz`	12698432	8d4ba32f0c4f654a3860b6f2ae1a8ea7
`day_12.gz`	14029428	908480917ed39be2a2ad2e1c339c40b4
`day_13.gz`	15953380	567d6bfa672dd10a0cf76feaec0cf92b
`day_14.gz`	15882872	ed377357aecaccc5f93c754c4819fd8d
`day_15.gz`	15382724	8e91f2a8d3d95202dfc3b22b88064c12
`day_16.gz`	14606256	387269870bf8ec7d285cf0e8ce82e92e
`day_17.gz`	13483480	48d3538fcf04807e0be4d72072dbda0b
`day_18.gz`	11796984	f26e23b6ef242f40b0e3fd92c986170c
`day_19.gz`	13059032	3f6f36657b0ff1258428356451eea6c8
`day_20.gz`	16022492	db7ff2b830817d3b10960f02bfb68547
`day_21.gz`	15707044	f1a4ba7f7a555cb4a7e724a082479f4a
`day_22.gz`	15463520	848ae20c4eab730ae487acc8ddaf52ba
`day_23.gz`	14591792	a2748bdbc67dd544b3ac470c4f1a52df

Please unzip all the files:

gunzip day_{0..23}.gz

to obtain the text files expected by the code: day_0, day_1, ..., day_23 (~1.1TB).

File name	Size in bytes (`du *`)	MD5 hash (`md5sum *`)
`day_0`	48603344	331703bf14b9a699324d589efd676962
`day_1`	49518248	1df068493bb19edce48c09a1ce1a7fca
`day_2`	48820460	a5dec1724865504895e03508d6308046
`day_3`	44863556	c6de260d9eb835a2a1866b9931f4c474
`day_4`	37583076	9a41d00282b6b87db890518a86ab001f
`day_5`	42691840	170cba5a53f2b7cea8b75e19f10e152e
`day_6`	50857184	a47640af3fdffdec69c7a61f74e8b4f6
`day_7`	49874324	b0ff0428ce74fcb3f690f5330b7803d7
`day_8`	48082108	89113bd1eed24775ff1e757feb80447a
`day_9`	49252856	730903c294a98261beefc5d8c1634fc9
`day_10`	46029928	994fb3a9a43c0ebef0e28357a3b34068
`day_11`	37983540	a62384802da42c8e0a82aa0a26e5f99e
`day_12`	41832600	3f4dc3fca55231e7e9cff5a696325a28
`day_13`	48109928	2ad1cddf9a1b93a6315ba29ceb31124c
`day_14`	48059080	969bf0ec8cc212fe1816c153161df367
`day_15`	46383644	fb44b44d68237d694b56f85f603f52cb
`day_16`	44094432	866d4b3ef3381fec2c77afee58dcb987
`day_17`	40374712	89e82426b5bc49e93b40d9dbd1bbf7a4
`day_18`	35039044	294f2d46c8653d36c12c163d9f91b5ac
`day_19`	38662560	74bc0f7b0b6dd324ecc029445c875ea3
`day_20`	47981304	5b02f47b002cd374613d8e27f2daa7ce
`day_21`	47650792	70dfaf349be746e3ea1a7affe97f40b8
`day_22`	47037532	c9abe6cbae0b93f6702af27cda53d229
`day_23`	44152268	08e251af4f3d1e8771ea15e405f39600

The Criteo fake dataset can be created in place of the real datasets in order to facilitate debugging and testing. We provide a fake (random) data generator that can be used to quickly generate data samples in a format compatible with both original and mlperf binary loaders. Please use the following script in ./tools to quickly create random samples for the corresponding models, which will be placed into ./fake_criteo directory.

./make_fake_criteo.sh [kaggle|terabyte0875|terabyte]
mv ./fake_criteo .. && cd ..
export DATA_DIR=./fake_criteo

Calibration set

For MLPerf Inference, we use the first 128000 rows (user-item pairs) of the second half of day_23 as the calibration set. Specifically, day_23 contains 178274637 rows in total, so we use the rows from the 89137319-th row to the 89265318-th row (both inclusive) in day_23 as the calibration set (assuming 0-based indexing).

Running the benchmark

Download and install all the pre-requisites. Both local and docker environment need to set 3 environment variables:

export DATA_DIR=YourCriteoTerabyteLocation
export MODEL_DIR=YourModelFileLocation
export DLRM_DIR=YourDLRMSourceLocation

Run local

./run_local.sh backend model dataset device [options]

backend is one of [pytorch]
model is one of [dlrm]
dataset is one of [kaggle|terabyte]
device is one of [cpu|gpu]
options are extra arguments that are passed along

For example, to run on CPU you may choose to use:

Criteo Kaggle DAC (debugging)

Offline scenario perf and accuracy modes

./run_local.sh pytorch dlrm kaggle cpu --scenario Offline --samples-to-aggregate-fix=2048 --max-batchsize=2048
./run_local.sh pytorch dlrm kaggle cpu --scenario Offline --samples-to-aggregate-fix=2048 --max-batchsize=2048 --samples-per-query-offline=1 --accuracy

Server scenario perf and accuracy modes

./run_local.sh pytorch dlrm kaggle cpu --scenario Server --samples-to-aggregate-fix=2048 --max-batchsize=2048
./run_local.sh pytorch dlrm kaggle cpu --scenario Server --samples-to-aggregate-fix=2048 --max-batchsize=2048 --accuracy

Criteo Terabyte with 0.875 sub-sampling (debugging)

Offline scenario perf and accuracy modes

./run_local.sh pytorch dlrm terabyte cpu --scenario Offline --max-ind-range=10000000 --data-sub-sample-rate=0.875 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 [--mlperf-bin-loader]
./run_local.sh pytorch dlrm terabyte cpu --scenario Offline --max-ind-range=10000000 --data-sub-sample-rate=0.875 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 --samples-per-query-offline=1 --accuracy [--mlperf-bin-loader]

Server scenario perf and accuracy modes

./run_local.sh pytorch dlrm terabyte cpu --scenario Server  --max-ind-range=10000000 --data-sub-sample-rate=0.875 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 [--mlperf-bin-loader]
./run_local.sh pytorch dlrm terabyte cpu --scenario Server  --max-ind-range=10000000 --data-sub-sample-rate=0.875 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 --accuracy [--mlperf-bin-loader]

Criteo Terabyte (official)

Offline scenario perf and accuracy modes

./run_local.sh pytorch dlrm terabyte cpu --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 --samples-per-query-offline=204800 [--mlperf-bin-loader]
./run_local.sh pytorch dlrm terabyte cpu --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 --samples-per-query-offline=204800 --accuracy [--mlperf-bin-loader]

Server scenario perf and accuracy modes

./run_local.sh pytorch dlrm terabyte cpu --scenario Server  --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 [--mlperf-bin-loader]
./run_local.sh pytorch dlrm terabyte cpu --scenario Server  --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 --accuracy [--mlperf-bin-loader]

Note that the code support (i) original and (ii) mlperf binary loader, that have slightly different performance characteristics. The latter loader can be enabled by adding --mlperf-bin-loader to the command line.

Note that this script will pre-process the data during the first run and reuse it over sub-sequent runs. The pre-processing of data can take a significant amount of time during the first run.

In order to use GPU(s), select the number of GPUs with the environment variable CUDA_VISIBLE_DEVICES, and run

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./run_local.sh pytorch dlrm terabyte gpu --accuracy

Get started quickly with Docker

Ensure you have a working docker setup on your machine.

CPU

Build Dockerfile configuration

cd $HOME/mlperf/inference/v0.5/recommendation
docker build -t dlrm-cpu docker_cpu/.

Run Docker container in interactive mode and enter the docker console

docker run -it dlrm-cpu

Inside container kickstart default setup (environment, git checkout, fake dataset and model download)

source kickstart.sh

GPU

Build Dockerfile configuration

cd $HOME/mlperf/inference/v0.5/recommendation
docker build -t dlrm-gpu docker_gpu/.

Run Docker container in interactive mode and enter the docker console

docker run --gpus all -it dlrm-gpu

Ensure you have a working docker setup with CUDA support (Should return True); If false ensure you have a functioning Docker installation with CUDA and GPU support.

python -c "exec(\"import torch\nprint(torch.cuda.is_available())\")"

Inside container kickstart default setup (environment, git checkout, fake dataset, model download and default to single GPU). See above for changing CUDA_VISIBLE_DEVICES.

source kickstart.sh

Examples for testing

During development running the full benchmark is unpractical. Here are some options to help:

--count-samples limits the number of items in the dataset used for accuracy pass

--duration limits the time the benchmark runs

--max-latency the latency used for Server scenario

--accuracy enables accuracy pass

So if you want to tune for example Server scenario, try:

./run_local.sh pytorch dlrm terabyte cpu --scenario Server  --count-samples 1024 --max-ind-range=10000000 --data-sub-sample-rate=0.875 --duration 60000 --target-qps 100 --max-latency 0.1

If you want run with accuracy pass, try:

./run_local.sh pytorch dlrm terabyte cpu --scenario Offline --count-samples 1024 --max-ind-range=10000000 --data-sub-sample-rate=0.875 --samples-per-query-offline=1 --samples-to-aggregate-fix=128 --accuracy [--mlperf-bin-loader]

Verifying aggregation trace

In the reference implementation, each sample is mapped to 100-700 user-item pairs following the distribution specified by tools/dist_quantile.txt. To verify that your sample aggregation trace matches the reference, please follow the steps in tools/dist_trace_verification.txt. Or simply download the reference dlrm_trace_of_aggregated_samples.txt from Zenodo (MD5:3db90209564316f2506c99cc994ad0b2).

Running accuracy script

To get the accuracy from a LoadGen accuracy json log file,

If your SUT outputs the predictions and the ground truth labels in a packed format like the reference implementation then run

python tools/accuracy-dlrm.py --mlperf-accuracy-file <LOADGEN_ACCURACY_JSON>

For instance, if the output is given in a standard directory then you can run

python ./tools/accuracy-dlrm.py --mlperf-accuracy-file=./output/pytorch-cpu/dlrm/mlperf_log_accuracy.json

If your SUT outputs only the predictions then you need to make sure that the data in day_23 are not shuffled and run

python tools/accuracy-dlrm.py --mlperf-accuracy-file <LOADGEN_ACCURACY_JSON> --day-23-file <path/to/day_23> --aggregation-trace-file <path/to/dlrm_trace_of_aggregated_samples.txt>

Usage

usage: main.py [-h]
    [--mlperf_conf ../../mlperf.conf]
    [--user_conf user.conf]
    [--model MODEL] --model-path MODEL_PATH
    [--dataset {kaggle,terabyte}] --dataset-path DATASET_PATH
    [--profile {defaults,dlrm-kaggle-pytorch,dlrm-terabyte-pytorch}]
    [--scenario SCENARIO]
    [--max-ind-range MAX_IND_RANGE] [--data-sub-sample-rate DATA_SUB_SAMPLE_RATE]
    [--max-batchsize MAX_BATCHSIZE] [--mlperf-bin-loader]
    [--output OUTPUT] [--inputs INPUTS] [--outputs OUTPUTS]
    [--backend BACKEND] [--use-gpu] [--threads THREADS] [--duration TIME_IN_MS]
    [--count-samples COUNT] [--count-queries COUNT] [--target-qps QPS]
    [--max-latency MAX_LATENCY]  [--cache CACHE]
    [--samples-per-query-multistream NUM_SAMPLES]
    [--samples-per-query-offline NUM_SAMPLES]
    [--samples-to-aggregate-fix NUM_FIXED_SAMPLES]
    [--samples-to-aggregate-min MIN_NUM_VARIABLE_SAMPLES]
    [--samples-to-aggregate-max MAX_NUM_VARIABLE_SAMPLES]
    [--samples-to-aggregate-quantile-file FILE]
    [--samples-to-aggregate-trace-file FILE]
    [--numpy-rand-seed SEED]
    [--accuracy] [--find-peak-performance]

--mlperf_conf the mlperf config file to use for rules compliant parameters (default: ../../mlperf.conf)

--user_conf the user config file to use for user LoadGen settings such as target QPS (default: user.conf)

--model model name, i.e. dlrm.

--model-path MODEL_PATH path to the file with model weights.

--dataset use the specified dataset. Currently we only support Criteo Terabyte.

--dataset-path path to the dataset.

--scenario {SingleStream,MultiStream,Server,Offline} benchmarking mode to be used.

--profile {dlrm-kaggle-pytorch,dlrm-terabyte-pytorch} this fills in default command line options with the once specified in the profile. Command line options that follow may override the those.

--backend only the PyTorch backedn is currently supported. However, we expect to add TensorFlow backend in the future.

--max-ind-range the maximum number of vectors allowed in an embedding table.

--data-sub-sample-rate the rate of sub-sampling of negative samples, either 0.875 or 0.0.

--max-batchsize MAX_BATCHSIZE maximum batchsize we generate to backend (default: 128). If the query contains a very large number of samples it will be broken up into smaller mini-batches of MAX_BATCHSIZE samples before forwarding it to the model.

--mlperf-bin-loader flag that enables mlperf binary loader to be used.

--output OUTPUT location of the JSON output.

--backend BACKEND which backend to use. Currently supported is PyTorch.

--use-gpu flag that enables use of GPU. The number of GPUs used is controlled by CUDA_VISIBLE_DEVICES environment variable.

--threads THREADS number of worker threads to use (default: the number of processors in the system).

--duration duration of the benchmark run in milliseconds (ms).

--count-samples COUNT number of samples from the dataset we use (default: use all samples in the dataset).

--count-queries COUNT number of queries we use (default: no limit).

--target-qps QPS target/expected QPS for the Server and Offline scenarios.

--max-latency MAX_LATENCY comma separated list of which latencies (in seconds) we try to reach in the 99 percentile (default: 0.01,0.05,0.100).

--samples-per-query-multistream maximum number of (aggregated) samples per query in MultiStream scenario.

--samples-per-query-offline maximum number of (aggregated) samples per query in Offline scenario.

--samples-to-aggregate-fix number of samples to aggregate and treat as a single sample. This number will stay fixed during runs.

--samples-to-aggregate-min, --samples-to-aggregate-max number of samples to aggregate and treat as a single sample. This number will vary randomly between min and max during runs.

--samples-to-aggregate-quantile-file number of samples to aggregate and treat as a single sample. This number will be sampled according to a custom distribution quantile stored in a file (e.g. tools/dist_quantile.txt).

--samples-to-aggregate-trace-file filename for writing the trace of queries. Each query is written on a single line, with a range of aggregated samples indicated in square brackets.

--numpy-rand-seed random seed for numpy package.

--accuracy perform inference on the entire dataset to validate achieved model accuracy/AUC metric.

--find-peak-performance determine the maximum QPS for the Server and samples per query for the MultiStream, while not applicable to other scenarios.

License

Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MLPerf Inference Benchmarks for Recommendation Task

Supported Models

Disclaimer

Prerequisites and Installation

Prepare the code and dataset

More information about the model weights

More information about the datasets

Calibration set

Running the benchmark

Run local

Get started quickly with Docker

CPU

GPU

Examples for testing

Verifying aggregation trace

Running accuracy script

Usage

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

MLPerf Inference Benchmarks for Recommendation Task

Supported Models

Disclaimer

Prerequisites and Installation

Prepare the code and dataset

More information about the model weights

More information about the datasets

Calibration set

Running the benchmark

Run local

Get started quickly with Docker

CPU

GPU

Examples for testing

Verifying aggregation trace

Running accuracy script

Usage

License