Difficulty reproducing benchmarks #16

kpedro88 · 2022-01-09T01:18:40Z

I'm interested to compare my own custom reader to this one, but I'm having some trouble getting benchmark results consistent with what the readme reports. Details:

CPU: Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz
Compiler: GCC 9.3.0
Python: 3.9.6 (newer)
xgboost: 1.3.3 (newer)
m2cgen: 0.9.0
ROOT: 6.22.08 (newer)

My results:
FastForest: 1.61 s
treelite: -*
m2cgen: 1.71 s
xgboost: 0.12 s
TMVA: 8.95 s

The relative differences in these numbers are rather different from the readme (and xgboost is blazing fast somehow).

It could be useful to distribute a Dockerfile that sets up and runs all the benchmarks in a more controlled environment (OS, versions, etc.).

* The treelite benchmark (using version 2.1.0) doesn't work at all. I get the following error:

ModuleNotFoundError: No module named 'treelite.runtime'

If I swap it with treelite_runtime (a separate pip package), then there's another error:

AttributeError: module 'treelite_runtime' has no attribute 'Batch'

The text was updated successfully, but these errors were encountered:

guitargeek · 2022-01-18T15:51:50Z

Hi @kpedro88, thanks for checking out the library and opening the issue!

Indeed, updating the benchmarks was long overdue. I have now added a commit where I updated the benchmark code and results. While doing that, I noticed some performance regression in fastforest in the last year and also fixed that (see the second-to-last commit).

I hope you can reproduce the benchmarks now and that they will make sense to you! The reason why xgboost appeared so fast was that newer versions use multithreading by default, so I needed to explicitly set it to one thread now. Treelike will probably be slower than fastforest, at least that's what I observed. Fastforest and m2cgen are very similar now apparently, in your numbers FastForest was faster and for me it was m2cgen. A few years ago m2cgen was still slower, but I suppose it caught up thanks to improved compilers.

Let me know if these issue is resolved for you or if you want me to follow up on something :) I'm also very curious about your own reader!

kpedro88 · 2022-01-27T03:12:32Z

Thanks @guitargeek, I'm able to run all the benchmarks now and I get similar, though not exactly the same, behavior on my Intel CPU. I also observe that the results vary more than expected based on the randomly-generated BDT and input data.

Versions:

Compiler: GCC 11.1.0
Python: 3.9.6 (older)
xgboost: 1.5.2
m2cgen: 0.9.0
ROOT: 6.24.06 (newer)

Results:
FastForest: 1.34 s
treelite: 2.24 s
m2cgen: 1.41 s
xgboost: 1.33 s
TMVA: 9.2 s

kpedro88 · 2022-01-27T03:16:06Z

I'm just going to summarize the details of my custom reader here, since it turns out not to be faster. (At one time, I had observed it to be faster than the CMS GBRTree, but this was long ago with gcc ~5.)

GBRTree/FastForest stores the trees in separate vectors:

        std::vector<int> rootIndices_;
        std::vector<CutIndexType> cutIndices_;
        std::vector<FeatureType> cutValues_;
        std::vector<int> leftIndices_;
        std::vector<int> rightIndices_;
        std::vector<TreeResponseType> responses_;

and uses them as (inside of the index-updating while loop):

            auto r = rightIndices_[index];
            auto l = leftIndices_[index];
            index = array[cutIndices_[index]] > cutValues_[index] ? r : l;

My idea was to store all elements in a single vector to improve memory locality, originally using a struct, but it can be simplified even further:

	std::vector<int> rootIndices_;
	//"node" layout: index, cut, left, right
	std::vector<int> nodes_;
	std::vector<float> responses_;

The inside of the index-updating while loop then becomes:

				node = &nodes_[index];
				index = features[*node] <= (float&)(*++node) ? *++node : *(node+2);

Surprisingly, this is actually slower than FastForest. It took godbolt, llvm-mca, callgrind, and Vincenzo to figure out why: gcc emits branchless code for the FastForest version, but switches to conditional branching for my version. It's then slowed down by branch prediction misses.

It's not entirely clear why gcc does this, but I eventually found a modification that forces it to be branchless:

				node = &nodes_[index];
				index = *(node+2+(features[*node] > (float&)(*(node+1))));

which results in a similar speed to FastForest (still a bit slower, suffering from a few extra instructions).

Interestingly, clang (12.0.0) emits branchless instructions for both versions and they perform very similarly. (And gcc with -O3 emits branching instructions for both and they slow down...)

The demo code is in https://github.com/kpedro88/XGBoost-FastForest/blob/aos3/benchmark/benchmark-01-aos.cpp and a script to run everything is https://gist.github.com/kpedro88/4f60741a1d3b8b2830d4447f239f13a2#file-run_bdt-sh. Godbolt links:
FastForest: https://godbolt.org/z/3EosczYEf
Mine: https://godbolt.org/z/nddbMWq4h
Mine (branchless): https://godbolt.org/z/E5xrcYdMG

guitargeek self-assigned this Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty reproducing benchmarks #16

Difficulty reproducing benchmarks #16

kpedro88 commented Jan 9, 2022

guitargeek commented Jan 18, 2022

kpedro88 commented Jan 27, 2022

kpedro88 commented Jan 27, 2022

Difficulty reproducing benchmarks #16

Difficulty reproducing benchmarks #16

Comments

kpedro88 commented Jan 9, 2022

guitargeek commented Jan 18, 2022

kpedro88 commented Jan 27, 2022

kpedro88 commented Jan 27, 2022