Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty reproducing benchmarks #16

Open
kpedro88 opened this issue Jan 9, 2022 · 3 comments
Open

Difficulty reproducing benchmarks #16

kpedro88 opened this issue Jan 9, 2022 · 3 comments
Assignees

Comments

@kpedro88
Copy link

kpedro88 commented Jan 9, 2022

I'm interested to compare my own custom reader to this one, but I'm having some trouble getting benchmark results consistent with what the readme reports. Details:

  • CPU: Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz
  • Compiler: GCC 9.3.0
  • Python: 3.9.6 (newer)
  • xgboost: 1.3.3 (newer)
  • m2cgen: 0.9.0
  • ROOT: 6.22.08 (newer)

My results:
FastForest: 1.61 s
treelite: -*
m2cgen: 1.71 s
xgboost: 0.12 s
TMVA: 8.95 s

The relative differences in these numbers are rather different from the readme (and xgboost is blazing fast somehow).

It could be useful to distribute a Dockerfile that sets up and runs all the benchmarks in a more controlled environment (OS, versions, etc.).

* The treelite benchmark (using version 2.1.0) doesn't work at all. I get the following error:

ModuleNotFoundError: No module named 'treelite.runtime'

If I swap it with treelite_runtime (a separate pip package), then there's another error:

AttributeError: module 'treelite_runtime' has no attribute 'Batch'
@guitargeek
Copy link
Owner

Hi @kpedro88, thanks for checking out the library and opening the issue!

Indeed, updating the benchmarks was long overdue. I have now added a commit where I updated the benchmark code and results. While doing that, I noticed some performance regression in fastforest in the last year and also fixed that (see the second-to-last commit).

I hope you can reproduce the benchmarks now and that they will make sense to you! The reason why xgboost appeared so fast was that newer versions use multithreading by default, so I needed to explicitly set it to one thread now. Treelike will probably be slower than fastforest, at least that's what I observed. Fastforest and m2cgen are very similar now apparently, in your numbers FastForest was faster and for me it was m2cgen. A few years ago m2cgen was still slower, but I suppose it caught up thanks to improved compilers.

Let me know if these issue is resolved for you or if you want me to follow up on something :) I'm also very curious about your own reader!

@guitargeek guitargeek self-assigned this Jan 18, 2022
@kpedro88
Copy link
Author

Thanks @guitargeek, I'm able to run all the benchmarks now and I get similar, though not exactly the same, behavior on my Intel CPU. I also observe that the results vary more than expected based on the randomly-generated BDT and input data.

Versions:

  • Compiler: GCC 11.1.0
  • Python: 3.9.6 (older)
  • xgboost: 1.5.2
  • m2cgen: 0.9.0
  • ROOT: 6.24.06 (newer)

Results:
FastForest: 1.34 s
treelite: 2.24 s
m2cgen: 1.41 s
xgboost: 1.33 s
TMVA: 9.2 s

@kpedro88
Copy link
Author

I'm just going to summarize the details of my custom reader here, since it turns out not to be faster. (At one time, I had observed it to be faster than the CMS GBRTree, but this was long ago with gcc ~5.)

GBRTree/FastForest stores the trees in separate vectors:

        std::vector<int> rootIndices_;
        std::vector<CutIndexType> cutIndices_;
        std::vector<FeatureType> cutValues_;
        std::vector<int> leftIndices_;
        std::vector<int> rightIndices_;
        std::vector<TreeResponseType> responses_;

and uses them as (inside of the index-updating while loop):

            auto r = rightIndices_[index];
            auto l = leftIndices_[index];
            index = array[cutIndices_[index]] > cutValues_[index] ? r : l;

My idea was to store all elements in a single vector to improve memory locality, originally using a struct, but it can be simplified even further:

	std::vector<int> rootIndices_;
	//"node" layout: index, cut, left, right
	std::vector<int> nodes_;
	std::vector<float> responses_;

The inside of the index-updating while loop then becomes:

				node = &nodes_[index];
				index = features[*node] <= (float&)(*++node) ? *++node : *(node+2);

Surprisingly, this is actually slower than FastForest. It took godbolt, llvm-mca, callgrind, and Vincenzo to figure out why: gcc emits branchless code for the FastForest version, but switches to conditional branching for my version. It's then slowed down by branch prediction misses.

It's not entirely clear why gcc does this, but I eventually found a modification that forces it to be branchless:

				node = &nodes_[index];
				index = *(node+2+(features[*node] > (float&)(*(node+1))));

which results in a similar speed to FastForest (still a bit slower, suffering from a few extra instructions).

Interestingly, clang (12.0.0) emits branchless instructions for both versions and they perform very similarly. (And gcc with -O3 emits branching instructions for both and they slow down...)

The demo code is in https://github.com/kpedro88/XGBoost-FastForest/blob/aos3/benchmark/benchmark-01-aos.cpp and a script to run everything is https://gist.github.com/kpedro88/4f60741a1d3b8b2830d4447f239f13a2#file-run_bdt-sh. Godbolt links:
FastForest: https://godbolt.org/z/3EosczYEf
Mine: https://godbolt.org/z/nddbMWq4h
Mine (branchless): https://godbolt.org/z/E5xrcYdMG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants