-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficulty reproducing benchmarks #16
Comments
Hi @kpedro88, thanks for checking out the library and opening the issue! Indeed, updating the benchmarks was long overdue. I have now added a commit where I updated the benchmark code and results. While doing that, I noticed some performance regression in fastforest in the last year and also fixed that (see the second-to-last commit). I hope you can reproduce the benchmarks now and that they will make sense to you! The reason why xgboost appeared so fast was that newer versions use multithreading by default, so I needed to explicitly set it to one thread now. Treelike will probably be slower than fastforest, at least that's what I observed. Fastforest and Let me know if these issue is resolved for you or if you want me to follow up on something :) I'm also very curious about your own reader! |
Thanks @guitargeek, I'm able to run all the benchmarks now and I get similar, though not exactly the same, behavior on my Intel CPU. I also observe that the results vary more than expected based on the randomly-generated BDT and input data. Versions:
Results: |
I'm just going to summarize the details of my custom reader here, since it turns out not to be faster. (At one time, I had observed it to be faster than the CMS GBRTree, but this was long ago with gcc ~5.) GBRTree/FastForest stores the trees in separate vectors: std::vector<int> rootIndices_;
std::vector<CutIndexType> cutIndices_;
std::vector<FeatureType> cutValues_;
std::vector<int> leftIndices_;
std::vector<int> rightIndices_;
std::vector<TreeResponseType> responses_; and uses them as (inside of the index-updating while loop): auto r = rightIndices_[index];
auto l = leftIndices_[index];
index = array[cutIndices_[index]] > cutValues_[index] ? r : l; My idea was to store all elements in a single vector to improve memory locality, originally using a struct, but it can be simplified even further: std::vector<int> rootIndices_;
//"node" layout: index, cut, left, right
std::vector<int> nodes_;
std::vector<float> responses_; The inside of the index-updating while loop then becomes: node = &nodes_[index];
index = features[*node] <= (float&)(*++node) ? *++node : *(node+2); Surprisingly, this is actually slower than FastForest. It took godbolt, llvm-mca, callgrind, and Vincenzo to figure out why: gcc emits branchless code for the FastForest version, but switches to conditional branching for my version. It's then slowed down by branch prediction misses. It's not entirely clear why gcc does this, but I eventually found a modification that forces it to be branchless: node = &nodes_[index];
index = *(node+2+(features[*node] > (float&)(*(node+1)))); which results in a similar speed to FastForest (still a bit slower, suffering from a few extra instructions). Interestingly, clang (12.0.0) emits branchless instructions for both versions and they perform very similarly. (And gcc with -O3 emits branching instructions for both and they slow down...) The demo code is in https://github.com/kpedro88/XGBoost-FastForest/blob/aos3/benchmark/benchmark-01-aos.cpp and a script to run everything is https://gist.github.com/kpedro88/4f60741a1d3b8b2830d4447f239f13a2#file-run_bdt-sh. Godbolt links: |
I'm interested to compare my own custom reader to this one, but I'm having some trouble getting benchmark results consistent with what the readme reports. Details:
My results:
FastForest: 1.61 s
treelite: -*
m2cgen: 1.71 s
xgboost: 0.12 s
TMVA: 8.95 s
The relative differences in these numbers are rather different from the readme (and xgboost is blazing fast somehow).
It could be useful to distribute a Dockerfile that sets up and runs all the benchmarks in a more controlled environment (OS, versions, etc.).
* The treelite benchmark (using version 2.1.0) doesn't work at all. I get the following error:
If I swap it with
treelite_runtime
(a separate pip package), then there's another error:The text was updated successfully, but these errors were encountered: