tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch#14219
tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch#14219thad0ctor wants to merge 0 commit intoggml-org:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds separate timing measurements for prompt processing and token generation in llama-bench and introduces a new command‑line argument (n_threads_batch) for batch thread specification.
- Added a new parameter “n_threads_batch” across configuration, parsing, test execution, and output formatting.
- Integrated separate metrics for prompt and generation timing (samples_prompt_ns, samples_gen_ns) and updated the markdown and SQL printers to display the new metrics.
Comments suppressed due to low confidence (2)
tools/llama-bench/llama-bench.cpp:1617
- [nitpick] Mapping the field 'n_threads_batch' to the alias 'th_batch' might be unclear; consider either using a more descriptive alias or adding an inline comment to explain the abbreviation.
if (field == "n_threads_batch") {
tools/llama-bench/llama-bench.cpp:2070
- [nitpick] Although separate timing measurements for prompt and generation are implemented, adding inline comments to explain the timing logic can improve clarity for future maintainers.
uint64_t t_start = get_time_ns() - t_start;
| if (samples_prompt_ns.empty() || n_prompt == 0) return {}; | ||
| std::vector<double> ts; | ||
| std::transform(samples_prompt_ns.begin(), samples_prompt_ns.end(), std::back_inserter(ts), | ||
| [this](uint64_t t) { return 1e9 * n_prompt / t; }); | ||
| return ts; | ||
| } | ||
|
|
||
| std::vector<double> get_gen_ts() const { | ||
| if (samples_gen_ns.empty() || n_gen == 0) return {}; | ||
| std::vector<double> ts; | ||
| std::transform(samples_gen_ns.begin(), samples_gen_ns.end(), std::back_inserter(ts), | ||
| [this](uint64_t t) { return 1e9 * n_gen / t; }); |
There was a problem hiding this comment.
The get_prompt_ts() and get_gen_ts() functions contain very similar code. Consider extracting a common helper function to reduce duplication.
| if (samples_prompt_ns.empty() || n_prompt == 0) return {}; | |
| std::vector<double> ts; | |
| std::transform(samples_prompt_ns.begin(), samples_prompt_ns.end(), std::back_inserter(ts), | |
| [this](uint64_t t) { return 1e9 * n_prompt / t; }); | |
| return ts; | |
| } | |
| std::vector<double> get_gen_ts() const { | |
| if (samples_gen_ns.empty() || n_gen == 0) return {}; | |
| std::vector<double> ts; | |
| std::transform(samples_gen_ns.begin(), samples_gen_ns.end(), std::back_inserter(ts), | |
| [this](uint64_t t) { return 1e9 * n_gen / t; }); | |
| return get_ts_helper(samples_prompt_ns, n_prompt); | |
| } | |
| std::vector<double> get_gen_ts() const { | |
| return get_ts_helper(samples_gen_ns, n_gen); | |
| } | |
| private: | |
| std::vector<double> get_ts_helper(const std::vector<uint64_t>& samples, int n) const { | |
| if (samples.empty() || n == 0) return {}; | |
| std::vector<double> ts; | |
| std::transform(samples.begin(), samples.end(), std::back_inserter(ts), | |
| [n](uint64_t t) { return 1e9 * n / t; }); |
There was a problem hiding this comment.
Optional, possibly for a follow on PR
|
|
n-threads-batch: I think we have two different schools of thought, I view llama-bench as a tool to get detailed information to fine tune performance of your model for a certain system, model, multi-model server, workflow, etc. As such, more parameters than can give the user insight into their performance are a value-add. If you are worried about this causing confusion I can update the code to only show the column when the parameter was passed. pp/gen t/s: similarly, this is an effective data point and (if such a thing exists) an industry standard when it comes to reviewing bench performance of new models, quants etc. this is an incredibly valuable data point for accessing models, workflows etc. if you are that worried I can add functionality to hide the extra token/s columns with another parameter. I view these as elementary data points though and standard information provided in all backends when interfacing |
|
I am sorry, but I don't see the point of any of these changes. Users can test different thread numbers, and when running llama.cpp normally they can use the number of threads that performed the best with generation with |
--threads-batch is not a llama-bench parameter though The use case may not appeal too you but, adding this allows users to benchmark a certain paramater - this is a benchmarking tool afterall. If someone wants to explore the interelation between thread-batch and other settings, have at it, these are parameters that work with llama-cli or sever so why not be able to test them. The earlier output I showed does show minor differences in performance when combining various threads/threads-batch combinations. Regarding you last comment about pp/tg, how can one process the data differently if the default output format into json, md, etc. doesn't provide the level of fidelity for a user to even estimate pp/tg t/s with a script? see the default output format - you can only infer pp t/s if -gen is 0: You may not see the utltity in this functionality but if you look at enough models online you see users routinely measuring benchmark performance in gen and pp t/s. Averaging these into one parameter limits one's ability to refine settings, model selection, etc to a certain workflow/use case. |
added gen t/s and pp t/s outputs to lamma-bench
n-theads-batch args to llama-bench
Minor improvments to llama-bench
New Features
Example output: