Introduce seq_len as inference param, and improve warnings #15716

abhinaykukkadapu · 2025-11-10T20:37:13Z

Summary:
Changes:

add --seq_len param to llama script to distinguish max_seq_len which is compile time param
Add warnings in the runner when seq_len is clamped to max_seq_len to avoid silently clamping it.
Add warnings in the token generator when EOS is not reached due to insufficient seq_len or max_seq_len.

Differential Revision: D86696759

Tests

Use --seq_len=600, prompt_len=512

I 00:00:02.883890 executorch:token_generator.cpp:333] Warning: Generation stopped at seq_len limit (600) without reaching EOS token. Response may be incomplete.
I 00:00:02.884094 executorch:token_generator.cpp:346] - seq_len (600) is less than compiled max_seq_len (1024). Consider increasing --seq_len (up to 1024).

Use --seq_len=2048, prefill_ar_len=1024

I 00:00:00.546967 executorch:runner.cpp:385] Warning: Requested seq_len (2048) exceeds compiled max_seq_len (1024). Clamping to 1024.

pytorch-bot · 2025-11-10T20:37:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15716

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 76b9c28 with merge base e774b77 ():

NEW FAILURE - The following job has failed:

pull / test-moshi-linux / linux-job (gh)
RuntimeError: Could not load libtorchcodec. Likely causes:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-11-10T20:37:22Z

@abhinaykukkadapu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86696759.

github-actions · 2025-11-10T20:37:56Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

winskuo-quic

Hi @abhinaykukkadapu,
Thanks for the PR.
I would like to know if we could achieve the same thing with the combination of
--pre_gen_pte & --max_seq_len.

For example:
During compile time, you can provide:
--max_seq_len 1024 --compile_only

During inference, you can provide:
--max_seq_len 512 --pre_gen_pte ./path_to_pregen_pte

abhinaykukkadapu · 2025-11-11T03:00:02Z

Hi @abhinaykukkadapu, Thanks for the PR. I would like to know if we could achieve the same thing with the combination of --pre_gen_pte & --max_seq_len.

For example: During compile time, you can provide: --max_seq_len 1024 --compile_only

During inference, you can provide: --max_seq_len 512 --pre_gen_pte ./path_to_pregen_pte

@winskuo-quic Thanks for the quick review, the goal of this additional param is to avoid confusing the users of the script thinking that --max_seq_len can be dynamic but it is a static param and is fixed during compilation.

Currently, one can pass --max_seq_len for inference which actually made me think that total context length can be changed dynamically and i only found out after digging through the code that is is piped as seq_len to the runner. With this new param, we have a clear distinction, that --max_seq_len is used for compile time and --seq_len is during runtime/ inference. Functionally this change is a no-op and it only improves user experience and making it visible to the user when we clamp it down.

winskuo-quic · 2025-11-11T08:07:27Z

examples/qualcomm/oss_scripts/llama/llama.py


+    parser.add_argument(
+        "--seq_len",
+        help="[Runtime-time] Maximum number of tokens to generate (prompt + output). If not specified, uses --max_seq_len. Will be clamped to compiled max_seq_len if exceeded.",


Maybe [Runtime-time] -> [Runtime]

winskuo-quic · 2025-11-11T08:25:25Z

@winskuo-quic Thanks for the quick review, the goal of this additional param is to avoid confusing the users of the script thinking that --max_seq_len can be dynamic but it is a static param and is fixed during compilation.

Currently, one can pass --max_seq_len for inference which actually made me think that total context length can be changed dynamically and i only found out after digging through the code that is is piped as seq_len to the runner. With this new param, we have a clear distinction, that --max_seq_len is used for compile time and --seq_len is during runtime/ inference. Functionally this change is a no-op and it only improves user experience and making it visible to the user when we clamp it down.

I see.
I think it makes sense to add some warning messages in .cpp files to guide users, which is helpful.
However, for llama.py, I would like to know if you think flag --max_seq_len is misleading? The --max_seq_len flag can actually be set to a different number every time during execution, as long as it is shorter than the --max_seq_len used during compilation.

abhinaykukkadapu · 2025-11-11T17:04:35Z

I think it makes sense to add some warning messages in .cpp files to guide users, which is helpful.

@winskuo-quic Right, i think we should be transparent in these, i've already added the messages that i think would be helpful, but suggest if you have any more in mind.

However, for llama.py, I would like to know if you think flag --max_seq_len is misleading? The --max_seq_len flag can actually be set to a different number every time during execution, as long as it is shorter than the --max_seq_len used during compilation.

Yeah i think this param is misleading as it clearly represents max context a model can have, so using it to during inference is a bit misleading for someone new to Qcom delegate, who might not know we use static llama and might think this can change total context length of the model dynamically. Also, all i did is use the same param of qnn runner (--seq_len) and bubbled up to llama.py script.

If you think this adds to the confusion, I'm also open to remove it and only keep warning messages in this PR.

…5716) Summary: Changes: 1. add `--seq_len` param to llama script to distinguish max_seq_len which is compile time param 2. Add warnings in the runner when `seq_len` is clamped to `max_seq_len` to avoid silently clamping it. 3. Add warnings in the token generator when EOS is not reached due to insufficient seq_len or max_seq_len. Differential Revision: D86696759

cccclai · 2025-11-11T19:57:16Z

examples/qualcomm/oss_scripts/llama/llama.py

            outputs.append(f.read())

-    seq_len = args.max_seq_len
+    # Use --seq_len if provided (inference-only), otherwise fall back to --max_seq_len


I don't quite follow why we need seq_len, can you share more? I feel like it might further causing the confusion...

The runner itself uses a param named --seq_len it is the script llama.py repurposing --max_seq_len and using it as seq_len for the runner. For context, if you've followed the internal discussion on benchmark numbers, we thought we swept the benchmarks for max_seq_len and prompt_length but the sweeping was not valid for max_seq_len because it is a compile time param and is ignored if it is more than what the model is compiled with.

Looking at CoreML and they use separate params as well: https://github.com/pytorch/executorch/blob/main/examples/apple/coreml/llama/run.py#L97-L103

Open to suggestions, if there are better ways to distinguish this param during compile time vs runtime from ux perspective.

I think we need to re-visit the seq_len in qnn llama runner because it seems it's only for debugging/profiling purpose. Users wouldn't need to use it and I want to make sure we don't make it more confusing.

…5716) Summary: Changes: 1. add `--seq_len` param to llama script to distinguish max_seq_len which is compile time param 2. Add warnings in the runner when `seq_len` is clamped to `max_seq_len` to avoid silently clamping it. 3. Add warnings in the token generator when EOS is not reached due to insufficient seq_len or max_seq_len. Differential Revision: D86696759

winskuo-quic

Thanks for the explanation.
I am slightly leaning toward leaving warnings messages in runtime and reuse the --max_seq_len flag in llama.py, which aligns with your new commit.
Thanks for the help on improving user's experience!

abhinaykukkadapu requested a review from cccclai as a code owner November 10, 2025 20:37

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 10, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 10, 2025

abhinaykukkadapu requested review from chenweng-quic, haowhsu-quic, shewu-quic and winskuo-quic November 10, 2025 22:26

winskuo-quic reviewed Nov 11, 2025

View reviewed changes

abhinaykukkadapu force-pushed the export-D86696759 branch from 7e0bdb0 to 536047d Compare November 11, 2025 19:08

cccclai reviewed Nov 11, 2025

View reviewed changes

abhinaykukkadapu force-pushed the export-D86696759 branch from 536047d to aa98ba0 Compare November 13, 2025 02:08

abhinaykukkadapu force-pushed the export-D86696759 branch from aa98ba0 to 76b9c28 Compare November 13, 2025 19:44

winskuo-quic approved these changes Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce seq_len as inference param, and improve warnings #15716

Introduce seq_len as inference param, and improve warnings #15716

abhinaykukkadapu commented Nov 10, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Nov 10, 2025

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

winskuo-quic left a comment

Uh oh!

abhinaykukkadapu commented Nov 11, 2025 •

edited

Loading

Uh oh!

winskuo-quic Nov 11, 2025

Uh oh!

winskuo-quic commented Nov 11, 2025

Uh oh!

abhinaykukkadapu commented Nov 11, 2025 •

edited

Loading

Uh oh!

cccclai Nov 11, 2025 •

edited

Loading

Uh oh!

abhinaykukkadapu Nov 11, 2025 •

edited

Loading

Uh oh!

cccclai Nov 13, 2025

Uh oh!

winskuo-quic left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Introduce seq_len as inference param, and improve warnings #15716

Are you sure you want to change the base?

Introduce seq_len as inference param, and improve warnings #15716

Conversation

abhinaykukkadapu commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Uh oh!

pytorch-bot bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15716

❌ 1 New Failure

Uh oh!

meta-codesync bot commented Nov 10, 2025

Uh oh!

github-actions bot commented Nov 10, 2025

This PR needs a release notes: label

Uh oh!

winskuo-quic left a comment

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

winskuo-quic Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

winskuo-quic commented Nov 11, 2025

Uh oh!

abhinaykukkadapu commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cccclai Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cccclai Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

winskuo-quic left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abhinaykukkadapu commented Nov 10, 2025 •

edited

Loading

pytorch-bot bot commented Nov 10, 2025 •

edited

Loading

This PR needs a `release notes:` label

abhinaykukkadapu commented Nov 11, 2025 •

edited

Loading

abhinaykukkadapu commented Nov 11, 2025 •

edited

Loading

cccclai Nov 11, 2025 •

edited

Loading

abhinaykukkadapu Nov 11, 2025 •

edited

Loading