Skip to content

Conversation

@abhinaykukkadapu
Copy link
Contributor

@abhinaykukkadapu abhinaykukkadapu commented Nov 10, 2025

Summary:
Changes:

  1. add --seq_len param to llama script to distinguish max_seq_len which is compile time param
  2. Add warnings in the runner when seq_len is clamped to max_seq_len to avoid silently clamping it.
  3. Add warnings in the token generator when EOS is not reached due to insufficient seq_len or max_seq_len.

Differential Revision: D86696759

Tests

Use --seq_len=600, prompt_len=512

I 00:00:02.883890 executorch:token_generator.cpp:333] Warning: Generation stopped at seq_len limit (600) without reaching EOS token. Response may be incomplete.
I 00:00:02.884094 executorch:token_generator.cpp:346] - seq_len (600) is less than compiled max_seq_len (1024). Consider increasing --seq_len (up to 1024).

Use --seq_len=2048, prefill_ar_len=1024

I 00:00:00.546967 executorch:runner.cpp:385] Warning: Requested seq_len (2048) exceeds compiled max_seq_len (1024). Clamping to 1024.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15716

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 76b9c28 with merge base e774b77 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 10, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 10, 2025

@abhinaykukkadapu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86696759.

@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Collaborator

@winskuo-quic winskuo-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abhinaykukkadapu,
Thanks for the PR.
I would like to know if we could achieve the same thing with the combination of
--pre_gen_pte & --max_seq_len.

For example:
During compile time, you can provide:
--max_seq_len 1024 --compile_only

During inference, you can provide:
--max_seq_len 512 --pre_gen_pte ./path_to_pregen_pte

@abhinaykukkadapu
Copy link
Contributor Author

abhinaykukkadapu commented Nov 11, 2025

Hi @abhinaykukkadapu, Thanks for the PR. I would like to know if we could achieve the same thing with the combination of --pre_gen_pte & --max_seq_len.

For example: During compile time, you can provide: --max_seq_len 1024 --compile_only

During inference, you can provide: --max_seq_len 512 --pre_gen_pte ./path_to_pregen_pte

@winskuo-quic Thanks for the quick review, the goal of this additional param is to avoid confusing the users of the script thinking that --max_seq_len can be dynamic but it is a static param and is fixed during compilation.

Currently, one can pass --max_seq_len for inference which actually made me think that total context length can be changed dynamically and i only found out after digging through the code that is is piped as seq_len to the runner. With this new param, we have a clear distinction, that --max_seq_len is used for compile time and --seq_len is during runtime/ inference. Functionally this change is a no-op and it only improves user experience and making it visible to the user when we clamp it down.


parser.add_argument(
"--seq_len",
help="[Runtime-time] Maximum number of tokens to generate (prompt + output). If not specified, uses --max_seq_len. Will be clamped to compiled max_seq_len if exceeded.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe [Runtime-time] -> [Runtime]

@winskuo-quic
Copy link
Collaborator

@winskuo-quic Thanks for the quick review, the goal of this additional param is to avoid confusing the users of the script thinking that --max_seq_len can be dynamic but it is a static param and is fixed during compilation.

Currently, one can pass --max_seq_len for inference which actually made me think that total context length can be changed dynamically and i only found out after digging through the code that is is piped as seq_len to the runner. With this new param, we have a clear distinction, that --max_seq_len is used for compile time and --seq_len is during runtime/ inference. Functionally this change is a no-op and it only improves user experience and making it visible to the user when we clamp it down.

I see.
I think it makes sense to add some warning messages in .cpp files to guide users, which is helpful.
However, for llama.py, I would like to know if you think flag --max_seq_len is misleading? The --max_seq_len flag can actually be set to a different number every time during execution, as long as it is shorter than the --max_seq_len used during compilation.

@abhinaykukkadapu
Copy link
Contributor Author

abhinaykukkadapu commented Nov 11, 2025

I think it makes sense to add some warning messages in .cpp files to guide users, which is helpful.

@winskuo-quic Right, i think we should be transparent in these, i've already added the messages that i think would be helpful, but suggest if you have any more in mind.

However, for llama.py, I would like to know if you think flag --max_seq_len is misleading? The --max_seq_len flag can actually be set to a different number every time during execution, as long as it is shorter than the --max_seq_len used during compilation.

Yeah i think this param is misleading as it clearly represents max context a model can have, so using it to during inference is a bit misleading for someone new to Qcom delegate, who might not know we use static llama and might think this can change total context length of the model dynamically. Also, all i did is use the same param of qnn runner (--seq_len) and bubbled up to llama.py script.

If you think this adds to the confusion, I'm also open to remove it and only keep warning messages in this PR.

abhinaykukkadapu added a commit to abhinaykukkadapu/executorch that referenced this pull request Nov 11, 2025
…5716)

Summary:

Changes:
1. add `--seq_len` param to llama script to distinguish max_seq_len which is compile time param
2. Add warnings in the runner when `seq_len` is clamped to `max_seq_len` to avoid silently clamping it.
3. Add warnings in the token generator when EOS is not reached due to insufficient seq_len or max_seq_len.

Differential Revision: D86696759
outputs.append(f.read())

seq_len = args.max_seq_len
# Use --seq_len if provided (inference-only), otherwise fall back to --max_seq_len
Copy link
Contributor

@cccclai cccclai Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow why we need seq_len, can you share more? I feel like it might further causing the confusion...

Copy link
Contributor Author

@abhinaykukkadapu abhinaykukkadapu Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner itself uses a param named --seq_len it is the script llama.py repurposing --max_seq_len and using it as seq_len for the runner. For context, if you've followed the internal discussion on benchmark numbers, we thought we swept the benchmarks for max_seq_len and prompt_length but the sweeping was not valid for max_seq_len because it is a compile time param and is ignored if it is more than what the model is compiled with.

Looking at CoreML and they use separate params as well: https://github.com/pytorch/executorch/blob/main/examples/apple/coreml/llama/run.py#L97-L103

Open to suggestions, if there are better ways to distinguish this param during compile time vs runtime from ux perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to re-visit the seq_len in qnn llama runner because it seems it's only for debugging/profiling purpose. Users wouldn't need to use it and I want to make sure we don't make it more confusing.

abhinaykukkadapu added a commit to abhinaykukkadapu/executorch that referenced this pull request Nov 13, 2025
…5716)

Summary:

Changes:
1. add `--seq_len` param to llama script to distinguish max_seq_len which is compile time param
2. Add warnings in the runner when `seq_len` is clamped to `max_seq_len` to avoid silently clamping it.
3. Add warnings in the token generator when EOS is not reached due to insufficient seq_len or max_seq_len.

Differential Revision: D86696759
…5716)

Summary:

Changes:
1. add `--seq_len` param to llama script to distinguish max_seq_len which is compile time param
2. Add warnings in the runner when `seq_len` is clamped to `max_seq_len` to avoid silently clamping it.
3. Add warnings in the token generator when EOS is not reached due to insufficient seq_len or max_seq_len.

Differential Revision: D86696759
Copy link
Collaborator

@winskuo-quic winskuo-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.
I am slightly leaning toward leaving warnings messages in runtime and reuse the --max_seq_len flag in llama.py, which aligns with your new commit.
Thanks for the help on improving user's experience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants