Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation by DannyYuyang-quic · Pull Request #17308 · pytorch/executorch

DannyYuyang-quic · 2026-02-09T15:56:24Z

Summary:

enable multi‑turn conversation and support processing multiple images both within a single turn and across multiple turns.
refactor Chat template and Runner

Test plan

# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."

# Turn 2

PROMPT2="Answer the question: What's the main object in first image?"

# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."

# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"

cc @cccclai @cbilgin

pytorch-bot · 2026-02-09T15:56:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17308

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 9 Awaiting Approval

As of commit 2233a6f with merge base 5545395 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-02-09T16:30:23Z

Hi @cccclai,
This PR extend the multi-turn conversation for VLM.

And encoder quantization is only enabled when each turn contains one single image.
For multi‑image conversation, we skip encoder quantization to maintain visual embedding quality, regular quantization does not preserve enough precision, causing the decoder to misinterpret image features.
Until we have a reliable tuning method for encoder quantization in multi‑image scenarios, we recommend keep the vision encoder in floating point.

Below are some HTP runtime results for SmolVLM‑500M in 3 turns conversation:
cc: @haowhsu-quic

Simulation turns

Turn 1:
Query:"Compare these images above and list the differences."
Image1: "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
Image2: "http://images.cocodataset.org/val2017/000000039769.jpg"

Turn 2:
Query: "Answer the question: What's the main object in first image?"

Turn 3:
Image:"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
Query="Caption this image."

Answer in 3 turns

PyTorchObserver {"prompt_tokens":151,"generated_tokens":30,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724158141,"inference_end_ms":1753724164592,"prompt_eval_end_ms":1753724163488,"first_token_ms":1753724163488,"aggregate_sampling_time_ms":49,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

PyTorchObserver {"prompt_tokens":21,"generated_tokens":13,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724164595,"inference_end_ms":1753724165818,"prompt_eval_end_ms":1753724165339,"first_token_ms":1753724165339,"aggregate_sampling_time_ms":70,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

PyTorchObserver {"prompt_tokens":80,"generated_tokens":10,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724165831,"inference_end_ms":1753724169033,"prompt_eval_end_ms":1753724168665,"first_token_ms":1753724168665,"aggregate_sampling_time_ms":86,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.8 MB/s (779 bytes in 0.001s)
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.001s)
INFO:root:Device Inference Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><fake_token_around_image><fake_token_around_image><global-img><image><fake_token_around_image>Compare these images above and list the differences.<end_of_utterance>
Assistant: The first image shows a cityscape with a statue of liberty in the foreground. The second image shows two tabby cats sleeping on a pink blanket.<end_of_utterance><|im_start|>User:Answer the question: What's the main object in first image?<end_of_utterance>
Assistant: The main object in the first image is a statue of liberty.<end_of_utterance><|im_start|>User:<fake_token_around_image><global-img><image><fake_token_around_image>Caption this image.<end_of_utterance>
Assistant: A flower with a yellow center and many petals.<end_of_utterance>

Script

# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."

# Turn 2

PROMPT2="Answer the question: What's the main object in first image?"

# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."

# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"

DannyYuyang-quic · 2026-02-09T16:31:24Z

@pytorchbot label "release notes: qualcomm"

meta-codesync · 2026-02-10T17:43:48Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D92849100.

larryliu0820

Review automatically exported from Phabricator review in Meta.

metascroy · 2026-02-11T23:35:49Z

Can we unify QNN and CoreML runners? The model definitions are similar: #16463

larryliu0820 · 2026-02-11T23:43:08Z

@DannyYuyang-quic and folks,

Thank you for adding this feature! I have a few questions regarding the high level architecture of the runners.

I think you have seen the generic LLM runners and multimodal runners under extension/llm/runner. I'm curious to learn what are the blockers, for qualcomm version of these runners to extend from extension/llm/runner base classes.

The reason for this is that we are maintaining a JNI layer so that it connects to android. I understand right now this is done from irunner interface, I'm thinking if we can reuse more components that will largely reduce our maintenance burden.

One concrete thing is if we can do something like

class QNNMultimodalRunner : public executorch::extension::llm::MultimodalRunner {

And reuse common logics. I think the team is happy to change the generic MultimodalRunner to be able to extend easily.

Tell me what you think about this!

larryliu0820

Requesting changes for the comment

DannyYuyang-quic · 2026-02-13T06:14:38Z

@larryliu0820 Thanks for the explanation!! I fully understand the motivation to have the QNNMultimodalRunner extend from the generic MultimodalRunner for JNI integration.
There are a few reasons why we did not inherit from MultimodalRunner:

MultimodalRunner is still marked as ET_EXPERIMENTAL:
Our multimodal_runner will evolve quite a bit in the near future, our short‑term plan includes supporting vision, audio, and mixed‑modality LLM, we preferred to keep our implementation fully under our control so that we can iterate quickly without being constrained by the experimental API.
Include encoder and text decoder in a single PTE is not ideal for our use case:
The current MultimodalRunner only accept one .pte.
However, encoder models are much harder to quantize (especially when handling multiple images in a single conversation).
For flexibility, we want users to be able to switch encoder precision without recompiling the entire PTE.

We absolutely agree that in the long run it would be beneficial to reuse more components from the generic runner.
Once our multimodal runner functionality stabilizes, we are happy to make it JNI friendly.

And I have a question about maintaining JNI layer:
From my understanding, if we want to maintain a JNI layer for multimodal support now, it seems that we only need to inherit from executorch::extension::llm::MultimodalRunner, is that correct? Or are there additional components we would need to handle?

And if you’d like to go deeper into the details. we’re happy to start an email thread!
Thanks!

cccclai · 2026-02-20T23:03:38Z

@larryliu0820 what do you think?

larryliu0820 · 2026-02-20T23:10:26Z

@DannyYuyang-quic thank you for your reply. I'm happy to accept your PR, if you commit to a follow up to extend your multimodal runner from extension::llm::runner::MultimodalRunner. Being tagged as EXPERIMENTAL actually gives us the ability to iterate fast, so if you feel like the current extension::llm::runner::MultimodalRunner API design is not ideal, feel free to propose changes.

DannyYuyang-quic · 2026-02-23T08:29:32Z

@larryliu0820 Thanks! I’ve added a follow-up commit to simply extend the QNN multimodal runner from extension::llm::runner::MultimodalRunner. After the pending MLLMs support are done on our side, we will further refactor the runner to align it with the generic runner.

DannyYuyang-quic · 2026-03-10T02:10:45Z

Hi @cccclai,
I noticed this hasn’t been imported yet. let me know if there's anything needed from my side! :)

Thanks!

digantdesai · 2026-03-10T02:57:16Z

Perhaps rebase, repush, and I can check internal CI.

DannyYuyang-quic · 2026-03-10T07:17:13Z

Perhaps rebase, repush, and I can check internal CI.

Rebased, thanks!

Summary: - Multi‑turn conversation: add conversation for VLM scenario - add runtime chat template - Multi-image inputs: Accept multiple images per turn and across turns. - extend QNNMultimodalRunner from generic MultimodalRunner

DannyYuyang-quic requested review from cccclai, kirklandsign and larryliu0820 as code owners February 9, 2026 15:56

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Feb 9, 2026

DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from 88163a8 to f96ab48 Compare February 10, 2026 07:27

larryliu0820 approved these changes Feb 11, 2026

View reviewed changes

larryliu0820 requested changes Feb 11, 2026

View reviewed changes

DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from f96ab48 to 4da37f3 Compare February 23, 2026 08:15

larryliu0820 approved these changes Feb 23, 2026

View reviewed changes

DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch 2 times, most recently from d11671f to bdad8bd Compare March 2, 2026 14:15

digantdesai added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Mar 10, 2026

DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from bdad8bd to ef72d88 Compare March 10, 2026 07:07

DannyYuyang-quic requested a review from abhinaykukkadapu as a code owner March 10, 2026 07:07

DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from ef72d88 to 0c03368 Compare March 12, 2026 16:28

DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from 0c03368 to 2233a6f Compare March 15, 2026 05:23

Conversation

DannyYuyang-quic commented Feb 9, 2026 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Test plan

Uh oh!

pytorch-bot bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17308

⚠️ 9 Awaiting Approval

Uh oh!

DannyYuyang-quic commented Feb 9, 2026

Simulation turns

Answer in 3 turns

Script

Uh oh!

DannyYuyang-quic commented Feb 9, 2026

Uh oh!

meta-codesync bot commented Feb 10, 2026

Uh oh!

larryliu0820 left a comment

Choose a reason for hiding this comment

Uh oh!

metascroy commented Feb 11, 2026

Uh oh!

larryliu0820 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larryliu0820 left a comment

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic commented Feb 13, 2026

Uh oh!

cccclai commented Feb 20, 2026

Uh oh!

larryliu0820 commented Feb 20, 2026

Uh oh!

DannyYuyang-quic commented Feb 23, 2026

Uh oh!

DannyYuyang-quic commented Mar 10, 2026

Uh oh!

digantdesai commented Mar 10, 2026

Uh oh!

DannyYuyang-quic commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DannyYuyang-quic commented Feb 9, 2026 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 9, 2026 •

edited

Loading

larryliu0820 commented Feb 11, 2026 •

edited

Loading