Skip to content

Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation#17308

Open
DannyYuyang-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/support_vlm_multi-turn_conversation_and_multi_image
Open

Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation#17308
DannyYuyang-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/support_vlm_multi-turn_conversation_and_multi_image

Conversation

@DannyYuyang-quic
Copy link
Contributor

@DannyYuyang-quic DannyYuyang-quic commented Feb 9, 2026

Summary:

  • enable multi‑turn conversation and support processing multiple images both within a single turn and across multiple turns.
  • refactor Chat template and Runner

Test plan

# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."

# Turn 2

PROMPT2="Answer the question: What's the main object in first image?"

# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."

# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"

cc @cccclai @cbilgin

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 9, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17308

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 9 Awaiting Approval

As of commit 2233a6f with merge base 5545395 (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026
@DannyYuyang-quic
Copy link
Contributor Author

Hi @cccclai,
This PR extend the multi-turn conversation for VLM.

And encoder quantization is only enabled when each turn contains one single image.
For multi‑image conversation, we skip encoder quantization to maintain visual embedding quality, regular quantization does not preserve enough precision, causing the decoder to misinterpret image features.
Until we have a reliable tuning method for encoder quantization in multi‑image scenarios, we recommend keep the vision encoder in floating point.

Below are some HTP runtime results for SmolVLM‑500M in 3 turns conversation:
cc: @haowhsu-quic

Simulation turns

Turn 1:
Query:"Compare these images above and list the differences."
Image1: "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
Image2: "http://images.cocodataset.org/val2017/000000039769.jpg"

Turn 2:
Query: "Answer the question: What's the main object in first image?"

Turn 3:
Image:"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
Query="Caption this image."

Answer in 3 turns

PyTorchObserver {"prompt_tokens":151,"generated_tokens":30,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724158141,"inference_end_ms":1753724164592,"prompt_eval_end_ms":1753724163488,"first_token_ms":1753724163488,"aggregate_sampling_time_ms":49,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

PyTorchObserver {"prompt_tokens":21,"generated_tokens":13,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724164595,"inference_end_ms":1753724165818,"prompt_eval_end_ms":1753724165339,"first_token_ms":1753724165339,"aggregate_sampling_time_ms":70,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

PyTorchObserver {"prompt_tokens":80,"generated_tokens":10,"model_load_start_ms":1753724157466,"model_load_end_ms":1753724158128,"inference_start_ms":1753724165831,"inference_end_ms":1753724169033,"prompt_eval_end_ms":1753724168665,"first_token_ms":1753724168665,"aggregate_sampling_time_ms":86,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.8 MB/s (779 bytes in 0.001s)
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.001s)
INFO:root:Device Inference Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><fake_token_around_image><fake_token_around_image><global-img><image><fake_token_around_image>Compare these images above and list the differences.<end_of_utterance>
Assistant: The first image shows a cityscape with a statue of liberty in the foreground. The second image shows two tabby cats sleeping on a pink blanket.<end_of_utterance><|im_start|>User:Answer the question: What's the main object in first image?<end_of_utterance>
Assistant: The main object in the first image is a statue of liberty.<end_of_utterance><|im_start|>User:<fake_token_around_image><global-img><image><fake_token_around_image>Caption this image.<end_of_utterance>
Assistant: A flower with a yellow center and many petals.<end_of_utterance>

Script

# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."

# Turn 2

PROMPT2="Answer the question: What's the main object in first image?"

# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."

# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"

@DannyYuyang-quic
Copy link
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Feb 9, 2026
@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from 88163a8 to f96ab48 Compare February 10, 2026 07:27
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Feb 10, 2026

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D92849100.

Copy link
Contributor

@larryliu0820 larryliu0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@metascroy
Copy link
Contributor

Can we unify QNN and CoreML runners? The model definitions are similar: #16463

@larryliu0820
Copy link
Contributor

larryliu0820 commented Feb 11, 2026

@DannyYuyang-quic and folks,

Thank you for adding this feature! I have a few questions regarding the high level architecture of the runners.

I think you have seen the generic LLM runners and multimodal runners under extension/llm/runner. I'm curious to learn what are the blockers, for qualcomm version of these runners to extend from extension/llm/runner base classes.

The reason for this is that we are maintaining a JNI layer so that it connects to android. I understand right now this is done from irunner interface, I'm thinking if we can reuse more components that will largely reduce our maintenance burden.

One concrete thing is if we can do something like

class QNNMultimodalRunner : public executorch::extension::llm::MultimodalRunner {

And reuse common logics. I think the team is happy to change the generic MultimodalRunner to be able to extend easily.

Tell me what you think about this!

Copy link
Contributor

@larryliu0820 larryliu0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for the comment

@DannyYuyang-quic
Copy link
Contributor Author

@larryliu0820 Thanks for the explanation!! I fully understand the motivation to have the QNNMultimodalRunner extend from the generic MultimodalRunner for JNI integration.
There are a few reasons why we did not inherit from MultimodalRunner:

  1. MultimodalRunner is still marked as ET_EXPERIMENTAL:
    Our multimodal_runner will evolve quite a bit in the near future, our short‑term plan includes supporting vision, audio, and mixed‑modality LLM, we preferred to keep our implementation fully under our control so that we can iterate quickly without being constrained by the experimental API.

  2. Include encoder and text decoder in a single PTE is not ideal for our use case:
    The current MultimodalRunner only accept one .pte.
    However, encoder models are much harder to quantize (especially when handling multiple images in a single conversation).
    For flexibility, we want users to be able to switch encoder precision without recompiling the entire PTE.

We absolutely agree that in the long run it would be beneficial to reuse more components from the generic runner.
Once our multimodal runner functionality stabilizes, we are happy to make it JNI friendly.

And I have a question about maintaining JNI layer:
From my understanding, if we want to maintain a JNI layer for multimodal support now, it seems that we only need to inherit from executorch::extension::llm::MultimodalRunner, is that correct? Or are there additional components we would need to handle?

And if you’d like to go deeper into the details. we’re happy to start an email thread!
Thanks!

@cccclai
Copy link
Contributor

cccclai commented Feb 20, 2026

@larryliu0820 what do you think?

@larryliu0820
Copy link
Contributor

@DannyYuyang-quic thank you for your reply. I'm happy to accept your PR, if you commit to a follow up to extend your multimodal runner from extension::llm::runner::MultimodalRunner. Being tagged as EXPERIMENTAL actually gives us the ability to iterate fast, so if you feel like the current extension::llm::runner::MultimodalRunner API design is not ideal, feel free to propose changes.

@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from f96ab48 to 4da37f3 Compare February 23, 2026 08:15
@DannyYuyang-quic
Copy link
Contributor Author

@larryliu0820 Thanks! I’ve added a follow-up commit to simply extend the QNN multimodal runner from extension::llm::runner::MultimodalRunner. After the pending MLLMs support are done on our side, we will further refactor the runner to align it with the generic runner.

@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch 2 times, most recently from d11671f to bdad8bd Compare March 2, 2026 14:15
@DannyYuyang-quic
Copy link
Contributor Author

Hi @cccclai,
I noticed this hasn’t been imported yet. let me know if there's anything needed from my side! :)

Thanks!

@digantdesai digantdesai added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Mar 10, 2026
@digantdesai
Copy link
Contributor

Perhaps rebase, repush, and I can check internal CI.

@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from bdad8bd to ef72d88 Compare March 10, 2026 07:07
@DannyYuyang-quic
Copy link
Contributor Author

Perhaps rebase, repush, and I can check internal CI.

Rebased, thanks!

@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from ef72d88 to 0c03368 Compare March 12, 2026 16:28
Summary:
 - Multi‑turn conversation: add conversation for VLM scenario
   - add runtime chat template
 - Multi-image inputs: Accept multiple images per turn and across turns.
 - extend QNNMultimodalRunner from generic MultimodalRunner
@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_vlm_multi-turn_conversation_and_multi_image branch from 0c03368 to 2233a6f Compare March 15, 2026 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants