Qualcomm AI Engine Direct - [Multimodal] Muti-turn VLM conversation#17308
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17308
Note: Links to docs will display an error until the docs builds have been completed.
|
|
Hi @cccclai, And encoder quantization is only enabled when each turn contains one single image. Below are some HTP runtime results for SmolVLM‑500M in 3 turns conversation: Simulation turnsTurn 1: Turn 2: Turn 3: Answer in 3 turnsScript# Turn 1
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."
# Turn 2
PROMPT2="Answer the question: What's the main object in first image?"
# Turn 3
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT3="<image>Caption this image."
# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL" |
|
@pytorchbot label "release notes: qualcomm" |
88163a8 to
f96ab48
Compare
larryliu0820
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
|
Can we unify QNN and CoreML runners? The model definitions are similar: #16463 |
|
@DannyYuyang-quic and folks, Thank you for adding this feature! I have a few questions regarding the high level architecture of the runners. I think you have seen the generic LLM runners and multimodal runners under extension/llm/runner. I'm curious to learn what are the blockers, for qualcomm version of these runners to extend from extension/llm/runner base classes. The reason for this is that we are maintaining a JNI layer so that it connects to android. I understand right now this is done from irunner interface, I'm thinking if we can reuse more components that will largely reduce our maintenance burden. One concrete thing is if we can do something like And reuse common logics. I think the team is happy to change the generic MultimodalRunner to be able to extend easily. Tell me what you think about this! |
larryliu0820
left a comment
There was a problem hiding this comment.
Requesting changes for the comment
|
@larryliu0820 Thanks for the explanation!! I fully understand the motivation to have the
We absolutely agree that in the long run it would be beneficial to reuse more components from the generic runner. And I have a question about maintaining JNI layer: And if you’d like to go deeper into the details. we’re happy to start an email thread! |
|
@larryliu0820 what do you think? |
|
@DannyYuyang-quic thank you for your reply. I'm happy to accept your PR, if you commit to a follow up to extend your multimodal runner from extension::llm::runner::MultimodalRunner. Being tagged as |
f96ab48 to
4da37f3
Compare
|
@larryliu0820 Thanks! I’ve added a follow-up commit to simply extend the QNN multimodal runner from |
d11671f to
bdad8bd
Compare
|
Hi @cccclai, Thanks! |
|
Perhaps rebase, repush, and I can check internal CI. |
bdad8bd to
ef72d88
Compare
Rebased, thanks! |
ef72d88 to
0c03368
Compare
Summary: - Multi‑turn conversation: add conversation for VLM scenario - add runtime chat template - Multi-image inputs: Accept multiple images per turn and across turns. - extend QNNMultimodalRunner from generic MultimodalRunner
0c03368 to
2233a6f
Compare
Summary:
Test plan
cc @cccclai @cbilgin