Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b by DannyYuyang-quic · Pull Request #18740 · pytorch/executorch

DannyYuyang-quic · 2026-04-07T16:28:43Z

Summary

Support granite-speech-3.3-2b
Extend Audio modality in QNNMultimodal AOT flow
Extend Audio modality in QNNMultimodal runner
Support encoder model sharding

Test plan

CI

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_asr --model_name granite_speech_3_3-2b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

Script

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"

Audio file: https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true
Prompt: "can you transcribe the speech into a written format?"
Result

I 00:00:16.333997 executorch:multimodal_runner.cpp:542] RSS after finishing text generation: 614.941406 MiB (0 if unsupported)
I 00:00:16.334231 executorch:stats.h:161] 	Prompt Tokens: 212    Generated Tokens: 201
I 00:00:16.334356 executorch:stats.h:167] 	Model Load Time:		1.460000 (seconds)
I 00:00:16.334419 executorch:stats.h:177] 	Total inference time:		14.871000 (seconds)		 Rate: 	13.516240 (tokens/second)
I 00:00:16.334480 executorch:stats.h:185] 		Prompt evaluation:	0.798000 (seconds)		 Rate: 	265.664160 (tokens/second)
I 00:00:16.334541 executorch:stats.h:196] 		Generated 201 tokens:	14.073000 (seconds)		 Rate: 	14.282669 (tokens/second)
I 00:00:16.334629 executorch:stats.h:204] 	Time to first generated token:	0.798000 (seconds)
I 00:00:16.334688 executorch:stats.h:211] 	Sampling time over 413 tokens:	0.479000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device

PyTorchObserver {"prefill_token_per_sec":265.664,"decode_token_per_sec":14.2827,"prompt_tokens":212,"generated_tokens":201,"model_load_start_ms":1744743525724,"model_load_end_ms":1744743527184,"inference_start_ms":1744743527186,"inference_end_ms":1744743542057,"prompt_eval_end_ms":1744743527984,"first_token_ms":1744743527984,"aggregate_sampling_time_ms":479,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.9 MB/s (1170 bytes in 0.001s)
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.002s)
[INFO 2026-04-08 00:22:11,849 llama.py:243] Device Inference Results[0]:
<|start_of_role|>system<|end_of_role|>You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>can you transcribe the speech into a written format?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>It appears you've provided a fragment of a sentence, possibly from a poem or text, and you're asking for a transcription or translation into written format. However, without the complete context or original text, it's challenging to accurately transcribe or translate it.

If we were to proceed with a hypothetical example, here's a possible continuation of the sentence in a written format:

"After his nap, Timothy leisurely stretched his foot, first one then the other, carefully selecting the choicest bits. Turning over the food, he methodically picked out the desired portions, meticulously choosing what was to be included in his meal."

This continuation assumes a narrative style, where Timothy is taking care of food preparation. The original sentence seems to be a playful or poetic exploration of a character's actions, possibly related to food preparation or a cooking process.<|end_of_text|>

cc: @abhinaykukkadapu, @cccclai, @haowhsu-quic

Summary: - Support granite-speech-3.3-2b - Extend Audio modality in QNNMultimodal AOT flow - Extend Audio modality in QNNMultimodal runner - Support encoder model sharding

pytorch-bot · 2026-04-07T16:28:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18740

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job, 1 Unrelated Failure

As of commit b01dff2 with merge base fcccda3 ():

NEW FAILURE - The following job has failed:

Cadence Build & Test / cpu-test / test-ops / test-ops (gh)
examples/cadence/operators/test_g3_ops.py::ATenOpTestCases::test_g3_neg_out_5

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest-editable / windows / windows-job (gh)

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-04-07T16:29:09Z

@pytorchbot label "release notes: qualcomm"

Qualcomm AI Engine Direct - [Multimodal] granite-3.3-2b-instruct

b01dff2

Summary: - Support granite-speech-3.3-2b - Extend Audio modality in QNNMultimodal AOT flow - Extend Audio modality in QNNMultimodal runner - Support encoder model sharding

DannyYuyang-quic requested review from abhinaykukkadapu, cccclai, larryliu0820, lucylq and mergennachin as code owners April 7, 2026 16:28

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2026

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Apr 7, 2026

DannyYuyang-quic changed the title ~~Qualcomm AI Engine Direct - [Multimodal] granite-3.3-2b-instruct~~ Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b#18740

Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b#18740
DannyYuyang-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/support_audio-language_models

DannyYuyang-quic commented Apr 7, 2026

Uh oh!

pytorch-bot bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DannyYuyang-quic commented Apr 7, 2026

Summary

Test plan

CI

Script

Uh oh!

pytorch-bot bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18740

❌ 1 New Failure, 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

DannyYuyang-quic commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot bot commented Apr 7, 2026 •

edited

Loading