Run sampler as a method if available #16888

larryliu0820 · 2026-01-27T00:32:31Z

With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result on RTX 5080:


======================================================================
BENCHMARK SUMMARY
======================================================================
Total runs: 30
Generated tokens per run: 104

THROUGHPUT (tokens/sec):
  Min:    793.89 t/s
  Max:    845.53 t/s
  Mean:   820.35 t/s
  Stdev:  11.86 t/s

MODEL LOAD TIME (ms):
  Min:    620 ms
  Max:    2170 ms
  Mean:   700 ms
  Stdev:  279 ms

ENCODE TIME (ms, inference_start to prompt_eval_end):
  Min:    36 ms
  Max:    38 ms
  Mean:   37 ms
  Stdev:  1 ms

DECODE TIME (ms, prompt_eval_end to inference_end):
  Min:    123 ms
  Max:    131 ms
  Mean:   127 ms
  Stdev:  2 ms

======================================================================

pytorch-bot · 2026-01-27T00:32:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16888

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 21 Pending

As of commit ddb0dce with merge base e4060ee ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Gasoonjia

Perf is crazy!! THanks for the great work!

And we now is still load generate token from gpu to cpu one by one right iiuc. if so the perf should be boost again after we use batch transfer!

extension/asr/runner/runner.cpp

With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose. Benchmark result:

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 27, 2026

larryliu0820 requested review from Gasoonjia and mergennachin and removed request for mergennachin January 27, 2026 00:33

larryliu0820 force-pushed the sampler_method branch from 99fd591 to 69b833e Compare January 27, 2026 00:35

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Jan 27, 2026

Gasoonjia approved these changes Jan 27, 2026

View reviewed changes

extension/asr/runner/runner.cpp Outdated Show resolved Hide resolved

larryliu0820 force-pushed the sampler_method branch from 69b833e to a1ae2bc Compare January 27, 2026 18:42

larryliu0820 mentioned this pull request Jan 27, 2026

CUDA Sampler #16787

Open

larryliu0820 force-pushed the sampler_method branch 2 times, most recently from a476fa8 to 800721a Compare January 27, 2026 21:21

larryliu0820 force-pushed the sampler_method branch from 800721a to ddb0dce Compare January 27, 2026 21:43

larryliu0820 merged commit 079799c into main Jan 27, 2026
141 checks passed

larryliu0820 deleted the sampler_method branch January 27, 2026 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run sampler as a method if available #16888

Run sampler as a method if available #16888

Uh oh!

larryliu0820 commented Jan 27, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Gasoonjia left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Run sampler as a method if available #16888

Run sampler as a method if available #16888

Uh oh!

Conversation

larryliu0820 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16888

⏳ No Failures, 21 Pending

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larryliu0820 commented Jan 27, 2026 •

edited

Loading

pytorch-bot bot commented Jan 27, 2026 •

edited

Loading