Skip to content

Conversation

@larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Jan 27, 2026

With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result on RTX 5080:


======================================================================
BENCHMARK SUMMARY
======================================================================
Total runs: 30
Generated tokens per run: 104

THROUGHPUT (tokens/sec):
  Min:    793.89 t/s
  Max:    845.53 t/s
  Mean:   820.35 t/s
  Stdev:  11.86 t/s

MODEL LOAD TIME (ms):
  Min:    620 ms
  Max:    2170 ms
  Mean:   700 ms
  Stdev:  279 ms

ENCODE TIME (ms, inference_start to prompt_eval_end):
  Min:    36 ms
  Max:    38 ms
  Mean:   37 ms
  Stdev:  1 ms

DECODE TIME (ms, prompt_eval_end to inference_end):
  Min:    123 ms
  Max:    131 ms
  Mean:   127 ms
  Stdev:  2 ms

======================================================================

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 27, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16888

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 21 Pending

As of commit ddb0dce with merge base e4060ee (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 27, 2026
@larryliu0820 larryliu0820 requested review from Gasoonjia and mergennachin and removed request for mergennachin January 27, 2026 00:33
@larryliu0820 larryliu0820 added the release notes: desktop for desktop/laptop workstream label Jan 27, 2026
Copy link
Contributor

@Gasoonjia Gasoonjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf is crazy!! THanks for the great work!

And we now is still load generate token from gpu to cpu one by one right iiuc. if so the perf should be boost again after we use batch transfer!

@larryliu0820 larryliu0820 mentioned this pull request Jan 27, 2026
@larryliu0820 larryliu0820 force-pushed the sampler_method branch 2 times, most recently from a476fa8 to 800721a Compare January 27, 2026 21:21
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result:
@larryliu0820 larryliu0820 merged commit 079799c into main Jan 27, 2026
141 checks passed
@larryliu0820 larryliu0820 deleted the sampler_method branch January 27, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: desktop for desktop/laptop workstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants