Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions examples/models/parakeet/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,14 @@ endif()

# CPU-only builds need quantized and custom ops
if(NOT EXECUTORCH_BUILD_CUDA)
list(APPEND link_libraries quantized_ops_lib custom_ops)
executorch_target_link_options_shared_lib(quantized_ops_lib)
executorch_target_link_options_shared_lib(custom_ops)
if(TARGET quantized_ops_lib)
list(APPEND link_libraries quantized_ops_lib)
executorch_target_link_options_shared_lib(quantized_ops_lib)
endif()
if(TARGET custom_ops)
list(APPEND link_libraries custom_ops)
executorch_target_link_options_shared_lib(custom_ops)
endif()
endif()

# XNNPACK
Expand Down
33 changes: 31 additions & 2 deletions examples/models/parakeet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ python export_parakeet_tdt.py --audio /path/to/audio.wav
| Argument | Description |
|----------|-------------|
| `--output-dir` | Output directory for exports (default: `./parakeet_tdt_exports`) |
| `--backend` | Backend for acceleration: `portable`, `xnnpack`, `metal`, `cuda`, `cuda-windows` (default: `xnnpack`) |
| `--backend` | Backend for acceleration: `portable`, `xnnpack`, `vulkan`, `metal`, `cuda`, `cuda-windows` (default: `xnnpack`) |
| `--dtype` | Data type: `fp32`, `bf16`, `fp16` (default: `fp32`). Metal backend supports `fp32` and `bf16` only (no `fp16`). |
| `--audio` | Path to audio file for transcription test |

Expand Down Expand Up @@ -54,7 +54,7 @@ The export script supports quantizing encoder and decoder linear layers using [t
|--------|-------------|----------|
| `4w` | 4-bit weight only quantization | CUDA |
| `8w` | 8-bit weight only quantization | CUDA |
| `8da4w` | 8-bit dynamic activation, 4-bit weight | CUDA |
| `8da4w` | 8-bit dynamic activation, 4-bit weight | Vulkan, CUDA |
| `8da8w` | 8-bit dynamic activation, 8-bit weight | CUDA |
| `fpa4w` | Floating point activation, 4-bit weight | Metal |

Expand All @@ -70,6 +70,26 @@ python export_parakeet_tdt.py \
--output-dir ./parakeet_quantized_xnnpack
```

#### Example: Dynamic Quantization for Vulkan

```bash
python export_parakeet_tdt.py \
--backend vulkan \
--qlinear_encoder 8da4w \
--qlinear_encoder_group_size 32 \
--qlinear 8da4w \
--qlinear_group_size 32 \
--vulkan_force_fp16 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cant use --dtype flag?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • --dtype fp16: inputs and outputs are also cast to fp16. From caller's perspective, input/output is fp16.
  • --vulkan_force_fp16: inputs and outputs are still fp32. Vulkan backend will automatically convert inputs to fp16 within the delegate and outputs to fp32. From caller's perspective, input/output is fp32.

--vulkan_force_fp16 is a bit simpler for client code since they don't have to handle the conversion to/from fp32, so I defaulted to that.

Another thing was that for the export_parakeet_tdt.py script, with --dtype fp16 I see there is a guard:

export_parakeet_tdt.py: error: fp16 is not yet supported

I wasn't sure if this was because the runner binary doesn't handle fp16 input/output yet, so I opted for --vulkan_force_fp16 instead.

Would you prefer enabling usage of the --dtype flag for fp16 inference?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also updated the text to clarify the special properties of the --vulkan_force_fp16 flag that wouldn't be covered by --dtype

--output-dir ./parakeet_quantized_vulkan
```

An additional `--vulkan_force_fp16` flag is available to have the Vulkan backend
internally downcast FP32 tensors to FP16 within the Vulkan backend, forcing
half-precision computation. Note that input/output tensors are still FP32, and
the delegate will automatically convert them to/from FP16 upon entering and
exiting the delegate. This will significantly improve latency but may slightly
reduce transcription accuracy.

#### Example: 4-bit Weight Quantization with Tile Packing for CUDA

```bash
Expand Down Expand Up @@ -186,6 +206,9 @@ make parakeet-cpu
# Metal build (macOS)
make parakeet-metal

# Vulkan build (Linux / Android)
make parakeet-vulkan

# CUDA build (Linux)
make parakeet-cuda
```
Expand Down Expand Up @@ -216,6 +239,12 @@ DYLD_LIBRARY_PATH=/usr/lib ./cmake-out/examples/models/parakeet/parakeet_runner
--audio_path /path/to/audio.wav \
--tokenizer_path examples/models/parakeet/parakeet_metal/tokenizer.model

# Vulkan
./cmake-out/examples/models/parakeet/parakeet_runner \
--model_path examples/models/parakeet/parakeet_vulkan/model.pte \
--audio_path /path/to/audio.wav \
--tokenizer_path examples/models/parakeet/parakeet_vulkan/tokenizer.model

# CUDA (include .ptd data file)
./cmake-out/examples/models/parakeet/parakeet_runner \
--model_path examples/models/parakeet/parakeet_cuda/model.pte \
Expand Down
Loading