diff --git a/examples/models/parakeet/CMakeLists.txt b/examples/models/parakeet/CMakeLists.txt index 9354afe5f86..837639174dc 100644 --- a/examples/models/parakeet/CMakeLists.txt +++ b/examples/models/parakeet/CMakeLists.txt @@ -42,9 +42,14 @@ endif() # CPU-only builds need quantized and custom ops if(NOT EXECUTORCH_BUILD_CUDA) - list(APPEND link_libraries quantized_ops_lib custom_ops) - executorch_target_link_options_shared_lib(quantized_ops_lib) - executorch_target_link_options_shared_lib(custom_ops) + if(TARGET quantized_ops_lib) + list(APPEND link_libraries quantized_ops_lib) + executorch_target_link_options_shared_lib(quantized_ops_lib) + endif() + if(TARGET custom_ops) + list(APPEND link_libraries custom_ops) + executorch_target_link_options_shared_lib(custom_ops) + endif() endif() # XNNPACK diff --git a/examples/models/parakeet/README.md b/examples/models/parakeet/README.md index d893b8e18b0..ff18c827bb2 100644 --- a/examples/models/parakeet/README.md +++ b/examples/models/parakeet/README.md @@ -25,7 +25,7 @@ python export_parakeet_tdt.py --audio /path/to/audio.wav | Argument | Description | |----------|-------------| | `--output-dir` | Output directory for exports (default: `./parakeet_tdt_exports`) | -| `--backend` | Backend for acceleration: `portable`, `xnnpack`, `metal`, `cuda`, `cuda-windows` (default: `xnnpack`) | +| `--backend` | Backend for acceleration: `portable`, `xnnpack`, `vulkan`, `metal`, `cuda`, `cuda-windows` (default: `xnnpack`) | | `--dtype` | Data type: `fp32`, `bf16`, `fp16` (default: `fp32`). Metal backend supports `fp32` and `bf16` only (no `fp16`). | | `--audio` | Path to audio file for transcription test | @@ -54,7 +54,7 @@ The export script supports quantizing encoder and decoder linear layers using [t |--------|-------------|----------| | `4w` | 4-bit weight only quantization | CUDA | | `8w` | 8-bit weight only quantization | CUDA | -| `8da4w` | 8-bit dynamic activation, 4-bit weight | CUDA | +| `8da4w` | 8-bit dynamic activation, 4-bit weight | Vulkan, CUDA | | `8da8w` | 8-bit dynamic activation, 8-bit weight | CUDA | | `fpa4w` | Floating point activation, 4-bit weight | Metal | @@ -70,6 +70,26 @@ python export_parakeet_tdt.py \ --output-dir ./parakeet_quantized_xnnpack ``` +#### Example: Dynamic Quantization for Vulkan + +```bash +python export_parakeet_tdt.py \ + --backend vulkan \ + --qlinear_encoder 8da4w \ + --qlinear_encoder_group_size 32 \ + --qlinear 8da4w \ + --qlinear_group_size 32 \ + --vulkan_force_fp16 \ + --output-dir ./parakeet_quantized_vulkan +``` + +An additional `--vulkan_force_fp16` flag is available to have the Vulkan backend +internally downcast FP32 tensors to FP16 within the Vulkan backend, forcing +half-precision computation. Note that input/output tensors are still FP32, and +the delegate will automatically convert them to/from FP16 upon entering and +exiting the delegate. This will significantly improve latency but may slightly +reduce transcription accuracy. + #### Example: 4-bit Weight Quantization with Tile Packing for CUDA ```bash @@ -186,6 +206,9 @@ make parakeet-cpu # Metal build (macOS) make parakeet-metal +# Vulkan build (Linux / Android) +make parakeet-vulkan + # CUDA build (Linux) make parakeet-cuda ``` @@ -216,6 +239,12 @@ DYLD_LIBRARY_PATH=/usr/lib ./cmake-out/examples/models/parakeet/parakeet_runner --audio_path /path/to/audio.wav \ --tokenizer_path examples/models/parakeet/parakeet_metal/tokenizer.model +# Vulkan +./cmake-out/examples/models/parakeet/parakeet_runner \ + --model_path examples/models/parakeet/parakeet_vulkan/model.pte \ + --audio_path /path/to/audio.wav \ + --tokenizer_path examples/models/parakeet/parakeet_vulkan/tokenizer.model + # CUDA (include .ptd data file) ./cmake-out/examples/models/parakeet/parakeet_runner \ --model_path examples/models/parakeet/parakeet_cuda/model.pte \