feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872
Conversation
8f5a1d2 to
9e7d8fd
Compare
9e7d8fd to
7d7a0ae
Compare
7d7a0ae to
328b5ab
Compare
328b5ab to
b859627
Compare
837330d to
fede96c
Compare
82c92cb to
75be8d3
Compare
Add FP16 precision conversion support across all model pipeline commands: - Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16) - optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list - build: --precision fp16 stage between optimize and quantize - export: --precision fp16 as post-export conversion - Add shared precision_option() CLI decorator in utils/cli.py Design: FP16 is a precision transformation (not a graph optimization), so it lives as a command-layer utility rather than an optimizer pipe. All three commands share the same convert_to_fp16() function. Fixes #867
- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list, and RTN fields to WinMLQuantizationConfig - quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ) and FP16 post-processing after QDQ (fp16=True, fp16_only=False) - resolve_quant_compile_config returns fp16_only quant config for precision=fp16 - Remove _run_fp16_stage and skip-quantize hack from build.py pipelines - Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile where Quantize Stage handles both QDQ and FP16 conversion - Update tests to reflect new behavior (fp16 produces quant config, not None)
- Remove fp16_postprocess from WinMLQuantizationConfig - Add expand_precision() to decompose w4a16 into [int4, fp16] passes - Refactor _run_quantize_stage into multi-pass loop with helper functions - Each quantize_onnx call now does exactly one operation (single responsibility) - Update standalone quantize command for two-pass w4a16 flow - Add precision field to WinMLBuildConfig for pass expansion - Add expand_precision tests
- Add 'precision' parameter to quantize_onnx() that handles multi-pass expansion internally (e.g., w4a16 → [int4, fp16]) - Simplify _run_quantize_stage in build.py to a single quantize_onnx() call — no more _make_step_config or _run_single_quantize_pass helpers - Simplify commands/quantize.py RTN path — remove manual expand_precision loop and intermediate file management - Delete unused _should_run_quantization() dead code from quantizer.py - All multi-pass orchestration (intermediate files, cleanup, pass config construction) now lives in the quant layer where it belongs
Move calibration warning logic from commands/quantize.py into utils/cli.py as warn_ignored_calibration_options() so any command that needs the check can reuse it without duplicating the logic.
FP16 conversion is exclusively used by the quantizer's algorithm='fp16' path. It's not an optimizer pipe — move it to quant/fp16.py where it logically belongs. Remove optim/fp16.py entirely.
Address reviewer comment: mode and algorithm are redundant. algorithm is the active routing field; mode is kept only for serialization backward-compatibility and marked deprecated.
Remove redundant 'algorithm' field. Expand 'mode' to cover all quantization modes: static, dynamic, rtn, fp16. The old 'qdq' value is mapped to 'static' for backward compatibility. from_dict() prefers the old 'algorithm' key over 'mode' when both are present (old to_dict emitted both), preventing silent data loss when deserializing configs with algorithm='rtn' or 'fp16'.
…ize command paths - Split _quantize_single_pass into 3 focused methods: _quantize_fp16, _quantize_rtn, _quantize_qdq with a dispatch dict - Consolidate 3 separate FP16/RTN/QDQ paths in commands/quantize.py into a single if/elif/else that builds config then shares execution logic - Remove duplicated try/except, console output, and output path logic
…gle-pass Multi-pass orchestration (w4a16 = int4 + fp16) will be re-introduced in a follow-up PR via a proper Quantizer class with BaseQuantPass pipeline (see #964). For now, quantize_onnx handles one pass at a time. - Remove _run_multi_pass, _make_pass_config, _cleanup_intermediates - Remove precision parameter from quantize_onnx signature - Remove multi-pass UI hint from build.py stage display - Update docstrings to reflect single-pass design
- Remove 'handles QDQ + FP16 post-processing' comments - Remove precision passing through extra_kwargs (no longer needed) - quantize_onnx no longer accepts precision param
5929e27 to
eca92a6
Compare
Let's use another item to fix this? One option is also add weight / activation type flags to build and add a common resolve_helper to consider all 3 |
- build.py: keep model-type quant finalizer dispatch alongside main's quantize stage - quantizer.py: reapply per-target weight/activation symmetry override on top of main's refactored _quantize_qdq (mode-dispatched single-pass quantizer) - qwen3 transformer-only finalizer: pin mode=static so the new mode-keyed dispatch always routes the fixed w8a16 QDQ scheme (regardless of incoming precision policy), with a regression test
Summary
Adds a unified
--precisionflag towinml quantizeandwinml buildthat auto-selects the quantization algorithm based on the target precision. This replaces the need to manually configure--weight-type,--activation-type, and algorithm-specific flags.Closes #867
Precision → Algorithm Mapping
--precisionfp16int4int8int16/w16a16w8a16w8a8Architecture
This PR implements single-pass quantization only. Each
--precisionvalue maps to exactly one quantization operation:_quantize_fp16()— float16 conversion via onnxconverter-common_quantize_rtn()— weight-only int4 via MatMulNBits_quantize_qdq()— calibrated static quantization via onnxruntimeDispatch is handled via a
_mode_handlersdict inquantize_onnx(). The command layer builds aWinMLQuantizationConfigwithmodeset to"fp16","rtn","static", or"dynamic", then callsquantize_onnx(config=...).Key Design Decisions
WinMLQuantizationConfig.modeis the unified field:Literal["static", "dynamic", "rtn", "fp16"]from_dict()provides backward compat: reads"algorithm"key and maps tomodeSupported Commands
--precisionsupportwinml buildwinml quantizewinml configE2E Test Results (convnext-tiny-224)
winml quantize --precision fp16winml quantize --precision int4winml quantize --precision int8winml build --precision fp16winml build --precision int4Follow-up PRs
Quantizerclass withBaseQuantPasspipeline — enables multi-pass (e.g.,w4a16= int4 + fp16)fp16_op_block_list/fp16_keep_io_typesacross QDQ and FP16 paths