feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag by DingmaomaoBJTU · Pull Request #872 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-11T03:04:44Z

Summary

Adds a unified --precision flag to winml quantize and winml build that auto-selects the quantization algorithm based on the target precision. This replaces the need to manually configure --weight-type, --activation-type, and algorithm-specific flags.

Closes #867

Precision → Algorithm Mapping

`--precision`	Algorithm	Description
`fp16`	FP16 conversion	Weights + activations → FP16 (I/O stays FP32)
`int4`	RTN (weight-only)	4-bit weight via MatMulNBits, activation stays FP32
`int8`	Static QDQ	Calibrated QDQ (uint8 weight + uint8 activation)
`int16` / `w16a16`	Static QDQ	Calibrated QDQ (int16 weight + uint16 activation)
`w8a16`	Static QDQ	Calibrated QDQ (uint8 weight + uint16 activation)
`w8a8`	Static QDQ	Calibrated QDQ (uint8 weight + uint8 activation)

Architecture

This PR implements single-pass quantization only. Each --precision value maps to exactly one quantization operation:

FP16: _quantize_fp16() — float16 conversion via onnxconverter-common
RTN: _quantize_rtn() — weight-only int4 via MatMulNBits
QDQ: _quantize_qdq() — calibrated static quantization via onnxruntime

Dispatch is handled via a _mode_handlers dict in quantize_onnx(). The command layer builds a WinMLQuantizationConfig with mode set to "fp16", "rtn", "static", or "dynamic", then calls quantize_onnx(config=...).

Key Design Decisions

WinMLQuantizationConfig.mode is the unified field: Literal["static", "dynamic", "rtn", "fp16"]
RTN and FP16 paths skip calibration entirely — warnings shown if calibration flags are provided
QDQ precisions (int8, int16, w8a16, etc.) still require calibration data
from_dict() provides backward compat: reads "algorithm" key and maps to mode

Supported Commands

Command	`--precision` support	Notes
`winml build`	✅	Full pipeline: export → optimize → quantize → compile
`winml quantize`	✅	Standalone quantization on existing ONNX
`winml config`	✅	Config generation respects precision

E2E Test Results (convnext-tiny-224)

Command	Result	Notes
`winml quantize --precision fp16`	✅	109→54.6MB, 4.7s
`winml quantize --precision int4`	✅	109→23.7MB, 3.7s (RTN 4-bit)
`winml quantize --precision int8`	✅	109→28.0MB, 46s (QDQ)
`winml build --precision fp16`	✅	Full pipeline, 87s
`winml build --precision int4`	✅	Full pipeline with RTN

Follow-up PRs

Refactor quantizer into Quantizer class with BaseQuantPass pipeline #964: Refactor into Quantizer class with BaseQuantPass pipeline — enables multi-pass (e.g., w4a16 = int4 + fp16)
Apply fp16_op_block_list / fp16_keep_io_types to QDQ path (and vice versa) #963: Share fp16_op_block_list / fp16_keep_io_types across QDQ and FP16 paths

timenick

Three findings on PR #872.

🤖 Generated with GitHub Copilot CLI

Add FP16 precision conversion support across all model pipeline commands: - Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16) - optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list - build: --precision fp16 stage between optimize and quantize - export: --precision fp16 as post-export conversion - Add shared precision_option() CLI decorator in utils/cli.py Design: FP16 is a precision transformation (not a graph optimization), so it lives as a command-layer utility rather than an optimizer pipe. All three commands share the same convert_to_fp16() function. Fixes #867

- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list, and RTN fields to WinMLQuantizationConfig - quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ) and FP16 post-processing after QDQ (fp16=True, fp16_only=False) - resolve_quant_compile_config returns fp16_only quant config for precision=fp16 - Remove _run_fp16_stage and skip-quantize hack from build.py pipelines - Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile where Quantize Stage handles both QDQ and FP16 conversion - Update tests to reflect new behavior (fp16 produces quant config, not None)

- Remove fp16_postprocess from WinMLQuantizationConfig - Add expand_precision() to decompose w4a16 into [int4, fp16] passes - Refactor _run_quantize_stage into multi-pass loop with helper functions - Each quantize_onnx call now does exactly one operation (single responsibility) - Update standalone quantize command for two-pass w4a16 flow - Add precision field to WinMLBuildConfig for pass expansion - Add expand_precision tests

- Add 'precision' parameter to quantize_onnx() that handles multi-pass expansion internally (e.g., w4a16 → [int4, fp16]) - Simplify _run_quantize_stage in build.py to a single quantize_onnx() call — no more _make_step_config or _run_single_quantize_pass helpers - Simplify commands/quantize.py RTN path — remove manual expand_precision loop and intermediate file management - Delete unused _should_run_quantization() dead code from quantizer.py - All multi-pass orchestration (intermediate files, cleanup, pass config construction) now lives in the quant layer where it belongs

…ation

Move calibration warning logic from commands/quantize.py into utils/cli.py as warn_ignored_calibration_options() so any command that needs the check can reuse it without duplicating the logic.

FP16 conversion is exclusively used by the quantizer's algorithm='fp16' path. It's not an optimizer pipe — move it to quant/fp16.py where it logically belongs. Remove optim/fp16.py entirely.

Address reviewer comment: mode and algorithm are redundant. algorithm is the active routing field; mode is kept only for serialization backward-compatibility and marked deprecated.

Remove redundant 'algorithm' field. Expand 'mode' to cover all quantization modes: static, dynamic, rtn, fp16. The old 'qdq' value is mapped to 'static' for backward compatibility. from_dict() prefers the old 'algorithm' key over 'mode' when both are present (old to_dict emitted both), preventing silent data loss when deserializing configs with algorithm='rtn' or 'fp16'.

…ize command paths - Split _quantize_single_pass into 3 focused methods: _quantize_fp16, _quantize_rtn, _quantize_qdq with a dispatch dict - Consolidate 3 separate FP16/RTN/QDQ paths in commands/quantize.py into a single if/elif/else that builds config then shares execution logic - Remove duplicated try/except, console output, and output path logic

…gle-pass Multi-pass orchestration (w4a16 = int4 + fp16) will be re-introduced in a follow-up PR via a proper Quantizer class with BaseQuantPass pipeline (see #964). For now, quantize_onnx handles one pass at a time. - Remove _run_multi_pass, _make_pass_config, _cleanup_intermediates - Remove precision parameter from quantize_onnx signature - Remove multi-pass UI hint from build.py stage display - Update docstrings to reflect single-pass design

- Remove 'handles QDQ + FP16 post-processing' comments - Remove precision passing through extra_kwargs (no longer needed) - quantize_onnx no longer accepts precision param

…a in QDQ path

xieofxie · 2026-06-25T06:46:50Z

commands/quantize.py:287 — --precision w4a16 --weight-type uint8 silently ignores the documented weight-type override.
The dispatch checks elif precision_lower and _is_weight_only(precision_lower) (true for w4a16/w4a8) before the QDQ branch that honors explicit --weight-type/--activation-type. The precision_option help advertises that explicit type flags override precision, but for a weight-only precision the RTN path runs unconditionally. Failure: user runs winml quantize --precision w4a16 --weight-type uint8 expecting uint8 QDQ, gets RTN int4 weight-only instead (a yellow warning is printed, but it contradicts the documented precedence).
config/precision.py:737 / quantizer.py — --precision w4a16 (and w4a8) is accepted but runs int4-only, silently dropping the activation pass.
_is_valid_precision/is_weight_only_precision accept w4a16, and quantize_onnx runs a single RTN int4 pass. expand_precision("w4a16") == ["int4", "fp16"] documents that the a16 (FP16 activation) pass should follow, but multi-pass is deferred to Refactor quantizer into Quantizer class with BaseQuantPass pipeline #964 and never invoked. Failure: --precision w4a16 produces an int4/w4a32-equivalent model that doesn't match its name, with no error. Consider rejecting w4a16/w4a8 at the CLI until Refactor quantizer into Quantizer class with BaseQuantPass pipeline #964 lands, rather than producing a mislabeled artifact.

Let's use another item to fix this?

One option is also add weight / activation type flags to build and add a common resolve_helper to consider all 3

…rapper and dead branch

- build.py: keep model-type quant finalizer dispatch alongside main's quantize stage - quantizer.py: reapply per-target weight/activation symmetry override on top of main's refactored _quantize_qdq (mode-dispatched single-pass quantizer) - qwen3 transformer-only finalizer: pin mode=static so the new mode-keyed dispatch always routes the fixed w8a16 QDQ scheme (regardless of incoming precision policy), with a regression test

DingmaomaoBJTU requested a review from a team as a code owner June 11, 2026 03:04

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/pipes/test_pipe_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 8f5a1d2 to 9e7d8fd Compare June 11, 2026 04:15

DingmaomaoBJTU changed the title ~~feat: add --enable-fp16-conversion to winml optimize~~ feat: add --precision fp16 to optimize, build, and export commands Jun 11, 2026

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/test_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 9e7d8fd to 7d7a0ae Compare June 11, 2026 04:22

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread src/winml/modelkit/optim/fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 7d7a0ae to 328b5ab Compare June 11, 2026 04:32

timenick reviewed Jun 11, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/build.py Outdated

Comment thread src/winml/modelkit/commands/build.py Outdated

Comment thread tests/unit/optim/test_fp16.py Outdated

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 328b5ab to b859627 Compare June 11, 2026 05:26

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/test_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch 2 times, most recently from 837330d to fede96c Compare June 11, 2026 07:43

DingmaomaoBJTU changed the title ~~feat: add --precision fp16 to optimize, build, and export commands~~ feat: FP16 precision support via quantize stage + extended build --precision Jun 23, 2026

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 82c92cb to 75be8d3 Compare June 23, 2026 07:37

github-advanced-security AI found potential problems Jun 23, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/build.py Fixed

DingmaomaoBJTU changed the title ~~feat: FP16 precision support via quantize stage + extended build --precision~~ feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag Jun 23, 2026