Skip to content

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872

Merged
DingmaomaoBJTU merged 36 commits into
mainfrom
dingmaomaobjtu/feat-fp16-conversion
Jun 25, 2026
Merged

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872
DingmaomaoBJTU merged 36 commits into
mainfrom
dingmaomaobjtu/feat-fp16-conversion

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a unified --precision flag to winml quantize and winml build that auto-selects the quantization algorithm based on the target precision. This replaces the need to manually configure --weight-type, --activation-type, and algorithm-specific flags.

Closes #867

Precision → Algorithm Mapping

--precision Algorithm Description
fp16 FP16 conversion Weights + activations → FP16 (I/O stays FP32)
int4 RTN (weight-only) 4-bit weight via MatMulNBits, activation stays FP32
int8 Static QDQ Calibrated QDQ (uint8 weight + uint8 activation)
int16 / w16a16 Static QDQ Calibrated QDQ (int16 weight + uint16 activation)
w8a16 Static QDQ Calibrated QDQ (uint8 weight + uint16 activation)
w8a8 Static QDQ Calibrated QDQ (uint8 weight + uint8 activation)

Architecture

This PR implements single-pass quantization only. Each --precision value maps to exactly one quantization operation:

  • FP16: _quantize_fp16() — float16 conversion via onnxconverter-common
  • RTN: _quantize_rtn() — weight-only int4 via MatMulNBits
  • QDQ: _quantize_qdq() — calibrated static quantization via onnxruntime

Dispatch is handled via a _mode_handlers dict in quantize_onnx(). The command layer builds a WinMLQuantizationConfig with mode set to "fp16", "rtn", "static", or "dynamic", then calls quantize_onnx(config=...).

Key Design Decisions

  • WinMLQuantizationConfig.mode is the unified field: Literal["static", "dynamic", "rtn", "fp16"]
  • RTN and FP16 paths skip calibration entirely — warnings shown if calibration flags are provided
  • QDQ precisions (int8, int16, w8a16, etc.) still require calibration data
  • from_dict() provides backward compat: reads "algorithm" key and maps to mode

Supported Commands

Command --precision support Notes
winml build Full pipeline: export → optimize → quantize → compile
winml quantize Standalone quantization on existing ONNX
winml config Config generation respects precision

E2E Test Results (convnext-tiny-224)

Command Result Notes
winml quantize --precision fp16 109→54.6MB, 4.7s
winml quantize --precision int4 109→23.7MB, 3.7s (RTN 4-bit)
winml quantize --precision int8 109→28.0MB, 46s (QDQ)
winml build --precision fp16 Full pipeline, 87s
winml build --precision int4 Full pipeline with RTN

Follow-up PRs

@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner June 11, 2026 03:04
Comment thread tests/unit/optim/pipes/test_pipe_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 8f5a1d2 to 9e7d8fd Compare June 11, 2026 04:15
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: add --enable-fp16-conversion to winml optimize feat: add --precision fp16 to optimize, build, and export commands Jun 11, 2026
Comment thread tests/unit/optim/test_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 9e7d8fd to 7d7a0ae Compare June 11, 2026 04:22
Comment thread src/winml/modelkit/optim/fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 7d7a0ae to 328b5ab Compare June 11, 2026 04:32

@timenick timenick left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three findings on PR #872.

🤖 Generated with GitHub Copilot CLI

Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread tests/unit/optim/test_fp16.py Outdated
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 328b5ab to b859627 Compare June 11, 2026 05:26
Comment thread tests/unit/optim/test_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch 2 times, most recently from 837330d to fede96c Compare June 11, 2026 07:43
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: add --precision fp16 to optimize, build, and export commands feat: FP16 precision support via quantize stage + extended build --precision Jun 23, 2026
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 82c92cb to 75be8d3 Compare June 23, 2026 07:37
Comment thread src/winml/modelkit/commands/build.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: FP16 precision support via quantize stage + extended build --precision feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag Jun 23, 2026
Comment thread src/winml/modelkit/quant/config.py Outdated
Comment thread src/winml/modelkit/quant/config.py Outdated
Comment thread src/winml/modelkit/commands/quantize.py Outdated
Comment thread src/winml/modelkit/commands/quantize.py Outdated
Comment thread src/winml/modelkit/quant/quantizer.py Outdated
Comment thread src/winml/modelkit/quant/quantizer.py Outdated
Comment thread src/winml/modelkit/quant/quantizer.py Outdated
Comment thread src/winml/modelkit/quant/quantizer.py Outdated
Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread src/winml/modelkit/commands/quantize.py Outdated
Comment thread src/winml/modelkit/quant/config.py
Comment thread src/winml/modelkit/quant/quantizer.py
DingmaomaoBJTU and others added 3 commits June 25, 2026 13:15
Add FP16 precision conversion support across all model pipeline commands:

- Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16)
- optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list
- build: --precision fp16 stage between optimize and quantize
- export: --precision fp16 as post-export conversion
- Add shared precision_option() CLI decorator in utils/cli.py

Design: FP16 is a precision transformation (not a graph optimization), so it
lives as a command-layer utility rather than an optimizer pipe. All three
commands share the same convert_to_fp16() function.

Fixes #867
- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list,
  and RTN fields to WinMLQuantizationConfig
- quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ)
  and FP16 post-processing after QDQ (fp16=True, fp16_only=False)
- resolve_quant_compile_config returns fp16_only quant config for precision=fp16
- Remove _run_fp16_stage and skip-quantize hack from build.py pipelines
- Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile
  where Quantize Stage handles both QDQ and FP16 conversion
- Update tests to reflect new behavior (fp16 produces quant config, not None)
github-actions Bot added 13 commits June 25, 2026 13:16
- Remove fp16_postprocess from WinMLQuantizationConfig
- Add expand_precision() to decompose w4a16 into [int4, fp16] passes
- Refactor _run_quantize_stage into multi-pass loop with helper functions
- Each quantize_onnx call now does exactly one operation (single responsibility)
- Update standalone quantize command for two-pass w4a16 flow
- Add precision field to WinMLBuildConfig for pass expansion
- Add expand_precision tests
- Add 'precision' parameter to quantize_onnx() that handles multi-pass
  expansion internally (e.g., w4a16 → [int4, fp16])
- Simplify _run_quantize_stage in build.py to a single quantize_onnx()
  call — no more _make_step_config or _run_single_quantize_pass helpers
- Simplify commands/quantize.py RTN path — remove manual expand_precision
  loop and intermediate file management
- Delete unused _should_run_quantization() dead code from quantizer.py
- All multi-pass orchestration (intermediate files, cleanup, pass config
  construction) now lives in the quant layer where it belongs
Move calibration warning logic from commands/quantize.py into
utils/cli.py as warn_ignored_calibration_options() so any command
that needs the check can reuse it without duplicating the logic.
FP16 conversion is exclusively used by the quantizer's algorithm='fp16'
path. It's not an optimizer pipe — move it to quant/fp16.py where it
logically belongs. Remove optim/fp16.py entirely.
Address reviewer comment: mode and algorithm are redundant.
algorithm is the active routing field; mode is kept only for
serialization backward-compatibility and marked deprecated.
Remove redundant 'algorithm' field. Expand 'mode' to cover all
quantization modes: static, dynamic, rtn, fp16. The old 'qdq'
value is mapped to 'static' for backward compatibility.

from_dict() prefers the old 'algorithm' key over 'mode' when both
are present (old to_dict emitted both), preventing silent data loss
when deserializing configs with algorithm='rtn' or 'fp16'.
…ize command paths

- Split _quantize_single_pass into 3 focused methods: _quantize_fp16,
  _quantize_rtn, _quantize_qdq with a dispatch dict
- Consolidate 3 separate FP16/RTN/QDQ paths in commands/quantize.py into
  a single if/elif/else that builds config then shares execution logic
- Remove duplicated try/except, console output, and output path logic
…gle-pass

Multi-pass orchestration (w4a16 = int4 + fp16) will be re-introduced in
a follow-up PR via a proper Quantizer class with BaseQuantPass pipeline
(see #964). For now, quantize_onnx handles one pass at a time.

- Remove _run_multi_pass, _make_pass_config, _cleanup_intermediates
- Remove precision parameter from quantize_onnx signature
- Remove multi-pass UI hint from build.py stage display
- Update docstrings to reflect single-pass design
- Remove 'handles QDQ + FP16 post-processing' comments
- Remove precision passing through extra_kwargs (no longer needed)
- quantize_onnx no longer accepts precision param
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 5929e27 to eca92a6 Compare June 25, 2026 05:20
Comment thread src/winml/modelkit/commands/build.py Fixed
Comment thread src/winml/modelkit/quant/config.py Outdated
Comment thread src/winml/modelkit/quant/config.py
@xieofxie

Copy link
Copy Markdown
Contributor
  1. commands/quantize.py:287 — --precision w4a16 --weight-type uint8 silently ignores the documented weight-type override.
    The dispatch checks elif precision_lower and _is_weight_only(precision_lower) (true for w4a16/w4a8) before the QDQ branch that honors explicit --weight-type/--activation-type. The precision_option help advertises that explicit type flags override precision, but for a weight-only precision the RTN path runs unconditionally. Failure: user runs winml quantize --precision w4a16 --weight-type uint8 expecting uint8 QDQ, gets RTN int4 weight-only instead (a yellow warning is printed, but it contradicts the documented precedence).

  2. config/precision.py:737 / quantizer.py — --precision w4a16 (and w4a8) is accepted but runs int4-only, silently dropping the activation pass.
    _is_valid_precision/is_weight_only_precision accept w4a16, and quantize_onnx runs a single RTN int4 pass. expand_precision("w4a16") == ["int4", "fp16"] documents that the a16 (FP16 activation) pass should follow, but multi-pass is deferred to Refactor quantizer into Quantizer class with BaseQuantPass pipeline #964 and never invoked. Failure: --precision w4a16 produces an int4/w4a32-equivalent model that doesn't match its name, with no error. Consider rejecting w4a16/w4a8 at the CLI until Refactor quantizer into Quantizer class with BaseQuantPass pipeline #964 lands, rather than producing a mislabeled artifact.

Let's use another item to fix this?

One option is also add weight / activation type flags to build and add a common resolve_helper to consider all 3

Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread src/winml/modelkit/commands/quantize.py Outdated
@DingmaomaoBJTU DingmaomaoBJTU merged commit bf82a18 into main Jun 25, 2026
9 checks passed
@DingmaomaoBJTU DingmaomaoBJTU deleted the dingmaomaobjtu/feat-fp16-conversion branch June 25, 2026 07:13
DingmaomaoBJTU pushed a commit that referenced this pull request Jun 25, 2026
- build.py: keep model-type quant finalizer dispatch alongside main's quantize stage
- quantizer.py: reapply per-target weight/activation symmetry override on top of
  main's refactored _quantize_qdq (mode-dispatched single-pass quantizer)
- qwen3 transformer-only finalizer: pin mode=static so the new mode-keyed
  dispatch always routes the fixed w8a16 QDQ scheme (regardless of incoming
  precision policy), with a regression test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add --enable-fp16-conversion to winml optimize and --precision to winml build/export

5 participants