Skip to content

[6241485] Add support for ONNX Q/DQ node placement for DLA#1661

Open
gcunhase wants to merge 1 commit into
NVIDIA:mainfrom
gcunhase:dev/gcunhasergio/6241485_dla_calib
Open

[6241485] Add support for ONNX Q/DQ node placement for DLA#1661
gcunhase wants to merge 1 commit into
NVIDIA:mainfrom
gcunhase:dev/gcunhasergio/6241485_dla_calib

Conversation

@gcunhase

@gcunhase gcunhase commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: New feature

On DLA, the whole DLA-eligible region is compiled as one node, which runs in INT8 or FP16, and it expects scales to be present throughout. A tensor without a usable scale typically forces either that region to run in FP16 or a GPU fallback (if enabled) — otherwise the build fails.

With IQ (implicit quantization) being deprecated in TensorRT, users are migrating to ModelOpt for quantization/calibration. However, this breaks the DLA workflow since DLA still only supports IQ. The suggested workflow is then to:

  1. Use ModelOpt to obtain the EQ (explicitly quantized) model;
  2. Use NVIDIA's Q/DQ Translator Toolkit to obtain the calib.cache and layer_arg.txt files, which can be used with the non-quantized model to generate a DLA loadable.

A study on Yolov5 has shown that EQ can achieve perf parity with IQ on DLA if Q/DQ nodes are inserted at every layer, making sure all tensors have INT8 scales. From the study: "With this option, all layers’ scales can be obtained during model fine-tuning. However, this method may potentially disrupt TensorRT fusion strategy with Q/DQ layers when running inference on GPU and lead to higher latency on the GPU. For DLA, on the other hand, the rule of thumb with PTQ scales is, “The more available scales, the lower the latency.” "

This PR aims to enable a quantization path targeting DLA.

Usage

$ python -m modelopt.onnx.quantization --onnx=model.onnx --target_dla

Testing

See 6241485@10

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ❌
  • Did you update Changelog?: ✅
  • Did you get Claude approval on this PR?: ❌

Additional info

Related blogpost: https://developer.nvidia.com/blog/deploying-yolov5-on-nvidia-jetson-orin-with-cudla-quantization-aware-training-to-inference/#adding_qdq_nodes

Summary by CodeRabbit

  • New Features

    • Added --target_dla flag for INT8 quantization to enable placement of Q/DQ nodes optimized for NVIDIA DLA target hardware.
  • Documentation

    • Updated changelog with new feature documentation.

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
@gcunhase gcunhase requested a review from a team as a code owner June 9, 2026 20:09
@gcunhase gcunhase requested a review from galagam June 9, 2026 20:09
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a --target_dla flag to the ONNX quantization pipeline, enabling Q/DQ node placement optimized for NVIDIA DLA deployment. The flag adjusts node exclusion logic in INT8 quantization, skipping MHA, GEMV, and conv-based exclusion patterns when targeting DLA.

Changes

DLA Targeting Feature

Layer / File(s) Summary
CLI flag definition and wiring
modelopt/onnx/quantization/__main__.py
The --target_dla command-line argument is added to the PTQ parser with INT8-only documentation, and the flag value is forwarded to the quantize() function call.
Main quantize() function signature and documentation
modelopt/onnx/quantization/quantize.py
The quantize() function signature is extended with a new target_dla: bool = False parameter, and the docstring documents its effect on Q/DQ placement for DLA deployment.
Main quantize() DLA logic
modelopt/onnx/quantization/quantize.py
MHA node exclusion is now conditional—find_nodes_from_mha_to_exclude() is skipped when target_dla is true—and the flag is injected into kwargs for downstream INT8 quantization logic.
INT8 quantization adjustments for DLA
modelopt/onnx/quantization/int8.py
When target_dla is enabled, op_types_to_quantize is overwritten with all node op_type values from the graph, and both GEMV detection and conv-based exclusion are skipped by extending their conditional guards.
Changelog entry
CHANGELOG.rst
Documents the new 0.46 feature for ONNX Q/DQ node placement support for DLA via the --target_dla flag.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature added in this PR: support for placing ONNX Q/DQ nodes targeting NVIDIA DLA deployment via the new --target_dla flag.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed The PR properly implements numpy.load with allow_pickle as a caller-configurable parameter (defaulting to False), includes security comments, and avoids all listed anti-patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@gcunhase gcunhase requested review from ajrasane and cjluo-nv and removed request for galagam June 9, 2026 20:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
modelopt/onnx/quantization/int8.py (3)

188-189: 💤 Low value

Update condition comment for consistency.

Similar to the GEMV exclusion above, the conv exclusion is now skipped for both autotune and target_dla. Consider adding a brief comment explaining why both conditions skip this exclusion, for maintainability.

📝 Suggested comment
 if not (autotune or kwargs.get("target_dla", False)):
+    # Skip conv exclusion for autotune (runtime-driven) and target_dla (comprehensive Q/DQ coverage)
     nodes_to_exclude.extend(find_nodes_from_convs_to_exclude(graph, quantize_mode="int8"))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/int8.py` around lines 188 - 189, The
conv-exclusion condition in int8.py currently reads `if not (autotune or
kwargs.get("target_dla", False)):` and lacks the explanatory comment present for
the GEMV exclusion; add a short comment immediately above this condition
explaining that conv nodes are only excluded when neither autotune nor
target_dla are active (because autotune/target DLA require keeping convs for
their tuning/DLA support), referencing the `find_nodes_from_convs_to_exclude`
call and `nodes_to_exclude` list so future readers understand why the exclusion
is skipped in both modes.

168-174: 💤 Low value

Update comment to mention target_dla.

The comment at line 172 states "this check will be skipped if Autotune is enabled," but after this change, the check is also skipped when target_dla=True. Please update the comment to reflect both conditions.

📝 Suggested comment update
-        # Note that this check will be skipped if Autotune is enabled as Q/DQ node placements
-        # will be decided according to TensorRT's runtime measurements.
+        # Note that this check will be skipped if Autotune or target_dla is enabled.
+        # Autotune decides Q/DQ placements via runtime measurements; target_dla requires
+        # comprehensive Q/DQ coverage for DLA scale requirements.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/int8.py` around lines 168 - 174, The comment above
the conditional that checks enable_gemv_detection_for_trt (the block starting
with if enable_gemv_detection_for_trt and not (autotune or
kwargs.get("target_dla", False))) is out of date: update the comment to state
that the GEMV/TensorCore check is skipped when either Autotune is enabled or
when target_dla=True. Locate the block that logs "Detecting GEMV patterns for
TRT optimization" and modify the explanatory comment to mention both conditions
(autotune and kwargs.get("target_dla", False)) as reasons the check will be
skipped.

155-156: ⚡ Quick win

Consider documenting the override behavior.

When target_dla=True, this code replaces any user-provided op_types_to_quantize list with all node types from the graph. While this is likely intentional for comprehensive DLA scale coverage, the override happens silently and could surprise users who provided a custom list.

Consider adding a log message or updating the docstring to make this behavior explicit.

💡 Suggested improvement
 if kwargs.get("target_dla", False):
+    logger.info("target_dla=True: quantizing all op types for DLA scale coverage")
     op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/int8.py` around lines 155 - 156, The code silently
overrides a user-supplied op_types_to_quantize when kwargs.get("target_dla",
False) is true by setting op_types_to_quantize = list({node.op_type for node in
onnx_model.graph.node}); update the function handling these kwargs (the block
that reads target_dla and sets op_types_to_quantize) to either (a) emit a clear
log message (e.g., logger.info/warn) stating that target_dla=True causes
op_types_to_quantize to be replaced with all graph node types, or (b) update the
function docstring to explicitly document this override behavior; implement the
logging approach so callers see the override at runtime and keep the existing
assignment to list({node.op_type for node in onnx_model.graph.node}) when
target_dla is true.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@modelopt/onnx/quantization/int8.py`:
- Around line 188-189: The conv-exclusion condition in int8.py currently reads
`if not (autotune or kwargs.get("target_dla", False)):` and lacks the
explanatory comment present for the GEMV exclusion; add a short comment
immediately above this condition explaining that conv nodes are only excluded
when neither autotune nor target_dla are active (because autotune/target DLA
require keeping convs for their tuning/DLA support), referencing the
`find_nodes_from_convs_to_exclude` call and `nodes_to_exclude` list so future
readers understand why the exclusion is skipped in both modes.
- Around line 168-174: The comment above the conditional that checks
enable_gemv_detection_for_trt (the block starting with if
enable_gemv_detection_for_trt and not (autotune or kwargs.get("target_dla",
False))) is out of date: update the comment to state that the GEMV/TensorCore
check is skipped when either Autotune is enabled or when target_dla=True. Locate
the block that logs "Detecting GEMV patterns for TRT optimization" and modify
the explanatory comment to mention both conditions (autotune and
kwargs.get("target_dla", False)) as reasons the check will be skipped.
- Around line 155-156: The code silently overrides a user-supplied
op_types_to_quantize when kwargs.get("target_dla", False) is true by setting
op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node});
update the function handling these kwargs (the block that reads target_dla and
sets op_types_to_quantize) to either (a) emit a clear log message (e.g.,
logger.info/warn) stating that target_dla=True causes op_types_to_quantize to be
replaced with all graph node types, or (b) update the function docstring to
explicitly document this override behavior; implement the logging approach so
callers see the override at runtime and keep the existing assignment to
list({node.op_type for node in onnx_model.graph.node}) when target_dla is true.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 72715d5e-ad7b-4b45-afdc-fb96b39940a5

📥 Commits

Reviewing files that changed from the base of the PR and between 5584ce4 and 2f10789.

📒 Files selected for processing (4)
  • CHANGELOG.rst
  • modelopt/onnx/quantization/__main__.py
  • modelopt/onnx/quantization/int8.py
  • modelopt/onnx/quantization/quantize.py

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.68%. Comparing base (d3acf45) to head (2f10789).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/onnx/quantization/int8.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1661       +/-   ##
===========================================
+ Coverage   56.59%   76.68%   +20.08%     
===========================================
  Files         507      508        +1     
  Lines       55794    55867       +73     
===========================================
+ Hits        31579    42840    +11261     
+ Misses      24215    13027    -11188     
Flag Coverage Δ
examples 42.31% <0.00%> (+23.75%) ⬆️
gpu 57.85% <75.00%> (+37.31%) ⬆️
unit 54.42% <87.50%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant