[6241485] Add support for ONNX Q/DQ node placement for DLA#1661
[6241485] Add support for ONNX Q/DQ node placement for DLA#1661gcunhase wants to merge 1 commit into
Conversation
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR introduces a ChangesDLA Targeting Feature
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
modelopt/onnx/quantization/int8.py (3)
188-189: 💤 Low valueUpdate condition comment for consistency.
Similar to the GEMV exclusion above, the conv exclusion is now skipped for both
autotuneandtarget_dla. Consider adding a brief comment explaining why both conditions skip this exclusion, for maintainability.📝 Suggested comment
if not (autotune or kwargs.get("target_dla", False)): + # Skip conv exclusion for autotune (runtime-driven) and target_dla (comprehensive Q/DQ coverage) nodes_to_exclude.extend(find_nodes_from_convs_to_exclude(graph, quantize_mode="int8"))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/int8.py` around lines 188 - 189, The conv-exclusion condition in int8.py currently reads `if not (autotune or kwargs.get("target_dla", False)):` and lacks the explanatory comment present for the GEMV exclusion; add a short comment immediately above this condition explaining that conv nodes are only excluded when neither autotune nor target_dla are active (because autotune/target DLA require keeping convs for their tuning/DLA support), referencing the `find_nodes_from_convs_to_exclude` call and `nodes_to_exclude` list so future readers understand why the exclusion is skipped in both modes.
168-174: 💤 Low valueUpdate comment to mention target_dla.
The comment at line 172 states "this check will be skipped if Autotune is enabled," but after this change, the check is also skipped when
target_dla=True. Please update the comment to reflect both conditions.📝 Suggested comment update
- # Note that this check will be skipped if Autotune is enabled as Q/DQ node placements - # will be decided according to TensorRT's runtime measurements. + # Note that this check will be skipped if Autotune or target_dla is enabled. + # Autotune decides Q/DQ placements via runtime measurements; target_dla requires + # comprehensive Q/DQ coverage for DLA scale requirements.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/int8.py` around lines 168 - 174, The comment above the conditional that checks enable_gemv_detection_for_trt (the block starting with if enable_gemv_detection_for_trt and not (autotune or kwargs.get("target_dla", False))) is out of date: update the comment to state that the GEMV/TensorCore check is skipped when either Autotune is enabled or when target_dla=True. Locate the block that logs "Detecting GEMV patterns for TRT optimization" and modify the explanatory comment to mention both conditions (autotune and kwargs.get("target_dla", False)) as reasons the check will be skipped.
155-156: ⚡ Quick winConsider documenting the override behavior.
When
target_dla=True, this code replaces any user-providedop_types_to_quantizelist with all node types from the graph. While this is likely intentional for comprehensive DLA scale coverage, the override happens silently and could surprise users who provided a custom list.Consider adding a log message or updating the docstring to make this behavior explicit.
💡 Suggested improvement
if kwargs.get("target_dla", False): + logger.info("target_dla=True: quantizing all op types for DLA scale coverage") op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/int8.py` around lines 155 - 156, The code silently overrides a user-supplied op_types_to_quantize when kwargs.get("target_dla", False) is true by setting op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node}); update the function handling these kwargs (the block that reads target_dla and sets op_types_to_quantize) to either (a) emit a clear log message (e.g., logger.info/warn) stating that target_dla=True causes op_types_to_quantize to be replaced with all graph node types, or (b) update the function docstring to explicitly document this override behavior; implement the logging approach so callers see the override at runtime and keep the existing assignment to list({node.op_type for node in onnx_model.graph.node}) when target_dla is true.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@modelopt/onnx/quantization/int8.py`:
- Around line 188-189: The conv-exclusion condition in int8.py currently reads
`if not (autotune or kwargs.get("target_dla", False)):` and lacks the
explanatory comment present for the GEMV exclusion; add a short comment
immediately above this condition explaining that conv nodes are only excluded
when neither autotune nor target_dla are active (because autotune/target DLA
require keeping convs for their tuning/DLA support), referencing the
`find_nodes_from_convs_to_exclude` call and `nodes_to_exclude` list so future
readers understand why the exclusion is skipped in both modes.
- Around line 168-174: The comment above the conditional that checks
enable_gemv_detection_for_trt (the block starting with if
enable_gemv_detection_for_trt and not (autotune or kwargs.get("target_dla",
False))) is out of date: update the comment to state that the GEMV/TensorCore
check is skipped when either Autotune is enabled or when target_dla=True. Locate
the block that logs "Detecting GEMV patterns for TRT optimization" and modify
the explanatory comment to mention both conditions (autotune and
kwargs.get("target_dla", False)) as reasons the check will be skipped.
- Around line 155-156: The code silently overrides a user-supplied
op_types_to_quantize when kwargs.get("target_dla", False) is true by setting
op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node});
update the function handling these kwargs (the block that reads target_dla and
sets op_types_to_quantize) to either (a) emit a clear log message (e.g.,
logger.info/warn) stating that target_dla=True causes op_types_to_quantize to be
replaced with all graph node types, or (b) update the function docstring to
explicitly document this override behavior; implement the logging approach so
callers see the override at runtime and keep the existing assignment to
list({node.op_type for node in onnx_model.graph.node}) when target_dla is true.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 72715d5e-ad7b-4b45-afdc-fb96b39940a5
📒 Files selected for processing (4)
CHANGELOG.rstmodelopt/onnx/quantization/__main__.pymodelopt/onnx/quantization/int8.pymodelopt/onnx/quantization/quantize.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1661 +/- ##
===========================================
+ Coverage 56.59% 76.68% +20.08%
===========================================
Files 507 508 +1
Lines 55794 55867 +73
===========================================
+ Hits 31579 42840 +11261
+ Misses 24215 13027 -11188
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: New feature
On DLA, the whole DLA-eligible region is compiled as one node, which runs in INT8 or FP16, and it expects scales to be present throughout. A tensor without a usable scale typically forces either that region to run in FP16 or a GPU fallback (if enabled) — otherwise the build fails.
With IQ (implicit quantization) being deprecated in TensorRT, users are migrating to ModelOpt for quantization/calibration. However, this breaks the DLA workflow since DLA still only supports IQ. The suggested workflow is then to:
calib.cacheandlayer_arg.txtfiles, which can be used with the non-quantized model to generate a DLA loadable.A study on Yolov5 has shown that EQ can achieve perf parity with IQ on DLA if Q/DQ nodes are inserted at every layer, making sure all tensors have INT8 scales. From the study: "With this option, all layers’ scales can be obtained during model fine-tuning. However, this method may potentially disrupt TensorRT fusion strategy with Q/DQ layers when running inference on GPU and lead to higher latency on the GPU. For DLA, on the other hand, the rule of thumb with PTQ scales is, “The more available scales, the lower the latency.” "
This PR aims to enable a quantization path targeting DLA.
Usage
Testing
See 6241485@10
Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional info
Related blogpost: https://developer.nvidia.com/blog/deploying-yolov5-on-nvidia-jetson-orin-with-cudla-quantization-aware-training-to-inference/#adding_qdq_nodes
Summary by CodeRabbit
New Features
--target_dlaflag for INT8 quantization to enable placement of Q/DQ nodes optimized for NVIDIA DLA target hardware.Documentation