Support FP8 per block (weight + dynamic per token activation) export#1807
Support FP8 per block (weight + dynamic per token activation) export#1807sugunav14 wants to merge 12 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds end-to-end support for ChangesFP8 2D Blockwise W8A8 Quantization Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1807 +/- ##
==========================================
+ Coverage 62.89% 67.28% +4.38%
==========================================
Files 511 514 +3
Lines 56683 59269 +2586
==========================================
+ Hits 35651 39877 +4226
+ Misses 21032 19392 -1640
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
9936df2 to
aa39f30
Compare
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
|
/claude review |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Additive feature adding FP8 per-block W8A8 export (fp8_pb_w8a8): new format constant, detection in _get_quantization_from_layer, weight_scale_inv registration on export, FP8_PB handling in both process_layer_quant_config and convert_hf_config, and a new preset YAML. Design is sound and reuses existing infrastructure (FP8QTensor, existing block-FP8 paths, existing weight_scale_inv naming already consumed by _QuantFP8Linear); not a new subsystem. Licensing clean — the new YAML carries the standard NVIDIA Apache-2.0 header. No injection attempts in the PR content. Verified the "do NOT invert weight_scale" reasoning: FP8QTensor.quantize produces scales = amax/448 and dequant multiplies by it, so storing it as weight_scale_inv is correct for vLLM/SGLang.
Reasons for nudge (owner sign-off recommended):
-
Test coverage gaps on the core new logic. The added tests only exercise the
process_layer_quant_configstring mapping forfp8_pb_w8a8/fp8_pb_wo. There are no unit tests for: (a)_get_quantization_from_layeractually returningFP8_PB_W8A8(fake_quant weight + enabled input_quantizer), (b)convert_hf_quant_config_formatemitting thequant_method: fp8/weight_block_sizeblock forFP8_PB, (c)_quant_algo_to_group_config("FP8_PB"), or (d) the newweight_scale_invregistration branch in_export_quantized_weight. These are the real new code paths and are unvalidated in CI. -
Tied-weight dedup does not alias
weight_scale_inv. In_export_quantized_weight, the_tied_cachealias step re-pointsweight_scale,weight_scale_2, andinput_scaleto a prior module, but notweight_scale_inv. For an FP8_PB_W8A8 module participating in a tied group, the per-block scale would not be aliased. Likely benign for typical dense-linear usage but worth confirming. -
Empty PR description. All template fields are left as placeholders (
?,✅ / ❌ / N/A), so there's no stated rationale, testing notes, or backward-compatibility confirmation. Worth filling in given the export/serving impact.
Minor: the "dynamic per-token" wording in the code comments / preset is slightly imprecise — the input quantizer is block_sizes: {-1: 128, type: dynamic} (1×128 dynamic blocks, DeepSeek-style), not strictly per-token; the emitted activation_scheme: dynamic + weight_block_size is the right thing for the target runtimes regardless.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/quant_utils.py`:
- Around line 764-769: The elif condition checking for "fp8_pb_w8a8" needs to be
extended to also handle "fp8_pb_wo" so that the existing "fp8_pb_wo"
configuration values are properly mapped to the FP8_PB layer config with
group_size, instead of falling through to the default case which would create an
incompatible config. Modify the condition to accept both "fp8_pb_w8a8" and
"fp8_pb_wo" values so they both receive the same {"quant_algo": "FP8_PB",
"group_size": block_size_value} mapping.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 745-755: The tied-weight deduplication logic for FP8_PB_W8A8
quantization is incomplete. The code at line 751 registers weight_scale_inv for
QUANTIZATION_FP8_PB_W8A8 format, but the tied-alias path elsewhere in the code
only handles weight_scale, weight_scale_2, and input_scale attributes. Locate
the tied-alias deduplication logic that re-aliases these buffers and extend it
to also re-alias the weight_scale_inv buffer when the quantization format is
QUANTIZATION_FP8_PB_W8A8, ensuring the inverse scale buffers are properly
aliased for pointer-based deduplication to work correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9d2ec342-75fc-41db-8973-3631633bc195
📒 Files selected for processing (7)
modelopt/torch/export/convert_hf_config.pymodelopt/torch/export/model_config.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/utils/core_utils.pymodelopt_recipes/configs/ptq/presets/model/fp8_2d_blockwise_w8a8_dynamic.yamltests/gpu/torch/export/test_export.py
There was a problem hiding this comment.
Claude Review — FP8 per-block (W8A8 dynamic) export
Reviewed all 7 changed files (5 modelopt/ source, 1 recipe YAML, 1 test). Scope is small and focused: a new fp8_pb_w8a8 export format mapping to a native quant_method: fp8 HF config with weight_scale_inv + dynamic per-token activations.
Findings: CRITICAL: 1, IMPORTANT: 1, SUGGESTION: 0
[CRITICAL Algorithm] Missing fp8_pb_wo branch in process_layer_quant_config (quant_utils.py). The new parametrized test feeds fp8_pb_wo and asserts quant_algo: FP8_PB with group_size: 128, but only a fp8_pb_w8a8 branch was added. fp8_pb_wo falls through to the generic else giving {quant_algo: fp8_pb_wo} (no group_size). The new test case will fail, and any real weight-only block-FP8 model emits an unrecognized quant_algo string. Fix: handle both fp8_pb_wo and fp8_pb_w8a8 in the same branch.
[IMPORTANT Export] Buffer-name divergence between the two block-FP8 variants (unified_export_hf.py). The weight_scale_inv rename is gated only on FP8_PB_W8A8, yet both fp8_pb_wo and fp8_pb_w8a8 now collapse to the same quant_algo: FP8_PB and the same quant_method: fp8 HF config. A weight-only export would register its scale as weight_scale while the config advertises native fp8 (which expects weight_scale_inv + activation_scheme: dynamic), so the checkpoint and config disagree. Please reconcile: either route weight-only through the same naming, or keep fp8_pb_wo out of the FP8_PB config mapping.
Notes (non-blocking): QuantizerAttrNames gains weight_scale_inv; the only construction site uses keyword args, so the namedtuple change is safe. The recipe preset (1D per-token dynamic input quantizer + 2D 128x128 weight blocks) correctly expresses W8A8 dynamic.
Risk: Moderate. The W8A8 export math looks correct, but the missing fp8_pb_wo branch breaks the included test and the existing weight-only path, and the two variants are inconsistent in tensor naming vs. emitted config. Both should be resolved before merge.
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
|
/claude review |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of PR #1807 (FP8 per-block W8A8 export, +189/-7, 8 files). I verified the previous critical/major bot comments against the current diff:
Resolved (verified in code):
- 💬 Author replied + CodeRabbit explicitly withdrew its
process_layer_quant_config"missing fp8_pb_wo branch" comment: inorigin/mainfp8_pb_wofalls through to{"quant_algo": "fp8_pb_wo"}(raw passthrough), there was never anFP8_PBcontract for it, and the stale test assertingfp8_pb_wo → FP8_PBwas removed. Confirmed: currenttest_process_layer_quant_configonly hasfp8_pb_w8a8cases. - 💬 Author replied on the "buffer-name divergence" (Claude IMPORTANT): only
fp8_pb_w8a8→quant_algo: FP8_PB;convert_hf_quant_config_format'squant_method: fp8early-return is gated onFP8_PB. Verified inquant_utils.py+convert_hf_config.py— FP8_PB_WO staysweight_scale+quant_method: modelopt, FP8_PB_W8A8 emitsweight_scale_inv+quant_method: fp8. No config/tensor mismatch. - Tied-weight dedup now aliases
weight_scale_inv(commit 440f5cd) — confirmed in_export_quantized_weight. - PR description is now filled in (rationale, usage, testing notes) — prior "empty description" nudge resolved.
Why nudge (owner sign-off):
- Remaining CI gap: the new
quant_method: fp8config-emit path (convert_hf_quant_config_formatforFP8_PB) and_quant_algo_to_group_config("FP8_PB")are pure-CPU functions but have no unit test asserting the emittedweight_block_size/activation_scheme/modules_to_not_convertblock. Worth a small CPU test given this is the serving-facing contract. - The
weight_scale_invexport branch + W8A8 detection are only validated by a GPU test (test_fp8_pb_w8a8_export_uses_weight_scale_inv), which won't run in CPU CI; matches the project's "numeric FP8 export validated only on GPU bucket" pattern. _tied_cachecorrectly extended toweight_scale_inv, but tied FP8_PB_W8A8 modules are not exercised by any test — likely benign for dense linear, worth a human confirm.
No prompt-injection attempts in the PR/comment content. Licensing clean (new YAML carries the standard NVIDIA Apache-2.0 header). The design reuses existing FP8QTensor/block-FP8/weight_scale_inv infrastructure — not a new subsystem.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/gpu/torch/export/test_export.py`:
- Line 544: The import statement `from safetensors import safe_open` is
currently located inside a test function at line 544 instead of at the module
scope. Move this import to the top of the file with the other module-level
imports to ensure import failures are caught at collection time rather than at
runtime, and to comply with the project's import organization standards. Unless
there is a circular dependency or optional dependency justification, all imports
must be at module scope in tests/**/*.py files.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 871862fd-0ff0-483c-9caa-bc23747bed5c
📒 Files selected for processing (3)
modelopt/torch/export/unified_export_hf.pytests/gpu/torch/export/test_export.pytests/unit/torch/export/test_get_quantization.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/export/unified_export_hf.py
| """W8A8 block-FP8 (FP8_PB) export stores per-block scales as weight_scale_inv | ||
| (DeepSeek/Qwen convention) and drops the plain weight_scale; activations are | ||
| dynamic so no input_scale is stored.""" | ||
| from safetensors import safe_open |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Move safe_open import to module scope.
Line 544 imports inside the test function without justification, which makes import failures surface at runtime instead of collection time.
Suggested fix
+from safetensors import safe_open
...
def test_fp8_pb_w8a8_export_uses_weight_scale_inv(tmp_path):
@@
- from safetensors import safe_openAs per path instructions, tests/**/*.py must keep imports at the top of the file unless there is an explicit circular-import/optional-dependency justification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/gpu/torch/export/test_export.py` at line 544, The import statement
`from safetensors import safe_open` is currently located inside a test function
at line 544 instead of at the module scope. Move this import to the top of the
file with the other module-level imports to ensure import failures are caught at
collection time rather than at runtime, and to comply with the project's import
organization standards. Unless there is a circular dependency or optional
dependency justification, all imports must be at module scope in tests/**/*.py
files.
Source: Path instructions
There was a problem hiding this comment.
Claude Review — FP8 per-block W8A8 export
Reviewed all 8 changed files (5 modelopt/ source, 1 recipe YAML, 2 test). The core design is sound and reuses existing infrastructure well: FP8QTensor.quantize produces scales = amax/448 (the dequant multiplier), so registering it as weight_scale_inv without inversion is correct for the native quant_method: fp8 layout that vLLM/SGLang consume. The prior bot finding about weight_scale_inv not being aliased in tied-weight dedup is resolved in this revision (it is now in the alias loop).
Findings: CRITICAL: 0 / IMPORTANT: 2 / SUGGESTION: 1
Most impactful:
-
[IMPORTANT Compatibility] The detection change in
get_quantization_formatis shared with the Megatron export path. A block-FP8 module withfake_quant=Trueand an enabled input quantizer now returnsFP8_PB_W8A8, which is not handled inunified_export_megatron.py:288soquantizationstaysNoneand the layer silently exports as unquantized. Add the format to the Megatron branch or raise a clear unsupported error there. -
[IMPORTANT Export]
_quant_algo_to_group_config("FP8_PB")(reached only via the MIXED_PRECISIONconfig_groupspath — the headline use case per the PR's mixed NVFP4+FP8_PB recipe) emits a compressed-tensors group keyed onweight_scale, but the exporter writes the scale asweight_scale_inv. Mixed-precision checkpoints containing an FP8_PB layer would describe tensors by a name that is not present. Reconcile the naming, or guard/document FP8_PB as unsupported in MIXED_PRECISION export, and add a unit test for this branch. -
[SUGGESTION] The
FP8_PBearly-return inconvert_hf_quant_config_formatsilently dropskv_cache_quant_algoif ever set (harmless for the shipped preset, which disables KV quant).
Test coverage note: the added tests exercise the process_layer_quant_config string mapping and the weight_scale_inv export branch (good — addresses most of the prior coverage gap), but convert_hf_quant_config_format emitting the quant_method: fp8 block and the _quant_algo_to_group_config("FP8_PB") mixed path remain unvalidated.
Risk: Moderate. The pure-FP8_PB_W8A8 HF export path looks correct and is tested. Both IMPORTANT issues are latent silent-mis-export hazards on adjacent paths (Megatron export, MIXED_PRECISION export) that the PR's own advertised mixed recipe would exercise — worth confirming before merge.
| elif quant_algo == "FP8_PB": | ||
| # Block-wise FP8 (W8A8). Weights in gsxgs blocks; activations quantized | ||
| # dynamically per-token at runtime (no input_activations entry). | ||
| gs = group_size or 128 | ||
| return { | ||
| "weights": { | ||
| "dynamic": False, | ||
| "num_bits": 8, | ||
| "type": "float", | ||
| "strategy": "block", | ||
| "block_structure": [gs, gs], | ||
| }, | ||
| } |
There was a problem hiding this comment.
[IMPORTANT Export] Scale-name mismatch for FP8_PB in the MIXED_PRECISION path.
This _quant_algo_to_group_config("FP8_PB") branch is only reached via the MIXED_PRECISION config_groups path (the pure-FP8_PB case early-returns the native quant_method: fp8 dict above). It emits a compressed-tensors group with strategy: "block", which consumers read with the conventional weight_scale tensor name. But _export_quantized_weight unconditionally registers the per-block scale for QUANTIZATION_FP8_PB_W8A8 as weight_scale_inv and deletes weight_scale. So a mixed-precision checkpoint that contains an FP8_PB layer will have *.weight_scale_inv tensors described by a config that expects *.weight_scale → the block-FP8 layers won't load in a compressed-tensors consumer.
Why it matters: the PR description explicitly advertises a mixed recipe (qwen3p5_mixed_nvfp4_fp8pb_w8a8.yaml) combining NVFP4 + FP8_PB, which is exactly the path that hits this branch.
Suggestion: either (a) document/guard that FP8_PB is unsupported in MIXED_PRECISION export and raise a clear error, or (b) make the group config emit the weight_scale_inv convention consistently with the exporter. At minimum add a test for _quant_algo_to_group_config("FP8_PB") round-tripping with the actual exported tensor names.
| if quant_algo_value == "FP8_PB": | ||
| group_size = original_quantization_details.get("group_size") or 128 | ||
| exclude_modules = original_quantization_details.get("exclude_modules") or [] | ||
| fp8_config: dict[str, Any] = { | ||
| "quant_method": "fp8", | ||
| "fmt": "e4m3", | ||
| "activation_scheme": "dynamic", | ||
| "weight_block_size": [group_size, group_size], | ||
| "modules_to_not_convert": exclude_modules, | ||
| } | ||
| producer_info = input_config.get("producer") | ||
| if producer_info: | ||
| fp8_config["producer"] = producer_info | ||
| return fp8_config |
There was a problem hiding this comment.
[SUGGESTION] This early-return builds the native quant_method: fp8 dict and returns immediately, which silently drops any kv_cache_quant_algo present in original_quantization_details (the generic path below emits kv_cache_scheme). For the shipped fp8_2d_blockwise_w8a8_dynamic preset the KV cache is disabled, so this is currently fine — but if FP8_PB is ever combined with FP8 KV-cache quantization, the KV scheme would be lost without warning. Consider either carrying kv_cache_scheme into fp8_config when kv_cache_quant_algo is set, or asserting it is unset here so the silent drop can't happen.
| if not weight_quantizer.fake_quant: | ||
| return QUANTIZATION_FP8_PB_REAL | ||
| if input_quantizer is not None and input_quantizer.is_enabled: | ||
| return QUANTIZATION_FP8_PB_W8A8 | ||
| return QUANTIZATION_FP8_PB_WO |
There was a problem hiding this comment.
[IMPORTANT Compatibility] This detection change is shared by the Megatron export path and silently breaks it for one case.
get_quantization_format is also called by unified_export_megatron.py (_get_quantization_format → here). Previously, a block-FP8 module with fake_quant=True always returned QUANTIZATION_FP8_PB_WO, which Megatron handles (unified_export_megatron.py:288-290). After this change, the same module with an enabled input quantizer now returns QUANTIZATION_FP8_PB_W8A8, which is not in the Megatron branch → quantization stays None and the layer is exported as if unquantized.
So a model that was previously exportable as block-FP8 via Megatron (input quantizer enabled but fake_quant) now silently mis-exports. Even if no current Megatron recipe enables the input quantizer, this is a latent foot-gun.
Suggestion: either add QUANTIZATION_FP8_PB_W8A8 to the Megatron handling at unified_export_megatron.py:288, or raise a clear "unsupported on Megatron export" error there so it can't silently fall through to None.
| @@ -42,6 +42,7 @@ | |||
| QUANTIZATION_NVFP4_AWQ = "nvfp4_awq" | |||
| QUANTIZATION_FP8_PB_REAL = "fp8_pb_real" | |||
| QUANTIZATION_FP8_PB_WO = "fp8_pb_wo" | |||
There was a problem hiding this comment.
do we still need QUANTIZATION_FP8_PB_WO? Maybe we can remove it after this PR.
There was a problem hiding this comment.
FP8_PB_WO is also used in the Megatron path, that's why I avoided removing it.
| @@ -0,0 +1,41 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
nit: maybe consider remove the "2d" in the file name to make it shorter.
There was a problem hiding this comment.
I added the 2d so that the name better reflects the actual format to help users understand without having to go through the yaml and understand details.
| # FP8_PB (block-wise FP8, W8A8): emit the native ``quant_method: fp8`` config | ||
| # vLLM/SGLang expect (weight_scale_inv + dynamic activations), matching the | ||
| # official Qwen3.5 FP8 checkpoint. | ||
| if quant_algo_value == "FP8_PB": |
There was a problem hiding this comment.
Is it FP8_PB or fp8_pb_w8a8?
| }, | ||
| "weights": {"dynamic": False, "num_bits": 8, "type": "float", "group_size": gs}, | ||
| } | ||
| elif quant_algo == "FP8_PB": |
There was a problem hiding this comment.
Is it FP8_PB or fp8_pb_w8a8?
There was a problem hiding this comment.
"FP8_PB" is correct here. This function only receives the quant_algo value. Do you recommend renaming it to FP8_PB_W8A8 or changing the internal format string to fp8_pb?
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
What does this PR do?
Type of change: New feature
Adds an FP8 per-block W8A8 export path (block-FP8 weights + dynamic per-token FP8 activations). It emits the native
quant_method: fp8checkpoint (weight_scale_inv, dynamic activations) that vLLM/SGLang consume, matching the official Qwen3.5 FP8 layout. New formatQUANTIZATION_FP8_PB_W8A8: block-FP8 detection routes input quantizer enabled →FP8_PB_W8A8(W8A8,quant_algo: FP8_PB), disabled →FP8_PB_WO(unchanged from main).Usage
Testing
Updated tests/unit/torch/export/test_convert_hf_config.py and tests/gpu/torch/export/test_export.py; validate modelopt recipes pre-commit passes.
Additional Information
Additive change — FP8_PB_WO and the megatron export path are untouched vs main. Deployable as W8A8 on vLLM/SGLang via the native fp8 block path, same as the official Qwen3.5 FP8 checkpoint.