Support FP8 per block (weight + dynamic per token activation) export by sugunav14 · Pull Request #1807 · NVIDIA/Model-Optimizer

sugunav14 · 2026-06-23T20:10:43Z

What does this PR do?

Type of change: New feature

Adds an FP8 per-block W8A8 export path (block-FP8 weights + dynamic per-token FP8 activations). It emits the native quant_method: fp8 checkpoint (weight_scale_inv, dynamic activations) that vLLM/SGLang consume, matching the official Qwen3.5 FP8 layout. New format QUANTIZATION_FP8_PB_W8A8: block-FP8 detection routes input quantizer enabled → FP8_PB_W8A8 (W8A8, quant_algo: FP8_PB), disabled → FP8_PB_WO (unchanged from main).

Usage

# qformat preset:
  python hf_ptq.py --qformat fp8_2d_blockwise_w8a8_dynamic --export_path <out> ...
# or a full mixed recipe:
  python hf_ptq.py --recipe examples/llm_ptq/qwen3p5_mixed_nvfp4_fp8pb_w8a8.yaml ...

Testing

Updated tests/unit/torch/export/test_convert_hf_config.py and tests/gpu/torch/export/test_export.py; validate modelopt recipes pre-commit passes.

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: ❌
Did you get Claude approval on this PR?: ❌

Additional Information

Additive change — FP8_PB_WO and the megatron export path are untouched vs main. Deployable as W8A8 on vLLM/SGLang via the native fp8 block path, same as the official Qwen3.5 FP8 checkpoint.


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Added FP8-PB W8A8 block-wise quantization support across export paths, including a new `fp8_pb_w8a8` export format and recognition of `QUANTIZATION_FP8_PB_W8A8`.
  * Introduced a new PTQ preset: `fp8_2d_blockwise_w8a8_dynamic` (configurable 2D block sizes).
* **Bug Fixes**
  * HF export now writes per-block scaling as `weight_scale_inv` (2-D), and avoids stale `weight_scale` / `input_scale`, including tied-weight cases.
* **Tests**
  * Expanded unit and GPU coverage to validate format mapping and exported tensor keys/shapes.
* **Notes**
  * Megatron export explicitly rejects `QUANTIZATION_FP8_PB_W8A8` and directs to the HF export path.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

copy-pr-bot · 2026-06-23T20:10:46Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-23T20:10:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ce17ed53-a12c-4004-87c8-10af8176dc6c

📥 Commits

Reviewing files that changed from the base of the PR and between b4377c4 and 8251f48.

📒 Files selected for processing (2)

modelopt/torch/export/convert_hf_config.py
modelopt/torch/export/unified_export_megatron.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/export/convert_hf_config.py

📝 Walkthrough

Walkthrough

Adds end-to-end support for FP8_PB_W8A8 quantization, including format detection, HF config conversion, export buffer naming, a PTQ preset, Megatron export rejection, and validation tests.

Changes

FP8 2D Blockwise W8A8 Quantization Support

Layer / File(s)	Summary
New constant, attr names, and PTQ preset `modelopt/torch/export/model_config.py`, `modelopt/torch/quantization/utils/core_utils.py`, `modelopt_recipes/configs/ptq/presets/model/fp8_2d_blockwise_w8a8_dynamic.yaml`	Defines `QUANTIZATION_FP8_PB_W8A8 = "fp8_pb_w8a8"`, adds `weight_scale_inv` to `QuantizerAttrNames` and `quantizer_attr_names()`, and introduces the `fp8_2d_blockwise_w8a8_dynamic` PTQ preset with 2D blockwise FP8 weight and dynamic FP8 activation quantization.
Format detection and weight packing `modelopt/torch/export/quant_utils.py`, `tests/unit/torch/export/test_get_quantization.py`	Adds `QUANTIZATION_FP8_PB_W8A8` handling in FP8 format selection, maps `fp8_pb_w8a8` to `quant_algo: "FP8_PB"` with `group_size`, extends `to_quantized_weight` packing to the new format, and updates unit tests with a new config fixture and expected format case.
HF quant config conversion for FP8_PB `modelopt/torch/export/convert_hf_config.py`	Adds an `FP8_PB` group-config branch and an early-return path that emits a native `quant_method: "fp8"` config with `fmt: "e4m3"`, `activation_scheme: "dynamic"`, `weight_block_size`, and `modules_to_not_convert`.
Export scale buffer registration and tied-weight aliasing `modelopt/torch/export/unified_export_hf.py`, `tests/gpu/torch/export/test_export.py`	Registers FP8 W8A8 scales as `weight_scale_inv`, removes stale `weight_scale` for that format, aliases `weight_scale_inv` for tied weights, and adds GPU coverage that checks exported tensors contain `weight_scale_inv` and exclude `weight_scale` and `input_scale`.
Megatron export rejection `modelopt/torch/export/unified_export_megatron.py`	Imports `QUANTIZATION_FP8_PB_W8A8` and raises `NotImplementedError` in Megatron export paths when that format is resolved.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1707: Both PRs touch the HF unified export tied-weight deduplication path in modelopt/torch/export/unified_export_hf.py, including buffer aliasing for scale-related tensors.

Suggested reviewers

Edwardf0t1
kevalmorabia97
ynankani
vishalpandya1990

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding FP8 per-block export with dynamic per-token activations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No changed modelopt/examples Python code hardcodes unsafe load flags, trust_remote_code, eval/exec, nosec, or new non-permissive deps.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch svelury/fp8pb-mxfp8-export

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-23T20:15:15Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1807/
Built to branch `gh-pages` at 2026-06-25 03:54 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-06-23T20:21:19Z

Codecov Report

❌ Patch coverage is 36.66667% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.28%. Comparing base (37dbbda) to head (8251f48).
⚠️ Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/export/convert_hf_config.py	7.69%	12 Missing ⚠️
modelopt/torch/export/unified_export_hf.py	28.57%	5 Missing ⚠️
modelopt/torch/export/quant_utils.py	71.42%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1807      +/-   ##
==========================================
+ Coverage   62.89%   67.28%   +4.38%     
==========================================
  Files         511      514       +3     
  Lines       56683    59269    +2586     
==========================================
+ Hits        35651    39877    +4226     
+ Misses      21032    19392    -1640

Flag	Coverage Δ
examples	`32.66% <20.00%> (-5.36%)`	⬇️
gpu	`31.75% <16.66%> (+11.18%)`	⬆️
regression	`14.71% <3.33%> (+0.04%)`	⬆️
unit	`54.59% <30.00%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 · 2026-06-23T23:11:32Z

/claude review

cjluo-nv

Bot review — DM the bot to share feedback.

Additive feature adding FP8 per-block W8A8 export (fp8_pb_w8a8): new format constant, detection in _get_quantization_from_layer, weight_scale_inv registration on export, FP8_PB handling in both process_layer_quant_config and convert_hf_config, and a new preset YAML. Design is sound and reuses existing infrastructure (FP8QTensor, existing block-FP8 paths, existing weight_scale_inv naming already consumed by _QuantFP8Linear); not a new subsystem. Licensing clean — the new YAML carries the standard NVIDIA Apache-2.0 header. No injection attempts in the PR content. Verified the "do NOT invert weight_scale" reasoning: FP8QTensor.quantize produces scales = amax/448 and dequant multiplies by it, so storing it as weight_scale_inv is correct for vLLM/SGLang.

Reasons for nudge (owner sign-off recommended):

Test coverage gaps on the core new logic. The added tests only exercise the process_layer_quant_config string mapping for fp8_pb_w8a8/fp8_pb_wo. There are no unit tests for: (a) _get_quantization_from_layer actually returning FP8_PB_W8A8 (fake_quant weight + enabled input_quantizer), (b) convert_hf_quant_config_format emitting the quant_method: fp8 / weight_block_size block for FP8_PB, (c) _quant_algo_to_group_config("FP8_PB"), or (d) the new weight_scale_inv registration branch in _export_quantized_weight. These are the real new code paths and are unvalidated in CI.
Tied-weight dedup does not alias weight_scale_inv. In _export_quantized_weight, the _tied_cache alias step re-points weight_scale, weight_scale_2, and input_scale to a prior module, but not weight_scale_inv. For an FP8_PB_W8A8 module participating in a tied group, the per-block scale would not be aliased. Likely benign for typical dense-linear usage but worth confirming.
Empty PR description. All template fields are left as placeholders (?, ✅ / ❌ / N/A), so there's no stated rationale, testing notes, or backward-compatibility confirmation. Worth filling in given the export/serving impact.

Minor: the "dynamic per-token" wording in the code comments / preset is slightly imprecise — the input quantizer is block_sizes: {-1: 128, type: dynamic} (1×128 dynamic blocks, DeepSeek-style), not strictly per-token; the emitted activation_scheme: dynamic + weight_block_size is the right thing for the target runtimes regardless.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/quant_utils.py`:
- Around line 764-769: The elif condition checking for "fp8_pb_w8a8" needs to be
extended to also handle "fp8_pb_wo" so that the existing "fp8_pb_wo"
configuration values are properly mapped to the FP8_PB layer config with
group_size, instead of falling through to the default case which would create an
incompatible config. Modify the condition to accept both "fp8_pb_w8a8" and
"fp8_pb_wo" values so they both receive the same {"quant_algo": "FP8_PB",
"group_size": block_size_value} mapping.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 745-755: The tied-weight deduplication logic for FP8_PB_W8A8
quantization is incomplete. The code at line 751 registers weight_scale_inv for
QUANTIZATION_FP8_PB_W8A8 format, but the tied-alias path elsewhere in the code
only handles weight_scale, weight_scale_2, and input_scale attributes. Locate
the tied-alias deduplication logic that re-aliases these buffers and extend it
to also re-alias the weight_scale_inv buffer when the quantization format is
QUANTIZATION_FP8_PB_W8A8, ensuring the inverse scale buffers are properly
aliased for pointer-based deduplication to work correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9d2ec342-75fc-41db-8973-3631633bc195

📥 Commits

Reviewing files that changed from the base of the PR and between 1766d55 and 20a1f5f.

📒 Files selected for processing (7)

modelopt/torch/export/convert_hf_config.py
modelopt/torch/export/model_config.py
modelopt/torch/export/quant_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/utils/core_utils.py
modelopt_recipes/configs/ptq/presets/model/fp8_2d_blockwise_w8a8_dynamic.yaml
tests/gpu/torch/export/test_export.py

claude

Claude Review — FP8 per-block (W8A8 dynamic) export

Reviewed all 7 changed files (5 modelopt/ source, 1 recipe YAML, 1 test). Scope is small and focused: a new fp8_pb_w8a8 export format mapping to a native quant_method: fp8 HF config with weight_scale_inv + dynamic per-token activations.

Findings: CRITICAL: 1, IMPORTANT: 1, SUGGESTION: 0

[CRITICAL Algorithm] Missing fp8_pb_wo branch in process_layer_quant_config (quant_utils.py). The new parametrized test feeds fp8_pb_wo and asserts quant_algo: FP8_PB with group_size: 128, but only a fp8_pb_w8a8 branch was added. fp8_pb_wo falls through to the generic else giving {quant_algo: fp8_pb_wo} (no group_size). The new test case will fail, and any real weight-only block-FP8 model emits an unrecognized quant_algo string. Fix: handle both fp8_pb_wo and fp8_pb_w8a8 in the same branch.

[IMPORTANT Export] Buffer-name divergence between the two block-FP8 variants (unified_export_hf.py). The weight_scale_inv rename is gated only on FP8_PB_W8A8, yet both fp8_pb_wo and fp8_pb_w8a8 now collapse to the same quant_algo: FP8_PB and the same quant_method: fp8 HF config. A weight-only export would register its scale as weight_scale while the config advertises native fp8 (which expects weight_scale_inv + activation_scheme: dynamic), so the checkpoint and config disagree. Please reconcile: either route weight-only through the same naming, or keep fp8_pb_wo out of the FP8_PB config mapping.

Notes (non-blocking): QuantizerAttrNames gains weight_scale_inv; the only construction site uses keyword args, so the namedtuple change is safe. The recipe preset (1D per-token dynamic input quantizer + 2D 128x128 weight blocks) correctly expresses W8A8 dynamic.

Risk: Moderate. The W8A8 export math looks correct, but the missing fp8_pb_wo branch breaks the included test and the existing weight-only path, and the two variants are inconsistent in tensor naming vs. emitted config. Both should be resolved before merge.

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 · 2026-06-23T23:46:09Z

/claude review

cjluo-nv

Bot review — DM the bot to share feedback.

Re-review of PR #1807 (FP8 per-block W8A8 export, +189/-7, 8 files). I verified the previous critical/major bot comments against the current diff:

Resolved (verified in code):

💬 Author replied + CodeRabbit explicitly withdrew its process_layer_quant_config "missing fp8_pb_wo branch" comment: in origin/main fp8_pb_wo falls through to {"quant_algo": "fp8_pb_wo"} (raw passthrough), there was never an FP8_PB contract for it, and the stale test asserting fp8_pb_wo → FP8_PB was removed. Confirmed: current test_process_layer_quant_config only has fp8_pb_w8a8 cases.
💬 Author replied on the "buffer-name divergence" (Claude IMPORTANT): only fp8_pb_w8a8 → quant_algo: FP8_PB; convert_hf_quant_config_format's quant_method: fp8 early-return is gated on FP8_PB. Verified in quant_utils.py + convert_hf_config.py — FP8_PB_WO stays weight_scale + quant_method: modelopt, FP8_PB_W8A8 emits weight_scale_inv + quant_method: fp8. No config/tensor mismatch.
Tied-weight dedup now aliases weight_scale_inv (commit 440f5cd) — confirmed in _export_quantized_weight.
PR description is now filled in (rationale, usage, testing notes) — prior "empty description" nudge resolved.

Why nudge (owner sign-off):

Remaining CI gap: the new quant_method: fp8 config-emit path (convert_hf_quant_config_format for FP8_PB) and _quant_algo_to_group_config("FP8_PB") are pure-CPU functions but have no unit test asserting the emitted weight_block_size/activation_scheme/modules_to_not_convert block. Worth a small CPU test given this is the serving-facing contract.
The weight_scale_inv export branch + W8A8 detection are only validated by a GPU test (test_fp8_pb_w8a8_export_uses_weight_scale_inv), which won't run in CPU CI; matches the project's "numeric FP8 export validated only on GPU bucket" pattern.
_tied_cache correctly extended to weight_scale_inv, but tied FP8_PB_W8A8 modules are not exercised by any test — likely benign for dense linear, worth a human confirm.

No prompt-injection attempts in the PR/comment content. Licensing clean (new YAML carries the standard NVIDIA Apache-2.0 header). The design reuses existing FP8QTensor/block-FP8/weight_scale_inv infrastructure — not a new subsystem.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/gpu/torch/export/test_export.py`:
- Line 544: The import statement `from safetensors import safe_open` is
currently located inside a test function at line 544 instead of at the module
scope. Move this import to the top of the file with the other module-level
imports to ensure import failures are caught at collection time rather than at
runtime, and to comply with the project's import organization standards. Unless
there is a circular dependency or optional dependency justification, all imports
must be at module scope in tests/**/*.py files.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 871862fd-0ff0-483c-9caa-bc23747bed5c

📥 Commits

Reviewing files that changed from the base of the PR and between 20a1f5f and 440f5cd.

📒 Files selected for processing (3)

modelopt/torch/export/unified_export_hf.py
tests/gpu/torch/export/test_export.py
tests/unit/torch/export/test_get_quantization.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/export/unified_export_hf.py

coderabbitai · 2026-06-23T23:48:52Z

+    """W8A8 block-FP8 (FP8_PB) export stores per-block scales as weight_scale_inv
+    (DeepSeek/Qwen convention) and drops the plain weight_scale; activations are
+    dynamic so no input_scale is stored."""
+    from safetensors import safe_open


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Move safe_open import to module scope.

Line 544 imports inside the test function without justification, which makes import failures surface at runtime instead of collection time.

Suggested fix

+from safetensors import safe_open ... def test_fp8_pb_w8a8_export_uses_weight_scale_inv(tmp_path): @@ - from safetensors import safe_open

As per path instructions, tests/**/*.py must keep imports at the top of the file unless there is an explicit circular-import/optional-dependency justification.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu/torch/export/test_export.py` at line 544, The import statement `from safetensors import safe_open` is currently located inside a test function at line 544 instead of at the module scope. Move this import to the top of the file with the other module-level imports to ensure import failures are caught at collection time rather than at runtime, and to comply with the project's import organization standards. Unless there is a circular dependency or optional dependency justification, all imports must be at module scope in tests/**/*.py files.

Source: Path instructions

claude

Claude Review — FP8 per-block W8A8 export

Reviewed all 8 changed files (5 modelopt/ source, 1 recipe YAML, 2 test). The core design is sound and reuses existing infrastructure well: FP8QTensor.quantize produces scales = amax/448 (the dequant multiplier), so registering it as weight_scale_inv without inversion is correct for the native quant_method: fp8 layout that vLLM/SGLang consume. The prior bot finding about weight_scale_inv not being aliased in tied-weight dedup is resolved in this revision (it is now in the alias loop).

Findings: CRITICAL: 0 / IMPORTANT: 2 / SUGGESTION: 1

Most impactful:

[IMPORTANT Compatibility] The detection change in get_quantization_format is shared with the Megatron export path. A block-FP8 module with fake_quant=True and an enabled input quantizer now returns FP8_PB_W8A8, which is not handled in unified_export_megatron.py:288 so quantization stays None and the layer silently exports as unquantized. Add the format to the Megatron branch or raise a clear unsupported error there.
[IMPORTANT Export] _quant_algo_to_group_config("FP8_PB") (reached only via the MIXED_PRECISION config_groups path — the headline use case per the PR's mixed NVFP4+FP8_PB recipe) emits a compressed-tensors group keyed on weight_scale, but the exporter writes the scale as weight_scale_inv. Mixed-precision checkpoints containing an FP8_PB layer would describe tensors by a name that is not present. Reconcile the naming, or guard/document FP8_PB as unsupported in MIXED_PRECISION export, and add a unit test for this branch.
[SUGGESTION] The FP8_PB early-return in convert_hf_quant_config_format silently drops kv_cache_quant_algo if ever set (harmless for the shipped preset, which disables KV quant).

Test coverage note: the added tests exercise the process_layer_quant_config string mapping and the weight_scale_inv export branch (good — addresses most of the prior coverage gap), but convert_hf_quant_config_format emitting the quant_method: fp8 block and the _quant_algo_to_group_config("FP8_PB") mixed path remain unvalidated.

Risk: Moderate. The pure-FP8_PB_W8A8 HF export path looks correct and is tested. Both IMPORTANT issues are latent silent-mis-export hazards on adjacent paths (Megatron export, MIXED_PRECISION export) that the PR's own advertised mixed recipe would exercise — worth confirming before merge.

claude · 2026-06-24T00:11:54Z

+    elif quant_algo == "FP8_PB":
+        # Block-wise FP8 (W8A8). Weights in gsxgs blocks; activations quantized
+        # dynamically per-token at runtime (no input_activations entry).
+        gs = group_size or 128
+        return {
+            "weights": {
+                "dynamic": False,
+                "num_bits": 8,
+                "type": "float",
+                "strategy": "block",
+                "block_structure": [gs, gs],
+            },
+        }


[IMPORTANT Export] Scale-name mismatch for FP8_PB in the MIXED_PRECISION path.

This _quant_algo_to_group_config("FP8_PB") branch is only reached via the MIXED_PRECISION config_groups path (the pure-FP8_PB case early-returns the native quant_method: fp8 dict above). It emits a compressed-tensors group with strategy: "block", which consumers read with the conventional weight_scale tensor name. But _export_quantized_weight unconditionally registers the per-block scale for QUANTIZATION_FP8_PB_W8A8 as weight_scale_inv and deletes weight_scale. So a mixed-precision checkpoint that contains an FP8_PB layer will have *.weight_scale_inv tensors described by a config that expects *.weight_scale → the block-FP8 layers won't load in a compressed-tensors consumer.

Why it matters: the PR description explicitly advertises a mixed recipe (qwen3p5_mixed_nvfp4_fp8pb_w8a8.yaml) combining NVFP4 + FP8_PB, which is exactly the path that hits this branch.

Suggestion: either (a) document/guard that FP8_PB is unsupported in MIXED_PRECISION export and raise a clear error, or (b) make the group config emit the weight_scale_inv convention consistently with the exporter. At minimum add a test for _quant_algo_to_group_config("FP8_PB") round-tripping with the actual exported tensor names.

claude · 2026-06-24T00:11:55Z

+    if quant_algo_value == "FP8_PB":
+        group_size = original_quantization_details.get("group_size") or 128
+        exclude_modules = original_quantization_details.get("exclude_modules") or []
+        fp8_config: dict[str, Any] = {
+            "quant_method": "fp8",
+            "fmt": "e4m3",
+            "activation_scheme": "dynamic",
+            "weight_block_size": [group_size, group_size],
+            "modules_to_not_convert": exclude_modules,
+        }
+        producer_info = input_config.get("producer")
+        if producer_info:
+            fp8_config["producer"] = producer_info
+        return fp8_config


[SUGGESTION] This early-return builds the native quant_method: fp8 dict and returns immediately, which silently drops any kv_cache_quant_algo present in original_quantization_details (the generic path below emits kv_cache_scheme). For the shipped fp8_2d_blockwise_w8a8_dynamic preset the KV cache is disabled, so this is currently fine — but if FP8_PB is ever combined with FP8 KV-cache quantization, the KV scheme would be lost without warning. Consider either carrying kv_cache_scheme into fp8_config when kv_cache_quant_algo is set, or asserting it is unset here so the silent drop can't happen.

claude · 2026-06-24T00:11:56Z

+                if not weight_quantizer.fake_quant:
                    return QUANTIZATION_FP8_PB_REAL
+                if input_quantizer is not None and input_quantizer.is_enabled:
+                    return QUANTIZATION_FP8_PB_W8A8
+                return QUANTIZATION_FP8_PB_WO


[IMPORTANT Compatibility] This detection change is shared by the Megatron export path and silently breaks it for one case.

get_quantization_format is also called by unified_export_megatron.py (_get_quantization_format → here). Previously, a block-FP8 module with fake_quant=True always returned QUANTIZATION_FP8_PB_WO, which Megatron handles (unified_export_megatron.py:288-290). After this change, the same module with an enabled input quantizer now returns QUANTIZATION_FP8_PB_W8A8, which is not in the Megatron branch → quantization stays None and the layer is exported as if unquantized.

So a model that was previously exportable as block-FP8 via Megatron (input quantizer enabled but fake_quant) now silently mis-exports. Even if no current Megatron recipe enables the input quantizer, this is a latent foot-gun.

Suggestion: either add QUANTIZATION_FP8_PB_W8A8 to the Megatron handling at unified_export_megatron.py:288, or raise a clear "unsupported on Megatron export" error there so it can't silently fall through to None.

cjluo-nv · 2026-06-24T17:19:23Z

@@ -42,6 +42,7 @@
 QUANTIZATION_NVFP4_AWQ = "nvfp4_awq"
 QUANTIZATION_FP8_PB_REAL = "fp8_pb_real"
 QUANTIZATION_FP8_PB_WO = "fp8_pb_wo"


do we still need QUANTIZATION_FP8_PB_WO? Maybe we can remove it after this PR.

FP8_PB_WO is also used in the Megatron path, that's why I avoided removing it.

cjluo-nv · 2026-06-24T17:20:27Z

@@ -0,0 +1,41 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


nit: maybe consider remove the "2d" in the file name to make it shorter.

I added the 2d so that the name better reflects the actual format to help users understand without having to go through the yaml and understand details.

cjluo-nv · 2026-06-24T17:23:52Z

+    # FP8_PB (block-wise FP8, W8A8): emit the native ``quant_method: fp8`` config
+    # vLLM/SGLang expect (weight_scale_inv + dynamic activations), matching the
+    # official Qwen3.5 FP8 checkpoint.
+    if quant_algo_value == "FP8_PB":


Is it FP8_PB or fp8_pb_w8a8?

cjluo-nv · 2026-06-24T17:23:58Z

            },
            "weights": {"dynamic": False, "num_bits": 8, "type": "float", "group_size": gs},
        }
+    elif quant_algo == "FP8_PB":


Is it FP8_PB or fp8_pb_w8a8?

"FP8_PB" is correct here. This function only receives the quant_algo value. Do you recommend renaming it to FP8_PB_W8A8 or changing the internal format string to fp8_pb?

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 changed the title ~~Support FP8 per block (weight + dynamic per token activation) and MXFP8 (weight + activation) export~~ Support FP8 per block (weight + dynamic per token activation) export Jun 23, 2026

sugunav14 added 4 commits June 23, 2026 21:45

update

6211da7

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

refactor to support both weight-only and weight + sctivation

dc5d3de

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

clean up

bed3730

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

aa39f30

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 force-pushed the svelury/fp8pb-mxfp8-export branch from 9936df2 to aa39f30 Compare June 23, 2026 21:53

sugunav14 added 4 commits June 23, 2026 22:26

update

cda83b6

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

clean up

8addf5a

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

fixed failed tests

c03ba3f

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

minor

20a1f5f

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 marked this pull request as ready for review June 23, 2026 23:10

sugunav14 requested review from a team as code owners June 23, 2026 23:11

sugunav14 requested review from shengliangxu and sychen52 June 23, 2026 23:11

sugunav14 requested a review from cjluo-nv June 23, 2026 23:11

cjluo-nv reviewed Jun 23, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread modelopt/torch/export/quant_utils.py

Comment thread modelopt/torch/export/unified_export_hf.py

claude Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread modelopt/torch/export/quant_utils.py

claude Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread modelopt/torch/export/unified_export_hf.py

PR reviews, modelopt bot, claude

440f5cd

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

cjluo-nv reviewed Jun 23, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

claude Bot reviewed Jun 24, 2026

View reviewed changes

cjluo-nv reviewed Jun 24, 2026

View reviewed changes

sugunav14 added 3 commits June 24, 2026 21:17

fixed 4D scale export bug

b026888

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

trimmed comments

b4377c4

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

claude review

8251f48

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

		@@ -0,0 +1,41 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Uh oh!

Conversation

sugunav14 commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-06-25 03:54 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sugunav14 commented Jun 23, 2026

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sugunav14 commented Jun 23, 2026

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

sugunav14 Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

sugunav14 Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

sugunav14 commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-06-25 03:54 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Jun 23, 2026 •

edited

Loading