Skip to content

Support FP8 per block (weight + dynamic per token activation) export#1807

Open
sugunav14 wants to merge 12 commits into
mainfrom
svelury/fp8pb-mxfp8-export
Open

Support FP8 per block (weight + dynamic per token activation) export#1807
sugunav14 wants to merge 12 commits into
mainfrom
svelury/fp8pb-mxfp8-export

Conversation

@sugunav14

@sugunav14 sugunav14 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: New feature

Adds an FP8 per-block W8A8 export path (block-FP8 weights + dynamic per-token FP8 activations). It emits the native quant_method: fp8 checkpoint (weight_scale_inv, dynamic activations) that vLLM/SGLang consume, matching the official Qwen3.5 FP8 layout. New format QUANTIZATION_FP8_PB_W8A8: block-FP8 detection routes input quantizer enabled → FP8_PB_W8A8 (W8A8, quant_algo: FP8_PB), disabled → FP8_PB_WO (unchanged from main).

Usage

# qformat preset:
  python hf_ptq.py --qformat fp8_2d_blockwise_w8a8_dynamic --export_path <out> ...
# or a full mixed recipe:
  python hf_ptq.py --recipe examples/llm_ptq/qwen3p5_mixed_nvfp4_fp8pb_w8a8.yaml ...

Testing

Updated tests/unit/torch/export/test_convert_hf_config.py and tests/gpu/torch/export/test_export.py; validate modelopt recipes pre-commit passes.

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ❌
  • Did you get Claude approval on this PR?: ❌

Additional Information

Additive change — FP8_PB_WO and the megatron export path are untouched vs main. Deployable as W8A8 on vLLM/SGLang via the native fp8 block path, same as the official Qwen3.5 FP8 checkpoint.


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Added FP8-PB W8A8 block-wise quantization support across export paths, including a new `fp8_pb_w8a8` export format and recognition of `QUANTIZATION_FP8_PB_W8A8`.
  * Introduced a new PTQ preset: `fp8_2d_blockwise_w8a8_dynamic` (configurable 2D block sizes).
* **Bug Fixes**
  * HF export now writes per-block scaling as `weight_scale_inv` (2-D), and avoids stale `weight_scale` / `input_scale`, including tied-weight cases.
* **Tests**
  * Expanded unit and GPU coverage to validate format mapping and exported tensor keys/shapes.
* **Notes**
  * Megatron export explicitly rejects `QUANTIZATION_FP8_PB_W8A8` and directs to the HF export path.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ce17ed53-a12c-4004-87c8-10af8176dc6c

📥 Commits

Reviewing files that changed from the base of the PR and between b4377c4 and 8251f48.

📒 Files selected for processing (2)
  • modelopt/torch/export/convert_hf_config.py
  • modelopt/torch/export/unified_export_megatron.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/export/convert_hf_config.py

📝 Walkthrough

Walkthrough

Adds end-to-end support for FP8_PB_W8A8 quantization, including format detection, HF config conversion, export buffer naming, a PTQ preset, Megatron export rejection, and validation tests.

Changes

FP8 2D Blockwise W8A8 Quantization Support

Layer / File(s) Summary
New constant, attr names, and PTQ preset
modelopt/torch/export/model_config.py, modelopt/torch/quantization/utils/core_utils.py, modelopt_recipes/configs/ptq/presets/model/fp8_2d_blockwise_w8a8_dynamic.yaml
Defines QUANTIZATION_FP8_PB_W8A8 = "fp8_pb_w8a8", adds weight_scale_inv to QuantizerAttrNames and quantizer_attr_names(), and introduces the fp8_2d_blockwise_w8a8_dynamic PTQ preset with 2D blockwise FP8 weight and dynamic FP8 activation quantization.
Format detection and weight packing
modelopt/torch/export/quant_utils.py, tests/unit/torch/export/test_get_quantization.py
Adds QUANTIZATION_FP8_PB_W8A8 handling in FP8 format selection, maps fp8_pb_w8a8 to quant_algo: "FP8_PB" with group_size, extends to_quantized_weight packing to the new format, and updates unit tests with a new config fixture and expected format case.
HF quant config conversion for FP8_PB
modelopt/torch/export/convert_hf_config.py
Adds an FP8_PB group-config branch and an early-return path that emits a native quant_method: "fp8" config with fmt: "e4m3", activation_scheme: "dynamic", weight_block_size, and modules_to_not_convert.
Export scale buffer registration and tied-weight aliasing
modelopt/torch/export/unified_export_hf.py, tests/gpu/torch/export/test_export.py
Registers FP8 W8A8 scales as weight_scale_inv, removes stale weight_scale for that format, aliases weight_scale_inv for tied weights, and adds GPU coverage that checks exported tensors contain weight_scale_inv and exclude weight_scale and input_scale.
Megatron export rejection
modelopt/torch/export/unified_export_megatron.py
Imports QUANTIZATION_FP8_PB_W8A8 and raises NotImplementedError in Megatron export paths when that format is resolved.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1707: Both PRs touch the HF unified export tied-weight deduplication path in modelopt/torch/export/unified_export_hf.py, including buffer aliasing for scale-related tensors.

Suggested reviewers

  • Edwardf0t1
  • kevalmorabia97
  • ynankani
  • vishalpandya1990
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding FP8 per-block export with dynamic per-token activations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No changed modelopt/examples Python code hardcodes unsafe load flags, trust_remote_code, eval/exec, nosec, or new non-permissive deps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch svelury/fp8pb-mxfp8-export

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1807/

Built to branch gh-pages at 2026-06-25 03:54 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@sugunav14 sugunav14 changed the title Support FP8 per block (weight + dynamic per token activation) and MXFP8 (weight + activation) export Support FP8 per block (weight + dynamic per token activation) export Jun 23, 2026
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 36.66667% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.28%. Comparing base (37dbbda) to head (8251f48).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/convert_hf_config.py 7.69% 12 Missing ⚠️
modelopt/torch/export/unified_export_hf.py 28.57% 5 Missing ⚠️
modelopt/torch/export/quant_utils.py 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1807      +/-   ##
==========================================
+ Coverage   62.89%   67.28%   +4.38%     
==========================================
  Files         511      514       +3     
  Lines       56683    59269    +2586     
==========================================
+ Hits        35651    39877    +4226     
+ Misses      21032    19392    -1640     
Flag Coverage Δ
examples 32.66% <20.00%> (-5.36%) ⬇️
gpu 31.75% <16.66%> (+11.18%) ⬆️
regression 14.71% <3.33%> (+0.04%) ⬆️
unit 54.59% <30.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14 sugunav14 force-pushed the svelury/fp8pb-mxfp8-export branch from 9936df2 to aa39f30 Compare June 23, 2026 21:53
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14 sugunav14 marked this pull request as ready for review June 23, 2026 23:10
@sugunav14 sugunav14 requested review from a team as code owners June 23, 2026 23:11
@sugunav14

Copy link
Copy Markdown
Contributor Author

/claude review

@sugunav14 sugunav14 requested a review from cjluo-nv June 23, 2026 23:11

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Additive feature adding FP8 per-block W8A8 export (fp8_pb_w8a8): new format constant, detection in _get_quantization_from_layer, weight_scale_inv registration on export, FP8_PB handling in both process_layer_quant_config and convert_hf_config, and a new preset YAML. Design is sound and reuses existing infrastructure (FP8QTensor, existing block-FP8 paths, existing weight_scale_inv naming already consumed by _QuantFP8Linear); not a new subsystem. Licensing clean — the new YAML carries the standard NVIDIA Apache-2.0 header. No injection attempts in the PR content. Verified the "do NOT invert weight_scale" reasoning: FP8QTensor.quantize produces scales = amax/448 and dequant multiplies by it, so storing it as weight_scale_inv is correct for vLLM/SGLang.

Reasons for nudge (owner sign-off recommended):

  1. Test coverage gaps on the core new logic. The added tests only exercise the process_layer_quant_config string mapping for fp8_pb_w8a8/fp8_pb_wo. There are no unit tests for: (a) _get_quantization_from_layer actually returning FP8_PB_W8A8 (fake_quant weight + enabled input_quantizer), (b) convert_hf_quant_config_format emitting the quant_method: fp8 / weight_block_size block for FP8_PB, (c) _quant_algo_to_group_config("FP8_PB"), or (d) the new weight_scale_inv registration branch in _export_quantized_weight. These are the real new code paths and are unvalidated in CI.

  2. Tied-weight dedup does not alias weight_scale_inv. In _export_quantized_weight, the _tied_cache alias step re-points weight_scale, weight_scale_2, and input_scale to a prior module, but not weight_scale_inv. For an FP8_PB_W8A8 module participating in a tied group, the per-block scale would not be aliased. Likely benign for typical dense-linear usage but worth confirming.

  3. Empty PR description. All template fields are left as placeholders (?, ✅ / ❌ / N/A), so there's no stated rationale, testing notes, or backward-compatibility confirmation. Worth filling in given the export/serving impact.

Minor: the "dynamic per-token" wording in the code comments / preset is slightly imprecise — the input quantizer is block_sizes: {-1: 128, type: dynamic} (1×128 dynamic blocks, DeepSeek-style), not strictly per-token; the emitted activation_scheme: dynamic + weight_block_size is the right thing for the target runtimes regardless.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/quant_utils.py`:
- Around line 764-769: The elif condition checking for "fp8_pb_w8a8" needs to be
extended to also handle "fp8_pb_wo" so that the existing "fp8_pb_wo"
configuration values are properly mapped to the FP8_PB layer config with
group_size, instead of falling through to the default case which would create an
incompatible config. Modify the condition to accept both "fp8_pb_w8a8" and
"fp8_pb_wo" values so they both receive the same {"quant_algo": "FP8_PB",
"group_size": block_size_value} mapping.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 745-755: The tied-weight deduplication logic for FP8_PB_W8A8
quantization is incomplete. The code at line 751 registers weight_scale_inv for
QUANTIZATION_FP8_PB_W8A8 format, but the tied-alias path elsewhere in the code
only handles weight_scale, weight_scale_2, and input_scale attributes. Locate
the tied-alias deduplication logic that re-aliases these buffers and extend it
to also re-alias the weight_scale_inv buffer when the quantization format is
QUANTIZATION_FP8_PB_W8A8, ensuring the inverse scale buffers are properly
aliased for pointer-based deduplication to work correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9d2ec342-75fc-41db-8973-3631633bc195

📥 Commits

Reviewing files that changed from the base of the PR and between 1766d55 and 20a1f5f.

📒 Files selected for processing (7)
  • modelopt/torch/export/convert_hf_config.py
  • modelopt/torch/export/model_config.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/utils/core_utils.py
  • modelopt_recipes/configs/ptq/presets/model/fp8_2d_blockwise_w8a8_dynamic.yaml
  • tests/gpu/torch/export/test_export.py

Comment thread modelopt/torch/export/quant_utils.py
Comment thread modelopt/torch/export/unified_export_hf.py

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Review — FP8 per-block (W8A8 dynamic) export

Reviewed all 7 changed files (5 modelopt/ source, 1 recipe YAML, 1 test). Scope is small and focused: a new fp8_pb_w8a8 export format mapping to a native quant_method: fp8 HF config with weight_scale_inv + dynamic per-token activations.

Findings: CRITICAL: 1, IMPORTANT: 1, SUGGESTION: 0

[CRITICAL Algorithm] Missing fp8_pb_wo branch in process_layer_quant_config (quant_utils.py). The new parametrized test feeds fp8_pb_wo and asserts quant_algo: FP8_PB with group_size: 128, but only a fp8_pb_w8a8 branch was added. fp8_pb_wo falls through to the generic else giving {quant_algo: fp8_pb_wo} (no group_size). The new test case will fail, and any real weight-only block-FP8 model emits an unrecognized quant_algo string. Fix: handle both fp8_pb_wo and fp8_pb_w8a8 in the same branch.

[IMPORTANT Export] Buffer-name divergence between the two block-FP8 variants (unified_export_hf.py). The weight_scale_inv rename is gated only on FP8_PB_W8A8, yet both fp8_pb_wo and fp8_pb_w8a8 now collapse to the same quant_algo: FP8_PB and the same quant_method: fp8 HF config. A weight-only export would register its scale as weight_scale while the config advertises native fp8 (which expects weight_scale_inv + activation_scheme: dynamic), so the checkpoint and config disagree. Please reconcile: either route weight-only through the same naming, or keep fp8_pb_wo out of the FP8_PB config mapping.

Notes (non-blocking): QuantizerAttrNames gains weight_scale_inv; the only construction site uses keyword args, so the namedtuple change is safe. The recipe preset (1D per-token dynamic input quantizer + 2D 128x128 weight blocks) correctly expresses W8A8 dynamic.

Risk: Moderate. The W8A8 export math looks correct, but the missing fp8_pb_wo branch breaks the included test and the existing weight-only path, and the two variants are inconsistent in tensor naming vs. emitted config. Both should be resolved before merge.

Comment thread modelopt/torch/export/quant_utils.py
Comment thread modelopt/torch/export/unified_export_hf.py
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14

Copy link
Copy Markdown
Contributor Author

/claude review

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review of PR #1807 (FP8 per-block W8A8 export, +189/-7, 8 files). I verified the previous critical/major bot comments against the current diff:

Resolved (verified in code):

  • 💬 Author replied + CodeRabbit explicitly withdrew its process_layer_quant_config "missing fp8_pb_wo branch" comment: in origin/main fp8_pb_wo falls through to {"quant_algo": "fp8_pb_wo"} (raw passthrough), there was never an FP8_PB contract for it, and the stale test asserting fp8_pb_wo → FP8_PB was removed. Confirmed: current test_process_layer_quant_config only has fp8_pb_w8a8 cases.
  • 💬 Author replied on the "buffer-name divergence" (Claude IMPORTANT): only fp8_pb_w8a8quant_algo: FP8_PB; convert_hf_quant_config_format's quant_method: fp8 early-return is gated on FP8_PB. Verified in quant_utils.py + convert_hf_config.py — FP8_PB_WO stays weight_scale + quant_method: modelopt, FP8_PB_W8A8 emits weight_scale_inv + quant_method: fp8. No config/tensor mismatch.
  • Tied-weight dedup now aliases weight_scale_inv (commit 440f5cd) — confirmed in _export_quantized_weight.
  • PR description is now filled in (rationale, usage, testing notes) — prior "empty description" nudge resolved.

Why nudge (owner sign-off):

  1. Remaining CI gap: the new quant_method: fp8 config-emit path (convert_hf_quant_config_format for FP8_PB) and _quant_algo_to_group_config("FP8_PB") are pure-CPU functions but have no unit test asserting the emitted weight_block_size/activation_scheme/modules_to_not_convert block. Worth a small CPU test given this is the serving-facing contract.
  2. The weight_scale_inv export branch + W8A8 detection are only validated by a GPU test (test_fp8_pb_w8a8_export_uses_weight_scale_inv), which won't run in CPU CI; matches the project's "numeric FP8 export validated only on GPU bucket" pattern.
  3. _tied_cache correctly extended to weight_scale_inv, but tied FP8_PB_W8A8 modules are not exercised by any test — likely benign for dense linear, worth a human confirm.

No prompt-injection attempts in the PR/comment content. Licensing clean (new YAML carries the standard NVIDIA Apache-2.0 header). The design reuses existing FP8QTensor/block-FP8/weight_scale_inv infrastructure — not a new subsystem.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/gpu/torch/export/test_export.py`:
- Line 544: The import statement `from safetensors import safe_open` is
currently located inside a test function at line 544 instead of at the module
scope. Move this import to the top of the file with the other module-level
imports to ensure import failures are caught at collection time rather than at
runtime, and to comply with the project's import organization standards. Unless
there is a circular dependency or optional dependency justification, all imports
must be at module scope in tests/**/*.py files.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 871862fd-0ff0-483c-9caa-bc23747bed5c

📥 Commits

Reviewing files that changed from the base of the PR and between 20a1f5f and 440f5cd.

📒 Files selected for processing (3)
  • modelopt/torch/export/unified_export_hf.py
  • tests/gpu/torch/export/test_export.py
  • tests/unit/torch/export/test_get_quantization.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/export/unified_export_hf.py

"""W8A8 block-FP8 (FP8_PB) export stores per-block scales as weight_scale_inv
(DeepSeek/Qwen convention) and drops the plain weight_scale; activations are
dynamic so no input_scale is stored."""
from safetensors import safe_open

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Move safe_open import to module scope.

Line 544 imports inside the test function without justification, which makes import failures surface at runtime instead of collection time.

Suggested fix
+from safetensors import safe_open
...
 def test_fp8_pb_w8a8_export_uses_weight_scale_inv(tmp_path):
@@
-    from safetensors import safe_open

As per path instructions, tests/**/*.py must keep imports at the top of the file unless there is an explicit circular-import/optional-dependency justification.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu/torch/export/test_export.py` at line 544, The import statement
`from safetensors import safe_open` is currently located inside a test function
at line 544 instead of at the module scope. Move this import to the top of the
file with the other module-level imports to ensure import failures are caught at
collection time rather than at runtime, and to comply with the project's import
organization standards. Unless there is a circular dependency or optional
dependency justification, all imports must be at module scope in tests/**/*.py
files.

Source: Path instructions

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Review — FP8 per-block W8A8 export

Reviewed all 8 changed files (5 modelopt/ source, 1 recipe YAML, 2 test). The core design is sound and reuses existing infrastructure well: FP8QTensor.quantize produces scales = amax/448 (the dequant multiplier), so registering it as weight_scale_inv without inversion is correct for the native quant_method: fp8 layout that vLLM/SGLang consume. The prior bot finding about weight_scale_inv not being aliased in tied-weight dedup is resolved in this revision (it is now in the alias loop).

Findings: CRITICAL: 0 / IMPORTANT: 2 / SUGGESTION: 1

Most impactful:

  1. [IMPORTANT Compatibility] The detection change in get_quantization_format is shared with the Megatron export path. A block-FP8 module with fake_quant=True and an enabled input quantizer now returns FP8_PB_W8A8, which is not handled in unified_export_megatron.py:288 so quantization stays None and the layer silently exports as unquantized. Add the format to the Megatron branch or raise a clear unsupported error there.

  2. [IMPORTANT Export] _quant_algo_to_group_config("FP8_PB") (reached only via the MIXED_PRECISION config_groups path — the headline use case per the PR's mixed NVFP4+FP8_PB recipe) emits a compressed-tensors group keyed on weight_scale, but the exporter writes the scale as weight_scale_inv. Mixed-precision checkpoints containing an FP8_PB layer would describe tensors by a name that is not present. Reconcile the naming, or guard/document FP8_PB as unsupported in MIXED_PRECISION export, and add a unit test for this branch.

  3. [SUGGESTION] The FP8_PB early-return in convert_hf_quant_config_format silently drops kv_cache_quant_algo if ever set (harmless for the shipped preset, which disables KV quant).

Test coverage note: the added tests exercise the process_layer_quant_config string mapping and the weight_scale_inv export branch (good — addresses most of the prior coverage gap), but convert_hf_quant_config_format emitting the quant_method: fp8 block and the _quant_algo_to_group_config("FP8_PB") mixed path remain unvalidated.

Risk: Moderate. The pure-FP8_PB_W8A8 HF export path looks correct and is tested. Both IMPORTANT issues are latent silent-mis-export hazards on adjacent paths (Megatron export, MIXED_PRECISION export) that the PR's own advertised mixed recipe would exercise — worth confirming before merge.

Comment on lines +107 to +119
elif quant_algo == "FP8_PB":
# Block-wise FP8 (W8A8). Weights in gsxgs blocks; activations quantized
# dynamically per-token at runtime (no input_activations entry).
gs = group_size or 128
return {
"weights": {
"dynamic": False,
"num_bits": 8,
"type": "float",
"strategy": "block",
"block_structure": [gs, gs],
},
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Export] Scale-name mismatch for FP8_PB in the MIXED_PRECISION path.

This _quant_algo_to_group_config("FP8_PB") branch is only reached via the MIXED_PRECISION config_groups path (the pure-FP8_PB case early-returns the native quant_method: fp8 dict above). It emits a compressed-tensors group with strategy: "block", which consumers read with the conventional weight_scale tensor name. But _export_quantized_weight unconditionally registers the per-block scale for QUANTIZATION_FP8_PB_W8A8 as weight_scale_inv and deletes weight_scale. So a mixed-precision checkpoint that contains an FP8_PB layer will have *.weight_scale_inv tensors described by a config that expects *.weight_scale → the block-FP8 layers won't load in a compressed-tensors consumer.

Why it matters: the PR description explicitly advertises a mixed recipe (qwen3p5_mixed_nvfp4_fp8pb_w8a8.yaml) combining NVFP4 + FP8_PB, which is exactly the path that hits this branch.

Suggestion: either (a) document/guard that FP8_PB is unsupported in MIXED_PRECISION export and raise a clear error, or (b) make the group config emit the weight_scale_inv convention consistently with the exporter. At minimum add a test for _quant_algo_to_group_config("FP8_PB") round-tripping with the actual exported tensor names.

Comment on lines +185 to +198
if quant_algo_value == "FP8_PB":
group_size = original_quantization_details.get("group_size") or 128
exclude_modules = original_quantization_details.get("exclude_modules") or []
fp8_config: dict[str, Any] = {
"quant_method": "fp8",
"fmt": "e4m3",
"activation_scheme": "dynamic",
"weight_block_size": [group_size, group_size],
"modules_to_not_convert": exclude_modules,
}
producer_info = input_config.get("producer")
if producer_info:
fp8_config["producer"] = producer_info
return fp8_config

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] This early-return builds the native quant_method: fp8 dict and returns immediately, which silently drops any kv_cache_quant_algo present in original_quantization_details (the generic path below emits kv_cache_scheme). For the shipped fp8_2d_blockwise_w8a8_dynamic preset the KV cache is disabled, so this is currently fine — but if FP8_PB is ever combined with FP8 KV-cache quantization, the KV scheme would be lost without warning. Consider either carrying kv_cache_scheme into fp8_config when kv_cache_quant_algo is set, or asserting it is unset here so the silent drop can't happen.

Comment on lines +542 to +546
if not weight_quantizer.fake_quant:
return QUANTIZATION_FP8_PB_REAL
if input_quantizer is not None and input_quantizer.is_enabled:
return QUANTIZATION_FP8_PB_W8A8
return QUANTIZATION_FP8_PB_WO

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] This detection change is shared by the Megatron export path and silently breaks it for one case.

get_quantization_format is also called by unified_export_megatron.py (_get_quantization_format → here). Previously, a block-FP8 module with fake_quant=True always returned QUANTIZATION_FP8_PB_WO, which Megatron handles (unified_export_megatron.py:288-290). After this change, the same module with an enabled input quantizer now returns QUANTIZATION_FP8_PB_W8A8, which is not in the Megatron branch → quantization stays None and the layer is exported as if unquantized.

So a model that was previously exportable as block-FP8 via Megatron (input quantizer enabled but fake_quant) now silently mis-exports. Even if no current Megatron recipe enables the input quantizer, this is a latent foot-gun.

Suggestion: either add QUANTIZATION_FP8_PB_W8A8 to the Megatron handling at unified_export_megatron.py:288, or raise a clear "unsupported on Megatron export" error there so it can't silently fall through to None.

@@ -42,6 +42,7 @@
QUANTIZATION_NVFP4_AWQ = "nvfp4_awq"
QUANTIZATION_FP8_PB_REAL = "fp8_pb_real"
QUANTIZATION_FP8_PB_WO = "fp8_pb_wo"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need QUANTIZATION_FP8_PB_WO? Maybe we can remove it after this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8_PB_WO is also used in the Megatron path, that's why I avoided removing it.

@@ -0,0 +1,41 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe consider remove the "2d" in the file name to make it shorter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the 2d so that the name better reflects the actual format to help users understand without having to go through the yaml and understand details.

# FP8_PB (block-wise FP8, W8A8): emit the native ``quant_method: fp8`` config
# vLLM/SGLang expect (weight_scale_inv + dynamic activations), matching the
# official Qwen3.5 FP8 checkpoint.
if quant_algo_value == "FP8_PB":

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it FP8_PB or fp8_pb_w8a8?

},
"weights": {"dynamic": False, "num_bits": 8, "type": "float", "group_size": gs},
}
elif quant_algo == "FP8_PB":

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it FP8_PB or fp8_pb_w8a8?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"FP8_PB" is correct here. This function only receives the quant_algo value. Do you recommend renaming it to FP8_PB_W8A8 or changing the internal format string to fp8_pb?

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants