Skip to content

Optimize fp8 block scaling Allgather for FSDP2#2789

Open
vthumbe1503 wants to merge 9 commits intoNVIDIA:mainfrom
vthumbe1503:optimize_fp8_blockwise_scaling
Open

Optimize fp8 block scaling Allgather for FSDP2#2789
vthumbe1503 wants to merge 9 commits intoNVIDIA:mainfrom
vthumbe1503:optimize_fp8_blockwise_scaling

Conversation

@vthumbe1503
Copy link
Collaborator

@vthumbe1503 vthumbe1503 commented Mar 23, 2026

Description

Eliminate Columnwise allgather for fp8_model_init with fsdp2. For weights when FP8 blockscaling is used, we typically use 2d. And in such a case, columnwise data and scale inv is just the transpose of the rowwise data and scale inverse. And so allgathering the rowwise data/scales are enough

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
@vthumbe1503 vthumbe1503 changed the title Optimize fp8 blockwise scaling Optimize fp8 block scaling Allgather for FSDP2 Mar 23, 2026
@vthumbe1503
Copy link
Collaborator Author

/te-ci L1 pytorch

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR optimises the FSDP2 all-gather path for FP8 2D block-scaled weights (Float8BlockwiseQTensor) by eliminating the columnwise all-gather entirely. Because 2D block scaling uses 128×128 tiles, the columnwise representation of any shard is mathematically identical to the FP8 transpose of its rowwise representation, so each device can derive columnwise data locally with a cheap tex.fp8_transpose call after gathering only the rowwise tensors — cutting the all-gather communication volume roughly in half.

Key changes:

  • float8_blockwise_tensor.pyfsdp_pre_all_gather now always emits exactly two sharded tensors (rowwise_data, rowwise_scale_inv); fsdp_post_all_gather calls _create_columnwise() locally when columnwise access is needed, reusing existing GPU buffers across iterations to avoid repeated allocations.
  • float8_blockwise_tensor.py – Adopts the same reshard_after_forward / TrainingState.PRE_BACKWARD logic that already exists in float8_tensor.py and mxfp8_tensor.py, so only the relevant data form (rowwise in forward, columnwise in backward) is materialised when weights are resharded after forward.
  • float8_tensor.py and mxfp8_tensor.py – Adds a None-guard for _fsdp_param_group before accessing ._reshard_after_forward and ._training_state, addressing the previously flagged AttributeError risk.

The prior review threads around _fsdp_param_group nullability, stale columnwise buffer state, and assert vs RuntimeError are addressed in this revision.

Confidence Score: 4/5

  • PR is safe to merge; the optimization is mathematically valid for 2D block scaling and prior review concerns have been addressed.
  • The core invariant — columnwise FP8 data equals the transpose of rowwise FP8 data under 2D block scaling — is correctly exploited. Buffer reuse across iterations is preserved via the existing out=self._columnwise_data argument in tex.fp8_transpose. The null-guard for _fsdp_param_group is now present in all three files. Two minor style observations remain (import location inconsistency and a missing clarifying comment for the non-resharding path) but neither affects correctness.
  • float8_blockwise_tensor.py deserves the most attention as it contains the core optimization logic; the other two files only add a defensive null-guard.

Important Files Changed

Filename Overview
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Core optimization: eliminates columnwise all-gather for FP8 2D block scaling by deriving columnwise data locally via fp8_transpose after gathering only rowwise tensors; also adds null-guard for _fsdp_param_group and adopts reshard_after_forward-based usage selection matching float8_tensor.py.
transformer_engine/pytorch/tensor/float8_tensor.py Defensive fix only: introduces null-guard for _fsdp_param_group before accessing _reshard_after_forward and _training_state, preventing AttributeError when called on modules without a direct param group.
transformer_engine/pytorch/tensor/mxfp8_tensor.py Defensive fix only: same null-guard for _fsdp_param_group as float8_tensor.py, no functional changes to MXFP8 all-gather logic.

Sequence Diagram

sequenceDiagram
    participant FSDP2
    participant ShardedTensor as ShardedTensor (each rank)
    participant AllGather as FSDP2 All-Gather
    participant PostGather as fsdp_post_all_gather
    participant Local as Local GPU (each rank)

    Note over FSDP2,Local: OLD flow (eliminated)
    FSDP2->>ShardedTensor: fsdp_pre_all_gather
    ShardedTensor-->>FSDP2: (rowwise_data, rowwise_scale_inv, columnwise_data, columnwise_scale_inv)
    FSDP2->>AllGather: all-gather 4 tensors
    AllGather-->>PostGather: full rowwise + full columnwise (transposed back)
    PostGather-->>FSDP2: Float8BlockwiseQTensor (rowwise + columnwise)

    Note over FSDP2,Local: NEW flow (this PR)
    FSDP2->>ShardedTensor: fsdp_pre_all_gather
    ShardedTensor-->>FSDP2: (rowwise_data, rowwise_scale_inv)
    FSDP2->>AllGather: all-gather 2 tensors only (~50% comms)
    AllGather-->>PostGather: full rowwise_data + rowwise_scale_inv
    PostGather->>Local: _create_columnwise() via tex.fp8_transpose (local, reuses buffer)
    PostGather-->>FSDP2: Float8BlockwiseQTensor (rowwise + derived columnwise)
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into optimize_fp8_bl..." | Re-trigger Greptile

Comment on lines +641 to +642
fsdp_state = _get_module_fsdp_state(module)
reshard_after_forward = fsdp_state._fsdp_param_group._reshard_after_forward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unguarded access to _fsdp_param_group

fsdp_state._fsdp_param_group is typed as Optional[FSDPParamGroup] in PyTorch's FSDP2 internals — it is None for any FSDP module that does not directly manage parameters (e.g. a container module whose children are individually sharded). Accessing ._reshard_after_forward on it unconditionally will raise AttributeError: 'NoneType' object has no attribute '_reshard_after_forward' in that case.

While in practice fsdp_pre_all_gather is only called for tensors managed by a param group, this assumption is implicit. A guard makes the failure mode explicit and easier to diagnose:

fsdp_state = _get_module_fsdp_state(module)
param_group = fsdp_state._fsdp_param_group
if param_group is None:
    raise RuntimeError(
        "FSDP state for this module has no parameter group; "
        "cannot determine reshard_after_forward."
    )
reshard_after_forward = param_group._reshard_after_forward

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
…03/TransformerEngine into optimize_fp8_blockwise_scaling
@vthumbe1503
Copy link
Collaborator Author

/te-ci L1 pytorch

Comment on lines +697 to +701
if out is not None:
# Update existing tensor in-place (subsequent iterations)
out._rowwise_data = rowwise_data
out._rowwise_scale_inv = rowwise_scale_inv
out._columnwise_data = columnwise_data
out._columnwise_scale_inv = columnwise_scale_inv
out._columnwise_data = None
out._columnwise_scale_inv = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Repeated GPU buffer allocation per backward iteration

Setting _columnwise_data = None and _columnwise_scale_inv = None before calling _create_columnwise() means that every fsdp_post_all_gather call with out is not None (i.e. every backward iteration after the first) allocates fresh GPU memory. In _create_columnwise, the tex.fp8_transpose(..., out=self._columnwise_data) call is passed out=None, so it allocates a new tensor every time. Likewise, the torch.empty(...) branch for _columnwise_scale_inv fires every iteration.

The previous all-gathered columnwise buffers could instead be reused by preserving their references and letting _create_columnwise overwrite them in-place. Consider not zeroing these fields here, and instead passing the existing (now stale) buffer as the out= hint to tex.fp8_transpose:

if out is not None:
    out._rowwise_data = rowwise_data
    out._rowwise_scale_inv = rowwise_scale_inv
    # Keep out._columnwise_data and out._columnwise_scale_inv alive so
    # _create_columnwise can reuse their underlying GPU storage via out=...
    # They will be overwritten or cleared by the calls below.

For large models running many training steps, the repeated allocation and GC of large FP8 tensors can cause GPU memory fragmentation and measurable overhead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol you keep me sending in circles.

vthumbe1503 and others added 3 commits March 23, 2026 08:54
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Remove unnecessary columnwise data and scale inv assignments.

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Comment on lines +627 to +628
# PyTorch FSDP2 private API – tested with PyTorch 2.5+;
from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inconsistent import style for TrainingState

TrainingState is imported at the module level (line 10) in float8_tensor.py and at line 13 in mxfp8_tensor.py, but here it's imported lazily inside fsdp_pre_all_gather. While the inline comment about the private API and PyTorch version is valuable, the inconsistency across the three sibling files may confuse readers.

Consider either:

  • Moving the TrainingState import to the module level and placing the version comment there (matching the other two files), or
  • Adding the same lazy-import pattern and version comment to float8_tensor.py and mxfp8_tensor.py for symmetry.
Suggested change
# PyTorch FSDP2 private API – tested with PyTorch 2.5+;
from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState
# PyTorch FSDP2 private API – tested with PyTorch 2.5+;
from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState
from transformer_engine.pytorch.distributed import _get_module_fsdp_state

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +656 to +663
if reshard_after_forward:
training_state = param_group._training_state
is_backward_pass = training_state == TrainingState.PRE_BACKWARD
rowwise_usage = not is_backward_pass
columnwise_usage = is_backward_pass
else:
rowwise_usage = True
columnwise_usage = self._quantizer.columnwise_usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 columnwise_usage not derived from training state in non-resharded path

When reshard_after_forward=False, the same all-gathered weight is reused through both forward and backward passes. The code sets:

rowwise_usage = True
columnwise_usage = self._quantizer.columnwise_usage

This means whether columnwise data gets derived locally (and kept) is entirely controlled by the sharded quantizer's setting, not the actual pass. The comment in the previous code explicitly noted that both forms were needed when not resharding. If self._quantizer.columnwise_usage is False (e.g. on an architecture that doesn't need the transpose), columnwise data won't be created and won't be available for the backward pass GEMM.

This matches the pre-existing float8_tensor.py behavior (same pattern there), so it's presumably already validated by the existing usage assumptions — but it would be worth a brief comment here documenting that self._quantizer.columnwise_usage must be True whenever the backward GEMM needs columnwise access for the non-resharding path.

@vthumbe1503 vthumbe1503 requested a review from ptrendx March 23, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant