Optimize fp8 block scaling Allgather for FSDP2 by vthumbe1503 · Pull Request #2789 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-03-23T00:45:10Z

Description

Eliminate Columnwise allgather for fp8_model_init with fsdp2. For weights when FP8 blockscaling is used, we typically use 2d. And in such a case, columnwise data and scale inv is just the transpose of the rowwise data and scale inverse. And so allgathering the rowwise data/scales are enough

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

vthumbe1503 · 2026-03-23T00:47:45Z

/te-ci L1 pytorch

greptile-apps · 2026-03-23T00:50:04Z

Greptile Summary

This PR optimises the FSDP2 all-gather path for FP8 2D block-scaled weights (Float8BlockwiseQTensor) by eliminating the columnwise all-gather entirely. Because 2D block scaling uses 128×128 tiles, the columnwise representation of any shard is mathematically identical to the FP8 transpose of its rowwise representation, so each device can derive columnwise data locally with a cheap tex.fp8_transpose call after gathering only the rowwise tensors — cutting the all-gather communication volume roughly in half.

Key changes:

float8_blockwise_tensor.py – fsdp_pre_all_gather now always emits exactly two sharded tensors (rowwise_data, rowwise_scale_inv); fsdp_post_all_gather calls _create_columnwise() locally when columnwise access is needed, reusing existing GPU buffers across iterations to avoid repeated allocations.
float8_blockwise_tensor.py – Adopts the same reshard_after_forward / TrainingState.PRE_BACKWARD logic that already exists in float8_tensor.py and mxfp8_tensor.py, so only the relevant data form (rowwise in forward, columnwise in backward) is materialised when weights are resharded after forward.
float8_tensor.py and mxfp8_tensor.py – Adds a None-guard for _fsdp_param_group before accessing ._reshard_after_forward and ._training_state, addressing the previously flagged AttributeError risk.

The prior review threads around _fsdp_param_group nullability, stale columnwise buffer state, and assert vs RuntimeError are addressed in this revision.

Confidence Score: 4/5

PR is safe to merge; the optimization is mathematically valid for 2D block scaling and prior review concerns have been addressed.
The core invariant — columnwise FP8 data equals the transpose of rowwise FP8 data under 2D block scaling — is correctly exploited. Buffer reuse across iterations is preserved via the existing out=self._columnwise_data argument in tex.fp8_transpose. The null-guard for _fsdp_param_group is now present in all three files. Two minor style observations remain (import location inconsistency and a missing clarifying comment for the non-resharding path) but neither affects correctness.
float8_blockwise_tensor.py deserves the most attention as it contains the core optimization logic; the other two files only add a defensive null-guard.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py	Core optimization: eliminates columnwise all-gather for FP8 2D block scaling by deriving columnwise data locally via fp8_transpose after gathering only rowwise tensors; also adds null-guard for _fsdp_param_group and adopts reshard_after_forward-based usage selection matching float8_tensor.py.
transformer_engine/pytorch/tensor/float8_tensor.py	Defensive fix only: introduces null-guard for _fsdp_param_group before accessing _reshard_after_forward and _training_state, preventing AttributeError when called on modules without a direct param group.
transformer_engine/pytorch/tensor/mxfp8_tensor.py	Defensive fix only: same null-guard for _fsdp_param_group as float8_tensor.py, no functional changes to MXFP8 all-gather logic.