Optimize fp8 block scaling Allgather for FSDP2#2789
Optimize fp8 block scaling Allgather for FSDP2#2789vthumbe1503 wants to merge 9 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci L1 pytorch |
Greptile SummaryThis PR optimises the FSDP2 all-gather path for FP8 2D block-scaled weights ( Key changes:
The prior review threads around Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant FSDP2
participant ShardedTensor as ShardedTensor (each rank)
participant AllGather as FSDP2 All-Gather
participant PostGather as fsdp_post_all_gather
participant Local as Local GPU (each rank)
Note over FSDP2,Local: OLD flow (eliminated)
FSDP2->>ShardedTensor: fsdp_pre_all_gather
ShardedTensor-->>FSDP2: (rowwise_data, rowwise_scale_inv, columnwise_data, columnwise_scale_inv)
FSDP2->>AllGather: all-gather 4 tensors
AllGather-->>PostGather: full rowwise + full columnwise (transposed back)
PostGather-->>FSDP2: Float8BlockwiseQTensor (rowwise + columnwise)
Note over FSDP2,Local: NEW flow (this PR)
FSDP2->>ShardedTensor: fsdp_pre_all_gather
ShardedTensor-->>FSDP2: (rowwise_data, rowwise_scale_inv)
FSDP2->>AllGather: all-gather 2 tensors only (~50% comms)
AllGather-->>PostGather: full rowwise_data + rowwise_scale_inv
PostGather->>Local: _create_columnwise() via tex.fp8_transpose (local, reuses buffer)
PostGather-->>FSDP2: Float8BlockwiseQTensor (rowwise + derived columnwise)
Reviews (3): Last reviewed commit: "Merge branch 'main' into optimize_fp8_bl..." | Re-trigger Greptile |
| fsdp_state = _get_module_fsdp_state(module) | ||
| reshard_after_forward = fsdp_state._fsdp_param_group._reshard_after_forward |
There was a problem hiding this comment.
Unguarded access to
_fsdp_param_group
fsdp_state._fsdp_param_group is typed as Optional[FSDPParamGroup] in PyTorch's FSDP2 internals — it is None for any FSDP module that does not directly manage parameters (e.g. a container module whose children are individually sharded). Accessing ._reshard_after_forward on it unconditionally will raise AttributeError: 'NoneType' object has no attribute '_reshard_after_forward' in that case.
While in practice fsdp_pre_all_gather is only called for tensors managed by a param group, this assumption is implicit. A guard makes the failure mode explicit and easier to diagnose:
fsdp_state = _get_module_fsdp_state(module)
param_group = fsdp_state._fsdp_param_group
if param_group is None:
raise RuntimeError(
"FSDP state for this module has no parameter group; "
"cannot determine reshard_after_forward."
)
reshard_after_forward = param_group._reshard_after_forwardSigned-off-by: Varun Thumbe <vthumbe@nvidia.com>
…03/TransformerEngine into optimize_fp8_blockwise_scaling
|
/te-ci L1 pytorch |
| if out is not None: | ||
| # Update existing tensor in-place (subsequent iterations) | ||
| out._rowwise_data = rowwise_data | ||
| out._rowwise_scale_inv = rowwise_scale_inv | ||
| out._columnwise_data = columnwise_data | ||
| out._columnwise_scale_inv = columnwise_scale_inv | ||
| out._columnwise_data = None | ||
| out._columnwise_scale_inv = None |
There was a problem hiding this comment.
Repeated GPU buffer allocation per backward iteration
Setting _columnwise_data = None and _columnwise_scale_inv = None before calling _create_columnwise() means that every fsdp_post_all_gather call with out is not None (i.e. every backward iteration after the first) allocates fresh GPU memory. In _create_columnwise, the tex.fp8_transpose(..., out=self._columnwise_data) call is passed out=None, so it allocates a new tensor every time. Likewise, the torch.empty(...) branch for _columnwise_scale_inv fires every iteration.
The previous all-gathered columnwise buffers could instead be reused by preserving their references and letting _create_columnwise overwrite them in-place. Consider not zeroing these fields here, and instead passing the existing (now stale) buffer as the out= hint to tex.fp8_transpose:
if out is not None:
out._rowwise_data = rowwise_data
out._rowwise_scale_inv = rowwise_scale_inv
# Keep out._columnwise_data and out._columnwise_scale_inv alive so
# _create_columnwise can reuse their underlying GPU storage via out=...
# They will be overwritten or cleared by the calls below.For large models running many training steps, the repeated allocation and GC of large FP8 tensors can cause GPU memory fragmentation and measurable overhead.
There was a problem hiding this comment.
Lol you keep me sending in circles.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Remove unnecessary columnwise data and scale inv assignments. Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
| # PyTorch FSDP2 private API – tested with PyTorch 2.5+; | ||
| from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState |
There was a problem hiding this comment.
Inconsistent import style for
TrainingState
TrainingState is imported at the module level (line 10) in float8_tensor.py and at line 13 in mxfp8_tensor.py, but here it's imported lazily inside fsdp_pre_all_gather. While the inline comment about the private API and PyTorch version is valuable, the inconsistency across the three sibling files may confuse readers.
Consider either:
- Moving the
TrainingStateimport to the module level and placing the version comment there (matching the other two files), or - Adding the same lazy-import pattern and version comment to
float8_tensor.pyandmxfp8_tensor.pyfor symmetry.
| # PyTorch FSDP2 private API – tested with PyTorch 2.5+; | |
| from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState | |
| # PyTorch FSDP2 private API – tested with PyTorch 2.5+; | |
| from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState | |
| from transformer_engine.pytorch.distributed import _get_module_fsdp_state |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if reshard_after_forward: | ||
| training_state = param_group._training_state | ||
| is_backward_pass = training_state == TrainingState.PRE_BACKWARD | ||
| rowwise_usage = not is_backward_pass | ||
| columnwise_usage = is_backward_pass | ||
| else: | ||
| rowwise_usage = True | ||
| columnwise_usage = self._quantizer.columnwise_usage |
There was a problem hiding this comment.
columnwise_usage not derived from training state in non-resharded path
When reshard_after_forward=False, the same all-gathered weight is reused through both forward and backward passes. The code sets:
rowwise_usage = True
columnwise_usage = self._quantizer.columnwise_usageThis means whether columnwise data gets derived locally (and kept) is entirely controlled by the sharded quantizer's setting, not the actual pass. The comment in the previous code explicitly noted that both forms were needed when not resharding. If self._quantizer.columnwise_usage is False (e.g. on an architecture that doesn't need the transpose), columnwise data won't be created and won't be available for the backward pass GEMM.
This matches the pre-existing float8_tensor.py behavior (same pattern there), so it's presumably already validated by the existing usage assumptions — but it would be worth a brief comment here documenting that self._quantizer.columnwise_usage must be True whenever the backward GEMM needs columnwise access for the non-resharding path.
Description
Eliminate Columnwise allgather for fp8_model_init with fsdp2. For weights when FP8 blockscaling is used, we typically use 2d. And in such a case, columnwise data and scale inv is just the transpose of the rowwise data and scale inverse. And so allgathering the rowwise data/scales are enough
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: