Skip to content

Request for additional features from TE Debug modules for low precision training. #2801

@jomitchellnv

Description

@jomitchellnv

Is your feature request related to a problem? Please describe.

  1. For LogFp4Tensor stats and LogFP8TensorStats we want to obtain the standard deviation of the scale factors.
  2. LogFp8 and LogFP4 tensor stats can't currently be used at the same time, which is useful for situations when you have some layers as FP8 and some as FP4.
  3. [most important] When using MXFP8 or NVFP4 micro-scaling methods we are only obtaining global statistics over the entire tensor instead of per each block.

Describe the solution you'd like

  1. We want to add the standard deviation metric for the scale factors because simply having min and max doesn't tell us enough information. We could have serious clipping going on but won't know if we only have a max value.
  2. Currently, I've had to hack in support by modifying yaml files online in order to perform FP8 and FP4 logging for networks that have mixed FP4/FP8 layers https://github.com/NVIDIA/bionemo-framework/blob/main/bionemo-recipes/recipes/esm2_native_te/quantization.py#L94 I am hoping this can be supported natively.
  3. For LogFP8TensorStats and LogFP4TensorStats specifically for micro-block formats such as MXFP8 / NVFP4 we need an optional mechanism to extract block level metadata in addition to per-tensor level metadata including scale factor statistics per block of (scale_value, block_indices, tensor_chunk) etc saved to files. By default this should be able to work on the first block (0:32 for MXFP8, 0:16 for NVFP4) unless otherwise specified etc. It would be super useful in order to identify issues on a more fine grained scale.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions