Request for additional features from TE Debug modules for low precision training.

**Is your feature request related to a problem? Please describe.**

1. For LogFp4Tensor stats and LogFP8TensorStats we want to obtain the standard deviation of the scale factors.
2. LogFp8 and LogFP4 tensor stats can't currently be used at the same time, which is useful for situations when you have some layers as FP8 and some as FP4.
3. [most important] When using MXFP8 or NVFP4 micro-scaling methods we are only obtaining global statistics over the entire tensor instead of per each block.


**Describe the solution you'd like**
1. We want to add the standard deviation metric for the scale factors because simply having min and max doesn't tell us enough information. We could have serious clipping going on but won't know if we only have a max value.
2. Currently, I've had to hack in support by modifying yaml files online in order to perform FP8 and FP4 logging for networks that have mixed FP4/FP8 layers https://github.com/NVIDIA/bionemo-framework/blob/main/bionemo-recipes/recipes/esm2_native_te/quantization.py#L94 I am hoping this can be supported natively.
3. For LogFP8TensorStats and LogFP4TensorStats specifically for micro-block formats such as MXFP8 / NVFP4 we need an optional mechanism to extract block level metadata in addition to per-tensor level metadata including scale factor statistics per block of (scale_value, block_indices, tensor_chunk) etc saved to files. By default this should be able to work on the first block (0:32 for MXFP8, 0:16 for NVFP4) unless otherwise specified etc. It would be super useful in order to identify issues on a more fine grained scale.

**Describe alternatives you've considered**

A clear and concise description of any alternative solutions or features you've considered.

**Additional context**

Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for additional features from TE Debug modules for low precision training. #2801

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request for additional features from TE Debug modules for low precision training. #2801

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions