-
Notifications
You must be signed in to change notification settings - Fork 687
Request for additional features from TE Debug modules for low precision training. #2801
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem? Please describe.
- For LogFp4Tensor stats and LogFP8TensorStats we want to obtain the standard deviation of the scale factors.
- LogFp8 and LogFP4 tensor stats can't currently be used at the same time, which is useful for situations when you have some layers as FP8 and some as FP4.
- [most important] When using MXFP8 or NVFP4 micro-scaling methods we are only obtaining global statistics over the entire tensor instead of per each block.
Describe the solution you'd like
- We want to add the standard deviation metric for the scale factors because simply having min and max doesn't tell us enough information. We could have serious clipping going on but won't know if we only have a max value.
- Currently, I've had to hack in support by modifying yaml files online in order to perform FP8 and FP4 logging for networks that have mixed FP4/FP8 layers https://github.com/NVIDIA/bionemo-framework/blob/main/bionemo-recipes/recipes/esm2_native_te/quantization.py#L94 I am hoping this can be supported natively.
- For LogFP8TensorStats and LogFP4TensorStats specifically for micro-block formats such as MXFP8 / NVFP4 we need an optional mechanism to extract block level metadata in addition to per-tensor level metadata including scale factor statistics per block of (scale_value, block_indices, tensor_chunk) etc saved to files. By default this should be able to work on the first block (0:32 for MXFP8, 0:16 for NVFP4) unless otherwise specified etc. It would be super useful in order to identify issues on a more fine grained scale.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request