Discussion about this post

User's avatar
Wen Chen's avatar

Hello. Thanks for trying Transformer Engine on ROCm. The reason why fp8 training used more memory is that it still keeps the master weights in fp32 or bf16 while it also maintain fp8 weights and (sometimes even its transpose).

I am surprised to see that when you were seeing slower speed up when using full TE than only using TE Linear. Could you open an issue here (https://github.com/ROCm/TransformerEngine/issues) and provide a reproducer?

Expand full comment

No posts

Ready for more?