Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for bfloat16 #22

Open
tisnik opened this issue Feb 10, 2020 · 5 comments
Open

Support for bfloat16 #22

tisnik opened this issue Feb 10, 2020 · 5 comments

Comments

@tisnik
Copy link

tisnik commented Feb 10, 2020

Thank you for making this very useful and well-tested library! Are you planning to add support for bfloat16 format, which is used in ML field? It has different bit widths for mantissa and exponent, but other rules are the same as in IEEE 754 formats.

@x448
Copy link
Owner

x448 commented Feb 10, 2020

Hi Pavel, I took a quick glance at bfloat16. If I implement it, I think it would be in a separate project.

There would have to be a convenient way for me to compare results with a hardware implementation. I'd like to be able to confirm 100% of float32<-->bfloat16 conversions.

float16 was very convenient because the vm I use for coding had hardware instructions (F16C aka FP16C).

@tisnik
Copy link
Author

tisnik commented Feb 10, 2020

Thank you for a quick response. Yes, it totally make sense to create bfloat16 as separate project (thought IMHO most of the code will be very similar). As far as I know, bfloat16 is supported in AVX-512 - VCVTNE2PS2BF16, VCVTNEPS2BF16 and
VDPBF16PS instructions, but I have not tried them (and very probably it won't be possible to use Go assembler with those pretty new instructions). I planned to create some conversion library myself, but I'm not sure how to handle special cases like denormalized values, sNaNs, qNaNs etc. as some bfloat16 implementations don't follow all IEEE 754 rules.

@agj32mrgibbits
Copy link

agj32mrgibbits commented Jun 7, 2022

Looks to me like bfloat16 conversion between float32 is a simple and fast shift:

type BFloat16 uint16

func ToFloat32(x BFloat16) float32 {
	return math.Float32frombits(uint32(x) << 16)
}

func FromFloat32(x float32) BFloat16 {
	return BFloat16(math.Float32bits(x) >> 16)
}

func FromBits(u16 uint16) BFloat16 {
	return BFloat16(u16)
}

func Bits(f BFloat16) uint16 {
	return uint16(f)
}

func (f BFloat16) String() string {
	return strconv.FormatFloat(float64(ToFloat32(f)), 'f', -1, 32)
}

https://go.dev/play/p/jhXQvuI9Pxz

@fxamacker
Copy link
Collaborator

Support for bfloat16 is also requested in comments at #46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants