SIMD Extract MSbs Intrinsic #4466

Barinzaya · 2024-11-07T16:18:03Z

This PR adds a simd_extract_msbs intrinsic, along with an extract_msbs re-export to core:simd. This intrinsic extracts the most significant bit of each element of a #simd vector and packs them into a bit_set. This behavior is similar to the SSE/AVX movemask intrinsic/movmsk instruction. This intrinsic is defined similarly to:

simd_extract_msbs :: proc(a: #simd[N]T) -> bit_set[0..<N] where type_is_integer(T) || type_is_boolean(T)

For example, this code:

v := #simd[8]i32 { -1, -2, +3, +4, -5, +6, +7, -8 }
fmt.println(simd.extract_msbs(v))

will print:

bit_set[0..=7]{0, 1, 4, 7}

since elements 0, 1, 4, and 7 have their most significant bits set (due to being negative).

This intrinsic is particularly useful in conjunction with lane-wise comparison masks, in a few particular use cases that I've found so far.

Counting matches. This can be useful on its own or in conjunction with masked_compress_store, for instance, where card(extract_msbs(v)) can be used to determine how many elements will be written. Examples:

count_range :: proc (src: []#simd[16]f32, lower, upper: f32) -> (result: int) {
	// Counts the number of values in `src` that are in the range `[lower, upper].`
	for v in src {
		mask := simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper)
		result += card(simd.extract_msbs(mask))
	}
	return
}

filter_range :: proc (dst: ^[dynamic]f32, src: []#simd[16]f32, lower, upper: f32) {
	// Appends to `dst` the values of `src` that are in the range `[lower, upper].`
	for v in src {
		mask := simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper)
		old_len := len(dst)
		non_zero_resize(dst, old_len + card(simd.extract_msbs(mask)))
		#no_bounds_check { simd.masked_compress_store(&dst[old_len], v, mask) }
	}
}

This use case can also be done via -simd_reduce_add_ordered(mask), though I find using extract_msbs to be clearer on intent.

Identifying which elements matched, e.g. for searching or looping over them. Examples:

first_range :: proc (src: []#simd[16]f32, lower, upper: f32) -> (at: int, found: bool) {
	// Returns the (flattened) index of the first value in `src` that is in the range `[lower, upper]`.
	for v, i in src {
		matches := simd.extract_msbs(simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper))
		for j in matches {
			at = 16*i + j
			found = true
			return
		}
	}
	return
}

histogram_range :: proc (bins: []int, src: []#simd[16]f32, lower, upper: f32) {
	// Generates histogram bin counts for the values in `src` in the range `[lower, upper)`.
	// bins[0] will be the number of values that fall in the first 1/Nth of the range,
	// bins[1] will be the number of values that fall in the second 1/Nth of the range, etc.
	span := (upper - lower) / f32(len(bins))
	for v in src {
		matches := simd.extract_msbs(simd.lanes_ge(v, auto_cast lower) & simd.lanes_lt(v, auto_cast upper))

		phases := (v - auto_cast lower) / auto_cast span
		indices := cast(#simd[16]i32)phases

		for j in matches {
			// Can't be done in parallel due to potential intersection
			i := int(simd.extract(indices, j))
			bins[i] += 1
		}
	}
	return
}

In more specific cases, the bit_set itself can also be directly useful for doing boolean logic en masse. Condensing the original checks to bit-masks allows the bit-masks themselves to be used in SIMD vectors to do large amounts of boolean logic at once.

Barinzaya force-pushed the simd_extract_msbs branch 2 times, most recently from 9d2e6f5 to 699d3ca Compare November 18, 2024 01:06

Added simd_extract_msbs intrinsic.

ee8a752

Barinzaya force-pushed the simd_extract_msbs branch from 699d3ca to ee8a752 Compare November 21, 2024 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD Extract MSbs Intrinsic #4466

SIMD Extract MSbs Intrinsic #4466

Barinzaya commented Nov 7, 2024

SIMD Extract MSbs Intrinsic #4466

Are you sure you want to change the base?

SIMD Extract MSbs Intrinsic #4466

Conversation

Barinzaya commented Nov 7, 2024