Intrinsic to set value in array based on a BitMask - c

Is there an intrinsic that will set a single value at all the places in an input array where the corresponding position had a 1 bit in the provided BitMask?
10101010 is bitmask
value is 121
it will set positions 0,2,4,6 with value 121

With AVX512, yes. Masked stores are a first-class operation in AVX512.
Use the bitmask as an AVX512 mask for a vector store to an array, using _mm512_mask_storeu_epi8 (void* mem_addr, __mmask64 k, __m512i a) vmovdqu8. (AVX512BW. With AVX512F, you can only use 32 or 64-bit element size.)
#include <immintrin.h>
#include <stdint.h>
void set_value_in_selected_elements(char *array, uint64_t bitmask, uint8_t value) {
__m512i broadcastv = _mm512_set1_epi8(value);
// integer types are implicitly convertible to/from __mmask types
// the compiler emits the KMOV instruction for you.
_mm512_mask_storeu_epi8 (array, bitmask, broadcastv);
}
This compiles (with gcc7.3 -O3 -march=skylake-avx512) to:
vpbroadcastb zmm0, edx
kmovq k1, rsi
vmovdqu8 ZMMWORD PTR [rdi]{k1}, zmm0
vzeroupper
ret
If you want to write zeros in the elements where the bitmap was zero, either use a zero-masking move to create a constant from the mask and store that, or create a 0 / -1 vector using AVX512BW or DQ __m512i _mm512_movm_epi8(__mmask64 ). Other element sizes are available. But using a masked store makes it possible to safely use it when the array size isn't a multiple of the vector width, because the unmodified elements aren't read / rewritten or anything; they're truly untouched. (The CPU can take a slow microcode assist if any of the untouched elements would have faulted on a real store, though.)
Without AVX512, you still asked for "an intrinsic" (singular).
There's pdep, which you can use to expand a bitmap to a byte-map. See my AVX2 left-packing answer for an example of using _pdep_u64(mask, 0x0101010101010101); to unpack each bit in mask to a byte. This gives you 8 bytes in a uint64_t. In C, if you use a union between that and an array, then it gives you an array of 0 / 1 elements. (But of course indexing the array will require the compiler to emit shift instructions, if it hasn't spilled it somewhere first. You probably just want to memcpy the uint64_t into a permanent array.)
But in the more general case (larger bitmaps), or even with 8 elements when you want to blend in new values based on the bitmask, you should use multiple intrinsics to implement the inverse of pmovmskb, and use that to blend. (See the without pdep section below)
In general, if your array fits in 64 bits (e.g. an 8-element char array), you can use pdep. Or if it's an array of 4-bit nibbles, then you can do a 16-bit mask instead of 8.
Otherwise there's no single instruction, and thus no intrinsic. For larger bitmaps, you can process it in 8-bit chunks and store 8-byte chunks into the array.
If your array elements are wider than 8 bits (and you don't have AVX512), you should probably still expand bits to bytes with pdep, but then use [v]pmovzx to expand from bytes to dwords or whatever in a vector. e.g.
// only the low 8 bits of the input matter
__m256i bits_to_dwords(unsigned bitmap) {
uint64_t mask_bytes = _pdep_u64(bitmap, 0x0101010101010101); // expand bits to bytes
__m128i byte_vec = _mm_cvtsi64x_si128(mask_bytes);
return _mm256_cvtepu8_epi32(byte_vec);
}
If you want to leave elements unmodified instead of setting them to zero where the bitmask had zeros, OR with the previous contents instead of assigning / storing.
This is rather inconvenient to express in C / C++ (compared to asm). To copy 8 bytes from a uint64_t into a char array, you can (and should) just use memcpy (to avoid any undefined behaviour because of pointer aliasing or misaligned uint64_t*). This will compile to a single 8-byte store with modern compilers.
But to OR them in, you'd either have to write a loop over the bytes of the uint64_t, or cast your char array to uint64_t*. This usually works fine, because char* can alias anything so reading the char array later doesn't have any strict-aliasing UB. But a misaligned uint64_t* can cause problems even on x86, if the compiler assumes that it is aligned when auto-vectorizing. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Assigning a value other than 0 / 1
Use a multiply by 0xFF to turn the mask of 0/1 bytes into a 0 / -1 mask, and then AND that with a uint64_t that has your value broadcasted to all byte positions.
If you want to leave element unmodified instead of setting them to zero or value=121, you should probably use SSE2 / SSE4 or AVX2 even if your array has byte elements. Load the old contents, vpblendvb with set1(121), using the byte-mask as a control vector.
vpblendvb only uses the high bit of each byte, so your pdep constant can be 0x8080808080808080 to scatter the input bits to the high bit of each byte, instead of the low bit. (So you don't need to multiply by 0xFF to get an AND mask).
If your elements are dword or larger, you could use _mm256_maskstore_epi32. (Use pmovsx instead of zx to copy the sign bit when expanding the mask from bytes to dwords). This can be a perf win over a variable-blend + always read / re-write. Is it possible to use SIMD instruction for replace?.
Without pdep
pdep is very slow on Ryzen, and even on Intel it's maybe not the best choice.
The alternative is to turn your bitmask into a vector mask:
is there an inverse instruction to the movemask instruction in intel avx2? and
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
i.e. broadcast your bitmap to every position of a vector (or shuffle it so the right bit of the bitmap in in the corresponding byte), and use a SIMD AND to mask off the appropriate bit for that byte. Then use pcmpeqb/w/d against the AND-mask to find the elements that had their bit set.
You're probably going to want to load / blend / store if you don't want to store zeros where the bitmap was zero.
Use the compare-mask to blend on your value, e.g. with _mm_blendv_epi8 or the 256bit AVX2 version. You can handle bitmaps in 16-bit chunks, producing 16-byte vectors with just a pshufb to send bytes of it to the right elements.
It's not safe for multiple threads to do this at the same time on the same array even if their bitmaps don't intersect, unless you use masked stores, though.

Related

what is the most efficient way to flip all the bits from the least significant bit up to the most significant last 1 bit value?

Say for example I have a uint8_t that can be of any value, and I only want to flip all the bits from the least significant bit up to the most significant last 1 bit value? How would I do that in the most efficient way?, Is there a solution where I can avoid using a loop?
here are some cases:
left side is the original bits - right side after the flips.
00011101 -> 00000010
00000000 -> 00000000
11111111 -> 00000000
11110111 -> 00001000
01000000 -> 00111111
[EDIT]
The type could also be larger than uint8_t, It could be uint32_t, uint64_t and __uint128_t. I just use uint8_t because it's the easiest size to show in the example cases.
In general I expect that most solutions will have roughly this form:
Compute the mask of bits that need to flipped
XOR by that mask
As mentioned in the comments, x64 is a target of interest, and on x64 you can do step 1 like this:
Find the 1-based position p of the most significant 1, by leading zeroes (_lzcnt_u64) and subtracting that from 64 (or 32 whichever is appropriate).
Create a mask with p consecutive set bits starting from the least significant bit, probably using _bzhi_u64.
There are some variations, such as using BitScanReverse to find the most significant 1 (but it has an ugly case for zero), or using a shift instead of bzhi (but it has an ugly case for 64). lzcnt and bzhi is a good combination with no ugly cases. bzhi requires BMI2 (Intel Haswell or newer, AMD Zen or newer).
Putting it together:
x ^ _bzhi_u64(~(uint64_t)0, 64 - _lzcnt_u64(x))
Which could be further simplified to
_bzhi_u64(~x, 64 - _lzcnt_u64(x))
As shown by Peter. This doesn't follow the original 2-step plan, rather all bits are flipped, and then the bits that were originally leading zeroes are reset.
Since those original leading zeroes form a contiguous sequence of leading ones in ~x, an alternative to bzhi could be to add the appropriate power of two to ~x (though sometimes zero, which might be thought of as 264, putting the set bit just beyond the top of the number). Unfortunately the power of two that we need is a bit annoying to compute, at least I could not come up with a good way to do it, it seems like a dead end to me.
Step 1 could also be implemented in a generic way (no special operations) using a few shifts and bitwise ORs, like this:
// Get all-ones below the leading 1
// On x86-64, this is probably slower than Paul R's method using BSR and shift
// even though you have to special case x==0
m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32; // last step should be removed if x is 32-bit
AMD CPUs have slowish BSR (but fast LZCNT; https://uops.info/), so you might want this shift/or version for uint8_t or uint16_t (where it takes fewest steps), especially if you need compatibility with all CPUs and speed on AMD is more important than on Intel.
This generic version is also useful within SIMD elements, especially narrow ones, where we don't have a leading-zero-count until AVX-512.
TL:DR: use a uint64_t shift to implement efficiently with uint32_t when compiling for 64-bit machines that have lzcnt (AMD since K10, Intel since Haswell). Without lzcnt (only bsr that's baseline for x86) the n==0 case is still special.
For the uint64_t version, the hard part is that you have 65 different possible positions for the highest set bit, including non-existent (lzcnt producing 64 when all bits are zero). But a single shift with 64-bit operand-size on x86 can only produce one of 64 different values (assuming a constant input), since x86 shifts mask the count like foo >> (c&63)
Using a shift requires special-casing one leading-bit-position, typically the n==0 case. As Harold's answer shows, BMI2 bzhi avoids that, allowing bit counts from 0..64.
Same for 32-bit operand-size shifts: they mask c&31. But to generate a mask for uint32_t, we can use a 64-bit shift efficiently on x86-64. (Or 32-bit for uint16_t and uint8_t. Fun fact: x86 asm shifts with 8 or 16-bit operand-size still mask their count mod 32, so they can shift out all the bits without even using a wider operand-size. But 32-bit operand size is efficient, no need to mess with partial-register writes.)
This strategy is even more efficient than bzhi for a type narrower than register width.
// optimized for 64-bit mode, otherwise 32-bit bzhi or a cmov version of Paul R's is good
#ifdef __LZCNT__
#include <immintrin.h>
uint32_t flip_32_on_64(uint32_t n)
{
uint64_t mask32 = 0xffffffff; // (uint64_t)(uint32_t)-1u32
// this needs to be _lzcnt_u32, not __builtin_clz; we need 32 for n==0
// If lznct isn't available, we can't avoid handling n==0 zero specially
uint32_t mask = mask32 >> _lzcnt_u32(n);
return n ^ mask;
}
#endif
This works equivalently for uint8_t and uint16_t (literally the same code with same mask, using a 32-bit lzcnt on them after zero-extension). But not uint64_t (You could use a unsigned __int128 shift, but shrd masks its shift count mod 64 so compilers still need some conditional behaviour to emulate it. So you might as well do a manual cmov or something, or sbb same,same to generate a 0 or -1 in a register as the mask to be shifted.)
Godbolt with gcc and clang. Note that it's not safe to replace _lzcnt_u32 with __builtin_clz; clang11 and later assume that can't produce 32 even when they compile it to an lzcnt instruction1, and optimize the shift operand-size down to 32 which will act as mask32 >> clz(n) & 31.
# clang 14 -O3 -march=haswell (or znver1 or bdver4 or other BMI2 CPUs)
flip_32_on_64:
lzcnt eax, edi # skylake fixed the output false-dependency for lzcnt/tzcnt, but not popcnt. Clang doesn't care, it's reckless about false deps except inside a loop in a single function.
mov ecx, 4294967295
shrx rax, rcx, rax
xor eax, edi
ret
Without BMI2, e.g. with -march=bdver1 or barcelona (aka k10), we get the same code-gen except with shr rax, cl. Those CPUs do still have lzcnt, otherwise this wouldn't compile.
(I'm curious if Intel Skylake Pentium/Celeron run lzcnt as lzcnt or bsf. They lack BMI1/BMI2, but lzcnt has its own feature flag.
It seems low-power uarches as recent as Tremont are missing lzcnt, though, according to InstLatx64 for a Pentium Silver N6005 Jasper Lake-D, Tremont core. I didn't manually look for the feature bit in the raw CPUID dumps of recent Pentium/Celeron, but Instlat does have those available if someone wants to check.)
Anyway, bzhi also requires BMI2, so if you're comparing against that for any size but uint64_t, this is the comparison.
This shrx version can keep its -1 constant around in a register across loops. So the mov reg,-1 can be hoisted out of a loop after inlining, if the compiler has a spare register. The best bzhi strategy doesn't need a mask constant so it has nothing to gain. _bzhi_u64(~x, 64 - _lzcnt_u64(x)) is 5 uops, but works for 64-bit integers on 64-bit machines. Its latency critical path length is the same as this. (lzcnt / sub / bzhi).
Without LZCNT, one option might be to always flip as a way to get FLAGS set for CMOV, and use -1 << bsr(n) to XOR some of them back to the original state. This could reduce critical path latency. IDK if a C compiler could be coaxed into emitting this. Especially not if you want to take advantage of the fact that real CPUs keep the BSR destination unchanged if the source was zero, but only AMD documents this fact. (Intel says it's an "undefined" result.)
(TODO: finish this hand-written asm idea.)
Other C ideas for the uint64_t case: cmov or cmp/sbb (to generate a 0 or -1) in parallel with lzcnt to shorten the critical path latency? See the Godbolt link where I was playing with that.
ARM/AArch64 saturate their shift counts, unlike how x86 masks for scalar. If one could take advantage of that safely (without C shift-count UB) that would be neat, allowing something about as good as this.
x86 SIMD shifts also saturate their counts, which Paul R took advantage of with an AVX-512 answer using vlzcnt and variable-shift. (It's not worth copying data to an XMM reg and back for one scalar shift, though; only useful if you have multiple elements to do.)
Footnote 1: clang codegen with __builtin_clz or ...ll
Using __builtin_clzll(n) will get clang to use 64-bit operand-size for the shift, since values from 32 to 63 become possible. But you can't actually use that to compile for CPUs without lzcnt. The 63-bsr a compiler would use without lzcnt available would not produce the 64 we need for that case. Not unless you did n<<=1; / n|=1; or something before the bsr and adjusted the result, but that would be slower than cmov.
If you were using a 64-bit lzcnt, you'd want uint64_t mask = -1ULL since there will be 32 extra leading zeros after zero-extending to uint64_t. Fortunately all-ones is relatively cheap to materialize on all ISAs, so use that instead of 0xffffffff00000000ULL
Here’s a simple example for 32 bit ints that works with gcc and compatible compilers (clang et al), and is portable across most architectures.
uint32_t flip(uint32_t n)
{
if (n == 0) return 0;
uint32_t mask = ~0U >> __builtin_clz(n);
return n ^ mask;
}
DEMO
We could avoid the extra check for n==0 if we used lzcnt on x86-64 (or clz on ARM), and we were using a shift that allowed a count of 32. (In C, shifts by the type-width or larger are undefined behaviour. On x86, in practice the shift count is masked &31 for shifts other than 64-bit, so this could be usable for uint16_t or uint8_t using a uint32_t mask.)
Be careful to avoid C undefined behaviour, including any assumption about __builtin_clz with an input of 0; modern C compilers are not portable assemblers, even though we sometimes wish they were when the language doesn't portably expose the CPU features we want to take advantage of. For example, clang assumes that __builtin_clz(n) can't be 32 even when it compiles it to lzcnt.
See #PeterCordes's answer for details.
If your use case is performance-critical you might also want to consider a SIMD implementation for performing the bit flipping operation on a large number of elements. Here's an example using AVX512 for 32 bit elements:
void flip(const uint32_t in[], uint32_t out[], size_t n)
{
assert((n & 7) == 0); // for this example we only handle arrays which are vector multiples in size
for (size_t i = 0; i + 8 <= n; i += 8)
{
__m512i vin = _mm512_loadu_si512(&in[i]);
__m512i vlz = _mm512_lzcnt_epi32(vin);
__m512i vmask = _mm512_srlv_epi32(_mm512_set1_epi32(-1), vlz);
__m512i vout = _mm512_xor_si512(vin, vmask);
_mm512_storeu_si512(&out[i], vout);
}
}
This uses the same approach as other solutions, i.e. count leading zeroes, create mask, XOR, but for 32 bit elements it processes 8 elements per loop iteration. You could implement a 64 bit version of this similarly, but unfortunately there are no similar AVX512 intrinsics for element sizes < 32 bits or > 64 bits.
You can see the above 32 bit example in action on Compiler Explorer (note: you might need to hit the refresh button at the bottom of the assembly pane to get it to re-compile and run if you get "Program returned: 139" in the output pane - this seems to be due to a glitch in Compiler Explorer currently).

Assigning a 2 byte variable to a 3 byte register?

My Watch dog timer has a default value of 0x0fffff and i want to write a 2 byte variable (u2 compare) in it. What happens when i assign the value simply like this
wdt_register = compare;
What happens to most significant byte of register?
Register definition. It's 3 bytes register containing H, M, L 8bit registers. 4 most significat bits of H are not used and then it's actually a 20 bit register. Datasheet named all of them as WDTCR_20.
My question is what happens when i assign a value to the register using this line (just an example of 2 byte value written to 3 byte register) :
WDTCR_20 = 0x1234;
Your WDT is a so-called special function register. In hardware, it may end up being three bytes, or it could be four bytes, some of which are fixed/read-only/unused. Your compiler's implementation of the write is itself implementation-dependent if the SFR is declared in a particular way that makes the compiler emit SFR-specific write instructions.
This effectively makes the result of the assignment implementation-dependent; the high eight bits might end up being discarded, might set some other microarchitectural flags, or might cause a trap/crash if they aren't set to a specific (likely all-zeros value). It depends on the processor's datasheet (since you didn't mention a processor/toolchain, we don't know exactly).
For example, the AVR-based atmega328p datasheet shows an example of such a register:
In this case, the one-byte register is actually only three bits, effectively (bits 7..3 are fixed to zero on read and ignored on write, and could very well have no physical flip-flop or SRAM cell associated with them).

Summing 8-bit integers in __m512i with AVX intrinsics

AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8, yet.
_mm512_reduce_add_ps //horizontal sum of 16 floats
_mm512_reduce_add_pd //horizontal sum of 8 doubles
_mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers
Basically, I need to implement MAGIC in the following snippet.
__m512i all_ones = _mm512_set1_epi16(1);
short sum_of_ones = MAGIC(all_ones);
/* now sum_of_ones contains 32, the sum of 32 ones. */
The most obvious way would be using _mm512_storeu_epi8 and sum the elements of the array together, but that would be slow, plus it might invalidate the cache. I suppose there exists a faster approach.
Bonus points for implementing _mm512_reduce_add_epi16 as well.
First of all, _mm512_reduce_add_epi64 does not correspond to a single AVX512 instruction, but it generates a sequence of shuffles and additions.
To reduce 64 epu8 values to 8 epi64 values one usually uses the vpsadbw instruction (SAD=Sum of Absolute Differences) against a zero vector, which then can be reduced further:
long reduce_add_epu8(__m512i a)
{
return _mm512_reduce_add_epi64(_mm512_sad_epu8(a, _mm512_setzero_si512()));
}
Try it on godbolt: https://godbolt.org/z/1rMiPH. Unfortunately, neither GCC nor Clang seem to be able to optimize away the function if it is used with _mm512_set1_epi16(1).
For epi8 instead of epu8 you need to first add 128 to each element (or xor with 0x80), then reduce it using vpsadbw and at the end subtract 64*128 (or 8*128 on each intermediate 64bit result). [Note this was wrong in a previous version of this answer]
For epi16 I suggest having a look at what instructions _mm512_reduce_add_epi32 and _mm512_reduce_add_epi64 generate and derive from there what to do.
Overall, as #Mysticial suggested, it depends on your context what the best approach of reducing is. E.g., if you have a very large array of int64 and want a sum as int64, you should just add them together packet-wise and only at the very end reduce one packet to a single int64.

Is bit masking comparable to "accessing an array" in bits?

For all the definitions I've seen of bit masking, they all just dive right into how to bit mask, use bitwise, etc. without explaining a use case for any of it. Is the purpose of updating all the bits you want to keep and all the bits you want to clear to "access an array" in bits?
Is the purpose of updating all the bits you want to keep and all the bits you want to clear to "access an array" in bits?
I will say the answer is no.
When you access an array of int you'll do:
int_array[index] = 42; // Write access
int x = int_array[42]; // Read access
If you want to write similar functions to read/write a specific bit in e.g. an unsigned int in a "array like fashion" it could look like:
unsigned a = 0;
set_bit(a, 4); // Set bit number 4
unsigned x = get_bit(a, 4); // Get bit number 4
The implementation of set_bit and get_bit will require (among other things) some bitwise mask operation.
So yes - to access bits in an "array like fashion" you'll need masking but...
There are many other uses of bit level masking.
Example:
int buffer[64];
unsigned index = 0;
void add_to_cyclic_buffer(int n)
{
buffer[index] = n;
++index;
index &= 0x3f; // Masking by 0x3f ensures index is always in the range 0..63
}
Example:
unsigned a = some_func();
a |= 1; // Make sure a is odd
a &= ~1; // Make sure a is even
Example:
unsigned a = some_func();
a &= ~0xf; // Make sure a is a multiple of 16
This is just a few examples of using "masking" that has nothing to do with accessing bits as an array. Many other examples can be made.
So to conclude:
Masking can be used to write functions that access bits in an array like fashion but masking is used for many other things as well.
So there are 3 (or 4) main uses.
One, as you say, is where you use the word as a set of true/false flags, where each flag is just indexed in a symmetric manner. I use 'word' here to be the piece of discrete memory that you are accessing in a single operation. So a byte holds 8 bit values, and a 'long long' holds 64 bits. With a bit more effort an array of words can be used as an array of more packed flags.
A second is where you are doing some manipulation of the value, but still consider the word to hold one value. There are many tricks like setting or clearing bottom bits to ensure alignment, or clearing top bits to get a modulus, shifting to divide or multiply by powers of 2.
A third use is where you want to pack lots of smaller-ranged values into a word. Each of the values is a particular meaning in context. This may either be because you need to communicate with a device that has defined this as the protocol, or because you need to create so many objects that the saving in space in each object outweighs the increase in code size and code speed cost (though that might be contrasted with the increased cache misses causing slowdown if the object were bigger).
As a distinction the fourth case is where these fields are distinct 1-bit flags that have specific meanings in the context of the code. Data objects tend to collect a number of such flags, and it is simply more convenient sometimes to store them as bits in a single location, than to use separate bytes for each flag. Generally testing a particular fixed indexed bit, or a fixed masked bit is no more expensive in code size or speed than testing the whole byte, though writing can be more complex. The storage savings are clear, so often programmers will declare an enumeration of bit masks by default when faced with creating a number of flags in a structure, or when writing a function.

Fast 8-bit checksum algorithm for heterogenous tuples

Suppose I have triplets containing 3 heterogenous integer types (int16_t, int32_t, int64_t) and I would like to compute an 8-bit unsigned checksum for these 3 values. Assume all of the values have uniform distribution across all the significant bits so we cannot cheat by truncating any of the values at concatenating them.
What's a fast way for me to compute a checksum with relatively low collision rate and non-cryptographic properties? I'm guessing I can concatenate the bytes and use a variant of Fletcher's checksum or Pearson hashing, but all of the implementations I've seen of those seem dated and I'd like to see if I can further exploit any SIMD or properties of modern (Skylake) architecture.
I'm also aware of MurmurHash but it doesn't have an 8-bit implementation.
Since you mention that all of the values are uniformly distributed across all of your bits, you can simply choose any byte in your tuple as your 8-bit hash, ignoring the remaining bits, which is essentially free. The result is a perfectly uniform hash function, which is the best possible (it will have a collision probability of 1 in 256, which is the lower bound for unpredictable input).
You only need a "better" hash function if you input bits are somehow non-uniform (which is the case the overwhelming majority of the time for real data that isn't just random numbers, but I guess your situation is different).
Modern x86 has very fast CRC32C (hardware instruction added in SSE4.2). You might get good results from concatenating the int32 and int16 into a zero-extended int64_t, and using two CRC32C instructions to accumulate a single checksum. To get the compiler to do this for you, use intrinsics from imintrin.h: unsigned __int64 _mm_crc32_u64( unsinged __int64 crc, unsigned __int64 data ).
According to Agner Fog's instruction tables, crc32 has 1 per clock throughput and 3 cycle latency on Skylake, so feeding it 2x 8 bytes and getting a 32-bit result should only take 2 uops / 6 cycle latency. Feed it the uint64_t first so concatenating the uint16 and uint32 are off the critical path, i.e. create instruction-level parallelism between the shift/or and the first crc32.
Then horizontally XOR the crc32c down to 8 bits:
uint32_t crc = my_object_crc32(&my_object);
crc ^= crc>>16;
crc ^= crc>>8;
crc = (uint8_t)crc;
Horizontal xor to mix the bits of a wider crc / hash / checksum into an 8-bit value is applicable to any hash function you want to use.
Or simply take the low byte of the CRC32C. IDK how much if anything you gain from XORing all 4 bytes down to 1. Again, viable with any multi-byte hash function.
You could even just horizontally XOR all the bytes in your input. e.g. load with a 16-byte SSE2 load, and mask off the padding bytes, then pshufd / pxor down to 8 bytes, pshuflw / pxor down to 4 bytes.
Then another pshuflw / pxor down to 2 bytes, and movd to integer for the final shift / xor. (Or you could movd to integer earlier, especially if the compiler has BMI2 rorx to copy-and-shift with one instruction).

Resources