How to properly use carry-less multiplication assembly (PCLMULQDQ) in zlib CRC32? - c

I've recently been playing around with CloudFlare's optimized zlib, and the results are really quite impressive.
Unfortunately, they seem to have assumed development of zlib was abandoned, and their fork broke away. I was eventually able to manually rebase their changes on to the current zlib development branch, though it was a real pain in the ass.
Anyway, there's still one major optimization in the CloudFlare code I haven't been able to utilize, namely, the fast CRC32 code implemented with the PCLMULQDQ carry-less multiplication instructions included with newer (Haswell and later, I believe) Intel processors, because:
I'm on a Mac, and neither the clang integrated assembler nor Apple's ancient GAS understand the newer GAS mnemonics used,
and
The code was lifted from the Linux kernel and is GPL2, which makes the whole library GPL2, and thereby basically renders it useless for my purposes.
So I did some hunting around, and after a few hours I stumbled onto some code that Apple is using in their bzip2: handwritten, vectorized CRC32 implementations for both arm64 and x86_64.
Bizarrely, the comments for the x86_64 assembly are (only) in the arm64 source, but it does seem to indicate that this code could be used with zlib:
This function SHOULD NOT be called directly. It should be called in a wrapper
function (such as crc32_little in crc32.c) that 1st align an input buffer to 16-byte (update crc along the way),
and make sure that len is at least 16 and SHOULD be a multiple of 16.
But I unfortunately, after a few attempts, at this point I seem to be in a bit over my head. And I'm not sure how to actually do that. So I was hoping someone could show me how/where one would call the function provided.
(It also would be fantastic if there were a way to do it where the necessary features were detected at runtime, and could fall back to the software implementation if the hardware features are unavailable, so I wouldn't have to distribute multiple binaries. But, at the very least, if anyone could help me suss out how to get the library to correctly use the Apple PCLMULQDQ-based CRC32, that would go a long way, regardless.)

As it says, you need to calculate the CRC sum on a 16-byte aligned buffer that has length of multiple of 16 bytes. Thus you'd cast the current buffer pointer as uintptr_t and for as long as its 4 LSB bits are not zero, you increase the pointer feeding the bytes into an ordinary CRC-32 routine. Once you've at 16-byte aligned address, you round the remaining length down to multiple of 16, then feed these bytes to the fast CRC-32, and again the remaining bytes to the slow calculation.
Something like:
// a function for adding a single byte to crc
uint32_t crc32_by_byte(uint32_t crc, uint8_t byte);
// the assembly routine
uint32_t _crc32_vec(uint32_t crc, uint8_t *input, int length);
uint32_t crc = initial_value;
uint8_t *input = whatever;
int length = whatever; // yes, the assembly uses *int* length.
assert(length >= 32); // if length is less than 32 just calculate byte by byte
while ((uintptr_t)input & 0xf) { // for as long as input is not 16-byte aligned
crc = crc32_by_byte(crc, *input++);
length--;
}
// input is now 16-byte-aligned
// floor length to multiple of 16
int fast_length = (length >> 4) << 4;
crc = _crc32_vec(crc, input, fast_length);
// do the remaining bytes
length -= fast_length;
while (length--) {
crc = crc32_by_byte(crc, *input++)
}
return crc;

Related

String length function is unstable

So I made this strlen a while ago and everything seemed fine. But I started noticing bugs with my codebase and after a while I tracked it down to this strlen function. I used SIMD instructions to write it and I am new to writing intrinsics so the code isn't probably the best it could be either.
Here is the function:
inline size_t strlen(const char* data) {
const __m256i terminationCharacters = _mm256_setzero_si256();
const size_t shiftAmount = ((size_t)&data) & 31;
const __m256i* pointer = (const __m256i*) (data - shiftAmount);
size_t length = 0;
for (;; length += 32, ++pointer) {
const __m256i comparingData = _mm256_load_si256(pointer);
const __m256i comparison = _mm256_cmpeq_epi8(comparingData, terminationCharacters);
if (!_mm256_testc_si256(terminationCharacters, comparison)) {
const auto mask = _mm256_movemask_epi8(comparison);
return length + _tzcnt_u32(mask >> shiftAmount);
}
}
}
Your attempt to combine startup handling into the aligned-vector loop has at least 2 showstopper bugs:
You exit the loop if your aligned load finds any zero bytes, even if they're from before the proper start of the string. (#James Griffin spotted this in comments). You need to do mask >>= shiftAmount and check that for non-zero to see if there were any matches in the part of the load that comes after the start of the string. (Don't use _mm256_testc_si256, just movemask and check).
_tzcnt_u32(mask >> shiftAmount); is buggy for any vectors after the first. The whole vector comes from bytes after the start of the string, so you need tzcnt to see all of bits. Instead, you want _tzcnt_u32(mask) - shiftAmount, I think.
Make yourself some test cases with 0 bytes before the actual string but inside the first aligned vector. And test cases with the final 0 in different places relative to a vector, and non-zero and test your version against libc strlen. (Maybe even for some randomized 0-positions within the first 32 bytes, and then within the first 64 bytes after that.)
Your strategy for handling unaligned startup should work, if you separate it from the loop. (Is it safe to read past the end of a buffer within the same page on x86 and x64?).
Another option is a page-cross check before a first unaligned vector load from the actual start of the string. (But then you need a fallback to something else). Then go aligned: overlap is fine; as long as you calculate the final length correctly, it doesn't matter if you check the same byte twice for being zero.
(You also don't really want the compiler to be wasting instructions inside the loop incrementing a pointer and a separate length, so check the resulting asm. A pointer-subtract after the loop should do the trick. Even cast to uintptr_t.
Also, you can subtract the final zero-position from the initial function arg, instead of from the aligned pointer, so instead of subtracting shiftAmount twice, you're just not using it at all except for the initial alignment.)
Don't use the vptest intrinsic (_mm256_testc_si256) at all, even in the main loop when you should be checking all the bytes; it's not better for _mm_cmp* results. vptest is 2 uops and can't macro-fuse with a branch instruction. But vpmovmskb eax, ymm0 is 1 uop, and test eax,eax / jz .loop is another one macro-fused uop. And even better, you actually need the integer movemask result after the loop, so you already have it.
Related
Is it safe to read past the end of a buffer within the same page on x86 and x64?
Why does glibc's strlen need to be so complicated to run quickly? (includes links to hand-written x86-64 asm for glibc's strlen implementation.) Unless you're on a platform with a worse C library, normally you should use that, because glibc uses CPU detection during dynamic linking to select a good version of strlen (and memcpy, etc.) for your CPU. Unaligned-startup for strlen is somewhat tricky, and glibc I think makes reasonable choices, unless the function-call overhead is a big problem. It also has good loop-unrolling techniques for big strings (like _mm256_min_epu8 to get a zero in a vector element if either of 2 input vectors had a zero, so it can amortize the actual movemask/branch work over a whole cache-line of data). It might be too aggressive in ramping up to that for medium-length strings though.
Note that glibc's licence is the LGPL, so you can't just copy code from glibc into your project unless your license is compatible. Even writing an intrinsics equivalent of its asm might be questionable.
Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? - a simple SSE2 strlen that doesn't handle misalignment, in hand-written asm. And comments on benchmarking.
https://agner.org/optimize/ - guides and instruction tables, and his subroutine library (in hand-written asm) includes a strlen. (But note it's GPL licensed.)
I assume some of the BSDs and MacOS have an asm strlen under a more permissive license you could use / look at if your project isn't GPL-compatible.
No offense but
size_t strlen(char *p)
{
size_t ret_val = 0;
while (*p++) ret_val++;
retirn ret_val;
}
does its work very well since long long ago. Also, today's optimizing compilers get very tight code for it, and your code is impossible to read.

Create own type of variable

Is it possible to create a custom type of variable in C/C++? I want something like "super long int", that occupies let's say 40 bytes and allows same operations as in an usual int. (+, -, /, %, <, >, etc..)
There's nothing built-in for something like that, at least not in C. You'll need to use a big-number library like GMP. It doesn't allow for using the normal set of operators, but it can handle numbers of an arbitrarily large size.
EDIT:
If you're targeting C++, GMP does have overloaded operators that will allow you to use the standard set of operators like you would with a regular int. See the manual for more details.
Some CPUs have support to work with very large numbers. With SSE on the x86/64 architecture you can implement 128 bit values (16 bytes) that can be calculated with normally.
With AVX this limitation extends to 256 bits (32 bytes). The upcoming AVX-512 extension is supposed to have 512 bits (64 bytes), thus enabling "super large" integers.
But there are two caveats to these extensions:
The compiler has to support it (GCC for example uses immintrin.h for AXV support and xmmintrin.h for SSE support). Alternatively you can try to implement the abstractions via inline assembler, but then the Assembler has to understand these (GCC uses AS as far as I know).
The machine you are running the compiled code on has to support these instructions. If the CPU does not support AVX or SSE (depending on what you want to do), the application will crash on these instructions, as the CPU does not understand them.
AVX/SSE is used in the implementations of memset, memcpy, etc, since they also allow you to reduce the memory accesses by a good deal (keep in mind that, while your cache line is going to be loaded into cache once, loading to it still takes up some cycles, and AVX/SSE help you eliminating a good chunk of these costs as well).
Here a working example (compiles with GCC 4.9.3, you have to add -mavx to your compiler options):
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
int main(void)
{
size_t i;
/*********************************************************************
**Hack-ish way to ensure that malloc's alignment does not screw with
**us. On this box it aligns to 0x10 bytes, but AVX needs 0x20.
*********************************************************************/
#define AVX_BASE (0x20ULL)
uint64_t*real_raw = malloc(128);
uint64_t*raw = (uint64_t*)((uintptr_t)real_raw + (AVX_BASE - ((uintptr_t)real_raw % AVX_BASE)));
__m256i value = _mm256_setzero_si256();
for(i = 0;i < 10;i++)
{
/*No special function here to do the math.*/
value += i * i;
/*************************************************************
**Extract the value from the register and print the last
**byte.
*************************************************************/
_mm256_store_si256((__m256i*)raw,value);
printf("%lu\n",raw[0]);
}
_mm256_store_si256((__m256i*)raw,value);
printf("End: %lu\n",raw[0]);
free(real_raw);
return 0;
}

8 bit int vs 32 bit int on a 32 bit architechture (GCC)

While coding, I try not to use more variable memory than needed and that leads me to write code like this:
for (uint8 i = 0; i < 32; i++) {
...
}
instead of:
for (int i = 0; i < 32; i++) {
...
}
(uint8 instead of int because i only need to go up to 32)
This would make sense when coding on an 8bit microprocessor. However, am I saving any resources if running this code on a 32bit microprocessor (where int might be 16 or 32 bit)? Does a compiler like GCC do any magic/juju underneath when I explicitly use an 8bit int on a 32bit architecture?
In most cases, there won't be any memory usage difference because i will never be in memory. i will be stored in a CPU register, and you can't really use one register to store two variables. So i will take one register, uint8 or uint32 doesn't matter.
In some rare cases, i will actually be stored in memory, because the loop is so big that all the CPU registers are taken. In this case, there is still a good chance you won't gain any memory either, because others multi-bytes variables will be aligned, and i will be followed by some useless padding bytes to align the next variable.
Now, if i is actually stored in memory, and there are other 8-bit variables to fill the padding, you may save some memory, but it's so little and so unlikely that it probably isn't worth it. Performance-wise, the difference between 8-bit and 32-bit is very architecture dependent, but usually it will be the same.
Also, if you are working in a 64-bit environment, this memory saving is probably non-existent, because the 64-bit call convention impose a huge 16-bytes alignment on the stack (where i will be stored if it is in memory).

does 8-bit processor have to face endianness problem?

If I have a int32 type integer in the 8-bit processor's memory, say, 8051, how could I identify the endianess of that integer? Is it compiler specific? I think this is important when sending multybyte data through serial lines etc.
With an 8 bit microcontroller that has no native support for wider integers, the endianness of integers stored in memory is indeed up to the compiler writer.
The SDCC compiler, which is widely used on 8051, stores integers in little-endian format (the user guide for that compiler claims that it is more efficient on that architecture, due to the presence of an instruction for incrementing a data pointer but not one for decrementing).
If the processor has any operations that act on multi-byte values, or has an multi-byte registers, it has the possibility to have an endian-ness.
http://69.41.174.64/forum/printable.phtml?id=14233&thread=14207 suggests that the 8051 mixes different endian-ness in different places.
The endianness is specific to the CPU architecture. Since a compiler needs to target a particular CPU, the compiler would have knowledge of the endianness as well. So if you need to send data over a serial connection, network, etc you may wish to use build-in functions to put data in network byte order - especially if your code needs to support multiple architectures.
For more information, see: http://www.gnu.org/s/libc/manual/html_node/Byte-Order.html
It's not just up to the compiler - '51 has some native 16-bit registers (DPTR, PC in standard, ADC_IN, DAC_OUT and such in variants) of given endianness which the compiler has to obey - but outside of that, the compiler is free to use any endianness it prefers or one you choose in project configuration...
An integer does not have endianness in it. You can't determine just from looking at the bytes whether it's big or little endian. You just have to know: For example if your 8 bit processor is little endian and you're receiving a message that you know to be big endian (because, for example, the field bus system defines big endian), you have to convert values of more than 8 bits. You'll need to either hard-code that or to have some definition on the system on which bytes to swap.
Note that swapping bytes is the easy thing. You may also have to swap bits in bit fields, since the order of bits in bit fields is compiler-specific. Again, you basically have to know this at build time.
unsigned long int x = 1;
unsigned char *px = (unsigned char *) &x;
*px == 0 ? "big endian" : "little endian"
If x is assigned the value 1 then the value 1 will be in the least significant byte.
If we then cast x to be a pointer to bytes, the pointer will point to the lowest memory location of x. If that memory location is 0 it is big endian, otherwise it is little endian.
#include <stdio.h>
union foo {
int as_int;
char as_bytes[sizeof(int)];
};
int main() {
union foo data;
int i;
for (i = 0; i < sizeof(int); ++i) {
data.as_bytes[i] = 1 + i;
}
printf ("%0x\n", data.as_int);
return 0;
}
Interpreting the output is up to you.

Is there memset() that accepts integers larger than char?

Is there a version of memset() which sets a value that is larger than 1 byte (char)? For example, let's say we have a memset32() function, so using it we can do the following:
int32_t array[10];
memset32(array, 0xDEADBEEF, sizeof(array));
This will set the value 0xDEADBEEF in all the elements of array. Currently it seems to me this can only be done with a loop.
Specifically, I am interested in a 64 bit version of memset(). Know anything like that?
void memset64( void * dest, uint64_t value, uintptr_t size )
{
uintptr_t i;
for( i = 0; i < (size & (~7)); i+=8 )
{
memcpy( ((char*)dest) + i, &value, 8 );
}
for( ; i < size; i++ )
{
((char*)dest)[i] = ((char*)&value)[i&7];
}
}
(Explanation, as requested in the comments: when you assign to a pointer, the compiler assumes that the pointer is aligned to the type's natural alignment; for uint64_t, that is 8 bytes. memcpy() makes no such assumption. On some hardware unaligned accesses are impossible, so assignment is not a suitable solution unless you know unaligned accesses work on the hardware with small or no penalty, or know that they will never occur, or both. The compiler will replace small memcpy()s and memset()s with more suitable code so it is not as horrible is it looks; but if you do know enough to guarantee assignment will always work and your profiler tells you it is faster, you can replace the memcpy with an assignment. The second for() loop is present in case the amount of memory to be filled is not a multiple of 64 bits. If you know it always will be, you can simply drop that loop.)
There's no standard library function afaik. So if you're writing portable code, you're looking at a loop.
If you're writing non-portable code then check your compiler/platform documentation, but don't hold your breath because it's rare to get much help here. Maybe someone else will chip in with examples of platforms which do provide something.
The way you'd write your own depends on whether you can define in the API that the caller guarantees the dst pointer will be sufficiently aligned for 64-bit writes on your platform (or platforms if portable). On any platform that has a 64-bit integer type at all, malloc at least will return suitably-aligned pointers.
If you have to cope with non-alignment, then you need something like moonshadow's answer. The compiler may inline/unroll that memcpy with a size of 8 (and use 32- or 64-bit unaligned write ops if they exist), so the code should be pretty nippy, but my guess is it probably won't special-case the whole function for the destination being aligned. I'd love to be corrected, but fear I won't be.
So if you know that the caller will always give you a dst with sufficient alignment for your architecture, and a length which is a multiple of 8 bytes, then do a simple loop writing a uint64_t (or whatever the 64-bit int is in your compiler) and you'll probably (no promises) end up with faster code. You'll certainly have shorter code.
Whatever the case, if you do care about performance then profile it. If it's not fast enough try again with more optimisation. If it's still not fast enough, ask a question about an asm version for the CPU(s) on which it's not fast enough. memcpy/memset can get massive performance increases from per-platform optimisation.
Just for the record, the following uses memcpy(..) in the following pattern. Suppose we want to fill an array with 20 integers:
--------------------
First copy one:
N-------------------
Then copy it to the neighbour:
NN------------------
Then copy them to make four:
NNNN----------------
And so on:
NNNNNNNN------------
NNNNNNNNNNNNNNNN----
Then copy enough to fill the array:
NNNNNNNNNNNNNNNNNNNN
This takes O(lg(num)) applications of memcpy(..).
int *memset_int(int *ptr, int value, size_t num) {
if (num < 1) return ptr;
memcpy(ptr, &value, sizeof(int));
size_t start = 1, step = 1;
for ( ; start + step <= num; start += step, step *= 2)
memcpy(ptr + start, ptr, sizeof(int) * step);
if (start < num)
memcpy(ptr + start, ptr, sizeof(int) * (num - start));
return ptr;
}
I thought it might be faster than a loop if memcpy(..) was optimised using some hardware block memory copy functionality, but it turns out that a simple loop is faster than the above with -O2 and -O3. (At least using MinGW GCC on Windows with my particular hardware.) Without the -O switch, on a 400 MB array the code above is about twice as fast as an equivalent loop, and takes 417 ms on my machine, while with optimisation they both go to about 300 ms. Which means that it takes approximately the same number of nanoseconds as bytes, and a clock cycle is about a nanosecond. So either there is no hardware block memory copy functionality on my machine, or the memcpy(..) implementation does not take advantage of it.
Check your OS documentation for a local version, then consider just using the loop.
The compiler probably knows more about optimizing memory access on any particular architecture than you do, so let it do the work.
Wrap it up as a library and compile it with all the speed improving optimizations the compiler allows.
wmemset(3) is the wide (16-bit) version of memset. I think that's the closest you're going to get in C, without a loop.
If you're just targeting an x86 compiler you could try something like (VC++ example):
inline void memset32(void *buf, uint32_t n, int32_t c)
{
__asm {
mov ecx, n
mov eax, c
mov edi, buf
rep stosd
}
}
Otherwise just make a simple loop and trust the optimizer to know what it's doing, just something like:
for(uint32_t i = 0;i < n;i++)
{
((int_32 *)buf)[i] = c;
}
If you make it complicated chances are it will end up slower than simpler to optimize code, not to mention harder to maintain.
You should really let the compiler optimize this for you as someone else suggested. In most cases that loop will be negligible.
But if this some special situation and you don't mind being platform specific, and really need to get rid of the loop, you can do this in an assembly block.
//pseudo code
asm
{
rep stosq ...
}
You can probably google stosq assembly command for the specifics. It shouldn't be more than a few lines of code.
write your own; it's trivial even in asm.

Resources