String length function is unstable - c

So I made this strlen a while ago and everything seemed fine. But I started noticing bugs with my codebase and after a while I tracked it down to this strlen function. I used SIMD instructions to write it and I am new to writing intrinsics so the code isn't probably the best it could be either.
Here is the function:
inline size_t strlen(const char* data) {
const __m256i terminationCharacters = _mm256_setzero_si256();
const size_t shiftAmount = ((size_t)&data) & 31;
const __m256i* pointer = (const __m256i*) (data - shiftAmount);
size_t length = 0;
for (;; length += 32, ++pointer) {
const __m256i comparingData = _mm256_load_si256(pointer);
const __m256i comparison = _mm256_cmpeq_epi8(comparingData, terminationCharacters);
if (!_mm256_testc_si256(terminationCharacters, comparison)) {
const auto mask = _mm256_movemask_epi8(comparison);
return length + _tzcnt_u32(mask >> shiftAmount);
}
}
}

Your attempt to combine startup handling into the aligned-vector loop has at least 2 showstopper bugs:
You exit the loop if your aligned load finds any zero bytes, even if they're from before the proper start of the string. (#James Griffin spotted this in comments). You need to do mask >>= shiftAmount and check that for non-zero to see if there were any matches in the part of the load that comes after the start of the string. (Don't use _mm256_testc_si256, just movemask and check).
_tzcnt_u32(mask >> shiftAmount); is buggy for any vectors after the first. The whole vector comes from bytes after the start of the string, so you need tzcnt to see all of bits. Instead, you want _tzcnt_u32(mask) - shiftAmount, I think.
Make yourself some test cases with 0 bytes before the actual string but inside the first aligned vector. And test cases with the final 0 in different places relative to a vector, and non-zero and test your version against libc strlen. (Maybe even for some randomized 0-positions within the first 32 bytes, and then within the first 64 bytes after that.)
Your strategy for handling unaligned startup should work, if you separate it from the loop. (Is it safe to read past the end of a buffer within the same page on x86 and x64?).
Another option is a page-cross check before a first unaligned vector load from the actual start of the string. (But then you need a fallback to something else). Then go aligned: overlap is fine; as long as you calculate the final length correctly, it doesn't matter if you check the same byte twice for being zero.
(You also don't really want the compiler to be wasting instructions inside the loop incrementing a pointer and a separate length, so check the resulting asm. A pointer-subtract after the loop should do the trick. Even cast to uintptr_t.
Also, you can subtract the final zero-position from the initial function arg, instead of from the aligned pointer, so instead of subtracting shiftAmount twice, you're just not using it at all except for the initial alignment.)
Don't use the vptest intrinsic (_mm256_testc_si256) at all, even in the main loop when you should be checking all the bytes; it's not better for _mm_cmp* results. vptest is 2 uops and can't macro-fuse with a branch instruction. But vpmovmskb eax, ymm0 is 1 uop, and test eax,eax / jz .loop is another one macro-fused uop. And even better, you actually need the integer movemask result after the loop, so you already have it.
Related
Is it safe to read past the end of a buffer within the same page on x86 and x64?
Why does glibc's strlen need to be so complicated to run quickly? (includes links to hand-written x86-64 asm for glibc's strlen implementation.) Unless you're on a platform with a worse C library, normally you should use that, because glibc uses CPU detection during dynamic linking to select a good version of strlen (and memcpy, etc.) for your CPU. Unaligned-startup for strlen is somewhat tricky, and glibc I think makes reasonable choices, unless the function-call overhead is a big problem. It also has good loop-unrolling techniques for big strings (like _mm256_min_epu8 to get a zero in a vector element if either of 2 input vectors had a zero, so it can amortize the actual movemask/branch work over a whole cache-line of data). It might be too aggressive in ramping up to that for medium-length strings though.
Note that glibc's licence is the LGPL, so you can't just copy code from glibc into your project unless your license is compatible. Even writing an intrinsics equivalent of its asm might be questionable.
Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? - a simple SSE2 strlen that doesn't handle misalignment, in hand-written asm. And comments on benchmarking.
https://agner.org/optimize/ - guides and instruction tables, and his subroutine library (in hand-written asm) includes a strlen. (But note it's GPL licensed.)
I assume some of the BSDs and MacOS have an asm strlen under a more permissive license you could use / look at if your project isn't GPL-compatible.

No offense but
size_t strlen(char *p)
{
size_t ret_val = 0;
while (*p++) ret_val++;
retirn ret_val;
}
does its work very well since long long ago. Also, today's optimizing compilers get very tight code for it, and your code is impossible to read.

Related

Can I cast pointers like this?

Code:
unsigned char array_add[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
...
if ((*((uint32_t*)array_add)!=0)||(*((uint32_t*)array_add+1)!=0))
{
...
}
I want to check if the array is all zero. So naturally I thought of casting the address of an array, which also happens to be the address of the first member, to an unsigned int 32 type, so I'll only need to do this twice, since it's a 64 bit, 8 byte array. Problem is, it was successfully compiled but the program crashes every time around here.
I'm running my program on an 8bit microcontroller, cortex-M0.
How wrong am I?
In theory this could work but in practice there is a thing you aren't considering: aligned memory accesses.
If a uint32_t requires aligned memory access (eg to 4 bytes), then casting an array of unsigned char which has 1 byte alignment requirement to an uint32_t* produces a pointer to an unaligned array of uint32_t.
According to documentation:
There is no support for unaligned accesses on the Cortex-M0 processor. Any attempt to perform an unaligned memory access operation results in a HardFault exception.
In practice this is just dangerous and fragile code which invokes undefined behavior in certain circumstances, as pointed out by Olaf and better explained here.
To test multiple bytes as once code could use memcmp().
How speedy this is depends more on the compiler as a optimizing compiler may simple emit code that does a quick 8 byte at once (or 2 4-byte) compare. Even the memcmp() might not be too slow on an 8-bit processor. Profiling code helps.
Take care in micro-optimizations, as they too often are not efficient use of coders` time for significant optimizations.
unsigned char array_add[8] = ...
const unsigned char array_zero[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
if (memcmp(array_zero, array_add, 8) == 0) ...
Another method uses a union. Be careful not to assume if add.arr8[0] is the most or least significant byte.
union {
uint8_t array8[8];
uint64_t array64;
} add;
// below code will check all 8 of the add.array8[] is they are zero.
if (add.array64 == 0)
In general, focus on writing clear code and reserve such small optimizations to very select cases.
I am not sure but if your array has 8 bytes then just assign base address to a long long variable and compare it to 0. That should solve your problem of checking if the array is all 0.
Edit 1: After Olaf's comment I would say that replace long long with int64_t. However, why do you not a simple loop for iterating the array and checking. 8 chars is all you need to compare.
Edit 2: The other approach could be to OR all elements of array and then compare with 0. If all are 0 then OR will be zero. I do not know whether CMP will be fast or OR. Please refer to Cortex-M0 docs for exact CPU cycles requirement, however, I would expect CMP to be slower.

How to write loop in C so compiler may use branch on zero after decrement

Processors are known to have special instructions for decrementing a counter and branch if the counter is zero with very low latency as the branch instruction does not need to wait for the counter decrement passing through an integer unit.
Here is a link to the ppc instruction:
https://www.ibm.com/support/knowledgecenter/ssw_aix_53/com.ibm.aix.aixassem/doc/alangref/bc.htm
My usual way of doing what I believe triggers a compiler to generate the appropriate instructions is as follows:
unsigned int ctr = n;
while(ctr--)
a[ctr] += b[ctr];
Readability is high and it is a decrementing loop branching on zero. As you see the branch technically occurs if counter is zero before decrement. I was hoping the compiler could do some magic and make it work anyway. Q: Would a compiler have to break any fundamental rules of C in order to mangle it to special decrement and branch conditional instructions (if any)?
Another approach:
unsigned int ctr = n+1;
while(--ctr) {
a[ctr-1] += b[ctr-1];
}
The branch now happen after decrement but there are constants roaming making ugly code. An "index" variable being one less than counter would make it look a little prettier I guess. Looking at available ppc instructions the extra calculation in finding the a and b adress can still fit single instruction as load may also perform adress arithmetic (add). Not so sure about other instruction sets. My main problem though is if n+1 is larger than an max. Q: Will the decrement pull it back to max and loop as usual?
Q: Is there a more commonly used pattern in C for allowing the common instruction?
Edit: ARM has a decrement and branch operation but branches only if value is NOT zero. There appears to be an extra condition just like the ppc bc. As I see it it is from C point of view it is very much the same thing so I expect a code snippet to be compilable to that form too without any C standard violation. http://www.heyrick.co.uk/armwiki/Conditional_execution
Edit: Intel has virtually the same branching instruction as ARM: http://cse.unl.edu/~goddard/Courses/CSCE351/IntelArchitecture/InstructionSetSummary.pdf
This is going to depend on the efforts of the optimization writers of your compiler.
For instance, a bdz opcode could be used at the bottom of a loop to "jump over" a different jump that returned to the top. (This would be a bad idea, but it could happen.)
loop:
blah
blah
bdz ... out
b loop
out:
Far more likely would be to decrement and branch if NOT zero, which the PPC also supports.
loop:
blah
blah
bdnz ... loop
fallthru:
Unless you have a compelling reason to try to game the opcodes, I'd suggest that you try to write clean, readable code that minimizes side effects. Your own change from post-decrement to pre-decrement is a good example of that-- one less (un-used) side effect for the compiler to worry about.
That way, you'll get the most bang for your optimizing buck. If there's a platform that needs a special version of your code, you can #ifdef the whole thing, and either include inline assembly, or rewrite the code in conjunction with reading the assembly output and running the profiler.
Definitely depends on the compiler, but it's an instruction that is great for performance, so I'd expect compilers to try and maximize its usage.
Since you're linking an AIX reference, I'm assuming you're running xlc. I don't have access to an AIX machine but I do have access to xlc on a Z machine.
The equivalent Z counterpart is the Branch On Count (BCTR) instruction.
I tried 5 examples and checked the listings
int len = strlen(argv[1]);
//Loop header
argv[1][counter] += argv[2][counter];
With the following loop headers:
for (int i = 0; i < len; i++)
for (int i = len-1; i >= 0; i--)
while(--len)
while(len--)
while(len){
len--;
All 5 examples use branch on count at -O1 and higher, and none of them use it at opt 0.
I'd trust a modern compiler to be able to find branch on zero opportunities with any standard loop structure.
What about this:
do
{
a[ctr] += b[ctr];
}
while(--ctr);
You'd need an additional check, however:
if(n != 0)
{
/*...*/
}
if you cannot guarantee this by other means...
Oh, and be aware that ctr has different final values depending on which loop variant you select (0 in mine and your second one, ~0 in your first)...

How to properly use carry-less multiplication assembly (PCLMULQDQ) in zlib CRC32?

I've recently been playing around with CloudFlare's optimized zlib, and the results are really quite impressive.
Unfortunately, they seem to have assumed development of zlib was abandoned, and their fork broke away. I was eventually able to manually rebase their changes on to the current zlib development branch, though it was a real pain in the ass.
Anyway, there's still one major optimization in the CloudFlare code I haven't been able to utilize, namely, the fast CRC32 code implemented with the PCLMULQDQ carry-less multiplication instructions included with newer (Haswell and later, I believe) Intel processors, because:
I'm on a Mac, and neither the clang integrated assembler nor Apple's ancient GAS understand the newer GAS mnemonics used,
and
The code was lifted from the Linux kernel and is GPL2, which makes the whole library GPL2, and thereby basically renders it useless for my purposes.
So I did some hunting around, and after a few hours I stumbled onto some code that Apple is using in their bzip2: handwritten, vectorized CRC32 implementations for both arm64 and x86_64.
Bizarrely, the comments for the x86_64 assembly are (only) in the arm64 source, but it does seem to indicate that this code could be used with zlib:
This function SHOULD NOT be called directly. It should be called in a wrapper
function (such as crc32_little in crc32.c) that 1st align an input buffer to 16-byte (update crc along the way),
and make sure that len is at least 16 and SHOULD be a multiple of 16.
But I unfortunately, after a few attempts, at this point I seem to be in a bit over my head. And I'm not sure how to actually do that. So I was hoping someone could show me how/where one would call the function provided.
(It also would be fantastic if there were a way to do it where the necessary features were detected at runtime, and could fall back to the software implementation if the hardware features are unavailable, so I wouldn't have to distribute multiple binaries. But, at the very least, if anyone could help me suss out how to get the library to correctly use the Apple PCLMULQDQ-based CRC32, that would go a long way, regardless.)
As it says, you need to calculate the CRC sum on a 16-byte aligned buffer that has length of multiple of 16 bytes. Thus you'd cast the current buffer pointer as uintptr_t and for as long as its 4 LSB bits are not zero, you increase the pointer feeding the bytes into an ordinary CRC-32 routine. Once you've at 16-byte aligned address, you round the remaining length down to multiple of 16, then feed these bytes to the fast CRC-32, and again the remaining bytes to the slow calculation.
Something like:
// a function for adding a single byte to crc
uint32_t crc32_by_byte(uint32_t crc, uint8_t byte);
// the assembly routine
uint32_t _crc32_vec(uint32_t crc, uint8_t *input, int length);
uint32_t crc = initial_value;
uint8_t *input = whatever;
int length = whatever; // yes, the assembly uses *int* length.
assert(length >= 32); // if length is less than 32 just calculate byte by byte
while ((uintptr_t)input & 0xf) { // for as long as input is not 16-byte aligned
crc = crc32_by_byte(crc, *input++);
length--;
}
// input is now 16-byte-aligned
// floor length to multiple of 16
int fast_length = (length >> 4) << 4;
crc = _crc32_vec(crc, input, fast_length);
// do the remaining bytes
length -= fast_length;
while (length--) {
crc = crc32_by_byte(crc, *input++)
}
return crc;

Create own type of variable

Is it possible to create a custom type of variable in C/C++? I want something like "super long int", that occupies let's say 40 bytes and allows same operations as in an usual int. (+, -, /, %, <, >, etc..)
There's nothing built-in for something like that, at least not in C. You'll need to use a big-number library like GMP. It doesn't allow for using the normal set of operators, but it can handle numbers of an arbitrarily large size.
EDIT:
If you're targeting C++, GMP does have overloaded operators that will allow you to use the standard set of operators like you would with a regular int. See the manual for more details.
Some CPUs have support to work with very large numbers. With SSE on the x86/64 architecture you can implement 128 bit values (16 bytes) that can be calculated with normally.
With AVX this limitation extends to 256 bits (32 bytes). The upcoming AVX-512 extension is supposed to have 512 bits (64 bytes), thus enabling "super large" integers.
But there are two caveats to these extensions:
The compiler has to support it (GCC for example uses immintrin.h for AXV support and xmmintrin.h for SSE support). Alternatively you can try to implement the abstractions via inline assembler, but then the Assembler has to understand these (GCC uses AS as far as I know).
The machine you are running the compiled code on has to support these instructions. If the CPU does not support AVX or SSE (depending on what you want to do), the application will crash on these instructions, as the CPU does not understand them.
AVX/SSE is used in the implementations of memset, memcpy, etc, since they also allow you to reduce the memory accesses by a good deal (keep in mind that, while your cache line is going to be loaded into cache once, loading to it still takes up some cycles, and AVX/SSE help you eliminating a good chunk of these costs as well).
Here a working example (compiles with GCC 4.9.3, you have to add -mavx to your compiler options):
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
int main(void)
{
size_t i;
/*********************************************************************
**Hack-ish way to ensure that malloc's alignment does not screw with
**us. On this box it aligns to 0x10 bytes, but AVX needs 0x20.
*********************************************************************/
#define AVX_BASE (0x20ULL)
uint64_t*real_raw = malloc(128);
uint64_t*raw = (uint64_t*)((uintptr_t)real_raw + (AVX_BASE - ((uintptr_t)real_raw % AVX_BASE)));
__m256i value = _mm256_setzero_si256();
for(i = 0;i < 10;i++)
{
/*No special function here to do the math.*/
value += i * i;
/*************************************************************
**Extract the value from the register and print the last
**byte.
*************************************************************/
_mm256_store_si256((__m256i*)raw,value);
printf("%lu\n",raw[0]);
}
_mm256_store_si256((__m256i*)raw,value);
printf("End: %lu\n",raw[0]);
free(real_raw);
return 0;
}

Is there memset() that accepts integers larger than char?

Is there a version of memset() which sets a value that is larger than 1 byte (char)? For example, let's say we have a memset32() function, so using it we can do the following:
int32_t array[10];
memset32(array, 0xDEADBEEF, sizeof(array));
This will set the value 0xDEADBEEF in all the elements of array. Currently it seems to me this can only be done with a loop.
Specifically, I am interested in a 64 bit version of memset(). Know anything like that?
void memset64( void * dest, uint64_t value, uintptr_t size )
{
uintptr_t i;
for( i = 0; i < (size & (~7)); i+=8 )
{
memcpy( ((char*)dest) + i, &value, 8 );
}
for( ; i < size; i++ )
{
((char*)dest)[i] = ((char*)&value)[i&7];
}
}
(Explanation, as requested in the comments: when you assign to a pointer, the compiler assumes that the pointer is aligned to the type's natural alignment; for uint64_t, that is 8 bytes. memcpy() makes no such assumption. On some hardware unaligned accesses are impossible, so assignment is not a suitable solution unless you know unaligned accesses work on the hardware with small or no penalty, or know that they will never occur, or both. The compiler will replace small memcpy()s and memset()s with more suitable code so it is not as horrible is it looks; but if you do know enough to guarantee assignment will always work and your profiler tells you it is faster, you can replace the memcpy with an assignment. The second for() loop is present in case the amount of memory to be filled is not a multiple of 64 bits. If you know it always will be, you can simply drop that loop.)
There's no standard library function afaik. So if you're writing portable code, you're looking at a loop.
If you're writing non-portable code then check your compiler/platform documentation, but don't hold your breath because it's rare to get much help here. Maybe someone else will chip in with examples of platforms which do provide something.
The way you'd write your own depends on whether you can define in the API that the caller guarantees the dst pointer will be sufficiently aligned for 64-bit writes on your platform (or platforms if portable). On any platform that has a 64-bit integer type at all, malloc at least will return suitably-aligned pointers.
If you have to cope with non-alignment, then you need something like moonshadow's answer. The compiler may inline/unroll that memcpy with a size of 8 (and use 32- or 64-bit unaligned write ops if they exist), so the code should be pretty nippy, but my guess is it probably won't special-case the whole function for the destination being aligned. I'd love to be corrected, but fear I won't be.
So if you know that the caller will always give you a dst with sufficient alignment for your architecture, and a length which is a multiple of 8 bytes, then do a simple loop writing a uint64_t (or whatever the 64-bit int is in your compiler) and you'll probably (no promises) end up with faster code. You'll certainly have shorter code.
Whatever the case, if you do care about performance then profile it. If it's not fast enough try again with more optimisation. If it's still not fast enough, ask a question about an asm version for the CPU(s) on which it's not fast enough. memcpy/memset can get massive performance increases from per-platform optimisation.
Just for the record, the following uses memcpy(..) in the following pattern. Suppose we want to fill an array with 20 integers:
--------------------
First copy one:
N-------------------
Then copy it to the neighbour:
NN------------------
Then copy them to make four:
NNNN----------------
And so on:
NNNNNNNN------------
NNNNNNNNNNNNNNNN----
Then copy enough to fill the array:
NNNNNNNNNNNNNNNNNNNN
This takes O(lg(num)) applications of memcpy(..).
int *memset_int(int *ptr, int value, size_t num) {
if (num < 1) return ptr;
memcpy(ptr, &value, sizeof(int));
size_t start = 1, step = 1;
for ( ; start + step <= num; start += step, step *= 2)
memcpy(ptr + start, ptr, sizeof(int) * step);
if (start < num)
memcpy(ptr + start, ptr, sizeof(int) * (num - start));
return ptr;
}
I thought it might be faster than a loop if memcpy(..) was optimised using some hardware block memory copy functionality, but it turns out that a simple loop is faster than the above with -O2 and -O3. (At least using MinGW GCC on Windows with my particular hardware.) Without the -O switch, on a 400 MB array the code above is about twice as fast as an equivalent loop, and takes 417 ms on my machine, while with optimisation they both go to about 300 ms. Which means that it takes approximately the same number of nanoseconds as bytes, and a clock cycle is about a nanosecond. So either there is no hardware block memory copy functionality on my machine, or the memcpy(..) implementation does not take advantage of it.
Check your OS documentation for a local version, then consider just using the loop.
The compiler probably knows more about optimizing memory access on any particular architecture than you do, so let it do the work.
Wrap it up as a library and compile it with all the speed improving optimizations the compiler allows.
wmemset(3) is the wide (16-bit) version of memset. I think that's the closest you're going to get in C, without a loop.
If you're just targeting an x86 compiler you could try something like (VC++ example):
inline void memset32(void *buf, uint32_t n, int32_t c)
{
__asm {
mov ecx, n
mov eax, c
mov edi, buf
rep stosd
}
}
Otherwise just make a simple loop and trust the optimizer to know what it's doing, just something like:
for(uint32_t i = 0;i < n;i++)
{
((int_32 *)buf)[i] = c;
}
If you make it complicated chances are it will end up slower than simpler to optimize code, not to mention harder to maintain.
You should really let the compiler optimize this for you as someone else suggested. In most cases that loop will be negligible.
But if this some special situation and you don't mind being platform specific, and really need to get rid of the loop, you can do this in an assembly block.
//pseudo code
asm
{
rep stosq ...
}
You can probably google stosq assembly command for the specifics. It shouldn't be more than a few lines of code.
write your own; it's trivial even in asm.

Resources