What's the fastest way to copy two adjacent bytes in C?

What's the fastest way to copy two adjacent bytes in C? - c

Ok so let's start with the most obvious solution:
memcpy(Ptr, (const char[]){'a', 'b'}, 2);
There's quite an overhead of calling a library function. Compilers sometimes don't optimize it, well I wouldn't rely on compiler optimizations but even though GCC is smart, if I'm porting a program to more exotic platforms with trashy compilers I don't want to rely on it.
So now there's a more direct approach:
Ptr[0] = 'a';
Ptr[1] = 'b';
It doesn't involve any overhead of library functions, but is making two different assignments. Third we have a type pun:
*(uint16_t*)Ptr = *(uint16_t*)(unsigned char[]){'a', 'b'};
Which one should I use if in a bottleneck? What's the fastest way to copy only two bytes in C?
Regards,
Hank Sauri

Only two of the approaches you suggested are correct:
memcpy(Ptr, (const char[]){'a', 'b'}, 2);
and
Ptr[0] = 'a';
Ptr[1] = 'b';
On X86 GCC 10.2, both compile to identical code:
mov eax, 25185
mov WORD PTR [something], ax
This is possible because of the as-if rule.
Since a good compiler could figure out that these are identical, use the one that is easier to write in your cse. If you're setting one or two bytes, use the latter, if several use the former or use a string instead of a compound literal array.
The third one you suggested
*(uint16_t*)Ptr = *(uint16_t*)(unsigned char[]){'a', 'b'};
also compiles to the same code when using x86-64 GCC 10.2, i.e. it would behave identically in this case.
But in addition it has 2-4 points of undefined behaviour, because it has twice strict aliasing violation and twice, coupled with possible unaligned memory access at both source and destination. Undefined behaviour does not mean that it must not work like you intended, but neither does it mean that it has to work as you intended. The behaviour is undefined. And it can fail to work on any processor, including x86. Why would you care about the performance on a bad compiler so much that you would write code that would fail to work on a good compiler?!

When in doubt, use the Compiler Explorer.
#include <string.h>
#include <stdint.h>
int c1(char *Ptr) {
memcpy(Ptr, (const char[]){'a', 'b'}, 2);
}
int c2(char *Ptr) {
Ptr[0] = 'a';
Ptr[1] = 'b';
}
int c3(char *Ptr) {
// Bad bad not good.
*(uint16_t*)Ptr = *(uint16_t*)(unsigned char[]){'a', 'b'};
}
compiles down to (GCC)
c1:
mov eax, 25185
mov WORD PTR [rdi], ax
ret
c2:
mov eax, 25185
mov WORD PTR [rdi], ax
ret
c3:
mov eax, 25185
mov WORD PTR [rdi], ax
ret
or (Clang)
c1: # #c1
mov word ptr [rdi], 25185
ret
c2: # #c2
mov word ptr [rdi], 25185
ret
c3: # #c3
mov word ptr [rdi], 25185
ret

in C this approach is, no doubt, the fastest:
Ptr[0] = 'a';
Ptr[1] = 'b';
This is why:
All Intel and ARM CPU's are able to store some constant data (also called immediate data) within selected assembly instructions. These instructions are memory-to-cpu and cpu-to-memory data transfer like: MOV
That means that when those instructions are fetched from the PROGRAM memory to the CPU the immediate data will arrive to the CPU along with the instruction.
'a' and 'b' are constant and therefore might enter the CPU along with the MOV instruction.
Once the immediate data is in the CPU, the CPU itself has only to make one memory access to the DATA memory for writing 'a' to Ptr[0].
Ciao,
Enrico Migliore

Related

Struct zero initialization methods

Is
struct datainfo info = { 0 };
the same as
struct datainfo info;
memset(&info, 0, sizeof(info));
What's the difference and which is better ?

The first one is the best way by a country mile, as it guarantees that the struct members are initialised as they would be for static storage. It's also clearer.
There's no guarantee from a standards perspective that the two ways are equivalent, although a specific compiler may well optimise the first to the second, even if it ends up clobbering parts of memory discarded as padding.
(Note that in C++, the behaviour of the second way could well be undefined. Yes C is not C++ but a fair bit of C code does tend to end up being ported to C++.)

Practically, those two methods are very likely to produce the same result. Probably on account of first being compiled into a call to memset itself on today's common platforms.
From a language lawyer perspective, the first method will zero-initialize all the members of the of the structure, but there is nothing specified about the values any padding bytes may take (in the individual members, or the structure). While the the second method will zero out all the bytes. And to be even more precise, there is no guarantee that an all byte zero pattern is even an object's "zero" value.
Since (if one knows their target platform) the two are pretty much equivalent in every way that counts for the programmer, you choose the one that best suits your preferences.
Personally, I favor the initialization over the call to memset. Because it happens at the point of declaration, and not in another statement, not to mention the portability aspect. That makes it impossible to accidentally add code in between that makes the initialization not run (however unlikely that may be), or be faulty somehow. But some may say that memset is clearer, even to a programmer reading it later that is not aware of how {0} works. I can't entirely disregard their argument either.

As noted by others, the code is functionality equivalent.
Using the x86-64 gcc 8.3 compiler
The code:
#include <string.h>
main()
{
struct datainfo { int i; };
struct datainfo info;
memset(&info, 0, sizeof(info));
}
produces the assembly:
main:
push rbp
mov rbp, rsp
sub rsp, 16
lea rax, [rbp-4]
mov edx, 4
mov esi, 0
mov rdi, rax
call memset
mov eax, 0
leave
ret
while the code:
main()
{
struct datainfo { int i; };
struct datainfo info = {0};
}
compiles to:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
mov eax, 0
pop rbp
ret
To my untrained eye, the two outputs are 11 instructions vs 6 instructions, so at least space is more efficient in the second implementation. But as noted by others, the zero initialization method is much more explicit in its intent.

Is there any way to save registers before jumping into function?

this is my first question, because I couldn't find anything related to this topic.
Recently, while making a class for my C game engine project I've found something interesting:
struct Stack *S1 = new(Stack);
struct Stack *S2 = new(Stack);
S1->bPush(S1, 1, 2); //at this point
bPush is a function pointer in the structure.
So I wondered, what does operator -> in that case, and I've discovered:
mov r8b,2 ; a char, written to a low point of register r8
mov dl,1 ; also a char, but to d this time
mov rcx,qword ptr [S1] ; this is the 1st parameter of function
mov rax,qword ptr [S1] ; !Why cannot I use this one?
call qword ptr [rax+1A0h] ; pointer call
so I assume -> writes an object pointer to rcx, and I'd like to use it in functions (methods they shall be). So the question is, how can I do something alike
push rcx
// do other call vars
pop rcx
mov qword ptr [this], rcx
before it starts writing other variables of the function. Something with preprocessor?

It looks like you'd have an easier time (and get asm that's the same or more efficient) if you wrote in C++ so you could use language built-in support for virtual functions, and for running constructors on initialization. Not to mention not having to manually run destructors. You wouldn't need your struct Class hack.
I'd like to implicitly pass *this pointer, because as shown in second asm part it does the same thing twice, yes, it is what I'm looking for, bPush is a part of a struct and it cannot be called from outside, but I have to pass the pointer S1, which it already has.
You get inefficient asm because you disabled optimization.
MSVC -O2 or -Ox doesn't reload the static pointer twice. It does waste a mov instruction copying between registers, but if you want better asm use a better compiler (like gcc or clang).
The oldest MSVC on the Godbolt compiler explorer is CL19.0 from MSVC 2015, which compiles this source
struct Stack {
int stuff[4];
void (*bPush)(struct Stack*, unsigned char value, unsigned char length);
};
struct Stack *const S1 = new(Stack);
int foo(){
S1->bPush(S1, 1, 2);
//S1->bPush(S1, 1, 2);
return 0; // prevent tailcall optimization
}
into this asm (Godbolt)
# MSVC 2015 -O2
int foo(void) PROC ; foo, COMDAT
$LN4:
sub rsp, 40 ; 00000028H
mov rax, QWORD PTR Stack * __ptr64 __ptr64 S1
mov r8b, 2
mov dl, 1
mov rcx, rax ;; copy RAX to the arg-passing register
call QWORD PTR [rax+16]
xor eax, eax
add rsp, 40 ; 00000028H
ret 0
int foo(void) ENDP ; foo
(I compiled in C++ mode so I could write S1 = new(Stack) without having to copy your github code, and write it at global scope with a non-constant initializer.)
Clang7.0 -O3 loads into RCX straight away:
# clang -O3
foo():
sub rsp, 40
mov rcx, qword ptr [rip + S1]
mov dl, 1
mov r8b, 2
call qword ptr [rcx + 16] # uses the arg-passing register
xor eax, eax
add rsp, 40
ret
Strangely, clang only decides to use low-byte registers when targeting the Windows ABI with __attribute__((ms_abi)). It uses mov esi, 1 to avoid false dependencies when targeting its default Linux calling convention, not mov sil, 1.
Or if you are using optimization, then it's because even older MSVC is even worse. In that case you probably can't do anything in the C source to fix it, although you might try using a struct Stack *p = S1 local variable to hand-hold the compiler into loading it into a register once and reusing it from there.)

Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU:
#include <inttypes.h>
#include <stdlib.h>
#include <sys/mman.h>
int main()
{
uint32_t sum = 0;
uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
uint16_t *p = (buffer + 1);
int i;
for (i=0;i<14;++i) {
//printf("%d\n", i);
sum += p[i];
}
return sum;
}
This only segfaults if the memory is allocated using mmap. If I use malloc, a buffer on the stack, or a global variable it does not segfault.
If I decrease the number of iterations of the loop to anything less than 14 it no longer segfaults. And if I print the array index from within the loop it also no longer segfaults.
Why does unaligned memory access segfault on a CPU that is able to access unaligned addresses, and why only under such specific circumstances?

Related: Pascal Cuoq's blog post shows a case where GCC assumes aligned pointers (that two int* don't partially overlap): GCC always assumes aligned pointer accesses. He also links to a 2016 blog post (A bug story: data alignment on x86) that has the exact same bug as this question: auto-vectorization with a misaligned pointer -> segfault.
gcc4.8 makes a loop prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.
I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization. It's definitely undefined behaviour in ISO C to use a pointer to uint16_t with less than alignof(uint16_t)=2 alignment. GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".
Try with -O3 -fno-tree-vectorize or -O2. If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).
gcc knows malloc returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t is 16 bytes wide because long double has padding out to 16 bytes in the x86-64 System V ABI). It sees what you're doing and uses movdqu.
But gcc doesn't treat mmap as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p is 2-byte aligned, so it can use movdqa after handling misalignment. Your pointer is misaligned and violates this assumption.
(I wonder if newer glibc headers use __attribute__((assume_aligned(4096))) to mark mmap's return value as aligned. That would be a good idea, and would probably have given you about the same code-gen as for malloc. Except it wouldn't work because it would break error-checking for mmap != (void*)-1, as #Alcaro points out with an example on Godbolt: https://gcc.godbolt.org/z/gVrLWT)
on a CPU that is able to access unaligned
SSE2 movdqa segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.
SSE2 is baseline for x86-64, so gcc uses it.
Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing. It doesn't even recognize -march=haswell.)
14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3 and no -march or -mtune options.
I put your code on Godbolt, and this is the relevant part of main:
call mmap #
lea rdi, [rax+1] # p,
mov rdx, rax # buffer,
mov rax, rdi # D.2507, p
and eax, 15 # D.2507,
shr rax ##### rax>>=1 discards the low byte, assuming it's zero
neg rax # D.2507
mov esi, eax # prolog_loop_niters.7, D.2507
and esi, 7 # prolog_loop_niters.7,
je .L2
# .L2 leads directly to a MOVDQA xmm2, [rdx+1]
It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop. i.e. gcc doesn't have a code path to handle the case where p is odd.
But the code-gen for malloc looks like this:
call malloc #
movzx edx, WORD PTR [rax+17] # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
movzx ecx, WORD PTR [rax+27] # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
movdqu xmm2, XMMWORD PTR [rax+1] # tmp91, MEM[(uint16_t *)buffer_5 + 1B]
Note the use of movdqu. There are some more scalar movzx loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar. This is a missed-optimization: it could easily do another 4 with a movq load, especially because that fills an XMM vector after unpacking
with zero to get uint32_t elements before adding.
(There are various other missed-optimizations, like maybe using pmaddwd with a multiplier of 1 to add horizontal pairs of words into dword elements.)
Safe code with unaligned pointers:
If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy. On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer. But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.
memcpy is how you express an unaligned load / store in ISO C / C++.
#include <string.h>
int sum(int *p) {
int sum=0;
for (int i=0 ; i<10001 ; i++) {
// sum += p[i];
int tmp;
#ifdef USE_ALIGNED
tmp = p[i]; // normal dereference
#else
memcpy(&tmp, &p[i], sizeof(tmp)); // unaligned load
#endif
sum += tmp;
}
return sum;
}
With gcc7.2 -O3 -DUSE_ALIGNED, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)
.L4: # gcc7.2 normal dereference
add eax, 1
paddd xmm0, XMMWORD PTR [rdx]
add rdx, 16
cmp ecx, eax
ja .L4
But with memcpy, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:
.L2: # gcc7.2 memcpy for an unaligned pointer
movdqu xmm2, XMMWORD PTR [rdi]
add rdi, 16
cmp rax, rdi # end_pointer != pointer
paddd xmm0, xmm2
jne .L2 # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(
# hsum into EAX, then the final odd scalar element:
add eax, DWORD PTR [rdi+40000] # this is how memcpy compiles for normal scalar code, too.
In the OP's case, simply arranging for pointers to be aligned is a better choice. It avoids cache-line splits for scalar code (or for vectorized the way gcc does it). It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.
But sometimes that's not an option. memcpy fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type. i.e. just a load or store, no function call and no bouncing to an extra memory location. Even at -O0, this simple memcpy inlines with no function call, but of course tmp doesn't optimizes away.
Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers. For example, ICC18 doesn't auto-vectorize the version using memcpy.
uint64_t tmp=0; and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.
GNU C __attribute__((aligned(1))) and may_alias
Instead of memcpy (which won't inline on some ISAs when GCC doesn't know the pointer is aligned, i.e. exactly this use-case), you can also use a typedef with a GCC attribute to make an under-aligned version of a type.
typedef int __attribute__((aligned(1), may_alias)) unaligned_aliasing_int;
typedef unsigned long __attribute__((may_alias, aligned(1))) unaligned_aliasing_ulong;
related: Why does glibc's strlen need to be so complicated to run quickly? shows how to make a word-at-a-time bithack C strlen safe with this.
Note that it seems ICC doesn't respect __attribute__((may_alias)), but gcc/clang do. I was recently playing around with that trying to write a portable and safe 4-byte SIMD load like _mm_loadu_si32 (which GCC is missing). https://godbolt.org/z/ydMLCK has various combinations of safe everywhere but inefficient code-gen on some compilers, or unsafe on ICC but good everywhere.
aligned(1) may be less bad than memcpy on ISAs like MIPS where unaligned loads can't be done in one instruction.
You use it like any other pointer.
unaligned_aliasing_int *p = something;
int tmp = *p++;
int tmp2 = *p++;
And of course you can index it as normal like p[i].

Cutting QWORD to get DWORD and performing calculations upon it

mov rax,QWORD PTR [rbp-0x10]
mov eax,DWORD PTR [rax]
add eax,0x1
mov DWORD PTR [rbp-0x14], eax
Next lines written in C, compiled with GCC in GNU/Linux environment.
Assembly code is for int b = *a + 1;.
...
int a = 5;
int* ptr = &a;
int b = *a + 1;
dereferencing whats in address of a and adding 1 to that. After that, store under new variable.
What I don`t understand is second line in that assembly code. Does it mean that I cut QWORD to get the DWORD(one part of QWORD) and storing that into eax?
Since the code is few lines long, I would love that to be broke into step by step sections just to confirm that I`m on right track, also, to figure out what that second line does. Thank you.

What I don`t understand is second line in that assembly code. Does it mean that I cut QWORD to get the DWORD(one part of QWORD) and storing that into eax?
No, the 2nd line dereferences it. There's no splitting up of a qword into two dword halves. (Writing EAX zeros the upper 32 bits of RAX).
It just happens to use the same register that it was using for the pointer, because it doesn't need the pointer anymore.
Compile with optimizations enabled; it's much easier to see what's happening if gcc isn't storing/reloading all the time. (How to remove "noise" from GCC/clang assembly output?)
int foo(int *ptr) {
return *ptr + 1;
}
mov eax, DWORD PTR [rdi]
add eax, 1
ret
(On Godbolt)

int a = 5;
int* ptr = &a;
int b = *a + 1;
your example is an undefined behaviour as you dereference the integer value converted to the pointer (in this case 5) and it will not compile at all as this conversion has the unknown type.
To make it work you need to cast it first.
`int b = *(int *)a + 1;
https://godbolt.org/g/Yo8dd1
Explanation of your assembly code:
line 1: loads rax with the value of a (in this case 5)
line 2: dereferences this value (reads from the address 5 so probably you will get the segmentation fault). this code loads from the stack only because you use the -O0 option.

Efficiency of struct copying

When copying between two structure variables in C, in the back-end whether it does a memcpy or an item by item copy? Can this be compiler depended?

It's heavily compiler dependant
Consider a struct with just 2 fields
struct A { int a, b; };
Copying this struct in VS2015 in DEBUG build generates the following asm.
struct A b;
b = a;
mov eax,dword ptr [a]
mov dword ptr [b],eax
mov ecx,dword ptr [ebp-8]
mov dword ptr [ebp-18h],ecx
Now added an array of 100 char and then copy that
struct A
{
int a;
int b;
char x[100];
};
struct A a = { 1,2, {'1', '2'} };
struct A b;
b = a;
mov ecx,1Bh
lea esi,[a]
lea edi,[b]
rep movs dword ptr es:[edi],dword ptr [esi]
Now basically a memcpy is done from address of a to address of b.
It depends on a lot of the layout of the struct, the compiler, the level of optimization...a lot of factors.

You should not even think about that. Compilers are only required that the observable results of what they generate are the same as would you asked. Besides that, they can optimize the way they like. That means that you should let the compiler choose the way it copy structs.
The only case when the above rule does not apply it in case of low level optimization. But here other rules apply:
never use low level optimization at early development stages
only do after identifying by profiling the bottlenecks in your code
always use benchmarking to choose the best way
remember that such low level optimization only make sense for one (version of) compiler on one architecture.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

What's the fastest way to copy two adjacent bytes in C? - c

Related

Struct zero initialization methods

Is there any way to save registers before jumping into function?

Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

Cutting QWORD to get DWORD and performing calculations upon it

Efficiency of struct copying

Categories

Resources