Efficiency of struct copying

Efficiency of struct copying - c

When copying between two structure variables in C, in the back-end whether it does a memcpy or an item by item copy? Can this be compiler depended?

It's heavily compiler dependant
Consider a struct with just 2 fields
struct A { int a, b; };
Copying this struct in VS2015 in DEBUG build generates the following asm.
struct A b;
b = a;
mov eax,dword ptr [a]
mov dword ptr [b],eax
mov ecx,dword ptr [ebp-8]
mov dword ptr [ebp-18h],ecx
Now added an array of 100 char and then copy that
struct A
{
int a;
int b;
char x[100];
};
struct A a = { 1,2, {'1', '2'} };
struct A b;
b = a;
mov ecx,1Bh
lea esi,[a]
lea edi,[b]
rep movs dword ptr es:[edi],dword ptr [esi]
Now basically a memcpy is done from address of a to address of b.
It depends on a lot of the layout of the struct, the compiler, the level of optimization...a lot of factors.

You should not even think about that. Compilers are only required that the observable results of what they generate are the same as would you asked. Besides that, they can optimize the way they like. That means that you should let the compiler choose the way it copy structs.
The only case when the above rule does not apply it in case of low level optimization. But here other rules apply:
never use low level optimization at early development stages
only do after identifying by profiling the bottlenecks in your code
always use benchmarking to choose the best way
remember that such low level optimization only make sense for one (version of) compiler on one architecture.

Related

What's the fastest way to copy two adjacent bytes in C?

Ok so let's start with the most obvious solution:
memcpy(Ptr, (const char[]){'a', 'b'}, 2);
There's quite an overhead of calling a library function. Compilers sometimes don't optimize it, well I wouldn't rely on compiler optimizations but even though GCC is smart, if I'm porting a program to more exotic platforms with trashy compilers I don't want to rely on it.
So now there's a more direct approach:
Ptr[0] = 'a';
Ptr[1] = 'b';
It doesn't involve any overhead of library functions, but is making two different assignments. Third we have a type pun:
*(uint16_t*)Ptr = *(uint16_t*)(unsigned char[]){'a', 'b'};
Which one should I use if in a bottleneck? What's the fastest way to copy only two bytes in C?
Regards,
Hank Sauri

Only two of the approaches you suggested are correct:
memcpy(Ptr, (const char[]){'a', 'b'}, 2);
and
Ptr[0] = 'a';
Ptr[1] = 'b';
On X86 GCC 10.2, both compile to identical code:
mov eax, 25185
mov WORD PTR [something], ax
This is possible because of the as-if rule.
Since a good compiler could figure out that these are identical, use the one that is easier to write in your cse. If you're setting one or two bytes, use the latter, if several use the former or use a string instead of a compound literal array.
The third one you suggested
*(uint16_t*)Ptr = *(uint16_t*)(unsigned char[]){'a', 'b'};
also compiles to the same code when using x86-64 GCC 10.2, i.e. it would behave identically in this case.
But in addition it has 2-4 points of undefined behaviour, because it has twice strict aliasing violation and twice, coupled with possible unaligned memory access at both source and destination. Undefined behaviour does not mean that it must not work like you intended, but neither does it mean that it has to work as you intended. The behaviour is undefined. And it can fail to work on any processor, including x86. Why would you care about the performance on a bad compiler so much that you would write code that would fail to work on a good compiler?!

When in doubt, use the Compiler Explorer.
#include <string.h>
#include <stdint.h>
int c1(char *Ptr) {
memcpy(Ptr, (const char[]){'a', 'b'}, 2);
}
int c2(char *Ptr) {
Ptr[0] = 'a';
Ptr[1] = 'b';
}
int c3(char *Ptr) {
// Bad bad not good.
*(uint16_t*)Ptr = *(uint16_t*)(unsigned char[]){'a', 'b'};
}
compiles down to (GCC)
c1:
mov eax, 25185
mov WORD PTR [rdi], ax
ret
c2:
mov eax, 25185
mov WORD PTR [rdi], ax
ret
c3:
mov eax, 25185
mov WORD PTR [rdi], ax
ret
or (Clang)
c1: # #c1
mov word ptr [rdi], 25185
ret
c2: # #c2
mov word ptr [rdi], 25185
ret
c3: # #c3
mov word ptr [rdi], 25185
ret

in C this approach is, no doubt, the fastest:
Ptr[0] = 'a';
Ptr[1] = 'b';
This is why:
All Intel and ARM CPU's are able to store some constant data (also called immediate data) within selected assembly instructions. These instructions are memory-to-cpu and cpu-to-memory data transfer like: MOV
That means that when those instructions are fetched from the PROGRAM memory to the CPU the immediate data will arrive to the CPU along with the instruction.
'a' and 'b' are constant and therefore might enter the CPU along with the MOV instruction.
Once the immediate data is in the CPU, the CPU itself has only to make one memory access to the DATA memory for writing 'a' to Ptr[0].
Ciao,
Enrico Migliore

How are C structures & offsets stored in memory by linker?

I'm interested in systems programming, and want to see how structs are implemented in assembly, and how they are linked.
I've written three short .c codes, with same named structs but in different files and compiled and linked together, but I can't understand the output.
I believed that the struct is just a contiguous block of memory in assembly, in the data segment. But, I can't access the values after the first int data, either by using pointers or by having each function use its corresponding files' offsets. Can someone explain the output?
I tried to change data types, implement chars to avoid padding or endian problems. I tried using a char* pointer to print the entire structure, which surprisingly gives me weird values (not random, since they are the same in every execution)
#include <stdio.h>
struct S1
{
int i1;
int i2;
int i3;
int i4;
int i5;
int i6;
};
int main()
{
struct S1 s;
s.i1 = 5;
s.i2 = 10;
s.i3 = 15;
s.i4 = 16;
func1(s); //Implicit calls, no declarations needed
// because linker will know where to find defs
func2(s);
}
#include <stdio.h>
struct S1
{
int i1;
int i2;
float i3;
};
void func1(struct S1 s)
{
printf("In func1 : %d %d %f %lu\n", s.i1, s.i2, s.i3, sizeof(s));
};
#include <stdio.h>
struct S3
{
double i1;
int i2;
};
void func2(struct S3 s)
{
printf("In func2 : %lf %d %lu \n", s.i1, s.i2, sizeof(s));
};
I'm getting outputs:
In func1 : 1 0 0.000000 12
In func2 : 0.000000 1 16
I expected the values to be printed

As GCC 9.2 compiles your file with main, we see these instructions for the call to func1:
mov rdx, QWORD PTR [rbp-16]
mov rax, QWORD PTR [rbp-8]
mov rdi, rdx
mov rsi, rax
mov eax, 0
call func1
Note that the compiler has loaded data into general registers—rsi, rdi, and so on. Compare this to the instructions in func1:
mov rdx, rdi
movd eax, xmm0
mov QWORD PTR [rbp-16], rdx
mov DWORD PTR [rbp-8], eax
movss xmm0, DWORD PTR [rbp-8]
In particular, note the movss instruction. That is attempting to retrieve the float i3 member from the xmm0 register. But, as we have seen, the calling routine did not put anything in the xmm0 register.
The compiler has a specification for how arguments are passed between routines. This is called an Application Binary Interface. (The ABI is shared by software intended to be compatible on a particular platform and is often recommended by the processor manufacturer.) For small structures, at least in this case, the compiler does not pass them by pointing to them in memory or reproducing their exact layout. Instead, the members are passed individually, as if they were separate arguments.
Because of this, your code is not studying how structures are laid out in memory. It is studying how structures are passed in function calls. And part of the answer is that members are sometimes passed individually, and where they are passed depends in part on their types. Because main is passing only integer data, it uses registers for integer arguments. Because func1 is expecting some floating-point data, it looks in a register for that. The result is that func1 never gets the data that is passed for i3 and i4.

While Eric's answer is good, I'd like to highlight one additional misunderstanding your question contains:
func1(s); //Implicit calls, no declarations needed
The comment is incorrect. C always requires declarations of functions. It does allow you to declare it without a prototype, as:
void func1();
However, to call it, the promoted type of the arguments must match the actual type in the function's definition (even if that definition is in another translation unit). If it does not match, the behavior is undefined.

Struct zero initialization methods

Is
struct datainfo info = { 0 };
the same as
struct datainfo info;
memset(&info, 0, sizeof(info));
What's the difference and which is better ?

The first one is the best way by a country mile, as it guarantees that the struct members are initialised as they would be for static storage. It's also clearer.
There's no guarantee from a standards perspective that the two ways are equivalent, although a specific compiler may well optimise the first to the second, even if it ends up clobbering parts of memory discarded as padding.
(Note that in C++, the behaviour of the second way could well be undefined. Yes C is not C++ but a fair bit of C code does tend to end up being ported to C++.)

Practically, those two methods are very likely to produce the same result. Probably on account of first being compiled into a call to memset itself on today's common platforms.
From a language lawyer perspective, the first method will zero-initialize all the members of the of the structure, but there is nothing specified about the values any padding bytes may take (in the individual members, or the structure). While the the second method will zero out all the bytes. And to be even more precise, there is no guarantee that an all byte zero pattern is even an object's "zero" value.
Since (if one knows their target platform) the two are pretty much equivalent in every way that counts for the programmer, you choose the one that best suits your preferences.
Personally, I favor the initialization over the call to memset. Because it happens at the point of declaration, and not in another statement, not to mention the portability aspect. That makes it impossible to accidentally add code in between that makes the initialization not run (however unlikely that may be), or be faulty somehow. But some may say that memset is clearer, even to a programmer reading it later that is not aware of how {0} works. I can't entirely disregard their argument either.

As noted by others, the code is functionality equivalent.
Using the x86-64 gcc 8.3 compiler
The code:
#include <string.h>
main()
{
struct datainfo { int i; };
struct datainfo info;
memset(&info, 0, sizeof(info));
}
produces the assembly:
main:
push rbp
mov rbp, rsp
sub rsp, 16
lea rax, [rbp-4]
mov edx, 4
mov esi, 0
mov rdi, rax
call memset
mov eax, 0
leave
ret
while the code:
main()
{
struct datainfo { int i; };
struct datainfo info = {0};
}
compiles to:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
mov eax, 0
pop rbp
ret
To my untrained eye, the two outputs are 11 instructions vs 6 instructions, so at least space is more efficient in the second implementation. But as noted by others, the zero initialization method is much more explicit in its intent.

Structure copy Vs Members copy (Time efficient)

We know that we can copy one structure to another directly by assignment.
struct STR
{
int a;
int b;
};
struct STR s1 = {4, 5};
struct STR s2;
Method 1:
s2 = s1;
will assign s1 to s2.
Method 2:
s2.a = s1.a;
s2.b = s1.b;
In terms of time efficiency, which method is the faster one? Or both would take same time for the operation. Consider in the aspects of big structure in data handling!

Basically you cannot tell that, because it depends on the compiler, the target architecture, etc...
However, with modern C compilers, with optimization enabled they will be usually the same. For example, recent GCC on x86-64 generates exactly the same code for the two:
void a1()
{
s2 = s1;
}
void a2()
{
s2.a = s1.a;
s2.b = s1.b;
}
Produces:
a1():
mov rax, QWORD PTR s1[rip]
mov QWORD PTR s2[rip], rax
ret
a2():
mov rax, QWORD PTR s1[rip]
mov QWORD PTR s2[rip], rax
ret

Cutting QWORD to get DWORD and performing calculations upon it

mov rax,QWORD PTR [rbp-0x10]
mov eax,DWORD PTR [rax]
add eax,0x1
mov DWORD PTR [rbp-0x14], eax
Next lines written in C, compiled with GCC in GNU/Linux environment.
Assembly code is for int b = *a + 1;.
...
int a = 5;
int* ptr = &a;
int b = *a + 1;
dereferencing whats in address of a and adding 1 to that. After that, store under new variable.
What I don`t understand is second line in that assembly code. Does it mean that I cut QWORD to get the DWORD(one part of QWORD) and storing that into eax?
Since the code is few lines long, I would love that to be broke into step by step sections just to confirm that I`m on right track, also, to figure out what that second line does. Thank you.

What I don`t understand is second line in that assembly code. Does it mean that I cut QWORD to get the DWORD(one part of QWORD) and storing that into eax?
No, the 2nd line dereferences it. There's no splitting up of a qword into two dword halves. (Writing EAX zeros the upper 32 bits of RAX).
It just happens to use the same register that it was using for the pointer, because it doesn't need the pointer anymore.
Compile with optimizations enabled; it's much easier to see what's happening if gcc isn't storing/reloading all the time. (How to remove "noise" from GCC/clang assembly output?)
int foo(int *ptr) {
return *ptr + 1;
}
mov eax, DWORD PTR [rdi]
add eax, 1
ret
(On Godbolt)

int a = 5;
int* ptr = &a;
int b = *a + 1;
your example is an undefined behaviour as you dereference the integer value converted to the pointer (in this case 5) and it will not compile at all as this conversion has the unknown type.
To make it work you need to cast it first.
`int b = *(int *)a + 1;
https://godbolt.org/g/Yo8dd1
Explanation of your assembly code:
line 1: loads rax with the value of a (in this case 5)
line 2: dereferences this value (reads from the address 5 so probably you will get the segmentation fault). this code loads from the stack only because you use the -O0 option.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight