memcpy vs assignment in C - c

Under what circumstances should I expect memcpys to outperform assignments on modern INTEL/AMD hardware? I am using GCC 4.2.x on a 32 bit Intel platform (but am interested in 64 bit as well).

You should never expect them outperform assignments. The reason is, the compiler will use memcpy anyway when it thinks it would be faster (if you use optimize flags). If not and if the structure is reasonable small that it fits into registers, direct register manipulation could be used which wouldn't require any memory access at all.
GCC has special block-move patterns internally that figure out when to directly change registers / memory cells, or when to use the memcpy function. Note when assigning the struct, the compiler knows at compile time how big the move is going to be, so it can unroll small copies (do a move n-times in row instead of looping) for instance. Note -mno-memcpy:
-mmemcpy
-mno-memcpy
Force (do not force) the use of "memcpy()" for non-trivial block moves.
The default is -mno-memcpy, which allows GCC to inline most constant-sized copies.
Who knows it better when to use memcpy than the compiler itself?

Related

Is assigning a pointer in C program considered atomic on x86-64

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html#Atomic-Types says - In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
My question is whether pointer assignment can be considered atomic on x86_64 architecture for a C program compiled with gcc m64 flag. OS is 64bit Linux and CPU is Intel(R) Xeon(R) CPU D-1548. One thread will be setting a pointer and another thread accessing the pointer. There is only one writer thread and one reader thread. Reader should either be getting the previous value of the pointer or the latest value and no garbage value in between.
If it is not considered atomic, please let me know how can I use the gcc atomic builtins or maybe memory barrier like __sync_synchronize to achieve the same without using locks. Interested only in C solution and not C++. Thanks!
Bear in mind that atomicity alone is not enough for communicating between threads. Nothing prevents the compiler and CPU from reordering previous/subsequent load and store instructions with that "atomic" store. In old days people used volatile to prevent that reordering but that was never intended for use with threads and doesn't provide means to specify less or more restrictive memory order (see "Relationship with volatile" in there).
You should use C11 atomics because they guarantee both atomicity and memory order.
For almost all architectures, pointer load and store are atomic. A once notable exception was 8086/80286 where pointers could be seg:offset; there was an l[des]s instruction which could make an atomic load; but no corresponding atomic store.
The integrity of the pointer is only a small concern; your bigger issue revolves around synchronization: the pointer was at value Y, you set it to X; how will you know when nobody is using the (old) Y value?
A somewhat related problem is that you may have stored things at X, which the other thread expects to find. Without synchronization, other might see the new pointer value, however what it points to might not be up to date yet.
A plain global char *ptr should not be considered atomic. It might work sometimes, especially with optimization disabled, but you can get the compiler to make safe and efficient optimized asm by using modern language features to tell it you want atomicity.
Use C11 stdatomic.h or GNU C __atomic builtins. And see Why is integer assignment on a naturally aligned variable atomic on x86? - yes the underlying asm operations are atomic "for free", but you need to control the compiler's code-gen to get sane behaviour for multithreading.
See also LWN: Who's afraid of a big bad optimizing compiler? - weird effects of using plain vars include several really bad well-known things, but also more obscure stuff like invented loads, reading a variable more than once if the compiler decides to optimize away a local tmp and load the shared var twice, instead of loading it into a register. Using asm("" ::: "memory") compiler barriers may not be sufficient to defeat that depending on where you put them.
So use proper atomic stores and loads that tell the compiler what you want: You should generally use atomic loads to read them, too.
#include <stdatomic.h> // C11 way
_Atomic char *c11_shared_var; // all access to this is atomic, functions needed only if you want weaker ordering
void foo(){
atomic_store_explicit(&c11_shared_var, newval, memory_order_relaxed);
}
char *plain_shared_var; // GNU C
// This is a plain C var. Only specific accesses to it are atomic; be careful!
void foo() {
__atomic_store_n(&plain_shared_var, newval, __ATOMIC_RELAXED);
}
Using __atomic_store_n on a plain var is the functionality that C++20 atomic_ref exposes. If multiple threads access a variable for the entire time that it needs to exist, you might as well just use C11 stdatomic because every access needs to be atomic (not optimized into a register or whatever). When you want to let the compiler load once and reuse that value, do char *tmp = c11_shared_var; (or atomic_load_explicit if you only want acquire instead of seq_cst; cheaper on a few non-x86 ISAs).
Besides lack of tearing (atomicity of asm load or store), the other key parts of _Atomic foo * are:
The compiler will assume that other threads may have changed memory contents (like volatile effectively implies), otherwise the assumption of no data-race UB will let the compiler hoist loads out of loops. Without this, dead-store elimination might only do one store at the end of a loop, not updating the value multiple times.
The read side of the problem is usually what bites people in practice, see Multithreading program stuck in optimized mode but runs normally in -O0 - e.g. while(!flag){} becomes if(!flag) infinite_loop; with optimization enabled.
Ordering wrt. other code. e.g. you can use memory_order_release to make sure that other threads that see the pointer update also see all changes to the pointed-to data. (On x86 that's as simple as compile-time ordering, no extra barriers needed for acquire/release, only for seq_cst. Avoid seq_cst if you can; mfence or locked operations are slow.)
Guarantee that the store will compile to a single asm instruction. You'd be depending on this. It does happen in practice with sane compilers, although it's conceivable that a compiler might decide to use rep movsb to copy a few contiguous pointers, and that some machine somewhere might have a microcoded implementation that does some stores narrower than 8 bytes.
(This failure mode is highly unlikely; the Linux kernel relies on volatile load/store compiling to a single instruction with GCC / clang for its hand-rolled intrinsics. But if you just used asm("" ::: "memory") to make sure a store happened on a non-volatile variable, there's a chance.)
Also, something like ptr++ will compile to an atomic RMW operation like lock add qword [mem], 4, rather than separate load and store like volatile would. (See Can num++ be atomic for 'int num'? for more about atomic RMWs). Avoid that if you don't need it, it's slower. e.g. atomic_store_explicit(&ptr, ptr + 1, mo_release); - seq_cst loads are cheap on x86-64 but seq_cst stores aren't.
Also note that memory barriers can't create atomicity (lack of tearing), they can only create ordering wrt other ops.
In practice x86-64 ABIs do have alignof(void*) = 8 so all pointer objects should be naturally aligned (except in a __attribute__((packed)) struct which violates the ABI, so you can use __atomic_store_n on them. It should compile to what you want (plain store, no overhead), and meet the asm requirements to be atomic.
See also When to use volatile with multi threading? - you can roll your own atomics with volatile and asm memory barriers, but don't. The Linux kernel does that, but it's a lot of effort for basically no gain, especially for a user-space program.
Side note: an often repeated misconception is that volatile or _Atomic are needed to avoid reading stale values from cache. This is not the case.
All machines that run C11 threads across multiple cores have coherent caches, not needing explicit flush instructions in the reader or writer. Just ordinary load or store instructions, like x86 mov. The key is to not let the compiler keep values of shared variable in CPU registers (which are thread-private). It normally can do this optimization because of the assumption of no data-race Undefined Behaviour. Registers are very much not the same thing as L1d CPU cache; managing what's in registers vs. memory is done by the compiler, while hardware keeps cache in sync. See When to use volatile with multi threading? for more details about why coherent caches is sufficient to make volatile work like memory_order_relaxed.
See Multithreading program stuck in optimized mode but runs normally in -O0 for an example.
"Atomic" is treated as this quantum state where something can be both atomic and not atomic at the same time because "it's possible" that "some machines" "somewhere" "might not" write "a certain value" atomically. Maybe.
That is not the case. Atomicity has a very specific meaning, and it solves a very specific problem: threads being pre-empted by the OS to schedule another thread in its place on that core. And you cannot stop a thread from executing mid-assembly instruction.
What that means is that any single assembly instruction is "atomic" by definition. And since you have registry moving instructions, any register-sized copy is atomic by definition. That means a 32-bit integer on a 32-bit CPU, and a 64-bit integer on a 64-bit CPU are all atomic -- and of course that includes pointers (ignore all the people who will tell you "some architectures" have pointers of "different size" than registers, that hasn't been the case since 386).
You should however be careful not to hit variable caching problems (ie one thread writing a pointer, and another trying to read it but getting an old value from the cache), use volatile as needed to prevent this.

Is there still a performance advantage to redefine standard like memcpy?

My questions is quite simple, but I can't find any clear answer, so here I am.
Nowadays C compilers are more efficient than it could be few years ago. Is there still any advantage to redefine functions like memcpy or memset in a new project ?
To be more specific lets assume that the targeted MCU on the project is a 32bit ARM core such as Cortex M or A. And GNU ARM toolchain is used.
Thanks
No, it is not beneficial to redefine memcpy. The problem is that your own function cannot work like the standard library memcpy, because the C compiler knows that the function with name memcpy is the one that (C11 7.24.2.1p2)
[...] copies n characters from the object pointed to by s2 into the object pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.
and it is explicitly allowed to construct any equivalent program that behaves as if such a function is called. Sometimes it will even lead to code that does not even touch memory, memcpy being replaced by a register copy, or using an unaligned load instruction to load a value from memory into a register.
If you define your own superduperfastmemcpy in assembler, the C compiler will not know about what it does and will slavishly call it whenever asked to.
What can be beneficial however is to have a special routine for copying large blocks of memory when e.g. it is known that both source and destination address are divisible by 1k and all lengths are always divisible by 1k; in that case there could be several alternative routines that could be timed at the program start up and the fastest one be chosen to be used. Of course, copying large amounts of memory around is a sign of mostly bad design...
The question is only answerable as other than a matter of opinion because you have been specific about the target and toolchain. It is not possible to generalise (and never has been).
The GNU ARM toolchain uses the Newlib C library. Newlib is designed to be architecture agnostic and portable. As such it is written in C rather then assembler, so its performance is determined by the code generation of the compiler and in turn the compiler options applied when the library is built. It is possible to build for a very specific ARM architecture, or to build for more generic ARM instruction subset; that will affect performance too.
Moreover Newlib itself can be built with various conditional compilation options such as PREFER_SIZE_OVER_SPEED and __OPTIMIZE_SIZE__.
Now if you are able to generate better ARM assembler code (and have the time) than the compiler, then that is great, but such kung-foo coding skills are increasingly rare and frankly increasingly unnecessary. Do you have sufficient assembler expertise to beat the compiler; do you have time, and do you really want to do that for every architecture you might use? It may be a premature optimisation, and be rather unproductive.
In some circumstances, on targets with the capability, it may be worthwhile setting up a memory-to-memory DMA transfer. The GNU ARM compiler will not generate DMA code because that is chip vendor dependent and not part of the ARM architecture. However memcpy is general purpose for arbitrary copy size alignment and thread safety. For specific circumstances where DMA is optimal, better perhaps to define a new differently named routine and use it where it is needed rather than redefine memcpy and risk it being sub-optimal for small copies which may predominate, or multi-threaded applications.
The implementation of memcpy() in Newlib for example can be seen here. It is a reasonable idiomatic implementation and therefore sympathetic to a typical compiler optimiser, which generally work best on idiomatic code. An alternative implementation may perform better in un-optimised compilation, but if it is "unusual", the optimiser may not work as well. If you are writing it in assembler, you just have to be better than the compiler - you'd be a rare though not necessarily valuable (commercially) commodity. That said, looking at this specific implementation, it does look far less efficient for large un-aligned blocks in the speed-over-size implementation. It would be possible to improve that at some small expense perhaps to more common aligned copies.
The functions like memcpy belong to the standard library and almost sure they are implemented in assembler, not in C.
If you redefine them it will surely work slower. If you want to optimize memcpy you should either use memmove instead or declaring the pointers as restrict, to tell that they do not overlap and treat them as fast as memmove.
Those engineers who wrote the Standard C library for the given arhitechture for sure they used the existing assembler function to move memory faster.
EDIT:
Taking the remarks from some comments, every generation of code that keeps the semantics of copying (including replacing memcpy by mov-instructions or other code) is allowed.
For algorithms of copying (including the algorithm that newlib is using) you can check this article . Quote from this article:
Special situations If you know all about the data you're copying as
well as the environment in which memcpy runs, you may be able to
create a specialized version that runs very fast
There are several points here, maybe already mentioned above:
Certified libs: usually they are not certified to run if safety constrained environments. Developed according to certain ASPICE/CMM level is usually never provided, and these libs can therefore not be used in such envrionments.
Architecture specific implementations: Maybe your own implementation uses some very target specific features, that the libs can not provide, e.g. specific load/store instructions (SIMD, vector based instructions), or even a DMA based implementation for bigger data, or using different implementations in case of multiprocessor with different core architectures (e.g. NXP S32 with e200z4 and e200z7 cores, or ARM M5 vs. A53), and the lib would need to find out on which core it is called to get the best perfomance
Since embedded development is according to C-standard "freestanding" and not "hosted", a big part of the standard is "implementation defined" or even "unspecified", and that includes the libs.

Using memcpy and friends with memory-mapped I/O

I'm working on an embedded project which involves I/O on memory-mapped FPGA registers. Pointers to these memory regions need to be marked volatile so the compiler does not "optimize out" reads and writes to the FPGA by caching values in CPU registers.
In a few cases, we want to copy a series of FPGA registers into a buffer for further use. Since the registers are mapped to contiguous addresses, memcpy seems appropriate, but passing our volatile pointer as the source argument gives a warning about discarding the volatile qualifier.
Is it safe (and sane) to cast away the volatile-ness of the pointer to suppress this warning? Unless the compiler does something magical, I can't imagine a scenario where calling memcpy would fail to perform an actual copy. The alternative is to just use a for loop and copy byte by byte, but memcpy implementations can (and do) optimize the copy based on size of the copy, alignment, etc.
As a developer of both: FPGA and embedded software, there is just one clear answer: do not use memcpy et al. for this
Some reasons:
There is no guarantee memcpy will work in any specific order.
The compiler might very well replace the call with inline code.
Such acceses often require a certain word-size. memcpy does not guarantee that.
Gaps in the register map might result in undefined behaviour.
You can, however, use a simple for loop and copy yourself. This is safe, if the registers are volatile (see below).
Depending on your platform, volatile alone might not be sufficient. The memory area has also to be non-cachable and strictily ordered (and - possibly - non-shared). Otherwise the system busses might (and will for some platforms) reorder accesses.
Furthermore, you might need barriers/fences for your CPU not to reorder accesses. Please read your hardware-specs very carefully about this.
If you need to transfer larger blocks more often, think about using DMA. If the FPGA uses PCI(e), you could use busmaster DMA with scatter/gather for instance (however, this is not easily implemented; did that myself, but might be worth the effort).
The best (and most sane) approach depends actually on multiple factors, like platform, required speed, etc. Of all possible approaches, I would deem using mempcy() one of the lesser sane(1) at best (1): not sure if that is correct grammar, but I hope you got my point).
Absolutely not safe. There is no guarantee whatsoever in which order memcpy will copy the data, and how many bytes are copied at a time.

How can I optimize GCC compilation for memory usage?

I am developing a library which should use as little memory as possible (I am not concerned about anything else, like the binary size, or speed optimizations).
Are there any GCC flags (or any other GCC-related options) I can use? Should I avoid some level of -O* optimization?
You library -or any code in idiomatic C- has several kinds of memory usage :
binary code size, and indeed -Os should optimize that
heap memory, using C dynamic allocation, that is malloc; you obviously should know how, and how much, heap memory is allocated (and later free-d). The actual memory consumption would depend upon your particular malloc implementation (e.g. many implementations, when calling malloc(25) could in fact consume 32 bytes), not on the compiler. BTW, you might design your library to use some memory pools or even implement your own allocator (above OS syscalls like mmap, or above malloc etc...)
local variables, that is the call frames on the call stack. This mostly depend upon your code (but an optimizing compiler, e.g. -Os or -O2 for gcc, would probably use more registers and perhaps slightly less stack when optimizing). You could pass -fstack-usage to gcc to ask it to give the size of every call frame and you might give -Wstack-usage=len to be warned when a call frame exceeds len bytes.
global or static variables. You should know how much memory they need (and you might use nm or some other binutils program to query them). BTW, declaring carefully some variables inside a function as static would lower the stack consumption (but you cannot do that for every variable or every function).
Notice also that in some limited cases, GCC is doing tail calls, and then the stack usage is lowered (since the stack frame of the caller is reused in the callee). (See also this old question).
You might also ask the compiler to pack some particular struct-s (beware, this could slowdown the performance significantly). You'll want to use some type attributes like __attribute__((packed)), etc... and perhaps also some variable attributes etc...
Perhaps you should read more about Garbage Collection, since GC techniques, concepts, and terminology might be relevant. See this answer.
If on Linux, the valgrind tool should be useful too... (and during the debugging phase the -fsanitize=address option of recent GCC).
You might perhaps also use some code generation options like -fstack-reuse= or -fshort-enums or -fpack-struct or -fstack-limit-symbol= or -fsplit-stack ; be very careful: some such options make your binary code incompatible with your existing C (and others!) libraries (then you might need to recompile all used libraries, including your libc, with the same code generation flags).
You probably should enable link-time optimizations by compiling and linking with -flto (in addition of other optimization flags like -Os).
You certainly should use a recent version of GCC. Notice that GCC 5.1 has been released a few days ago (in april 2015).
If your library is large enough to worth the effort, you might even consider customizing your GCC compiler with MELT (to help you find out how to spend less memory). This might take weeks or months of work.
there are advantages to using 'stack frames', but that does use more stack space to save the stack frame pointer.
You can tell the compiler to not use stack frames. This will (generally) slightly increase the code size but will reduce the amount of stack used.
you can only use char and short for values rather than int.
It is poor programing practice, but can re-use variables and arrays for multiple purposes.
if some set of variables are mutually exclusive on usage, then can place them in a union.
If the function parameter lists are all very short, then can for the compiler to pass all the parameters in registers. (having an architecture with lots of general purpose registers really helps here.
Only use one malloc that contains ALL the area needed for malloc kind of operations, so as to minimize the amount of allocated memory overhead.
there are many techniques. Most make the code much more difficult to debug/maintain and often make the code much harder for humans to read
When possible, you can use -m32 option to compile your application for 32-bit. So, the application will consume only half of the memory on 64-bit systems.
apt-get install libc6-dev-i386
gcc -m32 application.c -o application

C structure assignment uses memcpy

I have this StructType st = StructTypeSecondInstance->st; and it generates a segfault. The strange part is when the stack backtrace shows me:
0x1067d2cc: memcpy + 0x10 (0, 10000, 1, 1097a69c, 11db0720, bfe821c0) + 310
0x103cfddc: some_function + 0x60 (0, bfe823d8, bfe82418, 10b09b10, 0, 0) +
So, does struct assigment use memcpy?
One can't tell. Small structs may even be kept in registers. Whether memcpy is used is an implementation detail (it's not even implementation-defined, or unspecified -- it's just something the compiler writer choses and does not need to document.)
From a C Standard point of view, all that matters is that after the assigment, the struct members of the destination struct compare equal to the corresponding members of the source struct.
I would expect compiler writers to make a tradeoff between speed and simplicity, probably based on the size of the struct, the larger the more likely to use a memcpy. Some memcpy implementations are very sophisticated and use different algorithms depending on whether the length is some power of 2 or not, or the alignment of the src and dst pointers. Why reinvent the wheel or blow up the code with an inline version of memcpy?
It might, yes.
This shouldn't be surprising: the struct assignment needs to copy a bunch of bytes from one place to another as quickly as possible, which happens to be the exact thing memcpy() is supposed to be good at. Generating a call to it seems like a no-brainer if you're a compiler writer.
Note that this means that assigning structs with lots of padding might be less efficient than optimally, since memcpy() can't skip the padding.
The standard doesn't say anything at all about how assignment (or any other operator) is actually realized by the compiler. There's nothing stopping a compiler from (say) generating a function call for every operation in your source file.
The compiler has license to implement assignment as it thinks best. Most of the time, with most compilers on most platforms, this means that if the structure is reasonably small, the compiler will generate an inline sequence of move instructions; if the structure is large, calling memcpy is common.
It would be perfectly valid, however, for the compiler to loop over generating random bitfields and stop when one of them matches the source of the assignment (Let's call this algorithm bogocopy).
Compilers that support non-hosted operation usually give you a switch to turn off emitting such libcalls if you're targeting a platform without an available (or complete) libc.
It depends on the compiler and platform. Assignment of big objects can use memcpy. But it must not be the reason of segfault.

Resources