Why doesn't GCC optimize structs?

Why doesn't GCC optimize structs? - c

Systems demand that certain primitives be aligned to certain points within the memory (ints to bytes that are multiples of 4, shorts to bytes that are multiples of 2, etc.). Of course, these can be optimized to waste the least space in padding.
My question is why doesn't GCC do this automatically? Is the more obvious heuristic (order variables from biggest size requirement to smallest) lacking in some way? Is some code dependent on the physical ordering of its structs (is that a good idea)?
I'm only asking because GCC is super optimized in a lot of ways but not in this one, and I'm thinking there must be some relatively cool explanation (to which I am oblivious).

gcc does not reorder the elements of a struct, because that would violate the C standard. Section 6.7.2.1 of the C99 standard states:
Within a structure object, the non-bit-ﬁeld members and the units in which bit-ﬁelds
reside have addresses that increase in the order in which they are declared.

Structs are frequently used as representations of the packing order of binary file formats and network protocols. This would break if that were done. In addition, different compilers would optimize things differently and linking code together from both would be impossible. This simply isn't feasible.

GCC is smarter than most of us in producing machine code from our source code; however, I shiver if it was smarter than us in re-arranging our structs, since it's data that e.g. can be written to a file. A struct that starts with 4 chars and then has a 4 byte integer would be useless if read on another system where GCC decided that it should re-arrange the struct members.

gcc SVN does have a structure reorganization optimization (-fipa-struct-reorg), but it requires whole-program analysis and isn't very powerful at the moment.

C compilers don't automatically pack structs precisely because of alignment issues like you mention. Accesses not on word boundaries (32-bit on most CPUs) carry heavy penalty on x86 and cause fatal traps on RISC architectures.

Not saying it's a good idea, but you can certainly write code that relies on the order of the members of a struct. For example, as a hack, often people cast a pointer to a struct as the type of a certain field inside that they want access to, then use pointer arithmetic to get there. To me this is a pretty dangerous idea, but I've seen it used, especially in C++ to force a variable that's been declared private to be publicly accessible when it's in a class from a 3rd party library and isn't publicly encapsulated. Reordering the members would totally break that.

You might want to try the latest gcc trunk or, struct-reorg-branch which is under active development.
https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Olga+Golovanevsky_+Memory+Layout+Optimizations+of+Structures+and+Objects.pdf

Related

Race condition when accessing adjacent members in a shared struct, according to CERT coding rule POS49-C?

According to CERT coding rule POS49-C it is possible that different threads accessing different fields of the same structure may conflict.
Instead of bit-field, I use regular unsigned int.
struct multi_threaded_flags {
unsigned int flag1;
unsigned int flag2;
};
struct multi_threaded_flags flags;
void thread1(void) {
flags.flag1 = 1;
}
void thread2(void) {
flags.flag2 = 2;
}
I can see that even unsigned int, there can still be racing condition IF compiler decides to use load/store 8 bytes instead of 4 bytes.
I think compiler will never do that and racing condition will never happen here, but that's completely just my guess.
Is there any well-defined assembly/compiler documentation regarding this case ? I hope locking, which is costly, is the last resort when this situation happens to be undefined.
FYI, I use gcc.

The C11 memory model guarantees that accesses to distinct structure members (which aren't part of a bit-field) are independent, so you'll run into no problems modifying the two flags from different threads (i.e., the "load 8 bytes, modify 4, and write back 8" scenario is not allowed).
This guarantee does not extend in general to bitfields, so you have to be careful there.
Of course, if you are concurrently modifying the same flag from more than one thread, you'll likely trigger the prohibition against data races, so don't do that.

Before C11, ISO C had nothing to say about threads, and writing multi-threaded code relied on other standards (e.g. POSIX which defines a memory model for pthreads), and multi-threaded code essentially depended on the way real compilers worked.
Note that this CERT coding-standard rule is in the POSIX section, and appears to be about pthreads without C11. (There's a CON32-C. Prevent data races when accessing bit-fields from multiple threads rule for C11, where they solve the bit-field concurrency problem by simply promoting the bit-fields to unsigned char, which C11 defines as "separate memory locations". This rule appears to be an insufficiently-edited copy of that, because many of its suggestions suck.)
But unfortunately POSIX pthreads doesn't clearly define what a "memory location is", and this is all they have to say on the subject:
Single UNIX® Specification, Version 4, 2016 Edition (online HTML copy, requires free registration)
4.12 Memory Synchronization
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads.
This is why C11 defines it more clearly, where only bitfields are dangerous to write from different threads (barring compiler bugs).
However, I think everyone (including all compilers) agreed that separate int variables / struct members / array elements were separate "memory locations". Most real-world software doesn't take any special precautions for int or char variables that may be written by separate threads (especially outside of structs).
A compiler that gets int wrong will cause problems all over the place unless the bug is limited to very specific circumstances.
Most bugs like this are very hard to detect with testing, because usually the other data that's non-atomically loaded and stored back isn't written by another thread very often / ever. But if a compiler always did that for every int, problems would show up in some software pretty quickly.
Normally, separate char members would also be considered separate "memory locations", but some pre-C11 implementations might have exceptions to that rule. (Especially on early Alpha AXP which famously has no byte store instruction (so a C11 implementation would have to use 32-bit char), but optimizations that invent writes when updating multiple members can happen anywhere, either by accident or because the compiler developers define "memory location" as a 32 or 64-bit word.)
There's also the issue of compiler bugs. This can affect even compilers that intend to conform to C11. For example gcc bug 52080 which affected some non-x86 architectures. (Discovered in gcc4.7 in 2012, fixed in gcc4.8 a couple months later). Using a bitfield "tricked" the compiler into doing a non-atomic read-modify-write of the containing 64-bit word, even though that included a non-bitfield member. (Bitfields are bait for compiler bugs. Any defensive / safe-coding standard should recommend avoiding them in structs where different members can be modified from different threads. And definitely don't put them next to the actual lock.)
Herb Sutter's talk atomic<> Weapons: The C++ Memory Model and Modern Hardware part 2 goes into some detail about the kinds of compiler bugs that have affected multi-threaded code. Most of these should be shaken out by now (2017) if you're using a modern compiler. Most things like inventing writes (or non-atomic read and write-back of the same value) were usually still considered bugs before C11; C11 mostly just firmed up the rules compilers were already trying to follow. It also made it easier to report such bugs, because you could say unequivocally that it violates the standard instead of just "it breaks my code".
That coding-rule article is poorly written. Its examples with adjacent bit-fields are unsafe, but it claims that all variables are at risk. This is not true in general, especially not with C11. Many users of pthreads can or already do compile with C11 compilers.
(The phrasing I'm referring to is "Bit-fields are especially prone to this behavior", which incorrectly implies that this is allowed to happen with ordinary members of structs, or variables that happen to be adjacent outside of structs)
It's part of a defensive coding standard, but it should definitely make the distinction between what the standards require and what is just belt-and-suspenders defense against compiler bugs.
Also, putting variables that will usually be accessed by different threads into one struct is generally terrible. False sharing of a cache line (typically 64 bytes) is really bad for performance, causing cache misses and (on out-of-order x86 CPUs) memory-ordering mis-speculation (like a branch mispredict requiring a roll-back.) Putting separately-used shared variables into the same byte with bit-fields is even worse, because it prevents efficient stores (any store has to be a RMW of the containing byte).
Solving the bit-field problem by promoting the two bit-fields to unsigned char makes much more sense than using a mutex if they need to be independently writeable from separate threads. Or even unsigned long if you're paranoid.
If the two members are often used together, it makes sense to put them nearby. But if you're going to pad to a whole long to contain both members (like that article does), you might as well make them at least unsigned char or bool instead of 1-byte bitfields.
Although honestly, having two threads modify separate members of a struct at the same time seems like poor design unless one of the members is the lock, and the modification is part of an attempt to take the lock. Using a bit-field as a lock is a bad idea unless you're writing for a specific ISA building and your own lock primitive using something like x86's lock bts instruction to atomically test-and-set a bit. Even then it's a bad idea unless you need to pack it with other bitfields for space saving; the Linux code that exposed the gcc bug with an int lock:1 member was a horrible idea.
In addition, the flags are declared volatile to ensure that the compiler will not attempt to move operations on them outside the mutex.
If your compiler needs this, your compiler is seriously broken, and will create broken code for most multi-threaded programs. (Unless the compiler bug only happens with bit-fields, because shared bit-fields are rare).
Most code doesn't make shared variables volatile, and relies on the guarantee that mutex lock/unlock stops operations from reordering at compile or run time out of the critical section.
Back in 2012, and possibly still today, gcc -pthread might affect code-gen choices in C89/C99 mode (-std=gnu99). In discussion on an LWN article about that gcc bug, this user claimed that -pthread would prohibit the compiler from doing a 64-bit load/store when modifying a 32-bit variable, but that without -pthread it could do so (although on most architectures, IDK why it would). But it turns out that gcc bug manifested even with -pthread, so it was really a bug rather than an aggressive optimization choice.
ISO C11 standard:
N1570, section 3.14 definitions:
memory location: either an object of scalar type, or a maximal sequence of adjacent bit-fields all having nonzero width
NOTE 1 Two threads of execution can update and access separate memory locations without interfering with each other.
A bit-field and an adjacent non-bit-field member are in separate memory locations. ... It is not safe to concurrently update two non-atomic bit-fields in the same structure if all members declared between them are also (non-zero-length) bit-fields, no matter what the sizes of those intervening bit-fields happen to be.
(...gives an example of a struct with bit-fields...)
So in C11, you can't assume anything about the compiler munging other bit-fields when writing one bit-field, but otherwise you're safe. Unless you use a separator :0 field to force the compiler pad enough (or use atomic bit-ops) so that it can update your bit-field without concurrency problems for other fields. But if you want to be safe, it's probably not a good idea to use bit-fields at all in structs that are written by multiple threads at once.
See also other Notes in the C11 standard, e.g. the one linked by #Fuz in Is it well-defined behavior to modify one element of an array while another thread modifies another element of the same array? that explicitly says that compiler transformations that would make this dangerous are disallowed.

Cost of union access vs using fundamental types

I have a large block of data where some operations would be fastest if the block were viewed as an array of 64 bit unsigned integers and others would be fastest if viewed as an array of 32 bit unsigned integers. By 'fastest', I mean fastest on average for the machines that will be running the code. My goal is to be near optimal in all the environments running the code, and I think this is possible if I use a void pointer, casting it to one of the two types for dereferencing. This brings me to my questions:
1) If I use a void pointer, will casting it to one of the two types for dereferencing be slower than directly using a pointer of the desired type?
2) Am I correct in my understanding of the standard that doing this will not violate the anti-aliasing rules, and that it will not produce any undefined or unspecified behaviour? The 32 and 64 bit types I am using exist and have no padding (this is a static assertion).
3) Am I correct in understanding the anti-aliasing rules to basically serve two purposes: type safety and compiler guarantees to enable optimization? If so, if all situations where the code I am discussing will be executed are such that no other dereferencing is happening, am I likely to loose out on any significant compiler optimizations?
I have tagged this with 'c11' because I need to prove from the c11 standard that the behaviour is well defined. Any references to the standard would be appreciated.
Finally, I would like to address a likely concern to be brought up in the responses, regarding "premature optimization". First off, this code is being ran on a diverse computing cluster, were performance is critical, and I know that even a one instruction slowdown in dereferencing would be significant. Second, testing this on all the hardware would take time I don't have to finish the project. There are a lot of different types of hardware, and I have a limited amount of time on site to actually work with the hardware. However, I am confident that an answer to this question will enable me to make the right design choice anyway.
EDIT: An answer and comments pointed out that there is an aliasing problem with this approach, which I verified directly in the c11 standard. An array of unions would require two address calculations and dereferencings in the 32 bit case, so I'd prefer a union of arrays. The questions then become:
1) Is there a performance problem in using a union member as an array as opposed to a pointer to the memory? I.e., is there a cost in union member access? Note that declaring two pointers to the the arrays violates the anti-aliasing rules, so access would need to be made directly through the union.
2) Are the contents of the array guaranteed invariant when accessed through one array then through the other?

There are different aspects to your question. First of all, interpreting memory with different types has several problems:
aliasing
alignment
padding
Aliasing is a "local" problem. Inside a function, you don't want to have pointers to the same object that have a different target type. If you do modify such pointed to objects, the compiler may pretend not to know that the object may have changed and optimize your program falsely. If you don't do that inside a function (e.g do a cast right at the beginning and stay with that interpretation) you should be fine for aliasing.
Alignment problems are often overlooked nowadays because many processors now are quite tolerant with alignment problems, but this is nothing portable and might also have performance impacts. So you'd have to ensure that your array is aligned in a way that is suitable for all types that you access it. This can be done with _Alignas in C11, older compilers have extensions that also allow for this. C11 adds some restrictions to aligment, e.g that this is always a power of 2, which should enable you to write portable code with respect to this problem.
Integer type padding is something rare these days (only exception is _Bool) but to be sure you should use types that are known not to have problems with that. In your case these are [u]int32_t and [u]int64_t that are known to have exactly the number of bits requested and of have two's complement representation for the signed types. If a platform doesn't support them, your program would simply not compile.

I would refrain from using a void pointer. A union of two arrays or an array of union will do better.
Use a proper alignment on the whole type. C11 provides alignas() as keywords. GCC has attributes for alignment which are non-standard (and work in per-11 standards as well). Other compilers may have none at all.
Depending on your architecture, there should be no performance impact. But this cannot be guaranteed (I do not see an issue her, however). You might even align the type to a larger type than 64 bits to fill a cache line perfectly. That might speed up prefetch and writeback.
Aliasing refers to the fact that an object is referenced by multiple pointer a the same time. This means the same memory address can be addressed using two different "sources". The problem is that the compiler may not be aware abouth this and thus may hold the value of a variable in a CPU register during some calculation without writing it back to memory instantly.
If the same variable is then referenced by the other "source" (i.e. pointer), the compiler may read invalid data from the memory location.
Imo is aliasing only relevant within a function if two pointers are pased inside. So, if you do not intend to pass two pointers to the same object (or part of it) there should be no problem at all. Otherwise, you should get comfortable with (compiler)barriers. Edit: C standard seems to be a bit more strict on that, as it requires just the lvalues accessing an object to fulfill certain criteria (C11 6.5/7 (n1570) - thks Matt McNabb).
Oh, and don't use int/long/etc. You really should use stdint.h types if you really need proper sized types.

What memory address spaces are there?

What forms of memory address spaces have been used?
Today, a large flat virtual address space is common. Historically, more complicated address spaces have been used, such as a pair of a base address and an offset, a pair of a segment number and an offset, a word address plus some index for a byte or other sub-object, and so on.
From time to time, various answers and comments assert that C (or C++) pointers are essentially integers. That is an incorrect model for C (or C++), since the variety of address spaces is undoubtedly the cause of some of the C (or C++) rules about pointer operations. For example, not defining pointer arithmetic beyond an array simplifies support for pointers in a base and offset model. Limits on pointer conversion simplify support for address-plus-extra-data models.
That recurring assertion motivates this question. I am looking for information about the variety of address spaces to illustrate that a C pointer is not necessarily a simple integer and that the C restrictions on pointer operations are sensible given the wide variety of machines to be supported.
Useful information may include:
Examples of computer architectures with various address spaces and descriptions of those spaces.
Examples of various address spaces still in use in machines currently being manufactured.
References to documentation or explanation, especially URLs.
Elaboration on how address spaces motivate C pointer rules.
This is a broad question, so I am open to suggestions on managing it. I would be happy to see collaborative editing on a single generally inclusive answer. However, that may fail to award reputation as deserved. I suggest up-voting multiple useful contributions.

Just about anything you can imagine has probably been used. The
first major division is between byte addressing (all modern
architectures) and word addressing (pre-IBM 360/PDP-11, but
I think modern Unisys mainframes are still word addressed). In
word addressing, char* and void* would often be bigger than
an int*; even if they were not bigger, the "byte selector"
would be in the high order bits, which were required to be 0, or
would be ignored for anything other than bytes. (On a PDP-10,
for example, if p was a char*, (int)p < (int)(p+1) would
often be false, even though int and char* had the same
size.)
Among byte addressed machines, the major variants are segmented
and non-segmented architectures. Both are still wide spread
today, although in the case of Intel 32bit (a segmented
architecture with 48 bit addresses), some of the more widely
used OSs (Windows and Linux) artificially restrict user
processes to a single segment, simulating a flat addressing.
Although I've no recent experience, I would expect even more
variety in embedded processors. In particular, in the past, it
was frequent for embedded processors to use a Harvard
architecture, where code and data were in independent address
spaces (so that a function pointer and a data pointer, cast to a
large enough integral type, could compare equal).

I would say you are asking the wrong question, except as historical curiosity.
Even if your system happens to use a flat address space -- indeed, even if every system from now until the end of time uses a flat address space -- you still cannot treat pointers as integers.
The C and C++ standards leave all sorts of pointer arithmetic "undefined". That can impact you right now, on any system, because compilers will assume you avoid undefined behavior and optimize accordingly.
For a concrete example, three months ago a very interesting bug turned up in Valgrind:
https://sourceforge.net/p/valgrind/mailman/message/29730736/
(Click "View entire thread", then search for "undefined behavior".)
Basically, Valgrind was using less-than and greater-than on pointers to try to determine if an automatic variable was within a certain range. Because comparisons between pointers in different aggregates is "undefined", Clang simply optimized away all of the comparisons to return a constant true (or false; I forget).
This bug itself spawned an interesting StackOverflow question.
So while the original pointer arithmetic definitions may have catered to real machines, and that might be interesting for its own sake, it is actually irrelevant to programming today. What is relevant today is that you simply cannot assume that pointers behave like integers, period, regardless of the system you happen to be using. "Undefined behavior" does not mean "something funny happens"; it means the compiler can assume you do not engage in it. When you do, you introduce a contradiction into the compiler's reasoning; and from a contradiction, anything follows... It only depends on how smart your compiler is.
And they get smarter all the time.

There are various forms of bank-switched memory.
I worked on an embedded system that had 128 KB of total memory: 64KB of RAM and 64KB of EPROM. Pointers were only 16-bit, so a pointer into the RAM could have the same value of a pointer in the EPROM, even though they referred to different memory locations.
The compiler kept track of the type of the pointer so that it could generate the instruction(s) to select the correct bank before dereferencing a pointer.
You could argue that this was like segment + offset, and at the hardware level, it essentially was. But the segment (or more correctly, the bank) was implicit from the pointer's type and not stored as the value of a pointer. If you inspected a pointer in the debugger, you'd just see a 16-bit value. To know whether it was an offset into the RAM or the ROM, you had to know the type.
For example, Foo * could only be in RAM and const Bar * could only be in ROM. If you had to copy a Bar into RAM, the copy would actually be a different type. (It wasn't as simple as const/non-const:
Everything in ROM was const, but not all consts were in ROM.)
This was all in C, and I know we used non-standard extensions to make this work. I suspect a 100% compliant C compiler probably couldn't cope with this.

From a C programmer's perspective, there are three main kinds of implementation to worry about:
Those which target machines with a linear memory model, and which are designed and/or configured to be usable as a "high-level assembler"--something the authors of the Standard have expressly said they did not wish to preclude. Most implementations behave in this way when optimizations are disabled.
Those which are usable as "high-level assemblers" for machines with unusual memory architectures.
Those which whose design and/or configuration make them suitable only for tasks that do not involve low-level programming, including clang and gcc when optimizations are enabled.
Memory-management code targeting the first type of implementation will often be compatible with all implementations of that type whose targets use the same representations for pointers and integers. Memory-management code for the second type of implementation will often need to be specifically tailored for the particular hardware architecture. Platforms that don't use linear addressing are sufficiently rare, and sufficiently varied, that unless one needs to write or maintain code for some particular piece of unusual hardware (e.g. because it drives an expensive piece of industrial equipment for which more modern controllers aren't available) knowledge of any particular architecture isn't likely to be of much use.
Implementations of the third type should be used only for programs that don't need to do any memory-management or systems-programming tasks. Because the Standard doesn't require that all implementations be capable of supporting such tasks, some compiler writers--even when targeting linear-address machines--make no attempt to support any of the useful semantics thereof. Even some principles like "an equality comparison between two valid pointers will--at worst--either yield 0 or 1 chosen in possibly-unspecified fashion don't apply to such implementations.

C structure assignment uses memcpy

I have this StructType st = StructTypeSecondInstance->st; and it generates a segfault. The strange part is when the stack backtrace shows me:
0x1067d2cc: memcpy + 0x10 (0, 10000, 1, 1097a69c, 11db0720, bfe821c0) + 310
0x103cfddc: some_function + 0x60 (0, bfe823d8, bfe82418, 10b09b10, 0, 0) +
So, does struct assigment use memcpy?

One can't tell. Small structs may even be kept in registers. Whether memcpy is used is an implementation detail (it's not even implementation-defined, or unspecified -- it's just something the compiler writer choses and does not need to document.)
From a C Standard point of view, all that matters is that after the assigment, the struct members of the destination struct compare equal to the corresponding members of the source struct.
I would expect compiler writers to make a tradeoff between speed and simplicity, probably based on the size of the struct, the larger the more likely to use a memcpy. Some memcpy implementations are very sophisticated and use different algorithms depending on whether the length is some power of 2 or not, or the alignment of the src and dst pointers. Why reinvent the wheel or blow up the code with an inline version of memcpy?

It might, yes.
This shouldn't be surprising: the struct assignment needs to copy a bunch of bytes from one place to another as quickly as possible, which happens to be the exact thing memcpy() is supposed to be good at. Generating a call to it seems like a no-brainer if you're a compiler writer.
Note that this means that assigning structs with lots of padding might be less efficient than optimally, since memcpy() can't skip the padding.

The standard doesn't say anything at all about how assignment (or any other operator) is actually realized by the compiler. There's nothing stopping a compiler from (say) generating a function call for every operation in your source file.
The compiler has license to implement assignment as it thinks best. Most of the time, with most compilers on most platforms, this means that if the structure is reasonably small, the compiler will generate an inline sequence of move instructions; if the structure is large, calling memcpy is common.
It would be perfectly valid, however, for the compiler to loop over generating random bitfields and stop when one of them matches the source of the assignment (Let's call this algorithm bogocopy).
Compilers that support non-hosted operation usually give you a switch to turn off emitting such libcalls if you're targeting a platform without an available (or complete) libc.

It depends on the compiler and platform. Assignment of big objects can use memcpy. But it must not be the reason of segfault.

Why are structures copied via memcpy in embedded system code?

In embedded software domain for copying structure of same type people don't use direct assignment and do that by memcpy() function or each element copying.
lets have for example
struct tag
{
int a;
int b;
};
struct tag exmple1 = {10,20};
struct tag exmple2;
for copying exmple1 into exmple2..
instead of writing direct
exmple2=exmple1;
people use
memcpy(exmple2,exmple1,sizeof(struct tag));
or
exmple2.a=exmple1.a;
exmple2.b=exmple1.b;
why ????

One way or the other there is nothing specific about embedded systems that makes this dangerous, the language semantics are identical for all platforms.
C has been used in embedded systems for many years, and early C compilers, before ANSI/ISO standardisation did not support direct structure assignment. Many practitioners are either from that era, or have been taught by those that were, or are using legacy code written by such practitioners. This is probably the root of the doubt, but it is not a problem on an ISO compliant implementation. On some very resource constrained targets, the available compiler may not be fully ISO compliant for a number of reasons, but I doubt that this feature would be affected.
One issue (that applies to embedded and non-embedded alike), is that when assigning a structure, an implementation need not duplicate the value of any undefined padding bits, therefore if you performed a structure assignment, and then performed a memcmp() rather than member-by-member comparison to test for equality, there is no guarantee that they will be equal. However if you perform a memcpy(), any padding bits will be copied so that memcmp() and member-by-member comparison will yield equality.
So it is arguably safer to use memcpy() in all cases (not just embedded), but the improvement is marginal, and not conducive to readability. It would be a strange implementation that did not use the simplest method of structure assignment, and that is a simple memcpy(), so it is unlikely that the theoretical mismatch would occur.

In your given code there is no problem even if you write:
example2 = example1;
But just assume if in future, the struct definition changes to:
struct tag
{
int a[1000];
int b;
};
Now if you execute the assignment operator as above then (some of the) compiler might inline the code for byte by byte (or int by int) copying. i.e.
example1.a[0] = example.a[0];
example1.a[1] = example.a[1];
example1.a[2] = example.a[2];
...
which will result in code bloat in your code segment. Such kind of memory errors are not trivial to find. That's why people use memcpy.
[However, I have heard that modern compilers are capable enough to use memcpy internally when such instruction is encountered especially for PODs.]

Copying C-structures via memcpy() is often used by programmers who learned C decades ago and did not follow the standardization process since. They simple don't know that C supports assignment of structures (direct structure assignment was not available in all pre-ANSI-C89 compilers).
When they learn about this feature some still stick to the memcpy() way because it is their custom. There are also motivations that originate in cargo cult programming, e.g. it is claimed that memcpy is just faster - of course - without being able to back this up with a benchmark test case.
Structures are also memcpy()ied by some newbie programmers because they either confuse structure assignment with the assignment of a pointer of a structure - or they simply overuse memcpy() (they often also use memcpy() where strcpy() would be more appropriate).
There is also the memcmp() structure comparison anti-pattern that is sometimes cited by some programmers for using memcpy() instead of structure assignment. The reasoning behind this is the following: since C does not automatically generate a == operator for structures and writing a custom structure comparison function is tedious, memcmp() is used to compare structures. In the next step - to avoid differences in the padding bits of compared structures - memset(...,0,...) is used to initialize all structures (instead of using the C99 initializer syntax or initializing all fields separately) and memcpy() is used to copy the structures! Because memcpy() also copies the content of the padding bits ...
But note that this reasoning is flawed for several reasons:
the use of memcpy()/memcmp()/memset() introduce new error possibilities - e.g. supplying a wrong size
when the structure contains integer fields the ordering under memcmp() changes between big- and little-endian architectures
a char array field of size n that is 0-terminated at position x must also have all elements after position x zeroed out at any time - else 2 otherwise equal structs compare unequal
assignment from a register to a field may also set the neighbouring padding bits to values unequal 0, thus, following comparisons with otherwise equal structures yield an unequal result
The last point is best illustrated with a small example (assuming architecture X):
struct S {
int a; // on X: sizeof(int) == 4
char b; // on X: 24 padding bits are inserted after b
int c;
};
typedef struct S S;
S s1;
memset(&s1, 0, sizeof(S));
s1.a = 0;
s1.b = 'a';
s1.c = 0;
S s2;
memcpy(&s2, &s1, sizeof(S));
assert(memcmp(&s1, &s2, sizeof(S)==0); // assertion is always true
s2.b = 'x';
assert(memcmp(&s1, &s2, sizeof(S)!=0); // assertion is always true
// some computation
char x = 'x'; // on X: 'x' is stored in a 32 bit register
// as least significant byte
// the other bytes contain previous data
s1.b = x; // the complete register is copied
// i.e. the higher 3 register bytes are the new
// padding bits in s1
assert(memcmp(&s1, &s2, sizeof(S)==0); // assertion is not always true
The failure of the last assertion may depend on code reordering, change of the compiler, change of compiler options and stuff like that.
Conclusion
As a general rule: to increase code correctness and portability use direct struct assignment (instead of memcpy()), C99 struct initialization syntax (instead of memset) and a custom comparison function (instead of memcmp()).

In C people probably do that, because they think that memcpy would be faster. But I don't think that is true. Compiler optimizations would take care of that.
In C++ it may also have different semantics because of user defined assignment operator and copy constructors.

On top of what the others wrote some additional points:
Using memcpy instead of a simple assignment gives a hint to someone who maintains the code that the operation might be expensive. Using memcpy in these cases will improves the understanding of the code.
Embedded systems are often written with portability and performance in mind. Portability is important because you may want to re-use your code even if the CPU in the original design is not available or if a cheaper micro-controller can do the same job.
These days low-end micro-controllers come and go faster than the compiler developers can catch up, so it is not uncommon to work with compilers that use a simple byte-copy loop instead of something optimized for structure assignments. With the move to 32 bit ARM cores this is not true for a large part of embedded developers. There are however a lot of people out there who build products that target obscure 8 and 16 bit micro-controllers.
A memcpy tuned for a specific platform may be more optimal than what a compiler can generate. For example on embedded platforms having structures in flash memory is common. Reading from flash is not as slow as writing to it, but it is still a lot slower than a ordinary copy from RAM to RAM. A optimized memcpy function may use DMA or special features from the flash controller to speed up the copy process.

That is a complete nonsense. Use whichever way you prefer. The simplest is :
exmple2=exmple1;

Whatever you do, don't do this:
exmple2.a=exmple1.a;
exmple2.b=exmple1.b;
It poses a maintainability problem because any time that anyone adds a member to the structure, they have to add a line of code to do the copy of that member. Someone is going to forget to do that and it will cause a hard to find bug.

On some implementations, the way in which memcpy() is performed may differ from the way in which "normal" structure assignment would be performed, in a manner that may be important in some narrow contexts. For example, one or the other structure operand may be unaligned and the compiler might not know about it (e.g. one memory region might have external linkage and be defined in a module written in a different language that has no means of enforcing alignment). Use of a __packed declaration would be better if a compiler supported such, but not all compilers do.
Another reason for using something other than structure assignment could be that a particular implementation's memcpy might access its operands in a sequence that would work correctly with certain kinds of volatile source or destination, while that implementation's struct assignment might use a different sequence that wouldn't work. This is generally not a good reason to use memcpy, however, since aside from the alignment issue (which memcpy is required to handle correctly in any case) the specifications for memcpy don't promise much about how the operation will be performed. It would be better to use a specially-written routine which performed the operations exactly as required (for example, if the target is a piece of hardware which needs to have 4 bytes of structure data written using four 8-bit writes rather than one 32-bit writes, one should write a routine which does that, rather than hoping that no future version of memcpy decides to "optimize" the operation).
A third reason for using memcpy in some cases would be the fact that compilers will often perform small structure assignments using a direct sequence of loads and stores, rather than using a library routine. On some controllers, the amount of code this requires may vary depending upon where the structures are located in memory, to the point that the load/store sequence may end up being bigger than a memcpy call. For example, on a PICmicro controller with 1Kwords of code space and 192 bytes of RAM, coping a 4-byte structure from bank 1 to bank 0 would take 16 instructions. A memcpy call would take eight or nine (depending upon whether count is an unsigned char or int [with only 192 bytes of RAM total, unsigned char should be more than sufficient!] Note, however, that calling a memcpy-ish routine which assumed a hard-coded size and required both operands be in RAM rather than code space would only require five instructions to call, and that could be reduced to four with the use of a global variable.

first version is perfect.
second one may be used for speed (there is no reason for your size).
3rd one is used only if padding is different for target and source.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight