Data alignment compiler option [duplicate]

Data alignment compiler option [duplicate] - c

I have to implement an optimized version of malloc/realloc/free (tailored for my particular application). At the moment the code runs on a particular platform, but I would like to write it in a portable way, if possible (the platform may change in the future), or at least I would like to concentrate the possible platform differences in a single point (probably a .h). I am aware of some of the problems:
differences in memory alignment
differences in smallest memory blocks size suitable for "generic" allocation
differences in pointer size
(I'll ignore the differences in the basic system services for memory allocation here, since on some embedded systems they may be unavailable at all. Let's imagine that we work on a big preallocated memory block to be used as "heap").
The question(s):
Are there standard macros or functions in C for this kind of purpose?
What other issues may I face in this job?

The classic way to ensure that you maintain alignment suitable for all the basic types is to define a union:
union alloc_align {
void *dummy1;
long long dummy2;
long double dummy3;
};
...then ensure that the addresses you hand out are always offset by a multiple of sizeof (union alloc_align) from the aligned addresses you recieve from the system memory allocator.
I believe a method similar to this is described in K&R.

Alignment features are only handled in the new C standard, C11. It has keywords _Alignof, _Alignas and a function aligned_alloc. Theses features are not very difficult to emulate with most modern compilers (as indicated in other answers), so I'd suggest you write yourself small macros or wrappers that you'd use depending on __STDC_VERSION__.

aligned memory differs from compiler to compiler unfortunately (this is one issue), on MSVC, you have aligned_malloc, you also have POSIX memalign for Linux, and then there is also _mm_alloc which works under ICC, MSVC and GCC, IIRC, which should be the most portable.
The second issue is memory wastage from aligning it, it wouldn't be major, but on embedded systems, its something to take note of.
if you are stack allocating things that require alignment (like SIMD types), you also want to look into __attribute__((__aligned__(x))) and __declspec(align(x)).
in terms of portability of pointer arithmetic, you can use the types from stdint.h/pstdint.h to do it, but the standards may say something about UB when casting between uintptr_t and a pointer (unfortunately standards aren't my strong point :().

The main problem is that you only provide the total size of the memory block to malloc() and friends, without any information about the object granularity. If you view an allocation as an array of objects, then you have a size that is the sizeof of the basic object, and a number n that is the number of objects in the array, e.g.:
p = malloc(sizeof(*p) * n);
If you have only the total size, then you don't know if s=4 and n=10, or if s=2 and n=20, or s=1 and n=40, because all multiply to the total size of 40 bytes.
So the basic question is, do you want a direct substitute for the original functions, e.g. when you have thrown native calls all over your code base, or do you have a centralized and DRY modularity with wrapper functions. There you could use functions that provide s and n.
void *my_malloc (size_t s, size_t n)
Most of the time it should be a safe bet when the returned absolute memory address is a multiple of s to guarantee correct alignment.
Alternatively, when porting your implementation, you simply look at the alignment that the native malloc() uses for the target platform (e.g. multiples of 16), and use this for your own implementation.

If you have a look at #pragma pack, this may help you as it allows you to define structure packing and is implemented on most compilers.

C says malloc returns a pointer to memory aligned for any purpose. There is no portable way in C to achieve that with C features. This has the consequence that malloc is a function that if written in C cannot be written in a portable way.
(C99, 7.20.3p1) "The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object and then used to access such an object or an array of such objects in the space allocated (until the space is explicitly deallocated)."

Related

Why use _mm_malloc? (as opposed to _aligned_malloc, alligned_alloc, or posix_memalign)

There are a few options for acquiring an aligned block of memory but they're very similar and the issue mostly boils down to what language standard and platforms you're targeting.
C11
void * aligned_alloc (size_t alignment, size_t size)
POSIX
int posix_memalign (void **memptr, size_t alignment, size_t size)
Windows
void * _aligned_malloc(size_t size, size_t alignment);
And of course it's also always an option to align by hand.
Intel offers another option.
Intel
void* _mm_malloc (int size, int align)
void _mm_free (void *p)
Based on source code released by Intel, this seems to be the method of allocating aligned memory their engineers prefer but I can't find any documentation comparing it to other methods. The closest I found simply acknowledges that other aligned memory allocation routines exist.
https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and
To dynamically allocate a piece of aligned memory, use posix_memalign,
which is supported by GCC as well as the Intel Compiler. The benefit
of using it is that you don’t have to change the memory disposal API.
You can use free() as you always do. But pay attention to the
parameter profile:
  int posix_memalign (void **memptr, size_t align, size_t size);
The Intel Compiler also provides another set of memory allocation
APIs. C/C++ programmers can use _mm_malloc and _mm_free to allocate
and free aligned blocks of memory. For example, the following
statement requests a 64-byte aligned memory block for 8 floating point
elements.
  farray = (float *)__mm_malloc(8*sizeof(float), 64);
Memory that is allocated using _mm_malloc must be freed using
_mm_free. Calling free on memory allocated with _mm_malloc or calling _mm_free on memory allocated with malloc will result in unpredictable behavior.
The clear differences from a user perspective is that _mm_malloc requires direct CPU and compiler support and memory allocated with _mm_malloc must be freed with _mm_free. Given these drawbacks, what is the reason for ever using _mm_malloc? Can it have a slight performance advantage? Historical accident?

Intel compilers support POSIX (Linux) and non-POSIX (Windows) operating systems, hence cannot rely upon either the POSIX or the Windows function. Thus, a compiler-specific but OS-agnostic solution was chosen.
C11 is a great solution but Microsoft doesn't even support C99 yet, so who knows if they will ever support C11.
Update: Unlike the C11/POSIX/Windows allocation functions, the ICC intrinsics include a deallocation function. This allows this API to use a separate heap manager from the default one. I don't know if/when it actually does that, but it can be useful to support this model.
Disclaimer: I work for Intel but have no special knowledge of these decisions, which happened long before I joined the company.

It's possible to take an existing C compiler which does not presently happen to use the identifiers _mm_alloc and _mm_free and define functions with those names which will behave as required. This could be done either by having _mm_alloc function as a wrapper on malloc() which asks for a slightly-oversized allocation and constructs a pointer to the first suitably-aligned address within it that's at least one byte from the beginning, and storing the number of bytes skipped immediately before that address, or by having _mm_malloc request large chunks of memory from malloc() and then dispense them piecemeal. In any case, the pointers returned by _mm_malloc() would not be pointers that free() would generally know how to do anything with; calling _mm_free would use the byte immediately preceding the allocation as an aid to finding the real start of the allocation received from malloc, and then pass that do free.
If an aligned-allocate function is allowed to use the internals of the malloc and free functions, however, that may eliminate the need for the extra layer of wrapping. It's possible to write _mm_alloc()/_mm_free() functions which wraps malloc/free without knowing anything about their internals, but it requires that _mm_alloc() keep book-keeping information which is separate from that used by malloc/free.
If the author of an aligned-allocate function knows how malloc and free are implemented, it will often be possible to coordinate the design of all the allocation/free functions so that free can distinguish all kinds of allocations and handle them appropriately. No single aligned-allocate implementation would be usable on all malloc/free implementations, however.
I would suggest that the most portable way to write code would probably be to select a couple of symbols that are not used anywhere else for your own allocate and free functions, so that you could then say, e.g.
#define a_alloc(align,sz) _mm_alloc((align),(sz))
#define a_free(ptr) _mm_free((ptr))
on compilers that support that, or
static inline void *aa_alloc(int align, int size)
{
void *ret=0;
posix_memalign(&ret, align, size); // Guessing here
return ret;
}
#define a_alloc(align,sz) aa_alloc((align),(sz))
#define a_free(ptr) free((ptr))
on Posix systems, etc. For every system it should be possible to define macros or functions that will yield the necessary behavior [I think it's probably better to use macros consistently than to sometimes use macros and sometimes functions, so as to allow #if defined macroname to test whether things are defined yet].

_mm_malloc seems to have been created before there was a standard aligned_alloc function, and the need to use _mm_free is a quirk of the implementation.
My guess is that unlike when using posix_memalign, it doesn't need to over-allocate in order to guarantee alignment, instead it uses a separate alignment-aware allocator. This will save memory when allocating types with alignment different to the default alignment (typically 8 or 16 bytes).

C structure assignment uses memcpy

I have this StructType st = StructTypeSecondInstance->st; and it generates a segfault. The strange part is when the stack backtrace shows me:
0x1067d2cc: memcpy + 0x10 (0, 10000, 1, 1097a69c, 11db0720, bfe821c0) + 310
0x103cfddc: some_function + 0x60 (0, bfe823d8, bfe82418, 10b09b10, 0, 0) +
So, does struct assigment use memcpy?

One can't tell. Small structs may even be kept in registers. Whether memcpy is used is an implementation detail (it's not even implementation-defined, or unspecified -- it's just something the compiler writer choses and does not need to document.)
From a C Standard point of view, all that matters is that after the assigment, the struct members of the destination struct compare equal to the corresponding members of the source struct.
I would expect compiler writers to make a tradeoff between speed and simplicity, probably based on the size of the struct, the larger the more likely to use a memcpy. Some memcpy implementations are very sophisticated and use different algorithms depending on whether the length is some power of 2 or not, or the alignment of the src and dst pointers. Why reinvent the wheel or blow up the code with an inline version of memcpy?

It might, yes.
This shouldn't be surprising: the struct assignment needs to copy a bunch of bytes from one place to another as quickly as possible, which happens to be the exact thing memcpy() is supposed to be good at. Generating a call to it seems like a no-brainer if you're a compiler writer.
Note that this means that assigning structs with lots of padding might be less efficient than optimally, since memcpy() can't skip the padding.

The standard doesn't say anything at all about how assignment (or any other operator) is actually realized by the compiler. There's nothing stopping a compiler from (say) generating a function call for every operation in your source file.
The compiler has license to implement assignment as it thinks best. Most of the time, with most compilers on most platforms, this means that if the structure is reasonably small, the compiler will generate an inline sequence of move instructions; if the structure is large, calling memcpy is common.
It would be perfectly valid, however, for the compiler to loop over generating random bitfields and stop when one of them matches the source of the assignment (Let's call this algorithm bogocopy).
Compilers that support non-hosted operation usually give you a switch to turn off emitting such libcalls if you're targeting a platform without an available (or complete) libc.

It depends on the compiler and platform. Assignment of big objects can use memcpy. But it must not be the reason of segfault.

How to manage memory alignments and generic pointer arithmetics in a portable way in C?

I have to implement an optimized version of malloc/realloc/free (tailored for my particular application). At the moment the code runs on a particular platform, but I would like to write it in a portable way, if possible (the platform may change in the future), or at least I would like to concentrate the possible platform differences in a single point (probably a .h). I am aware of some of the problems:
differences in memory alignment
differences in smallest memory blocks size suitable for "generic" allocation
differences in pointer size
(I'll ignore the differences in the basic system services for memory allocation here, since on some embedded systems they may be unavailable at all. Let's imagine that we work on a big preallocated memory block to be used as "heap").
The question(s):
Are there standard macros or functions in C for this kind of purpose?
What other issues may I face in this job?

The classic way to ensure that you maintain alignment suitable for all the basic types is to define a union:
union alloc_align {
void *dummy1;
long long dummy2;
long double dummy3;
};
...then ensure that the addresses you hand out are always offset by a multiple of sizeof (union alloc_align) from the aligned addresses you recieve from the system memory allocator.
I believe a method similar to this is described in K&R.

Alignment features are only handled in the new C standard, C11. It has keywords _Alignof, _Alignas and a function aligned_alloc. Theses features are not very difficult to emulate with most modern compilers (as indicated in other answers), so I'd suggest you write yourself small macros or wrappers that you'd use depending on __STDC_VERSION__.

aligned memory differs from compiler to compiler unfortunately (this is one issue), on MSVC, you have aligned_malloc, you also have POSIX memalign for Linux, and then there is also _mm_alloc which works under ICC, MSVC and GCC, IIRC, which should be the most portable.
The second issue is memory wastage from aligning it, it wouldn't be major, but on embedded systems, its something to take note of.
if you are stack allocating things that require alignment (like SIMD types), you also want to look into __attribute__((__aligned__(x))) and __declspec(align(x)).
in terms of portability of pointer arithmetic, you can use the types from stdint.h/pstdint.h to do it, but the standards may say something about UB when casting between uintptr_t and a pointer (unfortunately standards aren't my strong point :().

The main problem is that you only provide the total size of the memory block to malloc() and friends, without any information about the object granularity. If you view an allocation as an array of objects, then you have a size that is the sizeof of the basic object, and a number n that is the number of objects in the array, e.g.:
p = malloc(sizeof(*p) * n);
If you have only the total size, then you don't know if s=4 and n=10, or if s=2 and n=20, or s=1 and n=40, because all multiply to the total size of 40 bytes.
So the basic question is, do you want a direct substitute for the original functions, e.g. when you have thrown native calls all over your code base, or do you have a centralized and DRY modularity with wrapper functions. There you could use functions that provide s and n.
void *my_malloc (size_t s, size_t n)
Most of the time it should be a safe bet when the returned absolute memory address is a multiple of s to guarantee correct alignment.
Alternatively, when porting your implementation, you simply look at the alignment that the native malloc() uses for the target platform (e.g. multiples of 16), and use this for your own implementation.

If you have a look at #pragma pack, this may help you as it allows you to define structure packing and is implemented on most compilers.

C says malloc returns a pointer to memory aligned for any purpose. There is no portable way in C to achieve that with C features. This has the consequence that malloc is a function that if written in C cannot be written in a portable way.
(C99, 7.20.3p1) "The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object and then used to access such an object or an array of such objects in the space allocated (until the space is explicitly deallocated)."

Why are structures copied via memcpy in embedded system code?

In embedded software domain for copying structure of same type people don't use direct assignment and do that by memcpy() function or each element copying.
lets have for example
struct tag
{
int a;
int b;
};
struct tag exmple1 = {10,20};
struct tag exmple2;
for copying exmple1 into exmple2..
instead of writing direct
exmple2=exmple1;
people use
memcpy(exmple2,exmple1,sizeof(struct tag));
or
exmple2.a=exmple1.a;
exmple2.b=exmple1.b;
why ????

One way or the other there is nothing specific about embedded systems that makes this dangerous, the language semantics are identical for all platforms.
C has been used in embedded systems for many years, and early C compilers, before ANSI/ISO standardisation did not support direct structure assignment. Many practitioners are either from that era, or have been taught by those that were, or are using legacy code written by such practitioners. This is probably the root of the doubt, but it is not a problem on an ISO compliant implementation. On some very resource constrained targets, the available compiler may not be fully ISO compliant for a number of reasons, but I doubt that this feature would be affected.
One issue (that applies to embedded and non-embedded alike), is that when assigning a structure, an implementation need not duplicate the value of any undefined padding bits, therefore if you performed a structure assignment, and then performed a memcmp() rather than member-by-member comparison to test for equality, there is no guarantee that they will be equal. However if you perform a memcpy(), any padding bits will be copied so that memcmp() and member-by-member comparison will yield equality.
So it is arguably safer to use memcpy() in all cases (not just embedded), but the improvement is marginal, and not conducive to readability. It would be a strange implementation that did not use the simplest method of structure assignment, and that is a simple memcpy(), so it is unlikely that the theoretical mismatch would occur.

In your given code there is no problem even if you write:
example2 = example1;
But just assume if in future, the struct definition changes to:
struct tag
{
int a[1000];
int b;
};
Now if you execute the assignment operator as above then (some of the) compiler might inline the code for byte by byte (or int by int) copying. i.e.
example1.a[0] = example.a[0];
example1.a[1] = example.a[1];
example1.a[2] = example.a[2];
...
which will result in code bloat in your code segment. Such kind of memory errors are not trivial to find. That's why people use memcpy.
[However, I have heard that modern compilers are capable enough to use memcpy internally when such instruction is encountered especially for PODs.]

Copying C-structures via memcpy() is often used by programmers who learned C decades ago and did not follow the standardization process since. They simple don't know that C supports assignment of structures (direct structure assignment was not available in all pre-ANSI-C89 compilers).
When they learn about this feature some still stick to the memcpy() way because it is their custom. There are also motivations that originate in cargo cult programming, e.g. it is claimed that memcpy is just faster - of course - without being able to back this up with a benchmark test case.
Structures are also memcpy()ied by some newbie programmers because they either confuse structure assignment with the assignment of a pointer of a structure - or they simply overuse memcpy() (they often also use memcpy() where strcpy() would be more appropriate).
There is also the memcmp() structure comparison anti-pattern that is sometimes cited by some programmers for using memcpy() instead of structure assignment. The reasoning behind this is the following: since C does not automatically generate a == operator for structures and writing a custom structure comparison function is tedious, memcmp() is used to compare structures. In the next step - to avoid differences in the padding bits of compared structures - memset(...,0,...) is used to initialize all structures (instead of using the C99 initializer syntax or initializing all fields separately) and memcpy() is used to copy the structures! Because memcpy() also copies the content of the padding bits ...
But note that this reasoning is flawed for several reasons:
the use of memcpy()/memcmp()/memset() introduce new error possibilities - e.g. supplying a wrong size
when the structure contains integer fields the ordering under memcmp() changes between big- and little-endian architectures
a char array field of size n that is 0-terminated at position x must also have all elements after position x zeroed out at any time - else 2 otherwise equal structs compare unequal
assignment from a register to a field may also set the neighbouring padding bits to values unequal 0, thus, following comparisons with otherwise equal structures yield an unequal result
The last point is best illustrated with a small example (assuming architecture X):
struct S {
int a; // on X: sizeof(int) == 4
char b; // on X: 24 padding bits are inserted after b
int c;
};
typedef struct S S;
S s1;
memset(&s1, 0, sizeof(S));
s1.a = 0;
s1.b = 'a';
s1.c = 0;
S s2;
memcpy(&s2, &s1, sizeof(S));
assert(memcmp(&s1, &s2, sizeof(S)==0); // assertion is always true
s2.b = 'x';
assert(memcmp(&s1, &s2, sizeof(S)!=0); // assertion is always true
// some computation
char x = 'x'; // on X: 'x' is stored in a 32 bit register
// as least significant byte
// the other bytes contain previous data
s1.b = x; // the complete register is copied
// i.e. the higher 3 register bytes are the new
// padding bits in s1
assert(memcmp(&s1, &s2, sizeof(S)==0); // assertion is not always true
The failure of the last assertion may depend on code reordering, change of the compiler, change of compiler options and stuff like that.
Conclusion
As a general rule: to increase code correctness and portability use direct struct assignment (instead of memcpy()), C99 struct initialization syntax (instead of memset) and a custom comparison function (instead of memcmp()).

In C people probably do that, because they think that memcpy would be faster. But I don't think that is true. Compiler optimizations would take care of that.
In C++ it may also have different semantics because of user defined assignment operator and copy constructors.

On top of what the others wrote some additional points:
Using memcpy instead of a simple assignment gives a hint to someone who maintains the code that the operation might be expensive. Using memcpy in these cases will improves the understanding of the code.
Embedded systems are often written with portability and performance in mind. Portability is important because you may want to re-use your code even if the CPU in the original design is not available or if a cheaper micro-controller can do the same job.
These days low-end micro-controllers come and go faster than the compiler developers can catch up, so it is not uncommon to work with compilers that use a simple byte-copy loop instead of something optimized for structure assignments. With the move to 32 bit ARM cores this is not true for a large part of embedded developers. There are however a lot of people out there who build products that target obscure 8 and 16 bit micro-controllers.
A memcpy tuned for a specific platform may be more optimal than what a compiler can generate. For example on embedded platforms having structures in flash memory is common. Reading from flash is not as slow as writing to it, but it is still a lot slower than a ordinary copy from RAM to RAM. A optimized memcpy function may use DMA or special features from the flash controller to speed up the copy process.

That is a complete nonsense. Use whichever way you prefer. The simplest is :
exmple2=exmple1;

Whatever you do, don't do this:
exmple2.a=exmple1.a;
exmple2.b=exmple1.b;
It poses a maintainability problem because any time that anyone adds a member to the structure, they have to add a line of code to do the copy of that member. Someone is going to forget to do that and it will cause a hard to find bug.

On some implementations, the way in which memcpy() is performed may differ from the way in which "normal" structure assignment would be performed, in a manner that may be important in some narrow contexts. For example, one or the other structure operand may be unaligned and the compiler might not know about it (e.g. one memory region might have external linkage and be defined in a module written in a different language that has no means of enforcing alignment). Use of a __packed declaration would be better if a compiler supported such, but not all compilers do.
Another reason for using something other than structure assignment could be that a particular implementation's memcpy might access its operands in a sequence that would work correctly with certain kinds of volatile source or destination, while that implementation's struct assignment might use a different sequence that wouldn't work. This is generally not a good reason to use memcpy, however, since aside from the alignment issue (which memcpy is required to handle correctly in any case) the specifications for memcpy don't promise much about how the operation will be performed. It would be better to use a specially-written routine which performed the operations exactly as required (for example, if the target is a piece of hardware which needs to have 4 bytes of structure data written using four 8-bit writes rather than one 32-bit writes, one should write a routine which does that, rather than hoping that no future version of memcpy decides to "optimize" the operation).
A third reason for using memcpy in some cases would be the fact that compilers will often perform small structure assignments using a direct sequence of loads and stores, rather than using a library routine. On some controllers, the amount of code this requires may vary depending upon where the structures are located in memory, to the point that the load/store sequence may end up being bigger than a memcpy call. For example, on a PICmicro controller with 1Kwords of code space and 192 bytes of RAM, coping a 4-byte structure from bank 1 to bank 0 would take 16 instructions. A memcpy call would take eight or nine (depending upon whether count is an unsigned char or int [with only 192 bytes of RAM total, unsigned char should be more than sufficient!] Note, however, that calling a memcpy-ish routine which assumed a hard-coded size and required both operands be in RAM rather than code space would only require five instructions to call, and that could be reduced to four with the use of a global variable.

first version is perfect.
second one may be used for speed (there is no reason for your size).
3rd one is used only if padding is different for target and source.

Approved syntax for raw pointer manipulation

I am making a memory block copy routine and need to deal with blocks of raw memory in efficient chunks. My question is not about the specialized copy routine I'm making, but in how to correctly examine raw pointer alignment in C.
I have a raw pointer of memory, let's say it's already cast as a non-null char *.
In my architecture, I can very efficiently copy memory in 64 byte chunks WHEN IT IS ALIGNED TO A 64 BYTE chunk. So the (standard) trick is that I will do a simple copy of 0-63 bytes "manually" at the head and/or tail to transform the copy from an arbitrary char* of arbitrary length to a 64 byte aligned pointer with some multiple of 64 bytes in length.
Now the question is, how do you legally "examine" a pointer to determine (and manipulate) its alignment?
The obvious way is to cast it into an integer and just examine the bits:
char *pointer=something.
int p=(int)pointer;
char *alignedPointer=(char *)((p+63)&~63);
Note here I realize that alignedPointer doesn't point to the same memory as pointer... this is the "rounded up" pointer that I can call my efficient copy routine on, and I'll handle any other bytes at the beginning manually.
But compilers (justifiably) freak out at casting a pointer into an integer. But how else can I examine and manipulate the pointer's lower bits in LEGAL C? Ideally so that with different compilers I'd get no errors or warnings.

For integer types that are large enough to hold pointers, C99 stdint.h has:
uintptr_t
intptr_t
For data lengths there are:
size_t
ssize_t
which have been around since well before C99.
If your platform doesn't have these, you can maximise your code's portability by still using these type names, and making suitable typedefs for them.

I don't think that in the past people were as reluctant to do their own bit-banging, but maybe the current "don't touch that" mood would be conducive to someone creating some kind of standard library for aligning pointers. Lacking some kind of official api, you have no choice but to AND and OR your way through.

Instead of int, try a datatype that's guaranteed to be the same size as a pointer (INT_PTR on Win32/64). Maybe the compiler won't freak out too much. :) Or use a union, if 64-bit compatibility is not important.

Casting pointers to and from integers is valid, but the results are implementation-defined. See section 6.3.2.3 of the standard. The intention seems to be that the results are what anybody familiar with the system would expect, and indeed this appears to be routinely the case in practice.
If the architecture in question can efficiently manipulate pointers and integers interchangeably, and the issue is just whether it will work on all compilers for that system, then the answer is that it probably will anyway.
(Certainly, if I were writing this code, I would think it fine as-is until proven otherwise. My experience has been that compilers for a given system all behave in very similar ways at this sort of level; the assembly language just suggests a particular approach, that all then take.)
"Probably works" isn't very good general advice though, so my suggestion would be just write the code that works, surround it enough suitable #ifdefs that only the known compiler(s) will compile it, and defer to memcpy in other cases.
#ifdef is rarely ideal, but it's fairly lightweight compared to other possibilities. And if implementation-defined behaviour or compiler-specific tricks are needed then the options are pretty limited anyway.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight