Preferred idiom for endianess-agnostic reads

Preferred idiom for endianess-agnostic reads - c

In the Plan 9 source code I often find code like this to read serialised data from a buffer with a well-defined endianess:
#include <stdint.h>
uint32_t le32read(uint8_t buf[static 4]) {
return (buf[0] | buf[1] << 8 | buf[2] << 16 | buf[3] << 24);
}
I expected both gcc and clang to compile this code into something as simple as this assembly on amd64:
.global le32read
.type le32read,#function
le32read:
mov (%rdi),%eax
ret
.size le32read,.-le32read
But contrary to my expectations, neither gcc nor clang recognize this pattern and produce complex assembly with multiple shifts instead.
Is there an idiom for this kind of operation that is both portable to all C99-implementations and produces good (i.e. like the one presented above) code across implementations?

After some research, I found (with the help of the terrific people in ##c on Freenode), that gcc 5.0 will implement optimizations for the kind of pattern described above. In fact, it compiles the C source listed in my question to the exact assembly I listed below.
I haven't found similar information about clang, so I filed a bug report. As of Clang 9.0, clang recognises both the read as well as the write idiom and turns it into fast code.

If you want to guaranty a conversions between a native platform order and a defined order (order on a network for example) you can let system libraries to the work and simply use the functions of <netinet/in.h> : hton, htons, htonl and ntoh, ntohs, nthol.
But I must admit that the include file is not guaranteed : under Windows I think it is winsock.h.

You could determine endianess like in this answer. Then use the O32_HOST_ORDER macro to decide whether to cast the byte array to an uint32_t directly or to use your bit shifting expression.
#include <stdint.h>
uint32_t le32read(uint8_t buf[static 4]) {
if (O32_HOST_ORDER == O32_LITTLE_ENDIAN) {
return *(uint32_t *)&buf[0];
}
return (buf[0] | buf[1] << 8 | buf[2] << 16 | buf[3] << 24);
}

Related

How can I elegantly take advantage of ARM instructions like REV and RBIT when writing C code?

I am writing C code which may be compiled for the Arm Cortex-M3 microcontroller.
This microcontroller supports several useful instructions for efficiently manipulating bits in registers, including REV*, RBIT, SXT*.
When writing C code, how can I take advantage of these instructions if I need those specific functions? For example, how can I complete this code?
#define REVERSE_BIT_ORDER(x) { /* what to write here? */ }
I would like to do this without using inline assembler so that this code is both portable, and readable.
Added:
In part, I am asking how to express such a function in C elegantly. For example, it's easy to express bit shifting in C, because it's built into the language. Likewise, setting or clearing bits. But bit reversal is unknown in C, and so is very hard to express. For example, this is how I would reverse bits:
unsigned int ReverseBits(unsigned int x)
{
unsigned int ret = 0;
for (int i=0; i<32; i++)
{
ret <<= 1;
if (x & (1<<i))
ret |= 1;
}
return ret;
}
Would the compiler recognise this as bit reversal, and issue the correct instruction?

Reversing bits in a 32 bit integer is such an exotic instruction so that might be why you can't reproduce it. I was able to generate code that utilizes REV (reverse byte order) however, which is a far more common use-case:
#include <stdint.h>
uint32_t endianize (uint32_t input)
{
return ((input >> 24) & 0x000000FF) |
((input >> 8) & 0x0000FF00) |
((input << 8) & 0x00FF0000) |
((input << 24) & 0xFF000000) ;
}
With gcc -O3 -mcpu=cortex-m3 -ffreestanding (for ARM32, vers 11.2.1 "none"):
endianize:
rev r0, r0
bx lr
https://godbolt.org/z/odGqzjTGz
It works for clang armv7-a 15.0.0 too, long as you use -mcpu=cortex-m3.
So this would support the idea of avoiding manual optimizations and let the compiler worry about such.

#Lundin's answer shows a pure-C shift/mask bithack that clang recognizes and compiles to a single rev instruction. (Or presumably to x86 bswap if targeting x86, or equivalent instructions on other ISAs that have them.)
In portable ISO C, hoping for pattern-recognition is unfortunately the best you can do, because they haven't added portable ways to expose CPU functionality; even C++ took until C++20 to add the <bit> header for things like std::popcount and C++23 std::byteswap.
(Some fairly-portable C libraries / headers have byte-reversal, e.g. as part of networking there's ntohl net-to-host which is an endian-swap on little-endian machines. Or there's GCC's (or glibc's?) endian.h, with htobe32 being host to big-endian 32-bit. Man page. These are usually implemented with intrinsics that compile to a single instruction in good-quality implementations.
Of course, if you definitely want a byte swap regardless of host endianness, you could do htole32(be32toh(x)) because one of them's a no-op and the other's a byte-swap, since ARM is either big or little endian. (It's still a byte-swap even if neither of them are NOPs, even on PDP or other mixed-endian machines, but there might be more efficient ways to do it.)
There are also some "collections of useful functions" headers with intrinsics for different compilers, with functions like byte swap. These can be of varying quality in terms of efficiency and maybe even correctness.
You can see that no, neither GCC nor clang optimize your code to rbit for ARM or AArch64. https://godbolt.org/z/Y7noP61dE . Presumably looping over bits in the other direction isn't any better. Perhaps a bithack as in In C/C++ what's the simplest way to reverse the order of bits in a byte? or Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C .
CC and clang recognize the standard bithack for popcount, but I didn't check any of the answers on the bit-reverse questions.
Some languages, notably Rust, do care more about making it possible to portably express what modern CPUs can do. foo.reverse_bits() (since Rust 1.37) and foo.swap_bytes() just work for any type on any ISA. For u32 specifically, https://doc.rust-lang.org/std/primitive.u32.html#method.reverse_bits (That's Rust's equivalent of C uint32_t.)
Most mainstream C implementations have portable (across ISAs) builtins or (target-specific) intrinsics (like __REV() or __REV16() for stuff like this.
The GNU dialect of C (GCC/clang/ICC and some others) includes __builtin_bswap32(input). See Does ARM GCC have a builtin function for the assembly 'REV' instruction?. It's named after the x86 bswap instruction, but it's just a byte-reverse that GCC / clang compile to whatever instructions can do it efficiently on the target ISA.
There's also a __builtin_bswap16(uint16_t) for swapping the bytes of a 16-bit integer, like revsh except the C semantics don't include preserving the upper 16 bits of a 32-bit integer. (Because normally you don't care about that part.) See the GCC manual
for the available GNU C builtins that aren't target-specific.
There isn't a GNU C builtin or intrinsic for bitwise reverse that I could find in the manual or GCC arm-none-eabi 12.2 headers.
ARM documents an __rbit() intrinsic for their own compiler, but I think that's Keil's ARMCC, so there might not be any equivalent of that for GCC/clang.
#0___________ suggests https://github.com/ARM-software/CMSIS_5 for headers that define a function for that.
If worst comes to worst, GNU C inline asm is possible for GCC/clang, given appropriate #ifdefs. You might also want if (__builtin_constant_p(x)) to use a pure-C bit-reversal so constant-propagation can happen on compile-time constants, only using inline asm on runtime-variable values.
uint32_t output, input=...;
#if defined(__arm__) || defined (__aarch64__)
// same instruction is valid for both
asm("rbit %0,%1" : "=r"(output) : "r"(input));
#else
... // pure C fallback or something
#endif
Note that it doesn't need to be volatile because rbit is a pure function of the input operand. It's a good thing if GCC/clang are able to hoist this out of a loop. And it's a single asm instruction so we don't need an early-clobber.
This has the downside that the compiler can't fold a shift into it, e.g. if you wanted a byte-reverse, __rbit(x) >> 24 equals __rbit(x<<24), which could be done with rbit r0, r1, lsl #24. (I think).
With inline asm I don't think there's a way to tell the compiler that a r1, lsl #24 is a valid expansion for the %1 input operand. Hmm, unless there's a machine-specific constraint for that? https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html - no, no mention of "shifted" or "flexible" source operand in the ARM section.
Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C shows an #ifdefed version with a working fallback that uses a bithack to reverse bits within a byte, then __builtin_bswap32 or MSVC _byteswap_ulong to reverse bytes.

It would be best if you used CMSIS intrinsic.
__REV, __REV16 etc. Those CMSIS header files contain much much more.
You can get them from here:
https://github.com/ARM-software/CMSIS_5
and you are looking for cmsis_gcc.h file (or similar if you use another compiler).

Interestingly, ARM gcc seems to have improved its detection of byte order reversing recently. With version 11, it would detect byte reversal if done by bit shifting, or by byte swapping through a pointer. However, from version 10 and backwards, the pointer method failed to issue the REV instruction.
uint32_t endianize1 (uint32_t input)
{
return ((input >> 24) & 0x000000FF) |
((input >> 8) & 0x0000FF00) |
((input << 8) & 0x00FF0000) |
((input << 24) & 0xFF000000) ;
}
uint32_t endianize2 (uint32_t input)
{
uint32_t output;
uint8_t *in8 = (uint8_t*)&input;
uint8_t *out8 = (uint8_t*)&output;
out8[0] = in8[3];
out8[1] = in8[2];
out8[2] = in8[1];
out8[3] = in8[0];
return output;
}
endianize1:
rev r0, r0
bx lr
endianize2:
mov r3, r0
movs r0, #0
lsrs r2, r3, #24
bfi r0, r2, #0, #8
ubfx r2, r3, #16, #8
bfi r0, r2, #8, #8
ubfx r2, r3, #8, #8
bfi r0, r2, #16, #8
bfi r0, r3, #24, #8
bx lr
https://godbolt.org/z/E3xGvG9qq
So, as we wait for optimisers to improve, there are certainly ways you can help the compiler understand your intent and take good advantage of the instruction set (without resorting to micro optimisations or inline assembler). But it's likely that this will involve a good understanding of the architecture by the programmer, and examination of the output assembler.
Take advantage of http://godbolt.org to help examine the compiler output, and see what produces the best output.

Cast from array causing crash on some MCUs but not others

I have a piece of code looking like this:
void update_clock(uint8_t *time_array)
{
time_t time = *((time_t *) &time_array[0]); // <-- hangs
/* ... more code ... */
}
Where time_array is an array of 4 bytes (i.e. uint8_t time_array[4]).
I'm using arm-none-eabi-gcc to compile this for an STM32L4 processor.
While compiling this a couple of months ago I got no errors and the code is running perfectly fine on all my test MCUs. I did some updates to my environment (OpenSTM32) when coming back to this project and now this piece of code is crashing on some MCUs while working fine on others.
I still have my binary from a couple of months ago and have confirmed that this code path works fine on all of my MCUs (I have about 5 to test on), but now it works on two of them while causing a crash on three of them.
I have mitigated the problem by rewriting the code like this:
time_t time = (
((uint32_t) time_array[0]) << 0 |
((uint32_t) time_array[1]) << 8 |
((uint32_t) time_array[2]) << 16 |
((uint32_t) time_array[3]) << 24
);
While this works for now, I think the old code looks cleaner and I'm also worried that if this code path hangs I probably will have similar errors elsewhere.
Does anyone have any idea what can be causing this? Can I change anything in my setup to make the compiler work the old way again?

From version 7-2017-q4-major, arm gcc ships with newlib compiled with time_t defined as 64 bit (long long) integer, causing all sorts of problems with code that assumes it to be 32 bits. Your code is reading past the end of the source array, taking whatever is stored there as the high order bits of the time value, possibly resulting in a date before the big bang, or after the heat death of the universe, which might not be what your code expects.
If the source array is known to contain 32 bits of data, copy it to a 32 bit int32_t variable first, then you can assign it to a time_t, this way it will be properly converted, regardless of the size of time_t.

Your development environment OpenSTM32 may be using a gcc compiler. If so, gcc supports the following macro flag.
-fno-strict-aliasing
It you are using -O2, this flag might resolve your problem.
Using memcpy is the standard advice, and is sometimes optimized-away by the compiler:
memcpy(&time, time_array, sizeof time);
Finally, you can use gcc's typeof and a compound literal with a union to generate the following safe cast:
#define PUN_CAST4(a, x) ((union {uint8_t src[4]; typeof(x) dst;}){{a[0],a[1],a[2],a[3]}}).dst
time_t time = PUN_CAST4(time_array, time);
As an example, the following code is compiled at https://godbolt.org/g/eZRXxW:
#include <stdint.h>
#include <time.h>
#include <string.h>
time_t update_clock(uint8_t *time_array) {
time_t t = *((time_t *) &time_array[0]); // assumes no alignment problem
return t;
}
time_t update_clock2(uint8_t *time_array) {
time_t t =
(uint32_t)time_array[0] << 0 |
(uint32_t)time_array[1] << 8 |
(uint32_t)time_array[2] << 16 |
(uint32_t)time_array[3] << 24;
return t;
}
time_t update_clock3(uint8_t *time_array) {
time_t t;
memcpy(&t, time_array, sizeof t);
return t;
}
#define PUN_CAST4(a, x) ((union {uint8_t src[4]; typeof(x) dst;}){{a[0],a[1],a[2],a[3]}}).dst
time_t update_clock4(uint8_t *time_array) {
time_t t = PUN_CAST4(time_array, t);
return t;
}
gcc 8.1 is good for all four examples: it generates the trivial code with -O2. But gcc 7.3 is bad for the 4th. Clang is also good for all four with -m32 for a 32-bit target, but fails on the 2nd and 4th without it

Your issue is caused by unaligned access, or writing to the wrong area.
Compiling
#include "stdint.h"
#include "time.h"
time_t myTime;
void update_clock(uint8_t *time_array)
{
myTime = *((time_t *) &time_array[0]); // <-- hangs
/* ... more code ... */
}
with GCC 7.2.1 with the arguments -march=armv7-m -Os generates the following
update_clock(unsigned char*):
ldr r3, .L2
ldrd r0, [r0]
strd r0, [r3]
bx lr
.L2:
.word .LANCHOR0
myTime:
Because your time array is an 8 bit type there are no rules for alignment, so if the linker has not word aligned it, when you try and dereference it as a time_t * the LDRD instruction is given a non word aligned address and causes a usagefault.
The LDRD and STRD instructions are loading and storing 8 bytes, whereas your array is only 4 bytes long. I suggest you check sizeof(time_t) in your environment, and make an aligned area long enough to store it.

How to write rotate code in C to compile into the `ror` x86 instruction?

I have some code that rotates my data. I know GAS syntax has a single assembly instruction that can rotate an entire byte. However, when I try to follow any of the advice on Best practices for circular shift (rotate) operations in C++, my C code compiles into at least 5 instructions, which use up three registers-- even when compiling with -O3. Maybe those are best practices in C++, and not in C?
In either case, how can I force C to use the ROR x86 instruction to rotate my data?
The precise line of code which is not getting compiled to the rotate instruction is:
value = (((y & mask) << 1 ) | (y >> (size-1))) //rotate y right 1
^ (((z & mask) << n ) | (z >> (size-n))) // rotate z left by n
// size can be 64 or 32, depending on whether we are rotating a long or an int, and
// mask would be 0xff or 0xffffffff, accordingly
I do not mind using __asm__ __volatile__ to do this rotate, if that's what I must do. But I don't know how to do so correctly.

Your macro compiles to a single ror instruction for me... specifically, I compiled this test file:
#define ROR(x,y) ((unsigned)(x) >> (y) | (unsigned)(x) << 32 - (y))
unsigned ror(unsigned x, unsigned y)
{
return ROR(x, y);
}
as C, using gcc 6, with -O2 -S, and this is the assembly I got:
.file "test.c"
.text
.p2align 4,,15
.globl ror
.type ror, #function
ror:
.LFB0:
.cfi_startproc
movl %edi, %eax
movl %esi, %ecx
rorl %cl, %eax
ret
.cfi_endproc
.LFE0:
.size ror, .-ror
.ident "GCC: (Debian 6.4.0-1) 6.4.0 20170704"
.section .note.GNU-stack,"",#progbits
Please try to do the same, and report the assembly you get. If your test program is substantially different from mine, please tell us how it differs. If you are using a different compiler or a different version of GCC please tell us exactly which one.
Incidentally, I get the same assembly output when I compile the code in the accepted answer for "Best practices for circular shift (rotate) operations in C++", as C.

How old is your compiler? As I noted in the linked question, the UB-safe variable-count rotate idiom (with extra & masking of the count) confuses old compilers, like gcc before 4.9. Since you're not masking the shift count, it should be recognized with even older gcc.
Your big expression is maybe confusing the compiler. Write an inline function for rotate, and call it, like
value = rotr32(y & mask, 1) ^ rotr32(z & mask, n);
Much more readable, and may help stop the compiler from trying to do things in the wrong order and breaking the idiom before recognizing it as a rotate.
Maybe those are best practices in C++, and not in C?
My answer on the linked question clearly says that it's the best practice for C as well as C++. They are different languages, but they overlap completely for this, according to my testing.
Here's a version of the Godbolt link using -xc to compile as C, not C++. I had a couple C++isms in the link in the original question, for experimenting with integer types for the rotate count.
Like the original linked from the best-practices answer, it has a version that uses x86 intrinsics if available. clang doesn't seem to provide any in x86intrin.h, but other compilers have _rotl / _rotr for 32-bit rotates, with other sizes available.
Actually, I talked about rotate intrinsics at length in the answer on the best-practices question, not just in the godbolt link. Did you even read the answer there, apart from the code block? (If you did, your question doesn't reflect it.)
Using intrinsics, or the idiom in your own inline function, is much better than using inline asm. Asm defeats constant-propagation, among other things. Also, compilers can use BMI2 rorx dst, src, imm8 to copy-and-rotate with one instruction, if you compile with -march=haswell or -mbmi2. It's a lot harder to write an inline-asm rotate that can use rorx for immediate-count rotates but ror r32, cl for variable-count rotates. You could try with _builtin_constant_p(), but clang evaluates that before inlining, so it's basically useless for meta-programming style choice of which code to use. It works with gcc though. But it's still much better not to use inline asm unless you've exhausted all other avenues (like asking on SO) to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm
Fun fact: the rotate functions in gcc's x86intrin.h are just pure C using the rotate idiom that gcc recognizes. Except for 16-bit rotates, where they use __builtin_ia32_rolhi.

You might need to be a bit more specific with what integral type / width you're rotating, and whether you have a fixed or variable rotation. ror{b,w,l,q} (8, 16, 32, 64-bit) has forms for (1), imm8, or the %cl register. As an example:
static inline uint32_t rotate_right (uint32_t u, size_t r)
{
__asm__ ("rorl %%cl, %0" : "+r" (u) : "c" (r));
return u;
}
I haven't tested this, it's just off the top of my head. And I'm sure multiple constraint syntax could be used to optimize cases where a constant (r) value is used, so %e/rcx is left alone.
If you're using a recent version of gcc or clang (or even icc). The intrinsics header <x86intrin.h>, may provide __ror{b|w|d|q} intrinsics. I haven't tried them.

Best Way:
#define rotr32(x, n) (( x>>n ) | (x<<(64-n)))
#define rotr64(x, n) (( x>>n ) | (x<<(32-n)))
More generic:
#define rotr(x, n) (( x>>n ) | (x<<((sizeof(x)<<3)-n)))
And it compiles (in GCC) with exactly the same code as the asm versions below.
For 64 bit:
__asm__ __volatile__("rorq %b1, %0" : "=g" (u64) : "Jc" (cShift), "0" (u64));
or
static inline uint64_t CC_ROR64(uint64_t word, int i)
{
__asm__("rorq %%cl,%0"
:"=r" (word)
:"0" (word),"c" (i));
return word;
}

Portable way of sending 64-bit variable through POSIX socket

I'm designing custom network protocol and I need to send uint64_t variable (representing file's length in bytes) through socket in portable and POSIX-compliant manner.
Unfortunately manual says that integer types with width 64 are not guaranteed to exist:
If an implementation provides integer types with width 64 that meet these requirements, then the following types are required: int64_t uint64_t
What's more there is no POSIX-compliant equivalent of htonl, htons, ntohl, ntohs (note that bswap_64 is not POSIX-compliant).
What is the best practice to send 64-bit variable through socket?

You can just apply htonl() twice, of course:
const uint64_t x = ...
const uint32_t upper_be = htonl(x >> 32);
const uint32_t lower_be = htonl((uint32_t) x);
This will give you two 32-bit variables containing big-endian versions of the upper and lower 32-bit halves of the 64-bit variable x.
If you are strict POSIX, you can't use uint64_t since it's not guaranteed to exist. Then you can do something like:
typedef struct {
uint32_t upper;
uint32_t lower;
} my_uint64;
And just htonl() those directly, of course.

My personal favorite is a macro... mine looks similar to this and checks for local byte ordering before deciding how to handle the byte ordering:
// clang-format off
#if !defined(__BIG_ENDIAN__) && !defined(__LITTLE_ENDIAN__)
# if defined(__has_include)
# if __has_include(<endian.h>)
# include <endian.h>
# elif __has_include(<sys/endian.h>)
# include <sys/endian.h>
# endif
# endif
# if !defined(__LITTLE_ENDIAN__) && \
(defined(__BIG_ENDIAN__) || __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
# define __BIG_ENDIAN__
# define bswap64(i) (i) // do nothing
# else
# define __LITTLE_ENDIAN__
# define bswap64(i) ((((i)&0xFFULL) << 56) | (((i)&0xFF00ULL) << 40) | \
(((i)&0xFF0000ULL) << 24) | (((i)&0xFF000000ULL) << 8) | \
(((i)&0xFF00000000ULL) >> 8) | (((i)&0xFF0000000000ULL) >> 24) | \
(((i)&0xFF000000000000ULL) >> 40) | \
(((i)&0xFF00000000000000ULL) >> 56))
# endif
#endif

Assuming a POSIX platform with C99 or greater, {u,}int64_t are not required to exist but {u,}int_{least,fast}64_t are.
Additionally, POSIX requires {u,}int{8,16,32}_t.
So what you can do is:
#include <stdint.h>
//host-to-network (native endian to big endian)
void hton64(unsigned char *B, uint_least64_t X)
{
B[0]=X>>56&0xFF;
B[1]=X>>48&0xFF;
B[2]=X>>40&0xFF;
B[3]=X>>32&0xFF;
B[4]=X>>24&0xFF;
B[5]=X>>16&0xFF;
B[6]=X>>8&0xFF;
B[7]=X>>0&0xFF;
}
//network-to-host (big endian to native endian)
uint_least64_t ntoh64(unsigned char const *B)
{
return (uint_least64_t)B[0]<<56|
(uint_least64_t)B[1]<<48|
(uint_least64_t)B[2]<<40|
(uint_least64_t)B[3]<<32|
(uint_least64_t)B[4]<<24|
(uint_least64_t)B[5]<<16|
(uint_least64_t)B[6]<<8|
(uint_least64_t)B[7]<<0;
}
If the machine has uint64_t, then uint_least64_t will be (due to requirements imposed by the C standard) identical to uint64_t.
If it doesn't, then uint_least64_t might not be 2's-complement or it might have more value bits (I have no idea if there are such architectures), but regardless of that, the above routines will send or receive exactly (if there's more) 64 lower-order bits of it (to or from a buffer).
(Anyway, this solutionshould be good as a generic backend, but if you want to be slightly more optimal, then you can try to first detect your endianness and do nothing if it's a big endian platform; if it's a little endian and sizeof(uint_least64_t)*CHAR_BIT==64, then if you can detect you have byteswap.h with bswap_64, then you should use that as it's likely to compile down to a single instruction. If all else fails, I'd use something like the above.)

Detecting Endianness

I'm currently trying to create a C source code which properly handles I/O whatever the endianness of the target system.
I've selected "little endian" as my I/O convention, which means that, for big endian CPU, I need to convert data while writing or reading.
Conversion is not the issue. The problem I face is to detect endianness, preferably at compile time (since CPU do not change endianness in the middle of execution...).
Up to now, I've been using this :
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
...
#else
...
#endif
It's documented as a GCC pre-defined macro, and Visual seems to understand it too.
However, I've received report that the check fails for some big_endian systems (PowerPC).
So, I'm looking for a foolproof solution, which ensures that endianess is correctly detected, whatever the compiler and the target system. well, most of them at least...
[Edit] : Most of the solutions proposed rely on "run-time tests". These tests may sometimes be properly evaluated by compilers during compilation, and therefore cost no real runtime performance.
However, branching with some kind of << if (0) { ... } else { ... } >> is not enough. In the current code implementation, variable and functions declaration depend on big_endian detection. These cannot be changed with an if statement.
Well, obviously, there is fall back plan, which is to rewrite the code...
I would prefer to avoid that, but, well, it looks like a diminishing hope...
[Edit 2] : I have tested "run-time tests", by deeply modifying the code. Although they do their job correctly, these tests also impact performance.
I was expecting that, since the tests have predictable output, the compiler could eliminate bad branches. But unfortunately, it doesn't work all the time. MSVC is good compiler, and is successful in eliminating bad branches, but GCC has mixed results, depending on versions, kind of tests, and with greater impact on 64 bits than on 32 bits.
It's strange. And it also means that the run-time tests cannot be ensured to be dealt with by the compiler.
Edit 3 : These days, I'm using a compile-time constant union, expecting the compiler to solve it to a clear yes/no signal.
And it works pretty well :
https://godbolt.org/g/DAafKo

As stated earlier, the only "real" way to detect Big Endian is to use runtime tests.
However, sometimes, a macro might be preferred.
Unfortunately, I've not found a single "test" to detect this situation, rather a collection of them.
For example, GCC recommends : __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ . However, this only works with latest versions, and earlier versions (and other compilers) will give this test a false value "true", since NULL == NULL. So you need the more complete version : defined(__BYTE_ORDER__)&&(__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
OK, now this works for newest GCC, but what about other compilers ?
You may try __BIG_ENDIAN__ or __BIG_ENDIAN or _BIG_ENDIAN which are often defined on big endian compilers.
This will improve detection. But if you specifically target PowerPC platforms, you can add a few more tests to improve even more detection. Try _ARCH_PPC or __PPC__ or __PPC or PPC or __powerpc__ or __powerpc or even powerpc. Bind all these defines together, and you have a pretty fair chance to detect big endian systems, and powerpc in particular, whatever the compiler and its version.
So, to summarize, there is no such thing as a "standard pre-defined macros" which guarantees to detect big-endian CPU on all platforms and compilers, but there are many such pre-defined macros which, collectively, give a high probability of correctly detecting big endian under most circumstances.

At compile time in C you can't do much more than trusting preprocessor #defines, and there are no standard solutions because the C standard isn't concerned with endianness.
Still, you could add an assertion that is done at runtime at the start of the program to make sure that the assumption done when compiling was true:
inline int IsBigEndian()
{
int i=1;
return ! *((char *)&i);
}
/* ... */
#ifdef COMPILED_FOR_BIG_ENDIAN
assert(IsBigEndian());
#elif COMPILED_FOR_LITTLE_ENDIAN
assert(!IsBigEndian());
#else
#error "No endianness macro defined"
#endif
(where COMPILED_FOR_BIG_ENDIAN and COMPILED_FOR_LITTLE_ENDIAN are macros #defined previously according to your preprocessor endianness checks)

Instead of looking for a compile-time check, why not just use big-endian order (which is considered the "network order" by many) and use the htons/htonl/ntohs/ntohl functions provided by most UNIX-systems and Windows. They're already defined to do the job you're trying to do. Why reinvent the wheel?

Try something like:
if(*(char *)(int[]){1}) {
/* little endian code */
} else {
/* big endian code */
}
and see if your compiler resolves it at compile-time. If not, you might have better luck doing the same with a union. Actually I like defining macros using unions that evaluate to 0,1 or 1,0 (respectively) so that I can just do things like accessing buf[HI] and buf[LO].

Notwithstanding compiler-defined macros, I don't think there's a compile-time way to detect this, since determining the endianness of an architecture involves analyzing the manner in which it stores data in memory.
Here's a function which does just that:
bool IsLittleEndian () {
int i=1;
return (int)*((unsigned char *)&i)==1;
}

As others have pointed out, there isn't a portable way to check for endianness at compile-time. However, one option would be to use the autoconf tool as part of your build script to detect whether the system is big-endian or little-endian, then to use the AC_C_BIGENDIAN macro, which holds this information. In a sense, this builds a program that detects at runtime whether the system is big-endian or little-endian, then has that program output information that can then be used statically by the main source code.
Hope this helps!

This comes from p. 45 of Pointers in C:
#include <stdio.h>
#define BIG_ENDIAN 0
#define LITTLE_ENDIAN 1
int endian()
{
short int word = 0x0001;
char *byte = (char *) &word;
return (byte[0] ? LITTLE_ENDIAN : BIG_ENDIAN);
}
int main(int argc, char* argv[])
{
int value;
value = endian();
if (value == 1)
printf("The machine is Little Endian\n");
else
printf("The machine is Big Endian\n");
return 0;
}

Socket's ntohl function can be used for this purpose. Source
// Soner
#include <stdio.h>
#include <arpa/inet.h>
int main() {
if (ntohl(0x12345678) == 0x12345678) {
printf("big-endian\n");
} else if (ntohl(0x12345678) == 0x78563412) {
printf("little-endian\n");
} else {
printf("(stupid)-middle-endian\n");
}
return 0;
}

My GCC version is 9.3.0, it's configured to support powerpc64 platform, and I've tested it and verified that it supports the following macros logic:
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
......
#endif
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
.....
#endif

As of C++20, no more hacks or compiler extensions are necessary.
https://en.cppreference.com/w/cpp/types/endian
std::endian (Defined in header <bit>)
enum class endian
{
little = /*implementation-defined*/,
big = /*implementation-defined*/,
native = /*implementation-defined*/
};
If all scalar types are little-endian, std::endian::native equals std::endian::little
If all scalar types are big-endian, std::endian::native equals std::endian::big

You can't detect it at compile time to be portable across all compilers. Maybe you can change the code to do it at run-time - this is achievable.

It is not possible to detect endianness portably in C with preprocessor directives.

I took the liberty of reformatting the quoted text
As of 2017-07-18, I use union { unsigned u; unsigned char c[4]; }
If sizeof (unsigned) != 4 your test may fail.
It may be better to use
union { unsigned u; unsigned char c[sizeof (unsigned)]; }

As most have mentioned, compile time is your best bet. Assuming you do not do cross compilations and you use cmake (it will also work with other tools such as a configure script, of course) then you can use a pre-test which is a compiled .c or .cpp file and that gives you the actual verified endianness of the processor you're running on.
With cmake you use the TestBigEndian macro. It sets a variable which you can then pass to your software. Something like this (untested):
TestBigEndian(IS_BIG_ENDIAN)
...
set(CFLAGS ${CFLAGS} -DIS_BIG_ENDIAN=${IS_BIG_ENDIAN}) // C
set(CXXFLAGS ${CXXFLAGS} -DIS_BIG_ENDIAN=${IS_BIG_ENDIAN}) // C++
Then in your C/C++ code you can check that IS_BIG_ENDIAN define:
#if IS_BIG_ENDIAN
...do big endian stuff here...
#else
...do little endian stuff here...
#endif
So the main problem with such a test is cross compiling since you may be on a completely different CPU with a different endianness... but at least it gives you the endianness at time of compiling the rest of your code and will work for most projects.

I provided a general approach in C with no preprocessor, but only runtime that compute endianess for every C type.
the output if this on my Linux x86_64 architecture is:
fabrizio#toshibaSeb:~/git/pegaso/scripts$ gcc -o sizeof_endianess sizeof_endianess.c
fabrizio#toshibaSeb:~/git/pegaso/scripts$ ./sizeof_endianess
INTEGER TYPE | signed | unsigned | 0x010203... | Endianess
--------------+---------+------------+-------------------------+--------------
int | 4 | 4 | 04 03 02 01 | little
char | 1 | 1 | - | -
short | 2 | 2 | 02 01 | little
long int | 8 | 8 | 08 07 06 05 04 03 02 01 | little
long long int | 8 | 8 | 08 07 06 05 04 03 02 01 | little
--------------+---------+------------+-------------------------+--------------
FLOATING POINT| size |
--------------+---------+
float | 4
double | 8
long double | 16
Get source at: https://github.com/bzimage-it/pegaso/blob/master/scripts/sizeof_endianess.c
This is a more general approach is to not detect endianess at compilation time (not possibile) nor assume any endianess escludes another one. In fact is important to remark that endianess is not a concept of the architecture/processor but regards single type. As argued by
#Christoph at https://stackoverflow.com/a/4712594/3280080 PDP-11 for example can have different endianess at the same time.
The approach consist to set an integer to be x = 0x010203... as long is it, then print them looking at casted-at-single-byte incrementing the address by one.
Can somebody test it please in a big endian and/or mixed endianess ?

I know I'm late to this party, but here is my take.
int is_big_endian() {
return 1 & *(uint16_t*)"01";
}
This is based on the fact that '0' is 48 in decimal and '1' 49, so '1' has the LSB bit set, while '0' not. I could make them '\x00' and '\x01' but I think my version makes it more readable.

#define BIG_ENDIAN ((1 >> 1 == 0) ? 0 : 1)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight