Faster way to zero memory than with memset? - c

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?
I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?
Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.
edit2:
I revisited this issue and did a little testing. Here is what I tested:
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/time.h>
#define TIME(body) do { \
struct timeval t1, t2; double elapsed; \
gettimeofday(&t1, NULL); \
body \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \
#define SIZE 0x1000000
void zero_1(void* buff, size_t size)
{
size_t i;
char* foo = buff;
for (i = 0; i < size; i++)
foo[i] = 0;
}
/* I foolishly assume size_t has register width */
void zero_sizet(void* buff, size_t size)
{
size_t i;
char* bar;
size_t* foo = buff;
for (i = 0; i < size / sizeof(size_t); i++)
foo[i] = 0;
// fixes bug pointed out by tristopia
bar = (char*)buff + size - size % sizeof(size_t);
for (i = 0; i < size % sizeof(size_t); i++)
bar[i] = 0;
}
int main()
{
char* buffer = malloc(SIZE);
TIME(
memset(buffer, 0, SIZE);
);
TIME(
zero_1(buffer, SIZE);
);
TIME(
zero_sizet(buffer, SIZE);
);
return 0;
}
results:
zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.
For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)
edit3: fixed bug in test code, test results are not affected
edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.

x86 is rather broad range of devices.
For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.
For chips with mmx, an assembly loop with movq could hit 64bits at a time.
You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.
for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps
Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.

memset is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).
That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memset implementation by doing it yourself. memset and its friends in the standard library are always fun targets for one-upmanship programming. :)

Nowadays your compiler should do all the work for you. At least of what I know gcc is very efficient in optimizing calls to memset away (better check the assembler, though).
Then also, avoid memset if you don't have to:
use calloc for heap memory
use proper initialization (... = { 0
}) for stack memory
And for really large chunks use mmap if you have it. This just gets zero initialized memory from the system "for free".

If I remember correctly (from a couple of years ago), one of the senior developers was talking about a fast way to bzero() on PowerPC (specs said we needed to zero almost all the memory on power up). It might not translate well (if at all) to x86, but it could be worth exploring.
The idea was to load a data cache line, clear that data cache line, and then write the cleared data cache line back to memory.
For what it is worth, I hope it helps.

Unless you have specific needs or know that your compiler/stdlib is sucky, stick with memset. It's general-purpose, and should have decent performance in general. Also, compilers might have an easier time optimizing/inlining memset() because it can have intrinsic support for it.
For instance, Visual C++ will often generate inline versions of memcpy/memset that are as small as a call to the library function, thus avoiding push/call/ret overhead. And there's further possible optimizations when the size parameter can be evaluated at compile-time.
That said, if you have specific needs (where size will always be tiny *or* huge), you can gain speed boosts by dropping down to assembly level. For instance, using write-through operations for zeroing huge chunks of memory without polluting your L2 cache.
But it all depends - and for normal stuff, please stick to memset/memcpy :)

The memset function is designed to be flexible and simple, even at the expense of speed. In many implementations, it is a simple while loop that copies the specified value one byte at a time over the given number of bytes. If you are wanting a faster memset (or memcpy, memmove, etc), it is almost always possible to code one up yourself.
The simplest customization would be to do single-byte "set" operations until the destination address is 32- or 64-bit aligned (whatever matches your chip's architecture) and then start copying a full CPU register at a time. You may have to do a couple of single-byte "set" operations at the end if your range doesn't end on an aligned address.
Depending on your particular CPU, you might also have some streaming SIMD instructions that can help you out. These will typically work better on aligned addresses, so the above technique for using aligned addresses can be useful here as well.
For zeroing out large sections of memory, you may also see a speed boost by splitting the range into sections and processing each section in parallel (where number of sections is the same as your number or cores/hardware threads).
Most importantly, there's no way to tell if any of this will help unless you try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard 'memset' as well (their implementation might be more efficient than your compiler's).

There is one fatal flaw in this otherwise great and helpful test:
As memset is the first instruction, there seems to be some "memory overhead" or so which makes it extremely slow.
Moving the timing of memset to second place and something else to first place or simply timing memset twice makes memset the fastest with all compile switches!!!

That's an interesting question. I made this implementation that is just slightly faster (but hardly measurable) when 32-bit release compiling on VC++ 2012. It probably can be improved on a lot. Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with memset() in multithreaded scenarios.
// MemsetSpeedTest.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include "Windows.h"
#include <time.h>
#pragma comment(lib, "Winmm.lib")
using namespace std;
/** a signed 64-bit integer value type */
#define _INT64 __int64
/** a signed 32-bit integer value type */
#define _INT32 __int32
/** a signed 16-bit integer value type */
#define _INT16 __int16
/** a signed 8-bit integer value type */
#define _INT8 __int8
/** an unsigned 64-bit integer value type */
#define _UINT64 unsigned _INT64
/** an unsigned 32-bit integer value type */
#define _UINT32 unsigned _INT32
/** an unsigned 16-bit integer value type */
#define _UINT16 unsigned _INT16
/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
/** maximum allo
wed value in an unsigned 64-bit integer value type */
#define _UINT64_MAX 18446744073709551615ULL
#ifdef _WIN32
/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);
/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT clock_t start;double diff;
/** Use to start the performance timer */
#define TIMER_START start=clock();
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif
void *MemSet(void *dest, _UINT8 c, size_t count)
{
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
_UINT64 cUll =
c
| (((_UINT64)c) << 8 )
| (((_UINT64)c) << 16 )
| (((_UINT64)c) << 24 )
| (((_UINT64)c) << 32 )
| (((_UINT64)c) << 40 )
| (((_UINT64)c) << 48 )
| (((_UINT64)c) << 56 );
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll;
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll;
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll;
if (!bytesLeft) return dest;
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll;
return dest;
}
int _tmain(int argc, _TCHAR* argv[])
{
TIMER_INIT
const size_t n = 10000000;
const _UINT64 m = _UINT64_MAX;
const _UINT64 o = 1;
char test[n];
{
cout << "memset()" << endl;
TIMER_START;
for (int i = 0; i < m ; i++)
for (int j = 0; j < o ; j++)
memset((void*)test, 0, n);
TIMER_STOP;
}
{
cout << "MemSet() took:" << endl;
TIMER_START;
for (int i = 0; i < m ; i++)
for (int j = 0; j < o ; j++)
MemSet((void*)test, 0, n);
TIMER_STOP;
}
cout << "Done" << endl;
int wait;
cin >> wait;
return 0;
}
Output is as follows when release compiling for 32-bit systems:
memset() took:
5.569000
MemSet() took:
5.544000
Done
Output is as follows when release compiling for 64-bit systems:
memset() took:
2.781000
MemSet() took:
2.765000
Done
Here you can find the source code Berkley's memset(), which I think is the most common implementation.

memset could be inlined by compiler as a series of efficient opcodes, unrolled for a few cycles. For very large memory blocks, like 4000x2000 64bit framebuffer, you can try optimizing it across several threads (which you prepare for that sole task), each setting its own part. Note that there is also bzero(), but it is more obscure, and less likely to be as optimized as memset, and the compiler will surely notice you pass 0.
What compiler usually assumes, is that you memset large blocks, so for smaller blocks it would likely be more efficient to just do *(uint64_t*)p = 0, if you init large number of small objects.
Generally, all x86 CPUs are different (unless you compile for some standardized platform), and something you optimize for Pentium 2 will behave differently on Core Duo or i486. So if you really into it and want to squeeze the last few bits of toothpaste, it makes sense to ship several versions your exe compiled and optimized for different popular CPU models. From personal experience Clang -march=native boosted my game's FPS from 60 to 65, compared to no -march.

Related

Vectorize random init and print for BigInt with decimal digit array, with AVX2?

How could I pass my code to AVX2 code and get the same result as before?
Is it possible to use __m256i in the LongNumInit, LongNumPrint functions instead of uint8_t *L, or some similar type of variable?
My knowledge of AVX is quite limited; I investigated quite a bit however I do not understand very well how to transform my code any suggestion and explanation is welcome.
I'm really interested in this code in AVX2.
void LongNumInit(uint8_t *L, size_t N )
{
for(size_t i = 0; i < N; ++i){
L[i] = myRandom()%10;
}
}
void LongNumPrint( uint8_t *L, size_t N, uint8_t *Name )
{
printf("%s:", Name);
for ( size_t i=N; i>0;--i )
{
printf("%d", L[i-1]);
}
printf("\n");
}
int main (int argc, char **argv)
{
int i, sum1, sum2, sum3, N=10000, Rep=50;
seed = 12345;
// obtain parameters at run time
if (argc>1) { N = atoi(argv[1]); }
if (argc>2) { Rep = atoi(argv[2]); }
// Create Long Nums
unsigned char *V1= (unsigned char*) malloc( N);
unsigned char *V2= (unsigned char*) malloc( N);
unsigned char *V3= (unsigned char*) malloc( N);
unsigned char *V4= (unsigned char*) malloc( N);
LongNumInit ( V1, N ); LongNumInit ( V2, N ); LongNumInit ( V3, N );
//Print last 32 digits of Long Numbers
LongNumPrint( V1, 32, "V1" );
LongNumPrint( V2, 32, "V2" );
LongNumPrint( V3, 32, "V3" );
LongNumPrint( V4, 32, "V4" );
free(V1); free(V2); free(V3); free(V4);
return 0;
}
The result that I obtain in my initial code is this:
V1:59348245908804493219098067811457
V2:24890422397351614779297691741341
V3:63392771324953818089038280656869
V4:00000000000000000000000000000000
This is a terrible format for BigInteger in general, see https://codereview.stackexchange.com/a/237764 for a code review of the design flaws in using one decimal digit per byte for BigInteger, and what you could/should do instead.
And see Can long integer routines benefit from SSE? for #Mysticial's notes on ways to store your data that make SIMD for BigInteger math practical, specifically partial-word arithmetic where your temporaries might not be "normalized", letting you do lazy carry handling.
But apparently you're just asking about this code, the random-init and print functions, not how to do math between two numbers in this format.
We can vectorize both of these quite well. My LongNumPrintName() is a drop-in replacement for yours.
For LongNumInit I'm just showing a building-block that stores two 32-byte chunks and returns an incremented pointer. Call it in a loop. (It naturally produces 2 vectors per call so for small N you might make an alternate version.)
LongNumInit
What's the fastest way to generate a 1 GB text file containing random digits? generates space-separated random ASCII decimal digits at about 33 GB/s on 4GHz Skylake, including overhead of write() system calls to /dev/null. (This is higher than DRAM bandwidth; cache blocking for 128kiB lets the stores hit in L2 cache. The kernel driver for /dev/null doesn't even read the user-space buffer.)
It could easily be adapted into an AVX2 version of void LongNumInit(uint8_t *L, size_t N ). My answer there uses an AVX2 xorshift128+ PRNG (vectorized with 4 independent PRNGs in the 64-bit elements of a __m256i) like AVX/SSE version of xorshift128+. That should be similar quality of randomness to your rand() % 10.
It breaks that up into decimal digits via a multiplicative inverse to divide and modulo by 10 with shifts and vpmulhuw, using Why does GCC use multiplication by a strange number in implementing integer division?. (Actually using GNU C native vector syntax to let GCC determine the magic constant and emit the multiplies and shifts for convenient syntax like v16u dig1 = v % ten; and v /= ten;)
You can use _mm256_packus_epi16 to pack two vectors of 16-bit digits into 8-bit elements instead of turning the odd elements into ASCII ' ' and the even elements into ASCII '0'..'9'. (So change vec_store_digit_and_space to pack pairs of vectors instead of ORing with a constant, see below)
Compile this with gcc, clang, or ICC (or hopefully any other compiler that understands the GNU C dialect of C99, and Intel's intrinsics).
See https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html for the __attribute__((vector_size(32))) part, and https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for the _mm256_* stuff. Also https://stackoverflow.com/tags/sse/info.
#include <immintrin.h>
// GNU C native vectors let us get the compiler to do stuff like %10 each element
typedef unsigned short v16u __attribute__((vector_size(32)));
// returns p + size of stores. Caller should use outpos = f(vec, outpos)
// p must be aligned
__m256i* vec_store_digits(__m256i vec, __m256i *restrict p)
{
v16u v = (v16u)vec;
v16u ten = (v16u)_mm256_set1_epi16(10);
v16u divisor = (v16u)_mm256_set1_epi16(6554); // ceil((2^16-1) / 10.0)
v16u div6554 = v / divisor; // Basically the entropy from the upper two decimal digits: 0..65.
// Probably some correlation with the modulo-based values, especially dig3, but we do this instead of
// dig4 for more ILP and fewer instructions total.
v16u dig1 = v % ten;
v /= ten;
v16u dig2 = v % ten;
v /= ten;
v16u dig3 = v % ten;
// dig4 would overlap much of the randomness that div6554 gets
// __m256i or v16u assignment is an aligned store
v16u *vecbuf = (v16u*)p;
// pack 16->8 bits.
vecbuf[0] = _mm256_packus_epi16(div6554, dig1);
vecbuf[1] = _mm256_packus_epi16(dig2, dig3)
return p + 2; // always a constant number of full vectors
}
The logic in random_decimal_fill_buffer that inserts newlines can be totally removed because you just want a flat array of decimal digits. Just call the above function in a loop until you've filled your buffer.
Handling small sizes (less than a full vector):
It would be convenient to pad your malloc up to the next multiple of 32 bytes so it's always safe to do a 32-byte load without checking for maybe crossing into an unmapped page.
And use C11 aligned_alloc to get 32-byte aligned storage. So for example, aligned_alloc(32, (size+31) & -32). This lets us just do full 32-byte stores even if N is odd. Logically only the first N bytes of the buffer hold our real data, but it's convenient to have padding we can scribble over to avoid any extra conditional checks for N being less than 32, or not a multiple of 32.
Unfortunately ISO C and glibc are missing aligned_realloc and aligned_calloc. MSVC does actually provide those: Why is there no 'aligned_realloc' on most platforms? allowing you to sometimes allocate more space at the end of an aligned buffer without copying it. A "try_realloc" would be ideal for C++ that might need to run copy-constructors if non-trivially copyable objects change address. Non-expressive allocator APIs that force sometimes-unnecessary copying is a pet peeve of mine.
LongNumPrint
Taking a uint8_t *Name arg is bad design. If the caller wants to printf a "something:" string first, they can do that. Your function should just do what printf "%d" does for an int.
Since you're storing your digits in reverse printing order, you'll want to byte-reverse into a tmp buffer and convert 0..9 byte values to '0'..'9' ASCII character values by ORing with '0'. Then pass that buffer to fwrite.
Specifically, use alignas(32) char tmpbuf[8192]; as a local variable.
You can work in fixed-size chunks (like 1kiB or 8kiB) instead allocating a potentially-huge buffer. You probably want to still go through stdio (instead of write() directly and managing your own I/O buffering). With an 8kiB buffer, an efficient fwrite might just pass that on to write() directly instead of memcpy into the stdio buffer. You might want to play around with tuning this, but keeping the tmp buffer comfortably smaller than half of L1d cache will mean it's still hot in cache when it's re-read after you wrote it.
Cache blocking makes the loop bounds a lot more complex but it's worth it for very large N.
Byte-reversing 32 bytes at a time:
You could avoid this work by deciding that your digits are stored in MSD-first order, but then if you did want to implement addition it would have to loop from the end backwards.
The your function could be implemented with SIMD _mm_shuffle_epi8 to reverse 16-byte chunks, starting from the end of you digit array and writing to the beginning of your tmp buffer.
Or better, load vmovdqu / vinserti128 16-byte loads to feed _mm256_shuffle_epi8 to byte-reverse within lanes, setting up for 32-byte stores.
On Intel CPUs, vinserti128 decodes to a load+ALU uop, but it can run on any vector ALU port, not just the shuffle port. So two 128-bit loads are more efficient than 256-bit load -> vpshufb - > vpermq which would probably bottleneck on shuffle-port throughput if data was hot in cache. Intel CPUs can do up to 2 loads + 1 store per clock cycle (or in IceLake, 2 loads + 2 stores). We'll probably bottleneck on the front-end if there are no memory bottlenecks, so in practice not saturating load+store and shuffle ports. (https://agner.org/optimize/ and https://uops.info/)
This function is also simplified by the assumption that we can always read 32 bytes from L without crossing into an unmapped page. But after a 32-byte reverse for small N, the first N bytes of the input become the last N bytes in a 32-byte chunk. It would be most convenient if we could always safely do a 32-byte load ending at the end of a buffer, but it's unreasonable to expect padding before the object.
#include <immintrin.h>
#include <stdalign.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
// one vector of 32 bytes of digits, reversed and converted to ASCII
static inline
void ASCIIrev32B(void *dst, const void *src)
{
__m128i hi = _mm_loadu_si128(1 + (const __m128i*)src); // unaligned loads
__m128i lo = _mm_loadu_si128(src);
__m256i v = _mm256_set_m128i(lo, hi); // reverse 128-bit hi/lo halves
// compilers will hoist constants out of inline functions
__m128i byterev_lane = _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
__m256i byterev = _mm256_broadcastsi128_si256(byterev_lane); // same in each lane
v = _mm256_shuffle_epi8(v, byterev); // in-lane reverse
v = _mm256_or_si256(v, _mm256_set1_epi8('0')); // digits to ASCII
_mm256_storeu_si256(dst, v); // Will usually be aligned in practice.
}
// Tested for N=32; could be bugs in the loop bounds for other N
// returns bytes written, like fwrite: N means no error, 0 means error in all fwrites
size_t LongNumPrint( uint8_t *num, size_t N)
{
// caller can print a name if it wants
const int revbufsize = 8192; // 8kiB on the stack should be fine
alignas(32) char revbuf[revbufsize];
if (N<32) {
// TODO: maybe use a smaller revbuf for this case to avoid touching new stack pages
ASCIIrev32B(revbuf, num); // the data we want is at the *end* of a 32-byte reverse
return fwrite(revbuf+32-N, 1, N, stdout);
}
size_t bytes_written = 0;
const uint8_t *inp = num+N; // start with last 32 bytes of num[]
do {
size_t chunksize = (inp - num >= revbufsize) ? revbufsize : inp - num;
const uint8_t *inp_stop = inp - chunksize + 32; // leave one full vector for the end
uint8_t *outp = revbuf;
while (inp > inp_stop) { // may run 0 times
inp -= 32;
ASCIIrev32B(outp, inp);
outp += 32;
}
// reverse first (lowest address) 32 bytes of this chunk of num
// into last 32 bytes of this chunk of revbuf
// if chunksize%32 != 0 this will overlap, which is fine.
ASCIIrev32B(revbuf + chunksize - 32, inp_stop - 32);
bytes_written += fwrite(revbuf, 1, chunksize, stdout);
inp = inp_stop - 32;
} while ( inp > num );
return bytes_written;
// caller can putchar('\n') if it wants
}
// wrapper that prints name and newline
void LongNumPrintName(uint8_t *num, size_t N, const char *name)
{
printf("%s:", name);
//LongNumPrint_scalar(num, N);
LongNumPrint(num, N);
putchar('\n');
}
// main() included on Godbolt link that runs successfully
This compiles and runs (on Godbolt) with gcc -O3 -march=haswell and produces identical output to your scalar loop for the N=32 that main passes. (I used rand() instead of MyRandom(), so we could test with the same seed and get the same numbers, using your init function.)
Untested for larger N, but the general idea of chunksize = min(ptrdiff, 8k) and using that to loop downwards from the end of num[] should be solid.
We could load (not just store) aligned vectors if we converted the first N%32 bytes and passed that to fwrite before starting the main loop. But that probably either leads to an extra write() system call, or to clunky copying inside stdio. (Unless there was already buffered text not printed yet, like Name:, in which case we already have that penalty.)
Note that it's technically C UB to decrement inp past start of num. So inp -= 32 instead of inp = inp_stop-32 would have that UB for the iteration that leaves the outer loop. I actually avoid that in this version, but it generally works anyway because I think GCC assumes a flat memory model and de-factor defines the behaviour of pointer compares enough. And normal OSes reserve the zero page so num definitely can't be within 32 bytes of the start of physical memory (so inp can't wrap to a high address.) This paragraph is mostly left-over from the first totally untested attempt that I thought was decrementing the pointer farther in the inner loop than it actually was.

How to change double's endianness?

I need to read data from serial port. They are in little endian, but I need to do it platform independent, so I have to cover the endianness of double. I couldn't find anywhere, how to do it, so I wrote my own function. But I am not sure with it. (and I don't have a machine with big endian to try it on).
Will this work correctly? or is there some better approach I wasn't able to find?
double get_double(uint8_t * buff){
double value;
memcpy(&value,buff,sizeof(double));
uint64_t tmp;
tmp = le64toh(*(uint64_t*)&value);
value = *(double*) &tmp;
return value;
}
p.s. I count with double 8 bytes long, so don't bother with this pls. I know that there might be problems with this
EDIT: After suggestion, that I should use union, I did this:
union double_int{
double d;
uint64_t i;
};
double get_double(uint8_t * buff){
union double_int value;
memcpy(&value,buff,sizeof(double));
value.i = le64toh(value.i);
return value.d;
}
better? (though I don't see much of a difference)
EDIT2: attemtp #3, what do you think now?
double get_double(uint8_t * buff){
double value;
uint64_t tmp;
memcpy(&tmp,buff,sizeof(double));
tmp = le64toh(tmp);
memcpy(&value,&tmp,sizeof(double));
return value;
}
Edit3: I compile it with gcc -std=gnu99 -lpthread -Wall -pedantic
Edit4: After next suggestion I added a condition for endianness order checking. I honestly have no idea, what I am doing right now (shouldn't there be something like __DOUBLE_WORD_ORDER__ ?)
double get_double(uint8_t * buff){
double value;
if (__FLOAT_WORD_ORDER__ == __ORDER_BIG_ENDIAN__){
uint64_t tmp;
memcpy(&tmp,buff,sizeof(double));
tmp = le64toh(tmp);
memcpy(&value,&tmp,sizeof(double));
}
else {
memcpy(&value,buff,sizeof(double));
}
return value;
}
I'd just go and copy the bytes manually to a temporary double and then return that. In C (and I think C++) it is fine to cast any pointer to char *; that's one of the explicit permissions for aliasing (in the standard draft n1570, par. 6.5/7), as I mentioned in one of my comments. The exception is absolutely necessary in order to get anything done that is close to hardware; including reversing bytes received over a network :-).
There is no standard compile time way to determine whether that's necessary which is a pity; if you want to avoid branches which is probably a good idea if you deal with lots of data, you should look up your compiler's documentation for proprietary defines so that you can choose the proper code branch at compile time. gcc, for example, has __FLOAT_WORD_ORDER__ set to either __ORDER_LITTLE_ENDIAN__ or __ORDER_BIG_ENDIAN__.
(Because of your question in comments: __FLOAT_WORD_ORDER__ means floating points in general. It would be a very sick mind who designs a FPU that has different byte orders for different data sizes :-). In all reality there aren't many mainstream architectures which have different byte orders for floating point vs. integer types. As Wikipedia says, small systems may differ.)
Basile pointed to ntohd, a conversion function which exists on Windows but apparently not on Linux.
My naive sample implementation would be like
/** copy the bytes at data into a double, reversing the
byte order, and return that.
*/
double reverseValue(const char *data)
{
double result;
char *dest = (char *)&result;
for(int i=0; i<sizeof(double); i++)
{
dest[i] = data[sizeof(double)-i-1];
}
return result;
}
/** Adjust the byte order from network to host.
On a big endian machine this is a NOP.
*/
double ntohd(double src)
{
# if !defined(__FLOAT_WORD_ORDER__) \
|| !defined(__ORDER_LITTLE_ENDIAN__)
# error "oops: unknown byte order"
# endif
# if __FLOAT_WORD_ORDER__ == __ORDER_LITTLE_ENDIAN__
return reverseValue((char *)&src);
# else
return src;
# endif
}
There is a working example here: https://ideone.com/aV9mj4.
An improved version would cater to the given CPU -- it may have an 8 byte swap command.
If you can adapt the software both sides (emitter & receiver) you could use some serialization library and format. It could be using old XDR routines like xdr_double in your case (see xdr(3)...). You could also consider ASN1 or using textual formats like JSON.
XDR is big endian. You might try to find some NDR implementation, is is little endian.
See also this related question, and STFW for htond
Ok, so finally, thanks to you all, I have found the best solution. Now my code looks like this:
double reverseDouble(const char *data){
double result;
char *dest = (char *)&result;
for(int i=0; i<sizeof(double); i++)
dest[i] = data[sizeof(double)-i-1];
return result;
}
double get_double(uint8_t * buff){
double value;
memcpy(&value,buff,sizeof(double));
if (__FLOAT_WORD_ORDER__ == __ORDER_BIG_ENDIAN__)
return reverseDouble((char *)&value);
else
return value;
}
p.s. (checking for defines etc. is somewhere else)
If you can assume the endianness of doubles is the same as for integers (which you can't, generally, but it's almost always the same), you can read it as an integer by shifting it byte by byte, and then cast the representation to a double.
double get_double_from_little_endian(uint8_t * buff) {
uint64_t u64 = ((uint64_t)buff[0] << 0 |
(uint64_t)buff[1] << 8 |
(uint64_t)buff[2] << 16 |
(uint64_t)buff[3] << 24 |
(uint64_t)buff[4] << 32 |
(uint64_t)buff[5] << 40 |
(uint64_t)buff[6] << 48 |
(uint64_t)buff[7] << 56);
return *(double *)(char *)&u64;
}
This is only standard C and it's optimized well by compilers. The C standard doesn't define how the floats are represented in memory though, so compiler-dependent macros like __FLOAT_WORD_ORDER__ may give better results, but it gets more complex if you want to cover also the "mixed-endian IEEE format" which is found on ARM old-ABI.

Changing the Endiannes of an integer which can be 2,4 or 8 bytes using a switch-case statement

In a (real time) system, computer 1 (big endian) gets an integer data from from computer 2 (which is little endian). Given the fact that we do not know the size of int, I check it using a sizeof() switch statement and use the __builtin_bswapX method accordingly as follows (assume that this builtin method is usable).
...
int data;
getData(&data); // not the actual function call. just represents what data is.
...
switch (sizeof(int)) {
case 2:
intVal = __builtin_bswap16(data);
break;
case 4:
intVal = __builtin_bswap32(data);
break;
case 8:
intVal = __builtin_bswap64(data);
break;
default:
break;
}
...
is this a legitimate way of swapping the bytes for an integer data? Or is this switch-case statement totally unnecessary?
Update: I do not have access to the internals of getData() method, which communicates with the other computer and gets the data. It then just returns an integer data which needs to be byte-swapped.
Update 2: I realize that I caused some confusion. The two computers have the same int size but we do not know that size. I hope it makes sense now.
Seems odd to assume the size of int is the same on 2 machines yet compensate for variant endian encodings.
The below only informs the int size of the receiving side and not the sending side.
switch(sizeof(int))
The sizeof(int) is the size, in char of an int on the local machine. It should be sizeof(int)*CHAR_BIT to get the bit size. [Op has edited the post]
The sending machine should detail the data width, as a 16, 32, 64- bit without regard to its int size and the receiving end should be able to detect that value as part of the message or an agreed upon width should be used.
Much like hton() to convert from local endian to network endian, the integer size with these function is moving toward fixed width integers like
#include <netinet/in.h>
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
So suggest sending/receiving the "int" as a 32-bit uint32_t in network endian.
[Edit]
Consider computers exist that have different endian (little and big are the most common, others exist) and various int sizes with bit width 32 (common), 16, 64 and maybe even some odd-ball 36 bit and such and room for growth to 128-bit. Let us assume N combinations. Rather than write code to convert from 1 of N to N different formats (N*N) routines, let us define a network format and fix its endian to big and bit-width to 32. Now each computer does not care nor need to know the int width/endian of the sender/recipient of data. Each platform get/receives data in a locally optimized method from its endian/int to network endian/int-width.
OP describes not knowing the the sender's int width yet hints that the int width on the sender/receiver might be the same as the local machine. If the int widths are specified to be the same and the endian are specified to be one big/one little as described, then OP's coding works.
However, such a "endians are opposite and int-width the same" seems very selective. I would prepare code to cope with a interchange standard (network standard) as certainly, even if today it is "opposite endian, same int", tomorrow will evolved to a network standard.
A portable approach would not depend on any machine properties, but only rely on mathematical operations and a definition of the communication protocol that is also hardware independent. For example, given that you want to store bytes in a defined way:
void serializeLittleEndian(uint8_t *buffer, uint32_t data) {
size_t i;
for (i = 0; i < sizeof(uint32_t); ++i) {
buffer[i] = data % 256;
data /= 256;
}
}
and to restore that data to whatever machine:
uint32_t deserializeLittleEndian(uint8_t *buffer) {
uint32_t data = 0;
size_t i;
for (i = 0; i < sizeof(uint32_t); ++i) {
data *= 256;
data += buffer[i];
}
return data;
}
EDIT: This is not portable to systems with other than 8 bits per byte due to the uses of int8_t and int32_t. The use of type int8_t implies a system with 8 bit chars. However, it will not compile for systems where these conditions are not met. Thanks to Olaf and Chqrlie.
Yes, this is totally cool - given you fix your switch for proper sizeof return values. One might be a little fancy and provide, for example, template specializations based on the size of int. But a switch like this is totally cool and will not produce any branches in optimized code.
As already mentioned, you generally want to define a protocol for communications across networks, which the hton/ntoh functions are mostly meant for. Network byte order is generally treated as big endian, which is what the hton/ntoh functions use. If the majority of your machines are little endian, it may be better to standardize on it instead though.
A couple people have been critical of using __builtin_bswap, which I personally consider fine as long you don't plan to target compilers that don't support it. Although, you may want to read Dan Luu's critique of intrinsics.
For completeness, I'm including a portable version of bswap that (at very least Clang) compiles into a bswap for x86(64).
#include <stddef.h>
#include <stdint.h>
size_t bswap(size_t x) {
for (size_t i = 0; i < sizeof(size_t) >> 1; i++) {
size_t d = sizeof(size_t) - i - 1;
size_t mh = ((size_t) 0xff) << (d << 3);
size_t ml = ((size_t) 0xff) << (i << 3);
size_t h = x & mh;
size_t l = x & ml;
size_t t = (l << ((d - i) << 3)) | (h >> ((d - i) << 3));
x = t | (x & ~(mh | ml));
}
return x;
}

how to cast uint8_t array of 4 to uint32_t in c

I am trying to cast an array of uint8_t to an array of uint32_t, but it seems not to be working.
Can any one help me on this. I need to get uint8_t values to uint32_t.
I can do this with shifting but i think there is a easy way.
uint32_t *v4full;
v4full=( uint32_t *)v4;
while (*v4full) {
if (*v4full & 1)
printf("1");
else
printf("0");
*v4full >>= 1;
}
printf("\n");
Given the need to get uint8_t values to uint32_t, and the specs on in4_pton()...
Try this with a possible correction on the byte order:
uint32_t i32 = v4[0] | (v4[1] << 8) | (v4[2] << 16) | (v4[3] << 24);
There is a problem with your example - actually with what you are trying to do (since you don't want the shifts).
See, it is a little known fact, but you're not allowed to switch pointer types in this manner
specifically, code like this is illegal:
type1 *vec1=...;
type2 *vec2=(type2*)vec1;
// do stuff with *vec2
The only case where this is legal is if type2 is char (or unsigned char or const char etc.), but if type2 is any other type (uint32_t in your example) it's against the standard and may introduce bugs to your code if you compile with -O2 or -O3 optimization.
This is called the "strict-aliasing rule" and it allows compilers to assume that pointers of different types never point to related points in memory - so that if you change the memory of one pointer, the compiler doesn't have to reload all other pointers.
It's hard for compilers to find instances of breaking this rule, unless you make it painfully clear to it. For example, if you change your code to do this:
uint32_t v4full=*((uint32_t*)v4);
and compile using -O3 -Wall (I'm using gcc) you'll get the warning:
warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
So you can't avoid using the shifts.
Note: it will work on lower optimization settings, and it will also work in higher settings if you never change the info pointer to by v4 and v4_full. It will work, but it's still a bug and still "against the rules".
If v4full is a pointer then the line
uint32_t *v4full;
v4full=( uint32_t)&v4;
Should throw an error or at least a compiler warning. Maybe you mean to do
uint32_t *v4full;
v4full=( uint32_t *) v4;
Where I assume v4 is itself a pointer to a uint8 array. I realize I am extrapolating from incomplete information…
EDIT since the above appears to have addressed a typo, let's try again.
The following snippet of code works as expected - and as I think you want your code to work. Please comment on this - how is this code not doing what you want?
#include <stdio.h>
#include <inttypes.h>
int main(void) {
uint8_t v4[4] = {1,2,3,4};
uint32_t *allOfIt;
allOfIt = (uint32_t*)v4;
printf("the number is %08x\n", *allOfIt);
}
Output:
the number is 04030201
Note - the order of the bytes in the printed number is reversed - you get 04030201 instead of 01020304 as you might have expected / wanted. This is because my machine (x86 architecture) is little-endian. If you want to make sure that the order of the bytes is the way you want it (in other words, that element [0] corresponds to the most significant byte) you are better off using #bvj's solution - shifting each of the four bytes into the right position in your 32 bit integer.
Incidentally, you can see this earlier answer for a very efficient way to do this, if needed (telling the compiler to use a built in instruction of the CPU).
One other issue that makes this code non-portable is that many architectures require a uint32_t to be aligned on a four-byte boundary, but allow uint8_t to have any address. Calling this code on an improperly-aligned array would then cause undefined behavior, such as crashing the program with SIGBUS. On these machines, the only way to cast an arbitrary uint8_t[] to a uint32_t[] is to memcpy() the contents. (If you do this in four-byte chunks, the compiler should optimize to whichever of an unaligned load or two-loads-and-a-shift is more efficient on your architecture.)
If you have control over the declaration of the source array, you can #include <stdalign.h> and then declare alignas(uint32_t) uint8_t bytes[]. The classic solution is to declare both the byte array and the 32-bit values as members of a union and type-pun between them. It is also safe to use pointers obtained from malloc(), since these are guaranteed to be suitably-aligned.
This is one solution:
/* convert character array to integer */
uint32_t buffChar_To_Int(char *array, size_t n){
int number = 0;
int mult = 1;
n = (int)n < 0 ? -n : n; /* quick absolute value check */
/* for each character in array */
while (n--){
/* if not digit or '-', check if number > 0, break or continue */
if((array[n] < '0' || array[n] > '9') && array[n] != '-'){
if(number)
break;
else
continue;
}
if(array[n] == '-'){ /* if '-' if number, negate, break */
if(number){
number = -number;
break;
}
}
else{ /* convert digit to numeric value */
number += (array[n] - '0') * mult;
mult *= 10;
}
}
return number;
}
One more solution:
u32 ip;
if (!in4_pton(str, -1, (u8 *)&ip, -1, NULL))
return -EINVAL;
... use ip as it defined above - (as variable of type u32)
Here we use result of in4_pton function (ip) without any additional variables and castings.

_mm_crc32_u64 poorly defined

Why in the world was _mm_crc32_u64(...) defined like this?
unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );
The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source operand as large as possible (i.e. 64-bits if I have that much data left, smaller for the tail ends) and always a 32-bit destination operand. But the intrinsics don't allow a 64-bit source and 32-bit destination. Note the other intrinsics:
unsigned int _mm_crc32_u8 ( unsigned int crc, unsigned char v );
The type of "crc" is not an 8-bit type, nor is the return type, they are 32-bits. Why is there no
unsigned int _mm_crc32_u64 ( unsigned int crc, unsigned __int64 v );
? The Intel instruction supports this, and that is the intrinsic that makes the most sense.
Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.
My guess is something like this:
#define CRC32(D32,S) __asm__("crc32 %0, %1" : "+xrm" (D32) : ">xrm" (S))
for GCC, and
#define CRC32(D32,S) __asm { crc32 D32, S }
for VisualStudio. Unfortunately I have little understanding of how constraints work, and little experience with the syntax and semantics of assembly level programming.
Small edit: note the macros I've defined:
#define GET_INT64(P) *(reinterpret_cast<const uint64* &>(P))++
#define GET_INT32(P) *(reinterpret_cast<const uint32* &>(P))++
#define GET_INT16(P) *(reinterpret_cast<const uint16* &>(P))++
#define GET_INT8(P) *(reinterpret_cast<const uint8 * &>(P))++
#define DO1_HW(CR,P) CR = _mm_crc32_u8 (CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = _mm_crc32_u16(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = _mm_crc32_u32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = (_mm_crc32_u64((uint64)CR, GET_INT64(P))) & 0xFFFFFFFF;
Notice how different the last macro statement is. The lack of uniformity is certainly and indication that the intrinsic has not been defined sensibly. While it is not necessary to put in the explicit (uint64) cast in the last macro, it is implicit and does happen. Disassembling the generated code shows code for both casts 32->64 and 64->32, both of which are unnecessary.
Put another way, it's _mm_crc32_u64, not _mm_crc64_u64, but they've implemented it as if it were the latter.
If I could get the definition of CRC32 above correct, then I would want to change my macros to
#define DO1_HW(CR,P) CR = CRC32(CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = CRC32(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = CRC32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = CRC32(CR, GET_INT64(P))
The 4 intrinsic functions provided really do allow all possible uses of the Intel defined CRC32 instruction. The instruction output always 32-bits because the instruction is hard-coded to use a specific 32-bit CRC polynomial. However, the instruction allows your code to feed input data to it 8, 16, 32, or 64 bits at a time. Processing 64-bits at a time should maximize throughput. Processing 32-bits at a time is the best you can do if restricted to 32-bit build. Processing 8 or 16 bits at a time could simplify your code logic if the input byte count is odd or or not a multiple of 4/8.
#include <stdio.h>
#include <stdint.h>
#include <intrin.h>
int main (int argc, char *argv [])
{
int index;
uint8_t *data8;
uint16_t *data16;
uint32_t *data32;
uint64_t *data64;
uint32_t total1, total2, total3;
uint64_t total4;
uint64_t input [] = {0x1122334455667788, 0x1111222233334444};
total1 = total2 = total3 = total4 = 0;
data8 = (void *) input;
data16 = (void *) input;
data32 = (void *) input;
data64 = (void *) input;
for (index = 0; index < sizeof input / sizeof *data8; index++)
total1 = _mm_crc32_u8 (total1, *data8++);
for (index = 0; index < sizeof input / sizeof *data16; index++)
total2 = _mm_crc32_u16 (total2, *data16++);
for (index = 0; index < sizeof input / sizeof *data32; index++)
total3 = _mm_crc32_u32 (total3, *data32++);
for (index = 0; index < sizeof input / sizeof *data64; index++)
total4 = _mm_crc32_u64 (total4, *data64++);
printf ("CRC32 result using 8-bit chunks: %08X\n", total1);
printf ("CRC32 result using 16-bit chunks: %08X\n", total2);
printf ("CRC32 result using 32-bit chunks: %08X\n", total3);
printf ("CRC32 result using 64-bit chunks: %08X\n", total4);
return 0;
}
Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.
My friend and I wrote a c++ sse intrinsics wrapper which contains the more preferred usage of the crc32 instruction with 64bit src.
http://code.google.com/p/sse-intrinsics/
See the i_crc32() instruction.
(sadly there are even more flaws with intel's sse intrinsic specifications on other instructions, see this page for more examples of flawed intrinsic design)

Resources