_mm_crc32_u64 poorly defined - c

Why in the world was _mm_crc32_u64(...) defined like this?
unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );
The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source operand as large as possible (i.e. 64-bits if I have that much data left, smaller for the tail ends) and always a 32-bit destination operand. But the intrinsics don't allow a 64-bit source and 32-bit destination. Note the other intrinsics:
unsigned int _mm_crc32_u8 ( unsigned int crc, unsigned char v );
The type of "crc" is not an 8-bit type, nor is the return type, they are 32-bits. Why is there no
unsigned int _mm_crc32_u64 ( unsigned int crc, unsigned __int64 v );
? The Intel instruction supports this, and that is the intrinsic that makes the most sense.
Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.
My guess is something like this:
#define CRC32(D32,S) __asm__("crc32 %0, %1" : "+xrm" (D32) : ">xrm" (S))
for GCC, and
#define CRC32(D32,S) __asm { crc32 D32, S }
for VisualStudio. Unfortunately I have little understanding of how constraints work, and little experience with the syntax and semantics of assembly level programming.
Small edit: note the macros I've defined:
#define GET_INT64(P) *(reinterpret_cast<const uint64* &>(P))++
#define GET_INT32(P) *(reinterpret_cast<const uint32* &>(P))++
#define GET_INT16(P) *(reinterpret_cast<const uint16* &>(P))++
#define GET_INT8(P) *(reinterpret_cast<const uint8 * &>(P))++
#define DO1_HW(CR,P) CR = _mm_crc32_u8 (CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = _mm_crc32_u16(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = _mm_crc32_u32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = (_mm_crc32_u64((uint64)CR, GET_INT64(P))) & 0xFFFFFFFF;
Notice how different the last macro statement is. The lack of uniformity is certainly and indication that the intrinsic has not been defined sensibly. While it is not necessary to put in the explicit (uint64) cast in the last macro, it is implicit and does happen. Disassembling the generated code shows code for both casts 32->64 and 64->32, both of which are unnecessary.
Put another way, it's _mm_crc32_u64, not _mm_crc64_u64, but they've implemented it as if it were the latter.
If I could get the definition of CRC32 above correct, then I would want to change my macros to
#define DO1_HW(CR,P) CR = CRC32(CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = CRC32(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = CRC32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = CRC32(CR, GET_INT64(P))

The 4 intrinsic functions provided really do allow all possible uses of the Intel defined CRC32 instruction. The instruction output always 32-bits because the instruction is hard-coded to use a specific 32-bit CRC polynomial. However, the instruction allows your code to feed input data to it 8, 16, 32, or 64 bits at a time. Processing 64-bits at a time should maximize throughput. Processing 32-bits at a time is the best you can do if restricted to 32-bit build. Processing 8 or 16 bits at a time could simplify your code logic if the input byte count is odd or or not a multiple of 4/8.
#include <stdio.h>
#include <stdint.h>
#include <intrin.h>
int main (int argc, char *argv [])
int index;
uint8_t *data8;
uint16_t *data16;
uint32_t *data32;
uint64_t *data64;
uint32_t total1, total2, total3;
uint64_t total4;
uint64_t input [] = {0x1122334455667788, 0x1111222233334444};
total1 = total2 = total3 = total4 = 0;
data8 = (void *) input;
data16 = (void *) input;
data32 = (void *) input;
data64 = (void *) input;
for (index = 0; index < sizeof input / sizeof *data8; index++)
total1 = _mm_crc32_u8 (total1, *data8++);
for (index = 0; index < sizeof input / sizeof *data16; index++)
total2 = _mm_crc32_u16 (total2, *data16++);
for (index = 0; index < sizeof input / sizeof *data32; index++)
total3 = _mm_crc32_u32 (total3, *data32++);
for (index = 0; index < sizeof input / sizeof *data64; index++)
total4 = _mm_crc32_u64 (total4, *data64++);
printf ("CRC32 result using 8-bit chunks: %08X\n", total1);
printf ("CRC32 result using 16-bit chunks: %08X\n", total2);
printf ("CRC32 result using 32-bit chunks: %08X\n", total3);
printf ("CRC32 result using 64-bit chunks: %08X\n", total4);
return 0;

My friend and I wrote a c++ sse intrinsics wrapper which contains the more preferred usage of the crc32 instruction with 64bit src.
See the i_crc32() instruction.
(sadly there are even more flaws with intel's sse intrinsic specifications on other instructions, see this page for more examples of flawed intrinsic design)


Save 128 bit value (unsigned) when max data type has 64 bits

I have to save the number of non zero entries in a matrix with dimensions that could be
as big as uint64_t x uint64_t resulting in a 128 bit value.
Im not sure which data-type would be right for this variable in C as it would require 128 bits (unsigned).
I would use __int128 as a data type but my problem is that when I test the max. supported data type on my system with
#include <stdio.h>
#include <stdint.h>
int main() {
printf("maxUInt: %lu\n", sizeof(uintmax_t));
printf("maxInt: %lu", sizeof(intmax_t));
It gives the following result:
maxUInt: 8
maxInt: 8
meaning that 8 Bytes is the maximum for number representation.
So this is troubling me as the result is possibly 128 bits == 16 Bytes big.
Will __int128 still work in my case?
We're talking about the size of an array, so uintmax_t and intmax_t are irrelevant.
malloc() accepts a size_t. The following therefore computes the limit of how much you can request:
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
printf( "2^( %zu * %d ) bytes\n", sizeof( size_t ), CHAR_BIT );
For me, that's 2^( 8 * 8 ) octets.
But I'm an an x86-64 machine. Those don't support nearly that much memory. The instruction set only supports 2^48 octets of memory.
281,474,976,710,656 (1/65,536 of what 64 bits can support)
But no x86-64 machine supports that much. Current hardware only supports 2^40 octets of memory.
1,099,511,627,776 (1/16,777,216 of what 64 bits can support)
So unless you have some very special hardware, 64 bits is more than enough to store the size of any array your machine can handle.
Still, let's answer your question about support for __int128 and unsigned __int128. These two types, if supported, are an extension to the standard. And they are apparently not candidates for intmax_t and uintmax_t, at least on my compiler. So checking the size of intmax_t and uintmax_t is not useful for detecting their support.
If you want to check if you have support for __int128 or unsigned __int128, simply try to use them.
__int128 i = 0;
unsigned __int128 u = 0;
If both uintmax_t and unsigned __int128 are too small, you can still use extended precision math, such as by using two 64-bit integers in the manner showed in Maxim Egorushkin's answer.
One portable option is to construct a counter out of multiple smaller units:
typedef struct BigCounterC {
uint64_t count_[2];
} BigCounterC;
void BigCounterC_increment(BigCounterC* counter) {
// Increment the higher units when the lower unit of unsigned type wraps around reaching 0.
for(size_t n = sizeof counter->count_ / sizeof *counter->count_; n-- && !++counter->count_[n];);
int main() {
BigCounterC c2 = {0}; // Zero-initialize.
return 0;
C++ version:
#include <cstdint>
#include <type_traits>
template<class Unit, size_t N>
struct BigCounter {
static_assert(std::is_unsigned_v<Unit>); // Unsigned overflow is well defined.
Unit count_[N] = {}; // Zero-initialize.
BigCounter& operator++() noexcept {
// Increment the higher units when the lower unit of unsigned type wraps around reaching 0.
for(auto n = N; n-- && !++count_[n];);
return *this;
int main() {
BigCounter<uint64_t, 2> c;

Endianness conversion without relying on undefined behavior

I am using C to read a .png image file, and if you're not familiar with the PNG encoding format, useful integer values are encoded in .png files in the form of 4-byte big-endian integers.
My computer is a little-endian machine, so to convert from a big-endian uint32_t that I read from the file with fread() to a little-endian one my computer understands, I've been using this little function I wrote:
#include <stdint.h>
uint32_t convertEndian(uint32_t val){
uint32_t value;
char bytes[sizeof(uint32_t)];
for(int i=0;i<sizeof(uint32_t);++i)
return out.value;
This works beautifully on my x86_64 UNIX environment, gcc compiles without error or warning even with the -Wall flag, but I feel rather confident that I'm relying on undefined behavior and type-punning that may not work as well on other systems.
Is there a standard function I can call that can reliably convert a big-endian integer to one the native machine understands, or if not, is there an alternative safer way to do this conversion?
I see no real UB in OP's code.
Portability issues: yes.
"type-punning that may not work as well on other systems" is not a problem with OP's C code yet may cause trouble with other languages.
Yet how about a big (PNG) endian to host instead?
Extract the bytes by address (lowest address which has the MSByte to highest address which has the LSByte - "big" endian) and form the result with the shifted bytes.
Something like:
uint32_t Endian_BigToHost32(uint32_t val) {
union {
uint32_t u32;
uint8_t u8[sizeof(uint32_t)]; // uint8_t insures a byte is 8 bits.
} x = { .u32 = val };
((uint32_t)x.u8[0] << 24) |
((uint32_t)x.u8[1] << 16) |
((uint32_t)x.u8[2] << 8) |
Tip: many libraries have a implementation specific function to efficiently to this. Example be32toh.
IMO it'd be better style to read from bytes into the desired format, rather than apparently memcpy'ing a uint32_t and then internally manipulating the uint32_t. The code might look like:
uint32_t read_be32(uint8_t *src) // must be unsigned input
return (src[0] * 0x1000000u) + (src[1] * 0x10000u) + (src[2] * 0x100u) + src[3];
It's quite easy to get this sort of code wrong, so make sure you get it from high rep SO users 😉. You may often see the alternative suggestion return (src[0] << 24) + (src[1] << 16) + (src[2] << 8) + src[3]; however, that causes undefined behaviour if src[0] >= 128 due to signed integer overflow , due to the unfortunate rule that the integer promotions take uint8_t to signed int. And also causes undefined behaviour on a system with 16-bit int due to large shifts.
Modern compilers should be smart enough to optimize, this, e.g. the assembly produced by clang little-endian is:
read_be32: # #read_be32
mov eax, dword ptr [rdi]
bswap eax
However I see that gcc 10.1 produces a much more complicated code, this seems to be a surprising missed optimization bug.
This solution doesn't rely on accessing inactive members of a union, but relies instead on unsigned integer bit-shift operations which can portably and safely convert from big-endian to little-endian or vice versa
#include <stdint.h>
uint32_t convertEndian32(uint32_t in){
return ((in&0xffu)<<24)|((in&0xff00u)<<8)|((in&0xff0000u)>>8)|((in&0xff000000u)>>24);
This code reads a uint32_t from a pointer of uchar_t in big endian storage, independently of the endianness of your architecture. (The code just acts as if it was reading a base 256 number)
uint32_t read_bigend_int(uchar_t *p, int sz)
uint32_t result = 0;
while(sz--) {
result <<= 8; /* multiply by base */
result |= *p++; /* and add the next digit */
if you call, for example:
int main()
/* ... */
uchar_t buff[1024];
read(fd, buff, sizeof buff);
uint32_t value = read_bigend_int(buff + offset, sizeof value);
/* ... */

Unsigned short int operation with Intel Intrinsics

I want to do some operation using the Intel intrinsics (vector of unsigned int of 16 bit) and the operations are the following :
load or set from an array of unsigned short int.
Div and Mod operations with unsigned short int.
Multiplication operation with unsigned short int.
Store operation of unsigned short int into an array.
I looked into the Intrinsics guide but it looks like there are only intrinsics for short integers and not the unsigned ones. Could someone have any trick that could help me with this ?
In fact I'm trying to store an image of a specific raster format in an array with a specific ordering. So I have to calculate the index where each pixel value is going to be stored:
unsigned int Index(unsigned int interleaving_depth, unsigned int x_size, unsigned int y_size, unsigned int z_size, unsigned int Pixel_number)
unsigned int x = 0, y = 0, z = 0, reminder = 0, i = 0;
y = Pixel_number/(x_size*z_size);
reminder = Pixel_number % (x_size*z_size);
i = reminder/(x_size*interleaving_depth);
reminder = reminder % (x_size*interleaving_depth);
if(i == z_size/interleaving_depth){
x = reminder/(z_size - i*interleaving_depth);
reminder = reminder % (z_size - i*interleaving_depth);
x = reminder/interleaving_depth;
reminder = reminder % interleaving_depth;
z = interleaving_depth*i + reminder;
if(z >= z_size)
z = z_size - 1;
return x + y*x_size + *x_size*y_size;
If you only want the low half of the result, multiplication is the same binary operation for signed or unsigned. So you can use pmullw on either. There are separate high-half multiply instructions for signed and unsigned short, though: _mm_mulhi_epu16 (pmulhuw) vs. _mm_mulhi_epi16 (pmuluw)
Similarly, you don't need an _mm_set_epu16 because it's the same operation: on x86 casting to signed doesn't change the bit-pattern, so Intel only bothered to provide _mm_set_epi16, but you can use it with args like 0xFFFFu instead of -1 with no problems. (Using Intel intrinsics automatically means your code only has to be portable to x86 32 and 64 bit.)
Load / store intrinsics don't change the data at all.
SSE/AVX doesn't have integer division or mod instructions. If you have compile-time-constant divisors, do it yourself with a multiply/shift. You can look at compiler output to get the magic constant and shift counts (Why does GCC use multiplication by a strange number in implementing integer division?), or even let gcc auto-vectorize something for you. Or even use GNU C native vector syntax to divide:
#include <immintrin.h>
__m128i div13_epu16(__m128i a)
typedef unsigned short __attribute__((vector_size(16))) v8uw;
v8uw tmp = (v8uw)a;
v8uw divisor = (v8uw)_mm_set1_epi16(13);
v8uw result = tmp/divisor;
return (__m128i)result;
// clang allows "lax" vector type conversions without casts
// gcc allows vector / scalar, e.g. tmp / 13. Clang requires set1
// to work with both, we need to jump through all the syntax hoops
compiles to this asm with gcc and clang (Godbolt compiler explorer):
pmulhuw xmm0, XMMWORD PTR .LC0[rip] # tmp93,
psrlw xmm0, 2 # tmp95,
.section .rodata
.value 20165
# repeats 8 times
If you have runtime-variable divisors, it's going to be slower, but you can use http://libdivide.com/. It's not too bad if you reuse the same divisor repeatedly, so you only have to calculate a fixed-point inverse for it once, but code to use an arbitrary inverse needs a variable shift count which is less efficient with SSE (well also for integer), and potentially more instructions because some divisors require a more complicated sequence than others.

Casting hex string to signed int results in different values in different platforms

I am dealing with an edge case in a program that I want to be multi-platform. Here is the extract of the problem:
#include <stdio.h>
#include <string.h>
void print_bits(size_t const size, void const * const ptr){
unsigned char *b = (unsigned char*) ptr;
unsigned char byte;
int i, j;
for (i=size-1;i>=0;i--)
for (j=7;j>=0;j--)
byte = (b[i] >> j) & 1;
printf("%u", byte);
int main() {
char* ascii = "0x80000000";
int myint = strtol(ascii, NULL, 16);
printf("%s to signed int is %d and bits are:\t", ascii, myint);
print_bits(sizeof myint, &myint);
return 0;
So when I compile with GCC on Linux I get this output:
0x80000000 to signed int is -2147483648 and bits are: 10000000000000000000000000000000
On a Windows, using MSVC and MinGW I get:
0x80000000 to signed int is 2147483647 and bits are: 01111111111111111111111111111111
I think the GCC outputs the correct expected values. My question is, where does this difference come from and how to make sure that on all compilers I get the correct result?
The reason behind this code is, I have to check if the MSB (bit #31) of the HEX value is 0 or 1. Then, I have to get the unsigned integer value of the next 7 bits (#30 to #24) result (in case of 0x80000000these 7 bits should result in 0:
int msb_is_set = myint & 1;
uint8_t next_7_bits;
next_7_bits = myint >> 24; //fine on GCC, outputs 0 for the next 7 bits
#ifdef WIN32 //If I do not do this, next_7_bit will be 127 on Windows instead of 0
if(msb_is_set )
next_7_bits = myint >> 1;
P.S. This is on the same machine (i5 64bit)
You're dealing with different data models here.
Windows 64 uses LLP64, which means only long long and pointers are 64bit. As strtol converts to long, it converts to a 32bit value, and 0x80000000 in a 32bit signed integer is negative.
Linux 64 uses LP64, so long, long long and pointers are 64bit. I guess you see what's happening here now ;)
Thanks to the comments, I realize my initial answer was wrong. The different outcome indeed has to do with the differing models on those platforms. But: in case of the LP64 model, you have a conversion to a signed type that cannot hold the value, which is implementation defined. int is 32bit on both platforms and a 32bit int just cannot hold 0x80000000. So the correct answer is: you shouldn't expect any result from your code on Linux64. On Win64, as long is only 32bit, strtol() correctly returns LONG_MAX for 0x80000000, which happens to be just one smaller than your input.
int myint = strtol(ascii, NULL, 16);
strtol is 'string to long', not string to int.
Also, you probably want 0x800000000 to be an unsigned long.
You may find that on (that version of ) Linux that int is 64bit, whereas on (that version of) Windo3ws, int is 32bits.
Don't do this:
#ifdef __GCC__
because a compiler switch might change the way things work. Better to do something like:
In some header somewhere:
#ifdef __GCC__
#ifdef __MSVC__
Then in your main code:
next_7_bits = myint >> 24;
if(msb_is_set )
next_7_bits = myint >> 1;
Your code should handle the implementation details, and the header should check which implementation is required by which compiler.
This separates out the code required from detecting which method is required for this compiler. In your header you can do more complex detection of compiler features.
#ifdef __GCC__ && __GCCVERION__ > 1.23
This is about your update. Although I'm not sure what your intention is, let's first point out some mistakes:
#ifdef WIN32
The macro always defined when targeting win32 is _WIN32, not WIN32.
Then you have another #ifdef checking for GCC, but this will not do what you expect: GCC also exists on win32 and it uses the same data model as MSVC. IOW, you can have both defined, __GCC__ and _WIN32.
You say you want to know whether the MSB is set. Then just make sure to convert your string to an unsigned int and directly check this bit:
#include <limits.h>
// [...]
unsigned int myint = strtoul(ascii, NULL, 16); // <- strtoul(), not strtol()!
unsigned int msb = 1U << (sizeof(unsigned int) * CHAR_BIT - 1);
if (myint & msb)
// msb is set
Btw, see this answer for a really portable way to get the number of bits in an integer type. sizeof() * CHAR_BIT will fail on a platform with padding bits.

Faster way to zero memory than with memset?

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?
I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?
Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.
I revisited this issue and did a little testing. Here is what I tested:
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/time.h>
#define TIME(body) do { \
struct timeval t1, t2; double elapsed; \
gettimeofday(&t1, NULL); \
body \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \
#define SIZE 0x1000000
void zero_1(void* buff, size_t size)
size_t i;
char* foo = buff;
for (i = 0; i < size; i++)
foo[i] = 0;
/* I foolishly assume size_t has register width */
void zero_sizet(void* buff, size_t size)
size_t i;
char* bar;
size_t* foo = buff;
for (i = 0; i < size / sizeof(size_t); i++)
foo[i] = 0;
// fixes bug pointed out by tristopia
bar = (char*)buff + size - size % sizeof(size_t);
for (i = 0; i < size % sizeof(size_t); i++)
bar[i] = 0;
int main()
char* buffer = malloc(SIZE);
memset(buffer, 0, SIZE);
zero_1(buffer, SIZE);
zero_sizet(buffer, SIZE);
return 0;
zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.
For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)
edit3: fixed bug in test code, test results are not affected
edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.
x86 is rather broad range of devices.
For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.
For chips with mmx, an assembly loop with movq could hit 64bits at a time.
You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.
for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps
Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.
memset is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).
That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memset implementation by doing it yourself. memset and its friends in the standard library are always fun targets for one-upmanship programming. :)
Nowadays your compiler should do all the work for you. At least of what I know gcc is very efficient in optimizing calls to memset away (better check the assembler, though).
Then also, avoid memset if you don't have to:
use calloc for heap memory
use proper initialization (... = { 0
}) for stack memory
And for really large chunks use mmap if you have it. This just gets zero initialized memory from the system "for free".
If I remember correctly (from a couple of years ago), one of the senior developers was talking about a fast way to bzero() on PowerPC (specs said we needed to zero almost all the memory on power up). It might not translate well (if at all) to x86, but it could be worth exploring.
The idea was to load a data cache line, clear that data cache line, and then write the cleared data cache line back to memory.
For what it is worth, I hope it helps.
Unless you have specific needs or know that your compiler/stdlib is sucky, stick with memset. It's general-purpose, and should have decent performance in general. Also, compilers might have an easier time optimizing/inlining memset() because it can have intrinsic support for it.
For instance, Visual C++ will often generate inline versions of memcpy/memset that are as small as a call to the library function, thus avoiding push/call/ret overhead. And there's further possible optimizations when the size parameter can be evaluated at compile-time.
That said, if you have specific needs (where size will always be tiny *or* huge), you can gain speed boosts by dropping down to assembly level. For instance, using write-through operations for zeroing huge chunks of memory without polluting your L2 cache.
But it all depends - and for normal stuff, please stick to memset/memcpy :)
The memset function is designed to be flexible and simple, even at the expense of speed. In many implementations, it is a simple while loop that copies the specified value one byte at a time over the given number of bytes. If you are wanting a faster memset (or memcpy, memmove, etc), it is almost always possible to code one up yourself.
The simplest customization would be to do single-byte "set" operations until the destination address is 32- or 64-bit aligned (whatever matches your chip's architecture) and then start copying a full CPU register at a time. You may have to do a couple of single-byte "set" operations at the end if your range doesn't end on an aligned address.
Depending on your particular CPU, you might also have some streaming SIMD instructions that can help you out. These will typically work better on aligned addresses, so the above technique for using aligned addresses can be useful here as well.
For zeroing out large sections of memory, you may also see a speed boost by splitting the range into sections and processing each section in parallel (where number of sections is the same as your number or cores/hardware threads).
Most importantly, there's no way to tell if any of this will help unless you try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard 'memset' as well (their implementation might be more efficient than your compiler's).
There is one fatal flaw in this otherwise great and helpful test:
As memset is the first instruction, there seems to be some "memory overhead" or so which makes it extremely slow.
Moving the timing of memset to second place and something else to first place or simply timing memset twice makes memset the fastest with all compile switches!!!
That's an interesting question. I made this implementation that is just slightly faster (but hardly measurable) when 32-bit release compiling on VC++ 2012. It probably can be improved on a lot. Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with memset() in multithreaded scenarios.
// MemsetSpeedTest.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include <iostream>
#include "Windows.h"
#include <time.h>
#pragma comment(lib, "Winmm.lib")
using namespace std;
/** a signed 64-bit integer value type */
#define _INT64 __int64
/** a signed 32-bit integer value type */
#define _INT32 __int32
/** a signed 16-bit integer value type */
#define _INT16 __int16
/** a signed 8-bit integer value type */
#define _INT8 __int8
/** an unsigned 64-bit integer value type */
#define _UINT64 unsigned _INT64
/** an unsigned 32-bit integer value type */
#define _UINT32 unsigned _INT32
/** an unsigned 16-bit integer value type */
#define _UINT16 unsigned _INT16
/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
/** maximum allo
wed value in an unsigned 64-bit integer value type */
#define _UINT64_MAX 18446744073709551615ULL
#ifdef _WIN32
/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);
/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
/** Use to init the clock */
#define TIMER_INIT clock_t start;double diff;
/** Use to start the performance timer */
#define TIMER_START start=clock();
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
void *MemSet(void *dest, _UINT8 c, size_t count)
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
_UINT64 cUll =
| (((_UINT64)c) << 8 )
| (((_UINT64)c) << 16 )
| (((_UINT64)c) << 24 )
| (((_UINT64)c) << 32 )
| (((_UINT64)c) << 40 )
| (((_UINT64)c) << 48 )
| (((_UINT64)c) << 56 );
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll;
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll;
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll;
if (!bytesLeft) return dest;
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll;
return dest;
int _tmain(int argc, _TCHAR* argv[])
const size_t n = 10000000;
const _UINT64 m = _UINT64_MAX;
const _UINT64 o = 1;
char test[n];
cout << "memset()" << endl;
for (int i = 0; i < m ; i++)
for (int j = 0; j < o ; j++)
memset((void*)test, 0, n);
cout << "MemSet() took:" << endl;
for (int i = 0; i < m ; i++)
for (int j = 0; j < o ; j++)
MemSet((void*)test, 0, n);
cout << "Done" << endl;
int wait;
cin >> wait;
return 0;
Output is as follows when release compiling for 32-bit systems:
memset() took:
MemSet() took:
Output is as follows when release compiling for 64-bit systems:
memset() took:
MemSet() took:
Here you can find the source code Berkley's memset(), which I think is the most common implementation.
memset could be inlined by compiler as a series of efficient opcodes, unrolled for a few cycles. For very large memory blocks, like 4000x2000 64bit framebuffer, you can try optimizing it across several threads (which you prepare for that sole task), each setting its own part. Note that there is also bzero(), but it is more obscure, and less likely to be as optimized as memset, and the compiler will surely notice you pass 0.
What compiler usually assumes, is that you memset large blocks, so for smaller blocks it would likely be more efficient to just do *(uint64_t*)p = 0, if you init large number of small objects.
Generally, all x86 CPUs are different (unless you compile for some standardized platform), and something you optimize for Pentium 2 will behave differently on Core Duo or i486. So if you really into it and want to squeeze the last few bits of toothpaste, it makes sense to ship several versions your exe compiled and optimized for different popular CPU models. From personal experience Clang -march=native boosted my game's FPS from 60 to 65, compared to no -march.
