Vectorize equality test without SIMD - c

I would like to vectorize an equality test in which all elements in a vector are compared against the same value, and the results are written to an array of 8-bit words. Each 8-bit word in the resulting array should be zero or one. (This is a little wasteful, but bit packing the booleans is not an import detail in this problem). This function can be written as:
#include <stdint.h>
void vecEq (uint8_t* numbers, uint8_t* results, int len, uint8_t target) {
for(int i = 0; i < len; i++) {
results[i] = numbers[i] == target;
}
}
If we knew that both vectors were 256-bit aligned, we could start by broadcasting target into an AVX register and then using SIMD's _mm256_cmpeq_epi8 to perform 32 equality tests at a time. However, in the setting I'm working in, both numbers and results have been allocated by a runtime (the GHC runtime, but this is irrelevant). They are both guaranteed to be 64-bit aligned. Is there any way to vectorize this operation, preferably without using AVX registers?
The approach I've considered is broadcasting the 8-bit word to a 64-bit word up front and then XORing it with 8 elements at a time. This doesn't work though because I cannot find a vectorized way to convert the result of XOR (zero means equal, anything else means unequal) to a equality test result I need (0 means unequal, 1 means equal, nothing else should ever exist). Roughly, the sketch I have is:
void vecEq (uint64_t* numbers, uint64_t* results, int len, uint_8 target) {
uint64_t targetA = (uint64_t)target;
uint64_t targetB = targetA<<56 | targetA<<48 | targetA<<40 | targetA<<32 | targetA<<24 | targetA<<16 | targetA<<8 | targetA;
for(int i = 0; i < len; i++) {
uint64_t tmp = numbers[i] ^ targetB;
results[i] = ... something with tmp ...;
}
}

Further to the comments above (the code will vectorise just fine). If you are using AVX, the best strategy is usually just to use unaligned load/store intrinsics. They have no extra cost if your data does happen to be aligned, and are as cheap as the HW can make them for cases of misalignment. (On Intel CPUs, there's still a penalty for loads/stores that span two cache lines, aka a cache-line split).
Ideally you can still align your buffers by 32, but if your data has to come from L2 or L3 or RAM, misalignment often doesn't make a measurable difference. And the best strategy for dealing with possible misalignment is usually just to let the HW handle it, instead of scalar up to an alignment boundary or something like you'd do with SSE, or with AVX512 where alignment matters again (any misalignment leads to every load/store being a cache-line split).
Just use _mm256_loadu_si256 / _mm256_storeu_si256 and forget about it.
As an interesting aside, Visual C++ will no longer emit aligned loads or stores, even if you request them.
https://godbolt.org/z/pL9nw9 (e.g. vmovups instead of vmovaps)
If compiling with GCC, you probably want to use -march=haswell or -march=znver1 not just -mavx2, or at least -mno-avx256-split-unaligned-load and -mno-avx256-split-unaligned-store so 256-bit unaligned loads compile to single instructions. The CPUs that benefit from those tune=generic defaults don't support AVX2, for example Sandybridge and Piledriver.

Related

Can I use SIMD to bucket sort / categorize?

I'm curious about SIMD and wondering if it can handle this use case.
Let's say I have an array of 2048 integers, like
[0x018A, 0x004B, 0x01C0, 0x0234, 0x0098, 0x0343, 0x0222, 0x0301, 0x0398, 0x0087, 0x0167, 0x0389, 0x03F2, 0x0034, 0x0345, ...]
Note how they all start with either 0x00, 0x01, 0x02, or 0x03. I want to split them into 4 arrays:
One for all the integers starting with 0x00
One for all the integers starting with 0x01
One for all the integers starting with 0x02
One for all the integers starting with 0x03
I imagine I would have code like this:
int main() {
uint16_t in[2048] = ...;
// 4 arrays, one for each category
uint16_t out[4][2048];
// Pointers to the next available slot in each of the arrays
uint16_t *nextOut[4] = { out[0], out[1], out[2], out[3] };
for (uint16_t *nextIn = in; nextIn < 2048; nextIn += 4) {
(*** magic simd instructions here ***)
// Equivalent non-simd code:
uint16_t categories[4];
for (int i = 0; i < 4; i++) {
categories[i] = nextIn[i] & 0xFF00;
}
for (int i = 0; i < 4; i++) {
uint16_t category = categories[i];
*nextOut[category] = nextIn[i];
nextOut[category]++;
}
}
// Now I have my categoried arrays!
}
I imagine that my first inner loop doesn't need SIMD, it can be just a (x & 0xFF00FF00FF00FF00) instruction, but I wonder if we can make that second inner loop into a SIMD instruction.
Is there any sort of SIMD instruction for this "categorizing" action that I'm doing?
The "insert" instructions seem somewhat promising, but I'm a bit too green to understand the descriptions at https://software.intel.com/en-us/node/695331.
If not, does anything come close?
Thanks!
You can do it with SIMD, but how fast it is will depend on exactly what instruction sets you have available, and how clever you are in your implementation.
One approach is to take the array and "sift" it to separate out elements that belong in different buckets. For example, grab 32 bytes from your array which will have 16 16-bit elements. Use some cmpgt instructions to get a mask where which determines whether each element falls into the 00 + 01 bucket or the 02 + 03 bucket. Then use some kind of "compress" or "filter" operation to move all masked elements contiguously into one end a register and then same for the unmasked elements.
Then repeat this one more time to sort out 00 from 01 and 02 from 03.
With AVX2 you could start with this question for inspiration on the "compress" operation. With AVX512 you could use the vcompress instruction to help out: it does exactly this operation but only at 32-bit or 64-bit granularity so you'd need to do a couple at least per vector.
You could also try a vertical approach, where you load N vectors and then swap between them so that the 0th vector has the smallest elements, etc. At this point, you can use a more optimized algorithm for the compress stage (e.g,. if you vertically sort enough vectors, the vectors at the ends may be entirely starting with 0x00 etc).
Finally, you might also consider organizing your data differently, either at the source or as a pre-processing step: separating out the "category" byte which is always 0-3 from the payload byte. Many of the processing steps only need to happen on one or the other, so you can potentially increase efficiency by splitting them out that way. For example, you could do the comparison operation on 32 bytes that are all categories, and then do the compress operation on the 32 payload bytes (at least in the final step where each category is unique).
This would lead to arrays of byte elements, not 16-bit elements, where the "category" byte is implicit. You've cut your data size in half, which might speed up everything else you want to do with the data in the future.
If you can't produce the source data in this format, you could use the bucketing as an opportunity to remove the tag byte as you put the payload into the right bucket, so the output is uint8_t out[4][2048];. If you're doing a SIMD left-pack with a pshufb byte-shuffle as discussed in comments, you could choose a shuffle control vector that packs only the payload bytes into the low half.
(Until AVX512BW, x86 SIMD doesn't have any variable-control word shuffles, only byte or dword, so you already needed a byte shuffle which can just as easily separate payloads from tags at the same time as packing payload bytes to the bottom.)

Copy from one memory to another skipping constant bytes in C

I am working on embedded system application. I want to copy from source to destination, skipping constant number of bytes. For example: source[6] = {0,1,2,3,4,5} and I want destination to be {0,2,4} skipping one byte. Unfortunately memcpy could not fulfilled my requirement. How can I achieve this in 'C' without using loop as I have large data to process and using loop experiences time overhead.
My current implementation is something like this which takes upto 5-6 milli-seconds for 1500 bytes to copy:
unsigned int len_actual = 1500;
/* Fill in the SPI DMA buffer. */
while (len_actual-- != 0)
{
*(tgt_handle->spi_tx_buff ++) = ((*write_irp->buffer ++)) | (2 << 16) | DSPI_PUSHR_CONT;
}
You could write a "cherry picker" function
void * memcpk(void * destination, const void * source,
size_t num, size_t size
int (*test)(const void * item));
which copies at most num "objects", each having size size from
source to destination. Only the objects that satisfy the test are copied.
Then with
int oddp(const void * intptr) { return (*((int *)intptr))%2; }
int evenp(const void * intptr) { return !oddp(intptr); }
you could do
int destination[6];
memcpk(destination, source, 6, sizeof(int), evenp);
.
Almost all CPUs have caches; which means that (e.g.) when you modify one byte the CPU fetches an entire cache line from RAM, modifies the byte in the cache, then writes the entire cache line back to RAM. By skipping small pieces you add overhead (more instructions for CPU to care about) and won't reduce the amount of data transfered between cache and RAM.
Also, typically memcpy() is optimised to copy larger pieces. For example, if you copy an array of bytes but the CPU is capable of copying 32-bits (4 bytes) at once, then memcpy() will probably do the majority of the copying as a loop with 4 bytes per iteration (to reduce the number of reads and writes and reduce the number of loop iterations).
In other words; code to avoid copying specific bytes will make it significantly slower than mempcy() for multiple reasons.
To avoid that, you really want to separate the data that needs to be copied from the data that doesn't - e.g. put everything that doesn't need to be copied at the end of the array and only copy the first part of the array (so that it remains "copy a contiguous area of bytes").
If you can't do that the next alternative to consider would be masking. For example, if you have an array of bytes where some bytes shouldn't be copied, then you'd also have an array of "mask bytes" and do something like dest[i] = (dest[i] & mask[i]) | (src[i] & ~mask[i]); in a loop. This sounds horrible (and is horrible) until you optimise it by operating on larger pieces - e.g. if the CPU can copy 32-bit pieces, masking allows you to do 4 bytes per iteration by pretending all of the arrays are arrays of uint32_t). Note that for this technique wider is better - e.g. if the CPU supports operations on 256-bit pieces (AVX on 80x86) you'd be able to do 32 bytes per iteration of the loop. It also helps if you can make guarantees about the size and alignment (e.g. if the CPU can operate on 32 bits/4 bytes at a time, ensure that the size of the arrays is always a multiple of 4 bytes and that the arrays are always 4-byte aligned; even if it means adding unused padding at the end).
Also note that depending on which CPU it actually is, there might be special support in the instruction set. For one example, modern 80x86 CPUs (that support SSE2) have a maskmovdqu instruction that is designed specifically for selectively writing some bytes but not others. In that case, you'd need to resort to instrinsics or inline assembly because "pure C" has no support for this type of thing (beyond bitwise operators).
Having overlooked your speed requirements:
You may try to find a way which solves the problem without copying at all.
Some ideas here:
If you want to iterate the destination array you could define
kind of a "picky iterator" for source that advances to the next number you allow: Instead of iter++ do iter = advance_source(iter)
If you want to search the destination array then wrap a function around bsearch() that searches source and inspects the result. And so on.
Depending on your processor memory width, and number of internal registers, you might be able to speed this up by using shift operations.
You need to know if your processor is big-endian or little-endian.
Lets say you have a 32 bit processor and bus, and at least 4 spare registers that the compiler can use for optimisation. This means you can read or write 4 bytes in the same target word, having read 2 source words. Note that you are reading the bytes you are going to discard.
You can also improve the speed by making sure that everything is word aligned, and ignoring the gaps between the buffers, so not having to worry about the odd counts of bytes.
So, for little-endian:
inline unsigned long CopyEven(unsigned long a, unsigned long b)
{
long c = a & 0xff;
c |= (a>>8) & 0xff00;
c |= (b<<16) & 0xff0000;
c |= (b<<8) &0xff000000;
return c;
}
unsigned long* d = (unsigned long*)dest;
unsigned long* s = (unsigned long*)source;
for (int count =0; count <sourceLenBytes; count+=8)
{
*d = CopyEven(s[0], s[1]);
d++;
s+=2;
}

Best way to store 256 bit AVX vectors into unsigned long integers

I was wondering what is the best way to store a 256 bit long AVX vectors into 4 64 bit unsigned long integers. According to the functions written in the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/ I could only figure out using maskstore(code below) to do this. But is it the best way to do so? Or there exist other methods for this?
#include <immintrin.h>
#include <stdio.h>
int main() {
unsigned long long int i,j;
unsigned long long int bit[32][4];//256 bit random numbers
unsigned long long int bit_out[32][4];//256 bit random numbers for test
for(i=0;i<32;i++){ //load with 64 bit random integers
for(j=0;j<4;j++){
bit[i][j]=rand();
bit[i][j]=bit[i][j]<<32 | rand();
}
}
//--------------------load masking-------------------------
__m256i v_bit[32];
__m256i mask;
unsigned long long int mask_ar[4];
mask_ar[0]=~(0UL);mask_ar[1]=~(0UL);mask_ar[2]=~(0UL);mask_ar[3]=~(0UL);
mask = _mm256_loadu_si256 ((__m256i const *)mask_ar);
//--------------------load masking ends-------------------------
//--------------------------load the vectors-------------------
for(i=0;i<32;i++){
v_bit[i]=_mm256_loadu_si256 ((__m256i const *)bit[i]);
}
//--------------------------load the vectors ends-------------------
//--------------------------extract from the vectors-------------------
for(i=0;i<32;i++){
_mm256_maskstore_epi64 (bit_out[i], mask, v_bit[i]);
}
//--------------------------extract from the vectors end-------------------
for(i=0;i<32;i++){ //load with 64 bit random integers
for(j=0;j<4;j++){
if(bit[i][j]!=bit_out[i][j])
printf("----ERROR----\n");
}
}
return 0;
}
As other said in comments you do not need to use mask store in this case. the following loop got no error in your program
for(i=0;i<32;i++){
_mm256_storeu_si256 ((__m256i const *) bit_out[i], v_bit[i]);
}
So the best instruction that you are looking for is _mm256_storeu_si256 this instruction stores a __m256i vector to unaligned address if your data are aligned you can use _mm256_store_si256. to see your vectors values you can use this function:
#include <stdalign.h>
alignas(32) unsigned long long int tempu64[4];
void printVecu64(__m256i vec)
{
_mm256_store_si256((__m256i *)&tempu64[0], vec);
printf("[0]= %u, [1]=%u, [2]=%u, [3]=%u \n\n", tempu64[0],tempu64[1],tempu64[2],tempu64[3]) ;
}
the _mm256_maskstore_epi64 let you choose the elements that you are going to store to the memory. This instruction is useful when you want to store a vector with more options to store an element to the memory or not change the memory value.
I was reading the Intel 64 and IA-32 Architectures Optimization Reference Manual (248966-032), 2016, page 410. and interestingly found out that unaligned store is still a performance killer.
11.6.3 Prefer Aligned Stores Over Aligned Loads
There are cases where it is possible to align only a subset of the
processed data buffers. In these cases, aligning data buffers used for
store operations usually yields better performance than aligning data
buffers used for load operations. Unaligned stores are likely to cause
greater performance degradation than unaligned loads, since there is a
very high penalty on stores to a split cache-line that crosses pages.
This penalty is estimated at 150 cycles. Loads that cross a page
boundary are executed at retirement. In Example 11-12, unaligned store
address can affect SAXPY performance for 3 unaligned addresses to
about one quarter of the aligned case.
I shared here because some people said there are no differences between aligned/unaligned store except in debuging!

optimized byte array shifter

I'm sure this has been asked before, but I need to implement a shift operator on a byte array of variable length size. I've looked around a bit but I have not found any standard way of doing it. I came up with an implementation which works, but I'm not sure how efficient it is. Does anyone know of a standard way to shift an array, or at least have any recommendation on how to boost the performance of my implementation;
char* baLeftShift(const char* array, size_t size, signed int displacement,char* result)
{
memcpy(result,array,size);
short shiftBuffer = 0;
char carryFlag = 0;
char* byte;
if(displacement > 0)
{
for(;displacement--;)
{
for(byte=&(result[size - 1]);((unsigned int)(byte))>=((unsigned int)(result));byte--)
{
shiftBuffer = *byte;
shiftBuffer <<= 1;
*byte = ((carryFlag) | ((char)(shiftBuffer)));
carryFlag = ((char*)(&shiftBuffer))[1];
}
}
}
else
{
unsigned int offset = ((unsigned int)(result)) + size;
displacement = -displacement;
for(;displacement--;)
{
for(byte=(char*)result;((unsigned int)(byte)) < offset;byte++)
{
shiftBuffer = *byte;
shiftBuffer <<= 7;
*byte = ((carryFlag) | ((char*)(&shiftBuffer))[1]);
carryFlag = ((char)(shiftBuffer));
}
}
}
return result;
}
If I can just add to what #dwelch is saying, you could try this.
Just move the bytes to their final locations. Then you are left with a shift count such as 3, for example, if each byte still needs to be left-shifted 3 bits into the next higher byte. (This assumes in your mind's eye the bytes are laid out in ascending order from right to left.)
Then rotate each byte to the left by 3. A lookup table might be faster than individually doing an actual rotate. Then, in each byte, the 3 bits to be shifted are now in the right-hand end of the byte.
Now make a mask M, which is (1<<3)-1, which is simply the low order 3 bits turned on.
Now, in order, from high order byte to low order byte, do this:
c[i] ^= M & (c[i] ^ c[i-1])
That will copy bits to c[i] from c[i-1] under the mask M.
For the last byte, just use a 0 in place of c[i-1].
For right shifts, same idea.
My first suggestion would be to eliminate the for loops around the displacement. You should be able to do the necessary shifts without the for(;displacement--;) loops. For displacements of magnitude greater than 7, things get a little trickier because your inner loop bounds will change and your source offset is no longer 1. i.e. your input buffer offset becomes magnitude / 8 and your shift becomes magnitude % 8.
It does look inefficient and perhaps this is what Nathan was referring to.
assuming a char is 8 bits where this code is running there are two things to do first move the whole bytes, for example if your input array is 0x00,0x00,0x12,0x34 and you shift left 8 bits then you get 0x00 0x12 0x34 0x00, there is no reason to do that in a loop 8 times one bit at a time. so start by shifting the whole chars in the array by (displacement>>3) locations and pad the holes created with zeros some sort of for(ra=(displacement>>3);ra>3)] = array[ra]; for(ra-=(displacement>>3);ra>(7-(displacement&7))). a good compiler will precompute (displacement>>3), displacement&7, 7-(displacement&7) and a good processor will have enough registers to keep all of those values. you might help the compiler by making separate variables for each of those items, but depending on the compiler and how you are using it it could make it worse too.
The bottom line though is time the code. perform a thousand 1 bit shifts then a thousand 2 bit shifts, etc time the whole thing, then try a different algorithm and time it the same way and see if the optimizations make a difference, make it better or worse. If you know ahead of time this code will only ever be used for single or less than 8 bit shifts adjust the timing test accordingly.
your use of the carry flag implies that you are aware that many processors have instructions specifically for chaining infinitely long shifts using the standard register length (for single bit at a time) rotate through carry basically. Which the C language does not support directly. for chaining single bit shifts you could consider assembler and likely outperform the C code. at least the single bit shifts are faster than C code can do. A hybrid of moving the bytes then if the number of bits to shift (displacement&7) is maybe less than 4 use the assembler else use a C loop. again the timing tests will tell you where the optimizations are.

Faster way to zero memory than with memset?

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?
I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?
Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.
edit2:
I revisited this issue and did a little testing. Here is what I tested:
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/time.h>
#define TIME(body) do { \
struct timeval t1, t2; double elapsed; \
gettimeofday(&t1, NULL); \
body \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \
#define SIZE 0x1000000
void zero_1(void* buff, size_t size)
{
size_t i;
char* foo = buff;
for (i = 0; i < size; i++)
foo[i] = 0;
}
/* I foolishly assume size_t has register width */
void zero_sizet(void* buff, size_t size)
{
size_t i;
char* bar;
size_t* foo = buff;
for (i = 0; i < size / sizeof(size_t); i++)
foo[i] = 0;
// fixes bug pointed out by tristopia
bar = (char*)buff + size - size % sizeof(size_t);
for (i = 0; i < size % sizeof(size_t); i++)
bar[i] = 0;
}
int main()
{
char* buffer = malloc(SIZE);
TIME(
memset(buffer, 0, SIZE);
);
TIME(
zero_1(buffer, SIZE);
);
TIME(
zero_sizet(buffer, SIZE);
);
return 0;
}
results:
zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.
For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)
edit3: fixed bug in test code, test results are not affected
edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.
x86 is rather broad range of devices.
For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.
For chips with mmx, an assembly loop with movq could hit 64bits at a time.
You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.
for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps
Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.
memset is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).
That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memset implementation by doing it yourself. memset and its friends in the standard library are always fun targets for one-upmanship programming. :)
Nowadays your compiler should do all the work for you. At least of what I know gcc is very efficient in optimizing calls to memset away (better check the assembler, though).
Then also, avoid memset if you don't have to:
use calloc for heap memory
use proper initialization (... = { 0
}) for stack memory
And for really large chunks use mmap if you have it. This just gets zero initialized memory from the system "for free".
If I remember correctly (from a couple of years ago), one of the senior developers was talking about a fast way to bzero() on PowerPC (specs said we needed to zero almost all the memory on power up). It might not translate well (if at all) to x86, but it could be worth exploring.
The idea was to load a data cache line, clear that data cache line, and then write the cleared data cache line back to memory.
For what it is worth, I hope it helps.
Unless you have specific needs or know that your compiler/stdlib is sucky, stick with memset. It's general-purpose, and should have decent performance in general. Also, compilers might have an easier time optimizing/inlining memset() because it can have intrinsic support for it.
For instance, Visual C++ will often generate inline versions of memcpy/memset that are as small as a call to the library function, thus avoiding push/call/ret overhead. And there's further possible optimizations when the size parameter can be evaluated at compile-time.
That said, if you have specific needs (where size will always be tiny *or* huge), you can gain speed boosts by dropping down to assembly level. For instance, using write-through operations for zeroing huge chunks of memory without polluting your L2 cache.
But it all depends - and for normal stuff, please stick to memset/memcpy :)
The memset function is designed to be flexible and simple, even at the expense of speed. In many implementations, it is a simple while loop that copies the specified value one byte at a time over the given number of bytes. If you are wanting a faster memset (or memcpy, memmove, etc), it is almost always possible to code one up yourself.
The simplest customization would be to do single-byte "set" operations until the destination address is 32- or 64-bit aligned (whatever matches your chip's architecture) and then start copying a full CPU register at a time. You may have to do a couple of single-byte "set" operations at the end if your range doesn't end on an aligned address.
Depending on your particular CPU, you might also have some streaming SIMD instructions that can help you out. These will typically work better on aligned addresses, so the above technique for using aligned addresses can be useful here as well.
For zeroing out large sections of memory, you may also see a speed boost by splitting the range into sections and processing each section in parallel (where number of sections is the same as your number or cores/hardware threads).
Most importantly, there's no way to tell if any of this will help unless you try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard 'memset' as well (their implementation might be more efficient than your compiler's).
There is one fatal flaw in this otherwise great and helpful test:
As memset is the first instruction, there seems to be some "memory overhead" or so which makes it extremely slow.
Moving the timing of memset to second place and something else to first place or simply timing memset twice makes memset the fastest with all compile switches!!!
That's an interesting question. I made this implementation that is just slightly faster (but hardly measurable) when 32-bit release compiling on VC++ 2012. It probably can be improved on a lot. Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with memset() in multithreaded scenarios.
// MemsetSpeedTest.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include "Windows.h"
#include <time.h>
#pragma comment(lib, "Winmm.lib")
using namespace std;
/** a signed 64-bit integer value type */
#define _INT64 __int64
/** a signed 32-bit integer value type */
#define _INT32 __int32
/** a signed 16-bit integer value type */
#define _INT16 __int16
/** a signed 8-bit integer value type */
#define _INT8 __int8
/** an unsigned 64-bit integer value type */
#define _UINT64 unsigned _INT64
/** an unsigned 32-bit integer value type */
#define _UINT32 unsigned _INT32
/** an unsigned 16-bit integer value type */
#define _UINT16 unsigned _INT16
/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
/** maximum allo
wed value in an unsigned 64-bit integer value type */
#define _UINT64_MAX 18446744073709551615ULL
#ifdef _WIN32
/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);
/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT clock_t start;double diff;
/** Use to start the performance timer */
#define TIMER_START start=clock();
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif
void *MemSet(void *dest, _UINT8 c, size_t count)
{
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
_UINT64 cUll =
c
| (((_UINT64)c) << 8 )
| (((_UINT64)c) << 16 )
| (((_UINT64)c) << 24 )
| (((_UINT64)c) << 32 )
| (((_UINT64)c) << 40 )
| (((_UINT64)c) << 48 )
| (((_UINT64)c) << 56 );
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll;
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll;
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll;
if (!bytesLeft) return dest;
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll;
return dest;
}
int _tmain(int argc, _TCHAR* argv[])
{
TIMER_INIT
const size_t n = 10000000;
const _UINT64 m = _UINT64_MAX;
const _UINT64 o = 1;
char test[n];
{
cout << "memset()" << endl;
TIMER_START;
for (int i = 0; i < m ; i++)
for (int j = 0; j < o ; j++)
memset((void*)test, 0, n);
TIMER_STOP;
}
{
cout << "MemSet() took:" << endl;
TIMER_START;
for (int i = 0; i < m ; i++)
for (int j = 0; j < o ; j++)
MemSet((void*)test, 0, n);
TIMER_STOP;
}
cout << "Done" << endl;
int wait;
cin >> wait;
return 0;
}
Output is as follows when release compiling for 32-bit systems:
memset() took:
5.569000
MemSet() took:
5.544000
Done
Output is as follows when release compiling for 64-bit systems:
memset() took:
2.781000
MemSet() took:
2.765000
Done
Here you can find the source code Berkley's memset(), which I think is the most common implementation.
memset could be inlined by compiler as a series of efficient opcodes, unrolled for a few cycles. For very large memory blocks, like 4000x2000 64bit framebuffer, you can try optimizing it across several threads (which you prepare for that sole task), each setting its own part. Note that there is also bzero(), but it is more obscure, and less likely to be as optimized as memset, and the compiler will surely notice you pass 0.
What compiler usually assumes, is that you memset large blocks, so for smaller blocks it would likely be more efficient to just do *(uint64_t*)p = 0, if you init large number of small objects.
Generally, all x86 CPUs are different (unless you compile for some standardized platform), and something you optimize for Pentium 2 will behave differently on Core Duo or i486. So if you really into it and want to squeeze the last few bits of toothpaste, it makes sense to ship several versions your exe compiled and optimized for different popular CPU models. From personal experience Clang -march=native boosted my game's FPS from 60 to 65, compared to no -march.

Resources