Replacing memcpy with neon intrinsics

Replacing memcpy with neon intrinsics - arm

I am trying to beat the "memcpy" function by writing the neon intrinsics for the same . Below is my logic :
uint8_t* m_input; //Size as 400 x300
uint8_t* m_output; //Size as 400 x300
//not mentioning the complete code base for memory creat
memcpy(m_output, m_input, sizeof(m_output[0]) * 300* 400);
Neon:
int32_t ht_index,wd_index;
uint8x16_t vector8x16_image;
for(int32_t htI =0;htI < m_roiHeight;htI++){
ht_index = htI * m_roiWidth ;
for(int32_t wdI = 0;wdI < m_roiWidth;wdI+=16){
wd_index = ht_index + wdI;
vector8x16_image = vld1q_u8(m_input);
vst1q_u8(&m_output[wd_index],vector8x16_image);
}
}
I verified multiple times these result on imx6 hardware.
Results:
Memcpy :0.039 milisec
neon memcpy: 0.02841 milisec
I READ SOMEWHERE THAT WITHOUT PRELOADED INSTRUCTIONS WE CAN NOT BEAT MEMCPY.
If it is true then how my code is giving these results .
Is it right or wrong

If correctly written, a non-NEON memcpy() should be able to saturate the L3 bandwidth on your device, but for smaller transfers (fitting entirely within L1 or L2 cache) things can be different. Your test probably fits within L2 cache.
Unfortunately memcpy has to work for any sized call, so it can't reasonably optimise for in-cache and out-of-cache cases at the same time as optimising for very short copies where the cost of detecting what kind of optimisation would be best turns out to be the dominant factor.
Even so, it's possible that your test isn't fair. You have to be sure that both implementations aren't subject to different cache preconditions or different virtual page layout.
Make sure neither test is run entirely before the other. Test some of one implementation, then test some of the other, then back to the first and back to the second a few times, to make sure they're not subject to any warm-up conditions. And use the same buffers for both to ensure that there's no characteristic of different parts of your virtual address space that harms one implementation only.
Also, there are cases your memcpy doesn't handle, but these shouldn't matter much for large transfers.

Related

Idiomatic way of performance evaluation?

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

Time measurements differ on microcontroller

I am measuring the cycle count of different C functions which I try to make constant time in order to mitigate side channel attacks (crypto).
I am working with a microcontroller (aurix from infineon) which has an onboard cycle counter which gets incremented each clock tick and which I can read out.
Consider the following:
int result[32], cnt=0;
int secret[32];
/** some other code***/
reset_and_startCounter(); //resets cycles to 0 and starts the counter
int tmp = readCycles(); //read cycles before function call
function(secret) //I want to measure this function, should be constant time
result[cnt++] = readCycles() - tmp; //read out cycles and subtract to get correct result
When I measure the cycles like shown above, I will sometimes receive a different amount of cycles depending on the input given to the function. (~1-10 cycles difference, function itself takes about 3000 cycles).
I was now wondering if it not yet is perfectly constant time, and that the calculations depend on some input. I looked into the function and did the following:
void function(int* input){
reset_and_startCounter();
int tmp = readCycles();
/*********************************
******calculations on input******
*********************************/
result[cnt++] = readCycles() - tmp;
}
and I received the same amount of cycles no matter what input is given.
I then also measured the time needed to call the function only, and to return from the function. Both measurements were the same no matter what input.
I was always using the gcc compiler flags -O3,-fomit-frame-pointer. -O3 because the runtime is critical and I need it to be fast. And also important, no other code has been running on the microcontroller (no OS etc.)
Does anyone have a possible explanation for this. I want to be secure, that my code is constant time, and those cycles are arbitrary...
And sorry for not providing a runnable code here, but I believe not many have an Aurix lying arround :O
Thank you

The Infineon Aurix microcontroller you're using is designed for hard real-time applications. It has intentionally been designed to provide consistent runtime performance -- it lacks most of the features that can lead to inconsistent performance on more sophisticated CPUs, like cache memory or branch prediction.
While showing that your code has constant runtime on this part is a start, it is still possible for your code to have variable runtime when run on other CPUs. It is also possible that a device containing this CPU may leak information through other channels, particularly through power analysis. If making your application resistant to sidechannel analysis is critical, you may want to consider using a part designed for cryptographic applications. (The Aurix is not such a part.)

Reason for random slowdown in vectorized code

I wrote a particular function in a project using AVX2, AVX and SSE compiler intrinsics. I am aware of the penalty when the CPU changes states between AVX/AVX2 and SSE modes so I set the Enhanced Instruction Set to AVX2 in Visual Studio project settings.
In my code I repeatedly use some data in a for loop. The structure of my code is mostly like what is shown below:
//I gather the data that I am going to access again and again and put them
//into variables so that I use minimal array indexing
__m256 a = (code to get a)
__m256 b = (code to get b)
.........
for(int i =0; i < large number; i++)
{
c = arrayofc[i];
//operate with a and b and other variables gathered outside the loop.
d+= result of operations;
}
The problem that I am facing is, this function performs really well but at certain runs of the program it slows down by a factor of 10 to 15, whereas other functions in the same program slow down by a factor of at most 2.
I used boost timers to measure performance, Visual Studio performance profiler and also GPU view. All indicate that at certain runs of my program this function performs horribly slow. My program does not give random results; every time it gives identical results.
GPUview did not show any other thread interfering with this function either.
For once I thought that given I cache my variables, and inside the loop it is just floating point vectorized operations, the Intel Speed Step which was turned on slowed down this function specifically as this function is likely to be more CPU dependent than other functions which are probably more Memory dependent. But my guess turned out to be wrong as I tested with Intel Speed Step disabled and still had the same issue.
I used software prefetch to cache the variable I gather outside the loop too but without benefit.
I am still not sure if it could be caused by virtualization or not. The task manager of the computer I am working on shows CPU utilization is very small ( 1-5 %). Memory utilization is around 40% and sometimes disk utilization is around 100%
Any help regarding this issue will be greatly appreciated.

#Paul R Sorry for the question being a bit vague.
#Rotem I think the reason is eviction of cache.
#Harold it cannot be related to denormals because everytime I get the same result and if it was affected by denormals it would affect the process everytime.
I probably found the answer to the question but I am not able to verify it right now. I will post the test results as soon as I can.
What I did in my function was, to reduce array indexing and maximize the use of registers , I put some data from arrays into variables.
For example I want to access Darray[0], Darray[1] ...... Darray[6];
In the start of the looop I used the code
__m256 D0 = Darray[0];
__m256 D1 = Darray[1];
and so on. Most of the time the compiler generates assembly code where the variables are put into registers and the registers are used but in this case the register pressure was too high and they were not put into registers and instead put into different memory locations. I printed out the address of D0 and the difference in address with the other variables D1 , D2 .... etc
This is the result that I got (the first number is the address of D0 and the next ones are the offsets from it):
280449315232 -704 -768 -640 -736 -608
Even though I access the variables in my code sequentially, sometimes they are quite far apart.
This is the result of another array (this one is the most surprising)
280449314144 416 512
Another one:
812970176 128 192 256 224 160 1152
Thus when I access one variable it is unlikely I will bring the other one in the cache. But with one iteration of the loop I might bring all the variables in the cache but some other program has the ability to remove them from the cache anytime. If I used an array, even if the variables I fetched into cache could get removed from the cache, I would end up bringing some elements to the cache when I access other elements.
I will use arrays again for most of the data and try to fit the rest in registers. I will benchmark and report my findings in this post.
Thank you.

Where does the SSE instructions outperform normal instructions

Where does the x86-64's SSE instructions (vector instructions) outperform the normal instructions. Because what I'm seeing is that the frequent loads and stores that are required for executing SSE instructions is nullifying any gain we have due to vector calculation. So could someone give me an example SSE code where it performs better than the normal code.
Its maybe because I am passing each parameter separately, like this...
__m128i a = _mm_set_epi32(pa[0], pa[1], pa[2], pa[3]);
__m128i b = _mm_set_epi32(pb[0], pb[1], pb[2], pb[3]);
__m128i res = _mm_add_epi32(a, b);
for( i = 0; i < 4; i++ )
po[i] = res.m128i_i32[i];
Isn't there a way I can pass all the 4 integers at one go, I mean pass the whole 128 bytes of pa at one go? And assign res.m128i_i32 to po at one go?

Summarizing comments into an answer:
You have basically fallen into the same trap that catches most first-timers. Basically there are two problems in your example:
You are misusing _mm_set_epi32().
You have a very low computation/load-store ratio. (1 to 3 in your example)
_mm_set_epi32() is a very expensive intrinsic. Although it's convenient to use, it doesn't compile to a single instruction. Some compilers (such as VS2010) can generate very poor performing code when using _mm_set_epi32().
Instead, since you are loading contiguous blocks of memory, you should use _mm_load_si128(). That requires that the pointer is aligned to 16 bytes. If you can't guarantee this alignment, you can use _mm_loadu_si128() - but with a performance penalty. Ideally, you should properly align your data so that don't need to resort to using _mm_loadu_si128().
The be truly efficient with SSE, you'll also want to maximize your computation/load-store ratio. A target that I shoot for is 3 - 4 arithmetic instructions per memory-access. This is a fairly high ratio. Typically you have to refactor the code or redesign the algorithm to increase it. Combining passes over the data is a common approach.
Loop unrolling is often necessary to maximize performance when you have large loop bodies with long dependency chains.
Some examples of SO questions that successfully use SSE to achieve speedup.
C code loop performance (non-vectorized)
C code loop performance [continued] (vectorized)
How do I achieve the theoretical maximum of 4 FLOPs per cycle? (contrived example for achieving peak processor performance)

How can I speed up crc32 calculation?

I'm trying to write a crc32 implementation on linux that's as fast as possible, as an exercise in learning to optimise C. I've tried my best, but I haven't been able to find many good resources online. I'm not even sure if my buffer size is sensible; it was chosen by repeated experimentation.
#include <stdio.h>
#define BUFFSIZE 1048567
const unsigned long int lookupbase = 0xEDB88320;
unsigned long int crctable[256] = {
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
/* LONG LIST OF PRECALCULTED VALUES */
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D};
int main(int argc, char *argv[]){
register unsigned long int x;
int i;
register unsigned char *c, *endbuff;
unsigned char buff[BUFFSIZE];
register FILE *thisfile=NULL;
for (i = 1; i < argc; i++){
thisfile = fopen(argv[i], "r");
if (thisfile == NULL) {
printf("Unable to open ");
} else {
x = 0xFFFFFFFF;
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
while (c != endbuff){
while (c != endbuff){
x=(x>>8) ^ crctable[(x&0xFF)^*c];
c++;
}
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
}
fclose(thisfile);
x = x ^ 0xFFFFFFFF;
printf("%0.8X ", x);
}
printf("%s\n", argv[i]);
}
return 0;
}
Thanks in advance for any suggestions or resources I can read through.

On Linux? Forget about the register keyword, that's just a suggestion to the compiler and, from my experience with gcc, it's a waste of space. gcc is more than capable of figuring that out for itself.
I would just make sure you're compiling with the insane optimisation level, -O3, and check that. I've seen gcc produce code at that level which took me hours to understand, so sneaky that it was.
And, on the buffer size, make it as large as you possibly can. Even with buffering, the cost of calling fread is still a cost, so the less you do it, the better. You would see a huge improvement if you increased the buffer size from 1K to 1M, not so much if you up it from 1M to 2M, but even a small amount of increased performance is an increase. And, 2M isn't the upper bound of what you can use, I'd set it to one or more gigabytes if possible.
You may then want to put it at file level (rather than inside main). At some point, the stack won't be able to hold it.
As with most optimisations, you can usually trade space for time. Keep in mind that, for small files (less than 1M), you won't see any improvement since there is still only one read no matter how big you make the buffer. You may even find a slight slowdown if the loading of the process has to take more time to set up memory.
But, since this would only be for small files (where the performance isn't a problem anyway), it shouldn't really matter. Large files, where the performance is an issue, should hopefully find an improvement.
And I know I don't need to tell you this (since you indicate you are doing it), but I will mention it anyway for those who don't know: Measure, don't guess! The ground is littered with the corpses of those who optimised with guesswork :-)

You've asked for three values to be stored in registers, but standard x86 only has four general purpose registers: that's an awful lot of burden to place on the last remaining register, which is one reason why I expect register really only prevents you from ever using &foo to find the address of the variable. I don't think any modern compiler even uses it as a hint, these days. Feel free to remove all three uses and re-time your application.
Since you're reading in huge chunks of the file yourself, you might as well use open(2) and read(2) directly, and remove all the standard IO handling behind the scenes. Another common approach is to open(2) and mmap(2) the file into memory: let the OS page it in as pages are required. This may allow future pages to be optimistically read from disk while you're doing your computation: this is a common access pattern, and one the OS designers have attempted to optimize. (The simple mechanism of mapping the entire file at once does put an upper limit on the size of the files you can handle, probably about 2.5 gigabytes on 32-bit platforms and absolutely huge on 64-bit platforms. Mapping the file in chunks will allow you to handle arbitrary sized files even on 32-bit platforms, but at the cost of loops like you've got now for reading, but for mapping.)
As David Gelhar points out, you're using an odd-length buffer -- this might complicate the code path of reading the file into memory. If you want to stick with reading from files into buffers, I suggest using a multiple of 8192 (two pages of memory), as it won't have special cases until the last loop.
If you're really into eeking out of the last bit of speed and don't mind drastically increasing the size of your pre-computation table, you can look at the file in 16-bit chunks, rather than just 8-bit chunks. Frequently, accessing memory along 16-bit alignment is faster than along 8-bit alignment, and you'd cut the number of iterations through your loop in half, which usually gives a huge speed boost. The downside, of course, is increased memory pressure (65k entries, each of 8 bytes, rather than just 256 entries each of 4 bytes), and the much larger table is much less likely to fit entirely in the CPU cache.
And the last optimization idea that crosses my mind is to fork(2) into 2, 3, or 4 processes (or use threading), each of which can compute the crc32 of a portion of the file, and then combine the end results after all processes have completed. crc32 may not be computationally intensive enough to actually benefit from trying to use multiple cores out of SMP or multicore computers, and figuring out how to combine partial computations of crc32 may not be feasible -- I haven't looked into it myself :) -- but it might repay the effort, and learning how to write multi-process or multi-threaded software is well worth the effort regardless.

You are not going to be able to speed up the actual arithmetic of the CRC calculation, so the areas you can look at are the overhead of (a) reading the file, and (b) looping.
You're using a pretty large buffer size, which is good (but why is it an odd number?). Using a read(2) system call (assuming you're on a unix-like system) instead of the fread(3) standard library function may save you one copy operation (copying the data from fread's internal buffer into your bufffer).
For the loop overhead, look into loop unrolling.
Your code also has some redundancies that you might want to eliminate.
sizeof (unsigned char) is 1 (by definition in C); no need to explicitly compute it
c = &(buff[0]) is exactly equivalent to c = buff
Neither of these changes will improve the performance of the code (assuming a decent compiler), but they will make it clearer and more in accordance with usual C style.