Reason for random slowdown in vectorized code

Reason for random slowdown in vectorized code - c

I wrote a particular function in a project using AVX2, AVX and SSE compiler intrinsics. I am aware of the penalty when the CPU changes states between AVX/AVX2 and SSE modes so I set the Enhanced Instruction Set to AVX2 in Visual Studio project settings.
In my code I repeatedly use some data in a for loop. The structure of my code is mostly like what is shown below:
//I gather the data that I am going to access again and again and put them
//into variables so that I use minimal array indexing
__m256 a = (code to get a)
__m256 b = (code to get b)
.........
for(int i =0; i < large number; i++)
{
c = arrayofc[i];
//operate with a and b and other variables gathered outside the loop.
d+= result of operations;
}
The problem that I am facing is, this function performs really well but at certain runs of the program it slows down by a factor of 10 to 15, whereas other functions in the same program slow down by a factor of at most 2.
I used boost timers to measure performance, Visual Studio performance profiler and also GPU view. All indicate that at certain runs of my program this function performs horribly slow. My program does not give random results; every time it gives identical results.
GPUview did not show any other thread interfering with this function either.
For once I thought that given I cache my variables, and inside the loop it is just floating point vectorized operations, the Intel Speed Step which was turned on slowed down this function specifically as this function is likely to be more CPU dependent than other functions which are probably more Memory dependent. But my guess turned out to be wrong as I tested with Intel Speed Step disabled and still had the same issue.
I used software prefetch to cache the variable I gather outside the loop too but without benefit.
I am still not sure if it could be caused by virtualization or not. The task manager of the computer I am working on shows CPU utilization is very small ( 1-5 %). Memory utilization is around 40% and sometimes disk utilization is around 100%
Any help regarding this issue will be greatly appreciated.

#Paul R Sorry for the question being a bit vague.
#Rotem I think the reason is eviction of cache.
#Harold it cannot be related to denormals because everytime I get the same result and if it was affected by denormals it would affect the process everytime.
I probably found the answer to the question but I am not able to verify it right now. I will post the test results as soon as I can.
What I did in my function was, to reduce array indexing and maximize the use of registers , I put some data from arrays into variables.
For example I want to access Darray[0], Darray[1] ...... Darray[6];
In the start of the looop I used the code
__m256 D0 = Darray[0];
__m256 D1 = Darray[1];
and so on. Most of the time the compiler generates assembly code where the variables are put into registers and the registers are used but in this case the register pressure was too high and they were not put into registers and instead put into different memory locations. I printed out the address of D0 and the difference in address with the other variables D1 , D2 .... etc
This is the result that I got (the first number is the address of D0 and the next ones are the offsets from it):
280449315232 -704 -768 -640 -736 -608
Even though I access the variables in my code sequentially, sometimes they are quite far apart.
This is the result of another array (this one is the most surprising)
280449314144 416 512
Another one:
812970176 128 192 256 224 160 1152
Thus when I access one variable it is unlikely I will bring the other one in the cache. But with one iteration of the loop I might bring all the variables in the cache but some other program has the ability to remove them from the cache anytime. If I used an array, even if the variables I fetched into cache could get removed from the cache, I would end up bringing some elements to the cache when I access other elements.
I will use arrays again for most of the data and try to fit the rest in registers. I will benchmark and report my findings in this post.
Thank you.

Related

Idiomatic way of performance evaluation?

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

Should I reduce set variable

I have 2 C++ code:
Code 1: Reduce assign variable
While(alive)
{
if(health < healthMax) health = healthMax;
}
Code 2: Not reduce assign variable
While(alive)
{
health = healthMax;
}
I don't know how set and get works, but I personally think that set will change/write the data on memory, and get only read memory, so it's best to get and reduce set - that's why I prefer Code 1 more for now. Am I thinking it right?
Thank you for reading :)

No. The assignment will hopefully compile to a move between registers, which is cheaper than a conditional branch.
If health is a global, you might want to manually sink the store to the global out of the loop, but even a store on every iteration isn't too bad. Repeated stores to the same memory location are cheap, because they will hit in L1 cache. You can expect a throughput of ~1 per clock, without hogging memory bandwidth for other cores.
Since you tagged this as assembly, see the x86 tag wiki for links to performance details for that platform, especially Agner Fog's stuff. A lot of the concepts are similar for other architectures.

measure time to execute single instruction

Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?

Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.

Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.

Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.

No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?

Optimizing ARM cache usage for different arrays

I want to port a small piece of code on ARM Cortex A8 processor. Both L1 cache and L2 cache are very limited. There are 3 arrays in my program. Two of them are sequentially accessed(size> Array A: 6MB and Array B: 3MB) and the access pattern for the third array(size> Array C: 3MB) is unpredictable. Though the calculations are not very rigorous but there are huge cache misses for accessing array C. One solution that I thought would be to allocate more cache (L2) space for array C and less for Array A & B. But I'm not able to find any way to achieve this. I went through preload engine of ARM but could not find anything useful.

It would be a good idea to split the cache and allocate each array in a different part of it.
Unfortunately that is not possible. The caches of the CortexA8 just are not that flexible. The good old StrongArm had a secondary cache for exactly this splitting purpose, but it's not available anymore. We have L1 and L2 caches instead (overall a good change imho.)
However, there is a thing you can do:
The NEON SIMD unit of the CortexA8 lags behind the general purpose processing unit by around 10 processor cycles. With clever programming you can issue cache prefetches from the general purpose unit but do the accesses via NEON. The delay between the two pipelines gives the cache a bit of time to do the prefetches, so your average cache miss time will be lower.
The drawback is that if you must never move the result of a calculation back from NEON to the ARM unit. Since NEON lags behind this will cause a full CPU pipeline flush. Almost if not even more costly as a cache miss.
The difference in performance can be significant. Out of the blue I would expect something between 20% and 30% of speed improvement.

From what I could find via Google, it looks like ARMv7 (which is the version of the ISA that Cortex A8 supports) has cache-flush capability, though I couldn't find a clear reference on how to use it -- perhaps you can do better if you spend more time on it than the minute or two I spent typing "ARM cache flush" into a search box and reading the results.
In any case, you should be able to achieve an approximation of what you want by periodically issuing "flush" instructions to flush out the parts of A and B that you know you no longer need.

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.

It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.

I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.

Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.

You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight