Using SSE to speed up lower_bound function - c

In a project I'm currently working on I often need to find the lowest possible index in a sorted array at which an element can be inserted (like std::lower_bound in C++).
It seems pretty attractive to me to use SSE to speed up my algorithm since I'm working with uint32 arrays which size is typically the size of a processor cache line.
I've never used SSE instructions before, so I can't manage to figure out what an SSE implementation of this function would looks like. Please give hints to help me write it out optimally wtih SSE.

Nothing like std::lower_bound is going to scale well using SSE. The reason SSE makes things faster is that it allows you to do several calculations at once. For example, a single SSE instruction might result in 4 multiply operations going on at once. However, the way std::lower_bound operates cannot be parallelized, because each step in the algorithm requires the comparison results of previous steps. Plus, it's already O(lg n), and as a result unlikely to be a bottleneck.
Moreover, before moving to inline assembly, you should know that whenever you use inline assembly, you defeat most compiler optimizations that might occur on that section of your program, and often as a result your program will be slower -- compilers usually write better assembler than us humans do.
If you want to use SSE, you are better off using intrinsics -- special "functions" or keywords provided by the compiler, which call the SSE instruction but otherwise allow optimizations to occur. Such intrinsics are available in Microsoft's Visual C++, as well as the GNU Compiler Collection. (And probably most any compiler. Consult your compiler's documentation)
Rather than trying to speed up std::lower_bound using SSE, you should try to not need to call it in the first place. For example, if you're constantly inserting elements into the vector using lower_bound, you should know what you've effectively created is insertion sort, and a poor insertion sort at that, which will require quadradic time. You would likely be better off simply putting your new elements on the end of the vector, and then sorting the vector when you need it sorted, which reduces things to an O(n lg n) sort. If your data access patterns are such that you would be resorting too often then you should use something like a std::set instead, which provides O(lg n) operations for insertions, rather than the O(n + lg n) insertions you're currently getting with the vectors.
And of course, remember to benchmark :)

Related

Does Intel MKL or some similar library provide a vectorized way to count the number of elements in an array fulfilling some condition in C?

The problem
I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.
However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.
Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.
Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.
Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.
Thanks!
If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.
It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.
However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.
For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).
Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).
I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.
MKL doesn't provide such functions but You may try to check another performance library - IPP which contains a set of threshold functions that could be useful to your case. Please refer to the IPP Developer Reference to check more details - https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/essential-functions/conversion-functions/threshold.html

Complexity for C code on ARM

I am trying to run my code on an ARM device. So far it's running and I also have a tool to measure complexity. Now I have lots of standard functions I am using to perform mathematical operations, like dividing, multiplying, adding and so an.
Is it easier (i.e. less complex) if I write those functions as e.g.
result = a + b;
or as
"qadd %0, %1, %4;"
which would be arm code for this operation if the values are in the respective registers. I am just wondering if writing everything in ARM code would really reduce the complexity.
Also, how does that behave for conditionals (like If and Else).
Thank you.
Let the compiler take care of it, until you discover a bottleneck.
Note that QADD is saturating arithmetic and has different behavior to the C code you show.

Can clang/gcc optimize linked-list trees?

I have a C program that has a tree implemented with linked-lists (child->parent and sibling->sibling).
I'm very green to compiler optimizations. I have seen and read about O-O3 and I think I've read about speeding up nested for loops and such.
If I want to increase performance with my tree implementation, do I need to start thinking about reimplementing it? Or perhaps I can just crunch the compiler?
Compiler optimizations won't change your data structure into something else. The best you'll get is a local array variable being kept entirely in registers and optimized away.
In theory with whole-program optimization, a compiler could figure out what you're doing with a data structure, and use a better one. In practice, if we had that we'd just tack on some natural-language processing and we'd have an AI to write classes / libraries based on English descriptions.
Your best bet is to use -O3. Or -Ofast if -ffast-math and similar "unsafe" optimizations are ok. Even better: use -profile-generate and -fprofile-use to optimize based on which loops run a lot, and which way branches usually go, and things like that. This will get the compiler to do as much as possible to minimize any constant factors in the run-time of your algorithm.
To improve the worst-case lookup times, you do need to change your algorithm. Either one of the many flavours of tree that involves re-balancing to avoid degenerate cases, or a different data structure entirely (e.g. hash table).

What is the limit of optimization using SIMD?

I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.
Do you think the use of vector operators could give bigger speedups?
Thanks
The best optimization occurs in rethinking the algorithm. Eliminate unnecessary steps. Find more a direct way of accomplishing the same result. Compute the solution in a domain more relevant to the problem.
For example, if the vector array is a list of n which are all on the same line, then it is sufficient to transform the end points only and interpolate the intermediate points.
It CAN give better speeds up than 4 times over straight floating point as the SIMD instructions could be less exact (Not so much as to give too many problems though) and so take fewer cycles to execute. It really depends.
Best plan is to learn as much about the processor you are optimising for as possible. You may find it can give you far better than 4x improvements. You may find out you can't. We can't say though without knowing more about the algorithm you are optimising and what CPU you are targetting.
On their own, no. But if the process of re-writing your algorithms to support them also happens to improve, say, cache locality or branching behaviour, then you could find unrelated speed-ups. However, this is true of any re-write...
This is entirely possible.
You can do more clever instruction-level micro optimizations than a compiler, if you know what you're doing.
Most SIMD instruction sets offers several powerful operations that don't have any equivalent in normal scalar FPU/ALU code (e.g. PAVG/PMIN etc. in SSE2). Even if these don't fit your problem exactly, you can often combine these instructions for great effect.
Not sure about Cell, but most SIMD instruction sets have features to optimize memory access, for example to prefetch data into cache. I've had very good results with these.
Now this isn't Cell or PPC at all, but a simple image convolution filter of mine got a 20x speedup (C vs. SSE2) on Atom, which is higher than the level of parallelity (16 pixels at a time).
It depends on the architecture.. For the moment I assume x86 architecture (aka SSE).
You can get factor four on tight loops easily. Just replace your existing math with SSE instruction and you're done.
You can even get a little more than that because if you use SSE you do the math in registers which are usually not used by the compiler. This frees up the general purpose register for other task such as loop control and address calculation. In short the code that surrounds the SSE instruction will be more compact and execute faster.
And then there is the option to hint the memory controller how you want to access the memory, e.g. if you want to store data in a way that it bypasses the cache or not. For bandwidth hungry algorithms that may give you some more extra speed ontop of that.

Practical use of automatic vectorization?

Has anyone taken advantage of the automatic vectorization that gcc can do? In the real world (as opposed to example code)? Does it take restructuring of existing code to take advantage? Are there a significant number of cases in any production code that can be vectorized this way?
I have yet to see either GCC or Intel C++ automatically vectorize anything but very simple loops, even when given the code of algorithms that can (and were, after I manually rewrote them using SSE intrinsics) be vectorized.
Part of this is being conservative - especially when faced with possible pointer aliasing, it can be very difficult for a C/C++ compiler to 'prove' to itself that a vectorization would be safe, even if you as the programmer know that it is. Most compilers (sensibly) prefer to not optimize code rather than risking miscompiling it. This is one area where higher level languages have a real advantage over C, at least in theory (I say in theory since I'm not actually aware of any automatically vectorizing ML or Haskell compilers).
Another part of it is simply analytical limitations - most research in vectorization, I understand, is related to optimizing classical numerical problems (fluid dynamics, say) which was the bread and butter of most vector machines before a few years ago (when, between CUDA/OpenCL, Altivec/SSE, and the STI Cell, vector programming in various forms became widely available in commercial systems).
It's fairly unlikely that code written for a scalar processor in mind will be easy for a compiler to vectorize. Happily, many things you can do to make it easier for a compiler to understand how to vectorize it, like loop tiling and partial loop unrolling, also (tend to) help performance on modern processors even if the compiler doesn't figure out how to vectorize it.
It is hard to use in any business logic, but gives speed ups when you are processing volumes of data in the same way.
Good example is sound/video processing where you apply the same operation to every sample/pixel.
I have used VisualDSP for this, and you had to check the results after compiling - if it is really used where it should.
Vectorized instructions are not limited to Cell processors - most modern workstations-like CPU have them (PPC, x86 since pentium 3, Sparc, etc...). When used well for floating points operations, it can help quite a lot for very computing intensive tasks (filters, etc...). In my experience, automatic vectorization does not work so well.
You may have noticed that pretty much no-one actually knows how to make good use of GCC's Automatic Vectorization. If you search around the web to see people's comments, it always come to the idea that GCC allows you to enable automatic vectorization, but it extremely rarely makes actual use of it, and so if you want to use SIMD acceleration (eg: MMX, SSE, AVX, NEON, AltiVec), then you basically haveto figure out how to write it using compiler intrinsics or Assembly language code.
But the problem with intrinsics is that you effectively need to understand the Assembly language side of it and then also learn the Intrinsics method of describing what you want, which is likely to result in much less efficient code than if you wrote it in Assembly code (such as by a factor of 10x), because the compiler is still going to have trouble making good use of your intrinsic instructions!
For example, you might be using SIMD Intrinsics so that many operations can be performed in parallel at the same time, but your compiler will probably generate Assembly code that transfers the data between the SIMD registers and the normal CPU registers and back, effectively making your SIMD code run at a similar speed (or even slower) than normal code!
So basically:
If you want upto 100% speedups (2x
speed), then either buy the
official Intel/ARM compilers or convert some of your code to use SIMD C/C++ Intrinsics.
If you
want 1000% speedups (10x speed), then
write it in Assembly code using SIMD instructions by hand. Or if available on your hardware, use GPU acceleration instead such as OpenCL or Nvidia's CUDA SDK, since they can provide similar speedups in the GPU as SIMD does in the CPU.

Resources