Is it quicker to do division batch-wise or just once? - theory

Suppose I have the numbers: 1, 2, 4, 7, 12 and 18 (randomly chosen). When computing the mean is it quicker to do:
mean=0
loop through [1, 2, 4, 7, 12, 18]
for each item:
increase mean by (item/total number of items)
or
mean=0
loop through [1, 2, 4, 7, 12, 18]
for each item:
increase mean by (item)
divide mean by 3
or does it make no difference to the speed of the algorithm? This is a purely theoretical question (ignoring the specific chosen compiler, etc.)

It depends to an extent on how smart your compiler is - some will perform lots of optimizations, others not as much. The optimization level can often be configured so that you can vary the degree to which the compiler transforms code.
In the limit of optimization, a compiler could recognize your code computes a constant value and pre-compute it during compilation and put the result in the output variable. This absolutely is possible.
Another possibility is that the compiler recognizes the loop is over a fixed range and so it removes the loop and expands the summation.
In the limit of no optimization, your first snippet does 6 division operations while the second snipped does just one. Since the other numbers of operations are the same for each of the snippets, the first snippet cannot be faster and is likely slower compared to the second snippet.
It might be instructive to learn about optimization in your language's compiler or interpreter and experiment with various levels of optimization to see how the performance of each snippet changes. You'll probably need larger input samples to get useful time measurements out.

Related

Effect of non-divisible loop sizes at runtime on openMP SIMD

After reading several different articles and not finding an answer I am going to introduce the problem and then ask the question.
I have a section of code that can be reduced down to a series of loops like look like the following.
#pragma omp parallel for simd
for(int i = 0; i < a*b*c; i++)
{
array1[i] += array2[i] * array3[i];
}
Now most examples of SIMD use that I have encountered have a, b and c fixed at compile time, allowing for the optimisation to take place. However, my code requires that the values of a b and c are determined at run time.
Lets say that for the case of the computer I am using the register can fit 4 values, and that the value of abc is 127. My understanding of compilation time for this is that the compiler will vectorise everything that is wholly divisible by 4, then serialise the rest (please correct this if I am wrong). However this is when the compiler has full knowledge of the problem. If I were to now allow a run time choice of a, b and c and came to the value of 127, how would vectorisation proceed? Naively I would assume that the code behind the scenes is intelligent enough to understand this might happen have have both a serial and vector code and calls the most suitable. However, as this is an assumption, I would appreciate someone more knowledgeable on the subject to enlighten me further, as I don't want accidental overflows, or non-processing of data, due to a misunderstanding.
On the off chance this matters, I am using OpenMP 4.0 with a C gcc compiler, although I am hoping this will not change your answer as I will always attempt to use the latest OpenMP version and unfortunately may need to routinely change compiler.
Typically, a compiler will unroll beyond the simd length. For optimum results, particularly with gcc, you would specify this unroll factor, e.g. --param max-unroll-times=2 (if you don't expect much longer loops). with a simd length of 4, the loop would consume 8 iterations at a time, leaving a remainder. gcc would build a remainder loop, somewhat like Duff's device, which might have 15 iterations, and would calculate where to jump in at run time. Intel compiler handles a vectorized remainder loop in a different way. Supposing you have 2 simd widths available, the remainder loop would use the shorter width without unrolling, so that the serial part is as short as possible. When compiling for the general case of unaligned data, there is a remainder loop at both ends, with the one at the beginning limited to the length required for alignment of the stored values. With the combination omp parallel simd, the situation gets more complicated; normally, the loop chunks must vary in size, and one might argue that the interior chunks might be set up for alignment, with the end chunks smaller (not normally done).

Compile time data-specific optimizations

In some cases, one knows at compile time what a particular piece of algorithmic data looks like, and as such might wish to convey this information to the compiler. This question is about how one might best achieve that.
By way of example, consider the following example of a sparse matrix multiplication in which the matrix is constant and known at compile time:
matrix = [ 0, 210, 0, 248, 137]
[ 0, 0, 0, 0, 239]
[ 0, 0, 0, 0, 0]
[116, 112, 0, 0, 7]
[ 0, 0, 0, 0, 165]
In such a case, a fully branchless implementation could be written to implement the matrix vector multiplication for an arbitrary input vector:
#include <stdio.h>
#define ARRAY_SIZE 8
static const int matrix[ARRAY_SIZE] = {210, 248, 137, 239, 116, 112, 7, 165};
static const int input_indices[ARRAY_SIZE] = {1, 3, 4, 4, 0, 1, 4, 4};
static const int output_indices[ARRAY_SIZE] = {0, 0, 0, 1, 3, 3, 3, 4};
static void matrix_multiply(int *input_array, int *output_array)
{
for (int i=0; i<ARRAY_SIZE; ++i){
output_array[output_indices[i]] += (
matrix[i] * input_array[input_indices[i]]);
}
}
int main()
{
int test_input[5] = {36, 220, 212, 122, 39};
int output[5] = {0};
matrix_multiply(test_input, output);
for (int i=0; i<5; ++i){
printf("%d\n", output[i]);
}
}
which prints the correct result for the matrix-vector multiplication (81799, 9321, 0, 29089, 6435).
Further optimisations can be envisaged that build on data specific knowledge about the memory locality of reference.
Now, clearly this is an approach which can be used, but it starts getting unwieldy when the size of the data gets big (say ~100MB in my case) and also in any real world situation would depend on meta-programming to generate the associated data dependent knowledge.
Does the general strategy of baking in data specific knowledge have mileage as regards optimisation? If so, what is the best approach to do this?
In the example given, on one level the whole thing than be reduced to knowledge about ARRAY_SIZE with the arrays set at runtime. This leads me to think the approach is limited (and is really a data structures problem), but I'm very interested to know if the general approach of data derived compile-time optimisations is useful in any situation.
I don't think this is a very good answer to this question but I'm going to try offering it anyway. It's also more of a search for the same basic answer.
I work in 3D VFX including raytracing where it's not uncommon to take a fairly modest input with data structures that build in under a second, and then do a monumental amount of processing subsequently to the point where a user might wait hours for a quality production render in a difficult lighting situation.
In theory at least, this could go so much faster if we could make these "data-specific optimizations". Variables could turn into literal constants, significantly less branching could be required, data that is known to always have an upper bound of 45 elements could be allocated on the stack instead of heap or use another form of memory preallocated in advance, locality of reference could be exploited to a greater deal than ever before, vectorization could be applied more easily, achieving both thread-safety and efficiency could be a lot easier, etc.
Where this gets awkward for me is that this requires information about user inputs which can only be provided after the usual notion of "compile-time". So a lot of my interest here relates to code-generation techniques while the application is running.
Now, clearly this is an approach which can be used, but it starts
getting unwieldy when the size of the data gets big (say ~100MB in my
case) and also in any real world situation would depend on
meta-programming to generate the associated data dependent knowledge.
I think beyond that, if the data size gets excessive, then we do often need a good share of branching and variables just to avoid generating so much code that we start becoming bottlenecked by icache misses.
Yet even the ability to turn a dozen variables accessed frequently into compile-time constants and allowing a handful of data structures to exploit greater knowledge of the specified input (and with the aid of an aggressive optimizer) may yield great mileage here, especially considering how well optimizers do provided they have the necessary information provided in advance.
Some of this could be tackled normally with increasingly elaborate and generalized code, metaprogramming techniques, etc, yet there's a peak to how far we can go there: an optimizer can only optimize as much as the information is has available in advance. The difficulty here is providing that information in a practical way. And, as you already guessed, this can quickly get unwieldy, difficult to maintain, and productivity starts to become just as great (if not greater) of a concern than efficiency.
So the most promising techniques to me revolve about code-generation techniques tuned for a specific problem domain, but not for a specific input (optimizing for the specific input will lean more on the optimizer, the code generation is there so that we can provide more of that info needed for the optimizer more easily/appropriately). A modest example that already does something like this is Open Shading Language, where it uses JIT compilation that exploits this idea to a modest level:
OSL uses the LLVM compiler framework to translate shader networks into
machine code on the fly (just in time, or "JIT"), and in the process
heavily optimizes shaders and networks with full knowledge of the
shader parameters and other runtime values that could not have been
known when the shaders were compiled from source code. As a result, we
are seeing our OSL shading networks execute 25% faster than the
equivalent shaders hand-crafted in C! (That's how our old shaders
worked in our renderer.)
While a 25% improvement over handwritten code is modest, that's still a big deal in a production renderer, and it seems like we could go far beyond that.
The use of nodes as a visual programming language also offers a more restrictive environment that helps reduce human errors, allows expressing solutions at a higher-level, seeing the results of changes made on the fly (instant turnaround), etc. -- so it adds not only efficiency but that productivity we need to avoid getting lost in such optimizations. Maintaining and building the code generator could be a little complex, but it only needs to have the minimal amount of code required and doesn't scale in complexity with the amount of code generated using it.
So apologies -- this isn't exactly an answer to your question as a comment, but I think we're searching for a similar thing.

Searching missing number - simple example

A little task on searching algorithm and complextiy in C. I just want to make sure im right.
I have n natural numbers from 1 to n+1 ordered from small to big, and i need to find the missing one.
For example: 1 2 3 5 6 7 8 9 10 11 - ans: 4
The fastest and the simple answer is do one loop and check every number with the number that comes after it. And the complexity of that is O(n) in the worst case.
I thought maybe i missing something and i can find it with using Binary Search. Can anybody think on more efficient algorithm in that simple example?
like O(log(n)) or something ?
There's obviously two answers:
If your problem is a purely theoretical problem, especially for large n, you'd do something like a binary search and check whether the middle between the two last boundaries is actually (upper-lower)/2.
However, if this is a practical question, for modern systems executing programs written in C and compiled by a modern, highly optimizing compiler for n << 10000, I'd assume that the linear search approach is much, much faster, simply because it can be vectorized so easily. In fact, modern CPUs have instructions to take e.g. each
4 integers at once, subtract four other integers,
compare the result to [4 4 4 4]
increment the counter by 4,
load the next 4 integers,
and so on, which very neatly lends itself to the fact that CPUs and memory controllers prefetch linear memory, and thus, jumping around in logarithmically descending step sizes can have an enormous performance impact.
So: For large n, where linear search would be impractical, go for the binary search approach; for n where that is questionable, go for the linear search. If you not only have SIMD capabilities but also multiple cores, you will want to split your problem. If your problem is not actually exactly 1 missing number, you might want to use a completely different approach ... The whole O(n) business is generally more of a benchmark usable purely for theoretical constructs, and unless the difference is immensely large, is rarely the sole reason to pick a specific algorithm in a real-world implementation.
For a comparison-based algorithm, you can't beat Lg(N) comparisons in the worst case. This is simply because the answer is a number between 1 and N and it takes Lg(N) bits of information to represent such a number. (And a comparison gives you a single bit.)
Unless the distribution of the answers is very skewed, you can't do much better than Lg(N) on average.
Now I don't see how a non-comparison-based method could exploit the fact that the sequence is ordered, and do better than O(N).

What is the logical difference between loops and recursive functions?

I came across this video which is discussing how most recursive functions can be written with for loops but when I thought about it, I couldn't see the logical difference between the two. I found this topic here but it only focuses on the practical difference as do many other similar topics on the web so what is the logical difference in the way a loop and a recursion are handled?
Bottom line up front -- recursion is more versatile but in practice is generally less efficient than looping.
A loop could in principle always be implemented as a recursion if you wished to do so. In practice the limits of stack resources put serious constraints on the size of the problems you can address. I can and have built loops that iterate a billion times, something I'd never try with recursion unless I was certain the compiler could and would convert the recursion into a loop. Because of the stack limits and efficiency, people often try to find a looping equivalent for recursions.
Tail recursions can always be converted to loops. However, there are recursions that can't be converted. As an example, I work with statistical design of experiments. Sometimes a large design is constructed by "crossing" several smaller sub-designs. Crossing is where you concatenate every row of a second design to each row of the first. For two sub-designs, all this needs is simple nested looping, but for three or more designs you need to increase the level of nesting, adding one level of nesting for each additional sub-design. So while this is nested looping in principle, in practice the amount of nesting is variable. If you tried to implement it with looping you'd have to revise your program to add/subtract nested loops every time you were dealing with a different number of sub-designs to be crossed, so you can't write an immutable loop-based version. This can easily be implemented with recursion. In this case, I'm happy to trade a slight amount of efficiency, because I wrote and debugged the code 6 years ago and haven't had to revise it since, despite creating lots of crossed designs of varying complexity since then.
One way to think through this is that the choice for recursion or iteration depends on how you think about the problem being solved. Certain "ways of thinking" lead more naturally to recursive solutions, and other ways of thinking lead to more iterative solutions. For any problem, you can in principle think in a way that gives you a recursive solution or a way that gives you an iterative solution. (Sometimes the iterative solution will just end up simulating a recursion stack, but there is no actual recursion there.)
Here's an example. You have an array of integers (positive or negative), and you want to find the maximum segment sum. A segment is a piece of the array that is contiguous. So in the array [3, -4, 2, 1, -2, 4], the maximum segment sum is 5, and you get that from the segment [2, 1, -2, 4]; its sum is 5.
OK - so how might we solve this problem? One thing you might do is reason like this: "if I knew the maximum segment sum in the left half, and the maximum segment sum in the right half, then maybe I could somehow jam those together and figure out the maximum segment sum overall". This idea would require you to find the maximum segment sum on the two subhalves, and this is a smaller instance of the original problem. This is recursion, and a direct translation of this idea into code would therefore be recursive.
But the maximum segment sum problem isn't "recursive" or "iterative" -- it can be both, depending on how you think about the solution. I gave a recursive thought process above. Here is an iterative process: "well, if I add up the elements in each of the segments that start at some index i and end at some index j, I can just take the maximum of these to solve the problem". And directly trying to code this approach would give you triply nested loops (and a bad mark on an assignment because it's horribly inefficient!).
So, the same problem, depending on how the problem is conceptualized, can lead to a recursive or iterative solution. Now, I happened to choose a problem where there are many ways of solving it, and where there are reasonable recursive and iterative solutions. Some problems, however, admit only one type of solution, and that solution may be most naturally implemented using recursion or iteration. For example, if I asked you to write a function that keeps asking the user to enter a letter until they enter y or n, you might start thinking: "keep repeating the prompt and asking for input..." and before you know it you have some iterative code. Perhaps you might instead think recursively: "if the user enters y or n, I am done; otherwise ask the user for y or n"... in which case you'd generate a recursive algorithm. But the recursion here doesn't give you much: it unnecessarily uses a stack and doesn't make the program any faster. (Recursion sometimes makes it easier to prove correctness, in which case you might present something recursively even though you could alternately give a reasonable iterative solution.)

Which is faster — sorting or multiplying a small array of elements?

Reading through Cactus Kev's Poker Hand Evaluator, I noticed the following statements:
At first, I thought that I could always simply sort the hand first before passing it to the evaluator; but sorting takes time, and I didn't want to waste any CPU cycles sorting hands. I needed a method that didn't care what order the five cards were given as.
...
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
...
Since multiplication is one of the fastest calculations a computer can make, we have shaved hundreds of milliseconds off our time had we been forced to sort each hand before evaluation.
I have a hard time believing this.
Cactus Kev represents each card as a 4-byte integer, and evaluates hands by calling eval_5cards( int c1, int c2, int c3, int c4, int c5 ). We could represent cards as one byte, and a poker hand as a 5-byte array. Sorting this 5-byte array to get a unique hand must be pretty fast. Is it faster than his approach?
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them? If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
Thanks!
Good answers everyone; I'm working on benchmarking the performance of sorting vs multiplication, to get some hard performance statistics.
Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.
Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.
Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is n log n. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.
Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.
However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.
Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.
5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.
However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.
But if you could build custom hardware to do the sorting, it might end up faster.
That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.
The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.
It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just load, load, multiply, load, multiply, ..., with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.
One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
That's a example of a non-positional number system.
I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them?
RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.
I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.
If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.
As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!
Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.
He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.
Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.
The multiplication is faster.
Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.
An example of a ready made Texas Hold'em 7- and 5-card evaluator can be found here with documentation and further explained here. All feedback welcome at the e-mail address found therein.
You don't need to sort, and can typically (~97% of the time) get away with just 6 additions and a couple of bit shifts when evaluating 7-card hands. The algo uses a generated look up table which occupies about 9MB of RAM and is generated in a near-instant. Cheap. All of this is done inside of 32-bits, and "inlining" the 7-card evaluator is good for evaluating about 50m randomly generated hands per second on my laptop.
Oh, and multiplication is faster than sorting.

Resources