Speedup superlinear of the quicksort algorithm [closed] - c

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I've implemented both a sequential version and a parallel version of quicksort.
I've used to verify the speedup the worst case of quicksort for my implementation: the source array is already sorted and in my case the pivot is always the first element of the array.
So, the partition generates two sets one containing the elements lesser than the pivot and another with the elements higher than pivot having namely n - 1 elements where n is the number of elements of the array being passed as the argument of quicksort function. The recursion depth has size N -1 where N is the number of elements of the original array passed as argument for the quicksort function.
Obs: The sets are actually represented by two variables containing the initial and the final position of the array part that correspondends either the elements are smaller than the pivot and the elements are higher than the pivot. The whole division are happening in place, what means no new array is created on process. The difference of the sequential for the parallel is in the parallel version more than one array is created where the elements are divided equally between them (sorted as the sequential case). For the junction of elements in the parallel case the algorithm merge was used.
The speedup obtained was higher than the theoric, it means with two threads the speeedup achieved was more than 2x compared to the sequential version (3x to be more precise) and with 4 threads the speedup was 10x.
The computer where I ran the threads is a 4 cores machine (Phenom II X4) running Ubuntu Linux 10.04, 64 bits if I am not wrong. The compiler is gcc 4.4 and no flags were passed for the compiler with exception of the inclusion of library pthread for the parallel implementation;
So, does someone know the reason for the superlinear speedup achieved? Can someone give some any pointer, please?

It would really be best to use some performance analyzer to dig into
this in more detail, but my first guess is that this kind of super
linear speed up is caused by the fact that you get more cache space
if you add threads. In that way, more data will be read from cache.
Since the cost of a memory transfer is really high, this can easily
improve performance.
Are you using Amdahl's law to evaluate your maximum speedup ?
Hope this helps.

If you see a 3x speedup with two threads versus one, and a 10x speedup with four threads versus one, something fishy is going on.
Amdahl's law states that speedup is 1/(1-P+P/S), where P is the portion of the algorithm which is parallel and S is the speedup factor of the parallel portion. Assuming that S=4 for four cores (the best possible result), we find that P=2.5, which is impossible (it has to be between 0 and 1, inclusive).
Put another way, if you could get a 10x improvement with 4 cores, then you could simply use one core to simulate four cores and still get a 2.5x improvement (or thereabouts).
Put yet another way, four cores over one second perform fewer operations than one core over ten seconds. So the parallel version is actually performing fewer operations, and if that is the case there is no reason why the serial version couldn't also perform fewer operations.
Possible conclusions:
Something could be wrong with your sequential version. Perhaps it is optimized poorly.
Something could be wrong with your parallel versions. They could be incorrect.
The measurement could be performed incorrectly. This is quite common.
Impossible conclusions:
This algorithm scales superlinearly.

Related

C/Linux: How to find the perfect number of threads to use, to minimize the time of execution? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Let's say I have a vector of N numbers. I have to do computations on the vector, so I will assign for each thread a portion of the vector to compute on. For example, I have 100 elements, 10 threads, so the first thread will work on the 0..9 elements, the second on, 10..19 elements, and so on. How do I find the perfect number of threads in order to minimize the time of execution. Of course, we will consider N to be a pretty big number, so we can observe differences. What relation is between the number of threads necessary for the time of execution to be minimum, and the number of cores on my machine?
There is no exact formula or relation as such, but you can judge it depending upon your use case.
With use of multithreading you can decide which Performance Metric you want to optimise:
Latency
Throughput
Your use case has a single task, which needs to be performed in minimum time as possible. Therefore we need to focus on optimising the latency.
For improving latency, each task can be divided into let's say N subtasks, as you did mention in your question.
Multithreading will not always help you minimising the runtime, for eg: if the input size is low(let's say, 100), then, your multithreaded program may end up taking more time than a single threaded program, as the cost of thread management is involved. This may not be true with a large input.
So, for any use case, the best way is to rely on some realtime metrics, discussed as follows:
For an array of size N, you can plot a Latency vs Number Of Threads plot and check what best suites your use case.
For example, have a look at the following plot:
From, the above plot, we can conclude that for a array of constant size N, and a processor with 4 cores,
The Optimal number of threads = 6.
Generally, we prefer Number of threads = Number of cores of the processor.
From the above plot, it is evident that this equation is not always true. The processor had 4 cores, but optimal latency was achieved by using 6 threads.
Now, for each N, you can test the best possible parameters that help you optimise your use case.
PS: You can study the concept of virtual cores and find out the reason why latency starts increasing after we increase the number of threads from 6 to 8.
PS: The entire credits for the image goes to Michael Pogrebinsky and his fantastic course on Udemy - Java Multithreading, Concurrency & Performance Optimization.
Unfortunately there is no easy answer on this one. You should essentially try out different configurations and then pick the best suited.
There is no easy answer to this. Intel have the MKL / IPP libraries for this reason; these libraries contain the necessary information for each CPU Intel make to judge what the best breakdown of threads for an operation is. A pretty good guess would be 1 thread per core.
On the topic of how much data a thread should be allocated to process (i.e. how much of the vector it will chew on), a good guideline is the L1 cache size (normally 32kB). If you judge it so that the amount of vector + the size of the corresponding results is less than 32kB, that gives the cache system the best possible chance of delivering data to the thread and taking the results away again.

OpenCL: How to distribute a calculation on different devices without multithreading

Following my former post about comparing the time required to do a simple array addition job (C[i]=A[i]+B[i]) on different devices, I improved the code a little bit to repeat the process for different array length and give back the time required:
The X axis is the array length in logarithm with a base 2 and Y is the time in logarithm with base 10. As it can be seen somewhere between 2^13 and 2^14 the GPUs become faster than the CPU. I guess it is because the memory allocation becomes negligible in comparison to the calculation. (GPI1 is a typo I meant GPU1).
Now hoping my C-OpenCL code is correct I can have an estimation of the time required to do an array addition on different devices: f1(n) for CPU, f2(n) for the first GPU and f3(n) for the second GPU. If I have an array job with a length of n I should theoretically be able to divide it into 3 parts as n1+n2+n3=n and in a way to satisfy the f1(n1)=f2(n2)=f3(n3) and distribute it on three devices on my system to have the fastest possible calculation. I think I can do it using lets say OpenMP or any other multithreading method and use the cores of my CPU to host three different OpenCL tasks. That's not what I like to do because:
It is a wast of resources. Two of the cores are just hosting while could be used for calculation
It makes the code more complicated.
I'm not sure how to do it. I'm now using the Apple Clang compiler with -framework OpenCL to compile the code, but for OpenMP I have to use the GNU compiler. I don't know how to both OpenMP and OpenCL on one of these compilers.
Now I'm thinking if there is any way to do this distribution without multithreading? For example if one of the CPU cores assigns the tasks to the three devices consequentially and then catches the results in the same (or different) order and then concatenate them. It probably needs a little bit of experimenting to adjust for the timing of the task assignment of the subtasks, but I guess it should be possible.
I'm a total beginner to the OpenCL so I would appreciate if you guys could help me know if it is possible and how to do it. Maybe there are already some examples doing so, please let me know. Thanks in advance.
P.S. I have also posted this question here and here on Reddit.
The problem as its read implicitly tells you the solution should be concurrent (asynchronous) thus you require to add the results from three different devices at same time, otherwise what you will do is to run a process first on device A, then device B and then device C (better to run a single process on the fastest device), if you plan to efficiently learn to exploit OpenCL programming (either on mCPU/GPUs) you should be comfortable to do Asynchronous programming (indeed multi threaded).

Counting FLOPs and size of data and check whether function is memory-bound or cpu-bound

I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!
With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.

Understand whether code sample is CPU bound or Memory bound

As a general question to those working on optimization and performance tuning of programs, how do you figure out if your code is CPU bound or Memory bound? I understand these concepts in general, but if I have say, 'y' amounts of loads and stores and '2y' computations, how does one go about finding what is the bottleneck?
Also can you figure out where exactly you are spending most of your time and say, if you load 'x' amount of data into cache (if its memory bound), in every loop iteration, then your code will run faster? Is there any precise way to determine this 'x', other than trial and error?
Are there any tools that you'll use, say on the IA-32 or IA-64 architecture? Doest VTune help?
For example, I'm currently doing the following:
I have 26 8*8 matrices of complex doubles and I have to perform a MVM (matrix vector multiplication) with (~4000) vectors of length 8, for each of these 26 matrices. I use SSE to perform the complex multiplication.
/*Copy 26 matrices to temporary storage*/
for(int i=0;i<4000;i+=2){//Loop over the 4000 vectors
for(int k=0;k<26;k++){//Loop over the 26 matrices
/*
Perform MVM in blocks of '2' between kth matrix and
'i' and 'i+1' vector
*/
}
}
The 26 matrices take 26kb (L1 cache is 32KB) and I have laid the vectors out in memory such that I have stride'1' accesses. Once I perform MVM on a vector with the 27 matrices, I don't visit them again, so I don't think cache blocking will help. I have used vectorization but I'm still stuck on 60% of peak performance.
I tried copying, say 64 vectors, into temporary storage, for every iteration of the outer loop thinking they'll be in cache and help, but its only decreased performance. I tried using _mm_prefetch() in the following way: When I am done with about half the matrices, I load the next 'i' and 'i+1' vector into memory, but that too hasn't helped.
I have done all this assuming its memory bound but I want to know for sure. Is there a way?
To my understanding the best way is profiling your application/workload. Based on the input data, the characteristic of the application/workload can significantly vary. These behaviors can however be quantified with to few phases Ref[2, 3] and a histogram can broadly tell the most frequent path of the workload to be optimized. The question that you are asking will also require benchmark programs [like SPEC2006, PARSEC, Media bench etc] for an architecture and is difficult to answer in general terms ( and is an active part of research in computer architecture). However, for specific cases a quantitative result can be stated for different memory hierarchies. You can use tools like:
Perf
OProfile
VTune
LikWid
LLTng
and other monitoring and simulation tools to get the profiling traces of the application. You can look at performance counters like IPC, CPI ( for CPU bound) and memory access, cache misses, cache access , and other memory counters for determining memory boundedness.like IPC, Memory access per cycle (MPC), is often used to determine the memory boundedness of an application/workload.
To specifically improve matrix multiplication, I would suggest using a optimized algorithm as in LAPACK.

Which is faster — sorting or multiplying a small array of elements?

Reading through Cactus Kev's Poker Hand Evaluator, I noticed the following statements:
At first, I thought that I could always simply sort the hand first before passing it to the evaluator; but sorting takes time, and I didn't want to waste any CPU cycles sorting hands. I needed a method that didn't care what order the five cards were given as.
...
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
...
Since multiplication is one of the fastest calculations a computer can make, we have shaved hundreds of milliseconds off our time had we been forced to sort each hand before evaluation.
I have a hard time believing this.
Cactus Kev represents each card as a 4-byte integer, and evaluates hands by calling eval_5cards( int c1, int c2, int c3, int c4, int c5 ). We could represent cards as one byte, and a poker hand as a 5-byte array. Sorting this 5-byte array to get a unique hand must be pretty fast. Is it faster than his approach?
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them? If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
Thanks!
Good answers everyone; I'm working on benchmarking the performance of sorting vs multiplication, to get some hard performance statistics.
Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.
Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.
Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is n log n. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.
Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.
However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.
Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.
5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.
However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.
But if you could build custom hardware to do the sorting, it might end up faster.
That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.
The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.
It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just load, load, multiply, load, multiply, ..., with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.
One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
That's a example of a non-positional number system.
I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them?
RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.
I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.
If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.
As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!
Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.
He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.
Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.
The multiplication is faster.
Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.
An example of a ready made Texas Hold'em 7- and 5-card evaluator can be found here with documentation and further explained here. All feedback welcome at the e-mail address found therein.
You don't need to sort, and can typically (~97% of the time) get away with just 6 additions and a couple of bit shifts when evaluating 7-card hands. The algo uses a generated look up table which occupies about 9MB of RAM and is generated in a near-instant. Cheap. All of this is done inside of 32-bits, and "inlining" the 7-card evaluator is good for evaluating about 50m randomly generated hands per second on my laptop.
Oh, and multiplication is faster than sorting.

Resources