Efficient parallel scheduling algorithm for calculating average array value - arrays

I'm trying to find the average value of an array of floating point values using multiple threads on a single machine. I'm not concerned with the size of the array or memory constraints (assume a moderately sized array, large enough to warrant multiple threads). In particular, I'm looking for the most efficient scheduling algorithm. It seems to me that a static block approach would be the most efficient.
So, given that I have x machine cores, it would seem reasonable to chunk the array into array.size/x values and have each core sum the results for their respective array chunk. Then, the summed results from each core are added and the final result is this value divided by the total number of array elements (note: in the case of the # of array elements not being exactly divisible by x, I am aware of the optimization to distribute the elements as evenly as possible across the threads).
The array will obviously be shared between threads, but since there are no writes involved, I won't need to involve any locking mechanisms or worry about synchronization issues.
My question is: is this actually the most efficient approach for this problem?
In contrast, for example, consider the static interleaved approach. In this case, if I had four cores (threads), then thread one would operate on array elements 0, 4, 8, 12... whereas thread two would operate on elements 1, 5, 9, 13... This would seem worse since each core would be continually getting cache misses, whereas the static block approach means that each core operates on success values and takes advantage of data locality. Some tests that I've run seem to back this up.
So, can anyone point out a better approach than static block, or confirm that this is most likely the best approach?
Edit:
I'm using Java and Linux (Ubuntu). I'm trying not to think to much about the language/platform involved, and just look at the problem abstractly from a scheduling point of view that involves manually assigning workload to multiple threads. But I understand that the language and platform are important factors.
Edit-2:
Here's some timings (nano time / 1000) with varying array sizes (doubles).
Sequential timings used a single java thread.
The others implemented their respective scheduling strategies using all available (4) cores working in parallel.
1,000,000 elements:
---Sequential
5765
1642
1565
1485
1444
1511
1446
1448
1465
1443
---Static Block
15857
4571
1489
1529
1547
1496
1445
1415
1452
1661
---Static Interleaved
9692
4578
3071
7204
5312
2298
4518
2427
1874
1900
50,000,000 elements:
---Sequential
73757
69280
70255
78510
74520
69001
69593
69586
69399
69665
---Static Block
62827
52705
55393
53843
57408
56276
56083
57366
57081
57787
---Static Interleaved
179592
306106
239443
145630
171871
303050
233730
141827
162240
292421

Your system doesn't seem to have the memory bandwidth to take advantage of 4 threads on this problem. Doing floating point addition of elements is just not enough work to keep the CPU busy at the rate memory can deliver data... your 4 cores are sharing the same memory controller/DRAM... and are waiting on memory. You probably will see the same speedup if you use 2 threads instead of 4.
Interleaving is a bad idea, as you said and as you confirmed, why waste precious memory bandwidth bringing data into a core and then only using one fourth of it. If you are lucky and the threads run somewhat in sync then you will get reuse of data in the Level 2 or Level 3 cache, but you will still bring data into the L1 cache and only use a fraction.
Update: when adding 50 million elements one concern is loss of precision, log base 2 of 50 million is about 26 bits, and double precision floating point has 53 effective bits (52 explicit and 1 implied). The best case is when all elements have similar exponents (similar in magnitude). Things get worse if the numbers in the array have a large range of exponents, in the worst case the range is large and they are sorted in descending order of magnitude. Precision of your final average can be improved by sorting the array and adding in ascending order. See this SO question for more insight into the precision problem when adding a large number of items: Find the average within variable number of doubles.

Related

Parallelizing radix sort for floating point numbers using pthread library in C

I am trying to parallelize the radix sort using POSIX threads using C language. The specialty is the radix sort needs to be implemented for floating-point numbers. Currently, the code is running sequentially but I have no idea how to parallelize the code. Can anyone help me with this? Any help is appreciated.
Radix sorts are pretty hard to parallelize efficiently on CPUs. There is two parts in a radix sort: the creation of the histogram and the bucket filling.
To create an histogram in parallel you can fill local histograms in each thread and then perform a (tree-based) reduction of the histograms to build a global one. This strategy scale well as long as the histogram are small relative to the data chunks computed by each thread. An alternative way to parallelize this step is to use atomic adds to fill directly a shared histogram. This last method scale barely when thread write accesses conflicts (which often happens on small histograms and many threads). Note that in both solutions, the input array is evenly distributed between threads.
Regarding the bucket filling part, one solution is to make use of atomic adds to fill the buckets: 1 atomic counter per bucket is needed so that each thread can push back items safely. This solution only scale when threads do not often access to the same bucket (bucket conflicts). This solution is not great as the scalability of the algorithm is strongly dependent of the content of the input array (sequential in the worst case). There are solutions to reduces conflicts between threads (better scalability) at the expense of more work (slower with few threads). One is to fill the buckets from both sides: threads with an even ID fill the buckets in ascending order while threads with an odd ID fill them in descending order. Note that it is important to take into account false sharing to maximize performance.
A simple way to parallelize radix sort for all but the first pass is to use a most significant digit (MSD) pass to split up the array into bins, each of which can then be sorted concurrently. This approach relies on having somewhat uniform distribution of values, at least in terms of the most significant digit, so that the bins are reasonably equal in size.
For example, using a digit size of 8 bits (base 256), use a MSD pass to split up the array into 256 bins. Assuming there are t threads, then sort t bins at a time, using least significant digit first radix sort.
For larger arrays, it may help to use a larger initial digit size to split up the array into a larger number of bins, with the goal of getting t bins to fit in cache.
Link to a non-parallelized radix sort that uses MSD for first pass, then LSD for next 3 passes. The loop at the end of RadixSort() to sort the 256 bins could be parallelized:
Radix Sort Optimization
For the first pass, you could use the parallel method in Jerome Richard's answer, but depending on the data pattern, it may not help much, due to cache and memory conflicts.

Understanding O(1) vs O(n) Time Complexity Intuitively

I understand that O(1) is constant-time, which means that the operation does not depend on the input size, and O(n) is linear time, which means that the operation changes linearly with input size.
If I had an algorithm that could simply go directly to an array index rather than going through each index one-by-one to find the required one, that would be considered constant-time rather than linear-time, right? This is what the textbooks say. But, intuitively, I don't understand how a computer could work this way: Wouldn't the computer still need to go through each index one-by-one, from 0 to (potentially) n, in order to find the specific index given to it? But then, is this not the same as what a linear-time algorithm does?
Edit
My response to ElKamina's answer elaborates on how my confusion extends to hardware:
But wouldn't the computer have to check where it is on its journey to
the index? For instance, if it's trying to find index 3, "I'm at index
0 so I need address 0 + 3", "Ok, now I'm at address 1, so I have to
move forward 2", "Ok, now I'm at address 2, so I have to move forward
1", "Ok, now I'm at index 3". Isn't those the same thing as what
linear-time algorithms do? How can the computer not do it
sequentially?
Theory
Imagine you have an array which stores events in the order they happened. If each event takes the same amount of space in a computer's memory, you know where that array begins, and you know what number event you're interested in, then you can precalculate the location of each event.
Imagine you want to store records and key them by telephone numbers. Since there are many numbers, you can calculate a hash of each one. The simplest hash you might apply is to treat the telephone number like a regular number and take it modulus the length of the array you'd like to store the number in. Again, you can assume each record takes the same amount of space, you know the number of records, you know where the array begins, and you know the offset of the event of interest. From these, you can precalculate the location of each event.
If array items have different sizes, then instead fill the array with pointers to the actual items. Your lookup then has two stages: find the appropriate array element and then follow it to the item in question.
Much like we can use shmancy GPS systems to tell us where an address is, but we still need to do the work of driving there, the problem with accessing memory is not knowing where an item is, it's getting there.
Answer to your question
With this in mind, the answer to your question is that look-up is almost never free, but it also is rarely O(N).
Tape memory: O(N)
Tape memory requires O(N) seeks, for obvious reasons: you have to spool and unspool the tape to position it to the needed location. It's slow. It's also cheap and reliable, so it's still in use today in long-term back-up systems. Special algorithms which account for the physical nature of the tape can speed up operations on it slightly.
Notice that, per the foregoing, the problem with tape is not that we don't know where the thing is we're trying to find. The problem is getting the physical medium to get there. The nature of a good tape algorithm is to try to minimize the total amount of tape spooled and unspooled over a grouping of operations.
Speaking of which, what if, instead of having one long tape, we had two shorter tapes: this would reduce the point-to-point travel time. What if we had four tapes?
Disk memory: O(N), but smaller
Hard drives make a huge reduction in seek time by turning the tape into a series of rings. Now, even though there are N memory spaces on a disk, any one can be accessed in short order by moving the drive head and the disk to the appropriate point. (Figuring out how to express this in big-oh notation is a challenge.)
Again, if you use faster disks or smaller disks, you can optimize performance.
RAM: O(1), but with caveats
Pretty much everyone who answers this question is going to fixate on RAM, since that's what programmers work with most frequently. Look to their answers for fuller explanations.
But, just briefly, RAM is a natural extension of the ideas developed above. The RAM holds N items and we know where the item we want is. However, this time there's nothing that needs to mechanically move in order for us to get to that item. In addition, we saw that by having more short tapes or smaller, faster drives, we could get to the memory we wanted faster. RAM takes this idea to its extreme.
For practical purposes, you can think of RAM as being a collection of little memory stores, all strung together. Your computer doesn't know exactly where in RAM a particular item is, just the collection it belongs to. So it grabs the whole collection, consisting of thousands or millions of bytes. It stashes this in something like an L3 cache.
But where is a particular item in that cache? Again, you can think of the computer as not really knowing, it just grabs the a subset which is guaranteed to include the item and passes it to the L2 cache.
And again, for the L1 cache.
And, at this point, we've gone from gigabytes (or terabytes) of RAM to something like 3-30 kilobytes. And, at this level, your computer (finally) knows exactly where the item is and grabs it for processing.
This kind of hierarchical behavior means that accessing adjacent items in RAM is much faster than randomly accessing different points all across RAM. That was also true of tape drives and hard disks.
However, unlike tape drives and hard disks, the worst-case time where all the caches are missed is not dependent on the amount of memory (or, at least, is very weakly dependent: path lengths, speed of light, &c)! For this reason, you can treat it as an O(1) operation in the size of the memory.
Comparing speeds
Knowing this, we can talk about access speed by looking at Latency Numbers Every Programmer Should Know:
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
In more human terms, these look like:
Minute:
L1 cache reference 0.5 s One heart beat (0.5 s)
Branch mispredict 5 s Yawn
L2 cache reference 7 s Long yawn
Mutex lock/unlock 25 s Making a coffee
Hour:
Main memory reference 100 s Brushing your teeth
Compress 1K bytes with Zippy 50 min One episode of a TV show (including ad breaks)
Day:
Send 2K bytes over 1 Gbps network 5.5 hr From lunch to end of work day
Week:
SSD random read 1.7 days A normal weekend
Read 1 MB sequentially from memory 2.9 days A long weekend
Round trip within same datacenter 5.8 days A medium vacation
Read 1 MB sequentially from SSD 11.6 days Waiting for almost 2 weeks for a delivery
Year:
Disk seek 16.5 weeks A semester in university
Read 1 MB sequentially from disk 7.8 months Almost producing a new human being
The above 2 together 1 year
Decade:
Send packet CA->Netherlands->CA 4.8 years Average time it takes to complete a bachelor's degree
Underlying any calculation of time complexity is a cost model. Cost models tend to be oversimplified; for example, we generally talk about the time complexity of sort algorithms in terms of how many elements do we have to compare to each other.
The assumption underlying concluding that indexing into an array is O(1) is that of random access memory; that we can access location N by encoding N on the address lines of the memory bus, and the contents of that location come back on the data bus. If memory were sequential access (e.g., accessing off of a magnetic tape), we'd assume a different cost model.
Imagine computer memory as buckets, say you have 10 buckets in from of you.
if someone tells you to pick something up from bucket number 8, you will not first stick your hand into bucket 1 to 7. you would simply put your hand directly into bucket 8.
Arrays work the same way, in most languages map to some form of memory layout. so e.g. if you have an byte array of 10 that would be 10 sequential bytes.
other types could vary in size depending if the content is a value type/struct or if it is a reference type where the array would consist of pointers.
We assume that the memory is "Random Access Memory" (also known as RAM), not the tape or disk memory. In RAM you can access any address in constant time. See the corresponding wiki article for more information on how it works.
Also, elements of the array are stored sequentially. Say we want to store integers in Java which take up 4 bytes. If we wanted to look for kth element, we would directly look at start + 4 * k location in the memory.
You could implement an array in other ways as well. For example, you could implement the array with a linked list, in which case it would take O(n) time to access an element. But this is not how arrays are implemented typically.
No one here has explained why (IMO) in sufficient detail you can access it in O(1) time in detail, so I will try to:
As a note before I do, this is probably trivializing how complex the hardware in the computer has become, but hopefully it's something along the right path. You would cover this in a Computer Organization course that goes into the guts of the hardware.
When you have circuits, the voltage passed through the computer propagates very fast, and the results that come back depends on the pulse of the clock. Take this diagram for example:
https://upload.wikimedia.org/wikipedia/commons/3/3d/Square_array_of_mosfet_cells_read.png
The following is missing parts that you would learn properly from a textbook or course (or online), but omission of those details should still leave you with sufficient enough of a high level overview for a rough idea of how this works:
The address you send as bits will go up the left side of the image, and based on the address size you send, the voltage will be properly sent to the proper memory cell that has the data you want. Upon the cell receiving voltage, it will then emit the value back down to the bottom (which also is basically instant), and now you've read the 'value stored in memory' since the data you want has arrived. Because of how fast voltage travels, you pretty much almost instantly get the result due to the speed of voltage change in circuits. This means it does not depend on traversing the elements before it since you can just go to it, which is the idea behind RAM. The bottleneck comes from the clock pulse with the latches, which when you take a computer organization course you will see what we do and why we do it.
This is why we consider it doable in O(1) time.
Now an Operating Systems and Computer Organization course would show you all about how this is connected under the hood, why its way more complex than what I've written (and what might not even be that accurate anymore), but hopefully gives you an intuition as to why we can do it in constant time.
Since complexity notation hides the constants under the hood (which from the above, we can assume it's constant time to go to any offset in memory), it then would make sense that we can jump to any array offset in O(1) time from a high level point of view -- which is what complexity analysis aims to do for us -- compared to. This is also why we don't need to traverse over every element in memory to get where we want, which as you said is O(n).
Assuming the data structure you are talking about is a vector/array, you can easily reach index 'x' by incrementing whatever you use to iterate over it.
Say you have a vector of struct "A" where A occupies 20bytes, say you want to get to index 28 and you know the vector starts at memory location 'x', than you simply need to go to x + 20 bytes and that is your element.
With a data structure like a list the lookup time will be O(n) since its not continously assigned you have to jump from pointer to pointer.
With a binary tree its O(log2(n)) ... etc
So the answer here is that it depend on your structure. I would recommend reading some books about fundamental data structures, those might help you greatly in gaining more theoretical understanding of the various concepts you are using.

Understand whether code sample is CPU bound or Memory bound

As a general question to those working on optimization and performance tuning of programs, how do you figure out if your code is CPU bound or Memory bound? I understand these concepts in general, but if I have say, 'y' amounts of loads and stores and '2y' computations, how does one go about finding what is the bottleneck?
Also can you figure out where exactly you are spending most of your time and say, if you load 'x' amount of data into cache (if its memory bound), in every loop iteration, then your code will run faster? Is there any precise way to determine this 'x', other than trial and error?
Are there any tools that you'll use, say on the IA-32 or IA-64 architecture? Doest VTune help?
For example, I'm currently doing the following:
I have 26 8*8 matrices of complex doubles and I have to perform a MVM (matrix vector multiplication) with (~4000) vectors of length 8, for each of these 26 matrices. I use SSE to perform the complex multiplication.
/*Copy 26 matrices to temporary storage*/
for(int i=0;i<4000;i+=2){//Loop over the 4000 vectors
for(int k=0;k<26;k++){//Loop over the 26 matrices
/*
Perform MVM in blocks of '2' between kth matrix and
'i' and 'i+1' vector
*/
}
}
The 26 matrices take 26kb (L1 cache is 32KB) and I have laid the vectors out in memory such that I have stride'1' accesses. Once I perform MVM on a vector with the 27 matrices, I don't visit them again, so I don't think cache blocking will help. I have used vectorization but I'm still stuck on 60% of peak performance.
I tried copying, say 64 vectors, into temporary storage, for every iteration of the outer loop thinking they'll be in cache and help, but its only decreased performance. I tried using _mm_prefetch() in the following way: When I am done with about half the matrices, I load the next 'i' and 'i+1' vector into memory, but that too hasn't helped.
I have done all this assuming its memory bound but I want to know for sure. Is there a way?
To my understanding the best way is profiling your application/workload. Based on the input data, the characteristic of the application/workload can significantly vary. These behaviors can however be quantified with to few phases Ref[2, 3] and a histogram can broadly tell the most frequent path of the workload to be optimized. The question that you are asking will also require benchmark programs [like SPEC2006, PARSEC, Media bench etc] for an architecture and is difficult to answer in general terms ( and is an active part of research in computer architecture). However, for specific cases a quantitative result can be stated for different memory hierarchies. You can use tools like:
Perf
OProfile
VTune
LikWid
LLTng
and other monitoring and simulation tools to get the profiling traces of the application. You can look at performance counters like IPC, CPI ( for CPU bound) and memory access, cache misses, cache access , and other memory counters for determining memory boundedness.like IPC, Memory access per cycle (MPC), is often used to determine the memory boundedness of an application/workload.
To specifically improve matrix multiplication, I would suggest using a optimized algorithm as in LAPACK.

Which is faster — sorting or multiplying a small array of elements?

Reading through Cactus Kev's Poker Hand Evaluator, I noticed the following statements:
At first, I thought that I could always simply sort the hand first before passing it to the evaluator; but sorting takes time, and I didn't want to waste any CPU cycles sorting hands. I needed a method that didn't care what order the five cards were given as.
...
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
...
Since multiplication is one of the fastest calculations a computer can make, we have shaved hundreds of milliseconds off our time had we been forced to sort each hand before evaluation.
I have a hard time believing this.
Cactus Kev represents each card as a 4-byte integer, and evaluates hands by calling eval_5cards( int c1, int c2, int c3, int c4, int c5 ). We could represent cards as one byte, and a poker hand as a 5-byte array. Sorting this 5-byte array to get a unique hand must be pretty fast. Is it faster than his approach?
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them? If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
Thanks!
Good answers everyone; I'm working on benchmarking the performance of sorting vs multiplication, to get some hard performance statistics.
Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.
Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.
Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is n log n. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.
Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.
However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.
Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.
5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.
However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.
But if you could build custom hardware to do the sorting, it might end up faster.
That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.
The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.
It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just load, load, multiply, load, multiply, ..., with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.
One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
That's a example of a non-positional number system.
I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them?
RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.
I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.
If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.
As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!
Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.
He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.
Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.
The multiplication is faster.
Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.
An example of a ready made Texas Hold'em 7- and 5-card evaluator can be found here with documentation and further explained here. All feedback welcome at the e-mail address found therein.
You don't need to sort, and can typically (~97% of the time) get away with just 6 additions and a couple of bit shifts when evaluating 7-card hands. The algo uses a generated look up table which occupies about 9MB of RAM and is generated in a near-instant. Cheap. All of this is done inside of 32-bits, and "inlining" the 7-card evaluator is good for evaluating about 50m randomly generated hands per second on my laptop.
Oh, and multiplication is faster than sorting.

Why is that data structures usually have a size of 2^n?

Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.
There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?
Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.
Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.
Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.
I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.
Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.
I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.
It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.
It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.
In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.

Resources