Temporal complexity of primary instructions in C

Temporal complexity of primary instructions in C - c

I have a question about algorithmic complexity.
Do the basic instructions in C have an equivalent complexity, if not, in what order are they:
if, write/read a single cell of a matrix, a+b, a*b, a = b ...
Thanks

No. The basic instructions in C cannot be ordered by any kind of wall-time or theoretic complexity. This is not specified and probably cannot be specified by the Standard; rather, these properties arise from the interaction of the code, the OS, and the underlying architecture.
I think you're looking for information on cycles per instruction.
However, even this is not the whole story. Modern CPUs have hierarchical caches. If your algorithm operates on data which is primarily in a fast cache, then it will run much faster than a program which operates on data that must be repeatedly accessed from RAM, the hard drive, or over a network. The amount of calculation done per load is an application's arithmetic intensity. Roofline models provide a tool for thinking about this. You can achieve better cache utilization via blocking and other techniques, though the subfield of communication avoiding algorithms explores this in-depth.
Ultimately, the C language is a high-level abstraction of what a processor actually does. In standard cost models we think of all instructions as taking the same amount of time. In more accurate, but potentially more difficult to use, cache-aware cost models, data movement is treated as being more expensive.

Complexity is not about the time it takes to execute "basic" code lines like addition, multiplication, division and so on.
Even if these expressions have different execution time they all have complexity O(1).
Complexity is about what happens when some variable figure changes. That variable figure can be many different things. Some examples could be "the number of element in an array", "the number of elements in a linked list", "the size of a file", "the size of a matrix".
For instance - if you write code that has to find the largest value in an array of integers, the execution time depends on the number of elements in the array. The code will have to visit every array element to check if it's larger than the previous elements. Consequently, the complexity is O(N), where N is the number of elements. From that we can't say how much time it will take to find the largest element but we can say that it will take 10 times longer to execute on a 1000 element array than on a 100 element array.
Now if you did the same with a linked list (i.e. find largest element) the complexity would again be O(N). However, this does not say that a linked list perform just the same as an array. It only says that it scales in the same way as an array.
A simplified way to say it - if there is no loops involved the complexity is always
O(1).

Related

Is O(cn) at least as fast as O(n) in a non asymptotically way?

So first of all let me talk about the motivation for this question. Let's supose you have to find the minimum and the maximum values in an array. In this case, you wave two ways of doing so.
The first one consists in iterating over the array and finding the maximum value, then doing the same thing to find the minimum value. This solution is O(2n).
The second one consists in iterating over the array just one time and finding both the minimum and maximum value at the same time. This solution is O(n).
Even though the time complexity has been halved, for each iteration of the O(n) solution you now have twice as many instructions (ignoring how the compiler can possibly optmize these instructions) so I believe they should take the same amount of time to execute.
Let me give you a second example. Now you need to reverse an array. Again, you have two ways of doing so.
The first one is to create an empty array, iterate over the data array filling the empty array. This solution is O(n).
The second one is to iterate over the data array, swapping the 0th and n-1th elements, then the 1th and n-2th elements and so on (using this strategy) until you reach the middle of the array. This solution is O((1/2)n).
Again, even though the time complexity has been cutted in half, you have three times more instructions per iteration. You're iterating over (1/2)n elements, but for each iteration you have to perform three XOR instructions. If you were not to use XOR, but an auxiliary variable you would still need 2 more instructions to perform the variable swapping, so now I believe that o((1/2)n) should actually be worse than o(n).
Having said these things, my question is the following:
Ignoring space complexity, garbage collecting and the compiler possible optimizations, can I assume that having O(c1*n) and O(c2*n) algorithms so that c1 > c2, can I be sure that the algorithm that gives me O(c1*n) is as fast or faster than the one that gives me O(c2*n)?
This question is cool because it can make a difference on how I start writing code from here and on. If the "more complex" (c1) way is as fast as the "less complex" (c2) but more readable, i'm sticking with the "more complex" one.

c1 > c2, can I be sure that the algorithm that gives me O(c1n) is as fast or faster than the one that gives me O(c2n)?
The whole issue lies within the words "fast" or "faster". Computational complexity doesn't strictly measure what we intuitively understand as "fast". Without going into mathematical details (although it's a good idea: https://en.wikipedia.org/wiki/Big_O_notation), it answers the question "how fast it will go slower when my input grows". So if you have O(n^2) complexity you can roughly expect that doubling the size of the input will make your algorithm take 4 times more time. Whereas for linear complexity, 2 times bigger input gives only doubles the time. As you can see, it's relative, so any constants cancel out.
To sum up: from the way you ask your question, it doesn't seem the big-O notation is the correct tool here.

By definition, if c1 and c2 are constants, O(c1*n) === O(c2*c) === O(n). That is, the number of operations per element of your array of length n is completely irrelevant in this kind of complexity analysis.
All that it will tell you is that "it's linear". That is, if you have 1 bazillion operations for an array of length n, then you'll have 2 bazillion operations for an array of length 2*n (plus or minus something that grows slower than linear).
can I assume that having O(c1n) and O(c2n) algorithms so that c1 > c2, can I be sure that the algorithm that gives me O(c1n) is as fast or faster than the one that gives me O(c2n)?
Nope, not at all.
First, because the constants there are meaningless in that analysis. There's no way to put it: it is absolutely irrelevant whatever restrictions you put in c1 and c2 for big-O analysis. The whole idea is that it will discard those restrictions.
Second, because they don't tell you anything that would enable you to compare the two algorithms runtime for a specific value of n.
Such complexity analysis only enables you to compare the asymptotic behavior of algorithms. Real-world problems in general don't care about where the asymptotes are.
Assume that A1(n) is the number of operations Algorithm 1 needs for an input of length n, and A2(n) is the same for Algorithm 2. You could have:
A1(n) = 10n + 900
A2(n) = 100n
The complexity of both is O(A1) = O(A2) = O(n). For small inputs, A2 is faster. For large inputs, A1 is faster. The point where they change is n == 10.
This question is cool because it can make a difference on how I start writing code from here and on. If the "more complex" (c1) way is as fast as the "less complex" (c2) but more readable, i'm sticking with the "more complex" one.
Not only that, but also there's the fact that when you have 2 different algorithms that are really of different complexity classes (e.g., linear vs quadratic), it might still make sense to use the one of higher complexity as it may still be faster.
For example:
A3(n) = n^2
A4(n) = n + 10^20.
E.g., Algorithm 3 is quadratic, while Algorithm 4 is linear but it has a constant huge initialization time.
For inputs of size of up to around n == 10^10, it will be faster to use the quadratic algorithm.
It may very well be the case that all relevant inputs for your specific problem fall within that range, meaning that the quadratic algorithm would be the better, faster choice.
The bottom line is: for analyzing the actual time it will take to run an algorithm on a given input (or a given bounded range of inputs, as nearly all real-world problems are) and compare it with another algorithm, big-O analysis is meaningless.
Another way to put it: you're asking a practical "engineering" question (i.e., which option is better / faster) but trying to answer the question with a tool that's only useful for "theoretical" analysis. That tool is important, yes. But it has no chance of giving you the answer you're looking for, by design.

By definition, time complexity ignores constants. So O((1/2)n) == O(n) == O(2n) == O(cn).
Your example of O((1/2)n) shows why this is the case, because the constants can measure units of anything, so comparing them is meaningless.
You can never tell which algorithm is faster based only on the time complexity. But, you can tell which one would be faster as n approaches infinity. Since constants are removed from the time complexity, they would be considered equal and therefore with O(c1n) and O(c2n) you still would not be able to tell which one is faster even as n approaches infinity.

(my theoretical computer science courses are a couple of decades ago)
O(cn) is O(n).
It's still a linear search over the array.

Is array access always constant-time / O(1)?

From Richard Bird, Pearls of Functional Algorithm Design (2010), page 6:
For a pure functional programmer, an update operation takes logarithmic time in the size of the array. To be fair, procedural programmers also appreciate that constant-time indexing and updating
are only possible when the arrays are small.
Under what conditions do arrays have non-constant-time access? Is this related to CPU cache?

Most modern machine architectures try to offer small unit time access to memory.
They fail. Instead we see layers of caches with differing speeds.
The problem is simple: the speed of light. If you need an enormous memory [array] (in the extreme, imagine a memory the size of the Andromeda galaxy), it will take enormous space, and light cannot cross enormous space in short periods of time. Information cannot travel faster than the speed of light. You are screwed by physics from the start.
So the best you can do is build part of memory "nearby" where light takes only fractions of nanosecond to traverse (thus registers and L1 cache), and part of the memory far away (the disk drive). Other practical complications ensue, such as capacitance (think inertia) to slow down access to things further away.
Now, if you are willing to take the access time of your farthest memory element as "unit" time, yes, access to everything takes the same amount of time, e.g., O(1). In practical computing, we treat RAM memory this way most of the time, and we leave out other, slower devices to avoid screwing up our simple model.
Then you discover people that aren't satisfied with that, and voila, you have people optimizing for cache line access. So it may be O(1) in theory, and acts like O(1) for small arrays (that fit in the first level of cache), but it often is not in practice.
An extreme practical case is an array that doesn't fit in main memory; now an array access may cause paging from the disk.
Sometimes we don't care even in that case. Google is essentially a giant cache. We tend to think of Google searches as O(1).

Can oranges be red?
Yes, they can be red due to a number of reasons -
You color them red.
You grow a genetically modified variety.
You grow them on Mars, the red planet, where every thing is supposed to look red.
The (theoretical) list of some practical (given todays technology) and impractical (fiction / or future reality) goes on...
Point is, that I think the question you are asking, is really about two orthogonal concepts. Namely -
Big O Notation - "In mathematics, big O notation describes the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions."
vs
Practicalities (hardware and software) a good software engineer should be aware of, while architecting / designing their app and writing code.
In other words, while the concept of Big O Notation can be called academic, but it is most appropriate way of classifying algorithms complexity (Time / Space).. and that's where it ends. There is no need to muddy the waters with orthogonal concerns.
To be clear, I am not saying that one should not be aware of the under the hood implementation details and workings of things, which affect the performance of the software you write.. but there is no point of mixing the two together. For example, does it make sense to say -
Arrays do not have constant time access (with indexes) because -
Large arrays do not fit in CPU cache, and hence incur high cost of cache misses.
On a system under memory pressure, the array, big or small, has been swapped out from Physical Memory to Hard Disk, and not only is impacted by a cache miss, but also a hard page fault.
On a system under extreme CPU load, the thread which read the supposed array can be pre-empted, and may not get a chance to execute for several seconds.
On a hypothetical OS, which backs its memory not just with disk, but with additional memory on another computer on the other corner of the world, will make array access un-imaginably slow.
Like my apple and orange example, as you read through my increasingly absurd examples, hope the point I am trying to make is clear.
Conclusion - Any day, I'd answer the question "Do Arrays have constant time O(1) access (with indexes)", as yes.. without any doubt or ifs and buts, they do.
EDIT:
Put it another way - If O(1) is not the answer.. then neither is O(log n), or O(n log n), or O(n^2) or O(n ^ 3)..... and certainly not 42.

He is talking about Computation Models, and in particular the word-based RAM machine
A RAM machine is a formalization of something of very similar to an actual computer: we model the computer memory as a big array of memory words of w bits each, and we can read/write any words in O(1) time
But we have yet something important to define: how large should a word be?
We need a word size w ≥ Ω(log n) to be able at least to address the n parts of our input.
For this reason, word-based RAMs normally assume a word length of O(log n)
But having the word length of your machine depends on the size of the input appears strange and unrealistic
What if we keep the word length fixed? Then even following a pointer needs Ω(log n) time just to read the entire pointer
We need Ω(log n) words to store pointer and Ω(log n) time to access input element

if a language supports sparse arrays, access to the array would have to go through a directory, and a tree-structured directory would have non-linear access time. Or did you mean realistic conditions? ;-)

What is good measure to compare algorithms?

Well I was reading an article about comparing two algorithms by firstly analyzing them.
My teacher taught me that you can analyze algorithm by directly using number of steps for that algorithm.
for ex:
algo printArray(arr[n]){
for(int i=0;i<n;i++){
write arr[i];
}
}
will have complexity of O(N), where N is size of array. and it repeats the for loop for N times.
while
algo printMatrix(arr,m,n){
for(i=0;i<m;i++){
for(j=0;j<n;j++){
write arr[i][j];
}
}
}
will have complexity of O(MXN) ~ O(N^2) when M=N. statements inside for are executed MXN times.
similarly O(log N). if it divides input into 2 equal parts. and So on.
But according to that article:
The Measures Execution Time, Number of statements aren't good for analyzing the algorithm.
because:
Execution Time will be system Dependent and,
Number of statements will vary with the programming language used.
and It states that
Ideal Solution will be to express running time of algorithm as a function of input size N that is f(n).
That confused me a little, How can you calculate running time if you consider execution time as not good measure?
Can experts here please elaborate this?
Thanks in advance.

When you were saying "complexity of O(N)" that is referred to as "Big-O notation" which is the same as the "Ideal Solution" that you mentioned in your post. It is a way of expressing run time as a function of input size.
I think were you got confused was when it said "express running time" - it didn't mean express it in a numerical value (which is what execution time is), it meant express it in Big-O notation. I think you just got tripped up on the terminology.

Execution time is indeed system-dependent, but it also depends on the number of instructions the algorithm executes.
Also, I do not understand how the number of steps is irrelevant, given that algorithms are analyzed as language-agnostic and without paying any attention to whatever features and syntactic-sugars various languages imply.
The one measure of algorithm analysis I have always encountered since I started analyzing algorithms is the number of executed instructions and I fail to see how this metric may be irrelevant.
At the same time, complexity classes are meant as an "order of magnitude" indication of how fast or slow an algorithm is. They are dependent of the number of executed instructions and independent of the system the algorithm runs on, because by definition an elementary operation (such as addition of two numbers) should take constant time, however large or small this "constant" means in practice, therefore complexity classes do not change. The constants inside the expression for the exact complexity function may indeed vary from system to system, but what is actually relevant for algorithm comparison is the complexity class, as only by comparing those can you find out how an algorithm behaves on increasingly large inputs (asymptotically) compared to another algorithm.

Big-O notation waves away constants (both fixed cost and constant multipliers). So any function that takes kn+c operations to complete is (by definition!) O(n), regardless of k and c. This is why it's often better to take real-world measurements (profiling) of your algorithms in action with real data, to see how fast they effectively are.
But execution time, obviously, varies depending on the data set -- if you're trying to come up with a general measure of performance that's not based on a specific usage scenario, then execution time is less valuable (unless you're comparing all algorithms under the same conditions, and even then it's not necessarily fair unless you model the majority of possible scenarios, and not just one).
Big-O notation becomes more valuable as you move to larger data sets. It gives you a rough idea of the performance of an algorithm, assuming reasonable values for k and c. If you have a million numbers you want to sort, then it's safe to say you want to stay away from any O(n^2) algorithm, and try to find a better O(n lg n) algorithm. If you're sorting three numbers, the theoretical complexity bound doesn't matter anymore, because the constants dominate the resources taken.
Note also that while the number of statements a given algorithm can be expressed in varies wildly between programming languages, the number of constant-time steps that need to be executed (at the machine level for your target architecture, which is typically one where integer arithmetic and memory accesses take a fixed amount of time, or more precisely are bounded by a fixed amount of time). It is this bound on the maximum number of fixed-cost steps required by an algorithm that big-O measures, which has no direct relation to actual running time for a given input, yet still describes roughly how much work must be done for a given data set as the size of the set grows.

In comparing algorithms, execution speed is important as well mentioned by others, but other factors like memory space are crucial too.
Memory space also uses order of complexity notation.
Code could sort an array in place using a bubble sort needing only a handful of extra memory O(1). Other methods, though faster, may need O(ln N) memory.
Other more esoteric measures include code complexity like Cyclomatic complexity and Readability

Traditionally, computer science measures algorithm effectivity (speed) by the number of comparisons or sometimes data accesses, using "Big O notation". This is so, because the number of comparisons (and/or data accesses) is a good mathematical model to describe efficiency of certain algorithms, searching and sorting ones in particular, where O(log n) is considered the fastest possible in theory.
This theoretic model has always had several flaws though. It assumes that comparisons (and/or data accessing) are what takes time, and that the time for performing things like function calls and branching/looping is neglectible. This is of course nonsense in the real world.
In the real world, a recursive binary search algorithm might for example be extremely slow compared to a quick & dirty linear search implemented with a plain for loop, because on the given system, the function call overhead is what takes the most time, not the comparisons.
There are a whole lot of things that affect performance. As CPUs evolve, more such things are invented. Nowadays, you might have to consider things like data alignment, instruction pipe-lining, branch prediction, data cache memory, multiple CPU cores and so on. All these technologies make traditional algorithm theory rather irrelevant.
To write the most effective code possible, you need to have a specific system in mind and you need in-depth knowledge about said system. Fortunately, compilers have evolved a lot too, so a lot of the in-depth system knowledge can be left to the person who implements a compiler port for the specific system.
Generally, I think many programmers today spend far too much time pondering about program speed and coming up with "clever things" to get better performance. Back in the days when CPUs were slow and compilers were terrible, such things were very important. But today, a good, modern programmer focus on making the code bug-free, readable, maintainable, re-useable, secure, portable etc. It doesn't matter how fast your program is, if it is a buggy mess of unreadable crap. So deal with performance when the need arises.

Counting FLOPs and size of data and check whether function is memory-bound or cpu-bound

I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!

With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.

Which is faster — sorting or multiplying a small array of elements?

Reading through Cactus Kev's Poker Hand Evaluator, I noticed the following statements:
At first, I thought that I could always simply sort the hand first before passing it to the evaluator; but sorting takes time, and I didn't want to waste any CPU cycles sorting hands. I needed a method that didn't care what order the five cards were given as.
...
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
...
Since multiplication is one of the fastest calculations a computer can make, we have shaved hundreds of milliseconds off our time had we been forced to sort each hand before evaluation.
I have a hard time believing this.
Cactus Kev represents each card as a 4-byte integer, and evaluates hands by calling eval_5cards( int c1, int c2, int c3, int c4, int c5 ). We could represent cards as one byte, and a poker hand as a 5-byte array. Sorting this 5-byte array to get a unique hand must be pretty fast. Is it faster than his approach?
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them? If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
Thanks!
Good answers everyone; I'm working on benchmarking the performance of sorting vs multiplication, to get some hard performance statistics.

Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.
Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.

Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is n log n. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.

Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.
However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.
Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.

5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.
However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.
But if you could build custom hardware to do the sorting, it might end up faster.

That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.
The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.

It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just load, load, multiply, load, multiply, ..., with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.

One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.

After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
That's a example of a non-positional number system.
I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them?
RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.
I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.
If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.

As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!
Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.
He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.
Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.

The multiplication is faster.
Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.

An example of a ready made Texas Hold'em 7- and 5-card evaluator can be found here with documentation and further explained here. All feedback welcome at the e-mail address found therein.
You don't need to sort, and can typically (~97% of the time) get away with just 6 additions and a couple of bit shifts when evaluating 7-card hands. The algo uses a generated look up table which occupies about 9MB of RAM and is generated in a near-instant. Cheap. All of this is done inside of 32-bits, and "inlining" the 7-card evaluator is good for evaluating about 50m randomly generated hands per second on my laptop.
Oh, and multiplication is faster than sorting.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight