Optimization of variable multiplication - c

[This was initially on matrices, but I guess it applies to any variable generically]
Say we have Var1 * Var2 * Var3 * Var4.
One of them sporadically changes, which one of them is random.
Is it possible to minimize multiplications?
If I do
In case Var1 changes: newVar1 * savedVar2Var3Var4
I noticed that then I need to recalculate savedVar2Var3Var4 each time Var2, Var3, Var4 change.
Would that re-calculation of 'saved combinations' defy the purpose?

If you had a lot more numbers to multiply or if multiplication was extremely expensive then there is one thing I can think of to do.
If you had a huge number of numbers to multiply then you could separate them into sub-sets and memoize the product of each set. When a particular set changes, due to one of its members changing, then the memoized product becomes invalid and needs to be recomputed. You could do this at several levels depending on how expensive multiplication is, how much memory you have available, and how often things change. How to best implement this in C probably depends on how the variables go about changing -- if an event comes in that says "here is a new value for C" then you can invalidate all products that had C in them (or check that old C actually is different from new C before invalidation). If they are volatile variables then you will probably just have to compare each of the current values to the previous values (and this will probably take as much or more time as just multiplying on any machine with a hardware multiply instruction).
So, if you have:
answer = A * B * C * D * E * F * G * H;
then you could do separate those out to:
answer = ( (A * B) * (C * D) ) * ( (E * F) * (G * H) );
Then, if rather than having this multiplication done directly in C you were to do it on an expression tree:
answer
*
/ \
/ \
/ \
ABCD EFGH
* *
/ \ / \
/ \ / \
AB CD EF GH
* * * *
/ \ / \ / \ / \
A B C D E F G H
Then at each level (well maybe just the top few levels) you could have a memoized sub-answer as well as some data to tell you if the variables below it had changed. If events come in to tell you to change a variable then that could cause the invalidation of the expression to propagate upward upon receipt of the event (or just recompute the memoized sub-answers for each event). If variables just magically change and you have to examine them to tell that they did change then you have more work to do.
Oh, another way to do this just popped in my head, and I'm ashamed that I didn't think of it earlier. If you do know the old and new values of a variable that has changed then, as long as the old value was not 0, you could just:
new_answer = (old_answer * new_var) / old_var;
In real math this would work, but in computer math this might lose too much precision for your purposes.

In the first place, micro-optimizations like this are almost never worthwhile. Time your program to see if there is a performance problem, profile to see where the problem is, and test after making changes to see if you've made things better or worse.
In the second place, multiplications of numbers are generally fast in modern CPUs, while branches can be more expensive.
In the third place, the way you're setting it up, if Var1 changes, you'll need to recalculate savedVar1Var2Var3, saved Var1Var2Var4, saved Var1Var3Var4, and the whole product. Obviously, you're better off just recalculating the total.

Yes, it is possible.
For scalars there will probably be no benefit. For largish matrix math, you could compute and store: Var1*Var2 and Var3*Var4. Your result is the product of these 2 things. Now when one changes you only need to update 2 products instead of 3. Update only one of the 2 stored products depending who change, and update the result.
There you have it, 2 multiplications instead of 3 with each update. This will only benefit you if the common case really is for only one of them to update, but if that's true it should help a lot.

I don't think you save any time. Every time one of the N variables changes, you need to calculate (N - 1) additional products, right? Say you have A, B, C, and D. A changes, and you have saved the product of B, C, and D, but now you must recalculate your cached ABC, ABD, and ACD products. You are, in fact, doing additional work. ABCD is three multiply operations, while ABCD, ABC, ACD, and ABD works out to SEVEN.

The answer depends on how often the values change. With your example, calculating savedVar2Var3Var4 costs you two multiplications, with one additional multiplication each time Var1 changes (or you otherwise need to calculate the total). So: how many times do Var2, Var3, Var4 change, compared to Var1?
If Var1 changes more than about 3 times as often as the others, it should be worth recalculating savedVar2Var3Var4 as needed.

I don't think the gain is worth the effort, unless your "multiply" operation involves heavy calculations (matrices?).
edit: I've added an example that shows you... it's not worth it :)
T multiply(T v1, T v2, T v3, T v4)
{
static T v1xv2 = v1*v2;
static T v1xv3 = v1*v3;
static T v1xv4 = v1*v4;
static T v2xv3 = v2*v3;
static T v2xv4 = v2*v4;
static T v3xv4 = v3*v4;
static T v1save = v1;
static T v2save = v2;
static T v3save = v3;
static T v4save = v4;
if v1save != v1
{
v1save = v1;
v1xv2 = v1*v2;
v1xv3 = v1*v3;
v1xv4 = v1*v4;
}
if v2save != v2
{
v2save = v2;
v1xv2 = v1*v2;
v2xv3 = v2*v3;
v2xv4 = v2*v4;
}
if v3save != v3
{
v3save = v3;
v1xv3 = v1*v3;
v2xv3 = v2*v3;
v3xv4 = v3*v4;
}
if v4save != v4
{
v4save = v4;
v1xv4 = v1*v4;
v2xv4 = v2*v4;
v3xv4 = v3*v4;
}
return v1xv2*v3xv4;
}

Suppose you had a sum of many many variables, like Sum = A+B+C+D+.... and one of them changed, like C. If C' is the old value of C, then you could just say Sum += (C-C');
Same idea for a product: Product *= C/C';. (For matrices, they would have to be invertible.)
Of course, you might get creeping roundoff errors, so once in a while you could recalculate the whole thing.

I would try something like this:
var12 = var1*var2;
var34 = var3*var4;
result = var12*var34;
while (values available) {
get new values;
if (var1 changes || var2 changes) var12 = var1*var2;
if (var3 changes || var4 changes) var34 = var3*var4;
result = var12*var34;
}
There is no overload (only change checking) and it can be used for matrices (doesn't rely on commutativity, only associativity).

Related

Is there a way to customize the default parallelization behavior of whole-array statements in Chapel?

According to the available documentation for Chapel, (whole-)array
statements like
A = B + alpha * C; // with A, B, and C being arrays, and alpha some scalar
are implemented in the language as the following forall iteration:
forall (a,b,c) in zip(A,B,C) do
a = b + alpha * c;
Thus, array statements are by default executed by a team of parallel
threads. Unfortunately, this also seems to completely preclude the
(either partial or complete) vectorization of such statements. This
can lead to performance surprises for programmers who are used to languages like Fortran or Python/Numpy (where the default behavior typically is to have array statements be only vectorized).
For codes that use (whole-)array statements with arrays of small to
moderate size, the loss of vectorization (confirmed by Linux hardware
performance counters) and the significant overhead inherent to
parallel threads (which are unsuited to effectively exploit the
fine-grained data-parallelism available in such problems) can result
in significant loss of performance. As an example consider the
following versions of Jacobi iteration that all solve the same problem
on a domain of 300 x 300 zones:
Jacobi_1 employs array-statements, as follows:
/*
* Jacobi_1
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using (whole-)array statements.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
X[n+1,1..n] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
XNew[1..n,1..n] =
( X[0..n-1,1..n] + X[2..n+1,1..n] +
X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;
// update X with next approximation
X[1..n,1..n] = XNew[1..n,1..n];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Jacobi_2 employs serial for-loops throughout (i.e. only (auto-)vectorization
by the back-end C-compiler is allowed):
/*
* Jacobi_2
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using (serial) for-loops.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
for j in 1..n do
X[n+1,j] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
for i in 1..n do
for j in 1..n do
XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
X[i,j+1] + X[i,j-1] ) / 4.0;
// update X with next approximation
for i in 1..n do
for j in 1..n do
X[i,j] = XNew[i,j];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Jacobi_3, finally, has the innermost loops vectorized and only the
outermost loops threaded:
/*
* Jacobi_3
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using both parallel and serial (vectorized) loops.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
for j in vectorizeOnly(1..n) do
X[n+1,j] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
forall i in 1..n do
for j in vectorizeOnly(1..n) do
XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
X[i,j+1] + X[i,j-1] ) / 4.0;
// update X with next approximation
forall i in 1..n do
for j in vectorizeOnly(1..n) do
X[i,j] = XNew[i,j];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Running these codes on a laptop with 2 processor-cores and using two
parallel threads, one finds that Jacobi_1 is (surprisingly)
more than ten times slower than Jacobi_2, which itself is (expectedly)
~1.6 times slower than Jacobi_3.
Unfortunately, this default behavior makes array statements completely
unattractive for my use cases, even for algorithms which would benefit
enormously from the more concise notation, and readability that
(whole-)array statements can provide.
Are there ways for the user in Chapel to change this default behavior?
That is, can a user customize the default parallelization of whole-array
statements in a way that such array statements, as used in Jacobi_1, will
behave either like the code in Jacobi_2 (which would be useful for code development and debugging purposes), or the code in Jacobi_3 (which, among those three, would be the method of choice for production calculations)?
I have tried to achieve this by plugging calls to "vectorizeOnly()" into
the definition of "Domain" above, but to no avail.
Chapel's intent is to support vectorization automatically within the per-task serial loops that are used to implement forall loops (for cases that are legally vectorizable). Yet that capability is not well-supported today, as you note (even the vectorizeOnly() iterator that you're using is only considered prototypical).
I'll mention that we tend to see better vectorization results when using Chapel's LLVM back-end than we do with the (default) C back-end, and that we've seen even better results when utilizing Simon Moll's LLVM-based Region Vectorizer (Saarland University). But we've also seen cases where the LLVM back-end underperforms the C back-end, so your mileage may vary. But if you care about vectorization, it's worth a try.
To your specific question:
Are there ways for the user in Chapel to change this default behavior?
There are. For explicit forall loops, you can write your own parallel iterator which can be used to specify a different implementation strategy for a forall loop than our default iterators use. If you implement one that you like, you can then write (or clone and modify) a domain map (background here) to govern how loops over a given array are implemented by default (i.e., if no iterator is explicitly invoked). This permits end-users to specify different implementation policies for a Chapel array than the ones we support by default.
With respect to your three code variants, I'm noting that the first uses multidimensional zippering which is known to have significant performance problems today. This is the likely main cause of performance differences between it and the others. For example, I suspect that if you rewrote it using the form forall (i,j) in Domain ... and then used +/-1 indexing per-dimension, you'd see a significant improvement (and, I'd guess, performance that's much more comparable to the third case).
For the third, I'd be curious whether the benefits you're seeing are due to vectorization or simply due to multitasking since you've avoided the performance problem of the first and the serial implementation of the second. E.g., have you checked to see whether using the vectorizeOnly() iterator added any performance improvement over the same code without that iterator (or used tools on the binary files to inspect whether vectorization is occurring?)
In any Chapel performance study, make sure to throw the --fast compiler flag. And again, for best vectorization results, you might try the LLVM back-end.

How to prevent recursion after looping once

I just realized that was a dumb question. Curious if anyone can still find a loophole though.
Source code:
married(trump,obama).
married(trump,goat).
married(pepee,pepper).
married(X,Y) :- married(Y,X),!. % not awesome because of infinite recursion
Goal: ex. married(trump, putin).
trace(
first base case fails.
second base case fails.
third base case fails.
married(trump,putin) = married(putin,trump),!.
what I want it doing is try married (putin,trump) again but all earlier base cases will fail again. We tried switching args before and failed. So don't recurse. Just return false.
I get a stack error because until married(putin,trump) or other way around before ! will never return true or false so cut will not be able triggered.
Easier and more sane way is to just rewrite the code to prevent recursion. I'm curious if there is a way to try switching args once and return fail if that fails. If u have a long list of facts, u can reduce that long list by half if u can try arg1,arg2 and vice versa. Potentially more exponentially if we get crazy permutation scenarios.
Any insights will be awesome thanks.
You are on the right track with "switching args once and return fail if that fails", even though that is worded very imperatively and does not cover all modes we expect from such a relation.
For this to work, you need to separate this into two predicates. It is easy to show that a single predicate with the given interface is not sufficient.
First, the auxiliary predicate:
married_(a, b).
married_(c, d).
etc.
Then, the main predicate, essentially as you suggest:
married(X, Y) :- married_(X, Y).
married(X, Y) :- married_(Y, X).
Adding impurities to your solution makes matters worse: Almost invariably, you will destroy the generality of your relations, raising the question why you are using a declarative language at all.
Example query:
?- married(X, Y).
X = a,
Y = b ;
X = c,
Y = d ;
X = b,
Y = a ;
X = d,
Y = c.
Strictly speaking, you can of course also do this with only a single predicate, but you need to carry around additional information if you do it this way.
For example:
married(_, a, b).
married(_, c, d).
married(first, X, Y) :- married(second, Y, X).
Example query:
?- married(_, X, Y).
X = a,
Y = b ;
X = c,
Y = d ;
X = b,
Y = a ;
X = d,
Y = c.
This closely follows the approach you describe: "We tried switching args before. So don't do it again."

Indexing sliced array in matlab??? [duplicate]

For example, if I want to read the middle value from magic(5), I can do so like this:
M = magic(5);
value = M(3,3);
to get value == 13. I'd like to be able to do something like one of these:
value = magic(5)(3,3);
value = (magic(5))(3,3);
to dispense with the intermediate variable. However, MATLAB complains about Unbalanced or unexpected parenthesis or bracket on the first parenthesis before the 3.
Is it possible to read values from an array/matrix without first assigning it to a variable?
It actually is possible to do what you want, but you have to use the functional form of the indexing operator. When you perform an indexing operation using (), you are actually making a call to the subsref function. So, even though you can't do this:
value = magic(5)(3, 3);
You can do this:
value = subsref(magic(5), struct('type', '()', 'subs', {{3, 3}}));
Ugly, but possible. ;)
In general, you just have to change the indexing step to a function call so you don't have two sets of parentheses immediately following one another. Another way to do this would be to define your own anonymous function to do the subscripted indexing. For example:
subindex = #(A, r, c) A(r, c); % An anonymous function for 2-D indexing
value = subindex(magic(5), 3, 3); % Use the function to index the matrix
However, when all is said and done the temporary local variable solution is much more readable, and definitely what I would suggest.
There was just good blog post on Loren on the Art of Matlab a couple days ago with a couple gems that might help. In particular, using helper functions like:
paren = #(x, varargin) x(varargin{:});
curly = #(x, varargin) x{varargin{:}};
where paren() can be used like
paren(magic(5), 3, 3);
would return
ans = 16
I would also surmise that this will be faster than gnovice's answer, but I haven't checked (Use the profiler!!!). That being said, you also have to include these function definitions somewhere. I personally have made them independent functions in my path, because they are super useful.
These functions and others are now available in the Functional Programming Constructs add-on which is available through the MATLAB Add-On Explorer or on the File Exchange.
How do you feel about using undocumented features:
>> builtin('_paren', magic(5), 3, 3) %# M(3,3)
ans =
13
or for cell arrays:
>> builtin('_brace', num2cell(magic(5)), 3, 3) %# C{3,3}
ans =
13
Just like magic :)
UPDATE:
Bad news, the above hack doesn't work anymore in R2015b! That's fine, it was undocumented functionality and we cannot rely on it as a supported feature :)
For those wondering where to find this type of thing, look in the folder fullfile(matlabroot,'bin','registry'). There's a bunch of XML files there that list all kinds of goodies. Be warned that calling some of these functions directly can easily crash your MATLAB session.
At least in MATLAB 2013a you can use getfield like:
a=rand(5);
getfield(a,{1,2}) % etc
to get the element at (1,2)
unfortunately syntax like magic(5)(3,3) is not supported by matlab. you need to use temporary intermediate variables. you can free up the memory after use, e.g.
tmp = magic(3);
myVar = tmp(3,3);
clear tmp
Note that if you compare running times with the standard way (asign the result and then access entries), they are exactly the same.
subs=#(M,i,j) M(i,j);
>> for nit=1:10;tic;subs(magic(100),1:10,1:10);tlap(nit)=toc;end;mean(tlap)
ans =
0.0103
>> for nit=1:10,tic;M=magic(100); M(1:10,1:10);tlap(nit)=toc;end;mean(tlap)
ans =
0.0101
To my opinion, the bottom line is : MATLAB does not have pointers, you have to live with it.
It could be more simple if you make a new function:
function [ element ] = getElem( matrix, index1, index2 )
element = matrix(index1, index2);
end
and then use it:
value = getElem(magic(5), 3, 3);
Your initial notation is the most concise way to do this:
M = magic(5); %create
value = M(3,3); % extract useful data
clear M; %free memory
If you are doing this in a loop you can just reassign M every time and ignore the clear statement as well.
To complement Amro's answer, you can use feval instead of builtin. There is no difference, really, unless you try to overload the operator function:
BUILTIN(...) is the same as FEVAL(...) except that it will call the
original built-in version of the function even if an overloaded one
exists (for this to work, you must never overload
BUILTIN).
>> feval('_paren', magic(5), 3, 3) % M(3,3)
ans =
13
>> feval('_brace', num2cell(magic(5)), 3, 3) % C{3,3}
ans =
13
What's interesting is that feval seems to be just a tiny bit quicker than builtin (by ~3.5%), at least in Matlab 2013b, which is weird given that feval needs to check if the function is overloaded, unlike builtin:
>> tic; for i=1:1e6, feval('_paren', magic(5), 3, 3); end; toc;
Elapsed time is 49.904117 seconds.
>> tic; for i=1:1e6, builtin('_paren', magic(5), 3, 3); end; toc;
Elapsed time is 51.485339 seconds.

Simple hash functions

I'm trying to write a C program that uses a hash table to store different words and I could use some help.
Firstly, I create a hash table with the size of a prime number which is closest to the number of the words I have to store, and then I use a hash function to find an address for each word.
I started with the simplest function, adding the letters together, which ended up with 88% collision.
Then I started experimenting with the function and found out that whatever I change it to, the collisions don't get lower than 35%.
Right now I'm using
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int counter, hashAddress =0;
for (counter =0; word[counter]!='\0'; counter++){
hashAddress = hashAddress*word[counter] + word[counter] + counter;
}
return (hashAddress%hashTableSize);
}
which is just a random function that I came up with, but it gives me the best results - around 35% collision.
I've been reading articles on hash functions for the past a few hours and I tried to use a few simple ones, such as djb2, but all of them gave me even worse results.(djb2 resulted in 37% collision, which is't much worse, but I was expecting something better rather than worse)
I also don't know how to use some of the other, more complex ones, such as the murmur2, because I don't know what the parameters (key, len, seed) they take in are.
Is it normal to get more than 35% collisions, even with using the djb2, or am I doing something wrong?
What are the key, len and seed values?
Try sdbm:
hashAddress = 0;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = word[counter] + (hashAddress << 6) + (hashAddress << 16) - hashAddress;
}
Or djb2:
hashAddress = 5381;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = ((hashAddress << 5) + hashAddress) + word[counter];
}
Or Adler32:
uint32_t adler32(const void *buf, size_t buflength) {
const uint8_t *buffer = (const uint8_t*)buf;
uint32_t s1 = 1;
uint32_t s2 = 0;
for (size_t n = 0; n < buflength; n++) {
s1 = (s1 + buffer[n]) % 65521;
s2 = (s2 + s1) % 65521;
}
return (s2 << 16) | s1;
}
// ...
hashAddress = adler32(word, strlen(word));
None of these are really great, though. If you really want good hashes, you need something more complex like lookup3, murmur3, or CityHash for example.
Note that a hashtable is expected to have plenty of collisions as soon as it is filled by more than 70-80%. This is perfectly normal and will even happen if you use a very good hash algorithm. That's why most hashtable implementations increase the capacity of the hashtable (e.g. capacity * 1.5 or even capacity * 2) as soon as you are adding something to the hashtable and the ratio size / capacity is already above 0.7 to 0.8. Increasing the capacity means a new hashtable is created with a higher capacity, all values from the current one are added to the new one (therefor they must all be rehashed, as their new index will be different in most cases), the new hashtable array replaces the old one and the old one is released/freed. If you plan on hashing 1000 words, a hashtable capacity of at 1250 least recommended, better 1400 or even 1500.
Hashtables are not supposed to be "filled to brim", at least not if they shall be fast and efficient (thus they always should have spare capacity). That's the downside of hashtables, they are fast (O(1)), yet they will usually waste more space than would be necessary for storing the same data in another structure (when you store them as a sorted array, you will only need a capacity of 1000 for 1000 words; the downside is that the lookup cannot be faster than O(log n) in that case). A collision free hashtable is not possible in most cases either way. Pretty much all hashtable implementations expect collisions to happen and usually have some kind of way to deal with them (usually collisions make the lookup somewhat slower, but the hashtable will still work and still beat other data structures in many cases).
Also note that if you are using a pretty good hash function, there is no requirement, yet not even an advantage, if the hashtable has a power of 2 capacity if you are cropping hash values using modulo (%) in the end. The reason why many hashtable implementations always use power of 2 capacities is because they do not use modulo, instead they use AND (&) for cropping because an AND operation is among the fastest operations you will find on most CPUs (modulo is never faster than AND, in the best case it would be equally fast, in most cases it is a lot slower). If your hashtable uses power of 2 sizes, you can replace any module with an AND operation:
x % 4 == x & 3
x % 8 == x & 7
x % 16 == x & 15
x % 32 == x & 31
...
This only works for power of 2 sizes, though. If you use modulo, power of 2 sizes can only buy something, if the hash is a very bad hash with a very bad "bit distribution". A bad bit distribution is usually caused by hashes that do not use any kind of bit shifting (>> or <<) or any other operations that would have a similar effect as bit shifting.
I created a stripped down lookup3 implementation for you:
#include <stdint.h>
#include <stdlib.h>
#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#define mix(a,b,c) \
{ \
a -= c; a ^= rot(c, 4); c += b; \
b -= a; b ^= rot(a, 6); a += c; \
c -= b; c ^= rot(b, 8); b += a; \
a -= c; a ^= rot(c,16); c += b; \
b -= a; b ^= rot(a,19); a += c; \
c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a,b,c) \
{ \
c ^= b; c -= rot(b,14); \
a ^= c; a -= rot(c,11); \
b ^= a; b -= rot(a,25); \
c ^= b; c -= rot(b,16); \
a ^= c; a -= rot(c,4); \
b ^= a; b -= rot(a,14); \
c ^= b; c -= rot(b,24); \
}
uint32_t lookup3 (
const void *key,
size_t length,
uint32_t initval
) {
uint32_t a,b,c;
const uint8_t *k;
const uint32_t *data32Bit;
data32Bit = key;
a = b = c = 0xdeadbeef + (((uint32_t)length)<<2) + initval;
while (length > 12) {
a += *(data32Bit++);
b += *(data32Bit++);
c += *(data32Bit++);
mix(a,b,c);
length -= 12;
}
k = (const uint8_t *)data32Bit;
switch (length) {
case 12: c += ((uint32_t)k[11])<<24;
case 11: c += ((uint32_t)k[10])<<16;
case 10: c += ((uint32_t)k[9])<<8;
case 9 : c += k[8];
case 8 : b += ((uint32_t)k[7])<<24;
case 7 : b += ((uint32_t)k[6])<<16;
case 6 : b += ((uint32_t)k[5])<<8;
case 5 : b += k[4];
case 4 : a += ((uint32_t)k[3])<<24;
case 3 : a += ((uint32_t)k[2])<<16;
case 2 : a += ((uint32_t)k[1])<<8;
case 1 : a += k[0];
break;
case 0 : return c;
}
final(a,b,c);
return c;
}
This code is not as highly optimized for performance as the original code, therefor it is a lot simpler. It is also not as portable as the original code, but it is portable to all major consumer platforms in use today. It is also completely ignoring the CPU endian, yet that is not really an issue, it will work on big and little endian CPUs. Just keep in mind that it will not calculate the same hash for the same data on big and little endian CPUs, but that is no requirement; it will calculate a good hash on both kind of CPUs and its only important that it always calculates the same hash for the same input data on a single machine.
You would use this function as follows:
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int initval;
unsigned int hashAddress;
initval = 12345;
hashAddress = lookup3(word, strlen(word), initval);
return (hashAddress%hashTableSize);
// If hashtable is guaranteed to always have a size that is a power of 2,
// replace the line above with the following more effective line:
// return (hashAddress & (hashTableSize - 1));
}
You way wonder what initval is. Well, it is whatever you want it to be. You could call it a salt. It will influence the hash values, yet the hash values will not get better or worse in quality because of this (at least not in the average case, it may lead to more or less collisions for very specific data, though). E.g. you can use different initval values if you want to hash the same data twice, yet each time should produce a different hash value (there is no guarantee it will, but it is rather likely if initval is different; if it creates the same value, this would be a very unlucky coincidence that you must treat that as a kind of collision). It is not advisable to use different initval values when hashing data for the same hashtable (this will rather cause more collisions on average). Another use for initval is if you want to combine a hash with some other data, in which case the already existing hash becomes initval when hashing the other data (so both, the other data as well as the previous hash influence the outcome of the hash function). You may even set initval to 0 if you like or pick a random value when the hashtable is created (and always use this random value for this instance of hashtable, yet each hashtable has its own random value).
A note on collisions:
Collisions are usually not such a huge problem in practice, it usually does not pay off to waste tons of memory just to avoid them. The question is rather how you are going to deal with them in an efficient way.
You said you are currently dealing with 9000 words. If you were using an unsorted array, finding a word in the array will need 4500 comparisons on average. On my system, 4500 string comparisons (assuming that words are between 3 and 20 characters long) need 38 microseconds (0.000038 seconds). So even such a simple, ineffective algorithm is fast enough for most purposes. Assuming that you are sorting the word list and use a binary search, finding a word in the array will need only 13 comparisons on average. 13 comparisons are close to nothing in terms of time, it's too little to even benchmark reliably. So if finding a word in a hashtable needs 2 to 4 comparisons, I wouldn't even waste a single second on the question whether that may be a huge performance problem.
In your case, a sorted list with binary search may even beat a hashtable by far. Sure, 13 comparisons need more time than 2-4 comparisons, however, in case of a hashtable you must first hash the input data to perform a lookup. Hashing alone may already take longer than 13 comparisons! The better the hash, the longer it will take for the same amount of data to be hashed. So a hashtable only pays off performance-wise if you have a really huge amount of data or if you must update the data frequently (e.g. constantly adding/removing words to/from the table, since these operations are less costly for a hashtable than they are for a sorted list). The fact that a hashatble is O(1) only means that regardless how big it is, a lookup will approx. always need the same amount of time. O(log n) only means that the lookup grows logarithmically with the number of words, that means more words, slower lookup. Yet the Big-O notation says nothing about absolute speed! This is a big misunderstanding. It is not said that a O(1) algorithm always performs faster than a O(log n) one. The Big-O notation only tells you that if the O(log n) algorithm is faster for a certain number of values and you keep increasing the number of values, the O(1) algorithm will certainly overtake the O(log n) algorithm at some point of time, but your current word count may be far below that point. Without benchmarking both approaches, you cannot say which one is faster by just looking at the Big-O notation.
Back to collisions. What should you do if you run into a collision? If the number of collisions is small, and here I don't mean the overall number of collisions (the number of words that are colliding in the hashtable) but the per index one (the number of words stored at the same hashtable index, so in your case maybe 2-4), the simplest approach is to store them as a linked list. If there was no collision so far for this table index, there is just a single key/value pair. If there was a collision, there is a linked list of key/value pairs. In that case your code must iterate over the linked list and verify each of the keys and return the value if it matches. Going by your numbers, this linked list won't have more than 4 entries and doing 4 comparisons is insignificant in terms of performance. So finding the index is O(1), finding the value (or detecting that this key is not in the table) is O(n), but here n is only the number of linked list entries (so it is 4 at most).
If the number of collisions raises, a linked list can become to slow and you may also store a dynamically sized, sorted array of key/value pairs, which allows lookups of O(log n) and again, n is only the number of keys in that array, not of all keys in the hashtable. Even if there were 100 collisions at one index, finding the right key/value pair takes at most 7 comparisons. That's still close to nothing. Despite the fact that if you really have 100 collisions at one index, either your hash algorithm is unsuited for your key data or the hashtable is far too small in capacity. The disadvantage of a dynamically sized, sorted array is that adding/removing keys is somewhat more work than in case of a linked list (code-wise, not necessarily performance-wise). So using a linked list is usually sufficient if you keep the number of collisions low enough and it is almost trivial to implement such a linked list yourself in C and add it to an existing hashtable implementation.
Most hashtable implementations I have seen use such a "fallback to an alternate data structure" to deal with collisions. The disadvantage is that these require a little bit extra memory to store the alternative data structure and a bit more code to also search for keys in that structure. There are also solutions that store collisions inside the hashtable itself and that don't require any additional memory. However, these solutions have a couple of drawbacks. The first drawback is that every collision increases the chances for even more collisions as more data is added. The second drawback is that while lookup times for keys decrease linearly with the number of collisions so far (and as I said before, every collision leads to even more collisions as data is added), lookup times for keys not in the hashtable decrease even worse and in the end, if you perform a lookup for a key that is not in the hashtable (yet you cannot know without performing the lookup), the lookup may take as long as a linear search over the whole hashtable (YUCK!!!). So if you can spare the extra memory, go for an alternate structure to handle collisions.
Firstly, I create a hash table with the size of a prime number which is the closes to the number of the words I have to store, and then I use a hash function to find an address for each word.
...
return (hashAddress%hashTableSize);
Since the number of different hashes is comparable to the number of words you cannot expect to have much lower collisions.
I made a simple statistical test with a random hash (which is the best you could achieve) and found that 26% is the limiting collision rate if you have #words == #different hashes.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Resources