Fast bi-directional hash of two integers in C - c

I am writing a Linux kernel module and I need to come up with a hashing function that takes two integers for input. Because the code runs in kernel space, none of the standard libraries are available to me.
Basically, I need a hashing function where:
hash(a, b) = c
hash(b, a) = c
Where acceptable inputs for a and b are unsigned 32-bit integers. The hashing function should return an unsigned 64-bit integer. Collision (i.e. hash(a, b) = c and hash(d, f) = c as well) is not desirable as these values will be used in a binary search tree. The result of the search is a linked list of possible results that is then iterated over where a and b are actually compared. So some collision is acceptable, but the less collisions, the less iterations required, and the faster it will run.
Performance is also of extreme importance, this lookup will be used for every packet received in a system as I am writing a firewall application (the integers are actually packet source and destination addresses). This function is used to lookup existing network sessions.
Thank you for your time.

Pseudocode of how you can do it:
if a>b
return (a << 32) | b;
else
return (b << 32) | a;
This satisfies hash(a,b) == hash(b,a), utilizes the full 64 bit space, and shouldn't have collisions ...I think :)
Be careful to not directly shift the 32bit variables. Use intermediate 64-bit buffers or inline casts instead:
uint64_t myhash(uint32_t a, uint32_t b)
{
uint64 a64 = (uint64_t) a;
uint64 b64 = (uint64_t) b;
return (a > b) ? ((a64 << 32) | b64) : ((b64 << 32) | a64);
}

#define MYHASH(a,b) ( (((UINT64) max(a,b)) << 32) | ((UINT64) min(a,b)) )

((a | b) << 32) + (a & b)
is commutative and should lead to a minimum number of collisions.
I have to think more about it though ...

How about ((uint64_t)max(a, b) << UINT64_C(32)) | (uint64_t)min(a, b))? This would avoid collisions entirely, as there is no possible overlap between inputs. I can't speak to the distribution though, as that depends on your input values.

(a ^ b) | ((a ^ ~b) <<32);

Related

Is there any way in C to check at compile time if you are on an architecture where multiplication is fast?

Is there any way for C code to tell whether it is being compiled on an architecture where multiplication is fast? Is there some macro __FAST_MULT__ or something which is defined on those architectures?
For example, assume you are implementing a function to determine the Hamming weight of a 64-bit integer via the shift-and-add method*. There are two optimal algorithms for that: one requires 17 arithmetic operations, while the other requires only 12, but one of those is a multiplication operation. The second algorithm is thus 30% faster, if you are running on hardware where multiplication takes the same amount of time as addition - but much, much slower on a system where multiplication is implemented as repeated addition.
Thus, when writing such a function, it would be useful to be able to check at compile time whether this is the case, and switch between the two algorithms as appropriate:
unsigned int popcount_64(uint64_t x) {
x -= (x >> 1) & 0x5555555555555555; // put count of each 2 bits into those 2 bits
x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333); // put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0f; // put count of each 8 bits into those 8 bits
#ifdef __FAST_MULT__
return (x * 0x0101010101010101)>>56; // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
#else // __FAST_MULT__
x += x >> 8; // put count of each 16 bits into their lowest 8 bits
x += x >> 16; // put count of each 32 bits into their lowest 8 bits
x += x >> 32; // put count of each 64 bits into their lowest 8 bits
return x & 0x7f;
#endif // __FAST_MULT__
}
Is there any way to do this?
* Yes, I am aware of the __builtin_popcount() functions; this is just an example.
Is there any way for C code to tell whether it is being compiled on an architecture where multiplication is fast? Is there some macro __FAST_MULT__ or something which is defined on those architectures?
No, standard C does not provide any such facility. It is possible that particular compilers provide such a thing as an extension, but I am not specifically aware of any that actually do.
This sort of thing can be tested during build configuration, for example via Autoconf or CMake, in which case you can provide the symbol yourself where appropriate.
Alternatively, some C compilers definitely do provide macros that indicate the architecture for which the code is being compiled. You can use that in conjunction with knowledge of the details of various machine architectures to choose between the two algorithms -- that's what such macros are intended for, after all.
Or you can rely on the person building the program to choose, by configuration option, by defining a macro, or whatever.
I don't believe there is a predefined macro that specifically addresses the fast multiplication feature.
There are, however, a lot of predefined compiler macros for different architectures so if you already know in advance what architectures or CPUs support the fast multiplication instruction, you can use those macros do define your own application-specific one that signifies the fast multiplication.
E.g.:
#if (defined __GNUC__ && defined __arm__ && defined __ARM_ARCH_'7'__) ||
(defined __CC_ARM && (__TARGET_ARCH_ARM == 7))
#define FAST_MULT
#endif

How do I implement a bitset of k bits in C? [duplicate]

I have been using the Bitset class in Java and I would like to do something similar in C. I suppose I would have to do it manually as most stuff in C. What would be an efficient way to implement?
byte bitset[]
maybe
bool bitset[]
?
CCAN has a bitset implementation you can use: http://ccan.ozlabs.org/info/jbitset.html
But if you do end up implementing it yourself (for instance if you don't like the dependencies on that package), you should use an array of ints and use the native size of the computer architecture:
#define WORD_BITS (8 * sizeof(unsigned int))
unsigned int * bitarray = (int *)calloc(size / 8 + 1, sizeof(unsigned int));
static inline void setIndex(unsigned int * bitarray, size_t idx) {
bitarray[idx / WORD_BITS] |= (1 << (idx % WORD_BITS));
}
Don't use a specific size (e.g. with uint64 or uint32), let the computer use what it wants to use and adapt to that using sizeof.
Nobody mentioned what the C FAQ recommends, which is a bunch of good-old-macros:
#include <limits.h> /* for CHAR_BIT */
#define BITMASK(b) (1 << ((b) % CHAR_BIT))
#define BITSLOT(b) ((b) / CHAR_BIT)
#define BITSET(a, b) ((a)[BITSLOT(b)] |= BITMASK(b))
#define BITCLEAR(a, b) ((a)[BITSLOT(b)] &= ~BITMASK(b))
#define BITTEST(a, b) ((a)[BITSLOT(b)] & BITMASK(b))
#define BITNSLOTS(nb) ((nb + CHAR_BIT - 1) / CHAR_BIT)
(via http://c-faq.com/misc/bitsets.html)
Well, byte bitset[] seems a little misleading, no?
Use bit fields in a struct and then you can maintain a collection of these types (or use them otherwise as you see fit)
struct packed_struct {
unsigned int b1:1;
unsigned int b2:1;
unsigned int b3:1;
unsigned int b4:1;
/* etc. */
} packed;
I recommend my BITSCAN C++ library (version 1.0 has just been released). BITSCAN is specifically oriented for fast bitscan operations. I have used it to implement NP-Hard combinatorial problems involving simple undirected graphs, such as maximum clique (see BBMC algorithm, for a leading exact solver).
A comparison between BITSCAN and standard solutions STL bitset and BOOST dynamic_bitset is available here:
http://blog.biicode.com/bitscan-efficiency-at-glance/
You can give my PackedArray code a try with a bitsPerItem of 1.
It implements a random access container where items are packed at the bit-level. In other words, it acts as if you were able to manipulate a e.g. uint9_t or uint17_t array:
PackedArray principle:
. compact storage of <= 32 bits items
. items are tightly packed into a buffer of uint32_t integers
PackedArray requirements:
. you must know in advance how many bits are needed to hold a single item
. you must know in advance how many items you want to store
. when packing, behavior is undefined if items have more than bitsPerItem bits
PackedArray general in memory representation:
|-------------------------------------------------- - - -
| b0 | b1 | b2 |
|-------------------------------------------------- - - -
| i0 | i1 | i2 | i3 | i4 | i5 | i6 | i7 | i8 | i9 |
|-------------------------------------------------- - - -
. items are tightly packed together
. several items end up inside the same buffer cell, e.g. i0, i1, i2
. some items span two buffer cells, e.g. i3, i6
As usual you need to first decide what sort of operations you need to perform on your bitset. Perhaps some subset of what Java defines? After that you can decide how best to implement it. You can certainly look at the source for BitSet.java in OpenJDK for ideas.
Make it an array of unsigned int 64.

Explain this Function

Can someone explain to me the reason why someone would want use bitwise comparison?
example:
int f(int x) {
return x & (x-1);
}
int main(){
printf("F(10) = %d", f(10));
}
This is what I really want to know: "Why check for common set bits"
x is any positive number.
Bitwise operations are used for three reasons:
You can use the least possible space to store information
You can compare/modify an entire register (e.g. 32, 64, or 128 bits depending on your processor) in a single CPU instruction, usually taking a single clock cycle. That means you can do a lot of work (of certain types) blindingly fast compared to regular arithmetic.
It's cool, fun and interesting. Programmers like these things, and they can often be the differentiator when there is no difference between techniques in terms of efficiency/performance.
You can use this for all kinds of very handy things. For example, in my database I can store a lot of true/false information about my customers in a tiny space (a single byte can store 8 different true/false facts) and then use '&' operations to query their status:
Is my customer Male and Single and a Smoker?
if (customerFlags & (maleFlag | singleFlag | smokerFlag) ==
(maleFlag | singleFlag | smokerFlag))
Is my customer (any combination of) Male Or Single Or a Smoker?
if (customerFlags & (maleFlag | singleFlag | smokerFlag) != 0)
Is my customer not Male and not Single and not a Smoker)?
if (customerFlags & (maleFlag | singleFlag | smokerFlag) == 0)
Aside from just "checking for common bits", you can also do:
Certain arithmetic, e.g. value & 15 is a much faster equivalent of value % 16. This only works for certain numbers, but if you can use it, it can be a great optimisation.
Data packing/unpacking. e.g. a colour is often expressed as a 32-bit integer that contains Alpha, Red, Green and Blue byte values. The Red value might be extracted with an expression like red = (value >> 16) & 255; (shift the value down 16 bit positions and then carve off the bottom byte)
Data manipulation and swizzling. Some clever tricks can be achieved with bitwise operations. For example, swapping two integer values without needing to use a third temporary variable, or converting ARGB colour values into another format (e.g RGBA or BGRA)
The Ur-example is "testing if a number is even or odd":
unsigned int number = ...;
bool isOdd = (0 != (number & 1));
More complex uses include bitmasks (multiple boolean values in a single integer, each one taking up one bit of space) and encryption/hashing (which frequently involve bit shifting, XOR, etc.)
The example you've given is kinda odd, but I'll use bitwise comparisons all the time in embedded code.
I'll often have code that looks like the following:
volatile uint32_t *flags = 0x000A000;
bool flagA = *flags & 0x1;
bool flagB = *flags & 0x2;
bool flagC = *flags & 0x4;
It's not a bitwise comparison. It doesn't return a boolean.
Bitwise operators are used to read and modify individual bits of a number.
n & 0x8 // Peek at bit3
n |= 0x8 // Set bit3
n &= ~0x8 // Clear bit3
n ^= 0x8 // Toggle bit3
Bits are used in order to save space. 8 chars takes a lot more memory than 8 bits in a char.
The following example gets the range of an IP subnet using given an IP address of the subnet and the subnet mask of the subnet.
uint32_t mask = (((255 << 8) | 255) << 8) | 255) << 8) | 255;
uint32_t ip = (((192 << 8) | 168) << 8) | 3) << 8) | 4;
uint32_t first = ip & mask;
uint32_t last = ip | ~mask;
e.g. if you have a number of status flags in order to save space you may want to put each flag as a bit.
so x, if declared as a byte, would have 8 flags.
I think you mean bitwise combination (in your case a bitwise AND operation). This is a very common operation in those cases where the byte, word or dword value is handled as a collection of bits, eg status information, eg in SCADA or control programs.
Your example tests whether x has at most 1 bit set. f returns 0 if x is a power of 2 and non-zero if it is not.
Your particular example tests if two consecutive bits in the binary representation are 1.

Casting unsigned int to unsigned short int with bit operator

I would like to cast unsigned int (32bit) A to unsigned short int (16bit) B in a following way:
if A <= 2^16-1 then B=A
if A > 2^16-1 then B=2^16-1
In other words to cast A but if it is > of maximum allowed value for 16bit to set it as max value.
How can this be achieved with bit operations or other non branching method?
It will work for unsigned values:
b = -!!(a >> 16) | a;
or, something similar:
static inline unsigned short int fn(unsigned int a){
return (-(a >> 16) >> 16) | a;
};
Find minimum of two integers without branching:
http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
On some rare machines where branching
is very expensive and no condition
move instructions exist, the above
expression might be faster than the
obvious approach, r = (x < y) ? x : y,
even though it involves two more
instructions. (Typically, the obvious
approach is best, though.)
Just to kick things off, here's a brain-dead benchmark. I'm trying to get a 50/50 mix of large and small values "at random":
#include <iostream>
#include <stdint.h>
int main() {
uint32_t total = 0;
uint32_t n = 27465;
for (int i = 0; i < 1000*1000*500; ++i) {
n *= 30029; // worst PRNG in the world
uint32_t a = n & 0x1ffff;
#ifdef EMPTY
uint16_t b = a; // gives the wrong total, of course.
#endif
#ifdef NORMAL
uint16_t b = (a > 0xffff) ? 0xffff : a;
#endif
#ifdef RUSLIK
uint16_t b = (-(a >> 16) >> 16) | a;
#endif
#ifdef BITHACK
uint16_t b = a ^ ((0xffff ^ a) & -(0xffff < a));
#endif
total += b;
}
std::cout << total << "\n";
}
On my compiler (gcc 4.3.4 on cygwin with -O3), NORMAL wins, followed by RUSLIK, then BITHACK, respectively 0.3, 0.5 and 0.9 seconds slower than the empty loop. Really this benchmark means nothing, I haven't even checked the emitted code to see whether the compiler's smart enough to outwit me somewhere. But I like ruslik's anyway.
1) With an intrinsic on a CPU that natively does this sort of convertion.
2) You're probably not going to like this, but:
c = a >> 16; /* previously declared as a short */
/* Saturate 'c' with 1s if there are any 1s, by first propagating
1s rightward, then leftward. */
c |= c >> 8;
c |= c >> 4;
c |= c >> 2;
c |= c >> 1;
c |= c << 1;
c |= c << 2;
c |= c << 4;
c |= c << 8;
b = a | c; /* implicit truncation */
First off, the phrase "non-branching method" doesn't technically make sense when discussing C code; the optimizer may find ways to remove branches from "branchy" C code, and conversely would be entirely within its rights to replace your clever non-branching code with a branch just to spite you (or because some heuristic said it would be faster).
That aside, the simple expression:
uint16_t b = a > UINT16_MAX ? UINT16_MAX : a;
despite "having a branch", will be compiled to some sort of (branch-free) conditional move (or possible just a saturate) by many compilers on many systems (I just tried three different compilers for ARM and Intel, and all generated a conditional move).
I would use that simple, readable expression. If and only if your compiler isn't smart enough to optimize it (or your target architecture doesn't have conditional moves), and if you have benchmark data that shows this to be a bottleneck for your program, then I would (a) find a better compiler and (b) file a bug against your compiler and only then look for clever hacks.
If you're really, truly devoted to being too clever by half, then ruslik's second suggestion is actually quite beautiful (much nicer than a generic min/max).

Emulating variable bit-shift using only constant shifts?

I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.
The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like
int ShiftByConstant( int x ) { return x << 3 ; }
is fast, single-op, and superscalar, whereas a shift-by-variable, like
int ShiftByVar( int x, int y ) { return x << y ; }
is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.
What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.
I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)
This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.
Edit: A couple of clarifications that I should add:
I'm not even a little bit worried about portability
PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function
int isel(a, b, c) { return a >= 0 ? b : c; }
(if you write out a ternary that does the same thing I'll get what you mean)
integer multiplication is also microcoded and even slower than sraw. :-(
On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.
Here you go...
I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).
The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.
template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=(nVal&bMask1) + nVal; //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
EDIT: Note on isel()
I saw your isel() code on your website.
// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
return x + ((y - x) & mask);
};
FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:
return (x & (~mask)) + (y & mask);
How about this:
if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;
will probably take longer yet to execute but easier to interleave if you have other code to go between.
Let's assume that your max shift is 31. So the shift amount is a 5-bit number. Because shifting is cumulative, we can break this into five constant shifts. The obvious version uses branching, but you ruled that out.
Let N be a number between 1 and 5. You want to shift x by 2N if the bit whose value is 2N is set in y, otherwise keep x intact. Here one way to do it:
#define SHIFT(N) x = isel(((y >> N) & 1) - 1, x << (1 << N), x);
The macro assigns to x either x << 2ᴺ or x, depending on whether the Nth bit is set in y or not.
And then the driver:
SHIFT(1); SHIFT(2); SHIFT(3); SHIFT(4); SHIFT(5)
Note that N is a macro variable and becomes constant.
Don't know though if this is going to be actually faster than the variable shift. If it would be, one wonders why the microcode wouldn't run this instead...
This one breaks my head. I've now discarded a half dozen ideas. All of them exploit the notion that adding a thing to itself shifts left 1, doing the same to the result shifts left 4, and so on. If you keep all the partial results for shift left 0, 1, 2, 4, 8, and 16, then by testing bits 0 to 4 of the shift variable you can get your initial shift. Now do it again, once for each 1 bit in the shift variable. Frankly, you might as well send your processor out for coffee.
The one place I'd look for real help is Hank Warren's Hacker's Delight (which is the only useful part of this answer).
How about this:
int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};
int ShiftByVar( int x, int y )
{
//return x << y;
return x * multiplicands[y];
}
If the shift count can be calculated far in advance then I have two ideas that might work
Using self-modifying code
Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
// shift by constant functions
typedef int (*shiftFunc)(int); // the shift function
#define SHL(n) int shl##n(int x) { return x << (n); }
SHL(1)
SHL(2)
SHL(3)
...
shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
int arr[MAX]; // all the values that need to be shifted with the same amount
shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
for (int i = 0; i < MAX; i++)
arr[i] = shl(arr[i]);
This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD
If the range of the values is small, lookup table is another possible solution
#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n) S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
{ S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
{ S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
...
{ S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}
Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together
There is some good stuff here regarding bit manipulation black magic:
Advanced bit manipulation fu (Christer Ericson's blog)
Don't know if any of it's directly applicable, but if there is a way, likely there are some hints to that way in there somewhere.
Here's something that is trivially unrollable:
int result= value;
int shift_accumulator= value;
for (int i= 0; i<5; ++i)
{
result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
shift_accumulator += shift_accumulator;
k >>= 1;
}

Resources