What are the int8 matrix multiply instructions in Neoverse V1? - arm

This WikiChip article states that Neoverse V1 has int8 instructions that allow 256 operations per CPU clock (per core, presumably):
I'm trying to understand what these instructions are. Do they take int8 input and accumulate the results in int8's or int16s (risking overflow or requiring saturation), or do they accumulate into int32?
What are these instructions? Are they listed in https://developer.arm.com/documentation/dui0801/k/A64-SIMD-Vector-Instructions/ ?

What are these instructions?
smopa for int8 and int16 types, bfmopa for FP16 type. They are documented there.
Do they take int8 input and accumulate the results in int8's or int16s (risking overflow or requiring saturation), or do they accumulate into int32?
The int8 version accumulates into int32.
Unfortunately, the documentation quality is mediocre. I would recommend ARM company to look for a good technical writer to document their hardware.
Still, I think that instruction does something like following C++.
Untested because I don’t have a hardware which supports that ISA.
using std::array;
void smopa( array<int8_t, 32> a, array<int8_t, 32> b, array<array<int, 8>, 8>& acc,
array<bool, 32> mask1, array<bool, 32> mask2, bool subtract )
{
for( int r = 0; r < 8; r++ )
for( int c = 0; c < 8; c++ )
{
int sum = acc[ r ][ c ];
for( int i = 0; i < 4; i++ )
{
int ir = r * 4 + i;
int ic = c * 4 + i;
if( !( mask1[ ir ] && mask2[ ic ] ) )
continue;
int p = (int)a[ ir ] * (int)b[ ic ];
sum = subtract ? sum - p : sum + p;
}
acc[ r ][ c ] = sum;
}
}

Related

What is the matrix/vector operation that corresponds to this code?

Here is the code:
long long mul(long long x)
{
uint64_t M[64] = INIT;
uint64_t result = 0;
for ( int i = 0; i < 64; i++ )
{
uint64_t a = x & M[i];
uint64_t b = 0;
while ( a ){
b ^= a & 1;;
a >>= 1;
}
result |= b << (63 - i);
}
return result;
}
This code implements multiplication of the matrix and vector on GF(2). The code that returns result as the product of 64x64 matrix M and 1x64 vector x.
I want to know what linear algebraic operation( on GF(2) ) this code is:
long long unknown(long long x)
{
uint64_t A[] = INIT;
uint64_t a = 0, b = 0;
for( i = 1; i <= 64; i++ ){
for( j = i; j <= 64; j++ ){
if( ((x >> (64-i)) & 1) && ((x >> (64-j)) & 1) )
a ^= A[b];
b++;
}
}
return a;
}
I want to know what linear algebraic operation( on GF(2) ) this code is:
Of course you mean GF(2)64, the field of 64-dimensional vectors over GF(2).
Consider first the loop structure:
for( i = 1; i <= 64; i++ ){
for( j = i; j <= 64; j++ ){
That's looking at every distinct pair of indices (the indices themselves not necessarily distinct from each other). That should provide a first clue. We then see
if( ((x >> (64-i)) & 1) && ((x >> (64-j)) & 1) )
, which is testing whether vector x has both bit i and bit j set. If it does, then we add a row of matrix A into accumulation variable a, by vector sum (== element-wise exclusive or). By incrementing b on every inner-loop iteration, we ensure that each iteration services a different row of A. And that also tells us that A must have 64 * 65 / 2 = 160 rows (that matter).
In general, this is not a linear operation at all. The criterion for an operation o on a vector field over GF(2) to be linear boils down to this expression holding for all pairs of vectors x and y:
o(x + y) = o(x) + o(y)
Now, for notational convenience, let's consider the field GF(2)2 instead of GF(2)64; the result can be extended from the former to the latter simply by adding zeroes. Let x be the bit vector (1, 0) (represented, for example, by the integer 2). Let y be the bit vector (0, 1) (represented by the integer 1). And let A be this matrix:
1 0
0 1
1 0
Your operation has the following among its results:
operand result as integer comment
x (1, 0) 2 Only the first row is accumulated
y (1, 0) 2 Only the third row is accumulated
x + y (0, 1) 1 All rows are accumulated
Clearly, it is not the case that o(x) + o(y) = o(x + y) for this x, y, and characteristic A, so the operation is not linear for this A.
There are matrices A for which the corresponding operation is linear, but what linear operation they represent will depend on A. For example, it is possible to represent a wide variety of matrix-vector multiplications this way. It's not clear to me whether linear operations other than matrix-vector multiplications can be represented in this form, but I'm inclined to think not.

How to generate random 64-bit unsigned integer in C

I need generate random 64-bit unsigned integers using C. I mean, the range should be 0 to 18446744073709551615. RAND_MAX is 1073741823.
I found some solutions in the links which might be possible duplicates but the answers mostly concatenates some rand() results or making some incremental arithmetic operations. So results are always 18 digits or 20 digits. I also want outcomes like 5, 11, 33387, not just 3771778641802345472.
By the way, I really don't have so much experience with the C but any approach, code samples and idea could be beneficial.
Concerning "So results are always 18 digits or 20 digits."
See #Thomas comment. If you generate random numbers long enough, code will create ones like 5, 11 and 33387. If code generates 1,000,000,000 numbers/second, it may take a year as very small numbers < 100,000 are so rare amongst all 64-bit numbers.
rand() simple returns random bits. A simplistic method pulls 1 bit at a time
uint64_t rand_uint64_slow(void) {
uint64_t r = 0;
for (int i=0; i<64; i++) {
r = r*2 + rand()%2;
}
return r;
}
Assuming RAND_MAX is some power of 2 - 1 as in OP's case 1073741823 == 0x3FFFFFFF, take advantage that 30 at least 15 bits are generated each time. The following code will call rand() 5 3 times - a tad wasteful. Instead bits shifted out could be saved for the next random number, but that brings in other issues. Leave that for another day.
uint64_t rand_uint64(void) {
uint64_t r = 0;
for (int i=0; i<64; i += 15 /*30*/) {
r = r*((uint64_t)RAND_MAX + 1) + rand();
}
return r;
}
A portable loop count method avoids the 15 /*30*/ - But see 2020 edit below.
#if RAND_MAX/256 >= 0xFFFFFFFFFFFFFF
#define LOOP_COUNT 1
#elif RAND_MAX/256 >= 0xFFFFFF
#define LOOP_COUNT 2
#elif RAND_MAX/256 >= 0x3FFFF
#define LOOP_COUNT 3
#elif RAND_MAX/256 >= 0x1FF
#define LOOP_COUNT 4
#else
#define LOOP_COUNT 5
#endif
uint64_t rand_uint64(void) {
uint64_t r = 0;
for (int i=LOOP_COUNT; i > 0; i--) {
r = r*(RAND_MAX + (uint64_t)1) + rand();
}
return r;
}
The autocorrelation effects commented here are caused by a weak rand(). C does not specify a particular method of random number generation. The above relies on rand() - or whatever base random function employed - being good.
If rand() is sub-par, then code should use other generators. Yet one can still use this approach to build up larger random numbers.
[Edit 2020]
Hallvard B. Furuseth provides as nice way to determine the number of bits in RAND_MAX when it is a Mersenne Number - a power of 2 minus 1.
#define IMAX_BITS(m) ((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12))
#define RAND_MAX_WIDTH IMAX_BITS(RAND_MAX)
_Static_assert((RAND_MAX & (RAND_MAX + 1u)) == 0, "RAND_MAX not a Mersenne number");
uint64_t rand64(void) {
uint64_t r = 0;
for (int i = 0; i < 64; i += RAND_MAX_WIDTH) {
r <<= RAND_MAX_WIDTH;
r ^= (unsigned) rand();
}
return r;
}
If you don't need cryptographically secure pseudo random numbers, I would suggest using MT19937-64. It is a 64 bit version of Mersenne Twister PRNG.
Please, do not combine rand() outputs and do not build upon other tricks. Use existing implementation:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
Iff you have a sufficiently good source of random bytes (like, say, /dev/random or /dev/urandom on a linux machine), you can simply consume 8 bytes from that source and concatenate them. If they are independent and have a linear distribution, you're set.
If you don't, you MAY get away by doing the same, but there is likely to be some artefacts in your pseudo-random generator that gives a toe-hold for all sorts of hi-jinx.
Example code assuming we have an open binary FILE *source:
/* Implementation #1, slightly more elegant than looping yourself */
uint64_t 64bitrandom()
{
uint64_t rv;
size_t count;
do {
count = fread(&rv, sizeof(rv), 1, source);
} while (count != 1);
return rv;
}
/* Implementation #2 */
uint64_t 64bitrandom()
{
uint64_t rv = 0;
int c;
for (i=0; i < sizeof(rv); i++) {
do {
c = fgetc(source)
} while (c < 0);
rv = (rv << 8) | (c & 0xff);
}
return rv;
}
If you replace "read random bytes from a randomness device" with "get bytes from a function call", all you have to do is to adjust the shifts in method #2.
You're vastly more likely to get a "number with many digits" than one with "small number of digits" (of all the numbers between 0 and 2 ** 64, roughly 95% have 19 or more decimal digits, so really that is what you will mostly get.
If you are willing to use a repetitive pseudo random sequence and you can deal with a bunch of values that will never happen (like even numbers? ... don't use just the low bits), an LCG or MCG are simple solutions. Wikipedia: Linear congruential generator can get you started (there are several more types including the commonly used Wikipedia: Mersenne Twister). And this site can generate a couple prime numbers for the modulus and the multiplier below. (caveat: this sequence will be guessable and thus it is NOT secure)
#include <stdio.h>
#include <stdint.h>
uint64_t
mcg64(void)
{
static uint64_t i = 1;
return (i = (164603309694725029ull * i) % 14738995463583502973ull);
}
int
main(int ac, char * av[])
{
for (int i = 0; i < 10; i++)
printf("%016p\n", mcg64());
}
I have tried this code here and it seems to work fine there.
#include <time.h>
#include <stdlib.h>
#include <math.h>
int main(){
srand(time(NULL));
int a = rand();
int b = rand();
int c = rand();
int d = rand();
long e = (long)a*b;
e = abs(e);
long f = (long)c*d;
f = abs(f);
long long answer = (long long)e*f;
printf("value %lld",answer);
return 0;
}
I ran a few iterations and i get the following outputs :
value 1869044101095834648
value 2104046041914393000
value 1587782446298476296
value 604955295827516250
value 41152208336759610
value 57792837533816000
If you have 32 or 16-bit random value - generate 2 or 4 randoms and combine them to one 64-bit with << and |.
uint64_t rand_uint64(void) {
// Assuming RAND_MAX is 2^31.
uint64_t r = rand();
r = r<<30 | rand();
r = r<<30 | rand();
return r;
}
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
unsigned long long int randomize(unsigned long long int uint_64);
int main(void)
{
srand(time(0));
unsigned long long int random_number = randomize(18446744073709551615);
printf("%llu\n",random_number);
random_number = randomize(123);
printf("%llu\n",random_number);
return 0;
}
unsigned long long int randomize(unsigned long long int uint_64)
{
char buffer[100] , data[100] , tmp[2];
//convert llu to string,store in buffer
sprintf(buffer, "%llu", uint_64);
//store buffer length
size_t len = strlen(buffer);
//x : store converted char to int, rand_num : random number , index of data array
int x , rand_num , index = 0;
//condition that prevents the program from generating number that is bigger input value
bool Condition = 0;
//iterate over buffer array
for( int n = 0 ; n < len ; n++ )
{
//store the first character of buffer
tmp[0] = buffer[n];
tmp[1] = '\0';
//convert it to integer,store in x
x = atoi(tmp);
if( n == 0 )
{
//if first iteration,rand_num must be less than or equal to x
rand_num = rand() % ( x + 1 );
//if generated random number does not equal to x,condition is true
if( rand_num != x )
Condition = 1;
//convert character that corrosponds to integer to integer and store it in data array;increment index
data[index] = rand_num + '0';
index++;
}
//if not first iteration,do the following
else
{
if( Condition )
{
rand_num = rand() % ( 10 );
data[index] = rand_num + '0';
index++;
}
else
{
rand_num = rand() % ( x + 1 );
if( rand_num != x )
Condition = 1;
data[index] = rand_num + '0';
index++;
}
}
}
data[index] = '\0';
char *ptr ;
//convert the data array to unsigned long long int
unsigned long long int ret = _strtoui64(data,&ptr,10);
return ret;
}

Pseudo-Random number genetor based on LCG

I want to implement the pseudo-random number generator in xv6. I am trying to implement Linear congruential generator algorithm, but I am not getting how to seed it. Here is the piece of my code. I know this code won't work because X is not changing globally. I am not getting how doing that.
static int X = 1;
int random_g(int M)
{
int a = 1103515245, c = 12345;
X = (a * X + c) % M;
return X;
}
Incorrect code.
Do not use % on X, the random state variable, to update the state. Use % to form the return value.
Use unsigned types to avoid signed integer overflow (UB) - Perhaps unsigned, unsigned long, unsigned long long. Wider affords a longer sequence.
To match a = 1103515245, c = 12345, we want m = 31.
static unsigned long X = 1;
int random_g(int M) {
const unsigned long a = 1103515245, c = 12345;
#define m 0x80000000
int r = (X % M) + 1; // [1 ... M]
X = (a * X + c) % m;
return r;
}
Additional code needed to remove the typical M bias. Many SO post on that.
Ref: Why 1103515245 is used in rand? and http://wiki.osdev.org/Random_Number_Generator
I don't know how much that helps you, but if you have an Intel Ivy Bridge or later generation processor, you could try to use the RDRAND instruction. Something along these lines:
static int X;
int
random_g (int M)
{
asm volatile("byte $0x48; byte $0x0F; byte $0xC7; byte $0xF0"); // RDRAND AX
asm volatile("mov %%ax, %0": "=r"(X)); // X = rdrand_val
int a = 1103515245, c = 12345;
X = (a * X + c) % M;
return X;
}
I haven't tested the above code, as I can't build xv6 right now, but it should give you a hint as to how you can work; utilising your processor's rng.
In the following code, random_g is a self-seeding random number generator that returns values between 1 and M. The main function tests the function for the specific case where M is 8.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <time.h>
int random_g( int M )
{
static uint32_t X = 1;
static bool ready = false;
if ( !ready )
{
X = (uint32_t)time(NULL);
ready = true;
}
static const uint32_t a = 1103515245;
static const uint32_t c = 12345;
X = (a * X + c);
uint64_t temp = (uint64_t)X * (uint64_t)M;
temp >>= 32;
temp++;
return (int)temp;
}
int main(void)
{
int i, r;
int M = 8;
int *histogram = calloc( M+1, sizeof(int) );
for ( i = 0; i < 1000000; i++ )
{
r = random_g( M );
if ( i < 10 )
printf( "%d\n", r );
if ( r < 1 || r > M )
{
printf( "bad number: %d\n", r );
break;
}
histogram[r]++;
}
printf( "\n" );
for ( i = 1; i <= M; i++ )
printf( "%d %6d\n", i, histogram[i] );
free( histogram );
}

Optimize Bilinear Resize Algorithm in C

Can anyone spot any way to improve the speed in the next Bilinear resizing Algorithm?
I need to improve Speed as this is critical, keeping good image quality. Is expected to be used in mobile devices with low speed CPUs.
The algorithm is used mainly for up-scale resizing. Any other faster Bilinear algorithm also would be appreciated. Thanks
void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
int a, b, c, d, x, y, index;
float x_ratio = ((float)(sourceWidth - 1)) / targetWidth;
float y_ratio = ((float)(sourceHeight - 1)) / targetHeight;
float x_diff, y_diff, blue, red, green ;
int offset = 0 ;
for (int i = 0; i < targetHeight; i++)
{
for (int j = 0; j < targetWidth; j++)
{
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
a = input[index] ;
b = input[index + 1] ;
c = input[index + sourceWidth] ;
d = input[index + sourceWidth + 1] ;
// blue element
blue = (a&0xff)*(1-x_diff)*(1-y_diff) + (b&0xff)*(x_diff)*(1-y_diff) +
(c&0xff)*(y_diff)*(1-x_diff) + (d&0xff)*(x_diff*y_diff);
// green element
green = ((a>>8)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>8)&0xff)*(x_diff)*(1-y_diff) +
((c>>8)&0xff)*(y_diff)*(1-x_diff) + ((d>>8)&0xff)*(x_diff*y_diff);
// red element
red = ((a>>16)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>16)&0xff)*(x_diff)*(1-y_diff) +
((c>>16)&0xff)*(y_diff)*(1-x_diff) + ((d>>16)&0xff)*(x_diff*y_diff);
output [offset++] =
0x000000ff | // alpha
((((int)red) << 24)&0xff0000) |
((((int)green) << 16)&0xff00) |
((((int)blue) << 8)&0xff00);
}
}
}
Off the the top of my head:
Stop using floating-point, unless you're certain your target CPU has it in hardware with good performance.
Make sure memory accesses are cache-optimized, i.e. clumped together.
Use the fastest data types possible. Sometimes this means smallest, sometimes it means "most native, requiring least overhead".
Investigate if signed/unsigned for integer operations have performance costs on your platform.
Investigate if look-up tables rather than computations gain you anything (but these can blow the caches, so be careful).
And, of course, do lots of profiling and measurements.
In-Line Cache and Lookup Tables
Cache your computations in your algorithm.
Avoid duplicate computations (like (1-y_diff) or (x_ratio * j))
Go through all the lines of your algorithm, and try to identify patterns of repetitions. Extract these to local variables. And possibly extract to functions, if they are short enough to be inlined, to make things more readable.
Use a lookup-table
It's quite likely that, if you can spare some memory, you can implement a "store" for your RGB values and simply "fetch" them based on the inputs that produced them. Maybe you don't need to store all of them, but you could experiment and see if some come back often. Alternatively, you could "fudge" your colors and thus end up with less values to store for more lookup inputs.
If you know the boundaries for you inputs, you can calculate the complete domain space and figure out what makes sense to cache. For instance, if you can't cache the whole R, G, B values, maybe you can at least pre-compute the shiftings ((b>>16) and so forth...) that are most likely deterministic in your case).
Use the Right Data Types for Performance
If you can avoid double and float variables, use int. On most architectures, int would be test faster type for computations because of the memory model. You can still achieve decent precision by simply shifting your units (ie use 1026 as int instead of 1.026 as double or float). It's quite likely that this trick would be enough for you.
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
Could surely use some optimization: you were using x_ration * j-1 just a few cycles earlier, so all you really need here is x+=x_ratio
My random guess (use a profiler instead of letting people guess!):
The compiler has to generate that works when input and output overlap which means it has to do generate loads of redundant stores and loads. Add restrict to the input and output parameters to remove that safety feature.
You could also try using a=b; and c=d; instead of loading them again.
here is my version, steal some ideas. My C-fu is quite weak, so some lines are pseudocodes, but you can fix them.
void resize(int* input, int* output,
int sourceWidth, int sourceHeight,
int targetWidth, int targetHeight
) {
// Let's create some lookup tables!
// you can move them into 2-dimensional arrays to
// group together values used at the same time to help processor cache
int sx[0..targetWidth ]; // target->source X lookup
int sy[0..targetHeight]; // target->source Y lookup
int mx[0..targetWidth ]; // left pixel's multiplier
int my[0..targetHeight]; // bottom pixel's multiplier
// we don't have to calc indexes every time, find out when
bool reloadPixels[0..targetWidth ];
bool shiftPixels[0..targetWidth ];
int shiftReloadPixels[0..targetWidth ]; // can be combined if necessary
int v; // temporary value
for (int j = 0; j < targetWidth; j++){
// (8bit + targetBits + sourceBits) should be < max int
v = 256 * j * (sourceWidth-1) / (targetWidth-1);
sx[j] = v / 256;
mx[j] = v % 256;
reloadPixels[j] = j ? ( sx[j-1] != sx[j] ? 1 : 0)
: 1; // always load first pixel
// if no reload -> then no shift too
shiftPixels[j] = j ? ( sx[j-1]+1 = sx[j] ? 2 : 0)
: 0; // nothing to shift at first pixel
shiftReloadPixels[j] = reloadPixels[i] | shiftPixels[j];
}
for (int i = 0; i < targetHeight; i++){
v = 256 * i * (sourceHeight-1) / (targetHeight-1);
sy[i] = v / 256;
my[i] = v % 256;
}
int shiftReload;
int srcIndex;
int srcRowIndex;
int offset = 0;
int lm, rm, tm, bm; // left / right / top / bottom multipliers
int a, b, c, d;
for (int i = 0; i < targetHeight; i++){
srcRowIndex = sy[ i ] * sourceWidth;
tm = my[i];
bm = 255 - tm;
for (int j = 0; j < targetWidth; j++){
// too much ifs can be too slow, measure.
// always true for first pixel in a row
if( shiftReload = shiftReloadPixels[ j ] ){
srcIndex = srcRowIndex + sx[j];
if( shiftReload & 2 ){
a = b;
c = d;
}else{
a = input[ srcIndex ];
c = input[ srcIndex + sourceWidth ];
}
b = input[ srcIndex + 1 ];
d = input[ srcIndex + 1 + sourceWidth ];
}
lm = mx[j];
rm = 255 - lm;
// WTF?
// Input AA RR GG BB
// Output RR GG BB AA
if( j ){
leftOutput = rightOutput ^ 0xFFFFFF00;
}else{
leftOutput =
// blue element
((( ( (a&0xFF)*tm
+ (c&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((a>>8)&0xFF)*tm
+ ((c>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((a>>16)&0xFF)*tm
+ ((c>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
}
rightOutput =
// blue element
((( ( (b&0xFF)*tm
+ (d&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((b>>8)&0xFF)*tm
+ ((d>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((b>>16)&0xFF)*tm
+ ((d>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
output[offset++] =
// alpha
0x000000ff
| leftOutput
| rightOutput
;
}
}
}

Optimizing repeated modulus within a loop

I have this statement in my c program and I want to optimize. By optimization I particularly want to refer to bitwise operators (but any other suggestion is also fine).
uint64_t h_one = hash[0];
uint64_t h_two = hash[1];
for ( int i=0; i<k; ++i )
{
(uint64_t *) k_hash[i] = ( h_one + i * h_two ) % size; //suggest some optimization for this line.
}
Any suggestion will be of great help.
Edit:
As of now size can be any int but it is not a problem and we can round it up to the next prime (but may be not a power of two as for larger values the power of 2 increases rapidly and it will lead to much wastage of memory)
h_two is a 64 bit int(basically a chuck of 64 bytes).
so essentially you're doing
k_0 = h_1 mod s
k_1 = h_1 + h_2 mod s = k_0 + h_2 mod s
k_2 = h_1 + h_2 + h_2 mod s = k_1 + h_2 mod s
..
k_n = k_(n-1) + h_2 mod s
Depending on overflow issues (which shouldn't differ from the original if size is less than half of 2**64), this could be faster (less easy to parallelize though):
uint64_t h_one = hash[0];
uint64_t h_two = hash[1];
k_hash[0] = h_one % size;
for ( int i=1; i<k; ++i )
{
(uint64_t *) k_hash[i] = ( k_hash[i-1] + h_two ) % size;
}
Note there is a possibility that your compiler already came to this form, depending on which optimization flags you use.
Of course this only eliminated one multiplication. If you want to eliminate or reduce the modulo, I guess that based on h_two%size and h_1%size you can predetermine the steps where you have to explicitly call %size, something like this:
uint64_t h_one = hash[0]%size;
uint64_t h_two = hash[1]%size;
k_hash[0] = h_one;
step = (size-(h_one))/(h_two)-1;
for ( int i=1; i<k; ++i )
{
(uint64_t *) k_hash[i] = ( k_hash[i-1] + h_two );
if(i==step)
{
k_hash[i] %= size;
}
}
Note I'm not sure of the formula (didn't test it), it's more a general idea. This would greatly depend on how good your branch prediction is (and how big a performance-hit a misprediction is). ALso it's only likely to help if step is big.
edit: or more simple (and probably with the same performance) -thanks to Mystical:
uint64_t h_one = hash[0]%size;
uint64_t h_two = hash[1]%size;
k_hash[0] = h_one;
for ( int i=1; i<k; ++i )
{
(uint64_t *) k_hash[i] = ( k_hash[i-1] + h_two );
if(k_hash[i] > size)
{
k_hash[i] -= size;
}
}
If size is a power of two, then applying a bitwise AND to size - 1 optimizes "% size":
(uint64_t *)k_hash[i] = (h_one + i * h_two) & (size - 1)

Resources