Could someone help me understanding the SSE implementation of the FAST corner detection in OpenCV? I understand the algorithm but not the implementation. Could somebody walk me through the code?
The code is long, so thank you in advance.
I am using OpenCV 2.4.11 and the code goes like this:
__m128i delta = _mm_set1_epi8(-128);
__m128i t = _mm_set1_epi8((char)threshold);
__m128i m0, m1;
__m128i v0 = _mm_loadu_si128((const __m128i*)ptr);
I think the following have something to do with threshold checking, but can't understand the use of delta
__m128i v1 = _mm_xor_si128(_mm_subs_epu8(v0, t), delta);
v0 = _mm_xor_si128(_mm_adds_epu8(v0, t), delta);
Now it checks the neighboring 4 pixels, but again, what is the use of delta?
__m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[0])), delta);
__m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta);
__m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta);
__m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta);
m0 = _mm_and_si128(_mm_cmpgt_epi8(x0, v0), _mm_cmpgt_epi8(x1, v0));
m1 = _mm_and_si128(_mm_cmpgt_epi8(v1, x0), _mm_cmpgt_epi8(v1, x1));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x1, v0), _mm_cmpgt_epi8(x2, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x1), _mm_cmpgt_epi8(v1, x2)));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x2, v0), _mm_cmpgt_epi8(x3, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x2), _mm_cmpgt_epi8(v1, x3)));
m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x3, v0), _mm_cmpgt_epi8(x0, v0)));
m1 = _mm_or_si128(m1, _mm_and_si128(_mm_cmpgt_epi8(v1, x3), _mm_cmpgt_epi8(v1, x0)));
m0 = _mm_or_si128(m0, m1);
Here it checks the continuity of the neighboring pixels. (Right?)
int mask = _mm_movemask_epi8(m0);
if( mask == 0 )
continue;
This is another puzzle for me. Why shifting 8 bytes to the left? I assume the mask tells me the location of the corner candidate, but why 8 bytes?
if( (mask & 255) == 0 )
{
j -= 8;
ptr -= 8;
continue;
}
I gave up at this point...
__m128i c0 = _mm_setzero_si128(), c1 = c0, max0 = c0, max1 = c0;
for( k = 0; k < N; k++ )
{
__m128i x = _mm_xor_si128(_mm_loadu_si128((const __m128i*)(ptr + pixel[k])), delta);
m0 = _mm_cmpgt_epi8(x, v0);
m1 = _mm_cmpgt_epi8(v1, x);
c0 = _mm_and_si128(_mm_sub_epi8(c0, m0), m0);
c1 = _mm_and_si128(_mm_sub_epi8(c1, m1), m1);
max0 = _mm_max_epu8(max0, c0);
max1 = _mm_max_epu8(max1, c1);
}
max0 = _mm_max_epu8(max0, max1);
int m = _mm_movemask_epi8(_mm_cmpgt_epi8(max0, K16));
for( k = 0; m > 0 && k < 16; k++, m >>= 1 )
if(m & 1)
{
cornerpos[ncorners++] = j+k;
if(nonmax_suppression)
curr[j+k] = (uchar)cornerScore<patternSize>(ptr+k, pixel, threshold);
}
As harold said, delta is used to make unsigned comparsion.
Let's describe this implementation by steps:
__m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr +
pixel[0])), delta);
__m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta);
__m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta);
__m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta); m0 = _mm_and_si128(_mm_cmpgt_epi8(x0, v0),
_mm_cmpgt_epi8(x1, v0)); m1 = _mm_and_si128(_mm_cmpgt_epi8(v1, x0), _mm_cmpgt_epi8(v1, x1)); m0 = _mm_or_si128(m0, _mm_and_si128(_mm_cmpgt_epi8(x1, v0), _mm_cmpgt_epi8(x2, v0))); ......
Here it's not checking of 4 neighboring pixels. It checks 4 points, for example, like this:
Here they check that "corner condition" is true for this 4 points, because if it's not true there are no 8 neighboring pixels that satisfy "corner condition", so it's not corner pixel. If mask is zero it means that all pixels in vector can't be corner so we shift left for 16 pixels.
int mask = _mm_movemask_epi8(m0);
if( mask == 0 )
continue;
If mask is not zero, but for first 8 pixels "corner condition" is not true they shift left only for 8 pixels to check remain pixels on next iteration.
if( (mask & 255) == 0 )
{
j -= 8;
ptr -= 8;
continue;
}
And final step. Here they count number of neighboring pixels which are greater than x + threshold to c0 counter and which are less than x - threshold to c1 counter.
Here generating mask for such conditions:
__m128i x = _mm_xor_si128(_mm_loadu_si128((const __m128i*)(ptr + pixel[k])), delta);
m0 = _mm_cmpgt_epi8(x, v0);
m1 = _mm_cmpgt_epi8(v1, x);
Note that if condition is true for element of vector his value set to 0xFF or -1 since we treat him as signed char.
c0 = _mm_and_si128(_mm_sub_epi8(c0, m0), m0);
c1 = _mm_and_si128(_mm_sub_epi8(c1, m1), m1);
If element of mask is -1 it accumulates to c0 or c1 counter since of substraction (for example c0 - (-1)) . But if it equal to zero they reset counter to zero (_mm_and_si128).
Than they need to store maximum value of counters:
max0 = _mm_max_epu8(max0, c0);
max1 = _mm_max_epu8(max1, c1);
So they store maximum number of neighboring pixels which satisfy "corner condition".
Here they determine which pixels are actually corners and which are not:
max0 = _mm_max_epu8(max0, max1);
int m = _mm_movemask_epi8(_mm_cmpgt_epi8(max0, K16));
for( k = 0; m > 0 && k < 16; k++, m >>= 1 )
if(m & 1)
{
cornerpos[ncorners++] = j+k;
if(nonmax_suppression)
curr[j+k] = (uchar)cornerScore<patternSize>(ptr+k, pixel, threshold);
}
I hope it will help. I'm sorry for my bad English.
delta is a mask in which only the signbits are set. They use it because they want to compare for greater than unsigned, but there is only a signed comparison.
Adding 128 (or subtracting it, because -128 == 128) and xoring with it do the same (if you're working with bytes), because
a + b == (a ^ b) + ((a & b) << 1)
and if b only has the top bit set, the ((a & b) << 1) term must be zero (a & b can have the top bit set, but it's shifted out).
Then as you can see in the diagram below, subtracting 128 "shifts" the entire range down in such a way that a signed comparison will give the same result as an unsigned comparison would have given on the original range.
|0 ... 127 ... 255| unsigned
|-128 ... 0 ... 127| signed
I don't know about the rest, I hope someone else can answer that.
Related
Consider that you want to calculate the low 128-bits of the result of multiplying a 64-bit and 128-bit unsigned number, and that the largest multiplication you have available is the C-like 64-bit multiplication which takes two 64-bit unsigned inputs and returns the low 64-bits of the result.
How many multiplications are needed?
Certainly you can do it with eight: break all the inputs up into 32-bit chunks and use your 64-bit multiplication to do the 4 * 2 = 8 required full-width 32*32->64 multiplications, but can one do better?
Of course the algorithm should do only a "reasonable" number of additions or other basic arithmetic on top of the multiplications (I'm not interested in solutions that re-invent multiplication as an addition loop and hence claim "zero" multiplications).
Four, but it starts to get a little tricky.
Let a and b be the numbers to be multiplied, with a0 and a1 being the low and high 32 bits of a, respectively, and b0, b1, b2, b3 being 32-bit groups of b, from low to high respectively.
The desired result is the remainder of (a0 + a1•232) • (b0 + b1•232 + b2•264 + b3•296) modulo 2128.
We can rewrite that as (a0 + a1•232) • (b0 + b1•232) +
(a0 + a1•232) • (b2•264 + b3•296) modulo 2128.
The remainder of the latter term modulo 2128 can be computed as a single 64-bit by 64-bit multiplication (whose result is implicitly multiplied by 264).
Then the former term can be computed with three multiplications using a
carefully implemented Karatsuba step. The simple version would involve a 33-bit by 33-bit to 66-bit product which is not available, but there is a trickier version that avoids it:
z0 = a0 * b0
z2 = a1 * b1
z1 = abs(a0 - a1) * abs(b0 - b1) * sgn(a0 - a1) * sgn(b1 - b0) + z0 + z2
The last line contains only one multiplication; the other two pseudo-multiplications are just conditional negations. Absolute-difference and conditional-negate are annoying to implement in pure C, but it could be done.
Of course, without Karatsuba, 5 multiplies.
Karatsuba is wonderful, but these days a 64 x 64 multiply can be over in 3 clocks and a new one can be scheduled every clock. So the overhead of dealing with the signs and what not can be significantly greater than the saving of one multiply.
For straightforward 64 x 64 multiply need:
r0 = a0*b0
r1 = a0*b1
r2 = a1*b0
r3 = a1*b1
where need to add r0 = r0 + (r1 << 32) + (r2 << 32)
and add r3 = r3 + (r1 >> 32) + (r2 >> 32) + carry
where the carry is the carry from the additions to r0, and result is r3:r0.
typedef struct { uint64_t w0, w1 ; } uint64x2_t ;
uint64x2_t
mulu64x2(uint64_t x, uint64_t m)
{
uint64x2_t r ;
uint64_t r1, r2, rx, ry ;
uint32_t x1, x0 ;
uint32_t m1, m0 ;
x1 = (uint32_t)(x >> 32) ;
x0 = (uint32_t)x ;
m1 = (uint32_t)(m >> 32) ;
m0 = (uint32_t)m ;
r1 = (uint64_t)x1 * m0 ;
r2 = (uint64_t)x0 * m1 ;
r.w0 = (uint64_t)x0 * m0 ;
r.w1 = (uint64_t)x1 * m1 ;
rx = (uint32_t)r1 ;
rx = rx + (uint32_t)r2 ; // add the ls halves, collecting carry
ry = r.w0 >> 32 ; // pick up ms of r0
r.w0 += (rx << 32) ; // complete r0
rx += ry ; // complete addition, rx >> 32 == carry !
r.w1 += (r1 >> 32) + (r2 >> 32) + (rx >> 32) ;
return r ;
}
For Karatsuba, the suggested:
z1 = abs(a0 - a1) * abs(b0 - b1) * sgn(a0 - a1) * sgn(b1 - b0) + z0 + z2
is trickier than it looks... for a start, if z1 is 64 bits, then need to somehow collect the carry which this addition can generate... and that is complicated by the signed-ness issues.
z0 = a0*b0
z1 = ax*bx -- ax = (a1 - a0), bx = (b0 - b1)
z2 = a1*b1
where need to add r0 = z0 + (z1 << 32) + (z0 << 32) + (z2 << 32)
and add r1 = z2 + (z1 >> 32) + (z0 >> 32) + (z2 >> 32) + carry
where the carry is the carry from the additions to create r0, and result is r1:r0.
where must take into account the signed-ness of ax, bx and z1.
uint64x2_t
mulu64x2_karatsuba(uint64_t a, uint64_t b)
{
uint64_t a0, a1, b0, b1 ;
uint64_t ax, bx, zx, zy ;
uint as, bs, xs ;
uint64_t z0, z2 ;
uint64x2_t r ;
a0 = (uint32_t)a ; a1 = a >> 32 ;
b0 = (uint32_t)b ; b1 = b >> 32 ;
z0 = a0 * b0 ;
z2 = a1 * b1 ;
ax = (uint64_t)(a1 - a0) ;
bx = (uint64_t)(b0 - b1) ;
as = (uint)(ax > a1) ; // sign of magic middle, a
bs = (uint)(bx > b0) ; // sign of magic middle, b
xs = (uint)(as ^ bs) ; // sign of magic middle, x = a * b
ax = (uint64_t)((ax ^ -(uint64_t)as) + as) ; // abs magic middle a
bx = (uint64_t)((bx ^ -(uint64_t)bs) + bs) ; // abs magic middle b
zx = (uint64_t)(((ax * bx) ^ -(uint64_t)xs) + xs) ;
xs = xs & (uint)(zx != 0) ; // discard sign if z1 == 0 !
zy = (uint32_t)zx ; // start ls half of z1
zy = zy + (uint32_t)z0 + (uint32_t)z2 ;
r.w0 = z0 + (zy << 32) ; // complete ls word of result.
zy = zy + (z0 >> 32) ; // complete carry
zx = (zx >> 32) - ((uint64_t)xs << 32) ; // start ms half of z1
r.w1 = z2 + zx + (z0 >> 32) + (z2 >> 32) + (zy >> 32) ;
return r ;
}
I did some very simple timings (using times(), running on Ryzen 7 1800X):
using gcc __int128................... ~780 'units'
using mulu64x2()..................... ~895
using mulu64x2_karatsuba()... ~1,095
...so, yes, you can save a multiply by using Karatsuba, but whether it's worth doing rather depends.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
If A and B are of type uint8_t and I want the result C=AxB % N where N is 2^16, how do i do this if I can't use integers (so I can't declare N as an integer, only uint8_t) in C language?
N.B: A, B and C are stored in uint8 arrays, so they are "expressed" as uint8 but their values can be bigger.
In general there is no easy way to do this.
Firstly you need to implement the multiply with carry between A and B for each uint8_t block. See the answer here.
Division with 2^16 really mean "disregard" the last 16 bits, "don't use" the last two uint8_t (as you use the array of int.). As you have the modulus operator, this means just the opposite, so you only need to get the last two uint8_ts.
Take the lowest two uint8 of A (say a0 and a1) and B (say b0 and b1):
split each uint8 in high and low part
a0h = a0 >> 4; ## the same as a0h = a0/16;
a0l = a0 % 16; ## the same as a0l = a0 & 0x0f;
a1h = a1 >> 4;
a1l = a1 % 16;
b0h = b0 >> 4;
b0l = b0 % 16;
b1h = b1 >> 4;
b1l = b1 % 16;
Multiply the lower parts first (x is a buffer var)
x = a0l * b0l;
The first part of the result is the last four bits of x, let's call it s0l
s0l = x % 16;
The top for bits of x are carry.
c = x>>4;
multiply the higher parts of first uint8 and add carry.
x = (a0h * b0h) + c;
The first part of the result is the last four bits of x, let's call it s0h. And we need to get carry again.
s0h = x % 16;
c = x>>4;
We can now combine the s0:
s0 = (s0h << 4) + s0l;
Do exactly the same for the s1 (but don't forget to add the carry!):
x = (a1l * b1l) + c;
s1l = x % 16;
c = x>>4;
x = (a1h * b1h) + c;
s1h = x % 16;
c = x>>4;
s1 = (s1h << 4) + s1l;
Your result at this point is c, s1 and s0 (you need carry for next multiplications eg. s2, s3, s4,). As your formula says %(2^16) you already have your result - s1 and s2. If you have to divide with something else, you should do something similar to the code above, but for division. In this case be careful to catch the dividing with zero, it will give you NAN or something!
You can put A, B, C and S in array and loop it through the indexes to make code cleaner.
Here's my effort. I took the liberty of using larger integers and pointers for looping through the arrays. The numbers are represented by arrays of uint8_t in big-endian order. All the intermediate results are kept in uint8_t variables. The code could be made more efficient if intermediate results could be stored in wider integer variables!
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
static void add_c(uint8_t *r, size_t len_r, uint8_t x)
{
uint8_t o;
while (len_r--) {
o = r[len_r];
r[len_r] += x;
if (o <= r[len_r])
break;
x = 1;
}
}
void multiply(uint8_t *res, size_t len_res,
const uint8_t *a, size_t len_a, const uint8_t *b, size_t len_b)
{
size_t ia, ib, ir;
for (ir = 0; ir < len_res; ir++)
res[ir] = 0;
for (ia = 0; ia < len_a && ia < len_res; ia++) {
uint8_t ah, al, t;
t = a[len_a - ia - 1];
ah = t >> 4;
al = t & 0xf;
for (ib = 0; ib < len_b && ia + ib < len_res; ib++) {
uint8_t bh, bl, x, o, c0, c1;
t = b[len_b - ib - 1];
bh = t >> 4;
bl = t & 0xf;
c0 = al * bl;
c1 = ah * bh;
o = c0;
t = al * bh;
x = (t & 0xf) << 4;
c0 += x;
x = (t >> 4);
c1 += x;
if (o > c0)
c1++;
o = c0;
t = ah * bl;
x = (t & 0xf) << 4;
c0 += x;
x = (t >> 4);
c1 += x;
if (o > c0)
c1++;
add_c(res, len_res - ia - ib, c0);
add_c(res, len_res - ia - ib - 1, c1);
}
}
}
int main(void)
{
uint8_t a[2] = { 0xee, 0xdd };
uint8_t b[2] = { 0xcc, 0xbb };
uint8_t r[4];
multiply(r, sizeof(r), a, sizeof(a), b, sizeof(b));
printf("0x%02X%02X * 0x%02X%02X = 0x%02X%02X%02X%02X\n",
a[0], a[1], b[0], b[1], r[0], r[1], r[2], r[3]);
return 0;
}
Output:
0xEEDD * 0xCCBB = 0xBF06976F
I am trying to implement a random-number generator with Mersenne prime (231-1) as the modulus. The following working code was based on several related posts:
How do I extract specific 'n' bits of a 32-bit unsigned integer in C?
Fast multiplication and subtraction modulo a prime
Fast multiplication modulo 2^16 + 1
However,
It does not work with uint32_t hi, lo;, which means I do not understand signed vs. unsigned aspect of the problem.
Based on #2 above, I was expecting the answer to be (hi+lo). Which means, I do not understand why the following statement is needed.
if (x1 > r)
x1 += r + 2;
Can someone please clarify the source of my confusion?
Can the code itself be improved?
Should the generator avoid 0 or 231-1 as a seed?
How would the code change for a prime (2p-k)?
Original code
#include <inttypes.h>
// x1 = a*x0 (mod 2^31-1)
int32_t lgc_m(int32_t a, int32_t x)
{
printf("x %"PRId32"\n", x);
if (x == 2147483647){
printf("x1 %"PRId64"\n", 0);
return (0);
}
uint64_t c, r = 1;
c = (uint64_t)a * (uint64_t)x;
if (c < 2147483647){
printf("x1 %"PRId64"\n", c);
return (c);
}
int32_t hi=0, lo=0;
int i, p = 31;//2^31-1
for (i = 1; i < p; ++i){
r |= 1 << i;
}
lo = (c & r) ;
hi = (c & ~r) >> p;
uint64_t x1 = (uint64_t ) (hi + lo);
// NOT SURE ABOUT THE NEXT STATEMENT
if (x1 > r)
x1 += r + 2;
printf("c %"PRId64"\n", c);
printf("r %"PRId64"\n", r);
printf("\tlo %"PRId32"\n", lo);
printf("\thi %"PRId32"\n", hi);
printf("x1 %"PRId64"\n", x1);
printf("\n" );
return((int32_t) x1);
}
int main(void)
{
int32_t r;
r = lgc_m(1583458089, 1);
r = lgc_m(1583458089, 2000000000);
r = lgc_m(1583458089, 2147483646);
r = lgc_m(1583458089, 2147483647);
return(0);
}
The following if statement
if (x1 > r)
x1 += r + 2;
should be written as
if (x1 > r)
x1 -= r;
Both results are the same modulo 2^31:
x1 + r + 2 = x1 + 2^31 - 1 + 2 = x1 + 2^31 + 1
x1 - r = x1 - (2^31 - 1) = x1 - 2^31 + 1
The first solution overflows an int32_t and assumes that conversion from uint64_t to int32_t is modulo 2^31. While many C compilers handle the conversion this way, this is not mandated by the C standard. The actual result is implementation-defined.
The second solution avoids the overflow and works with both int32_t and uint32_t.
You can also use an integer constant for r:
uint64_t r = 0x7FFFFFFF; // 2^31 - 1
Or simply
uint64_t r = INT32_MAX;
EDIT: For primes of the form 2^p-k, you have to use masks with p bits and calculate the result with
uint32_t x1 = (k * hi + lo) % ((1 << p) - k)
If k * hi + lo can overflow a uint32_t (that is (k + 1) * (2^p - 1) >= 2^32), you have to use 64-bit arithmetic:
uint32_t x1 = ((uint64_t)a * x) % ((1 << p) - k)
Depending on the platform, the latter might be faster anyway.
Sue provided this as a solution:
With some experimentation (new code at the bottom), I was able to use
uint32_t, which further suggests that I do not understand how the
signed integers work with bit operations.
The following code uses uint32_t for input as well as hi and lo.
#include <inttypes.h>
// x1 = a*x0 (mod 2^31-1)
uint32_t lgc_m(uint32_t a, uint32_t x)
{
printf("x %"PRId32"\n", x);
if (x == 2147483647){
printf("x1 %"PRId64"\n", 0);
return (0);
}
uint64_t c, r = 1;
c = (uint64_t)a * (uint64_t)x;
if (c < 2147483647){
printf("x1 %"PRId64"\n", c);
return (c);
}
uint32_t hi=0, lo=0;
int i, p = 31;//2^31-1
for (i = 1; i < p; ++i){
r |= 1 << i;
}
hi = c >> p;
lo = (c & r) ;
uint64_t x1 = (uint64_t ) ((hi + lo) );
// NOT SURE ABOUT THE NEXT STATEMENT
if (x1 > r){
printf("x1 - r = %"PRId64"\n", x1- r);
x1 -= r;
}
printf("c %"PRId64"\n", c);
printf("r %"PRId64"\n", r);
printf("\tlo %"PRId32"\n", lo);
printf("\thi %"PRId32"\n", hi);
printf("x1 %"PRId64"\n", x1);
printf("\n" );
return((uint32_t) x1);
}
int main(void)
{
uint32_t r;
r = lgc_m(1583458089, 1583458089);
r = lgc_m(1583458089, 2147483645);
return(0);
}
The issue was that my assumption that the reduction will be complete
after one pass. If (x > 231-1), then by definition the
reduction has not occurred and a second pass is necessary. Subtracting
231-1, in that case does the trick. In the second attempt
above, r = 2^31-1 and is therefore the modulus. x -= r achieves
the final reduction.
Perhaps someone with expertise in random numbers or modular reduction
could explain it better.
Cleaned function without printf()s.
uint32_t lgc_m(uint32_t a, uint32_t x){
uint64_t c, x1, m = 2147483647; //modulus: m = 2^31-1
if (x == m)
return (0);
c = (uint64_t)a * (uint64_t)x;
if (c < m)//no reduction necessary
return (c);
uint32_t hi, lo, p = 31;//2^p-1, p = 31
hi = c >> p;
lo = c & m;
x1 = (uint64_t)(hi + lo);
if (x1 > m){//one more pass needed
//this block can be replaced by x1 -= m;
hi = x1 >> p;
lo = (x1 & m);
x1 = (uint64_t)(hi + lo);
}
return((uint32_t) x1);
}
Okay, so I'm trying to write a function to reverse a long (64 bytes) in C, and I'm getting some weird results with my bitshifting.
long reverse_long(long x) {
int i;
for(i=0; i<4; i++) {
long a = 0x00000000000000FF<<(8*i);
long b = 0xFF00000000000000>>(8*i);
a = x&a;
b = x&b;
a=a<<8*(7-2*i);
b=b>>8*(7-2*i);
x=x&(~(0x00000000000000FF<<(8*i)));
x=x&(~(0xFF00000000000000>>(8*i)));
x=x|a;
x=x|b;
}
}
On line 4 (long a = 0x00000000000000FF<<(8*i)), I'm shifting a byte of ones to the left by 8 bits for each iteration of the loop, which works fine for the first, second, and third iterations, but on the fourth iteration I get something like 0xFFFFFFFF000000, when I should be getting 0x00000000FF000000.
Line 5 (long b = 0x00000000000000FF>>(8*i)) works just fine though, and gives me the value 0x000000FF00000000.
Can anyone tell me what's going on here?
To understand potential problems in your code you need to understand the following things:
The type and value of integer literals
Rules about left-shifting signed values
Rules about right-shifting signed values
Rules about ~ on signed values
Rules about shifting a value by the width of its type or more
The behaviour of out-of-range integral conversions
That's quite a lot of things to remember. To avoid having to deal with all sorts of weird issues (for example, long a = 0x00000000000000FF<<(8*i); causes undefined behaviour when i == 3), I would recommend the following:
Only use unsigned variables and constants (including x)
Use constants which are the correct width for the type
Further, your code makes the assumption that long is 64-bit. This is not always true. It would be better to do one of the following two things:
Have your code work for unsigned long, whatever size unsigned long might be
Use uint64_t instead of long
To cut a long story short, this is how your code should look if we just fix errors relating to the points I listed above (and do not change the algorithm):
uint64_t reverse_long(uint64_t x)
{
int i;
for(i=0; i<4; i++)
{
uint64_t a = 0xFFull << (8*i);
uint64_t b = 0xFF00000000000000ull >> (8*i);
a = x&a;
b = x&b;
a=a<<8*(7-2*i);
b=b>>8*(7-2*i);
x=x&(~(0xFFull<<(8*i)));
x=x&(~(0xFF00000000000000ull>>(8*i)));
x=x|a;
x=x|b;
}
return x; // don't forget this
}
note: I have used ull suffix to create 64-bit literals. Actually this only guarantees at least 64 bit, but since everything is unsigned here it makes no difference, excess bits will just get truncated. To be very precise, write (uint64_t)0xFF instead of 0xFFull, etc.
You've received excellent advice on where you code went awry, but I thought you might like to see an alternate approach to reversing that might be a bit simpler.
uint64_t reverse_long(uint64_t n) {
uint8_t* a = (uint8_t*)&n;
uint8_t* b = a + 7;
while(a < b) {
uint8_t t = *b;
*b-- = *a;
*a++ = t;
}
}
a) Regarding your error:
What you doing there:
long a = 0x00000000000000FF<<(8*i);
Create signed int constant 0xFF;
Shit it left by i bytes
When it shift by 3 bytes, constant become: 0xFF000000;
When it assign it to long signed, performed sign extension:
0xFF000000 -> 0xFFFFFFFFFF000000;
b) Regarding your code:
There is exist more easy way to write your function, for example:
unsigned long reverse_long(unsigned long x) {
unsigned long rc = 0;
int i = 8;
do {
rc = (rc << 8) | (unsigned char)x;
x >>= 8;
} while(--i);
return rc;
}
The right shifting of signed longs is problematic when they're negative. This minor variant on your code, which is only safe for 64-bit machines where sizeof(long) == 8), ensures that the constants are long and the intermediate variables a and b are unsigned long to avoid those problems. The code contains quite a lot of diagnostics.
#include <stdio.h>
long reverse_long(long x);
long reverse_long(long x)
{
int i;
for (i = 0; i < 4; i++)
{
printf("x0 0x%.16lX\n", x);
unsigned long a = 0x00000000000000FFL << (8 * i);
unsigned long b = 0xFF00000000000000L >> (8 * i);
a &= x;
b &= x;
printf("a0 0x%.16lX; b0 0x%.16lX\n", a, b);
a <<= 8 * (7 - 2 * i);
b >>= 8 * (7 - 2 * i);
printf("a1 0x%.16lX; b1 0x%.16lX\n", a, b);
x &= (~(0x00000000000000FFL << (8 * i)));
x &= (~(0xFF00000000000000L >> (8 * i)));
printf("x1 0x%.16lX\n", x);
x |= a | b;
printf("x2 0x%.16lX\n", x);
}
return x;
}
int main(void)
{
long x = 0xFEDCBA9876543210L;
printf("0x%.16lX <=> 0x%.16lX\n", x, reverse_long(x));
return 0;
}
Output:
x0 0xFEDCBA9876543210
a0 0x0000000000000010; b0 0xFE00000000000000
a1 0x1000000000000000; b1 0x00000000000000FE
x1 0x00DCBA9876543200
x2 0x10DCBA98765432FE
x0 0x10DCBA98765432FE
a0 0x0000000000003200; b0 0x00DC000000000000
a1 0x0032000000000000; b1 0x000000000000DC00
x1 0x1000BA98765400FE
x2 0x1032BA987654DCFE
x0 0x1032BA987654DCFE
a0 0x0000000000540000; b0 0x0000BA0000000000
a1 0x0000540000000000; b1 0x0000000000BA0000
x1 0x103200987600DCFE
x2 0x1032549876BADCFE
x0 0x1032549876BADCFE
a0 0x0000000076000000; b0 0x0000009800000000
a1 0x0000007600000000; b1 0x0000000098000000
x1 0x1032540000BADCFE
x2 0x1032547698BADCFE
0xFEDCBA9876543210 <=> 0x1032547698BADCFE
Alternative Implementations
This is a variant of the program above, with the reverse_long() changed to reverse_uint64_v1() and using uint64_t instead of long and unsigned long. The printing is upgraded using PRIX64 format, but also commented out since it is being used in a performance test. The reverse_uint64_v2() function does fewer operations per cycle, though it does more cycles (8 instead of 4). It copies the low order byte of what's left of the input value into the low order byte of the current output value after it's been shifted left 8 places. The reverse_uint64_v3() function does a loop-unrolling of reverse_uint64_v2(), and micro-optimizes by avoiding one assignment to b and one extra shift at the end.
#include <inttypes.h>
#include <stdio.h>
#include "timer.h"
uint64_t reverse_uint64_v1(uint64_t x);
uint64_t reverse_uint64_v2(uint64_t x);
uint64_t reverse_uint64_v3(uint64_t x);
uint64_t reverse_uint64_v1(uint64_t x)
{
for (int i = 0; i < 4; i++)
{
//printf("x0 0x%.16" PRIX64 "\n", x);
uint64_t a = UINT64_C(0x00000000000000FF) << (8 * i);
uint64_t b = UINT64_C(0xFF00000000000000) >> (8 * i);
a &= x;
b &= x;
//printf("a0 0x%.16" PRIX64 "; b0 0x%.16" PRIX64 "\n", a, b);
a <<= 8 * (7 - 2 * i);
b >>= 8 * (7 - 2 * i);
//printf("a1 0x%.16" PRIX64 "; b1 0x%.16" PRIX64 "\n", a, b);
x &= ~(UINT64_C(0x00000000000000FF) << (8 * i));
x &= ~(UINT64_C(0xFF00000000000000) >> (8 * i));
//printf("x1 0x%.16" PRIX64 "\n", x);
x |= a | b;
//printf("x2 0x%.16" PRIX64 "\n", x);
}
return x;
}
uint64_t reverse_uint64_v2(uint64_t x)
{
uint64_t r = 0;
for (size_t i = 0; i < sizeof(uint64_t); i++)
{
uint64_t b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
}
return r;
}
uint64_t reverse_uint64_v3(uint64_t x)
{
uint64_t b;
uint64_t r;
r = x & 0xFF; // Optimization 1
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
// x >>= 8; // Optimization 2
return r;
}
static void timing_test(uint64_t (*reverse)(uint64_t))
{
Clock clk;
clk_init(&clk);
uint64_t ur = 0;
uint64_t lb = UINT64_C(0x0123456789ABCDEF);
uint64_t ub = UINT64_C(0xFEDCBA9876543210);
uint64_t inc = UINT64_C(0x287654321);
uint64_t cnt = 0;
clk_start(&clk);
for (uint64_t u = lb; u < ub; u += inc)
{
ur += (*reverse)(u);
cnt++;
}
clk_stop(&clk);
char buffer[32];
printf("Sum = 0x%.16" PRIX64 " Count = %" PRId64 " Time = %s\n", ur, cnt,
clk_elapsed_us(&clk, buffer, sizeof(buffer)));
}
int main(void)
{
uint64_t u = UINT64_C(0xFEDCBA9876543210);
printf("0x%.16" PRIX64 " <=> 0x%.16" PRIX64 "\n", u, reverse_uint64_v1(u));
printf("0x%.16" PRIX64 " <=> 0x%.16" PRIX64 "\n", u, reverse_uint64_v2(u));
printf("0x%.16" PRIX64 " <=> 0x%.16" PRIX64 "\n", u, reverse_uint64_v3(u));
timing_test(reverse_uint64_v1);
timing_test(reverse_uint64_v2);
timing_test(reverse_uint64_v3);
timing_test(reverse_uint64_v1);
timing_test(reverse_uint64_v2);
timing_test(reverse_uint64_v3);
return 0;
}
Example output:
0xFEDCBA9876543210 <=> 0x1032547698BADCFE
0xFEDCBA9876543210 <=> 0x1032547698BADCFE
0xFEDCBA9876543210 <=> 0x1032547698BADCFE
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.543540
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.822616
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.303825
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.943668
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.314660
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.295862
The sum and count have two purposes. First, they provide a cross-check that the results from the three functions are the same. Second, they ensure that the compiler doesn't do anything like optimize the whole loop out of business.
As you can see, there is not a lot of difference between the v2 and v3 timings, but the v1 code is quite a bit slower than the v2 or v3 code. For clarity, then, I'd use the v2 code.
For comparison, I also added a 'do nothing' function:
uint64_t reverse_uint64_v4(uint64_t x)
{
return x;
}
Clearly, the sum from this is different, but the count is the same, so it measures the overhead of the loop control and counting. The times I got on two runs were:
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.965360
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.197267
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.454553
Sum = 0x09EBA33CFF9869C2 Count = 1683264863 Time = 3.607310
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.381292
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.804442
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.797625
Sum = 0x09EBA33CFF9869C2 Count = 1683264863 Time = 3.541233
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.438374
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.805865
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.797086
Sum = 0x09EBA33CFF9869C2 Count = 1683264863 Time = 3.532735
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.426701
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.824182
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.834344
Sum = 0x09EBA33CFF9869C2 Count = 1683264863 Time = 3.510904
Clearly, about half the time in the test function is in the loop and function call overhead.
I'm looking for fast code for 64-bit (unsigned) cube roots. (I'm using C and compiling with gcc, but I imagine most of the work required will be language- and compiler-agnostic.) I will denote by ulong a 64-bit unisgned integer.
Given an input n I require the (integral) return value r to be such that
r * r * r <= n && n < (r + 1) * (r + 1) * (r + 1)
That is, I want the cube root of n, rounded down. Basic code like
return (ulong)pow(n, 1.0/3);
is incorrect because of rounding toward the end of the range. Unsophisticated code like
ulong
cuberoot(ulong n)
{
ulong ret = pow(n + 0.5, 1.0/3);
if (n < 100000000000001ULL)
return ret;
if (n >= 18446724184312856125ULL)
return 2642245ULL;
if (ret * ret * ret > n) {
ret--;
while (ret * ret * ret > n)
ret--;
return ret;
}
while ((ret + 1) * (ret + 1) * (ret + 1) <= n)
ret++;
return ret;
}
gives the correct result, but is slower than it needs to be.
This code is for a math library and it will be called many times from various functions. Speed is important, but you can't count on a warm cache (so suggestions like a 2,642,245-entry binary search are right out).
For comparison, here is code that correctly calculates the integer square root.
ulong squareroot(ulong a) {
ulong x = (ulong)sqrt((double)a);
if (x > 0xFFFFFFFF || x*x > a)
x--;
return x;
}
The book "Hacker's Delight" has algorithms for this and many other problems. The code is online here. EDIT: That code doesn't work properly with 64-bit ints, and the instructions in the book on how to fix it for 64-bit are somewhat confusing. A proper 64-bit implementation (including test case) is online here.
I doubt that your squareroot function works "correctly" - it should be ulong a for the argument, not n :) (but the same approach would work using cbrt instead of sqrt, although not all C math libraries have cube root functions).
I've adapted the algorithm presented in 1.5.2 (the kth root) in Modern Computer Arithmetic (Brent and Zimmerman). For the case of (k == 3), and given a 'relatively' accurate over-estimate of the initial guess - this algorithm seems to out-perform the 'Hacker's Delight' code above.
Not only that, but MCA as a text provides theoretical background as well as a proof of correctness and terminating criteria.
Provided that we can produce a 'relatively' good initial over-estimate, I haven't been able to find a case that exceeds (7) iterations. (Is this effectively related to 64-bit values having 2^6 bits?) Either way, it's an improvement over the (21) iterations in the HacDel code - with linear O(b) convergence, despite having a loop body that is evidently much faster.
The initial estimate I've used is based on a 'rounding up' of the number of significant bits in the value (x). Given (b) significant bits in (x), we can say: 2^(b - 1) <= x < 2^b. I state without proof (though it should be relatively easy to demonstrate) that: 2^ceil(b / 3) > x^(1/3)
static inline uint32_t u64_cbrt (uint64_t x)
{
uint64_t r0 = 1, r1;
/* IEEE-754 cbrt *may* not be exact. */
if (x == 0) /* cbrt(0) : */
return (0);
int b = (64) - __builtin_clzll(x);
r0 <<= (b + 2) / 3; /* ceil(b / 3) */
do /* quadratic convergence: */
{
r1 = r0;
r0 = (2 * r1 + x / (r1 * r1)) / 3;
}
while (r0 < r1);
return ((uint32_t) r1); /* floor(cbrt(x)); */
}
A crbt call probably isn't all that useful - unlike the sqrt call which can be efficiently implemented on modern hardware. That said, I've seen gains for sets of values under 2^53 (exactly represented in IEEE-754 doubles), which surprised me.
The only downside is the division by: (r * r) - this can be slow, as the latency of integer division continues to fall behind other advances in ALUs. The division by a constant: (3) is handled by reciprocal methods on any modern optimising compiler.
It's interesting that Intel's 'Icelake' microarchitecture will significantly improve integer division - an operation that seems to have been neglected for a long time. I simply won't trust the 'Hacker's Delight' answer until I can find a sound theoretical basis for it. And then I have to work out which variant is the 'correct' answer.
You could try a Newton's step to fix your rounding errors:
ulong r = (ulong)pow(n, 1.0/3);
if(r==0) return r; /* avoid divide by 0 later on */
ulong r3 = r*r*r;
ulong slope = 3*r*r;
ulong r1 = r+1;
ulong r13 = r1*r1*r1;
/* making sure to handle unsigned arithmetic correctly */
if(n >= r13) r+= (n - r3)/slope;
if(n < r3) r-= (r3 - n)/slope;
A single Newton step ought to be enough, but you may have off-by-one (or possibly more?) errors. You can check/fix those using a final check&increment step, as in your OQ:
while(r*r*r > n) --r;
while((r+1)*(r+1)*(r+1) <= n) ++r;
or some such.
(I admit I'm lazy; the right way to do it is to carefully check to determine which (if any) of the check&increment things is actually necessary...)
If pow is too expensive, you can use a count-leading-zeros instruction to get an approximation to the result, then use a lookup table, then some Newton steps to finish it.
int k = __builtin_clz(n); // counts # of leading zeros (often a single assembly insn)
int b = 64 - k; // # of bits in n
int top8 = n >> (b - 8); // top 8 bits of n (top bit is always 1)
int approx = table[b][top8 & 0x7f];
Given b and top8, you can use a lookup table (in my code, 8K entries) to find a good approximation to cuberoot(n). Use some Newton steps (see comingstorm's answer) to finish it.
// On my pc: Math.Sqrt 35 ns, cbrt64 <70ns, cbrt32 <25 ns, (cbrt12 < 10ns)
// cbrt64(ulong x) is a C# version of:
// http://www.hackersdelight.org/hdcodetxt/acbrt.c.txt (acbrt1)
// cbrt32(uint x) is a C# version of:
// http://www.hackersdelight.org/hdcodetxt/icbrt.c.txt (icbrt1)
// Union in C#:
// http://www.hanselman.com/blog/UnionsOrAnEquivalentInCSairamasTipOfTheDay.aspx
using System.Runtime.InteropServices;
[StructLayout(LayoutKind.Explicit)]
public struct fu_32 // float <==> uint
{
[FieldOffset(0)]
public float f;
[FieldOffset(0)]
public uint u;
}
private static uint cbrt64(ulong x)
{
if (x >= 18446724184312856125) return 2642245;
float fx = (float)x;
fu_32 fu32 = new fu_32();
fu32.f = fx;
uint uy = fu32.u / 4;
uy += uy / 4;
uy += uy / 16;
uy += uy / 256;
uy += 0x2a5137a0;
fu32.u = uy;
float fy = fu32.f;
fy = 0.33333333f * (fx / (fy * fy) + 2.0f * fy);
int y0 = (int)
(0.33333333f * (fx / (fy * fy) + 2.0f * fy));
uint y1 = (uint)y0;
ulong y2, y3;
if (y1 >= 2642245)
{
y1 = 2642245;
y2 = 6981458640025;
y3 = 18446724184312856125;
}
else
{
y2 = (ulong)y1 * y1;
y3 = y2 * y1;
}
if (y3 > x)
{
y1 -= 1;
y2 -= 2 * y1 + 1;
y3 -= 3 * y2 + 3 * y1 + 1;
while (y3 > x)
{
y1 -= 1;
y2 -= 2 * y1 + 1;
y3 -= 3 * y2 + 3 * y1 + 1;
}
return y1;
}
do
{
y3 += 3 * y2 + 3 * y1 + 1;
y2 += 2 * y1 + 1;
y1 += 1;
}
while (y3 <= x);
return y1 - 1;
}
private static uint cbrt32(uint x)
{
uint y = 0, z = 0, b = 0;
int s = x < 1u << 24 ? x < 1u << 12 ? x < 1u << 06 ? x < 1u << 03 ? 00 : 03 :
x < 1u << 09 ? 06 : 09 :
x < 1u << 18 ? x < 1u << 15 ? 12 : 15 :
x < 1u << 21 ? 18 : 21 :
x >= 1u << 30 ? 30 : x < 1u << 27 ? 24 : 27;
do
{
y *= 2;
z *= 4;
b = 3 * y + 3 * z + 1 << s;
if (x >= b)
{
x -= b;
z += 2 * y + 1;
y += 1;
}
s -= 3;
}
while (s >= 0);
return y;
}
private static uint cbrt12(uint x) // x < ~255
{
uint y = 0, a = 0, b = 1, c = 0;
while (a < x)
{
y++;
b += c;
a += b;
c += 6;
}
if (a != x) y--;
return y;
}
Starting from the code within the GitHub gist from the answer of Fabian Giesen, I have arrived at the following, faster implementation:
#include <stdint.h>
static inline uint64_t icbrt(uint64_t x) {
uint64_t b, y, bits = 3*21;
int s;
for (s = bits - 3; s >= 0; s -= 3) {
if ((x >> s) == 0)
continue;
x -= 1 << s;
y = 1;
for (s = s - 3; s >= 0; s -= 3) {
y += y;
b = 1 + 3*y*(y + 1);
if ((x >> s) >= b) {
x -= b << s;
y += 1;
}
}
return y;
}
return 0;
}
While the above is still somewhat slower than methods relying on the GNU specific __builtin_clzll, the above does not make use of compiler specifics and is thus completely portable.
The bits constant
Lowering the constant bits leads to faster computation, but the highest number x for which the function gives correct results is (1 << bits) - 1. Also, bits must be a multiple of 3 and be at most 64, meaning that its maximum value is really 3*21 == 63. With bits = 3*21, icbrt() thus works for input x <= 9223372036854775807. If we know that a program is working with limited x, say x < 1000000, then we can speed up the cube root computation by setting bits = 3*7, since (1 << 3*7) - 1 = 2097151 >= 1000000.
64-bit vs. 32-bit integers
Though the above is written for 64-bit integers, the logic is the same for 32-bit:
#include <stdint.h>
static inline uint32_t icbrt(uint32_t x) {
uint32_t b, y, bits = 3*7; /* or whatever is appropriate */
int s;
for (s = bits - 3; s >= 0; s -= 3) {
if ((x >> s) == 0)
continue;
x -= 1 << s;
y = 1;
for (s = s - 3; s >= 0; s -= 3) {
y += y;
b = 1 + 3*y*(y + 1);
if ((x >> s) >= b) {
x -= b << s;
y += 1;
}
}
return y;
}
return 0;
}
I would research how to do it by hand, and then translate that into a computer algorithm, working in base 2 rather than base 10.
We end up with an algorithm something like (pseudocode):
Find the largest n such that (1 << 3n) < input.
result = 1 << n.
For i in (n-1)..0:
if ((result | 1 << i)**3) < input:
result |= 1 << i.
We can optimize the calculation of (result | 1 << i)**3 by observing that the bitwise-or is equivalent to addition, refactoring to result**3 + 3 * i * result ** 2 + 3 * i ** 2 * result + i ** 3, caching the values of result**3 and result**2 between iterations, and using shifts instead of multiplication.
You can try and adapt this C algorithm :
#include <limits.h>
// return a number that, when multiplied by itself twice, makes N.
unsigned cube_root(unsigned n){
unsigned a = 0, b;
for (int c = sizeof(unsigned) * CHAR_BIT / 3 * 3 ; c >= 0; c -= 3) {
a <<= 1;
b = a + (a << 1), b = b * a + b + 1 ;
if (n >> c >= b)
n -= b << c, ++a;
}
return a;
}
Also there is :
// return the number that was multiplied by itself to reach N.
unsigned square_root(const unsigned num) {
unsigned a, b, c, d;
for (b = a = num, c = 1; a >>= 1; ++c);
for (c = 1 << (c & -2); c; c >>= 2) {
d = a + c;
a >>= 1;
if (b >= d)
b -= d, a += c;
}
return a;
}
Source