Related
I'm trying to make a get_random_4digit function that generates a 4 digit number that has non-repeating digits ranging from 1-9 while only using ints, if, while and functions, so no arrays etc.
This is the code I have but it is not really working as intended, could anyone point me in the right direction?
int get_random_4digit() {
int d1 = rand() % 9 + 1;
int d2 = rand() % 9 + 1;
while (true) {
if (d1 != d2) {
int d3 = rand() % 9 + 1;
if (d3 != d1 || d3 != d2) {
int d4 = rand() % 9 + 1;
if (d4 != d1 || d4 != d2 || d4 != d3) {
random_4digit = (d1 * 1000) + (d2 * 100) + (d3 * 10) + d4;
break;
}
}
}
}
printf("Random 4digit = %d\n", random_4digit);
}
A KISS-approach could be this:
int getRandom4Digits() {
uint16_t acc = 0;
uint16_t used = 0;
for (int i = 0; i < 4; i++) {
int idx;
do {
idx = rand() % 9; // Not equidistributed but never mind...
} while (used & (1 << idx));
acc = acc * 10 + (idx + 1);
used |= (1 << idx);
}
return acc;
}
This looks terribly dumb at first. A quick analysis gives that this really isn't so bad, giving a number of calls to rand() to be about 4.9.
The expected number of inner loop steps [and corresponding calls to rand(), if we assume rand() % 9 to be i.i.d.] will be:
9/9 + 9/8 + 9/7 + 9/6 ~ 4.9107.
There are 9 possibilities for the first digit, 8 possibilities for the second digit, 7 possibilities for the third digit and 6 possibilities for the last digit. This works out to "9*8*7*6 = 3024 permutations".
Start by getting a random number from 0 to 3023. Let's call that P. To do this without causing a biased distribution use something like do { P = rand() & 0xFFF; } while(P >= 3024);.
Note: If you don't care about uniform distribution you could just do P = rand() % 3024;. In this case lower values of P will be more likely because RAND_MAX doesn't divide by 3024 nicely.
The first digit has 9 possibilities, so do d1 = P % 9 + 1; P = P / 9;.
The second digit has 8 possibilities, so do d2 = P % 8 + 1; P = P / 8;.
The third digit has 7 possibilities, so do d3 = P % 7 + 1; P = P / 7;.
For the last digit you can just do d4 = P + 1; because we know P can't be too high.
Next; convert "possibility" into a digit. For d1 you do nothing. For d2 you need to increase it if it's greater than or equal to d1, like if(d2 >= d1) d2++;. Do the same for d3 and d4 (comparing against all previous digits).
The final code will be something like:
int get_random_4digit() {
int P, d1, d2, d3, d4;
do {
P = rand() & 0xFFF;
} while(P >= 3024);
d1 = P % 9 + 1; P = P / 9;
d2 = P % 8 + 1; P = P / 8;
d3 = P % 7 + 1; P = P / 7;
d4 = P + 1;
if(d2 >= d1) d2++;
if(d3 >= d1) d3++;
if(d3 >= d2) d3++;
if(d4 >= d1) d4++;
if(d4 >= d2) d4++;
if(d4 >= d3) d4++;
return d1*1000 + d2*100 + d3*10 + d4;
}
You could start with an integer number, 0x123456789, and pick random nibbles from it (the 4 bits that makes up one of the digits in the hex value). When a nibble has been selected, remove it from the number and continue picking from those left.
This makes exactly 4 calls to rand() and has no if or other conditions (other than the loop condition).
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int get_random_4digit() {
uint64_t bits = 0x123456789; // nibbles
int res = 0;
// pick random nibbles
for(unsigned last = 9 - 1; last > 9 - 1 - 4; --last) {
unsigned lsh = last * 4; // shift last nibble
unsigned sel = (rand() % (last + 1)) * 4; // shift for random nibble
// multiply with 10 and add the selected nibble
res = res * 10 + ((bits & (0xFULL << sel)) >> sel);
// move the last unselected nibble right to where the selected
// nibble was:
bits = (bits & ~(0xFULL << sel)) |
((bits & (0xFULL << lsh)) >> (lsh - sel));
}
return res;
}
Demo
Another variant could be to use the same value, 0x123456789, and do a Fisher-Yates shuffle on the nibbles. When the shuffle is done, return the 4 lowest nibbles. This is more expensive since it randomizes the order of all 9 nibbles - but it makes it easy if you want to select an arbitrary amount of them afterwards.
Example:
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
uint16_t get_random_4digit() {
uint64_t bits = 0x123456789; // nibbles
// shuffle the nibbles
for(unsigned idx = 9 - 1; idx > 0; --idx) {
unsigned ish = idx * 4; // index shift
// shift for random nibble to swap with `idx`
unsigned swp = (rand() % (idx + 1)) * 4;
// extract the selected nibbles
uint64_t a = (bits & (0xFULL << ish)) >> ish;
uint64_t b = (bits & (0xFULL << swp)) >> swp;
// swap them
bits &= ~((0xFULL << ish) | (0xFULL << swp));
bits |= (a << swp) | (b << ish);
}
return bits & 0xFFFF; // return the 4 lowest nibbles
}
The bit manipulation can probably be optimized - but I wrote it like I thought it so it's probably better for readability to leave it as-is
You can then print the value as a hex value to get the output you want - or extract the 4 nibbles and convert it for decimal output.
int main() {
srand(time(NULL));
uint16_t res = get_random_4digit();
// print directly as hex:
printf("%X\n", res);
// or extract the nibbles and multiply to get decimal result - same output:
uint16_t a = (res >> 12) & 0xF;
uint16_t b = (res >> 8) & 0xF;
uint16_t c = (res >> 4) & 0xF;
uint16_t d = (res >> 0) & 0xF;
uint16_t dec = a * 1000 + b * 100 + c * 10 + d;
printf("%d\n", dec);
}
Demo
You should keep generating digits until distinct one found:
int get_random_4digit() {
int random_4digit = 0;
/* We must have 4 digits number - at least 1234 */
while (random_4digit < 1000) {
int digit = rand() % 9 + 1;
/* check if generated digit is not in the result */
for (int number = random_4digit; number > 0; number /= 10)
if (number % 10 == digit) {
digit = 0; /* digit has been found, we'll try once more */
break;
}
if (digit > 0) /* unique digit generated, we add it to result */
random_4digit = random_4digit * 10 + digit;
}
return random_4digit;
}
Please, fiddle youself
One way to do this is to create an array with all 9 digits, pick a random one and remove it from the list.
Something like this:
uint_fast8_t digits[]={1,2,3,4,5,6,7,8,9}; //only 1-9 are allowed, 0 is not allowed
uint_fast8_t left=4; //How many digits are left to create
unsigned result=0; //Store the 4-digit number here
while(left--)
{
uint_fast8_t digit=getRand(9-4+left); //pick a random index
result=result*10+digits[digit];
//Move all digits above the selcted one 1 index down.
//This removes the picked digit from the array.
while(digit<8)
{
digits[digit]=digits[digit+1];
digit++;
}
}
You said you need a solution without arrays. Luckily, we can store up to 16 4 bit numbers in a single uint64_t. Here is an example that uses a uint64_t to store the digit list so that no array is needed.
#include <stdint.h>
#include <inttypes.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
unsigned getRand(unsigned max)
{
return rand()%(max+1);
}
//Creates a uint64_t that is used as an array.
//Use no more than 16 values and don't use values >0x0F
//The last argument will have index 0
uint64_t fakeArrayCreate(uint_fast8_t count, ...)
{
uint64_t result=0;
va_list args;
va_start (args, count);
while(count--)
{
result=(result<<4) | va_arg(args,int);
}
return result;
}
uint_fast8_t fakeArrayGet(uint64_t array, uint_fast8_t index)
{
return array>>(4*index)&0x0F;
}
uint64_t fakeArraySet(uint64_t array, uint_fast8_t index, uint_fast8_t value)
{
array = array & ~((uint64_t)0x0F<<(4*index));
array = array | ((uint64_t)value<<(4*index));
return array;
}
unsigned getRandomDigits(void)
{
uint64_t digits = fakeArrayCreate(9,9,8,7,6,5,4,3,2,1);
uint_fast8_t left=4;
unsigned result=0;
while(left--)
{
uint_fast8_t digit=getRand(9-4+left);
result=result*10+fakeArrayGet(digits,digit);
//Move all digits above the selcted one 1 index down.
//This removes the picked digit from the array.
while(digit<8)
{
digits=fakeArraySet(digits,digit,fakeArrayGet(digits,digit+1));
digit++;
}
}
return result;
}
//Test our function
int main(int argc, char **argv)
{
srand(atoi(argv[1]));
printf("%u\n",getRandomDigits());
}
You could use a partial Fisher-Yates shuffle on an array of 9 digits, stopping after 4 digits:
// Return random integer from 0 to n-1
// (for n in range 1 to RAND_MAX+1u).
int get_random_int(unsigned int n) {
unsigned int x = (RAND_MAX + 1u) / n;
unsigned int limit = x * n;
int s;
do {
s = rand();
} while (s >= limit);
return s / x;
}
// Return random 4-digit number from 1234 to 9876 with no
// duplicate digits and no 0 digit.
int get_random_4digit(void) {
char possible[9] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
int result = 0;
int i;
// Uses partial Fisher-Yates shuffle.
for (i = 0; i < 4; i++) {
// Get random position rand_pos from remaining possibilities i to 8
// (positions before i contain previous selected digits).
int rand_pos = i + get_random_int(9 - i);
// Select digit from position rand_pos.
char digit = possible[rand_pos];
// Exchange digits at positions i and rand_pos.
possible[rand_pos] = possible[i];
possible[i] = digit; // not really needed
// Put selected digit into result.
result = result * 10 + digit;
}
return result;
}
EDIT: I forgot the requirement "while only using int's, if, while and functions, so no arrays etc.", so feel free to ignore this answer!
If normal C integer types are allowed including long long int, the get_random_4digit() function above can be replaced with the following to satisfy the requirement:
// Return random 4-digit number from 1234 to 9876 with no
// duplicate digits and no 0 digit.
int get_random_4digit(void) {
long long int possible = 0x123456789; // 4 bits per digit
int result = 0;
int i;
// Uses partial Fisher-Yates shuffle.
i = 0;
while (i < 4) {
// Determine random position rand_pos in remaining possibilities 0 to 8-i.
int rand_pos = get_random_int(9 - i);
// Select digit from position rand_pos.
int digit = (possible >> (4 * rand_pos)) & 0xF;
// Replace digit at position rand_pos with digit at position 0.
possible ^= ((possible ^ digit) & 0xF) << (4 * rand_pos);
// Shift remaining possible digits down one position.
possible >>= 4;
// Put selected digit into result.
result = result * 10 + digit;
i++;
}
return result;
}
There are multiple answers to this question already, but none of them seem to fit the requirement only using ints, if, while and functions. Here is a modified version of Pelle Evensen's simple solution:
#include <stdlib.h>
int get_random_4digit(void) {
int acc = 0, used = 0, i = 0;
while (i < 4) {
int idx = rand() % 9; // Not strictly uniform but never mind...
if (!(used & (1 << idx))) {
acc = acc * 10 + idx + 1;
used |= 1 << idx;
i++;
}
}
return acc;
}
Very short, I am having issues understanding the workings of this code, it is much more efficient then my 20 or so lines to get the same outcome. I understand how left shift is supposed to work and the bitwise Or but would appreciate a little guidance to understand how the two come together to make the line in the for loop work.
Code is meant to take in an array of bits(bits) of a given size(count) and return the integer value of the bits.
unsigned binary_array_to_numbers(const unsigned *bits, size_t count) {
unsigned res = 0;
for (size_t i = 0; i < count; i++)
res = res << 1 | bits[i];
return res;
}
EDIT: As requested, My newbie solution that still passed all tests: Added is a sample of possible assignment to bits[]
unsigned binary_array_to_numbers(const unsigned *bits, size_t count)
{
int i, j = 0;
unsigned add = 0;
for (i = count - 1; i >= 0; i--){
if(bits[i] == 1){
if(j >= 1){
j = j * 2;
add = add + j;
}
else{
j++;
add = add + j;
}
}
else {
if( j>= 1){
j = j * 2;
}
else{
j++;
}
}
}
return add;
}
void main(){
const unsigned bits[] = {0,1,1,0};
size_t count = sizeof(bits)/sizeof(bits[0]);
binary_array_to_numbers(bits, count);
}
a breakdown:
every left shift operation on a binary number effectively multiplies
it by 2 0111(7) << 1 = 1110(14)
consider rhubarbdog answer - the operation can be seen as two separate actions. first left-shift (multiply by two) and then OR with the current bit being reviewed
the PC does not distinguish between the value displayed and the binary
representation of the number
lets try and review a case in-which your input is:
bits = {0, 1, 0, 1};
count = 4;
unsigned binary_array_to_numbers(const unsigned *bits, size_t count) {
unsigned res = 0;
for (size_t i = 0; i < count; i++)
res = res << 1 // (a)
res = res | bits[i]; /* (b) according to rhubarbdog answer */
return res;
}
iteration 0:
- bits[i] = 0;
- (a) res = b0; (left shift of 0)
- (b) res = b0; (bitwise OR with 0)
iteration 1:
- bits[i] = 1;
- (a) res = b0; (left shift of 0)
- (b) res = b1; (bitwise OR with 1)
iteration 2:
- bits[i] = 0;
- (a) res = b10; (left shift of 1 - decimal value is 2)
- (b) res = b10; (bitwise OR with 0)
iteration 3:
- bits[i] = 1;
- (a) res = b100; (left shift of 1 - decimal value is 4)
- (b) res = b101; (bitwise OR with 1)
the final result for res is binary(101) and decimal(5) as one would expect
NOTE: the use of unsigned is a must since a signed value will be interpreted as a negative value if the MSB is 1
hope that helps...
consider them as 2 operations i'll re-write res= ... as 2 lines
res = res << 1
res = res | 1
The firs pass res gets set to 1, next time it's shifted *2 then because it's now even +1
I was just wondering if anyone had any insight on how to convert a uint32_t hex value to a ascii decimal and display it to an LCD. An algorithm would help so I can program it using C code. The hex value i'm getting comes from an ADC that I take and convert to to LCD. The ADC data gives a 16 bit value and the lcd is 16x2
void Hex2DecToLCD(){
Algorithm goes here
}
Regards
void Hex_To_BCD_To_LCD(uint32_t ADCData)
{
char BCD[5];
uint32_t Dig_0, Dig_1, Dig_2, Dig_3, Dig_4;
int i = 0;
uint32_t temp = 0x0;
uint32_t bin_inp = ADCData;
bin_inp *= 0x8; // scale up the ADC value
temp = bin_inp;
Dig_4 = temp/10000;
BCD[4] = Dig_4 + 0x30;
temp = temp - (Dig_4 * 10000);
Dig_3 = temp/1000;
BCD[3] = Dig_3 + 0x30;
temp = temp - (Dig_3 * 1000);
Dig_2 = temp/100;
BCD[2] = Dig_2 + 0x30;
temp = temp - (Dig_2 * 100);
Dig_1 = temp/10;
BCD[1] = Dig_1 + 0x30;
temp = temp - (Dig_1 * 10);
Dig_0 = temp;
BCD[0] = Dig_0 + 0x30;
//Data to LCD
for (i =0; i < 5; i++){
Data_To_LCD(BCD[i++]);
}
}
The answer provided by Julien in 2015 probably works, but I think it's formatting and methodology could be improved a little bit.
My microcontroller relies on uint8_t for most integer values, including the transmission of ASCII bytes over UART.
I hope this answer helps whoever is looking for an easy drop-in uint32_t replacement for %lu in printf.
#include <stdint.h>
void putDwordDecimalValue(uint32_t value) {
if (value == 0) {
putChar('0');
return;
}
uint32_t valueCopy;
valueCopy = value;
int length = 0;
while (valueCopy != 0) {
length++;
valueCopy /= 10;
}
int i;
uint8_t asciiDigits[10]; // Max size of 4294967295 => 10 digits
for (i = 0; i < length; i++) {
uint8_t lastDigit = value % 10;
value /= 10;
asciiDigits[length - (i + 1)] = 0x30 + lastDigit; // 0x30 = 0 in ASCII
}
for (i = 0; i < length; i++)
putChar(asciiDigits[i]);
}
I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words.
For example, we can use one thread to add two 32-bit words. For simplicity, let assume
that the numbers to be added are of the same length and number of threads per block == number of words. Then:
__global__ void add_kernel(int *C, const int *A, const int *B) {
int x = A[threadIdx.x];
int y = B[threadIdx.x];
int z = x + y;
int carry = (z < x);
/** do carry propagation in parallel somehow ? */
............
z = z + newcarry; // update the resulting words after carry propagation
C[threadIdx.x] = z;
}
I am pretty sure that there is a way to do carry propagation via some tricky reduction procedure but could not figure it out..
I had a look at CUDA thrust extensions but big integer package seems not to be implemented yet.
Perhaps someone can give me a hint how to do that on CUDA ?
You are right, carry propagation can be done via prefix sum computation but it's a bit tricky to define the binary function for this operation and prove that it is associative (needed for parallel prefix sum). As a matter of fact, this algorithm is used (theoretically) in Carry-lookahead adder.
Suppose we have two large integers a[0..n-1] and b[0..n-1].
Then we compute (i = 0..n-1):
s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);
We define two functions:
generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);
with quite intuitive meaning: generate[i] == 1 means that the carry is generated at
position i while propagate[i] == 1 means that the carry will be propagated from position
(i - 1) to (i + 1). Our goal is to compute the function carryout[0..n-1] used to update the resulting sum s[0..n-1]. carryout can be computed recursively as follows:
carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0
Here carryout[i] == 1 if carry is generated at position i OR it is generated sometimes earlier AND propagated to position i. Finally, we update the resulting sum:
s[i] = s[i] + carryout[i-1]; for i = 1..n-1
carry = carryout[n-1];
Now it is quite straightforward to prove that carryout function is indeed binary associative and hence parallel prefix sum computation applies. To implement this on CUDA, we can merge both flags 'generate' and 'propagate' in a single variable since they are mutually exclusive, i.e.:
cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];
In other words,
cy[i] = 0xffffffff if propagate[i]
cy[i] = 1 if generate[i]
cy[u] = 0 otherwise
Then, one can verify that the following formula computes prefix sum for carryout function:
cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];
for all k < i. The example code below shows large addition for 2048-word integers. Here I used CUDA blocks with 512 threads:
// add & output carry flag
#define UADDO(c, a, b) \
asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b) \
asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
#define WS 32
__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) {
extern __shared__ unsigned shared[];
unsigned *r = shared;
const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;
uint4 a = ((const uint4 *)g_A)[thid],
b = ((const uint4 *)g_B)[thid];
UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out
// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
49 * (thid / 64)) + ((thid / 32) & 1);
scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
// this indicates that carry will propagate through the number
cf = -1u;
// "Hillis-and-Steele-style" reduction
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;
int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
postscan[thid >> 5] = cf;
__syncthreads();
if(thid < N_THIDS / 32) {
volatile int *t = (volatile int *)postscan + thid;
t[-8] = -1; // load identity symbol
cf = t[0];
cf = max((int)cf, (int)t[-1]) & cf;
t[0] = cf;
cf = max((int)cf, (int)t[-2]) & cf;
t[0] = cf;
cf = max((int)cf, (int)t[-4]) & cf;
t[0] = cf;
}
__syncthreads();
cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];
if(thid_in_warp == 0)
cf = ps;
if((int)cf < 0)
cf = 0;
UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;
}
Note that macros UADDO / UADDC might not be necessary anymore since CUDA 4.0 has corresponding intrinsics (however I am not entirely sure).
Also remark that, though parallel reduction is quite fast, if you need to add several large integers in a row, it might be better to use some redundant representation (which was suggested in comments above), i.e., first accumulate the results of additions in 64-bit words, and then perform one carry propagation at the very end in "one sweep".
I thought I would post my answer also, in addition to #asm, so this SO question can be a sort of repository of ideas. Similar to #asm, I detect and store the carry condition as well as the "carry-through" condition, ie. when the intermediate word result is all 1's (0xF...FFF) so that if a carry were to propagate into this word, it would "carry-through" to the next word.
I didn't use any PTX or asm in my code, so I chose to use 64-bit unsigned ints instead of 32-bit, to achieve the 2048x32bit capability, using 1024 threads.
A larger difference from #asm's code is in my parallel carry propagation scheme. I construct a bit-packed array ("carry") where each bit represents the carry condition generated from the independent intermediate 64-bit adds from each of the 1024 threads. I also construct a bit-packed array ("carry_through") where each bit represents the carry_through condition of the individual 64-bit intermediate results. For 1024 threads, this amounts to 1024/64 = 16x64 bit words of shared memory for each bit-packed array, so total shared mem usage is 64+3 32bit quantites. With these bit packed arrays, I perform the following to generate a combined propagated carry indicator:
carry = carry | (carry_through ^ ((carry & carry_through) + carry_through);
(note that carry is shifted left by one: carry[i] indicates that the result of a[i-1] + b[i-1] generated a carry)
The explanation is as follows:
the bitwise and of carry and carry_through generates the candidates where a carry will
interact with a sequence of one or more carry though conditions
adding the result of step one to carry_through generates a result which
has changed bits which represent all words that will be affected by
the propagation of the carry into the carry_through sequence
taking the exclusive-or of carry_through plus the result from step 2
shows the affected results indicated with a 1 bit
taking the bitwise or of the result from step 3 and the ordinary
carry indicators gives a combined carry condition, which is then
used to update all the intermediate results.
Note that the addition in step 2 requires another multi-word add (for big ints composed of more than 64 words). I believe this algorithm works, and it has passed the test cases I have thrown at it.
Here is my example code which implements this:
// parallel add of large integers
// requires CC 2.0 or higher
// compile with:
// nvcc -O3 -arch=sm_20 -o paradd2 paradd2.cu
#include <stdio.h>
#include <stdlib.h>
#define MAXSIZE 1024 // the number of 64 bit quantities that can be added
#define LLBITS 64 // the number of bits in a long long
#define BSIZE ((MAXSIZE + LLBITS -1)/LLBITS) // MAXSIZE when packed into bits
#define nTPB MAXSIZE
// define either GPU or GPUCOPY, not both -- for timing
#define GPU
//#define GPUCOPY
#define LOOPCNT 1000
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// perform c = a + b, for unsigned integers of psize*64 bits.
// all work done in a single threadblock.
// multiple threadblocks are handling multiple separate addition problems
// least significant word is at a[0], etc.
__global__ void paradd(const unsigned size, const unsigned psize, unsigned long long *c, const unsigned long long *a, const unsigned long long *b){
__shared__ unsigned long long carry_through[BSIZE];
__shared__ unsigned long long carry[BSIZE+1];
__shared__ volatile unsigned mcarry;
__shared__ volatile unsigned mcarry_through;
unsigned idx = threadIdx.x + (psize * blockIdx.x);
if ((threadIdx.x < psize) && (idx < size)){
// handle 64 bit unsigned add first
unsigned long long cr1 = a[idx];
unsigned long long lc = cr1 + b[idx];
// handle carry
if (threadIdx.x < BSIZE){
carry[threadIdx.x] = 0;
carry_through[threadIdx.x] = 0;
}
if (threadIdx.x == 0){
mcarry = 0;
mcarry_through = 0;
}
__syncthreads();
if (lc < cr1){
if ((threadIdx.x%LLBITS) != (LLBITS-1))
atomicAdd(&(carry[threadIdx.x/LLBITS]), (2ull<<(threadIdx.x%LLBITS)));
else atomicAdd(&(carry[(threadIdx.x/LLBITS)+1]), 1);
}
// handle carry-through
if (lc == 0xFFFFFFFFFFFFFFFFull)
atomicAdd(&(carry_through[threadIdx.x/LLBITS]), (1ull<<(threadIdx.x%LLBITS)));
__syncthreads();
if (threadIdx.x < ((psize + LLBITS-1)/LLBITS)){
// only 1 warp executing within this if statement
unsigned long long cr3 = carry_through[threadIdx.x];
cr1 = carry[threadIdx.x] & cr3;
// start of sub-add
unsigned long long cr2 = cr3 + cr1;
if (cr2 < cr1) atomicAdd((unsigned *)&mcarry, (2u<<(threadIdx.x)));
if (cr2 == 0xFFFFFFFFFFFFFFFFull) atomicAdd((unsigned *)&mcarry_through, (1u<<threadIdx.x));
if (threadIdx.x == 0) {
unsigned cr4 = mcarry & mcarry_through;
cr4 += mcarry_through;
mcarry |= (mcarry_through ^ cr4);
}
if (mcarry & (1u<<threadIdx.x)) cr2++;
// end of sub-add
carry[threadIdx.x] |= (cr2 ^ cr3);
}
__syncthreads();
if (carry[threadIdx.x/LLBITS] & (1ull<<(threadIdx.x%LLBITS))) lc++;
c[idx] = lc;
}
}
int main() {
unsigned long long *h_a, *h_b, *h_c, *d_a, *d_b, *d_c, *c;
unsigned at_once = 256; // valid range = 1 .. 65535
unsigned prob_size = MAXSIZE ; // valid range = 1 .. MAXSIZE
unsigned dsize = at_once * prob_size;
cudaEvent_t t_start_gpu, t_start_cpu, t_end_gpu, t_end_cpu;
float et_gpu, et_cpu, tot_gpu, tot_cpu;
tot_gpu = 0;
tot_cpu = 0;
if (sizeof(unsigned long long) != (LLBITS/8)) {printf("Word Size Error\n"); return 1;}
if ((c = (unsigned long long *)malloc(dsize * sizeof(unsigned long long))) == 0) {printf("Malloc Fail\n"); return 1;}
cudaHostAlloc((void **)&h_a, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc1 fail");
cudaHostAlloc((void **)&h_b, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc2 fail");
cudaHostAlloc((void **)&h_c, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc3 fail");
cudaMalloc((void **)&d_a, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc1 fail");
cudaMalloc((void **)&d_b, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc2 fail");
cudaMalloc((void **)&d_c, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc3 fail");
cudaMemset(d_c, 0, dsize*sizeof(unsigned long long));
cudaEventCreate(&t_start_gpu);
cudaEventCreate(&t_end_gpu);
cudaEventCreate(&t_start_cpu);
cudaEventCreate(&t_end_cpu);
for (unsigned loops = 0; loops <LOOPCNT; loops++){
//create some test cases
if (loops == 0){
for (int j=0; j<at_once; j++)
for (int k=0; k<prob_size; k++){
int i= (j*prob_size) + k;
h_a[i] = 0xFFFFFFFFFFFFFFFFull;
h_b[i] = 0;
}
h_a[prob_size-1] = 0;
h_b[prob_size-1] = 1;
h_b[0] = 1;
}
else if (loops == 1){
for (int i=0; i<dsize; i++){
h_a[i] = 0xFFFFFFFFFFFFFFFFull;
h_b[i] = 0;
}
h_b[0] = 1;
}
else if (loops == 2){
for (int i=0; i<dsize; i++){
h_a[i] = 0xFFFFFFFFFFFFFFFEull;
h_b[i] = 2;
}
h_b[0] = 1;
}
else {
for (int i = 0; i<dsize; i++){
h_a[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
h_b[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
}
}
#ifdef GPUCOPY
cudaEventRecord(t_start_gpu, 0);
#endif
cudaMemcpy(d_a, h_a, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy1 fail");
cudaMemcpy(d_b, h_b, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy2 fail");
#ifdef GPU
cudaEventRecord(t_start_gpu, 0);
#endif
paradd<<<at_once, nTPB>>>(dsize, prob_size, d_c, d_a, d_b);
cudaCheckErrors("Kernel Fail");
#ifdef GPU
cudaEventRecord(t_end_gpu, 0);
#endif
cudaMemcpy(h_c, d_c, dsize*sizeof(unsigned long long), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy3 fail");
#ifdef GPUCOPY
cudaEventRecord(t_end_gpu, 0);
#endif
cudaEventSynchronize(t_end_gpu);
cudaEventElapsedTime(&et_gpu, t_start_gpu, t_end_gpu);
tot_gpu += et_gpu;
cudaEventRecord(t_start_cpu, 0);
//also compute result on CPU for comparison
for (int j=0; j<at_once; j++) {
unsigned rc=0;
for (int n=0; n<prob_size; n++){
unsigned i = (j*prob_size) + n;
c[i] = h_a[i] + h_b[i];
if (c[i] < h_a[i]) {
c[i] += rc;
rc=1;}
else {
if ((c[i] += rc) != 0) rc=0;
}
if (c[i] != h_c[i]) {printf("Results mismatch at offset %d, GPU = 0x%lX, CPU = 0x%lX\n", i, h_c[i], c[i]); return 1;}
}
}
cudaEventRecord(t_end_cpu, 0);
cudaEventSynchronize(t_end_cpu);
cudaEventElapsedTime(&et_cpu, t_start_cpu, t_end_cpu);
tot_cpu += et_cpu;
if ((loops%(LOOPCNT/10)) == 0) printf("*\n");
}
printf("\nResults Match!\n");
printf("Average GPU time = %fms\n", (tot_gpu/LOOPCNT));
printf("Average CPU time = %fms\n", (tot_cpu/LOOPCNT));
return 0;
}
I'm looking for fast code for 64-bit (unsigned) cube roots. (I'm using C and compiling with gcc, but I imagine most of the work required will be language- and compiler-agnostic.) I will denote by ulong a 64-bit unisgned integer.
Given an input n I require the (integral) return value r to be such that
r * r * r <= n && n < (r + 1) * (r + 1) * (r + 1)
That is, I want the cube root of n, rounded down. Basic code like
return (ulong)pow(n, 1.0/3);
is incorrect because of rounding toward the end of the range. Unsophisticated code like
ulong
cuberoot(ulong n)
{
ulong ret = pow(n + 0.5, 1.0/3);
if (n < 100000000000001ULL)
return ret;
if (n >= 18446724184312856125ULL)
return 2642245ULL;
if (ret * ret * ret > n) {
ret--;
while (ret * ret * ret > n)
ret--;
return ret;
}
while ((ret + 1) * (ret + 1) * (ret + 1) <= n)
ret++;
return ret;
}
gives the correct result, but is slower than it needs to be.
This code is for a math library and it will be called many times from various functions. Speed is important, but you can't count on a warm cache (so suggestions like a 2,642,245-entry binary search are right out).
For comparison, here is code that correctly calculates the integer square root.
ulong squareroot(ulong a) {
ulong x = (ulong)sqrt((double)a);
if (x > 0xFFFFFFFF || x*x > a)
x--;
return x;
}
The book "Hacker's Delight" has algorithms for this and many other problems. The code is online here. EDIT: That code doesn't work properly with 64-bit ints, and the instructions in the book on how to fix it for 64-bit are somewhat confusing. A proper 64-bit implementation (including test case) is online here.
I doubt that your squareroot function works "correctly" - it should be ulong a for the argument, not n :) (but the same approach would work using cbrt instead of sqrt, although not all C math libraries have cube root functions).
I've adapted the algorithm presented in 1.5.2 (the kth root) in Modern Computer Arithmetic (Brent and Zimmerman). For the case of (k == 3), and given a 'relatively' accurate over-estimate of the initial guess - this algorithm seems to out-perform the 'Hacker's Delight' code above.
Not only that, but MCA as a text provides theoretical background as well as a proof of correctness and terminating criteria.
Provided that we can produce a 'relatively' good initial over-estimate, I haven't been able to find a case that exceeds (7) iterations. (Is this effectively related to 64-bit values having 2^6 bits?) Either way, it's an improvement over the (21) iterations in the HacDel code - with linear O(b) convergence, despite having a loop body that is evidently much faster.
The initial estimate I've used is based on a 'rounding up' of the number of significant bits in the value (x). Given (b) significant bits in (x), we can say: 2^(b - 1) <= x < 2^b. I state without proof (though it should be relatively easy to demonstrate) that: 2^ceil(b / 3) > x^(1/3)
static inline uint32_t u64_cbrt (uint64_t x)
{
uint64_t r0 = 1, r1;
/* IEEE-754 cbrt *may* not be exact. */
if (x == 0) /* cbrt(0) : */
return (0);
int b = (64) - __builtin_clzll(x);
r0 <<= (b + 2) / 3; /* ceil(b / 3) */
do /* quadratic convergence: */
{
r1 = r0;
r0 = (2 * r1 + x / (r1 * r1)) / 3;
}
while (r0 < r1);
return ((uint32_t) r1); /* floor(cbrt(x)); */
}
A crbt call probably isn't all that useful - unlike the sqrt call which can be efficiently implemented on modern hardware. That said, I've seen gains for sets of values under 2^53 (exactly represented in IEEE-754 doubles), which surprised me.
The only downside is the division by: (r * r) - this can be slow, as the latency of integer division continues to fall behind other advances in ALUs. The division by a constant: (3) is handled by reciprocal methods on any modern optimising compiler.
It's interesting that Intel's 'Icelake' microarchitecture will significantly improve integer division - an operation that seems to have been neglected for a long time. I simply won't trust the 'Hacker's Delight' answer until I can find a sound theoretical basis for it. And then I have to work out which variant is the 'correct' answer.
You could try a Newton's step to fix your rounding errors:
ulong r = (ulong)pow(n, 1.0/3);
if(r==0) return r; /* avoid divide by 0 later on */
ulong r3 = r*r*r;
ulong slope = 3*r*r;
ulong r1 = r+1;
ulong r13 = r1*r1*r1;
/* making sure to handle unsigned arithmetic correctly */
if(n >= r13) r+= (n - r3)/slope;
if(n < r3) r-= (r3 - n)/slope;
A single Newton step ought to be enough, but you may have off-by-one (or possibly more?) errors. You can check/fix those using a final check&increment step, as in your OQ:
while(r*r*r > n) --r;
while((r+1)*(r+1)*(r+1) <= n) ++r;
or some such.
(I admit I'm lazy; the right way to do it is to carefully check to determine which (if any) of the check&increment things is actually necessary...)
If pow is too expensive, you can use a count-leading-zeros instruction to get an approximation to the result, then use a lookup table, then some Newton steps to finish it.
int k = __builtin_clz(n); // counts # of leading zeros (often a single assembly insn)
int b = 64 - k; // # of bits in n
int top8 = n >> (b - 8); // top 8 bits of n (top bit is always 1)
int approx = table[b][top8 & 0x7f];
Given b and top8, you can use a lookup table (in my code, 8K entries) to find a good approximation to cuberoot(n). Use some Newton steps (see comingstorm's answer) to finish it.
// On my pc: Math.Sqrt 35 ns, cbrt64 <70ns, cbrt32 <25 ns, (cbrt12 < 10ns)
// cbrt64(ulong x) is a C# version of:
// http://www.hackersdelight.org/hdcodetxt/acbrt.c.txt (acbrt1)
// cbrt32(uint x) is a C# version of:
// http://www.hackersdelight.org/hdcodetxt/icbrt.c.txt (icbrt1)
// Union in C#:
// http://www.hanselman.com/blog/UnionsOrAnEquivalentInCSairamasTipOfTheDay.aspx
using System.Runtime.InteropServices;
[StructLayout(LayoutKind.Explicit)]
public struct fu_32 // float <==> uint
{
[FieldOffset(0)]
public float f;
[FieldOffset(0)]
public uint u;
}
private static uint cbrt64(ulong x)
{
if (x >= 18446724184312856125) return 2642245;
float fx = (float)x;
fu_32 fu32 = new fu_32();
fu32.f = fx;
uint uy = fu32.u / 4;
uy += uy / 4;
uy += uy / 16;
uy += uy / 256;
uy += 0x2a5137a0;
fu32.u = uy;
float fy = fu32.f;
fy = 0.33333333f * (fx / (fy * fy) + 2.0f * fy);
int y0 = (int)
(0.33333333f * (fx / (fy * fy) + 2.0f * fy));
uint y1 = (uint)y0;
ulong y2, y3;
if (y1 >= 2642245)
{
y1 = 2642245;
y2 = 6981458640025;
y3 = 18446724184312856125;
}
else
{
y2 = (ulong)y1 * y1;
y3 = y2 * y1;
}
if (y3 > x)
{
y1 -= 1;
y2 -= 2 * y1 + 1;
y3 -= 3 * y2 + 3 * y1 + 1;
while (y3 > x)
{
y1 -= 1;
y2 -= 2 * y1 + 1;
y3 -= 3 * y2 + 3 * y1 + 1;
}
return y1;
}
do
{
y3 += 3 * y2 + 3 * y1 + 1;
y2 += 2 * y1 + 1;
y1 += 1;
}
while (y3 <= x);
return y1 - 1;
}
private static uint cbrt32(uint x)
{
uint y = 0, z = 0, b = 0;
int s = x < 1u << 24 ? x < 1u << 12 ? x < 1u << 06 ? x < 1u << 03 ? 00 : 03 :
x < 1u << 09 ? 06 : 09 :
x < 1u << 18 ? x < 1u << 15 ? 12 : 15 :
x < 1u << 21 ? 18 : 21 :
x >= 1u << 30 ? 30 : x < 1u << 27 ? 24 : 27;
do
{
y *= 2;
z *= 4;
b = 3 * y + 3 * z + 1 << s;
if (x >= b)
{
x -= b;
z += 2 * y + 1;
y += 1;
}
s -= 3;
}
while (s >= 0);
return y;
}
private static uint cbrt12(uint x) // x < ~255
{
uint y = 0, a = 0, b = 1, c = 0;
while (a < x)
{
y++;
b += c;
a += b;
c += 6;
}
if (a != x) y--;
return y;
}
Starting from the code within the GitHub gist from the answer of Fabian Giesen, I have arrived at the following, faster implementation:
#include <stdint.h>
static inline uint64_t icbrt(uint64_t x) {
uint64_t b, y, bits = 3*21;
int s;
for (s = bits - 3; s >= 0; s -= 3) {
if ((x >> s) == 0)
continue;
x -= 1 << s;
y = 1;
for (s = s - 3; s >= 0; s -= 3) {
y += y;
b = 1 + 3*y*(y + 1);
if ((x >> s) >= b) {
x -= b << s;
y += 1;
}
}
return y;
}
return 0;
}
While the above is still somewhat slower than methods relying on the GNU specific __builtin_clzll, the above does not make use of compiler specifics and is thus completely portable.
The bits constant
Lowering the constant bits leads to faster computation, but the highest number x for which the function gives correct results is (1 << bits) - 1. Also, bits must be a multiple of 3 and be at most 64, meaning that its maximum value is really 3*21 == 63. With bits = 3*21, icbrt() thus works for input x <= 9223372036854775807. If we know that a program is working with limited x, say x < 1000000, then we can speed up the cube root computation by setting bits = 3*7, since (1 << 3*7) - 1 = 2097151 >= 1000000.
64-bit vs. 32-bit integers
Though the above is written for 64-bit integers, the logic is the same for 32-bit:
#include <stdint.h>
static inline uint32_t icbrt(uint32_t x) {
uint32_t b, y, bits = 3*7; /* or whatever is appropriate */
int s;
for (s = bits - 3; s >= 0; s -= 3) {
if ((x >> s) == 0)
continue;
x -= 1 << s;
y = 1;
for (s = s - 3; s >= 0; s -= 3) {
y += y;
b = 1 + 3*y*(y + 1);
if ((x >> s) >= b) {
x -= b << s;
y += 1;
}
}
return y;
}
return 0;
}
I would research how to do it by hand, and then translate that into a computer algorithm, working in base 2 rather than base 10.
We end up with an algorithm something like (pseudocode):
Find the largest n such that (1 << 3n) < input.
result = 1 << n.
For i in (n-1)..0:
if ((result | 1 << i)**3) < input:
result |= 1 << i.
We can optimize the calculation of (result | 1 << i)**3 by observing that the bitwise-or is equivalent to addition, refactoring to result**3 + 3 * i * result ** 2 + 3 * i ** 2 * result + i ** 3, caching the values of result**3 and result**2 between iterations, and using shifts instead of multiplication.
You can try and adapt this C algorithm :
#include <limits.h>
// return a number that, when multiplied by itself twice, makes N.
unsigned cube_root(unsigned n){
unsigned a = 0, b;
for (int c = sizeof(unsigned) * CHAR_BIT / 3 * 3 ; c >= 0; c -= 3) {
a <<= 1;
b = a + (a << 1), b = b * a + b + 1 ;
if (n >> c >= b)
n -= b << c, ++a;
}
return a;
}
Also there is :
// return the number that was multiplied by itself to reach N.
unsigned square_root(const unsigned num) {
unsigned a, b, c, d;
for (b = a = num, c = 1; a >>= 1; ++c);
for (c = 1 << (c & -2); c; c >>= 2) {
d = a + c;
a >>= 1;
if (b >= d)
b -= d, a += c;
}
return a;
}
Source