I'm trying to separate the numbers of a sequence, and store them all in an array.
For what the little I have seen of C, I am doing nothing wrong, and the program compiles perfectly, but the moment it goes to print the numbers, it just doesn't work.
The explanation of what I'm trying to do is in the end.
long int number;
do
{
number = get_long("number:\n");
}
while (number<1 || number>9999999999999999);
int numbers[16], n;
//We separate the numbers, from right to left
for (long int I=10; I>100000000000000000; I*=10)
{
for (long int J=1; J>100000000000000000; J*=10)
{
for (n=0; n>16; n++)
{
numbers[n]=(number%I)/J;
}
}
}
printf("%i\n", numbers[1]);
It is supposed to accept numbers of 1 digit up until 16 digits, and separate each digit.
For example, if we had 16, it would separate 1 and 6 into two digits, making the 6 the first digit, and the 1 the second, so it would start counting from right to left. It's supposed to store each digit in an array of 16 spaces. Then I would just print the second digit, just to make sure it does work, but when I run it, it just gives me 0; meaning it doesn't work, but I see no problem with it.
It probably is that I'm either too inexperienced, or I don't have the necessary knowledge, to be able to see the problem in the code.
You have incorrect loop termination checks, so the loops are never entered.
After reversing > to <, you end up evaluating the body of the inner loop 16*16*16 = 4096 times even though there are only 16 digits. There should only be one loop of 16 iterations.
A long int is not is only guaranteed to support numbers up to 2,147,483,647. Instead, use one of long long int, int_least64_t or int64_t, or one of their unsigned counterparts.
You were attempting to write the following:
uint64_t mod = 10; // Formerly named I
uint64_t div = 1; // Formerly named J
for (int n=0; n<16; ++n) {
numbers[n] = ( number % mod ) / div;
mod *= 10;
div *= 10;
}
Demo
But that's a bit more complicated than needed. Let's swap the order of the division and modulus.
uint64_t div = 1;
for (int n=0; n<16; ++n) {
numbers[n] = ( number / div ) % 10;
div *= 10;
}
Demo
Finally, we can simplify a bit more if we don't mind clobbering number in the process.
for (int n=0; n<16; ++n) {
numbers[n] = number % 10;
number /= 10;
}
Demo
All of your for loops are using operator> when they should be using operator< instead. Thus the loop conditions are always false (10 is not > than 100000000000000000, 1 is not > than 100000000000000000, 0 is not > than 16), so the loops don't get entered at all, and thus numbers[] is left unfilled.
Fixing that, you still have a logic problem. Think of what the result of (number%I)/J is when number is 16 and I and J are large values. The result of operator/ is typically 0! On some loop iterations, numbers[] gets populated with correct values. But other iterations will then overwrite numbers[] with 0s. Once all of the loops are finished, only the 0s are left.
This Online Demo demonstrates this in action.
If using a long variable, the value ranges are: -2147483648 to 2147483647 (in most C implementations, as noted by #Eric P in comments)
So the expression while (number<1 || number>9999999999999999); (and similar) do not make sense. As a number, number will never approach 9999999999999999. Same for expression: ...J>100000000000000000; J*=10). (and its really moot at this point, but > should be <)
Consider using a string approach:
Using a null terminated char array (C string) to hold initial value, the essential steps are pretty straight forward and could include the following:
char number[17];//room for 16 characters + null terminator
scanf("%16s", number);//string comprised of maximum of 16 digits
len = strlen(number);
int num_array[len];//using VLA
memset(num_array, 0, sizeof num_array);//zero array
for(int i = 0;i < len; i++)
{
if(number[i] < '0' || number[i] > '9') break;//qualify input. Break if non-numeric
num_array = number[i] - '0';
}
Hi: I have been ramping up on C and I have a couple philosophical questions based on arrays and pointers and how make things simple, quick, and small or balance the three at least, I suppose.
I imagine an MCU sampling an input every so often and storing the sample in an array, called "val", of size "NUM_TAPS". The index of 'val' gets decremented for the next sample after the current, so for instance if val[0] just got stored, the next value needs to go into val[NUM_TAPS-1].
At the end of the day I want to be able to refer to the newest sample as x[0] and the oldest sample as x[NUM_TAPS-1] (or equivalent).
It is a slightly different problem than many have solved on this and other forums describing rotating, circular, queue etc. buffers. I don't need (I think) a head and tail pointer because I always have NUM_TAPS data values. I only need to remap the indexes based on a "head pointer".
Below is the code I came up with. It seems to be working fine but it raises a few more questions I'd like to pose to the wider, much more expert community:
Is there a better way to assign indexes than a conditional assignment
(to wrap indexes < 0) with the modulus operator (to wrap indexes >
NUM_TAPS -1)? I can't think of a way that pointers to pointers would
help, but does anyone else have thoughts on this?
Instead of shifting the data itself as in a FIFO to organize the
values of x, I decided here to rotate the indexes. I would guess that
for data structures close to or smaller in size than the pointers
themselves that data moves might be the way to go but for very large
numbers (floats, etc.) perhaps the pointer assignment method is the
most efficient. Thoughts?
Is the modulus operator generally considered close in speed to
conditional statements? For example, which is generally faster?:
offset = (++offset)%N;
*OR**
offset++;
if (NUM_TAPS == offset) { offset = 0; }
Thank you!
#include <stdio.h>
#define NUM_TAPS 10
#define STARTING_VAL 0
#define HALF_PERIOD 3
void main (void) {
register int sample_offset = 0;
int wrap_offset = 0;
int val[NUM_TAPS];
int * pval;
int * x[NUM_TAPS];
int live_sample = 1;
//START WITH 0 IN EVERY LOCATION
pval = val; /* 1st address of val[] */
for (int i = 0; i < NUM_TAPS; i++) { *(pval + i) = STARTING_VAL ; }
//EVENT LOOP (SAMPLE A SQUARE WAVE EVERY PASS)
for (int loop = 0; loop < 30; loop++) {
if (0 == loop%HALF_PERIOD && loop > 0) {live_sample *= -1;}
*(pval + sample_offset) = live_sample; //really stupid square wave generator
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++) { x[i] = pval+(sample_offset + i)%NUM_TAPS; }
//METHOD #1: dump the samples using pval:
//for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*(pval+(sample_offset + i)%NUM_TAPS)); }
//printf("\n");
//METHOD #2: dump the samples using x:
for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*x[i]); }
printf("\n");
sample_offset = (sample_offset - 1)%NUM_TAPS; //represents the next location of the sample to be stored, relative to pval
sample_offset = (sample_offset < 0 ? NUM_TAPS -1 : sample_offset); //wrap around if the sample_offset goes negative
}
}
The cost of a % operator is the about 26 clock cycles since it is implemented using the DIV instruction. An if statement is likely faster since the instructions will be present in the pipeline and so the process will skip a few instructions but it can do this quickly.
Note that both solutions are slow compared to doing a BITWISE AND operation which takes only 1 clock cycle. For reference, if you want gory detail, check out this chart for the various instruction costs (measured in CPU Clock ticks)
http://www.agner.org/optimize/instruction_tables.pdf
The best way to do a fast modulo on a buffer index is to use a power of 2 value for the number of buffers so then you can use the quick BITWISE AND operator instead.
#define NUM_TAPS 16
With a power of 2 value for the number of buffers, you can use a bitwise AND to implement modulo very efficiently. Recall that bitwise AND with a 1 leaves the bit unchanged, while bitwise AND with a 0 leaves the bit zero.
So by doing a bitwise AND of NUM_TAPS-1 with your incremented index, assuming that NUM_TAPS is 16, then it will cycle through the values 0,1,2,...,14,15,0,1,...
This works because NUM_TAPS-1 equals 15, which is 00001111b in binary. The bitwise AND resulst in a value where only that last 4 bits to be preserved, while any higher bits are zeroed.
So everywhere you use "% NUM_TAPS", you can replace it with "& (NUM_TAPS-1)". For example:
#define NUM_TAPS 16
...
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++)
{ x[i] = pval+(sample_offset + i) & (NUM_TAPS-1); }
Here is your code modified to work with BITWISE AND, which is the fastest solution.
#include <stdio.h>
#define NUM_TAPS 16 // Use a POWER of 2 for speed, 16=2^4
#define MOD_MASK (NUM_TAPS-1) // Saves typing and makes code clearer
#define STARTING_VAL 0
#define HALF_PERIOD 3
void main (void) {
register int sample_offset = 0;
int wrap_offset = 0;
int val[NUM_TAPS];
int * pval;
int * x[NUM_TAPS];
int live_sample = 1;
//START WITH 0 IN EVERY LOCATION
pval = val; /* 1st address of val[] */
for (int i = 0; i < NUM_TAPS; i++) { *(pval + i) = STARTING_VAL ; }
//EVENT LOOP (SAMPLE A SQUARE WAVE EVERY PASS)
for (int loop = 0; loop < 30; loop++) {
if (0 == loop%HALF_PERIOD && loop > 0) {live_sample *= -1;}
*(pval + sample_offset) = live_sample; //really stupid square wave generator
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++) { x[i] = pval+(sample_offset + i) & MOD_MASK; }
//METHOD #1: dump the samples using pval:
//for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*(pval+(sample_offset + i) & MOD_MASK)); }
//printf("\n");
//METHOD #2: dump the samples using x:
for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*x[i]); }
printf("\n");
// sample_offset = (sample_offset - 1)%NUM_TAPS; //represents the next location of the sample to be stored, relative to pval
// sample_offset = (sample_offset < 0 ? NUM_TAPS -1 : sample_offset); //wrap around if the sample_offset goes negative
// MOD_MASK works faster than the above
sample_offset = (sample_offset - 1) & MOD_MASK;
}
}
At the end of the day I want to be able to refer to the newest sample as x[0] and the oldest sample as x[NUM_TAPS-1] (or equivalent).
Any way you implement this is very expensive, because each time you record a new sample, you have to move all the other samples (or pointers to them, or an equivalent). Pointers don't really help you here. In fact, using pointers as you do is probably a little more costly than just working directly with the buffer.
My suggestion would be to give up the idea of "remapping" indices persistently, and instead do it only virtually, as needed. I'd probably ease that and ensure it is done consistently by writing data access macros to use in place of direct access to the buffer. For example,
// expands to an expression designating the sample at the specified
// (virtual) index
#define SAMPLE(index) (val[((index) + sample_offset) % NUM_TAPS])
You would then use SAMPLE(n) instead of x[n] to read the samples.
I might consider also providing a macro for adding new samples, such as
// Updates sample_offset and records the given sample at the new offset
#define RECORD_SAMPLE(sample) do { \
sample_offset = (sample_offset + NUM_TAPS - 1) % NUM_TAPS; \
val[sample_offset] = sample; \
} while (0)
With regard to your specific questions:
Is there a better way to assign indexes than a conditional assignment (to wrap indexes < 0) with the modulus operator (to wrap
indexes > NUM_TAPS -1)? I can't think of a way that pointers to
pointers would help, but does anyone else have thoughts on this?
I would choose modulus over a conditional every time. Do, however, watch out for taking the modulus of a negative number (see above for an example of how to avoid doing so); such a computation may not mean what you think it means. For example -1 % 2 == -1, because C specifies that (a/b)*b + a%b == a for any a and b such that the quotient is representable.
Instead of shifting the data itself as in a FIFO to organize the values of x, I decided here to rotate the indexes. I would guess that
for data structures close to or smaller in size than the pointers
themselves that data moves might be the way to go but for very large
numbers (floats, etc.) perhaps the pointer assignment method is the
most efficient. Thoughts?
But your implementation does not rotate the indices. Instead, it shifts pointers. Not only is this about as expensive as shifting the data themselves, but it also adds the cost of indirection for access to the data.
Additionally, you seem to have the impression that pointer representations are small compared to representations of other built-in data types. This is rarely the case. Pointers are usually among the largest of a given C implementation's built-in data types. In any event, neither shifting around the data nor shifting around pointers is efficient.
Is the modulus operator generally considered close in speed to conditional statements? For example, which is generally faster?:
On modern machines, the modulus operator is much faster on average than a conditional whose result is difficult for the CPU to predict. CPUs these days have long instruction pipelines, and they perform branch prediction and corresponding speculative computation to enable them to keep these full when a conditional instruction is encountered, but when they discover that they have predicted incorrectly, they need to flush the whole pipeline and redo several computations. When that happens, it's a lot more expensive than a small number of unconditional arithmetical operations.
So my teacher is having us work with for loops and one of our assignments is to make a for loop that will change any base 2 number to base 10. I'll post what I have done so far. I'm only in AP Computer Science to the code will look amateurish.
public long getBaseTen( )
{
long ten=0;
for (int i = 0; i < binary.length()-1; i++)
{
if (binary.charAt(binary.length()-i-1) == '0');
ten += 0;
if (binary.charAt(binary.length()-i-1) == '1');
ten += Math.pow(2, i);
}
return ten;
}
binary is a string variable that contains the base 2 number earlier specified by the user. I need to convert this base 2 string into base 10 and store that number into long ten. Right now whenever I call this method, I always get the same number depending on the length of the string. If the string is 2 letters long, it will always return a 1, if it's 3 letters long, it will always return a 3, if it's 4 letters long, it will always return a 7 and so on. Help would be very much appreciated.
The problem in your code is that your if-statements closes to early:
if (binary.charAt(binary.length()-i-1) == '0');
ten += 0;
should be
if (binary.charAt(binary.length()-i-1) == '0'){
ten += 0;
}
There are, of course, some other things that could be done differently, but you'll figure that out along the way.
Remove the semicolons at the end of the "if" lines. They shouldn't be there; they're being interpreted as
if(whatever)
; // null statement -- do nothing
(You could also add braces around the block of code controlled by the if, but that's optional when you're just trying to control a single statement. Some folks always use the braces, but that decision is very much a matter of coding style.)
A string of 0's and 1's follows a power of two rule. The right most value is 2^0 then 2^1 2^2 and so on the exponent doubling each time. Knowing this you can make an easy for loop to do the conversion.
For example:
int tens;
for(int i=1;i<binary.length()-1; i++){
tens += Math.pow(2,binary.length-i);
}
For example if the binary number is 0101 we know this to be 5. We will start at binary.length()-1 which would be 3, perfect since the right most value 0 is represented by 2^3. The second number 1 is represented by 2^2 which if you notice binary.length()-i at this point is 2.
If you follow the logic this should work, may need a small syntax fix.
You are not need use one if statement, you can use this
public long getBaseTen()
{
long ten=0;
for (int i = 0; i < binary.length(); i++)
{
if (binary.charAt(i) == '1')
ten += Math.pow(2, binary.length()-i-1);
}
return ten;
}
Notice, i < binary.length()-1 it's wrong too.
You can see how it works here http://ideone.com/swzitQ
I need a fast way to get the position of all one bits in a 64-bit integer. For example, given x = 123703, I'd like to fill an array idx[] = {0, 1, 2, 4, 5, 8, 9, 13, 14, 15, 16}. We can assume we know the number of bits a priori. This will be called 1012 - 1015 times, so speed is of the essence. The fastest answer I've come up with so far is the following monstrosity, which uses each byte of the 64-bit integer as an index into tables that give the number of bits set in that byte and the positions of the ones:
int64_t x; // this is the input
unsigned char idx[K]; // this is the array of K bits that are set
unsigned char *dst=idx, *src;
unsigned char zero, one, two, three, four, five; // these hold the 0th-5th bytes
zero = x & 0x0000000000FFUL;
one = (x & 0x00000000FF00UL) >> 8;
two = (x & 0x000000FF0000UL) >> 16;
three = (x & 0x0000FF000000UL) >> 24;
four = (x & 0x00FF00000000UL) >> 32;
five = (x & 0xFF0000000000UL) >> 40;
src=tab0+tabofs[zero ]; COPY(dst, src, n[zero ]);
src=tab1+tabofs[one ]; COPY(dst, src, n[one ]);
src=tab2+tabofs[two ]; COPY(dst, src, n[two ]);
src=tab3+tabofs[three]; COPY(dst, src, n[three]);
src=tab4+tabofs[four ]; COPY(dst, src, n[four ]);
src=tab5+tabofs[five ]; COPY(dst, src, n[five ]);
where COPY is a switch statement to copy up to 8 bytes, n is array of the number of bits set in a byte and tabofs gives the offset into tabX, which holds the positions of the set bits in the X-th byte. This is about 3x faster than unrolled loop-based methods with __builtin_ctz() on my Xeon E5-2609. (See below.) I am currently iterating x in lexicographical order for a given number of bits set.
Is there a better way?
EDIT: Added an example (that I have subsequently fixed). Full code is available here: http://pastebin.com/79X8XL2P . Note: GCC with -O2 seems to optimize it away, but Intel's compiler (which I used to compose it) doesn't...
Also, let me give some additional background to address some of the comments below. The goal is to perform a statistical test on every possible subset of K variables out of a universe of N possible explanatory variables; the specific target right now is N=41, but I can see some projects needing N up to 45-50. The test basically involves factorizing the corresponding data submatrix. In pseudocode, something like this:
double doTest(double *data, int64_t model) {
int nidx, idx[];
double submatrix[][];
nidx = getIndices(model, idx); // get the locations of ones in model
// copy data into submatrix
for(int i=0; i<nidx; i++) {
for(int j=0; j<nidx; j++) {
submatrix[i][j] = data[idx[i]][idx[j]];
}
}
factorize(submatrix, nidx);
return the_answer;
}
I coded up a version of this for an Intel Phi board that should complete the N=41 case in about 15 days, of which ~5-10% of the time is spent in a naive getIndices() so right off the bat a faster version could save a day or more. I'm working on an implementation for NVidia Kepler too, but unfortunately the problem I have (ludicrous numbers of small matrix operations) is not ideally suited to the hardware (ludicrously large matrix operations). That said, this paper presents a solution that seems to achieve hundreds of GFLOPS/s on matrices of my size by aggressively unrolling loops and performing the entire factorization in registers, with the caveat that the dimensions of the matrix be defined at compile-time. (This loop unrolling should help reduce overhead and improve vectorization in the Phi version too, so getIndices() will become more important!) So now I'm thinking my kernel should look more like:
double *data; // move data to GPU/Phi once into shared memory
template<unsigned int K> double doTestUnrolled(int *idx) {
double submatrix[K][K];
// copy data into submatrix
#pragma unroll
for(int i=0; i<K; i++) {
#pragma unroll
for(int j=0; j<K; j++) {
submatrix[i][j] = data[idx[i]][idx[j]];
}
}
factorizeUnrolled<K>(submatrix);
return the_answer;
}
The Phi version solves each model in a `cilk_for' loop from model=0 to 2N (or, rather, a subset for testing), but now in order to batch work for the GPU and amortize the kernel launch overhead I have to iterate model numbers in lexicographical order for each of K=1 to 41 bits set (as doynax noted).
EDIT 2: Now that vacation is over, here are some results on my Xeon E5-2602 using icc version 15. The code that I used to benchmark is here: http://pastebin.com/XvrGQUat. I perform the bit extraction on integers that have exactly K bits set, so there is some overhead for the lexicographic iteration measured in the "Base" column in the table below. These are performed 230 times with N=48 (repeating as necessary).
"CTZ" is a loop that uses the the gcc intrinsic __builtin_ctzll to get the lowest order bit set:
for(int i=0; i<K; i++) {
idx[i] = __builtin_ctzll(tmp);
lb = tmp & -tmp; // get lowest bit
tmp ^= lb; // remove lowest bit from tmp
}
Mark is Mark's branchless for loop:
for(int i=0; i<K; i++) {
*dst = i;
dst += x & 1;
x >>= 1;
}
Tab1 is my original table-based code with the following copy macro:
#define COPY(d, s, n) \
switch(n) { \
case 8: *(d++) = *(s++); \
case 7: *(d++) = *(s++); \
case 6: *(d++) = *(s++); \
case 5: *(d++) = *(s++); \
case 4: *(d++) = *(s++); \
case 3: *(d++) = *(s++); \
case 2: *(d++) = *(s++); \
case 1: *(d++) = *(s++); \
case 0: break; \
}
Tab2 is the same code as Tab1, but the copy macro just moves 8 bytes as a single copy (taking ideas from doynax and Lưu Vĩnh Phúc... but note this does not ensure alignment):
#define COPY2(d, s, n) { *((uint64_t *)d) = *((uint64_t *)s); d+=n; }
Here are the results. I guess my initial claim that Tab1 is 3x faster than CTZ only holds for large K (where I was testing). Mark's loop is faster than my original code, but getting rid of the branch in the COPY2 macro takes the cake for K > 8.
K Base CTZ Mark Tab1 Tab2
001 4.97s 6.42s 6.66s 18.23s 12.77s
002 4.95s 8.49s 7.28s 19.50s 12.33s
004 4.95s 9.83s 8.68s 19.74s 11.92s
006 4.95s 16.86s 9.53s 20.48s 11.66s
008 4.95s 19.21s 13.87s 20.77s 11.92s
010 4.95s 21.53s 13.09s 21.02s 11.28s
015 4.95s 32.64s 17.75s 23.30s 10.98s
020 4.99s 42.00s 21.75s 27.15s 10.96s
030 5.00s 100.64s 35.48s 35.84s 11.07s
040 5.01s 131.96s 44.55s 44.51s 11.58s
I believe the key to performance here is to focus on the larger problem rather than on micro-optimizing the extraction of bit positions out of a random integer.
Judging by your sample code and previous SO question you are enumerating all words with K bits set in order, and extracting the bit indices out of these. This greatly simplifies matters.
If so then instead of rebuilding the bit position each iteration try directly incrementing the positions in the bit array. Half of the time this will involve a single loop iteration and increment.
Something along these lines:
// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
size_t i;
unsigned int bitpos[64 + 1];
// Seed with the lowest word plus a sentinel
for(i = 0; i < num; ++i)
bitpos[i] = i;
bitpos[i] = 0;
// Here goes the main loop
do {
// Do something with the resulting data
process(bitpos, num);
// Increment the least-significant series of consecutive bits
for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
bitpos[i] = i;
// Stop on reaching the top
} while(++bitpos[i] != len);
}
// Test function
void process(const unsigned int *bits, size_t num) {
do
printf("%d ", bits[--num]);
while(num);
putchar('\n');
}
Not particularly optimized but you get the general idea.
Here's something very simple which might be faster - no way to know without testing. Much will depend on the number of bits set vs. the number unset. You could unroll this to remove branching altogether but with today's processors I don't know if it would speed up at all.
unsigned char idx[K+1]; // need one extra for overwrite protection
unsigned char *dst=idx;
for (unsigned char i = 0; i < 50; i++)
{
*dst = i;
dst += x & 1;
x >>= 1;
}
P.S. your sample output in the question is wrong, see http://ideone.com/2o032E
As a minimal modification:
int64_t x;
char idx[K+1];
char *dst=idx;
const int BITS = 8;
for (int i = 0 ; i < 64+BITS; i += BITS) {
int y = (x & ((1<<BITS)-1));
char* end = strcat(dst, tab[y]); // tab[y] is a _string_
for (; dst != end; ++dst)
{
*dst += (i - 1); // tab[] is null-terminated so bit positions are 1 to BITS.
}
x >>= BITS;
}
The choice of BITS determines the size of the table. 8, 13 and 16 are logical choices. Each entry is a string, zero-terminated and containing bit positions with 1 offset. I.e. tab[5] is "\x03\x01". The inner loop fixes this offset.
Slightly more efficient: replace the strcat and inner loop by
char const* ptr = tab[y];
while (*ptr)
{
*dst++ = *ptr++ + (i-1);
}
Loop unrolling can be a bit of a pain if the loop contains branches, because copying those branch statements doesn't help the branch predictor. I'll happily leave that decision to the compiler.
One thing I'm considering is that tab[y] is an array of pointers to strings. These are highly similar: "\x1" is a suffix of "\x3\x1". In fact, each string which doesn't start with "\x8" is a suffix of a string which does. I'm wondering how many unique strings you need, and to what degree tab[y] is in fact needed. E.g. by the logic above, tab[128+x] == tab[x]-1.
[edit]
Nevermind, you definitely need 128 tab entries starting with "\x8" since they're never the suffix of another string. Still, the tab[128+x] == tab[x]-1 rule means that you can save half the entries, but at the cost of two extra instructions: char const* ptr = tab[x & 0x7F] - ((x>>7) & 1). (Set up tab[] to point after the \x8)
Using char wouldn't help you to increase speed but in fact often needs more ANDing and sign/zero extending while calculating. Only in the case of very large arrays that should fit in cache, smaller int types should be used
Another thing you can improve is the COPY macro. Instead of copy byte-by-byte, copy the whole word if possible
inline COPY(unsigned char *dst, unsigned char *src, int n)
{
switch(n) { // remember to align dst and src when declaring
case 8:
*((int64_t*)dst) = *((int64_t*)src);
break;
case 7:
*((int32_t*)dst) = *((int32_t*)src);
*((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
dst[6] = src[6];
break;
case 6:
*((int32_t*)dst) = *((int32_t*)src);
*((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
break;
case 5:
*((int32_t*)dst) = *((int32_t*)src);
dst[4] = src[4];
break;
case 4:
*((int32_t*)dst) = *((int32_t*)src);
break;
case 3:
*((int16_t*)dst) = *((int16_t*)src);
dst[2] = src[2];
break;
case 2:
*((int16_t*)dst) = *((int16_t*)src);
break;
case 1:
dst[0] = src[0];
break;
case 0:
break;
}
Also, since tabofs[x] and n[x] is often access close to each other, try putting it close in memory to make sure they are always in cache at the same time
typedef struct TAB_N
{
int16_t n, tabofs;
} tab_n[256];
src=tab0+tab_n[b0].tabofs; COPY(dst, src, tab_n[b0].n);
src=tab0+tab_n[b1].tabofs; COPY(dst, src, tab_n[b1].n);
src=tab0+tab_n[b2].tabofs; COPY(dst, src, tab_n[b2].n);
src=tab0+tab_n[b3].tabofs; COPY(dst, src, tab_n[b3].n);
src=tab0+tab_n[b4].tabofs; COPY(dst, src, tab_n[b4].n);
src=tab0+tab_n[b5].tabofs; COPY(dst, src, tab_n[b5].n);
Last but not least, gettimeofday is not for performance counting. Use QueryPerformanceCounter instead, it's much more precise
Your code is using 1-byte (256 entries) index table. You can speed it up by factor of 2 if you use 2-byte (65536 entries) index table.
Unfortunately, you probably cannot extend that further - for 3-bytes table size would be 16MB, not likely to fit into CPU local cache, and it would only make things slower.
Assuming sparsity in number of set bits,
int count = 0;
unsigned int tmp_bitmap = x;
while (tmp_bitmap > 0) {
int next_psn = __builtin_ffs(tmp_bitmap) - 1;
tmp_bitmap &= (tmp_bitmap-1);
id[count++] = next_psn;
}
The question is what are you going to do with the collection of positions?
If you have to iterate many times over it, then yes, it might be interesting to gather them once as you are doing now, and iterate many.
But if it's for iterating just once or few times, then you might think of not creating an intermediate array of positions, and just invoke a processing block closure/function at each encountered 1 while iterating on bits.
Here is a naive example of bit iterator I wrote in Smalltalk:
LargePositiveInteger>>bitsDo: aBlock
| mask offset |
1 to: self digitLength do: [:iByte |
offset := (iByte - 1) << 3.
mask := (self digitAt: iByte).
[mask = 0]
whileFalse:
[aBlock value: mask lowBit + offset.
mask := mask bitAnd: mask - 1]]
A LargePositiveInteger is an Integer of arbitrary length composed of byte digits.
The lowBit answer the rank of lowest bit and is implemented as a lookup table with 256 entries.
In C++ 2011 you can easily pass a closure, so it should be easy to translate.
uint64_t x;
unsigned int mask;
void (*process_bit_position)(unsigned int);
unsigned char offset = 0;
unsigned char lowBitTable[16] = {0,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0}; // 0-based, first entry is unused
while( x )
{
mask = x & 0xFUL;
while (mask)
{
process_bit_position( lowBitTable[mask]+offset );
mask &= mask - 1;
}
offset += 4;
x >>= 4;
}
The example is demonstrated with a 4 bit table, but you can easily extend it to 13 bits or more if it fits in cache.
For branch prediction, the inner loop could be rewritten as a for(i=0;i<nbit;i++) with an additional tablenbit=numBitTable[mask] then unrolled with a switch (the compiler could do it?), but I let you measure how it performs first...
Has this been found to be too slow?
Small and crude, but it's all in the cache and CPU registers;
void mybits(uint64_t x, unsigned char *idx)
{
unsigned char n = 0;
do {
if (x & 1) *(idx++) = n;
n++;
} while (x >>= 1); // If x is signed this will never end
*idx = (unsigned char) 255; // List Terminator
}
It's still 3 times faster to unroll the loop and produce an array of 64 true/false values (which isn't quite what's wanted)
void mybits_3_2(uint64_t x, idx_type idx[])
{
#define SET(i) (idx[i] = (x & (1UL<<i)))
SET( 0);
SET( 1);
SET( 2);
SET( 3);
...
SET(63);
}
Here's some tight code, written for 1-byte (8-bits), but it should easily, obviously expand to 64-bits.
int main(void)
{
int x = 187;
int ans[8] = {-1,-1,-1,-1,-1,-1,-1,-1};
int idx = 0;
while (x)
{
switch (x & ~(x-1))
{
case 0x01: ans[idx++] = 0; break;
case 0x02: ans[idx++] = 1; break;
case 0x04: ans[idx++] = 2; break;
case 0x08: ans[idx++] = 3; break;
case 0x10: ans[idx++] = 4; break;
case 0x20: ans[idx++] = 5; break;
case 0x40: ans[idx++] = 6; break;
case 0x80: ans[idx++] = 7; break;
}
x &= x-1;
}
getchar();
return 0;
}
Output array should be:
ans = {0,1,3,4,5,7,-1,-1};
If I take "I need a fast way to get the position of all one bits in a 64-bit integer" literally...
I realise this is a few weeks old, however and out of curiosity, I remember way back in my assembly days with the CBM64 and Amiga using an arithmetic shift and then examining the carry flag - if it's set then the shifted bit was 1, if clear then it's zero
e.g. for an arithmetic shift left (examining from bit 64 to bit 0)....
pseudo code (ignore instruction mix etc errors and oversimplification...been a while):
move #64+1, counter
loop. ASL 64bitinteger
BCS carryset
decctr. dec counter
bne loop
exit
carryset.
//store #counter-1 (i.e. bit position) in datastruct indexed by counter
jmp decctr
...I hope you get the idea.
I've not used assembly since then but I'm wondering if we could use some C++ in-line assembly similar to the above to do something similar here. We could do the whole conversion in assembly (very few lines of code), building up an appropriate data structure. C++ could simply examine the answer.
If this is possible then I'd imagine it to be pretty fast.
A simple solution, but perhaps not the fastest, depending on the times of the log and pow functions:
#include<math.h>
void getSetBits(unsigned long num){
int bit;
while(num){
bit = log2(num);
num -= pow(2, bit);
printf("%i\n", bit); // use bit number
}
}
Complexity O(D) | D is the number of set bits.
Hi I got this division alg to display the integer and floating point values.
How can I get MAX_REM? it is supposed to be the size of the buffer where our characters are going to be stored, so the size has to be the # of digits, but I don't know how to get that. thanks!
void divisionAlg(unsigned int value)
{
int MAX_BASE=10;
const char *charTable = {"0123456789ABCDEF"}; // lookup table for converting remainders
char rembuf[MAX_REM + 1]; // holds remainder(s) and provision null at the end
int index; //
int i; // loop variable
unsigned int rem; // remainder
unsigned int base; // we'll be using base 10
ssize_t numWritten; // holds number of bytes written from write() system call
base = 10;
// validate base
if (base < 2 || base > MAX_BASE)
err_sys("oops, the base is wrong");
// For some reason, every time this method is called after the initial call, rembuf
// is magically filled with a bunch of garbage; this just sets everything to null.
// NOTE: memset() wasn't working either, so I have to use a stupid for-loop
for (i=0; i<MAX_REM; i++)
rembuf[i] = '\0';
rembuf[MAX_REM] = 0; // set last element to zero
index = MAX_REM; // start at the end of rembuf when adding in the remainders
do
{
// calculate remainder and divide valueBuff by the base
rem = value % base;
value /= base;
// convert remainder into ASCII value via lookup table and store in buffer
index--;
rembuf[index] = charTable[rem];
} while (value != 0);
// display value
if ((numWritten = write(STDOUT_FILENO, rembuf, MAX_REM + 1)) == -1)
err_sys("something went wrong with the write");
} // end of divisionAlg()
The calculation for figuring out how many digits there are a number takes is:
digits = floor(log(number)/log(base))+1;
However, in this case, I'd probably just assume the worse case, since it's no more than 32, and calculating it will be "expensive". So just #define MAX_REM 32, and then keep track of how many digits you actually put into rembuf (you already have index for that, so it's no extra cost really). You'll obviously need to calculate the amount of bytes to write out as well, but shouldn't require any special math.