I'm translating code from Maple to C in order to optimize performance. In order to save time, I've hard coded a 2-dimensional array for the 3 cases that I need to run asap. Later I'll add functions that generate this array so that I can run any case.
Here's how I tried to define the array schur: (here N and dim are pre-determined ints, and numPar is an int as well).
// load Schur functions
switch (N) {
case 3:
numPar = 3;
int schur[numPar][dim] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1},
};
break;
case 4:
numPar = 5;
int schur[numPar][dim] = {
{1,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0},
{0,0,1,0,0,1,0,0},
{0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,1},
};
break;
case 5:
numPar = 7;
int schur[numPar][dim] = {
{1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0},
{0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0},
{0,0,0,1,0,1,1,0,0,1,1,0,1,0,0,0},
{0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,0},
{0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1},
};
break;
default:
}
Clearly this will not work. However, I'm at a loss as to how to rewrite it so that it does work. One idea is to flatten the array, but that will obfuscate my code rather badly later on. Suggestions are greatly appreciated.
You can allocate the multidimensional array to be as large as the largest case. Based on the switch case you can only fill it to the size you need, and then only access it to the size you filled.
So for example for the 3 by 4 array:
int staticArray[3][4] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1},
};
for (int i = 0; i<3; ++i) {
for (int j = 0; j<4; ++j) {
schur[i][j] = staticArray[i][j];
}
}
Since you're concerned about space, and since your larger arrays appear to be mostly zeros with relatively few ones, you might want to consider a "sparse array" solution. Access speed would be much slower, but the amount of memory used might be much less.
Websearching on that phrase will find implementations; which one would be best depends on how you intend to use these arrays.
switch (N) {
case 3:
numPar = 3;
int tmp1[3][dim] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1},
};//then copy this rray to thry you want
break;
case 4:
numPar = 5;
int tmp2[5][dim] = {
{1,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0},
{0,0,1,0,0,1,0,0},
{0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,1},
};//then copy this rray to thry you want
break;
case 5:
numPar = 7;
int tmp3[7][dim] = {
{1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0},
{0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0},
{0,0,0,1,0,1,1,0,0,1,1,0,1,0,0,0},
{0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,0},
{0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1},
};//then copy this rray to thry you want
break;
default:
}
First, note the hopefully obvious problem that you can't use variables when declaring an array in C, only constants. For example, your first declaration could work like this:
int schur[][] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1}
};
and the compiler will happily figure out just how much space to allocate ... if, of course, you also weren't trying to declare the same variable multiple times in your switch statement. :-)
The second thing to keep in mind is that the construct:
int myArray[][] = { {1, 0, ... }, { 0, 1, ... }, ... };
declares an array of pointers to arrays of integers. In that example, schur is an array of 3 pointers, each of which points to an array of 4 integers.
This being C, there is of course a number of different ways to accomplish what you're trying to do. (Steve's Law of Computing: "If there exists one way to do something, there exists an infinite number of ways to do the same thing.")
What comes to mind first for the three cases you show above is to declare the 3 arrays you need, then just return the appropriate one from the switch statement:
int schur3[][] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1}
};
int schur4[][] = {
{1,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0},
{0,0,1,0,0,1,0,0},
{0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,1}
};
int schur5[][] = {
{1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0},
{0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0},
{0,0,0,1,0,1,1,0,0,1,1,0,1,0,0,0},
{0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,0},
{0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1}
};
/* Note that what you get is a pointer to an array of pointers! */
int * * getSchurArray(int N)
{
switch (N)
{
case 3:
return (schur3);
case 4:
return (schur4);
case 5:
return (schur5);
}
}
(Caveat: No, I didn't run that through a compiler yet, so I won't guarantee there are no typos!)
Now, if you want to make this dynamic, and you really want to stick with C, you're going to have to use malloc(), which is how you do dynamic arrays in C. In your case, you need to do something along the lines of:
int * * createSchurArray(int numPar, int dim)
{
/* malloc() requires number of bytes, which is number of entries */
/* times the size of each entry. */
int * * answer = malloc(numPar * sizeof(int *));
for (int rowIndex = 0; rowIndex < numPar; rowIndex++)
{
answer[rowIndex] = malloc(dim * sizeof(int));
for (int colIndex = 0; colIndex < dim; colIndex++)
{
answer[rowIndex][colIndex] = schurValue(numPar, dim, rowIndex, colIndex);
}
}
}
where implementation of:
int schurValue(int numPar, int dim, int rowIndex, int colIndex)
is left as an exercise for someone who understands what you're trying to do with Schur functions. :-)
(Oh, wait - did I break an "only one smiley per answer" rule?)
I need a fast way to get the position of all one bits in a 64-bit integer. For example, given x = 123703, I'd like to fill an array idx[] = {0, 1, 2, 4, 5, 8, 9, 13, 14, 15, 16}. We can assume we know the number of bits a priori. This will be called 1012 - 1015 times, so speed is of the essence. The fastest answer I've come up with so far is the following monstrosity, which uses each byte of the 64-bit integer as an index into tables that give the number of bits set in that byte and the positions of the ones:
int64_t x; // this is the input
unsigned char idx[K]; // this is the array of K bits that are set
unsigned char *dst=idx, *src;
unsigned char zero, one, two, three, four, five; // these hold the 0th-5th bytes
zero = x & 0x0000000000FFUL;
one = (x & 0x00000000FF00UL) >> 8;
two = (x & 0x000000FF0000UL) >> 16;
three = (x & 0x0000FF000000UL) >> 24;
four = (x & 0x00FF00000000UL) >> 32;
five = (x & 0xFF0000000000UL) >> 40;
src=tab0+tabofs[zero ]; COPY(dst, src, n[zero ]);
src=tab1+tabofs[one ]; COPY(dst, src, n[one ]);
src=tab2+tabofs[two ]; COPY(dst, src, n[two ]);
src=tab3+tabofs[three]; COPY(dst, src, n[three]);
src=tab4+tabofs[four ]; COPY(dst, src, n[four ]);
src=tab5+tabofs[five ]; COPY(dst, src, n[five ]);
where COPY is a switch statement to copy up to 8 bytes, n is array of the number of bits set in a byte and tabofs gives the offset into tabX, which holds the positions of the set bits in the X-th byte. This is about 3x faster than unrolled loop-based methods with __builtin_ctz() on my Xeon E5-2609. (See below.) I am currently iterating x in lexicographical order for a given number of bits set.
Is there a better way?
EDIT: Added an example (that I have subsequently fixed). Full code is available here: http://pastebin.com/79X8XL2P . Note: GCC with -O2 seems to optimize it away, but Intel's compiler (which I used to compose it) doesn't...
Also, let me give some additional background to address some of the comments below. The goal is to perform a statistical test on every possible subset of K variables out of a universe of N possible explanatory variables; the specific target right now is N=41, but I can see some projects needing N up to 45-50. The test basically involves factorizing the corresponding data submatrix. In pseudocode, something like this:
double doTest(double *data, int64_t model) {
int nidx, idx[];
double submatrix[][];
nidx = getIndices(model, idx); // get the locations of ones in model
// copy data into submatrix
for(int i=0; i<nidx; i++) {
for(int j=0; j<nidx; j++) {
submatrix[i][j] = data[idx[i]][idx[j]];
}
}
factorize(submatrix, nidx);
return the_answer;
}
I coded up a version of this for an Intel Phi board that should complete the N=41 case in about 15 days, of which ~5-10% of the time is spent in a naive getIndices() so right off the bat a faster version could save a day or more. I'm working on an implementation for NVidia Kepler too, but unfortunately the problem I have (ludicrous numbers of small matrix operations) is not ideally suited to the hardware (ludicrously large matrix operations). That said, this paper presents a solution that seems to achieve hundreds of GFLOPS/s on matrices of my size by aggressively unrolling loops and performing the entire factorization in registers, with the caveat that the dimensions of the matrix be defined at compile-time. (This loop unrolling should help reduce overhead and improve vectorization in the Phi version too, so getIndices() will become more important!) So now I'm thinking my kernel should look more like:
double *data; // move data to GPU/Phi once into shared memory
template<unsigned int K> double doTestUnrolled(int *idx) {
double submatrix[K][K];
// copy data into submatrix
#pragma unroll
for(int i=0; i<K; i++) {
#pragma unroll
for(int j=0; j<K; j++) {
submatrix[i][j] = data[idx[i]][idx[j]];
}
}
factorizeUnrolled<K>(submatrix);
return the_answer;
}
The Phi version solves each model in a `cilk_for' loop from model=0 to 2N (or, rather, a subset for testing), but now in order to batch work for the GPU and amortize the kernel launch overhead I have to iterate model numbers in lexicographical order for each of K=1 to 41 bits set (as doynax noted).
EDIT 2: Now that vacation is over, here are some results on my Xeon E5-2602 using icc version 15. The code that I used to benchmark is here: http://pastebin.com/XvrGQUat. I perform the bit extraction on integers that have exactly K bits set, so there is some overhead for the lexicographic iteration measured in the "Base" column in the table below. These are performed 230 times with N=48 (repeating as necessary).
"CTZ" is a loop that uses the the gcc intrinsic __builtin_ctzll to get the lowest order bit set:
for(int i=0; i<K; i++) {
idx[i] = __builtin_ctzll(tmp);
lb = tmp & -tmp; // get lowest bit
tmp ^= lb; // remove lowest bit from tmp
}
Mark is Mark's branchless for loop:
for(int i=0; i<K; i++) {
*dst = i;
dst += x & 1;
x >>= 1;
}
Tab1 is my original table-based code with the following copy macro:
#define COPY(d, s, n) \
switch(n) { \
case 8: *(d++) = *(s++); \
case 7: *(d++) = *(s++); \
case 6: *(d++) = *(s++); \
case 5: *(d++) = *(s++); \
case 4: *(d++) = *(s++); \
case 3: *(d++) = *(s++); \
case 2: *(d++) = *(s++); \
case 1: *(d++) = *(s++); \
case 0: break; \
}
Tab2 is the same code as Tab1, but the copy macro just moves 8 bytes as a single copy (taking ideas from doynax and Lưu Vĩnh Phúc... but note this does not ensure alignment):
#define COPY2(d, s, n) { *((uint64_t *)d) = *((uint64_t *)s); d+=n; }
Here are the results. I guess my initial claim that Tab1 is 3x faster than CTZ only holds for large K (where I was testing). Mark's loop is faster than my original code, but getting rid of the branch in the COPY2 macro takes the cake for K > 8.
K Base CTZ Mark Tab1 Tab2
001 4.97s 6.42s 6.66s 18.23s 12.77s
002 4.95s 8.49s 7.28s 19.50s 12.33s
004 4.95s 9.83s 8.68s 19.74s 11.92s
006 4.95s 16.86s 9.53s 20.48s 11.66s
008 4.95s 19.21s 13.87s 20.77s 11.92s
010 4.95s 21.53s 13.09s 21.02s 11.28s
015 4.95s 32.64s 17.75s 23.30s 10.98s
020 4.99s 42.00s 21.75s 27.15s 10.96s
030 5.00s 100.64s 35.48s 35.84s 11.07s
040 5.01s 131.96s 44.55s 44.51s 11.58s
I believe the key to performance here is to focus on the larger problem rather than on micro-optimizing the extraction of bit positions out of a random integer.
Judging by your sample code and previous SO question you are enumerating all words with K bits set in order, and extracting the bit indices out of these. This greatly simplifies matters.
If so then instead of rebuilding the bit position each iteration try directly incrementing the positions in the bit array. Half of the time this will involve a single loop iteration and increment.
Something along these lines:
// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
size_t i;
unsigned int bitpos[64 + 1];
// Seed with the lowest word plus a sentinel
for(i = 0; i < num; ++i)
bitpos[i] = i;
bitpos[i] = 0;
// Here goes the main loop
do {
// Do something with the resulting data
process(bitpos, num);
// Increment the least-significant series of consecutive bits
for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
bitpos[i] = i;
// Stop on reaching the top
} while(++bitpos[i] != len);
}
// Test function
void process(const unsigned int *bits, size_t num) {
do
printf("%d ", bits[--num]);
while(num);
putchar('\n');
}
Not particularly optimized but you get the general idea.
Here's something very simple which might be faster - no way to know without testing. Much will depend on the number of bits set vs. the number unset. You could unroll this to remove branching altogether but with today's processors I don't know if it would speed up at all.
unsigned char idx[K+1]; // need one extra for overwrite protection
unsigned char *dst=idx;
for (unsigned char i = 0; i < 50; i++)
{
*dst = i;
dst += x & 1;
x >>= 1;
}
P.S. your sample output in the question is wrong, see http://ideone.com/2o032E
As a minimal modification:
int64_t x;
char idx[K+1];
char *dst=idx;
const int BITS = 8;
for (int i = 0 ; i < 64+BITS; i += BITS) {
int y = (x & ((1<<BITS)-1));
char* end = strcat(dst, tab[y]); // tab[y] is a _string_
for (; dst != end; ++dst)
{
*dst += (i - 1); // tab[] is null-terminated so bit positions are 1 to BITS.
}
x >>= BITS;
}
The choice of BITS determines the size of the table. 8, 13 and 16 are logical choices. Each entry is a string, zero-terminated and containing bit positions with 1 offset. I.e. tab[5] is "\x03\x01". The inner loop fixes this offset.
Slightly more efficient: replace the strcat and inner loop by
char const* ptr = tab[y];
while (*ptr)
{
*dst++ = *ptr++ + (i-1);
}
Loop unrolling can be a bit of a pain if the loop contains branches, because copying those branch statements doesn't help the branch predictor. I'll happily leave that decision to the compiler.
One thing I'm considering is that tab[y] is an array of pointers to strings. These are highly similar: "\x1" is a suffix of "\x3\x1". In fact, each string which doesn't start with "\x8" is a suffix of a string which does. I'm wondering how many unique strings you need, and to what degree tab[y] is in fact needed. E.g. by the logic above, tab[128+x] == tab[x]-1.
[edit]
Nevermind, you definitely need 128 tab entries starting with "\x8" since they're never the suffix of another string. Still, the tab[128+x] == tab[x]-1 rule means that you can save half the entries, but at the cost of two extra instructions: char const* ptr = tab[x & 0x7F] - ((x>>7) & 1). (Set up tab[] to point after the \x8)
Using char wouldn't help you to increase speed but in fact often needs more ANDing and sign/zero extending while calculating. Only in the case of very large arrays that should fit in cache, smaller int types should be used
Another thing you can improve is the COPY macro. Instead of copy byte-by-byte, copy the whole word if possible
inline COPY(unsigned char *dst, unsigned char *src, int n)
{
switch(n) { // remember to align dst and src when declaring
case 8:
*((int64_t*)dst) = *((int64_t*)src);
break;
case 7:
*((int32_t*)dst) = *((int32_t*)src);
*((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
dst[6] = src[6];
break;
case 6:
*((int32_t*)dst) = *((int32_t*)src);
*((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
break;
case 5:
*((int32_t*)dst) = *((int32_t*)src);
dst[4] = src[4];
break;
case 4:
*((int32_t*)dst) = *((int32_t*)src);
break;
case 3:
*((int16_t*)dst) = *((int16_t*)src);
dst[2] = src[2];
break;
case 2:
*((int16_t*)dst) = *((int16_t*)src);
break;
case 1:
dst[0] = src[0];
break;
case 0:
break;
}
Also, since tabofs[x] and n[x] is often access close to each other, try putting it close in memory to make sure they are always in cache at the same time
typedef struct TAB_N
{
int16_t n, tabofs;
} tab_n[256];
src=tab0+tab_n[b0].tabofs; COPY(dst, src, tab_n[b0].n);
src=tab0+tab_n[b1].tabofs; COPY(dst, src, tab_n[b1].n);
src=tab0+tab_n[b2].tabofs; COPY(dst, src, tab_n[b2].n);
src=tab0+tab_n[b3].tabofs; COPY(dst, src, tab_n[b3].n);
src=tab0+tab_n[b4].tabofs; COPY(dst, src, tab_n[b4].n);
src=tab0+tab_n[b5].tabofs; COPY(dst, src, tab_n[b5].n);
Last but not least, gettimeofday is not for performance counting. Use QueryPerformanceCounter instead, it's much more precise
Your code is using 1-byte (256 entries) index table. You can speed it up by factor of 2 if you use 2-byte (65536 entries) index table.
Unfortunately, you probably cannot extend that further - for 3-bytes table size would be 16MB, not likely to fit into CPU local cache, and it would only make things slower.
Assuming sparsity in number of set bits,
int count = 0;
unsigned int tmp_bitmap = x;
while (tmp_bitmap > 0) {
int next_psn = __builtin_ffs(tmp_bitmap) - 1;
tmp_bitmap &= (tmp_bitmap-1);
id[count++] = next_psn;
}
The question is what are you going to do with the collection of positions?
If you have to iterate many times over it, then yes, it might be interesting to gather them once as you are doing now, and iterate many.
But if it's for iterating just once or few times, then you might think of not creating an intermediate array of positions, and just invoke a processing block closure/function at each encountered 1 while iterating on bits.
Here is a naive example of bit iterator I wrote in Smalltalk:
LargePositiveInteger>>bitsDo: aBlock
| mask offset |
1 to: self digitLength do: [:iByte |
offset := (iByte - 1) << 3.
mask := (self digitAt: iByte).
[mask = 0]
whileFalse:
[aBlock value: mask lowBit + offset.
mask := mask bitAnd: mask - 1]]
A LargePositiveInteger is an Integer of arbitrary length composed of byte digits.
The lowBit answer the rank of lowest bit and is implemented as a lookup table with 256 entries.
In C++ 2011 you can easily pass a closure, so it should be easy to translate.
uint64_t x;
unsigned int mask;
void (*process_bit_position)(unsigned int);
unsigned char offset = 0;
unsigned char lowBitTable[16] = {0,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0}; // 0-based, first entry is unused
while( x )
{
mask = x & 0xFUL;
while (mask)
{
process_bit_position( lowBitTable[mask]+offset );
mask &= mask - 1;
}
offset += 4;
x >>= 4;
}
The example is demonstrated with a 4 bit table, but you can easily extend it to 13 bits or more if it fits in cache.
For branch prediction, the inner loop could be rewritten as a for(i=0;i<nbit;i++) with an additional tablenbit=numBitTable[mask] then unrolled with a switch (the compiler could do it?), but I let you measure how it performs first...
Has this been found to be too slow?
Small and crude, but it's all in the cache and CPU registers;
void mybits(uint64_t x, unsigned char *idx)
{
unsigned char n = 0;
do {
if (x & 1) *(idx++) = n;
n++;
} while (x >>= 1); // If x is signed this will never end
*idx = (unsigned char) 255; // List Terminator
}
It's still 3 times faster to unroll the loop and produce an array of 64 true/false values (which isn't quite what's wanted)
void mybits_3_2(uint64_t x, idx_type idx[])
{
#define SET(i) (idx[i] = (x & (1UL<<i)))
SET( 0);
SET( 1);
SET( 2);
SET( 3);
...
SET(63);
}
Here's some tight code, written for 1-byte (8-bits), but it should easily, obviously expand to 64-bits.
int main(void)
{
int x = 187;
int ans[8] = {-1,-1,-1,-1,-1,-1,-1,-1};
int idx = 0;
while (x)
{
switch (x & ~(x-1))
{
case 0x01: ans[idx++] = 0; break;
case 0x02: ans[idx++] = 1; break;
case 0x04: ans[idx++] = 2; break;
case 0x08: ans[idx++] = 3; break;
case 0x10: ans[idx++] = 4; break;
case 0x20: ans[idx++] = 5; break;
case 0x40: ans[idx++] = 6; break;
case 0x80: ans[idx++] = 7; break;
}
x &= x-1;
}
getchar();
return 0;
}
Output array should be:
ans = {0,1,3,4,5,7,-1,-1};
If I take "I need a fast way to get the position of all one bits in a 64-bit integer" literally...
I realise this is a few weeks old, however and out of curiosity, I remember way back in my assembly days with the CBM64 and Amiga using an arithmetic shift and then examining the carry flag - if it's set then the shifted bit was 1, if clear then it's zero
e.g. for an arithmetic shift left (examining from bit 64 to bit 0)....
pseudo code (ignore instruction mix etc errors and oversimplification...been a while):
move #64+1, counter
loop. ASL 64bitinteger
BCS carryset
decctr. dec counter
bne loop
exit
carryset.
//store #counter-1 (i.e. bit position) in datastruct indexed by counter
jmp decctr
...I hope you get the idea.
I've not used assembly since then but I'm wondering if we could use some C++ in-line assembly similar to the above to do something similar here. We could do the whole conversion in assembly (very few lines of code), building up an appropriate data structure. C++ could simply examine the answer.
If this is possible then I'd imagine it to be pretty fast.
A simple solution, but perhaps not the fastest, depending on the times of the log and pow functions:
#include<math.h>
void getSetBits(unsigned long num){
int bit;
while(num){
bit = log2(num);
num -= pow(2, bit);
printf("%i\n", bit); // use bit number
}
}
Complexity O(D) | D is the number of set bits.