Macro to iterate over struct members

Macro to iterate over struct members - c

I use a struct of bit fields to access each colour channel in a pixel, the problem is that quite often I have code that applies in the same way to each channel, but because I cannot just iterate over the members of a struct in C I end up having 3 copies of the same code for each member, or more inconveniently have to use switch-case statements.
I figured it would be more elegant if I could use a macro so that I can access a member by providing a number, ideally a macro that would make .CHAN(i) become either .r, .g or .b depending on whether the integer variable i contains a 0, 1 or 2. Except I have no idea how one would make such a macro or even if that's possible.
A detail but each member is something like 12 bits, not 8 as one might expect, so I cannot just turn it into an array or have a union with a pointer. Also X-Macros won't do as I often need to do many things to each channel before doing the same to another channel, in other words the for loop for going through each member can contain a lot more than just one line.
EDIT: Here's some code, first the struct:
typedef struct
{
uint32_t b:12;
uint32_t g:12;
uint32_t r:12;
uint32_t a:12;
} lrgb_t;
Now an example of what my problem looks like in code:
for (ic=0; ic<3; ic++)
{
for (i=0; i<curvecount; i++)
{
curve[i].p0.x = (double) i;
curve[i].p3.x = (double) i+1.;
switch (ic) // this is what I'm trying to eliminate
{
case 0:
curve[i].p0.y = pancol[i].r / 4095.;
curve[i].p3.y = pancol[i+1].r / 4095.;
break;
case 1:
curve[i].p0.y = pancol[i].g / 4095.;
curve[i].p3.y = pancol[i+1].g / 4095.;
break;
case 2:
curve[i].p0.y = pancol[i].b / 4095.;
curve[i].p3.y = pancol[i+1].b / 4095.;
break;
}
// Ideally this would be replaced by something like this, CHAN() being an hypothetical macro
// curve[i].p0.y = pancol[i].CHAN(ic) / 4095.;
// curve[i].p3.y = pancol[i+1].CHAN(ic) / 4095.;
}
... // more stuff that ultimately results in a bunch of pixels being written, channel after channel
}

as pointed out in the comments, this doesn't really address the OP's problem because the members on his struct are bitfields that wouldn't align with an array. I'll keep the answer here though, in hopes it can still be useful to someone.
I think a union is what you want.
You can write your struct such as
union
{
struct
{
float r;
float g;
float b;
}rgb;
float channel[3];
} color;
This way the struct will be in the same place in memory as the float[3], and you can effectively access the same members as either a struct member or as an element in the array.
You might have to look up the exact syntax, but you get the idea.

One possibility might be to wrap the repeated code into a function, and then call it for each of the channels:
typedef struct {
int r:12;
int g:12;
int b:12;
} Pixel;
int inc(int val) {
return val + 1;
}
int main(void) {
Pixel p = {0, 0, 0};
p.r = inc(p.r);
p.g = inc(p.g);
p.b = inc(p.b);
return 0;
}

After reading the code that you added I made some changes to my suggested macro
#define CHAN(ic) \
(ic == 1) ? curve[i].p0.y = pancol[i].r / 4095; curve[i].p3.y = pancol[i+1].r / 4095; : \
(ic == 2) ? curve[i].p0.y = pancol[i].g / 4095; curve[i].p3.y = pancol[i+1].g / 4095; : \
curve[i].p0.y = pancol[i].b / 4095; curve[i].p3.y = pancol[i+1].b / 4095;
The macro CHAN(ic) will evaluate 'ic' in order to decided which member to manipulate. If 'ic' is 1 then the member '.r' will be manipulated if 'ic' is 2 then '.g' will be manipulated, and if 'ic' is neither 1 or 2 then '.b' will be manipulated because of this assumption you must make sure that 'ic' is properly set otherwise you could screw with the value of panco[i].b and pancol[i+1].b . You code should look something like the following but you will most likely need to tweak the macro a bit let me know if you have any questions.
//#define CHAN(ic) here
for (ic=0; ic<3; ic++)
{
for (i=0; i<curvecount; i++)
{
curve[i].p0.x = (double) i;
curve[i].p3.x = (double) i+1.;
CHAN(ic)
}
... // more stuff that ultimately results in a bunch of pixels being written, channel after channel
}
Also please note that my macro will do exactly the same thing as your switch case. The only difference is that it is defined in a macro the point I am trying to make is that the difference between the switch case and the macro is purely visual.

Related

My C expression must have struct or union type

I have a method that is intended to print out values as per other methods defined. I will post the methods directly involved to reduce the code. However, Visual Studio returns the error
expression must have struct or union type
My code is as below:
int i;
double cat[12];
#define MAX 3
#define GRAMS 64
struct CatFoodInfo {
int SKU[MAX], CalPerServ[MAX];
double price[MAX], lbs[MAX], kg[MAX], grams[GRAMS], servings[MAX], costPerServ[MAX], costPerCal[MAX];
} CatFoodInfo;
void displayReportData() {
for (i = 0; i < MAX; i++) {
int sku = cat[i + (3 * i)];
double price = cat[i + (1 + (3 * i))];
double lbs = cat[i + (2 + (3 * i))];
int calPerServ = cat[i + (3 + (3 * i))];
int kg = cat[kg];
printf("%07d %10.2lf %10.1lf %10.4lf %9d %8d %8.1lf %7.2lf %7.5lf", sku, price, lbs, kg, cat.grams[i], calPerServ, cat.servings[i], cat.costPerServ[i], cat.costPerCal[i]);
}
}
The above code is will return that error. I have tried changing the values on the printf from cat.grams[i] to CatFoodInfo.grams[i]. This gets rid of the error but only zero is returned.
What is the correct way around this or where am I going wrong?

It isn't quite clear what's your purpose.
cat is an array of doubles, it has no idea about the struct CatFoodInfo or its fields, hence cat.grams[i] doesn't make sense. For that, cat should be a struct of type CatFoodInfo, e.g.:
typedef struct {
...
} CatFoodInfo;
CatFoodInfo cat;
But this change the meaning, now cat isn't an array, there is only one element of type CatFoodInfo, so you will have problems with the other parts of the code (e.g. cat[kg]).
However, to treat cat as an array with CatFoodInfo elements, you need:
typedef struct {
...
} CatFoodInfo;
CatFoodInfo cat[12];
But again, you would have to rewrite most of your code, e.g. cat.grams[i] to cat[j].grams[i].

Removing a switch-case in a function called from a loop

I have a function called blend_pixels() whose task is to blend a single pixel onto another pixel according to the specified blending mode. That function is in turn called by pretty much any function that wants to draw anything.
The problem is that function is called for every single pixel, that means it's called tens of millions of times a second, and it contains a switch-case statement going through all possible blending modes until it finds the right one.
Obviously this is somewhat slower than calling a function that directly does the desired operations, and that's the problem I'm trying to fix. The parent functions that call blend_pixels() usually just pass on the blending mode that they themselves received as an argument once called, so I can't just have them called a small function that would only do one blending mode. But the choice only needs to be done once for every call of the parent function (the parent functions operate on a lot of pixels per call whereas blend_pixels() is called for every single pixel, in a loop going through all the necessary pixels).
The function looks like this:
void blend_pixels(lrgb_t *bg, lrgb_t fg, int32_t p, const int mode)
{
int32_t r, g, b;
switch (mode)
{
case SOLID:
*bg = fg;
break;
case ADD:
r = (fg.r * p >> 15) + bg->r; if (r>ONE) bg->r = ONE; else bg->r = r;
g = (fg.g * p >> 15) + bg->g; if (g>ONE) bg->g = ONE; else bg->g = g;
b = (fg.b * p >> 15) + bg->b; if (b>ONE) bg->b = ONE; else bg->b = b;
break;
case SUB:
r = -(fg.r * p >> 15) + bg->r; if (r<0) bg->r = 0; else bg->r = r;
g = -(fg.g * p >> 15) + bg->g; if (g<0) bg->g = 0; else bg->g = g;
b = -(fg.b * p >> 15) + bg->b; if (b<0) bg->b = 0; else bg->b = b;
break;
case MUL:
... // you get the idea
}
}
and is called in this kind of way:
void parent_function(lrgb_t *fb, int w, int h, lrgb_t colour, ... int blendingmode)
{
...
for (iy=y0; iy<y1; iy++)
for (ix=x0; ix<x1; ix++)
{
p = some_weighting_formula();
blend_pixels(&fb[iy*w+ix], colour, p, blendingmode);
}
}
which itself might be called like:
parent_function(fb, w, h, orange, ... /*whatever*/, ADD);
"ADD" being an integer from an enum
So clearly any switch-case to pick the blending algorithm should be done outside of parent_function's loops. But how?

You can do this with function pointers.
First define a typedef for your function pointer:
typedef void (*blend_function)(lrgb_t *, lrgb_t, int32_t);
Then break out each part of blend_pixels into its own function, each with identical parameters and return type as the typedef:
void blend_pixels_add(lrgb_t *bg, lrgb_t fg, int32_t p)
...
void blend_pixels_sub(lrgb_t *bg, lrgb_t fg, int32_t p)
...
void blend_pixels_mult(lrgb_t *bg, lrgb_t fg, int32_t p)
...
Then in your parent function, you can assign a variable of the function pointer type, and assign it the address of the function you want to use:
void parent_function(lrgb_t *fb, int w, int h, lrgb_t colour, ... int blendingmode)
{
...
blend_function blend;
switch (blendingmode)
{
case ADD:
blend = blend_pixels_add;
break;
case SUB:
blend = blend_pixels_sub;
break;
...
}
for (iy=y0; iy<y1; iy++)
for (ix=x0; ix<x1; ix++)
{
p = some_weighting_formula();
blend(&fb[iy*w+ix], colour, p);
}
}

Addressing your concern that "and it contains a switch-case statement going through all possible blending modes until it finds the right one.", this is probably not what really happens.
Switch statements are generally compiled into what is called a jump table. In a jump table, the code does not step through all of the cases looking for the correct one, instead the argument of the switch() statement is used as the index in an array of addresses. Something like:
jump_table[SOLID] -> case SOLID address
jump_table[ADD] -> case ADD address
...
So, in this sort of implementation, a switch statement that is considering many, many values should be just as fast as a hand-coded function pointer solution because that is essentially what the compiler builds.

Most efficient way to use different members of struct in a function?

I'm writing for a very limited resource embedded processor. I have a struct that captures a time series of events, and I'd like to use the same graphing function against different values of different types. Something like (very stripped down; don't sweat uninitialized values, etc):
#define GRAPH_ONE 1
#define GRAPH_TWO 2
struct _event_t {
unsigned long timestamp;
int sensorOne;
float sensorTwo;
}
typedef struct _event_t event_t;
event_t[10] MyEvents;
void GraphSensor(byte graphType) {
for (unsigned int i = 0; i < 9; i++) {
// Get minimum value from series
if (MyEvents[i].?????) ...
}
}
How can I have my function operate on different members of the struct? I can see doing it with a ton of switch (graphType) statements, but that's pretty ugly. There could easily be 8 or 10 members of the struct. I could move all of those to a separate function and make every bit of data access call that function, always returning a float (which should be OK for my graph). Finally I could convert to C++ classes, which opens other means.
None of those feel right. Is there a better approach, preferably a very lightweight one?

You could wrap the accessor you need in a function, and pass that to the function that walks the array and aggregates the results. For example
float getSensor1(const event_t* t)
{
return t->sensorOne;
}
float getSensor2(const event_t* t)
{
return t->sensorTwo;
}
void GraphSensor(float (*accessor)(const event_t*)) {
// Get minimum value from series
float min_value = MAX_FLOAT;
for (unsigned int i = 0; i < 9; i++) {
float x = accessor(MyEvents + i);
if (x < min_value)
min_value = x;
}
}
/* later on . . . */
GraphSensor(getSensor1);
GraphSensor(getsensor2);
You are basically decoupling the access of the data from the operation on it, and homogenizing it all to floats. The aggregation operation could also be encapsulated into a function, too. But that's getting pretty close to map-reduce. :)

You could change the struct to an array which uses perhaps all floats. In that way the data handling is completely homogeneous.
#define N_SENSORS 12
#define N_EVENTS 10
float MyEvents [N_EVENTS] [N_SENSORS];
void GraphSensor(byte graphType)
{
float min = 1e38;
for (unsigned int i = 0; i < N_EVENTS; i++)
{
// Get minimum value from series
if (MyEvents[i][graphType] < min)
min = MyEvents[i][graphType];
}
}
Perhaps the timestamp could be there too in element zero maybe using spreadsheet conventions: integer part is days since 1970 or 1900 and fractional part is portion of the day (so noon = .5).

C: defining multidimensional arrays in a switch

I'm translating code from Maple to C in order to optimize performance. In order to save time, I've hard coded a 2-dimensional array for the 3 cases that I need to run asap. Later I'll add functions that generate this array so that I can run any case.
Here's how I tried to define the array schur: (here N and dim are pre-determined ints, and numPar is an int as well).
// load Schur functions
switch (N) {
case 3:
numPar = 3;
int schur[numPar][dim] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1},
};
break;
case 4:
numPar = 5;
int schur[numPar][dim] = {
{1,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0},
{0,0,1,0,0,1,0,0},
{0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,1},
};
break;
case 5:
numPar = 7;
int schur[numPar][dim] = {
{1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0},
{0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0},
{0,0,0,1,0,1,1,0,0,1,1,0,1,0,0,0},
{0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,0},
{0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1},
};
break;
default:
}
Clearly this will not work. However, I'm at a loss as to how to rewrite it so that it does work. One idea is to flatten the array, but that will obfuscate my code rather badly later on. Suggestions are greatly appreciated.

You can allocate the multidimensional array to be as large as the largest case. Based on the switch case you can only fill it to the size you need, and then only access it to the size you filled.
So for example for the 3 by 4 array:
int staticArray[3][4] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1},
};
for (int i = 0; i<3; ++i) {
for (int j = 0; j<4; ++j) {
schur[i][j] = staticArray[i][j];
}
}

Since you're concerned about space, and since your larger arrays appear to be mostly zeros with relatively few ones, you might want to consider a "sparse array" solution. Access speed would be much slower, but the amount of memory used might be much less.
Websearching on that phrase will find implementations; which one would be best depends on how you intend to use these arrays.

switch (N) {
case 3:
numPar = 3;
int tmp1[3][dim] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1},
};//then copy this rray to thry you want
break;
case 4:
numPar = 5;
int tmp2[5][dim] = {
{1,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0},
{0,0,1,0,0,1,0,0},
{0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,1},
};//then copy this rray to thry you want
break;
case 5:
numPar = 7;
int tmp3[7][dim] = {
{1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0},
{0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0},
{0,0,0,1,0,1,1,0,0,1,1,0,1,0,0,0},
{0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,0},
{0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1},
};//then copy this rray to thry you want
break;
default:
}

First, note the hopefully obvious problem that you can't use variables when declaring an array in C, only constants. For example, your first declaration could work like this:
int schur[][] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1}
};
and the compiler will happily figure out just how much space to allocate ... if, of course, you also weren't trying to declare the same variable multiple times in your switch statement. :-)
The second thing to keep in mind is that the construct:
int myArray[][] = { {1, 0, ... }, { 0, 1, ... }, ... };
declares an array of pointers to arrays of integers. In that example, schur is an array of 3 pointers, each of which points to an array of 4 integers.
This being C, there is of course a number of different ways to accomplish what you're trying to do. (Steve's Law of Computing: "If there exists one way to do something, there exists an infinite number of ways to do the same thing.")
What comes to mind first for the three cases you show above is to declare the 3 arrays you need, then just return the appropriate one from the switch statement:
int schur3[][] = {
{1,0,0,0},
{0,1,1,0},
{0,0,0,1}
};
int schur4[][] = {
{1,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0},
{0,0,1,0,0,1,0,0},
{0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,1}
};
int schur5[][] = {
{1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
{0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0},
{0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0},
{0,0,0,1,0,1,1,0,0,1,1,0,1,0,0,0},
{0,0,0,0,0,1,1,0,0,0,1,1,0,1,0,0},
{0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0},
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1}
};
/* Note that what you get is a pointer to an array of pointers! */
int * * getSchurArray(int N)
{
switch (N)
{
case 3:
return (schur3);
case 4:
return (schur4);
case 5:
return (schur5);
}
}
(Caveat: No, I didn't run that through a compiler yet, so I won't guarantee there are no typos!)
Now, if you want to make this dynamic, and you really want to stick with C, you're going to have to use malloc(), which is how you do dynamic arrays in C. In your case, you need to do something along the lines of:
int * * createSchurArray(int numPar, int dim)
{
/* malloc() requires number of bytes, which is number of entries */
/* times the size of each entry. */
int * * answer = malloc(numPar * sizeof(int *));
for (int rowIndex = 0; rowIndex < numPar; rowIndex++)
{
answer[rowIndex] = malloc(dim * sizeof(int));
for (int colIndex = 0; colIndex < dim; colIndex++)
{
answer[rowIndex][colIndex] = schurValue(numPar, dim, rowIndex, colIndex);
}
}
}
where implementation of:
int schurValue(int numPar, int dim, int rowIndex, int colIndex)
is left as an exercise for someone who understands what you're trying to do with Schur functions. :-)
(Oh, wait - did I break an "only one smiley per answer" rule?)

What is the fastest way to return the positions of all set bits in a 64-bit integer?

I need a fast way to get the position of all one bits in a 64-bit integer. For example, given x = 123703, I'd like to fill an array idx[] = {0, 1, 2, 4, 5, 8, 9, 13, 14, 15, 16}. We can assume we know the number of bits a priori. This will be called 1012 - 1015 times, so speed is of the essence. The fastest answer I've come up with so far is the following monstrosity, which uses each byte of the 64-bit integer as an index into tables that give the number of bits set in that byte and the positions of the ones:
int64_t x; // this is the input
unsigned char idx[K]; // this is the array of K bits that are set
unsigned char *dst=idx, *src;
unsigned char zero, one, two, three, four, five; // these hold the 0th-5th bytes
zero = x & 0x0000000000FFUL;
one = (x & 0x00000000FF00UL) >> 8;
two = (x & 0x000000FF0000UL) >> 16;
three = (x & 0x0000FF000000UL) >> 24;
four = (x & 0x00FF00000000UL) >> 32;
five = (x & 0xFF0000000000UL) >> 40;
src=tab0+tabofs[zero ]; COPY(dst, src, n[zero ]);
src=tab1+tabofs[one ]; COPY(dst, src, n[one ]);
src=tab2+tabofs[two ]; COPY(dst, src, n[two ]);
src=tab3+tabofs[three]; COPY(dst, src, n[three]);
src=tab4+tabofs[four ]; COPY(dst, src, n[four ]);
src=tab5+tabofs[five ]; COPY(dst, src, n[five ]);
where COPY is a switch statement to copy up to 8 bytes, n is array of the number of bits set in a byte and tabofs gives the offset into tabX, which holds the positions of the set bits in the X-th byte. This is about 3x faster than unrolled loop-based methods with __builtin_ctz() on my Xeon E5-2609. (See below.) I am currently iterating x in lexicographical order for a given number of bits set.
Is there a better way?
EDIT: Added an example (that I have subsequently fixed). Full code is available here: http://pastebin.com/79X8XL2P . Note: GCC with -O2 seems to optimize it away, but Intel's compiler (which I used to compose it) doesn't...
Also, let me give some additional background to address some of the comments below. The goal is to perform a statistical test on every possible subset of K variables out of a universe of N possible explanatory variables; the specific target right now is N=41, but I can see some projects needing N up to 45-50. The test basically involves factorizing the corresponding data submatrix. In pseudocode, something like this:
double doTest(double *data, int64_t model) {
int nidx, idx[];
double submatrix[][];
nidx = getIndices(model, idx); // get the locations of ones in model
// copy data into submatrix
for(int i=0; i<nidx; i++) {
for(int j=0; j<nidx; j++) {
submatrix[i][j] = data[idx[i]][idx[j]];
}
}
factorize(submatrix, nidx);
return the_answer;
}
I coded up a version of this for an Intel Phi board that should complete the N=41 case in about 15 days, of which ~5-10% of the time is spent in a naive getIndices() so right off the bat a faster version could save a day or more. I'm working on an implementation for NVidia Kepler too, but unfortunately the problem I have (ludicrous numbers of small matrix operations) is not ideally suited to the hardware (ludicrously large matrix operations). That said, this paper presents a solution that seems to achieve hundreds of GFLOPS/s on matrices of my size by aggressively unrolling loops and performing the entire factorization in registers, with the caveat that the dimensions of the matrix be defined at compile-time. (This loop unrolling should help reduce overhead and improve vectorization in the Phi version too, so getIndices() will become more important!) So now I'm thinking my kernel should look more like:
double *data; // move data to GPU/Phi once into shared memory
template<unsigned int K> double doTestUnrolled(int *idx) {
double submatrix[K][K];
// copy data into submatrix
#pragma unroll
for(int i=0; i<K; i++) {
#pragma unroll
for(int j=0; j<K; j++) {
submatrix[i][j] = data[idx[i]][idx[j]];
}
}
factorizeUnrolled<K>(submatrix);
return the_answer;
}
The Phi version solves each model in a `cilk_for' loop from model=0 to 2N (or, rather, a subset for testing), but now in order to batch work for the GPU and amortize the kernel launch overhead I have to iterate model numbers in lexicographical order for each of K=1 to 41 bits set (as doynax noted).
EDIT 2: Now that vacation is over, here are some results on my Xeon E5-2602 using icc version 15. The code that I used to benchmark is here: http://pastebin.com/XvrGQUat. I perform the bit extraction on integers that have exactly K bits set, so there is some overhead for the lexicographic iteration measured in the "Base" column in the table below. These are performed 230 times with N=48 (repeating as necessary).
"CTZ" is a loop that uses the the gcc intrinsic __builtin_ctzll to get the lowest order bit set:
for(int i=0; i<K; i++) {
idx[i] = __builtin_ctzll(tmp);
lb = tmp & -tmp; // get lowest bit
tmp ^= lb; // remove lowest bit from tmp
}
Mark is Mark's branchless for loop:
for(int i=0; i<K; i++) {
*dst = i;
dst += x & 1;
x >>= 1;
}
Tab1 is my original table-based code with the following copy macro:
#define COPY(d, s, n) \
switch(n) { \
case 8: *(d++) = *(s++); \
case 7: *(d++) = *(s++); \
case 6: *(d++) = *(s++); \
case 5: *(d++) = *(s++); \
case 4: *(d++) = *(s++); \
case 3: *(d++) = *(s++); \
case 2: *(d++) = *(s++); \
case 1: *(d++) = *(s++); \
case 0: break; \
}
Tab2 is the same code as Tab1, but the copy macro just moves 8 bytes as a single copy (taking ideas from doynax and Lưu Vĩnh Phúc... but note this does not ensure alignment):
#define COPY2(d, s, n) { *((uint64_t *)d) = *((uint64_t *)s); d+=n; }
Here are the results. I guess my initial claim that Tab1 is 3x faster than CTZ only holds for large K (where I was testing). Mark's loop is faster than my original code, but getting rid of the branch in the COPY2 macro takes the cake for K > 8.
K Base CTZ Mark Tab1 Tab2
001 4.97s 6.42s 6.66s 18.23s 12.77s
002 4.95s 8.49s 7.28s 19.50s 12.33s
004 4.95s 9.83s 8.68s 19.74s 11.92s
006 4.95s 16.86s 9.53s 20.48s 11.66s
008 4.95s 19.21s 13.87s 20.77s 11.92s
010 4.95s 21.53s 13.09s 21.02s 11.28s
015 4.95s 32.64s 17.75s 23.30s 10.98s
020 4.99s 42.00s 21.75s 27.15s 10.96s
030 5.00s 100.64s 35.48s 35.84s 11.07s
040 5.01s 131.96s 44.55s 44.51s 11.58s

I believe the key to performance here is to focus on the larger problem rather than on micro-optimizing the extraction of bit positions out of a random integer.
Judging by your sample code and previous SO question you are enumerating all words with K bits set in order, and extracting the bit indices out of these. This greatly simplifies matters.
If so then instead of rebuilding the bit position each iteration try directly incrementing the positions in the bit array. Half of the time this will involve a single loop iteration and increment.
Something along these lines:
// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
size_t i;
unsigned int bitpos[64 + 1];
// Seed with the lowest word plus a sentinel
for(i = 0; i < num; ++i)
bitpos[i] = i;
bitpos[i] = 0;
// Here goes the main loop
do {
// Do something with the resulting data
process(bitpos, num);
// Increment the least-significant series of consecutive bits
for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
bitpos[i] = i;
// Stop on reaching the top
} while(++bitpos[i] != len);
}
// Test function
void process(const unsigned int *bits, size_t num) {
do
printf("%d ", bits[--num]);
while(num);
putchar('\n');
}
Not particularly optimized but you get the general idea.

Here's something very simple which might be faster - no way to know without testing. Much will depend on the number of bits set vs. the number unset. You could unroll this to remove branching altogether but with today's processors I don't know if it would speed up at all.
unsigned char idx[K+1]; // need one extra for overwrite protection
unsigned char *dst=idx;
for (unsigned char i = 0; i < 50; i++)
{
*dst = i;
dst += x & 1;
x >>= 1;
}
P.S. your sample output in the question is wrong, see http://ideone.com/2o032E

As a minimal modification:
int64_t x;
char idx[K+1];
char *dst=idx;
const int BITS = 8;
for (int i = 0 ; i < 64+BITS; i += BITS) {
int y = (x & ((1<<BITS)-1));
char* end = strcat(dst, tab[y]); // tab[y] is a _string_
for (; dst != end; ++dst)
{
*dst += (i - 1); // tab[] is null-terminated so bit positions are 1 to BITS.
}
x >>= BITS;
}
The choice of BITS determines the size of the table. 8, 13 and 16 are logical choices. Each entry is a string, zero-terminated and containing bit positions with 1 offset. I.e. tab[5] is "\x03\x01". The inner loop fixes this offset.
Slightly more efficient: replace the strcat and inner loop by
char const* ptr = tab[y];
while (*ptr)
{
*dst++ = *ptr++ + (i-1);
}
Loop unrolling can be a bit of a pain if the loop contains branches, because copying those branch statements doesn't help the branch predictor. I'll happily leave that decision to the compiler.
One thing I'm considering is that tab[y] is an array of pointers to strings. These are highly similar: "\x1" is a suffix of "\x3\x1". In fact, each string which doesn't start with "\x8" is a suffix of a string which does. I'm wondering how many unique strings you need, and to what degree tab[y] is in fact needed. E.g. by the logic above, tab[128+x] == tab[x]-1.
[edit]
Nevermind, you definitely need 128 tab entries starting with "\x8" since they're never the suffix of another string. Still, the tab[128+x] == tab[x]-1 rule means that you can save half the entries, but at the cost of two extra instructions: char const* ptr = tab[x & 0x7F] - ((x>>7) & 1). (Set up tab[] to point after the \x8)

Using char wouldn't help you to increase speed but in fact often needs more ANDing and sign/zero extending while calculating. Only in the case of very large arrays that should fit in cache, smaller int types should be used
Another thing you can improve is the COPY macro. Instead of copy byte-by-byte, copy the whole word if possible
inline COPY(unsigned char *dst, unsigned char *src, int n)
{
switch(n) { // remember to align dst and src when declaring
case 8:
*((int64_t*)dst) = *((int64_t*)src);
break;
case 7:
*((int32_t*)dst) = *((int32_t*)src);
*((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
dst[6] = src[6];
break;
case 6:
*((int32_t*)dst) = *((int32_t*)src);
*((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
break;
case 5:
*((int32_t*)dst) = *((int32_t*)src);
dst[4] = src[4];
break;
case 4:
*((int32_t*)dst) = *((int32_t*)src);
break;
case 3:
*((int16_t*)dst) = *((int16_t*)src);
dst[2] = src[2];
break;
case 2:
*((int16_t*)dst) = *((int16_t*)src);
break;
case 1:
dst[0] = src[0];
break;
case 0:
break;
}
Also, since tabofs[x] and n[x] is often access close to each other, try putting it close in memory to make sure they are always in cache at the same time
typedef struct TAB_N
{
int16_t n, tabofs;
} tab_n[256];
src=tab0+tab_n[b0].tabofs; COPY(dst, src, tab_n[b0].n);
src=tab0+tab_n[b1].tabofs; COPY(dst, src, tab_n[b1].n);
src=tab0+tab_n[b2].tabofs; COPY(dst, src, tab_n[b2].n);
src=tab0+tab_n[b3].tabofs; COPY(dst, src, tab_n[b3].n);
src=tab0+tab_n[b4].tabofs; COPY(dst, src, tab_n[b4].n);
src=tab0+tab_n[b5].tabofs; COPY(dst, src, tab_n[b5].n);
Last but not least, gettimeofday is not for performance counting. Use QueryPerformanceCounter instead, it's much more precise

Your code is using 1-byte (256 entries) index table. You can speed it up by factor of 2 if you use 2-byte (65536 entries) index table.
Unfortunately, you probably cannot extend that further - for 3-bytes table size would be 16MB, not likely to fit into CPU local cache, and it would only make things slower.

Assuming sparsity in number of set bits,
int count = 0;
unsigned int tmp_bitmap = x;
while (tmp_bitmap > 0) {
int next_psn = __builtin_ffs(tmp_bitmap) - 1;
tmp_bitmap &= (tmp_bitmap-1);
id[count++] = next_psn;
}

The question is what are you going to do with the collection of positions?
If you have to iterate many times over it, then yes, it might be interesting to gather them once as you are doing now, and iterate many.
But if it's for iterating just once or few times, then you might think of not creating an intermediate array of positions, and just invoke a processing block closure/function at each encountered 1 while iterating on bits.
Here is a naive example of bit iterator I wrote in Smalltalk:
LargePositiveInteger>>bitsDo: aBlock
| mask offset |
1 to: self digitLength do: [:iByte |
offset := (iByte - 1) << 3.
mask := (self digitAt: iByte).
[mask = 0]
whileFalse:
[aBlock value: mask lowBit + offset.
mask := mask bitAnd: mask - 1]]
A LargePositiveInteger is an Integer of arbitrary length composed of byte digits.
The lowBit answer the rank of lowest bit and is implemented as a lookup table with 256 entries.
In C++ 2011 you can easily pass a closure, so it should be easy to translate.
uint64_t x;
unsigned int mask;
void (*process_bit_position)(unsigned int);
unsigned char offset = 0;
unsigned char lowBitTable[16] = {0,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0}; // 0-based, first entry is unused
while( x )
{
mask = x & 0xFUL;
while (mask)
{
process_bit_position( lowBitTable[mask]+offset );
mask &= mask - 1;
}
offset += 4;
x >>= 4;
}
The example is demonstrated with a 4 bit table, but you can easily extend it to 13 bits or more if it fits in cache.
For branch prediction, the inner loop could be rewritten as a for(i=0;i<nbit;i++) with an additional tablenbit=numBitTable[mask] then unrolled with a switch (the compiler could do it?), but I let you measure how it performs first...

Has this been found to be too slow?
Small and crude, but it's all in the cache and CPU registers;
void mybits(uint64_t x, unsigned char *idx)
{
unsigned char n = 0;
do {
if (x & 1) *(idx++) = n;
n++;
} while (x >>= 1); // If x is signed this will never end
*idx = (unsigned char) 255; // List Terminator
}
It's still 3 times faster to unroll the loop and produce an array of 64 true/false values (which isn't quite what's wanted)
void mybits_3_2(uint64_t x, idx_type idx[])
{
#define SET(i) (idx[i] = (x & (1UL<<i)))
SET( 0);
SET( 1);
SET( 2);
SET( 3);
...
SET(63);
}

Here's some tight code, written for 1-byte (8-bits), but it should easily, obviously expand to 64-bits.
int main(void)
{
int x = 187;
int ans[8] = {-1,-1,-1,-1,-1,-1,-1,-1};
int idx = 0;
while (x)
{
switch (x & ~(x-1))
{
case 0x01: ans[idx++] = 0; break;
case 0x02: ans[idx++] = 1; break;
case 0x04: ans[idx++] = 2; break;
case 0x08: ans[idx++] = 3; break;
case 0x10: ans[idx++] = 4; break;
case 0x20: ans[idx++] = 5; break;
case 0x40: ans[idx++] = 6; break;
case 0x80: ans[idx++] = 7; break;
}
x &= x-1;
}
getchar();
return 0;
}
Output array should be:
ans = {0,1,3,4,5,7,-1,-1};

If I take "I need a fast way to get the position of all one bits in a 64-bit integer" literally...
I realise this is a few weeks old, however and out of curiosity, I remember way back in my assembly days with the CBM64 and Amiga using an arithmetic shift and then examining the carry flag - if it's set then the shifted bit was 1, if clear then it's zero
e.g. for an arithmetic shift left (examining from bit 64 to bit 0)....
pseudo code (ignore instruction mix etc errors and oversimplification...been a while):
move #64+1, counter
loop. ASL 64bitinteger
BCS carryset
decctr. dec counter
bne loop
exit
carryset.
//store #counter-1 (i.e. bit position) in datastruct indexed by counter
jmp decctr
...I hope you get the idea.
I've not used assembly since then but I'm wondering if we could use some C++ in-line assembly similar to the above to do something similar here. We could do the whole conversion in assembly (very few lines of code), building up an appropriate data structure. C++ could simply examine the answer.
If this is possible then I'd imagine it to be pretty fast.

A simple solution, but perhaps not the fastest, depending on the times of the log and pow functions:
#include<math.h>
void getSetBits(unsigned long num){
int bit;
while(num){
bit = log2(num);
num -= pow(2, bit);
printf("%i\n", bit); // use bit number
}
}
Complexity O(D) | D is the number of set bits.