Related
I need to store a large number, but due to limitations in an old game engine, I am restricted to working with signed short (I can, however, use as many of these as I want).
I need to split an unsigned long (0 to 4,294,967,295) into multiple signed short (-32,768 to 32,767). Then I need to recombine the multiple signed short into a new unsigned long later.
For example, take the number 4,000,000,000. This should be split into multiple signed short and then recombined into unsigned long.
Is this possible in C? Thanks.
In addition to dbush's answer you can also use a union, e.g.:
union
{
unsigned long longvalue;
signed short shortvalues[2];
}
value;
The array of two shorts overlays the single long value.
I assume your problem is finding a place to store these large values. There are options we haven't yet explored which don't involve splitting the values up and recombining them:
Write them to a file, and read them back later. This might seem silly at first, but considering the bigger picture, if the values end up in a file later on then this might seem like the most attractive option.
Declare your unsigned long to have static storage duration e.g. outside of any blocks of code A.K.A globally (I hate that term) or using the static keyword inside a block of code.
None of the other answers so far are strictly portable, not that it seems like it should matter to you. You seem to be describing a twos complement 16-bit signed short representation and a 32-bit unsigned long representation (you should put assertions in place to ensure this is the case), which has implications that restrict the options for the implementation (that is, the C compiler, the OS, the CPU, etc)... so the portability issues associated with them are unlikely to occur. In case you're curious, however, I'll discuss those issues anyway.
The portability issues associated are that one type or the other might have padding bits causing the sizes to mismatch, and that there might be trap representations for short.
Changing the type but not the representation is by far much cleaner and easier to get right, though not portable; this includes the union hack, you could also avoid the union by casting an unsigned long * to a short *. These solutions are the cleanest solutions, which makes Ken Clement's answer my favourite so far, despite the non-portability.
Binary shifts (the >> and << operators), and (the & operator), or (|) operators introduce additional portability issues when you use them on signed types; they're also bulky and clumsy leading to more code to debug and a higher chance that mistakes are made.
You need to consider that while ULONG_MAX is guaranteed to be at least 4,294,967,295, SHORT_MIN is not guaranteed by the C standard to be -32,768; it might be -32,767 (which is quite uncommon indeed, though still possible)... There might be a negative zero or trap representation in place of that -32,768 value.
This means you can't portably rely upon a pair of signed shorts being able to represent all of the values of an unsigned long; even when the sizes match up you need another bit to account for the two missing values.
With this in mind, you could use a third signed char... The implementation-defined and undefined behaviours of the shift approaches could be avoided that way.
signed short x = (value ) & 0xFFF,
y = (value >> 12) & 0xFFF,
z = (value >> 24) & 0xFFF;
value = (unsigned long) x
+ ((unsigned long) y << 12)
+ ((unsigned long) z << 24);
You can do it like this (I used fixed size types to properly illustrate how it works):
#include<stdio.h>
#include<stdint.h>
int main()
{
uint32_t val1;
int16_t val2a, val2b;
uint32_t val3;
val1 = 0x11223344;
printf("val1=%08x\n", val1);
// to short
val2a = val1 >> 16;
val2b = val1 & 0xFFFF;
printf("val2a=%04x\n", val2a);
printf("val2b=%04x\n", val2b);
// to long
val3 = (uint32_t)val2a << 16;
val3 |= (uint32_t)val2b;
printf("val3=%08x\n", val3);
return 0;
}
Output:
val1=11223344
val2a=1122
val2b=3344
val3=11223344
There are any number of ways to do it. One thing to consider is that unsigned long may not have the same size on different hardware/operating systems. You can use exact length types found in stdint.h to avoid ambiguity (e.g. uint8_t, uint16_t, etc.). One implementation incorporating exact types (and cheezy hex values) would be:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <limits.h>
int main (void) {
uint64_t a = 0xfacedeadbeefcafe, b = 0;
uint16_t s[4] = {0};
uint32_t i = 0, n = 0;
printf ("\n a : 0x%16"PRIx64"\n\n", a);
/* separate uint64_t into 4 uint16_t */
for (i = 0; i < sizeof a; i += 2, n++)
printf (" s[%"PRIu32"] : 0x%04"PRIx16"\n", n,
(s[n] = (a >> (i * CHAR_BIT))));
/* combine 4 uint16_t into uint64_t */
for (n = i = 0; i < sizeof b; i += 2, n++)
b |= (uint64_t)s[n] << i * CHAR_BIT;
printf ("\n b : 0x%16"PRIx64"\n\n", b);
return 0;
}
Output
$ ./bin/uint64_16
a : 0xfacedeadbeefcafe
s[0] : 0xcafe
s[1] : 0xbeef
s[2] : 0xdead
s[3] : 0xface
b : 0xfacedeadbeefcafe
This is one possible solution (which assumes ulong is 32-bits, and sshort is 16-bits):
unsigned long L1, L2;
signed short S1, S2;
L1 = 0x12345678; /* Initial ulong to store away into two sshort */
S1 = L1 & 0xFFFF; /* Store component 1 */
S2 = L1 >> 16; /* Store component 2*/
L2 = S1 | (S2<<16); /* Retrive ulong from two sshort */
/* Print results */
printf("Initial value: 0x%08lx\n",L1);
printf("Stored component 1: 0x%04hx\n",S1);
printf("Stored component 2: 0x%04hx\n",S2);
printf("Retrieved value: 0x%08lx\n",L2);
I have a program that uses the following two functions 99.9999% of time:
unsigned int getBit(unsigned char *byte, unsigned int bitPosition)
{
return (*byte & (1 << bitPosition)) >> bitPosition;
}
void setBit(unsigned char *byte, unsigned int bitPosition, unsigned int bitValue)
{
*byte = (*byte | (1 << bitPosition)) ^ ((bitValue ^ 1) << bitPosition);
}
Can this be improved? The processing speed of the program mainly depends on the speed of these two functions.
UPDATE
I will do a benchmark for each provided answer bellow and write the timings I get. For the reference, the compiler used is gcc on Mac OS X platform:
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
I compile without any specific arguments like: gcc -o program program.c
If you think I should set some optimizations, feel free to suggest.
The CPU is:
2,53 GHz Intel Core 2 Duo
While processing 21.5 MB of data with my originally provided functions it takes about:
Time: 13.565221
Time: 13.558416
Time: 13.566042
Time is in seconds (these are three tries).
-- UPDATE 2 --
I've used the -O3 optimization (gcc -O3 -o program program.c) option and now I'm getting these results:
Time: 6.168574
Time: 6.170481
Time: 6.167839
I'll redo the other benchmarks now...
If you want to stick with functions, then for the first one:
unsigned int getBit(unsigned char *byte, unsigned int bitPosition)
{
return (*byte >> bitPosition) & 1;
}
For the second one:
void setBit(unsigned char *byte, unsigned int bitPosition, unsigned int bitValue)
{
if(bitValue == 0)
*byte &= ~(1 << bitPosition);
else
*byte |= (1 << bitPosition);
}
However, I suspect that the function call/return overhead will swamp the actual bit-flipping. A good compiler might inline these function calls anyways, but you may get some improvement by defining these as macros:
#define getBit(b, p) ((*(b) >> (p)) & 1)
#define setBit(b, p, v) (*(b) = ((v) ? (*(b) | (1 << (p))) : (*(b) & (~(1 << (p))))))
#user694733 pointed out that branch prediction might be a problem and could cause a slowdown. As such it might be good to define separate setBit and clearBit functions:
void setBit(unsigned char *byte, unsigned int bitPosition)
(
*byte |= (1 << bitPosition);
}
void clearBit(unsigned char *byte, unsigned int bitPosition)
(
*byte &= ~(1 << bitPosition);
}
And their corresponding macro versions:
#define setBit(b, p) (*(b) |= (1 << (p)))
#define clearBit(b, p) (*(b) &= ~(1 << (p)))
The separate functions/macros would be useful if the calling code hard-codes the value passed for the bitValue argument in the original version.
Share and enjoy.
How about:
bool getBit(unsigned char byte, unsigned int bitPosition)
{
return (byte & (1 << bitPosition)) != 0;
}
No need to use a shift operator to "physically" shift the masked-out bit into position 0, just use a comparison operator and let the compiler deal with it. This should of course also be made inline if possible.
For the second one, it's complicated by the fact that it's basically "assignBit", i.e. it takes the new value of the indicated bit as a parameter. I'd try using the explicit branch:
unsigned char setBit(unsigned char byte, unsigned int bitPosition, bool value)
{
const uint8_t mask = 1 << bitPosition;
if(value)
return byte | mask;
return byte & ~mask;
}
Generally, these things are best left to the compiler's optimizer.
But why do you need functions for such trivial tasks? A C programmer should not get shocked when they encounter basic stuff like this:
x |= 1<<n; // set bit
x &= ~(1<<n); // clear bit
x ^= 1<<n; // toggle bit
y = x & (1<<n); // read bit
There is no real reason to hide simple things like these behind functions. You won't make the code more readable, because you can always assume that the reader of your code knows C. It rather seems like pointless wrapper functions to hide away "scary" operators that the programmer isn't familiar with.
That being said, the introduction of the functions may cause a lot of overhead code. To turn your functions back into the core operations shown above, the optimizer would have to be quite good.
If you for some reason persists in using the functions, any attempt of manual optimization is going to be questionable practice. The use of inline, register and such keywords are likely superfluous. The compiler with optimizer enabled should be far more capable to make the decision when to inline and when to put things in registers than the programmer.
As usual, it doesn't make sense to manually optimize code, unless you know more about the given CPU than the person who wrote the compiler port for it. Most often this is not the case.
What you can harmlessly do as manual optimization, is to get rid of unsigned char (you shouldn't be using the native C types for this anyhow). Instead use the uint_fast8_t type from stdint.h. Using this type means: "I would like to have an uint8_t, but if the CPU prefers a larger type for alignment/performance reasons, it can use that instead".
EDIT
There are different ways to set a bit to either 1 or 0. For maximum readability, you would write this:
uint8_t val = either_1_or_0;
...
if(val == 1)
byte |= 1<<n;
else
byte &= ~(1<<n);
This does however include a branch. Let's assume we know that the branch is a known performance bottleneck on the given system, to justify the otherwise questionable practice of manual optimization. We could then set the bit to either 1 or 0 without a branch, in the following manner:
byte = (byte & ~(1<<n)) | (val<<n);
And this is where the code is turning a bit unreadable. Read the above as:
Take the byte and preserve everything in it, except for the bit we want to set to 1 or 0.
Clear this bit.
Then set it to either 1 or 0.
Note that the whole right side sub-expression is pointless if val is zero. So on a "generic system" this code is possibly slower than the readable version. So before writing code like this, we would have to know that our CPU is very good at bit-flipping and not-so-good at branch prediction.
You can benchmark with the following variations and keep the best of all solutions.
inline unsigned int getBit(unsigned char *byte, unsigned int bitPosition)
{
const unsigned char mask = (unsigned char)(1U << bitPosition);
return !!(*byte & mask);
}
inline void setBit(unsigned char *byte, unsigned int bitPosition, unsigned int bitValue)
{
const unsigned char mask = (unsigned char)(1U << bitPosition);
bitValue ? *byte |= mask : *byte &= ~mask;
}
If your algorithm expects only zero v/s non zero result from getBit, you can remove !! from return. (To return 0 or 1, I found the version of #BobJarvis really clean)
If your algorithm can pass the bit mask to be set or reset to setBit function, you won't need to calculate mask explicitly.
So depending on the code calling these functions, it may be possible to cut on time.
How would you do that in C? (Example: 10110001 becomes 10001101 if we had to mirror 8 bits). Are there any instructions on certain processors that would simplify this task?
It's actually called "bit reversal", and is commonly done in FFT scrambling. The O(log N) way is (for up to 32 bits):
uint32_t reverse(uint32_t x, int bits)
{
x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1); // Swap _<>_
x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2); // Swap __<>__
x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4); // Swap ____<>____
x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8); // Swap ...
x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16); // Swap ...
return x >> (32 - bits);
}
Maybe this small "visualization" helps:
An example of the first 3 assignment, with a uint8_t example:
b7 b6 b5 b4 b3 b2 b1 b0
-> <- -> <- -> <- -> <-
----> <---- ----> <----
----------> <----------
Well, if we're doing ASCII art, here's mine:
7 6 5 4 3 2 1 0
X X X X
6 7 4 5 2 3 0 1
\ X / \ X /
X X X X
/ X \ / X \
4 5 6 7 0 1 2 3
\ \ \ X / / /
\ \ X X / /
\ X X X /
X X X X
/ X X X \
/ / X X \ \
/ / / X \ \ \
0 1 2 3 4 5 6 7
It kind of looks like FFT butterflies. Which is why it pops up with FFTs.
Per Rich Schroeppel in this MIT memo (if you can read past the assembler), the following will reverse the bits in an 8bit byte providing that you have 64bit arithmetic available:
byte = (byte * 0x0202020202ULL & 0x010884422010ULL) % 1023;
Which sort of fans the bits out (the multiply), selects them (the and) and then shrinks them back down (the modulus).
Is it actually an 8bit quantity that you have?
Nearly a duplicate of Most Efficient Algorithm for Bit Reversal ( from MSB->LSB to LSB->MSB) in C (which has a lot of answers, including one AVX2 answer for reversing every 8-bit char in an array).
X86
On x86 with SSSE3 (Core2 and later, Bulldozer and later), pshufb (_mm_shuffle_epi8) can be used as a nibble LUT to do 16 lookups in parallel. You only need 8 lookups for the 8 nibbles in a single 32-bit integer, but the real problem is splitting the input bytes into separate nibbles (with their upper half zeroed). It's basically the same problem as for pshufb-based popcount.
avx2 register bits reverse shows how to do this for a packed vector of 32-bit elements. The same code ported to 128-bit vectors would compile just fine with AVX.
It's still good for a single 32-bit int because x86 has very efficient round-trip between integer and vector regs: int bitrev = _mm_cvtsi128_si32 ( rbit32( _mm_cvtsi32_si128(input) ) );. That only costs 2 extra movd instructions to get an integer from an integer register into XMM and back. (Round trip latency = 3 cycles on an Intel CPU like Haswell.)
ARM:
rbit has single-cycle latency, and does a whole 32-bit integer in one instruction.
Fastest approach is almost sure to be a lookup table:
out[0]=lut[in[3]];
out[1]=lut[in[2]];
out[2]=lut[in[1]];
out[3]=lut[in[0]];
Or if you can afford 128k of table data (by afford, I mean cpu cache utilization, not main memory or virtual memory utilization), use 16-bit units:
out[0]=lut[in[1]];
out[1]=lut[in[0]];
The naive / slow / simple way is to extract the low bit of the input and shift it into another variable that accumulates a return value.
#include <stdint.h>
uint32_t mirror_u32(uint32_t input) {
uint32_t returnval = 0;
for (int i = 0; i < 32; ++i) {
int bit = input & 0x01;
returnval <<= 1;
returnval += bit; // Shift the isolated bit into returnval
input >>= 1;
}
return returnval;
}
For other types, the number of bits of storage is sizeof(input) * CHAR_BIT, but that includes potential padding bits that aren't part of the value. The fixed-width types are a good idea here.
The += instead of |= makes gcc compile it more efficiently for x86 (using x86's shift-and-add instruction, LEA). Of course, there are much faster ways to bit-reverse; see the other answers. This loop is good for small code size (no large masks), but otherwise pretty much no advantage.
Compilers unfortunately don't recognize this loop as a bit-reverse and optimize it to ARM rbit or whatever. (See it on the Godbolt compiler explorer)
If you are interested in a more embedded approach, when I worked with an armv7a system, I found the RBIT command.
So within a C function using GNU extended asm I could use:
uint32_t bit_reverse32(uint32_t inp32)
{
uint32_t out = 0;
asm("RBIT %0, %1" : "=r" (out) : "r" (inp32));
return out;
}
There are compilers which expose intrinsic C wrappers like this. (armcc __rbit) and gcc also has some intrinsic via ACLE but with gcc-arm-linux-gnueabihf I could not find __rbit C so I came up with the upper code.
I didn't look, but I suppose on other platforms you could create similar solutions.
I've also just figured out a minimal solution for mirroring 4 bits (a nibble) in only 16 bits temporary space.
mirr = ( (orig * 0x222) & 0x1284 ) % 63
I think I would make a lookup table of bitpatterns 0-255. Read each byte and with the lookup table reverse that byte and afterwards arrange the resulting bytes appropriately.
quint64 mirror(quint64 a,quint8 l=64) {
quint64 b=0;
for(quint8 i=0;i<l;i++) {
b|=(a>>(l-i-1))&((quint64)1<<i);
}
return b;
}
This function mirroring less then 64 bits. For instance it can mirroring 12 bits.
quint64 and quint8 are defined in Qt. But it possible redefine it in anyway.
If you have been staring at Mike DeSimone's great answer (like me), here is a "visualization" on the first 3 assignment, with a uint8_t example:
b7 b6 b5 b4 b3 b2 b1 b0
-> <- -> <- <- -> <- ->
----> <---- ----> <----
----------> <----------
So first, bitwise swap, then "two-bit-group" swap and so on.
For sure most people won't consider my approach neither as elegant nor efficient: it's aimed at being portable and somehow "straightforward".
#include <limits.h> // CHAR_BIT
unsigned bit_reverse( unsigned s ) {
unsigned d;
int i;
for( i=CHAR_BIT*sizeof( unsigned ),d=0; i; s>>=1,i-- ) {
d <<= 1;
d |= s&1;
}
return d;
}
This function pulls the least significant bit from the source bistring s and pushes it as the most significant bit in the destination bitstring d.
You can replace unsigned data type with whatever suits your case, from unsigned char (CHAR_BIT bits, usually 8) to unsigned long long (128 bits in modern 64-bit CPUs).
Of course, there can be CPU-specific instructions (or instruction sets) that could be used instead of my plain C code.
But than that wouldn't be "C language" but rather assembly instruction(s) in a C wrapper.
int mirror (int input)
{// return bit mirror of 8 digit number
int tmp2;
int out=0;
for (int i=0; i<8; i++)
{
out = out << 1;
tmp2 = input & 0x01;
out = out | tmp2;
input = input >> 1;
}
return out;
}
I am running through a memory block of binary data byte-wise.
Currently I am doing something like this:
for (i = 0; i < data->Count; i++)
{
byte = &data->Data[i];
((*byte & Masks[0]) == Masks[0]) ? Stats.FreqOf1++; // syntax incorrect but you get the point.
((*byte & Masks[1]) == Masks[1]) ? Stats.FreqOf1++;
((*byte & Masks[2]) == Masks[2]) ? Stats.FreqOf1++;
((*byte & Masks[3]) == Masks[3]) ? Stats.FreqOf1++;
((*byte & Masks[4]) == Masks[4]) ? Stats.FreqOf1++;
((*byte & Masks[5]) == Masks[5]) ? Stats.FreqOf1++;
((*byte & Masks[6]) == Masks[6]) ? Stats.FreqOf1++;
((*byte & Masks[7]) == Masks[7]) ? Stats.FreqOf1++;
}
Where Masks is:
for (i = 0; i < 8; i++)
{
Masks[i] = 1 << i;
}
(I somehow did not manage to do it as fast in a loop or in an inlined function, so I wrote it out.)
Does anyone have any suggestions on how to to improve this first loop? I am rather inexperienced with getting down to bits.
This may seem like a stupid thing to do. But I am in the process of implementing a compression algorithm. I just want to have the bit accessing part down right.
Thanks!
PS: This is in on the Visual Studio 2008 compiler. So it would be nice if the suggestions applied to that compiler.
PPS: I just realized, that I don't need to increment two counts. One would be enough. Then compute the difference to the total bits at the end.
But that would be specific to just counting. What I really want done fast is the bit extraction.
EDIT:
The lookup table idea that was brought forward is nice.
I realize though that I posed the question wrong in the title.
Because in the end what I want to do is not count the bits, but access each bit as fast as possible.
ANOTHER EDIT:
Is it possible to advance a pointer by just one bit in the data?
ANOTHER EDIT:
Thank you for all your answers so far.
What I want to implement in the next steps is a nonsophisticated binary arithmetic coder that does not analyze the context. So I am only interested in single bits for now. Eventually it will become a Context-adaptive BAC but I will leave that for later.
Processing 4 bytes instead of 1 byte could be an option. But a loop over 32 bits is costly as well, isn't it?
The fastest way is probably to build a lookup table of byte values versus the number of bits set in that byte. At least that was the answer when I interviewed at Google.
Use a table that maps each byte value (256) to the number of 1's in it. (The # of 0's is just (8 - # of 1's)). Then iterate over the bytes and perform a single lookup for each byte, instead of multiple lookups and comparisons. For example:
int onesCount = 0;
for (i = 0; i < data->Count; i++)
{
byte = &data->Data[i];
onesCount += NumOnes[byte];
}
Stats.FreqOf1 += onesCount;
Stats.FreqOf0 += (data->Count * 8) - onesCount;
I did not really understand what you're trying to do. But if you just want to get access to the bits of a bitmap, you can use these (untested!!!) functions:
#include <stddef.h>
_Bool isbitset(unsigned char * bitmap, size_t idx)
{
return bitmap[idx / 8] & (1 << (idx % 8)) ? 1 : 0;
}
void setbit(unsigned char * bitmap, size_t idx)
{
bitmap[idx / 8] |= (1 << (idx % 8));
}
void unsetbit(unsigned char * bitmap, size_t idx)
{
bitmap[idx / 8] &= ~(1 << (idx % 8));
}
void togglebit(unsigned char * bitmap, size_t idx)
{
bitmap[idx / 8] ^= (1 << (idx % 8));
}
Edit: Ok, I think I understand what you want to do: Fast iteration over a sequence of bits. Therefore, we don't want to use the random access functions from above, but read a whole word of data at once.
You might use any unsigned integer type you like, but you should choose one which is likely to correspond to the word size of your architecture. I'll go with uint_fast32_t from stdint.h:
uint_fast32_t * data = __data_source__;
for(; __condition__; ++data)
{
uint_fast32_t mask = 1;
uint_fast32_t current = *data;
for(; mask; mask <<= 1)
{
if(current & mask)
{
// bit is set
}
else
{
// bit is not set
}
}
}
From the inner loop, you can set the bit with
*data |= mask;
unset the bit with
*data &= ~mask;
and toggle the bit with
*data ^= mask;
Warning: The code might behave unexpectedly on big-endian architectures!
You could use a precomputed lookup table, i.e:
static int bitcount_lookup[256] = { ..... } ; /* or make it a global and compute the values in code */
...
for( ... )
byte = ...
Stats.FreqOf1 += bitcount_lookup[byte];
Here is a method how to count the 1 bits of a 32bit integer (based on Java's Integer.bitCount(i) method):
unsigned bitCount(unsigned i) {
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
i = (i + (i >> 4)) & 0x0f0f0f0f;
i = i + (i >> 8);
i = i + (i >> 16);
return i & 0x3f;
}
So you can cast your data to int and move forward in 4 byte steps.
Here is a simple one I whipped up on just a single 32 bit value, but you can see it wouldn't be hard to adapt it to any number of bits....
int ones = 0;
int x = 0xdeadbeef;
for(int y = 0;y < 32;y++)
{
if((x & 0x1) == 0x1) ones++;
x = (x >> 1);
}
printf("%x contains %d ones and %d zeros.\n", x, ones, 32-ones);
Notice however, that it modifies the value in the process. If you are doing this on data you need to keep, then you need to make a copy of it first.
Doing this in __asm would probably be a better, maybe faster way, but it's hard to say with how well the compiler can optimize...
With each solution you consider, each one will have drawbacks. A lookup table or a bit shifter (like mine), both have drawbacks.
Larry
ttobiass - Keep in mind your inline functions are important in applications like you are talking about, but there are things you need to keep in mind. You CAN get the performance out of the inline code, just remember a couple things.
inline in debug mode does not exist. (Unless you force it)
the compiler will inline functions as it sees fit. Often, if you tell it to inline a function, it may not do it at all. Even if you use __forceinline. Check MSDN for more info on inlining.
Only certain functions can even be inlined. For example, you cannot inline a recursive function.
You'll get your best performance out of your project settings for the C/C++ language, and how you construct your code. At this point, it's important to understand Heap vs. Stack operations, calling conventions, memory alignment, etc.
I know this does not answer your question exactly, but you mention performance, and how to get the best performance, and these things are key.
To join the link wagon:
counting bits
If this is not a case of premature optimization and you truly need to squeeze out every last femtosecond, then you're probably better off with a 256-element static array that you populate once with the bit-count of each byte value, then
Stats.FreqOf1 += bitCountTable[byte]
and when the loop is done:
Stats.FreqOf0 = ((data->Count * 8) - Stats.FreqOf1)
There's a whole chapter on the different techniques for this in the book Beautiful Code. You can read (most of) it on Google books starting here.
A faster way to extract bits is to use:
bitmask= data->Data[i];
while (bitmask)
{
bit_set_as_power_of_two= bitmask & -bitmask;
bitmask&= bitmask - 1;
}
If you just want to count bits set, a LUT in cache per would be fast, but you can also do it in constant time with the interleaved bit counting method in the link in this answer.
I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.
The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like
int ShiftByConstant( int x ) { return x << 3 ; }
is fast, single-op, and superscalar, whereas a shift-by-variable, like
int ShiftByVar( int x, int y ) { return x << y ; }
is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.
What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.
I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)
This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.
Edit: A couple of clarifications that I should add:
I'm not even a little bit worried about portability
PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function
int isel(a, b, c) { return a >= 0 ? b : c; }
(if you write out a ternary that does the same thing I'll get what you mean)
integer multiplication is also microcoded and even slower than sraw. :-(
On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.
Here you go...
I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).
The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.
template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=(nVal&bMask1) + nVal; //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
EDIT: Note on isel()
I saw your isel() code on your website.
// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
return x + ((y - x) & mask);
};
FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:
return (x & (~mask)) + (y & mask);
How about this:
if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;
will probably take longer yet to execute but easier to interleave if you have other code to go between.
Let's assume that your max shift is 31. So the shift amount is a 5-bit number. Because shifting is cumulative, we can break this into five constant shifts. The obvious version uses branching, but you ruled that out.
Let N be a number between 1 and 5. You want to shift x by 2N if the bit whose value is 2N is set in y, otherwise keep x intact. Here one way to do it:
#define SHIFT(N) x = isel(((y >> N) & 1) - 1, x << (1 << N), x);
The macro assigns to x either x << 2ᴺ or x, depending on whether the Nth bit is set in y or not.
And then the driver:
SHIFT(1); SHIFT(2); SHIFT(3); SHIFT(4); SHIFT(5)
Note that N is a macro variable and becomes constant.
Don't know though if this is going to be actually faster than the variable shift. If it would be, one wonders why the microcode wouldn't run this instead...
This one breaks my head. I've now discarded a half dozen ideas. All of them exploit the notion that adding a thing to itself shifts left 1, doing the same to the result shifts left 4, and so on. If you keep all the partial results for shift left 0, 1, 2, 4, 8, and 16, then by testing bits 0 to 4 of the shift variable you can get your initial shift. Now do it again, once for each 1 bit in the shift variable. Frankly, you might as well send your processor out for coffee.
The one place I'd look for real help is Hank Warren's Hacker's Delight (which is the only useful part of this answer).
How about this:
int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};
int ShiftByVar( int x, int y )
{
//return x << y;
return x * multiplicands[y];
}
If the shift count can be calculated far in advance then I have two ideas that might work
Using self-modifying code
Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
// shift by constant functions
typedef int (*shiftFunc)(int); // the shift function
#define SHL(n) int shl##n(int x) { return x << (n); }
SHL(1)
SHL(2)
SHL(3)
...
shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
int arr[MAX]; // all the values that need to be shifted with the same amount
shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
for (int i = 0; i < MAX; i++)
arr[i] = shl(arr[i]);
This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD
If the range of the values is small, lookup table is another possible solution
#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n) S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
{ S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
{ S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
...
{ S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}
Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together
There is some good stuff here regarding bit manipulation black magic:
Advanced bit manipulation fu (Christer Ericson's blog)
Don't know if any of it's directly applicable, but if there is a way, likely there are some hints to that way in there somewhere.
Here's something that is trivially unrollable:
int result= value;
int shift_accumulator= value;
for (int i= 0; i<5; ++i)
{
result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
shift_accumulator += shift_accumulator;
k >>= 1;
}