How to determine the number of useful bits of a number? - c

I'm writing a function, which determine the number of useful bits of a 16 bits integer.
int16_t
f(int16_t x)
{
/* ... */
}
For example, the number "00000010 00100101" has 10 useful bits. I think I should use some bitwise operators, but I don't know how. I'm looking for some ways to do it.

If you're using gcc (or a gcc-compatible compiler such as ICC) then you can use built in intrinsics, e.g.
#include <limits.h>
int f(int16_t x)
{
return x != 0 ? sizeof(x) * CHAR_BIT - __builtin_clz(x) : 0;
}
This assumes you just want the number of bits to the right of the last leading zero bit.
For MSVC you can use _BitScanReverse with some adjustment.
Otherwise if you need this to be portable then you can implement your own general purpose clz function, see e.g. http://en.wikipedia.org/wiki/Find_first_set

These are called bitscan operations, and on intel architecture there are assembly instruction ( you can call directly from C ) see here. If you are using a MS compiler start from here.

Logarithms compute the number of digits needed to represent a certain number to a certain base:
Let [x] be x, rounded to the next integer.
Then [log_b(x)] is the number of digits needed to represent x to base b.
Hence, if you want to know the number of significant bits of some x in C, then ceil(log2(x)) will tell you.
Since there is no algorithm that will tell you the number of leading zeros of a binary representation in constant time, computing the logarithm may actually be faster than naively iterating.

Related

Operating Rightmost/Leftmost n-Bits, Not All the Bits of A Integer Type Data Variable

In a programming-task, I have to add a smaller integer in variable B (data type int)
to a larger integer (20 decimal integer) in variable A (data type long long int),
then compare A with variable C which is also as large integer (data type long long int) as A.
What I realized, since I add a smaller B to A,
I don't need to check all the digits of A when I compare that with C, in other words, we don't need to check all the bits of A and C.
Given that I know, how many bits from the right I need to check, say n-bits,
is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language?
Because for comparing all the bits take more time, and since I am working with large number, the program becomes slower.
Every time I search in the google, bit-masking appears which uses all the bits of A, C, that doesn't do what I am asking for, so probably I am not using correct terminology, please help.
Addition:
Initial comments of this post made me think there is no way but i found the following -
Bit Manipulation by University of Colorado Boulder
(#cuboulder, after 7:45)
...the bit band region is accessed via a bit band alías, each bit in a
supported bit band region has its own unique address and we can access
that bit using a pointer to its bit band alias location, the least
significant bit in an alias location can be sent or cleared and that
will be mapped to the bit in the corresponding data or peripheral
memory, unfortunately this will not help you if you need to write to
multiple bit locations in memory dependent operations only allow a
single bit to be cleared or set...
Is above what I a asking for? if yes then
where I can find the detail as beginner?
Updated question:
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Your assumption that comparing fewer bits is faster might be true in some cases but is probably not true in most cases.
I'm only familiar with x86 CPUs. A x86-64 Processor has 64 bit wide registers. These can be accessed as 64 bit registers but the lower bits also as 32, 16 and 8 bit registers. There are processor instructions which work with the 64, 32, 16 or 8 bit part of the registers. Comparing 8 bits is one instruction but so is comparing 64 bits.
If using the 32 bit comparison would be faster than the 64 bit comparison you could gain some speed. But it seems like there is no speed difference for current processor generations. (Check out the "cmp" instruction with the link to uops.info from #harold.)
If your long long data type is actually bigger then the word size of your processor, then it's a different story. E.g. if your long long is 64 bit but your are on a 32 bit processor then these instructions cannot be handled by one register and you would need multiple instructions. So if you know that comparing only the lower 32 bits would be enough this could save some time.
Also note that comparing only e.g. 20 bits would actually take more time then comparing 32 bits. You would have to compare 32 bits and then mask the 12 highest bits. So you would need a comparison and a bitwise and instruction.
As you see this is very processor specific. And you are on the processors opcode level. As #RawkFist wrote in his comment you could try to get the C compiler to create such instructions but that does not automatically mean that this is even faster.
All of this is only relevant if these operations are executed a lot. I'm not sure what you are doing. If e.g. you add many values B to A and compare them to C each time it might be faster to start with C, subtract the B values from it and compare with 0. Because the compare-operation works internally like a subtraction. So instead of an add and a compare instruction a single subtraction would be enough within the loop. But modern CPUs and compilers are very smart and optimize a lot. So maybe the compiler automatically performs such or similar optimizations.
Try this question.
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Yes - when A + B != C. We can short-cut the comparison once a difference is found: from least to most significant.
No - when A + B == C. All bits need comparison.
Now back to OP's original question
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
No. In order to do so, we need to out-think the compiler. A well enabled compiler itself will notice any "tricks" available for long long + (signed char)int == long long and emit efficient code.
Yet what about really long compares? How about a custom uint1000000 for A and C?
For long compares of a custom type, a quick compare can be had.
First, select a fast working type. unsigned is a prime candidate.
typedef unsigned ufast;
Now define the wide integer.
#include <limits.h>
#include <stdbool.h>
#define UINT1000000_N (1000000/(sizeof(ufast) * CHAR_BIT))
typedef struct {
// Least significant first
ufast digit[UINT1000000_N];
} uint1000000;
Perform the addition and compare one "digit" at a time.
bool uint1000000_fast_offset_compare(const uint1000000 *A, unsigned B,
const uint1000000 *C) {
ufast carry = B;
for (unsigned i = 0; i < UINT1000000_N; i++) {
ufast sum = A->digit[i] + carry;
if (sum != C->digit[i]) {
return false;
}
carry = sum < A->digit[i];
}
return true;
}

How do you convert the remainder of a division operation to a fixed point in C?

I understand the concept of fixed point pretty well at this point, but I'm having trouble making a logical jump.
I'm working with M68000 CPUs using gcc with no standard libraries of any sort. Using DIVU/DIVS opcodes, I can obtain the quotient and the remainder. Given a Q16.16 fixed point value stored in an unsigned 32bit memory space, I know I can put the quotient in the upper 16 bits. However, how does one convert the integer remainder into the fractional portion of the fixed point value?
I'm sure this is something simple and I'm just missing it. Any help would be greatly appreciated.
The way to think about it is that fixed point numbers are actually integers hold the value of your number times some fixed multiplier. You want to build you fixed point operations out of the integer operations you have available in your hardware.
So for a 16.16 fixed-point format, your multiplier is 65536 (216), so if you want to do a divide c = a/b, the numbers (integers) you have to work with are actually a' = a * 65536 and b' = b * 65536 and you want to find c' = c * 65536. So substituting into the desired c = a/b, you have
c'/65536 = (a'/65536) / (b'/65536) = a'/b'
c' = 65536 * a' / b'
So you actually want to first (integer) mulitply the fixed-point value of a by 65536 (left shift by 16), then do an integer divide by the fixed point value of b, and that will give you the fixed point value of c. The issue is that the first multiply will almost certainly overflow 32 bits, so you need a 64 bit (actually only 48 bit) intermediate. So if you're using a 68020+ with a 64/32 DIVS.L instruction (divides a 64 bit value in a pair of registers by a 32 bit value), you're fine. You don't need the remainder at all.
If you're using a pure 68000 that doesn't have the wide divide, you'll need to do 16-bit long division on the values (where you use 16 bit numbers as "digits", so you're dividing a 3-"digit" number by a 2-"digit" one)

platform independent way to reduce precision of floating point constant values

The use case:
I have some large data arrays containing floating point constants that.
The file defining that array is generated and the template can be easily adapted.
I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.
Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.
I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.
The best case would be something like:
#define FP_REDUCE(float) /* some macro */
static const float32_t veryLargeArray[] = {
FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
// ...
};
#undef FP_REDUCE
This should be done at compile time and it should be platform independent.
The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2n (for example, to remove eight bits, use 0x1p8f for the second argument). The casts to float32_t coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur when float32_t is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)
IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.
#define RemoveBits(x, p) (float32_t) (((float32_t) ((x) * ((p)+1))) - (float32_t) (((float32_t) ((x) * ((p)+1))) - (x))))
What you're asking for can be done with varying degrees of partial portability, but not absolute unless you want to run the source file through your own preprocessing tool at build time to reduce the precision. If that's an option for you, it's probably your best one.
Short of that, I'm going to assume at least that your floating point types are base 2 and obey Annex F/IEEE semantics. This should be a reasonable assumption, but the latter is false with gcc on platforms (including 32-bit x86) with extended-precision under the default standards-conformance profile; you need -std=cNN or -fexcess-precision=standard to fix it.
One approach is to add and subtract a power of two chosen to cause rounding to the desired precision:
#define FP_REDUCE(x,p) ((x)+(p)-(p))
Unfortunately, this works in absolute precisions, not relative, and requires knowing the right value p for the particular x, which is going to be equal to the value of the leading base-2 place of x, times 2 raised to the power of FLT_MANT_DIG minus the bits of precision you want. This cannot be evaluated as a constant expression for use as an initializer, but you can write it in terms of FLT_EPSILON and, if you can assume C99+, a preprocessor-token-pasting to form a hex float literal, yielding the correct value for this factor. But you still need to know the power of two for the leading digit of x; I don't see any way to extract that as a constant expression.
Edit: I believe this is fixable, so as not to need an absolute precision but rather automatically scale to the value, but it depends on correctness of a work in progress. See Is there a correct constant-expression, in terms of a float, for its msb?. If that works I will later integrate the result with this answer.
Another approach I like, if your compiler supports compound literals in static initializers and if you can assume IEEE type representations, is using a union and masking off bits:
union { float x; uint32_t r; } fr;
#define FP_REDUCE(x) ((union fr){.r=(union fr){x}.r & (0xffffffffu<<n)}.x)
where n is the number of bits you want to drop. This will round towards zero rather than to nearest; if you want to make it round to nearest, it should be possible by adding an appropriate constant to the low bits before masking, but you have to take care about what happens when the addition overflows into the exponent bits.

fixed point arithmetic in modern systems

I'd like to start out by saying this isn't about optimizations so please refrain from dragging this topic down that path. My purpose for using fixed point arithmetic is because I want to control the precision of my calculations without using floating point.
With that being said let's move on. I wanted to have 17 bits for range and 15 bits for the fractional part. The extra bit is for the signed value. Here are some macros below.
const int scl = 18;
#define Double2Fix(x) ((x) * (double)(1 << scl))
#define Float2Fix(x) ((x) * (float)(1 << scl))
#define Fix2Double(x) ((double)(x) / (1 << scl))
#define Fix2Float(x) ((float)(x) / (1 << scl))
Addition and subtraction are fairly straight forward but things gets a bit tricky with mul and div.
I've seen two different ways to handle these two types of operations.
1) if I am using 32 bits then use a temp 64bit variable to store intermediate multiplication steps then scale at the end.
2) right in the multiplication step scale both variables to a lesser bit range before multiplication. For example if you have a 32 bit register with 16 bits for the whole number you could shift like this:
(((a)>>8)*((b)>>6) >> 2) or some combination that makes sense for you app.
It seems to me that if you design your fixed point math around 32 bits it might be impractical to always depend on having a 64bit variable able to store your intermediate values but on the other hand shifting to a lower scale will seriously reduce your range and precision.
questions
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
Also i've notice
int b = Double2Fix(9.1234567890);
printf("double shift:%f\n",Fix2Double(b));
int c = Float2Fix(9.1234567890);
printf("float shift:%f\n",Fix2Float(c));
double shift:9.123444
float shift:9.123444
Is that precision loss just a part of using fixed point numbers?
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
You have to work with the hardware capabilities, and the only available operations you'll find are:
Multiply N x N => low N bits (native C multiplication)
Multiply N x N => high N bits (the C language has no operator for this)
Multiply N x N => all 2N bits (cast to wider type, then multiply)
If the instruction set has #3, and the CPU implements it efficiently, then there's no need to worry about the extra-wide result it produces. For x86, you can pretty much take these as a given. Anyway, you said this wasn't an optimization question :) .
Sticking to just #1, you'll need to break the operands into pieces of (N/2) bits and do long multiplication, which is likely to generate more work. There are still cases where it's the right thing to do, for instance implementing #3 (software extended arithmetic) on a CPU that doesn't have it or #2.
Is that precision loss just a part of using fixed point numbers?
log2( 9.1234567890 – 9.123444 ) = –16.25, and you used 16 bits of precision, so yep, that's very typical.

Is bitwise & equivalent to modulo operation

I came across the following snippet which in my opinion is to convert an integer into binary equivalence.
Can anyone tell me why an &1 is used instead of %2 ? Many thanks.
for (i = 0; i <= nBits; ++i) {
bits[i] = ((unsigned long) x & 1) ? 1 : 0;
x = x / 2;
}
The representation of unsigned integers is specified by the Standard: An unsigned integer with n value bits represents numbers in the range [0, 2n), with the usual binary semantics. Therefore, the least significant bit is the remainder of the value of the integer after division by 2.
It is debatable whether it's useful to replace readable mathematics with low-level bit operations; this kind of style was popular in the 70s when compilers weren't very smart. Nowadays I think you can assume that a compiler will know that dividing by two can be realized as bit shift etc., so you can just Write What You Mean.
what the code snippet does, is not to convert a unsigned int into a binary number (it's internal representation is already binary). It created a bit array with the values of the unsigned int's bits. Spreads it out over an array if you will.
e.g. x=3 => bits[2]=0 bits[1]=1 bits[0]=1
To do this
it selects the last bit of the number and places it the bits array
(the &1 operation).
then shifts the number to the right by one position ( /2 is
equivalent to >>1).
Repeats the above operations for all the bits
You could have used %2 instead of &1, the generated code should be the same. But I guess it's just a matter of programming style and preference. For most programmers, the &1 is a lot clearer than %2.
In your example, %2 and &1 are the same. Which one to take is probably simply a matter of taste. While %2 is probably more easier to read for people with a strong mathematics background, &1 is easier to understand for people with a strong technical background.
They are equivalent in the very special case. It's an old Fortran influenced style.

Resources