Benefit of writing (1<<24 - 1) instead of FFFFFF? - c

I have a piece of code in C with the following:
a = b & ((1<<24) - 1);
If I am not mistaking, this is equivalent to:
a = b & 0xFFFFFF;
What is the benefit in terms of performance to write the first one? For me it is more complicated to read, but I suppose the guy who wrote that had a better C background than I have.
Thanks

There is no difference in performance since the compiler will perform the calculation for you.
The first option may be used to explicitly clarify that you are using 24 set bits. This is harder to count in the second option.

In all likelihood, there isn't any performance difference since the compiler will figure out that ((1<<24) - 1) is a constant expression, and will evaluate it at compile time.
We can only speculate about why the original author of the code chose to write it the way they did. Perhaps they thought it better expressed the intent ("mask out all but the 24 least significant bits of b").
If that was their reasoning, I personally would tend to agree with them.

I can't see any benefit from the performance point of view, as aix says.
To me, anyway, it appears clearer in the first version better communicates that the constant value is 2^24-1 than the latter form. Of course, I guess this is just an opinion.

If it isn't part of a larger block of code, I like your use of 0xFFFFFF better.
But, it can conceivably be part of a group of similar statements. Then the shift version is (arguably) better.
switch (binaryprefix) {
default: a = 0; break;
case DECABIN: a = b & ((1 << 1) - 1); break;
case HECTOBIN: a = b & ((1 << 2) - 1); break;
case KILOBIN: a = b & ((1 << 3) - 1); break;
case MEGABIN: a = b & ((1 << 6) - 1); break;
/* ... */
case ZETTABIN: a = b & ((1 << 21) - 1); break;
case YOTTABIN: a = b & ((1 << 24) - 1); break;
}

No benefit in performance for doing ((1<<24) - 1). It might be slower since it has to perform some operations (<< and -), while 0xFFFFFF is a constant. Best case the compiler will calculate the 1st at compile time and they'd be equivalent.

Generally, you should avoid using statements like the first.
The only scenario that i can think of that the first sentence would be preferable, is if the number 24 has a meaning. (Which should have been defined and named anyway.)
Like, if for some reason in this line of code it can be 24, and in a different place it might be 22.

Strictly speaking, the expression
(1<<24)
is non-portable and may be undefined behaviour, because 1 is treated as an int, and the standard guarantees only 16 bits for an int. If you still happen to code for such an implementation... If a and b are ints, then you can safely deduce that the target are more modern implementations only, having ints with 32 or more bits, of course.

Related

Implementing "logical not" using less than 5 bitwise operators

As part of my CS classes I've recently completed the pretty popular "Data Lab" assignments. In these assignments you are supposed to implement simple binary operations in C with as few operations as possible.
For those who are not familiar with the "Data Lab" a quick overview about the rules:
You may not call functions, cast or use control structures (e.g. if)
You may assign variables with no operator cost, however only int is allowed)
The less operators you use, the better
You may assume sizeof(int) == 32
Negative numbers are represented in 2's complement
The task is to implement a logical not called 'bang' (where bang(x) returns !x) by only using the following operators: ~ & ^ | + << >>
The function prototype is defined as
int bang(int x)
The best implementation I could find (using 5 operators) was the following:
return ((x | (~x +1)) >> 31) + 1
However there seems to be a way to accomplish this with even less operators, since I found a result website[1] from some German university where two people apparently found a solution with less than 5 operator. But I can't seem to figure out how they accomplished that.
[1] http://rtsys.informatik.uni-kiel.de/~rt-teach/ss09/v-sysinf2/dlcontest.html (logicalNeg column)
To clarify: This is not about how to solve the issue, but how to solve it with less operations.
Only slightly cheating:
int bang(int x) {
return ((x ^ 0xffffffffU) + 1UL) >> 32;
}
is the only way I can think of to do it in only 3 operations. Assumes a 32-bit int and 64-bit long...
If you take the liberty of assuming that int addition overflow is well-defined and wraps (rather than being undefined behavior), then there's a solution with four operators:
((a | (a + 0x7fffffff)) >> 31) + 1
I think you are assuming that overflow is defined to wrap otherwise your function ((x | (~x + 1)) >> 31) + 1 has undefined behavior for x=INT_MIN.
why not just :-
int bang(int x)
{
return 1 >> x;
}

what (r+1 + (r >> 8)) >> 8 does?

In some old C/C++ graphics related code, that I have to port to Java and JavaScript I found this:
b = (b+1 + (b >> 8)) >> 8; // very fast
Where b is short int for blue, and same code is seen for r and b (red & blue). The comment is not helpful.
I cannot figure out what it does, apart from obvious shifting and adding. I can port without understanding, I just ask out of curiosity.
y = ( x + 1 + (x>>8) ) >> 8 // very fast
This is a fixed-point approximation of division by 255. Conceptually, this is useful for normalizing calculations based on pixel values such that 255 (typically the maximum pixel value) maps to exactly 1.
It is described as very fast because fully general integer division is a relatively slow operation on many CPUs -- although it is possible that your compiler would make a similar optimization for you if it can deduce the input constraints.
This works based on the idea that 257/(256*256) is a very close approximation of 1/255, and that x*257/256 can be formulated as x+(x>>8). The +1 is rounding support which allows the formula to exactly match the integer division x/255 for all values of x in [0..65534].
Some algebra on the inner portion may make things a bit more clear...
x*257/256
= (x*256+x)/256
= x + x/256
= x + (x>>8)
There is more discussion here: How to do alpha blend fast? and here: Division via Multiplication
By the way, if you want round-to-nearest, and your CPU can do fast multiplies, the following is accurate for all uint16_t dividend values -- actually [0..(2^16)+126].
y = ((x+128)*257)>>16 // divide by 255 with round-to-nearest for x in [0..65662]
Looks like it is meant to check if blue (or red or green) is fully used. It evaluates to 1, when b is 255, and is 0 for all lower values.
A common use case of when you'd want to use a formula that's more accurate than 257/256 is when you have to combine a lot of alpha values together for each pixel. As one example, when doing image shrinking, you need to combine 4 alphas for each source pixel contributing to the destination, and then combine all the source pixels contributing to the destination.
I posted an infinitely accurate bit twiddling version of /255 but it was rejected without reason. So I'll add that I implement alpha blending hardware for a living, I write real time graphics code and game engines for a living, and I've published articles on this topic in conferences like MICRO, so I really know what I'm talking about. And it might be useful or at least entertaining for people to understand the more accurate formula that is EXACTLY 1/255:
Version 1: x = (x + (x >> 8)) >> 8
- no constant added, won't satisfy (x * 255) / 255 = x, but will look fine in most cases.
Version 2: x = (x + (x >> 8) + 1) >> 8
- WILL satisfy (x * 255) / 255 = x for integers, but won't hit correct integer values for all alphas
Version 3: (simple integer rounding):
(x + (x >> 8) + 128) >> 8
- Won't hit correct integer values for all alphas, but will on average be closer than Version 2 at the same cost.
Version 4: Infinitely accurate version, to any level of precision desired, for any number of composite alphas: (useful for image resizing, rotation, etc.):
[(x + (x >> 8)) >> 8] + [ ( (x & 255) + (x >> 8) ) >> 8]
Why is version 4 infinitely accurate?
Because 1/255 = 1/256 + 1/65536 + 1/256^3 + 1/256^4 + ...
The simplest expression above (version 1) doesn't handle rounding, but it also doesn't handle the carries that occur from this infinite number of identical sum columns. The new term added above determines the carry out (0 or 1) from this infinite number of base 256 digits. By adding it, you are getting the same result as if you added all the infinite addends. At which point you can round by adding a half bit to whatever accuracy point you want.
Not needed for the OP perhaps, but people should know that you don't need to approximate at all. The formula above is actually more accurate than double precision floating point.
As for speed: In hardware, this method is faster than even a single (full width) add. In software, you have to consider throughput vs latency. In latency, it may still be faster than a narrow multiply (definitely faster than a full width multiply), but in the OP context, you can unroll many pixels at once, and since modern multiply units are pipelined, you are still OK. In translation to Java, you probably have no narrow multiplies, so this could still be faster, but need to check.
WRT the one person who said "why not use the built in OS capabilities for alpha blitting?": If you already have a substantial graphical code base in that OS, this might be a fine option. If not, you're looking at hundreds to thousands as many lines of code to leverage the OS version - code that's far harder to write and debug than this code. And in the end, the OS code you have isn't portable at all, while this code can be used anywhere.
I suspect that it is trying to do the following:
boolean isBFullyOn = false;
if (b == 0xff) {
isBFullyOn = true;
}
Back in the days of slow processors; smart bit-shifting tricks like the above could be faster than the obvious if-then-else logic. It avoids a jump statement which was costly.
It probably also sets an overflow flag in the processor which was used for some latter logic. This is all highly dependant upon the target processor.
And also on my part speculative!!
Is value of b+1 + b/256, this calculation divided by 256.
In that way, using bit shift the compiler tranlte using CPU level shift instruction, instead of using FPU or library division functions.
b = (b + (b >> 8)) >> 8; is basically b = b *257/256 .
I would consider +1 being an ugly hack of the -0.5 mean reduce caused by the inner >>8.
I would write it as b = (b + 128 + ((b +128)>> 8)) >> 8; instead.
Running this test code:
public void test() {
Set<Integer> results = new HashSet<Integer>();
// short int ranges between -32767 and 32767
for (int i = -32767; i <= 32767; i++) {
int b = (i + 1 + (i >> 8)) >> 8;
if (!results.contains(b)) {
System.out.println(i + " -> " + b);
results.add(b);
}
}
}
Produces all possible values between -129 and 128. However, if you are working with 8-bit colours (0 - 255) then the only possible outputs are 0 (for 0 - 254) and 1 (for 255) so it is likely that it is attempting the function #kaykay posted.

Looking for decent-quality PRNG with only 32 bits of state

I'm trying to implement a tolerable-quality version of the rand_r interface, which has the unfortunate interface requirement that its entire state is stored in a single object of type unsigned, which for my purposes means exactly 32 bits. In addition, I need its output range to be [0,2³¹-1]. The standard solution is using a LCG and dropping the low bit (which has the shortest period), but this still leaves very poor periods for the next few bits.
My initial thought was to use two or three iterations of the LCG to generate the high/low or high/mid/low bits of the output. However, such an approach does not preserve the non-biased distribution; rather than each output value having equal frequency, many occur multiple times, and some never occur at all.
Since there are only 32 bits of state, the period of the PRNG is bounded by 2³², and in order to be non-biased, the PRNG must output each value exactly twice if it has full period or exactly once if it has period 2³¹. Shorter periods cannot be non-biased.
Is there any good known PRNG algorithm that meets these criteria?
One good (but probably not the fastest) possibility, offering very high quality, would be to use a 32-bit block cipher in CTR mode. Basically, your RNG state would simply be a 32-bit counter that gets incremented by one for each RNG call, and the output would be the encryption of that counter value using the block cipher with some arbitrarily chosen fixed key. For extra randomness, you could even provide a (non-standard) function to let the user set a custom key.
There aren't a lot of 32-bit block ciphers in common use, since such a short block size introduces problems for cryptographic use. (Basically, the birthday paradox lets you distinguish the output of such a cipher from a random function with a non-negligible probability after only about 216 = 65536 outputs, and after 232 outputs the non-randomness obviously becomes certain.) However, some ciphers with an adjustable block size, such as XXTEA or HPC, will let you go down to 32 bits, and should be suitable for your purposes.
(Edit: My bad, XXTEA only goes down to 64 bits. However, as suggested by CodesInChaos in the comments, Skip32 might be another option. Or you could build your own 32-bit Feistel cipher.)
The CTR mode construction guarantees that the RNG will have a full period of 232 outputs, while the standard security claim of (non-broken) block ciphers is essentially that it is not computationally feasible to distinguish their output from a random permutation of the set of 32-bit integers. (Of course, as noted above, such a permutation is still easily distinguished from a random function taking 32-bit values.)
Using CTR mode also provides some extra features you may find convenient (even if they're not part of the official API you're developing against), such as the ability to quickly seek into any point in the RNG output stream just by adding or subtracting from the state.
On the other hand, you probably don't want to follow the common practice of seeding the RNG by just setting the internal state to the seed value, since that would cause the output streams generated from nearby seeds to be highly similar (basically just the same stream shifted by the difference of the seeds). One way to avoid this issue would be to add an extra encryption step to the seeding process, i.e. to encrypt the seed with the cipher and set the internal counter value equal to the result.
A 32-bit maximal-period Galois LFSR might work for you. Try:
r = (r >> 1) ^ (-(r & 1) & 0x80200003);
The one problem with LFSRs is that you can't produce the value 0. So this one has a range of 1 to 2^32-1. You may want to tweak the output or else stick with a good LCG.
Besides using a Lehmer MCG, there's a couple you could use:
32-bit variants of Xorshift have a guaranteed period of 232−1 using a 32-bit state:
uint32_t state;
uint32_t xorshift32(void) {
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state;
}
That's the original 32-bit recommendation from 2003 (see paper). Depending on your definition of "decent quality", that should be fine. However it fails the binary rank tests of Diehard, and 5/10 tests of SmallCrush.
Alternate version with better mixing and constants (passes SmallCrush and Crush):
uint32_t xorshift32amx(void) {
int s = __builtin_bswap32(state * 1597334677);
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state + s;
}
Based on research here and here.
There's also Mulberry32 which has a period of exactly 232:
uint32_t mulberry32(void) {
uint32_t z = state += 0x6D2B79F5;
z = (z ^ z >> 15) * (1 | z);
z ^= z + (z ^ z >> 7) * (61 | z);
return z ^ z >> 14;
}
This is probably your best option. It's quite good. Author states "It passes gjrand's 13 tests with no failures and a total P-value
of 0.984 (where 1 is perfect and 0.1 or less is a failure) on 4GB of
generated data. That's a quarter of the full period". It appears to be an improvement over SplitMix32.
"SplitMix32", adopted from xxHash/MurmurHash3 (Weyl sequence):
uint32_t splitmix32(void) {
uint32_t z = state += 0x9e3779b9;
z ^= z >> 15; // 16 for murmur3
z *= 0x85ebca6b;
z ^= z >> 13;
z *= 0xc2b2ae35;
return z ^= z >> 16;
}
The quality might be questionable here, but its 64-bit big brother has a lot of fans (passes BigCrush). So the general structure is worth looking at.
Elaborating on my comment...
A block cipher in counter mode gives a generator in approximately the following form (except using much bigger data types):
uint32_t state = 0;
uint32_t rand()
{
state = next(state);
return temper(state);
}
Since cryptographic security hasn't been specified (and in 32 bits it would be more or less futile), a simpler, ad-hoc tempering function should do the trick.
One approach is where the next() function is simple (eg., return state + 1;) and temper() compensates by being complex (as in the block cipher).
A more balanced approach is to implement LCG in next(), since we know that it also visits all possible states but in a random(ish) order, and to find an implementation of temper() which does just enough work to cover the remaining problems with LCG.
Mersenne Twister includes such a tempering function on its output. That might be suitable. Also, this question asks for operations which fulfill the requirement.
I have a favourite, which is to bit-reverse the word, and then multiply it by some constant (odd) number. That may be overly complex if bit-reverse isn't a native operation on your architecture.

Better way to do predicate assignment in C?

What I'm trying to do is avoid the following:
if(*ptr > 128) {
number = 5;
}
Such code performs poorly when there's no clear pattern as to which way the branch will go. What I came up with is this:
int arr[] = { number, 5 };
int cond = *ptr > 128;
number = arr[cond];
Based on my testing, that runs more than twice as fast as doing the conditional when the input is random. What I'm wondering is if there's a more clever way to do this, perhaps using bitwise operators.
A clever compiler should definitely compile this to a conditional move with the right optimization settings; check the disassembly to be sure.
There is this branchless solution:
int mask = -(*ptr > 128);
number = (number & mask) | (5 & ~mask);
The last line can also be
number = ((mask & (number ^ 5)) ^ 5);
if you're looking to use one less operation. But, caveat emptor, the compiler won't be able to optimize either of these nearly as well. You are best leaving this particular optimization for the compiler to worry about, unless you specifically know that the compiler is unable to make the optimization (in that case, you may want to check your compiler version or flags).

Having issues with some homework, C programming stuff

So the problem is to add 3 numbers together(2's complement) in C. Normally should be very simple, but the hard part of this problem is that you can only use the ops ! ~ & ^ | << >>, no kind of loops, or function calls, or anything fancy. Just those ops. He gives us a function that adds 2 words together. The return of the function I'm writing (sum3) is return sum(word1, word2). My responsibility is to determine what to set word1 and word2 to in order for the call to the sum function to give me the proper answer. Oh, and also I can only use 16 total of those ops up there.
I tried setting word1 to x ^ y, and word2 to (x & y) << 1 to see if I at least got the right answer from that for the first 2 numbers, and it always ends up correct. However, I have no idea how to throw z into the mix without messing everything up. I think this is is the biggest problem...somebody please help, I messed up and didn't realize this was due in 5 hours from now, so I'm freaking out. At least a good hint...something, anything.
Just a hint: a + b == (a ^ b) + ((a & b) << 1). Here a & b is the expression for carry.
As you can see, by this transformation you reduce an add on N bits to some logical operations and an add on N-1 bits. If the N is given, you could manually unroll the loop and the whole result will contain only XOR, AND and SHL(1).

Resources