C quick calculation of next multiple of 4? - c

What's a fast way to round up an unsigned int to a multiple of 4?
A multiple of 4 has the two least significant bits 0, right? So I could mask them out and then do a switch statement, adding either 1,2 or 3 to the given uint.
That's not a very elegant solution..
There's also the arithmetic roundup:
myint == 0 ? 0 : ((myint+3)/4)*4
Probably there's a better way including some bit operations?

(myint + 3) & ~0x03
The addition of 3 is so that the next multiple of 4 becomes previous multiple of 4, which is produced by a modulo operation, doable by masking since the divisor is a power of 2.

I assume that what you are trying to achieve is the alignment of the input number, i.e. if the original number is already a multiple of 4, then it doesn't need to be changed. However, this is not clear from your question. Maybe you want next multiple even when the original number is already a multiple? Please, clarify.
In order to align an arbitrary non-negative number i on an arbitrary boundary n you just need to do
i = i / n * n;
But this will align it towards the negative infinity. In order to align it to the positive infinity, add n - 1 before peforming the alignment
i = (i + n - 1) / n * n;
This is already good enough for all intents and purposes. In your case it would be
i = (i + 3) / 4 * 4;
However, if you would prefer to to squeeze a few CPU clocks out of this, you might use the fact that the i / 4 * 4 can be replaced with a bit-twiddling i & ~0x3, giving you
i = (i + 3) & ~0x3;
although it wouldn't surprise me if modern compilers could figure out the latter by themselves.

If by "next multiple of 4" you mean the smallest multiple of 4 that is larger than your unsigned int value myint, then this will work:
(myint | 0x03) + 1;

(myint + 4) & 0xFFFC

If you want the next multiple of 4 strictly greater than myint, this solution will do (similar to previous posts):
(myint + 4) & ~3u
If you instead want to round up to the nearest multiple of 4 (leaving myint unchanged if it is a multiple of 4), this should work:
(0 == myint & 0x3) ? myint : ((myint + 4) & ~3u);

myint = (myint + 4) & 0xffffffc
This is assuming that by "next multiple of 4" that you are always moving upwards; i.e. 5 -> 8 and 4 -> 8.

This is branch-free, generally configurable, easy to understand (if you know about C byte strings), and it lets you avoid thinking about the bit size of myInt:
myInt += "\x00\x03\x02\x01"[myInt & 0x3];
Only downside is a possible single memory access to elsewhere (static string storage) than the stack.

Related

How to choose between two values in a random way?

I am trying to choose one of two numbers randomly: 2 or -2. Is there a way to do it? I am trying to implement an algorithm to create a maze.
You have rand() from the C standard library, which returns a pseudo-random integer in the range [0, RAND_MAX]. You can use this function and choose one of the two numbers checking if the value returned is above or below RAND_MAX/2.
First, use srand() to initialize the pseudo-random number generator with some seed. It's common to use time() to do this as it returns a different value each time.
srand(time(NULL));
int rnd = rand();
int result = (rnd > RAND_MAX/2) ? 2 : -2;
Alternatively you could use the least significant bit of the value returned by rand(), as suggested in this other answer, since half the values returned are odd and half are even:
int result = (rnd & 1) ? 2 : -2;
There are many ways to do this, my favorite is:
a + rand() % 2 * (b - a);
but it isn't clear what does it do and it doesn't contribute on efficiency either (well if you would make it into a macro/inline function and never used with variables, modern compilers should evaluate the numbers at the compile-time), so the most elegant way to do this would be to use some kind of condition, here's an example with ternary operator:
(rand() % 2)? a: b;
BTW: There are many ways to chose between 0/1, I used rand()%2 because it's most the used technique,
but if you happened to be doing this for 6502 architecture where there's no modulo/division you can do it with bitwise-and operator like this rand() & ANY_POWER_OF_TWO or like this rand() > HALF_MAX_RAND
You can use this. It uses bitwise operations to generate either 2 or -2 without branching:
-((rand() & 1) << 2) + 2
I note that you should use srand() to seed the random number generator before using it; I commonly use srand(time(NULL)).
Step-by-step:
(rand() & 1) generates a random number: either 0 or 1.
<< 2 multiplies the previous result by 4, and the - in front of -((rand() & 1) << 2) negates that, so the result is either 0 or -4.
+ 2 adds 2, so the result is either 2 or -2.
If you'd like to see a more arithmetic-like approach that may be easier to follow, here it is:
rand % 2 * -4 + 2

what (r+1 + (r >> 8)) >> 8 does?

In some old C/C++ graphics related code, that I have to port to Java and JavaScript I found this:
b = (b+1 + (b >> 8)) >> 8; // very fast
Where b is short int for blue, and same code is seen for r and b (red & blue). The comment is not helpful.
I cannot figure out what it does, apart from obvious shifting and adding. I can port without understanding, I just ask out of curiosity.
y = ( x + 1 + (x>>8) ) >> 8 // very fast
This is a fixed-point approximation of division by 255. Conceptually, this is useful for normalizing calculations based on pixel values such that 255 (typically the maximum pixel value) maps to exactly 1.
It is described as very fast because fully general integer division is a relatively slow operation on many CPUs -- although it is possible that your compiler would make a similar optimization for you if it can deduce the input constraints.
This works based on the idea that 257/(256*256) is a very close approximation of 1/255, and that x*257/256 can be formulated as x+(x>>8). The +1 is rounding support which allows the formula to exactly match the integer division x/255 for all values of x in [0..65534].
Some algebra on the inner portion may make things a bit more clear...
x*257/256
= (x*256+x)/256
= x + x/256
= x + (x>>8)
There is more discussion here: How to do alpha blend fast? and here: Division via Multiplication
By the way, if you want round-to-nearest, and your CPU can do fast multiplies, the following is accurate for all uint16_t dividend values -- actually [0..(2^16)+126].
y = ((x+128)*257)>>16 // divide by 255 with round-to-nearest for x in [0..65662]
Looks like it is meant to check if blue (or red or green) is fully used. It evaluates to 1, when b is 255, and is 0 for all lower values.
A common use case of when you'd want to use a formula that's more accurate than 257/256 is when you have to combine a lot of alpha values together for each pixel. As one example, when doing image shrinking, you need to combine 4 alphas for each source pixel contributing to the destination, and then combine all the source pixels contributing to the destination.
I posted an infinitely accurate bit twiddling version of /255 but it was rejected without reason. So I'll add that I implement alpha blending hardware for a living, I write real time graphics code and game engines for a living, and I've published articles on this topic in conferences like MICRO, so I really know what I'm talking about. And it might be useful or at least entertaining for people to understand the more accurate formula that is EXACTLY 1/255:
Version 1: x = (x + (x >> 8)) >> 8
- no constant added, won't satisfy (x * 255) / 255 = x, but will look fine in most cases.
Version 2: x = (x + (x >> 8) + 1) >> 8
- WILL satisfy (x * 255) / 255 = x for integers, but won't hit correct integer values for all alphas
Version 3: (simple integer rounding):
(x + (x >> 8) + 128) >> 8
- Won't hit correct integer values for all alphas, but will on average be closer than Version 2 at the same cost.
Version 4: Infinitely accurate version, to any level of precision desired, for any number of composite alphas: (useful for image resizing, rotation, etc.):
[(x + (x >> 8)) >> 8] + [ ( (x & 255) + (x >> 8) ) >> 8]
Why is version 4 infinitely accurate?
Because 1/255 = 1/256 + 1/65536 + 1/256^3 + 1/256^4 + ...
The simplest expression above (version 1) doesn't handle rounding, but it also doesn't handle the carries that occur from this infinite number of identical sum columns. The new term added above determines the carry out (0 or 1) from this infinite number of base 256 digits. By adding it, you are getting the same result as if you added all the infinite addends. At which point you can round by adding a half bit to whatever accuracy point you want.
Not needed for the OP perhaps, but people should know that you don't need to approximate at all. The formula above is actually more accurate than double precision floating point.
As for speed: In hardware, this method is faster than even a single (full width) add. In software, you have to consider throughput vs latency. In latency, it may still be faster than a narrow multiply (definitely faster than a full width multiply), but in the OP context, you can unroll many pixels at once, and since modern multiply units are pipelined, you are still OK. In translation to Java, you probably have no narrow multiplies, so this could still be faster, but need to check.
WRT the one person who said "why not use the built in OS capabilities for alpha blitting?": If you already have a substantial graphical code base in that OS, this might be a fine option. If not, you're looking at hundreds to thousands as many lines of code to leverage the OS version - code that's far harder to write and debug than this code. And in the end, the OS code you have isn't portable at all, while this code can be used anywhere.
I suspect that it is trying to do the following:
boolean isBFullyOn = false;
if (b == 0xff) {
isBFullyOn = true;
}
Back in the days of slow processors; smart bit-shifting tricks like the above could be faster than the obvious if-then-else logic. It avoids a jump statement which was costly.
It probably also sets an overflow flag in the processor which was used for some latter logic. This is all highly dependant upon the target processor.
And also on my part speculative!!
Is value of b+1 + b/256, this calculation divided by 256.
In that way, using bit shift the compiler tranlte using CPU level shift instruction, instead of using FPU or library division functions.
b = (b + (b >> 8)) >> 8; is basically b = b *257/256 .
I would consider +1 being an ugly hack of the -0.5 mean reduce caused by the inner >>8.
I would write it as b = (b + 128 + ((b +128)>> 8)) >> 8; instead.
Running this test code:
public void test() {
Set<Integer> results = new HashSet<Integer>();
// short int ranges between -32767 and 32767
for (int i = -32767; i <= 32767; i++) {
int b = (i + 1 + (i >> 8)) >> 8;
if (!results.contains(b)) {
System.out.println(i + " -> " + b);
results.add(b);
}
}
}
Produces all possible values between -129 and 128. However, if you are working with 8-bit colours (0 - 255) then the only possible outputs are 0 (for 0 - 254) and 1 (for 255) so it is likely that it is attempting the function #kaykay posted.

How to use bit manipulation to find 5th bit, and to return number of 1 bits in an integer

Assume Z is an unsigned integer. Using ~, <<, >>, &, | , +, and - provide statements which return the desired result.
I am allowed to introduce new binary values if needed.
I have these problems:
1.Extract the 5th bit from the left Z.
For this I was thinking about doing something like
x x x x x x x x
& 0 0 0 0 1 0 0 0
___________________
0 0 0 0 1 0 0 0
Does this make sense for extracting the fifth bit? I am not totally sure how I would make this work by using just Z when I do not know its values. (I am relatively new to all of this). Would this type of idea work though?
2.Return the number of 1 bits in Z
Here I kind of have no idea how to work this out. What I really need to know is how to work on just Z with the operators, but I m not sure exactly how to.
Like I said I am new to this, so any help is appreciated.
Problem 1
You’re right on the money. I’d do an & and a >> so that you get either a nice 0 or 1.
result = (z & 0x08) >> 3;
However, this may not be strictly necessary. For example, if you’re trying to check whether the bit is set as part of an if conditional, you can exploit C’s definition of anything nonzero as true.
if (z & 0x08)
do_stuff();
Problem 2
There are a whole variety of ways to do this. According to that page, the following methodology dates from 1960, though it wasn’t published in C until 1988.
for (result = 0; z; result++)
z &= z - 1;
Exactly why this works might not be obvious at first, but if you work through a few examples, you’ll quickly see why it does.
It’s worth noting that this operation – determining the number of 1 bits in a number – is sufficiently important to have a name (population count or Hamming weight) and, on recent Intel and AMD processors, a dedicated instruction. If you’re using GCC, you can use the __builtin_popcount intrinsic.
Problem 1 looks right, except you should finish it by shifting the result right by 4 to get that bit after the mask.
To implement the mask, you need to know what integer is represented by a single 5th bit. That number is incidentally 2^5 = 32. So you can just AND z with 32 and shift it right by 4.
Problem 2:
int answer = 0;
while (z != 0){ //stop when there are no more 1 bits in z
//the following masks the lowest bit in z and adds it into answer
//if z ends with a 0, nothing is added, otherwise 1 is added
answer += (z & 1);
//this shifts z right by 1 to get the next higher bit
z >>= 1;
}
return answer;
To find out the value of the fifth bit, you don't care about the bottom bits so you can get rid of them:
unsigned int answer = z >> 4;
The fifth bit becomes the bottom bit, so you can strip it off with a bitwise-AND:
answer = answer & 1;
To find the number of 1-bits in a number you can apply stakSmashr's solution. You could optimise this further if you know you need to count the number of bits in a lot of integers - precompute the number of bits in every possible 8-bit number and store it in a table. There will only be 256 entries in the table so it won't use much memory. Then, you can loop over your data one byte at a time and find the answer from the table. This lookup will be quicker than looping again over each bit.

Guaranteeing enough storage space for 4*ceil(n/3), where n is an int

Let's say n is an integer (an int variable in C). I need enough space for “4 times the ceiling of n divided by 3” bytes. How do I guarantee enough space for this?
Do you think malloc(4*(int)ceil(n/3.0)) will do, or do I have to add, say, 1 in order to be absolutely safe (due to possible rounding errors)?
you can achieve the same thing with pure integer arithmetic which guarantees that you allocate the correct amount of memory:
edit fixed brackets
malloc(4*((n+2)/3))
An alternative to KerrekSB's general formula which guarantees that only one division is used, is to calculate
(n+m-1)/m
To see that it produces the same, write n = k*m + r with 0 <= r < m. Then n%m == r, and if r == 0, we have n+m-1 = k*m + (m-1) and (n+m-1)/m == k, otherwise n+m-1 = (k+1)*m + (r-1) and (n+m-1)/m == k+1.
Most modern hardware gives you the quotient (n/m) in one register and the remainder (n%m) in another when you do an integer division, so you can get both parts of Kerrek's formula in one division, and most compilers will do so. If the compiler doesn't, but uses two divisions, the calculation will be considerably slower, so if the computation is done often and performance is an issue, you can work around the compiler's weakness with somewhat less obvious code.
In the given case, the malloc would be
malloc(4*((n+2)/3));
But since it's not obvious to everyone what that formula does, if you use it, explain it in a comment, and if you don't need to use it, use the more obvious code.
To compute the ceiling of n / m integrally, just say:
n / m + (n % m == 0 ? 0 : 1)
All in all, say malloc(4 * (n / 3 + (n % 3 ? 1 : 0)));.
While Kerrek SB has a precise answer, in practice most engineers would use malloc (4 + 4 * n / 3) or (equivalently) malloc (4 * (1 + n / 3)). The rules for C evaluate n/3 as an integer resulting in truncating remainder away. Adding a little more to the expression ensures that any fraction ignored by the division is allocated.
At most, this might waste three bytes. Only if there are at thousands of these would any extra computation to account for that be justified—maybe. Implementations of malloc often round storage allocations up to multiples of 4, 8, or 16 bytes to simplify its housekeeping.
Consider the cost of 3 bytes of memory: Current pricing is $5 to $15 per gigabyte. Three bytes cost $0.000 000 009.

How to map a long integer number to a N-dimensional vector of smaller integers (and fast inverse)?

Given a N-dimensional vector of small integers is there any simple way to map it with one-to-one correspondence to a large integer number?
Say, we have N=3 vector space. Can we represent a vector X=[(int16)x1,(int16)x2,(int16)x3] using an integer (int48)y? The obvious answer is "Yes, we can". But the question is: "What is the fastest way to do this and its inverse operation?"
Will this new 1-dimensional space possess some very special useful properties?
For the above example you have 3 * 32 = 96 bits of information, so without any a priori knowledge you need 96 bits for the equivalent long integer.
However, if you know that your x1, x2, x3, values will always fit within, say, 16 bits each, then you can pack them all into a 48 bit integer.
In either case the technique is very simple you just use shift, mask and bitwise or operations to pack/unpack the values.
Just to make this concrete, if you have a 3-dimensional vector of 8-bit numbers, like this:
uint8_t vector[3] = { 1, 2, 3 };
then you can join them into a single (24-bit number) like so:
uint32_t all = (vector[0] << 16) | (vector[1] << 8) | vector[2];
This number would, if printed using this statement:
printf("the vector was packed into %06x", (unsigned int) all);
produce the output
the vector was packed into 010203
The reverse operation would look like this:
uint8_t v2[3];
v2[0] = (all >> 16) & 0xff;
v2[1] = (all >> 8) & 0xff;
v2[2] = all & 0xff;
Of course this all depends on the size of the individual numbers in the vector and the length of the vector together not exceeding the size of an available integer type, otherwise you can't represent the "packed" vector as a single number.
If you have sets Si, i=1..n of size Ci = |Si|, then the cartesian product set S = S1 x S2 x ... x Sn has size C = C1 * C2 * ... * Cn.
This motivates an obvious way to do the packing one-to-one. If you have elements e1,...,en from each set, each in the range 0 to Ci-1, then you give the element e=(e1,...,en) the value e1+C1*(e2 + C2*(e3 + C3*(...Cn*en...))).
You can do any permutation of this packing if you feel like it, but unless the values are perfectly correlated, the size of the full set must be the product of the sizes of the component sets.
In the particular case of three 32 bit integers, if they can take on any value, you should treat them as one 96 bit integer.
If you particularly want to, you can map small values to small values through any number of means (e.g. filling out spheres with the L1 norm), but you have to specify what properties you want to have.
(For example, one can map (n,m) to (max(n,m)-1)^2 + k where k=n if n<=m and k=n+m if n>m--you can draw this as a picture of filling in a square like so:
1 2 5 | draw along the edge of the square this way
4 3 6 v
8 7
if you start counting from 1 and only worry about positive values; for integers, you can spiral around the origin.)
I'm writing this without having time to check details, but I suspect the best way is to represent your long integer via modular arithmetic, using k different integers which are mutually prime. The original integer can then be reconstructed using the Chinese remainder theorem. Sorry this is a bit sketchy, but hope it helps.
To expand on Rex Kerr's generalised form, in C you can pack the numbers like so:
X = e[n];
X *= MAX_E[n-1] + 1;
X += e[n-1];
/* ... */
X *= MAX_E[0] + 1;
X += e[0];
And unpack them with:
e[0] = X % (MAX_E[0] + 1);
X /= (MAX_E[0] + 1);
e[1] = X % (MAX_E[1] + 1);
X /= (MAX_E[1] + 1);
/* ... */
e[n] = X;
(Where MAX_E[n] is the greatest value that e[n] can have). Note that these maximum values are likely to be constants, and may be the same for every e, which will simplify things a little.
The shifting / masking implementations given in the other answers are a generalisation of this, for cases where the MAX_E + 1 values are powers of 2 (and thus the multiplication and division can be done with a shift, the addition with a bitwise-or and the modulus with a bitwise-and).
There is some totally non portable ways to make this real fast using packed unions and direct accesses to memory. That you really need this kind of speed is suspicious. Methods using shifts and masks should be fast enough for most purposes. If not, consider using specialized processors like GPU for wich vector support is optimized (parallel).
This naive storage does not possess any usefull property than I can foresee, except you can perform some computations (add, sub, logical bitwise operators) on the three coordinates at once as long as you use positive integers only and you don't overflow for add and sub.
You'd better be quite sure you won't overflow (or won't go negative for sub) or the vector will become garbage.
#include <stdint.h> // for uint8_t
long x;
uint8_t * p = &x;
or
union X {
long L;
uint8_t A[sizeof(long)/sizeof(uint8_t)];
};
works if you don't care about the endian. In my experience compilers generate better code with the union because it doesn't set of their "you took the address of this, so I must keep it in RAM" rules as quick. These rules will get set off if you try to index the array with stuff that the compiler can't optimize away.
If you do care about the endian then you need to mask and shift.
I think what you want can be solved using multi-dimensional space filling curves. The link gives a lot of references on this, which in turn give different methods and insights. Here's a specific example of an invertible mapping. It works for any dimension N.
As for useful properties, these mappings are related to Gray codes.
Hard to say whether this was what you were looking for, or whether the "pack 3 16-bit ints into a 48-bit int" does the trick for you.

Resources