implementation of rand()

implementation of rand() - c

I am writing some embedded code in C and need to use the rand() function. Unfortunately, rand() is not supported in the library for the controller. I need a simple implementation that is fast, but more importantly has little space overhead, that produces relatively high-quality random numbers. Does anyone know which algorithm to use or sample code?
EDIT: It's for image processing, so "relatively high quality" means decent cycle length and good uniform properties.

Check out this collection of random number generators from George Marsaglia. He's a leading expert in random number generation, so I'd be confident using anything he recommends. The generators in that list are tiny, some requiring only a couple unsigned longs as state.
Marsaglia's generators are definitely "high quality" by your standards of long period and good uniform distribution. They pass stringent statistical tests, though they wouldn't do for cryptography.

Use the C code for LFSR113 from L'écuyer:
unsigned int lfsr113_Bits (void)
{
static unsigned int z1 = 12345, z2 = 12345, z3 = 12345, z4 = 12345;
unsigned int b;
b = ((z1 << 6) ^ z1) >> 13;
z1 = ((z1 & 4294967294U) << 18) ^ b;
b = ((z2 << 2) ^ z2) >> 27;
z2 = ((z2 & 4294967288U) << 2) ^ b;
b = ((z3 << 13) ^ z3) >> 21;
z3 = ((z3 & 4294967280U) << 7) ^ b;
b = ((z4 << 3) ^ z4) >> 12;
z4 = ((z4 & 4294967168U) << 13) ^ b;
return (z1 ^ z2 ^ z3 ^ z4);
}
Very high quality and fast. Do NOT use rand() for anything.
It is worse than useless.

Here is a link to a ANSI C implementation of a few random number generators.

I've made a collection of random number generators, "simplerandom", that are compact and suitable for embedded systems. The collection is available in C and Python.
I've looked around for a bunch of simple and decent ones I could find, and put them together in a small package. They include several Marsaglia generators (KISS, MWC, SHR3), and a couple of L'Ecuyer LFSR ones.
All the generators return an unsigned 32-bit integer, and typically have a state made of 1 to 4 32-bit unsigned integers.
Interestingly, I found a few issues with the Marsaglia generators, and I've tried to fix/improve all those issues. Those issues were:
SHR3 generator (component of Marsaglia's 1999 KISS generator) was broken.
MWC low 16 bits have only an approx 229.1 period. So I made a slightly improved MWC, which gives the low 16 bits a 259.3 period, which is the overall period of this generator.
I uncovered a few issues with seeding, and tried to make robust seeding (initialisation) procedures, so they won't break if you give them a "bad" seed value.

I recommend the academic paper Two Fast Implementations of the Minimal Standard Random Number Generator by David Carta. You can find free PDF through Google. The original paper on the Minimal Standard Random Number Generator is also worth reading.
Carta's code gives fast, high-quality random numbers on 32-bit machines. For a more thorough evaluation, see the paper.

Mersenne twister
A bit from Wikipedia:
It was designed to have a period of 219937 − 1 (the creators of the algorithm proved this property). In practice, there is little reason to use a larger period, as most applications do not require 219937 unique combinations (219937 is approximately 4.3 × 106001; this is many orders of magnitude larger than the estimated number of particles in the observable universe, which is 1080).
It has a very high order of dimensional equidistribution (see linear congruential generator). This implies that there is negligible serial correlation between successive values in the output sequence.
It passes numerous tests for statistical randomness, including the Diehard tests. It passes most, but not all, of the even more stringent TestU01 Crush randomness tests.
source code for many languages available on the link.

I'd take one from the GNU C library, the source is available to browse online.
http://qa.coreboot.org/docs/libpayload/rand_8c-source.html
But if you have any concern at all about the quality of the random numbers, you should probably look at more carefully written mathematically libraries. It's a big subject and the standard rand implementations aren't highly thought of by experts.
Here's another possibility: http://www.boost.org/doc/libs/1_39_0/libs/random/index.html
(If you find you have too many options, you could always pick one at random.)

I found this: Simple Random Number Generation, by John D. Cook.
It should be easy to adapt to C, given that it's only a few lines of code.
Edit: and you could clarify what you mean by "relatively high-quality". Are you generating encryption keys for nuclear launch codes, or random numbers for a game of poker?

Better yet, use multiple linear feedback shift registers combine them together.
Assuming that sizeof(unsigned) == 4:
unsigned t1 = 0, t2 = 0;
unsigned random()
{
unsigned b;
b = t1 ^ (t1 >> 2) ^ (t1 >> 6) ^ (t1 >> 7);
t1 = (t1 >> 1) | (~b << 31);
b = (t2 << 1) ^ (t2 << 2) ^ (t1 << 3) ^ (t2 << 4);
t2 = (t2 << 1) | (~b >> 31);
return t1 ^ t2;
}

The standard solution is to use a linear feedback shift register.

There is one simple RNG named KISS, it is one random number generator according to three numbers.
/* Implementation of a 32-bit KISS generator which uses no multiply instructions */
static unsigned int x=123456789,y=234567891,z=345678912,w=456789123,c=0;
unsigned int JKISS32() {
int t;
y ^= (y<<5); y ^= (y>>7); y ^= (y<<22);
t = z+w+c; z = w; c = t < 0; w = t&2147483647;
x += 1411392427;
return x + y + w;
}
Also there is one web site to test RNG http://www.phy.duke.edu/~rgb/General/dieharder.php

Related

How does XorShift32 works?

I have this homework where i need to implement xorshift32(i can t use anything else) so i can generate some numbers but i don t understand how the algorithm works or how to implement it.
I am trying to print the generated number but i don t know how to call the xorshift32 function because of the state[static 1] argument.
uint32_t xorshift32(uint32_t state[static 1])
{
uint32_t x = state[0];
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
state[0] = x;
return x;
}
I do not have much information on xorshft32 other that what is on wikipedia(en.wikipedia.org/wiki/Xorshift).

This is an extended comment to the good answer by Jabberwocky.
The Xorshift variants, rand(), and basically all random number generator functions, are actually pseudorandom number generators. They are not "real random", because the sequence of numbers they generate depends on their internal state; but they are "pseudorandom", because if you do not know the generator internal state, the sequence of numbers they generate is random in the statistical sense.
George Marsaglia, the author of the Xorshift family of pseudorandom number generators, also developed a set of statistical tools called Diehard tests that can be used to analyse the "randomness" of the sequences generated. Currently, the TestU01 tests are probably the most widely used and trusted; in particular, the 160-test BigCrush set.
The sequence generated by ordinary pseudorandom number generators often allows one to determine the internal state of the generator. This means that observing a long enough generated sequence, allows one to fairly reliably predict the future sequence. Cryptographically secure pseudorandom number generators avoid that, usually by applying a cryptographically secure hash function to the output; one would need a catalog of the entire sequence to be able to follow it. When the periods are longer than 2256 or so, there is not enough baryonic matter in the entire observable universe to store the sequence.
My own favourite PRNG is Xorshift64*, which has a period of 264-1, and passes all but the MatrixRank test in BigCrush. In C99 and later, you can implement it using
#include <inttypes.h>
typedef struct {
uint64_t state;
} prng_state;
static inline uint64_t prng_u64(prng_state *const p)
{
uint64_t state = p->state;
state ^= state >> 12;
state ^= state << 25;
state ^= state >> 27;
p->state = state;
return state * UINT64_C(2685821657736338717);
}
The state can be initialized to any nonzero uint64_t. (A zero state will lead the generator to generate all zeros till infinity. The period is 264-1, because the generator will have each 64-bit state (excluding zero) exactly once during each period.)
It is good enough for most use cases, and extremely fast. It belongs to the class of linear-feedback shift register pseudorandom number generators.
Note that the variant which returns an uniform distribution between 0 and 1,
static inline double prng_one(prng_state *p)
{
return prng_u64(p) / 18446744073709551616.0;
}
uses the high bits; the high 32 bits of the sequence does pass all BigCrunch tests in TestU01 suite, so this is a surprisingly good (randomness and efficiency) generator for double-precision uniform random numbers -- my typical use case.
The format above allows multiple independent generators in a single process, by specifying the generator state as a parameter. If the basic generator is implemented in a header file (thus the static inline; it is a preprocessor macro-like function), you can switch between generators by switching between header files, and recompiling the binary.
(You are usually better off by using a single generator, unless you use multiple threads in a pseudorandom number heavy simulator, in which case using a separate generator for each thread will help a lot; avoids cacheline ping-pong between threads competing for the generator state, in particular.)
The rand() function in most C standard library implementations is a linear-congruential generator. They often suffer from poor choices of the coefficients, and nowadays, also from the relative slowness of the modulo operator (when the modulus is not a power of two).
The most widely used pseudorandom number generator is the Mersenne Twister, by Makoto Matsumoto (松本 眞) and Takuji Nishimura (西村 拓士). It is a twisted generalized linear feedback shift register, and has quite a large state (about 2500 bytes) and very long period (219937-1).
When we talk of true random number generators, we usually mean a combination of a pseudorandom number generator (usually a cryptographically secure one), and a source of entropy; random bits with at least some degree of true physical randomness.
In Linux, Mac OS, and BSDs at least, the operating system kernel exposes a source of pseudorandom numbers (getentropy() in Linux and OpenBSD, getrandom() in Linux, /dev/urandom, /dev/arandom, /dev/random in many Unixes, and so on). Entropy is gathered from physical electronic sources, like internal processor latencies, physical interrupt line timings, (spinning disk) hard drive timings, possibly even keyboard and mice. Many motherboards and some processors even have hardware random number sources that can be used as sources for entropy (or even directly as "trusted randomness sources").
The exclusive-or operation (^ in C) is used to mix in randomness to the generator state. This works, because exclusive-or between a known bit and a random bit results in a random bit; XOR preserves randomness. When mixing entropy pools (with some degree of randomness in the bit states) using XOR, the result will have at least as much entropy as the sources had.
Note that that does not mean that you get "better" random numbers by mixing the output of two or more generators. The statistics of true randomness is hard for humans to grok (just look at how poor the common early rand() implementations were! HORRIBLE!). It is better to pick a generator (or a set of generators to switch between at compile time, or at run time) that passes the BigCrunch tests, and ensure it has a good random initial state on every run. That way you leverage the work of many mathematicians and others who have worked on these things for decades, and can concentrate on the other stuff, what you yourself are good at.

The C code in the wikipedia article is somewhat misleading:
Here is a working example that uses both the 32 bit and the 64 bit versions:
#include <stdio.h>
#include <stdint.h>
/* The state word must be initialized to non-zero */
uint32_t xorshift32(uint32_t state[])
{
/* Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs" */
uint32_t x = state[0];
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
state[0] = x;
return x;
}
uint64_t xorshift64(uint64_t state[])
{
uint64_t x = state[0];
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
state[0] = x;
return x;
}
int main()
{
uint32_t state[1] = {1234}; // "seed" (can be anthing but 0)
for (int i = 0; i < 50; i++)
{
printf("%u\n", xorshift32(state));
}
uint64_t state64[1] = { 1234 }; // "seed" (can be anthing but 0)
for (int i = 0; i < 50; i++)
{
printf("%llu\n", xorshift64(state64));
}
}
The mathematical aspects are explained in the wikipedia article and in it's footnotes.
The rest is basic C language knowledge, ^ is the C bitwise XOR operator.

Generating (very) large non-repeating integer sequence without pre-shuffling

Background
I have a simple media client/server I've written, and I want to generate a non-obvious time value I send with each command from the client to the server. The timestamps will have a fair bit of data in them (nano-second resolution, even if it's not truly accurate, due to limitations of timer sampling in modern operating systems), etc.
What I'm trying to do (on Linux, in C), is to generate a one-to-one sequence of n-bit values (let's assume data is store in 128bit array-of-int elements for now) with no overlapping/colliding values. I would then take a pseudo-random 128bit value/number as a "salt", apply it to the timestamp, and then start sending off commands to the server, incrementing the pre-salted/pre-hashed value.
The reason the timestamp size is so large is because the timestamp may have to accommodate a very large duration of time.
Question
How could I accomplish such a sequence (non-colliding) with an initial salt value? The best approach that sounds along the lines of my goal is from this post, which notes:
If option 1 isn't "random" enough for you, use the CRC-32 hash of said
global (32-bit) counter. There is a 1-to-1 mapping (bijection) between
N-bit integers and their CRC-N so uniqueness will still be guaranteed.
However, I do not know:
If that can (efficiently) be extended to 128-bit data.
If some sort of addition-to/multiplication-by salt-value to provide the initial seed for the sequence would disrupt it or introduce collisions.
Follow-up
I realize that I could use a 128bit random hash from libssl or something similar, but I want the remote server, using the same salt value, to be able to convert the hashed timestamps back into their true values.
Thank you.

You could use a linear congruential generator. With the right parameters, it is guaranteed to produce non-repeating sequences [unique] sequences with a full period (i.e. no collisions).
This is what random(3) uses in TYPE_0 mode. I adapted it for a full unsigned int range and the seed can be any unsigned int (See my sample code below).
I believe it can be extended to 64 or 128 bits. I'd have a look at: https://en.wikipedia.org/wiki/Linear_congruential_generator to see about the constraints on parameters to prevent collisions and good randomness.
Following the wiki page guidelines, you could produce one that can take any 128 bit value as the seed and will not repeat until all possible 128 bit numbers have been generated.
You may need to write a program to generate suitable parameter pairs and then test them for the "best" randomness. This would be a one time operation.
Once you've got them, just plug these parameters into your equation in your actual application.
Here's some code of mine that I had been playing with when I was looking for something similar:
// _prngstd -- get random number
static inline u32
_prngstd(prng_p prng)
{
long rhs;
u32 lhs;
// NOTE: random is faster and has a _long_ period, but it _only_ produces
// positive integers but jrand48 produces positive _and_ negative
#if 0
rhs = jrand48(btc->btc_seed);
lhs = rhs;
#endif
// this has collisions
#if 0
rhs = rand();
PRNG_FLIP;
#endif
// this has collisions because it defaults to TYPE_3
#if 0
rhs = random();
PRNG_FLIP;
#endif
// this is random in TYPE_0 (linear congruential) mode
#if 0
prng->prng_state = ((prng->prng_state * 1103515245) + 12345) & 0x7fffffff;
rhs = prng->prng_state;
PRNG_FLIP;
#endif
// this is random in TYPE_0 (linear congruential) mode with the mask
// removed to get full range numbers
// this does _not_ produce overlaps
#if 1
prng->prng_state = ((prng->prng_state * 1103515245) + 12345);
rhs = prng->prng_state;
lhs = rhs;
#endif
return lhs;
}

The short answer is encryption. With a set of 128 bit values feed them into AES and get a different set of 128 bit values out. Because encryption is reversible the outputs are guaranteed unique for unique inputs with a fixed key.
Encryption is a reversible one-to-one mapping of the input values to the output values, each set is a full permutation of the other.
Since you are presumably not repeating your inputs, then ECB mode is probably sufficient, unless you want a greater degree of security. ECB mode is vulnerable if used repeatedly with identical inputs, which does not appear to be the case here.
For inputs shorter than 128 bits, then use a fixed padding method to make them the right length. As long as the uniqueness of inputs is not affected, then padding can be reasonably flexible. Zero padding, at either end (or at the beginning of internal fields) may well be sufficient.
I do not know your detailed requirements, so feel free to modify my advice.

Somewhere between linear congruential generators and encryption functions there are hashes that can convert linear counts into passable pseudorandom numbers.
If you happen to have 128-bit integer types handy (eg., __int128 in GCC when building for a 64-bit target), or are willing to implement such long multiplies by hand, then you could extend on the construction used in SplitMix64. I did a fairly superficial search and came up with the following parameters:
uint128_t mix(uint128_t x) {
uint128_t m0 = (uint128_t)0xecfb1b9bc1f0564f << 64
| 0xc68dd22b9302d18d;
uint128_t m1 = (uint128_t)0x4a4cf0348b717188 << 64
| 0xe2aead7d60f8a0df;
x ^= x >> 59;
x *= m0;
x ^= x >> 60;
x *= m1;
x ^= x >> 84;
return x;
}
and its inverse:
uint128_t unmix(uint128_t x) {
uint128_t im0 = (uint128_t)0x367ce11aef44b547 << 64
| 0x424b0c012b51d945;
uint128_t im1 = (uint128_t)0xef0323293e8f059d << 64
| 0x351690f213b31b1f;
x ^= x >> 84;
x *= im1;
x ^= x >> 60 ^ x >> (2 * 60);
x *= im0;
x ^= x >> 59 ^ x >> (2 * 59);
return x;
}
I'm not sure if you wanted a just a random sequence, or a way to obfuscate an arbitrary timestamp (since you said you wanted to decode the values they must be more interesting than a linear counter), but one derives from the other simply enough:
uint128_t encode(uint128_t time, uint128_t salt) {
return mix((time + 1) * salt);
}
uint128_t generate(uint128_t salt) {
static uint128_t t = 0;
return encode(t++, salt);
}
static uint128_t inv(uint128_t d) {
uint128_t i = d;
while (i * d != 1) {
i *= 2 - i * d;
}
return i;
}
uint128_t decode(uint128_t etime, uint128_t salt) {
return unmix(etime) * inv(salt) - 1;
}
Note that salt chooses one of 2127 sequences of non-repeating 128-bit values (we lose one bit because salt must be odd), but there are (2128)! possible sequences that could have been generated. Elsewhere I'm looking at extending the parameterisation so that more of these sequences can be visited, but I started goofing around with the above method for increasing the randomness of the sequence to hide any problems where the parameters could pick not-so-random (but provably distinct) sequences.
Obviously uint128_t isn't a standard type, and so my answer is not C, but you can use either a bignumber library or a compiler extension to make the arithmetic work. For clarity I relied on the compiler extension. All the operations rely on C-like unsigned overflow behaviour (take the low-order bits of the arbitrary-precision result).

Fast hashing of 32 bit values to between 0 and 254 inclusive

I'm looking for a fast way in C to hash numbers 32-bit numbers more or less uniformly between 0 and 254. 255 is reserved for a special purpose.
As an added constraint, I'm looking for a method that would map well to being used with ISA-specific vector intrinsics or to a language like OpenCL or CUDA without introducing control flow divergence between the vector lanes/threads.
Ordinarily, I would just use the following code to hash the number between 0 and 255, as this is just a fast way of doing x mod 256.
inline uint8_t hash(uint32_t x){ return x & 255; }
I could just give in and use the following:
inline uint8_t hash(uint32_t x){ return x % 255; }
However, this solution seems unimaginative and unlikely to be the highest performing solution. I found code at this site (http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml#exmod15) that appears to provide a reasonable solution for scalar code and have inserted it here for your convenience.
uint32_t mod255( uint32_t a ) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
I see two potential performance issues with this code:
The large number of if statements makes me question how easy it will be for a compiler or human :) to effectively vectorize the code without leading to control flow divergence within a warp/wavefront on a SIMT architecture or vectorized execution on a multicore CPU. If such divergence does occur, it will reduce parallel efficiency, as the divergent paths will have to be run in series.
It looks like it could be troublesome for a branch predictor (not applicable on common GPU architectures) as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Any recommendations on alternatives that I could use are most welcome. Alternatively, let me know if what I am asking for is unreasonable.

The "if statements on GPU kill performance" is a popular misconception which desperately wants to live on, it seems.
The large number of if statements makes me question how easy it will
be for a compiler or human :) to vectorize the code.
First of all I wouldn't consider 2 if statements a "large number of if statements", and those are so short and trivial that I'm willing to bet the compiler will turn them into branchless conditional moves or predicated instructions. There will be no performance penalty at all. (Do check the generated assembly, however).
It looks like it could be troublesome for a branch predictor as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Current GPUs do not have branch predictors. Note however that depending on the underlying hardware, operation on integers (and notably shifting) may be quite costly.

I would just do this:
uchar fast_mod255( uint a32 ) {
ushort a16 = (a32 >> 16) + (a32 & 0xFFFF); /* sum base 2**16 digits */
uchar a8 = (a16 >> 8) + (a16 & 0xFF); /* sum base 2**8 digits */
return (a8 % 255);
}
Another option is to just do:
uchar fast_mod255( uchar4 a ) {
return (dot(a) % 255); // or return (distance(a) % 255);
}
GPUs are very efficient in computing the distances and dot products, even in 4 dimensions. And it is a valid way of hashing as well. Dsicarding the overflowed values.
No branching, and a clever compiler can even optimize it out. Or do you really need that values that fall in the 255 zone have a scattered pattern instead of 1?

I wanted to answer my own question because over the last 2 years I have seen ways to get around a slow integer divide instruction. The easiest way is to make the integer a compile-time constant. Any decent modern compiler should replace the integer divide with an equivalent set of other instructions with typically higher throughput (how many such instructions can be retired per cycle) and reduced latency (how many cycles it takes the instruction to execute). If you're curious, check out Hacker's Delight (an excellent book on low-level computer arithmetic).
I wanted to share another finding, which I found on Daniel Lemire's blog (located here). The code that follows doesn't compute mod 255 but does something similar, which is equally useful in a number of applications and much faster.
Suppose that you have a set of numbers S that are uniformly randomly picked from the range 0 to 2^k - 1 inclusive, where k >= 0. In this case, if you care only about mapping numbers roughly uniformly from 0 to 254 inclusive, you may do the following:
For each number n in a set S, you may map n to one of the 255 candidate values by multiplying n by 255 and then arithmetically shifting the result to the right by k digits.
Here is the function that you call on each n for a fixed value of k:
int map_to_0_to_254(int n, int k){
return (n * 255) >> k;
}
As an example, if the values for the argument n range uniformly randomly from 0 to 4095 (2^12 - 1),
then map_to_0_254(n, 12) will return a value in the range 0 to 254 inclusive.
Here is a more general templated version in C++ for mapping to range from 0 to range_size - 1 inclusive:
template<typename T>
T map_to_0_to_range_size_minus_1(T n, T range_size, T k){
return (n * range_size) >> k;
}
REMEMBER that this code assumes that the inputs for n are roughly uniformly randomly distributed between 0 and 2^k - 1 inclusive. If that property holds, then the outputs will be roughly uniformly distributed between 0 and range_size - 1 inclusive. The larger 2^k is relative to range_size, the more uniform the mapping will be for a fixed set of inputs.
Why This is Useful
This approach has applications to computing hash functions for hash tables where the number of bins is not a power of 2. Those operations would ordinarily require a long-latency integer divide instruction, which is often an order of magnitude slower to execute than an integer multiply, because you often do not know the number of bins in the hash table at compile time.

what (r+1 + (r >> 8)) >> 8 does?

In some old C/C++ graphics related code, that I have to port to Java and JavaScript I found this:
b = (b+1 + (b >> 8)) >> 8; // very fast
Where b is short int for blue, and same code is seen for r and b (red & blue). The comment is not helpful.
I cannot figure out what it does, apart from obvious shifting and adding. I can port without understanding, I just ask out of curiosity.

y = ( x + 1 + (x>>8) ) >> 8 // very fast
This is a fixed-point approximation of division by 255. Conceptually, this is useful for normalizing calculations based on pixel values such that 255 (typically the maximum pixel value) maps to exactly 1.
It is described as very fast because fully general integer division is a relatively slow operation on many CPUs -- although it is possible that your compiler would make a similar optimization for you if it can deduce the input constraints.
This works based on the idea that 257/(256*256) is a very close approximation of 1/255, and that x*257/256 can be formulated as x+(x>>8). The +1 is rounding support which allows the formula to exactly match the integer division x/255 for all values of x in [0..65534].
Some algebra on the inner portion may make things a bit more clear...
x*257/256
= (x*256+x)/256
= x + x/256
= x + (x>>8)
There is more discussion here: How to do alpha blend fast? and here: Division via Multiplication
By the way, if you want round-to-nearest, and your CPU can do fast multiplies, the following is accurate for all uint16_t dividend values -- actually [0..(2^16)+126].
y = ((x+128)*257)>>16 // divide by 255 with round-to-nearest for x in [0..65662]

Looks like it is meant to check if blue (or red or green) is fully used. It evaluates to 1, when b is 255, and is 0 for all lower values.

A common use case of when you'd want to use a formula that's more accurate than 257/256 is when you have to combine a lot of alpha values together for each pixel. As one example, when doing image shrinking, you need to combine 4 alphas for each source pixel contributing to the destination, and then combine all the source pixels contributing to the destination.
I posted an infinitely accurate bit twiddling version of /255 but it was rejected without reason. So I'll add that I implement alpha blending hardware for a living, I write real time graphics code and game engines for a living, and I've published articles on this topic in conferences like MICRO, so I really know what I'm talking about. And it might be useful or at least entertaining for people to understand the more accurate formula that is EXACTLY 1/255:
Version 1: x = (x + (x >> 8)) >> 8
- no constant added, won't satisfy (x * 255) / 255 = x, but will look fine in most cases.
Version 2: x = (x + (x >> 8) + 1) >> 8
- WILL satisfy (x * 255) / 255 = x for integers, but won't hit correct integer values for all alphas
Version 3: (simple integer rounding):
(x + (x >> 8) + 128) >> 8
- Won't hit correct integer values for all alphas, but will on average be closer than Version 2 at the same cost.
Version 4: Infinitely accurate version, to any level of precision desired, for any number of composite alphas: (useful for image resizing, rotation, etc.):
[(x + (x >> 8)) >> 8] + [ ( (x & 255) + (x >> 8) ) >> 8]
Why is version 4 infinitely accurate?
Because 1/255 = 1/256 + 1/65536 + 1/256^3 + 1/256^4 + ...
The simplest expression above (version 1) doesn't handle rounding, but it also doesn't handle the carries that occur from this infinite number of identical sum columns. The new term added above determines the carry out (0 or 1) from this infinite number of base 256 digits. By adding it, you are getting the same result as if you added all the infinite addends. At which point you can round by adding a half bit to whatever accuracy point you want.
Not needed for the OP perhaps, but people should know that you don't need to approximate at all. The formula above is actually more accurate than double precision floating point.
As for speed: In hardware, this method is faster than even a single (full width) add. In software, you have to consider throughput vs latency. In latency, it may still be faster than a narrow multiply (definitely faster than a full width multiply), but in the OP context, you can unroll many pixels at once, and since modern multiply units are pipelined, you are still OK. In translation to Java, you probably have no narrow multiplies, so this could still be faster, but need to check.
WRT the one person who said "why not use the built in OS capabilities for alpha blitting?": If you already have a substantial graphical code base in that OS, this might be a fine option. If not, you're looking at hundreds to thousands as many lines of code to leverage the OS version - code that's far harder to write and debug than this code. And in the end, the OS code you have isn't portable at all, while this code can be used anywhere.

I suspect that it is trying to do the following:
boolean isBFullyOn = false;
if (b == 0xff) {
isBFullyOn = true;
}
Back in the days of slow processors; smart bit-shifting tricks like the above could be faster than the obvious if-then-else logic. It avoids a jump statement which was costly.
It probably also sets an overflow flag in the processor which was used for some latter logic. This is all highly dependant upon the target processor.
And also on my part speculative!!

Is value of b+1 + b/256, this calculation divided by 256.
In that way, using bit shift the compiler tranlte using CPU level shift instruction, instead of using FPU or library division functions.

b = (b + (b >> 8)) >> 8; is basically b = b *257/256 .
I would consider +1 being an ugly hack of the -0.5 mean reduce caused by the inner >>8.
I would write it as b = (b + 128 + ((b +128)>> 8)) >> 8; instead.

Running this test code:
public void test() {
Set<Integer> results = new HashSet<Integer>();
// short int ranges between -32767 and 32767
for (int i = -32767; i <= 32767; i++) {
int b = (i + 1 + (i >> 8)) >> 8;
if (!results.contains(b)) {
System.out.println(i + " -> " + b);
results.add(b);
}
}
}
Produces all possible values between -129 and 128. However, if you are working with 8-bit colours (0 - 255) then the only possible outputs are 0 (for 0 - 254) and 1 (for 255) so it is likely that it is attempting the function #kaykay posted.

Looking for decent-quality PRNG with only 32 bits of state

I'm trying to implement a tolerable-quality version of the rand_r interface, which has the unfortunate interface requirement that its entire state is stored in a single object of type unsigned, which for my purposes means exactly 32 bits. In addition, I need its output range to be [0,2³¹-1]. The standard solution is using a LCG and dropping the low bit (which has the shortest period), but this still leaves very poor periods for the next few bits.
My initial thought was to use two or three iterations of the LCG to generate the high/low or high/mid/low bits of the output. However, such an approach does not preserve the non-biased distribution; rather than each output value having equal frequency, many occur multiple times, and some never occur at all.
Since there are only 32 bits of state, the period of the PRNG is bounded by 2³², and in order to be non-biased, the PRNG must output each value exactly twice if it has full period or exactly once if it has period 2³¹. Shorter periods cannot be non-biased.
Is there any good known PRNG algorithm that meets these criteria?

One good (but probably not the fastest) possibility, offering very high quality, would be to use a 32-bit block cipher in CTR mode. Basically, your RNG state would simply be a 32-bit counter that gets incremented by one for each RNG call, and the output would be the encryption of that counter value using the block cipher with some arbitrarily chosen fixed key. For extra randomness, you could even provide a (non-standard) function to let the user set a custom key.
There aren't a lot of 32-bit block ciphers in common use, since such a short block size introduces problems for cryptographic use. (Basically, the birthday paradox lets you distinguish the output of such a cipher from a random function with a non-negligible probability after only about 216 = 65536 outputs, and after 232 outputs the non-randomness obviously becomes certain.) However, some ciphers with an adjustable block size, such as XXTEA or HPC, will let you go down to 32 bits, and should be suitable for your purposes.
(Edit: My bad, XXTEA only goes down to 64 bits. However, as suggested by CodesInChaos in the comments, Skip32 might be another option. Or you could build your own 32-bit Feistel cipher.)
The CTR mode construction guarantees that the RNG will have a full period of 232 outputs, while the standard security claim of (non-broken) block ciphers is essentially that it is not computationally feasible to distinguish their output from a random permutation of the set of 32-bit integers. (Of course, as noted above, such a permutation is still easily distinguished from a random function taking 32-bit values.)
Using CTR mode also provides some extra features you may find convenient (even if they're not part of the official API you're developing against), such as the ability to quickly seek into any point in the RNG output stream just by adding or subtracting from the state.
On the other hand, you probably don't want to follow the common practice of seeding the RNG by just setting the internal state to the seed value, since that would cause the output streams generated from nearby seeds to be highly similar (basically just the same stream shifted by the difference of the seeds). One way to avoid this issue would be to add an extra encryption step to the seeding process, i.e. to encrypt the seed with the cipher and set the internal counter value equal to the result.

A 32-bit maximal-period Galois LFSR might work for you. Try:
r = (r >> 1) ^ (-(r & 1) & 0x80200003);
The one problem with LFSRs is that you can't produce the value 0. So this one has a range of 1 to 2^32-1. You may want to tweak the output or else stick with a good LCG.

Besides using a Lehmer MCG, there's a couple you could use:
32-bit variants of Xorshift have a guaranteed period of 232−1 using a 32-bit state:
uint32_t state;
uint32_t xorshift32(void) {
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state;
}
That's the original 32-bit recommendation from 2003 (see paper). Depending on your definition of "decent quality", that should be fine. However it fails the binary rank tests of Diehard, and 5/10 tests of SmallCrush.
Alternate version with better mixing and constants (passes SmallCrush and Crush):
uint32_t xorshift32amx(void) {
int s = __builtin_bswap32(state * 1597334677);
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state + s;
}
Based on research here and here.
There's also Mulberry32 which has a period of exactly 232:
uint32_t mulberry32(void) {
uint32_t z = state += 0x6D2B79F5;
z = (z ^ z >> 15) * (1 | z);
z ^= z + (z ^ z >> 7) * (61 | z);
return z ^ z >> 14;
}
This is probably your best option. It's quite good. Author states "It passes gjrand's 13 tests with no failures and a total P-value
of 0.984 (where 1 is perfect and 0.1 or less is a failure) on 4GB of
generated data. That's a quarter of the full period". It appears to be an improvement over SplitMix32.
"SplitMix32", adopted from xxHash/MurmurHash3 (Weyl sequence):
uint32_t splitmix32(void) {
uint32_t z = state += 0x9e3779b9;
z ^= z >> 15; // 16 for murmur3
z *= 0x85ebca6b;
z ^= z >> 13;
z *= 0xc2b2ae35;
return z ^= z >> 16;
}
The quality might be questionable here, but its 64-bit big brother has a lot of fans (passes BigCrush). So the general structure is worth looking at.

Elaborating on my comment...
A block cipher in counter mode gives a generator in approximately the following form (except using much bigger data types):
uint32_t state = 0;
uint32_t rand()
{
state = next(state);
return temper(state);
}
Since cryptographic security hasn't been specified (and in 32 bits it would be more or less futile), a simpler, ad-hoc tempering function should do the trick.
One approach is where the next() function is simple (eg., return state + 1;) and temper() compensates by being complex (as in the block cipher).
A more balanced approach is to implement LCG in next(), since we know that it also visits all possible states but in a random(ish) order, and to find an implementation of temper() which does just enough work to cover the remaining problems with LCG.
Mersenne Twister includes such a tempering function on its output. That might be suitable. Also, this question asks for operations which fulfill the requirement.
I have a favourite, which is to bit-reverse the word, and then multiply it by some constant (odd) number. That may be overly complex if bit-reverse isn't a native operation on your architecture.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight