Generating (very) large non-repeating integer sequence without pre-shuffling

Generating (very) large non-repeating integer sequence without pre-shuffling - c

Background
I have a simple media client/server I've written, and I want to generate a non-obvious time value I send with each command from the client to the server. The timestamps will have a fair bit of data in them (nano-second resolution, even if it's not truly accurate, due to limitations of timer sampling in modern operating systems), etc.
What I'm trying to do (on Linux, in C), is to generate a one-to-one sequence of n-bit values (let's assume data is store in 128bit array-of-int elements for now) with no overlapping/colliding values. I would then take a pseudo-random 128bit value/number as a "salt", apply it to the timestamp, and then start sending off commands to the server, incrementing the pre-salted/pre-hashed value.
The reason the timestamp size is so large is because the timestamp may have to accommodate a very large duration of time.
Question
How could I accomplish such a sequence (non-colliding) with an initial salt value? The best approach that sounds along the lines of my goal is from this post, which notes:
If option 1 isn't "random" enough for you, use the CRC-32 hash of said
global (32-bit) counter. There is a 1-to-1 mapping (bijection) between
N-bit integers and their CRC-N so uniqueness will still be guaranteed.
However, I do not know:
If that can (efficiently) be extended to 128-bit data.
If some sort of addition-to/multiplication-by salt-value to provide the initial seed for the sequence would disrupt it or introduce collisions.
Follow-up
I realize that I could use a 128bit random hash from libssl or something similar, but I want the remote server, using the same salt value, to be able to convert the hashed timestamps back into their true values.
Thank you.

You could use a linear congruential generator. With the right parameters, it is guaranteed to produce non-repeating sequences [unique] sequences with a full period (i.e. no collisions).
This is what random(3) uses in TYPE_0 mode. I adapted it for a full unsigned int range and the seed can be any unsigned int (See my sample code below).
I believe it can be extended to 64 or 128 bits. I'd have a look at: https://en.wikipedia.org/wiki/Linear_congruential_generator to see about the constraints on parameters to prevent collisions and good randomness.
Following the wiki page guidelines, you could produce one that can take any 128 bit value as the seed and will not repeat until all possible 128 bit numbers have been generated.
You may need to write a program to generate suitable parameter pairs and then test them for the "best" randomness. This would be a one time operation.
Once you've got them, just plug these parameters into your equation in your actual application.
Here's some code of mine that I had been playing with when I was looking for something similar:
// _prngstd -- get random number
static inline u32
_prngstd(prng_p prng)
{
long rhs;
u32 lhs;
// NOTE: random is faster and has a _long_ period, but it _only_ produces
// positive integers but jrand48 produces positive _and_ negative
#if 0
rhs = jrand48(btc->btc_seed);
lhs = rhs;
#endif
// this has collisions
#if 0
rhs = rand();
PRNG_FLIP;
#endif
// this has collisions because it defaults to TYPE_3
#if 0
rhs = random();
PRNG_FLIP;
#endif
// this is random in TYPE_0 (linear congruential) mode
#if 0
prng->prng_state = ((prng->prng_state * 1103515245) + 12345) & 0x7fffffff;
rhs = prng->prng_state;
PRNG_FLIP;
#endif
// this is random in TYPE_0 (linear congruential) mode with the mask
// removed to get full range numbers
// this does _not_ produce overlaps
#if 1
prng->prng_state = ((prng->prng_state * 1103515245) + 12345);
rhs = prng->prng_state;
lhs = rhs;
#endif
return lhs;
}

The short answer is encryption. With a set of 128 bit values feed them into AES and get a different set of 128 bit values out. Because encryption is reversible the outputs are guaranteed unique for unique inputs with a fixed key.
Encryption is a reversible one-to-one mapping of the input values to the output values, each set is a full permutation of the other.
Since you are presumably not repeating your inputs, then ECB mode is probably sufficient, unless you want a greater degree of security. ECB mode is vulnerable if used repeatedly with identical inputs, which does not appear to be the case here.
For inputs shorter than 128 bits, then use a fixed padding method to make them the right length. As long as the uniqueness of inputs is not affected, then padding can be reasonably flexible. Zero padding, at either end (or at the beginning of internal fields) may well be sufficient.
I do not know your detailed requirements, so feel free to modify my advice.

Somewhere between linear congruential generators and encryption functions there are hashes that can convert linear counts into passable pseudorandom numbers.
If you happen to have 128-bit integer types handy (eg., __int128 in GCC when building for a 64-bit target), or are willing to implement such long multiplies by hand, then you could extend on the construction used in SplitMix64. I did a fairly superficial search and came up with the following parameters:
uint128_t mix(uint128_t x) {
uint128_t m0 = (uint128_t)0xecfb1b9bc1f0564f << 64
| 0xc68dd22b9302d18d;
uint128_t m1 = (uint128_t)0x4a4cf0348b717188 << 64
| 0xe2aead7d60f8a0df;
x ^= x >> 59;
x *= m0;
x ^= x >> 60;
x *= m1;
x ^= x >> 84;
return x;
}
and its inverse:
uint128_t unmix(uint128_t x) {
uint128_t im0 = (uint128_t)0x367ce11aef44b547 << 64
| 0x424b0c012b51d945;
uint128_t im1 = (uint128_t)0xef0323293e8f059d << 64
| 0x351690f213b31b1f;
x ^= x >> 84;
x *= im1;
x ^= x >> 60 ^ x >> (2 * 60);
x *= im0;
x ^= x >> 59 ^ x >> (2 * 59);
return x;
}
I'm not sure if you wanted a just a random sequence, or a way to obfuscate an arbitrary timestamp (since you said you wanted to decode the values they must be more interesting than a linear counter), but one derives from the other simply enough:
uint128_t encode(uint128_t time, uint128_t salt) {
return mix((time + 1) * salt);
}
uint128_t generate(uint128_t salt) {
static uint128_t t = 0;
return encode(t++, salt);
}
static uint128_t inv(uint128_t d) {
uint128_t i = d;
while (i * d != 1) {
i *= 2 - i * d;
}
return i;
}
uint128_t decode(uint128_t etime, uint128_t salt) {
return unmix(etime) * inv(salt) - 1;
}
Note that salt chooses one of 2127 sequences of non-repeating 128-bit values (we lose one bit because salt must be odd), but there are (2128)! possible sequences that could have been generated. Elsewhere I'm looking at extending the parameterisation so that more of these sequences can be visited, but I started goofing around with the above method for increasing the randomness of the sequence to hide any problems where the parameters could pick not-so-random (but provably distinct) sequences.
Obviously uint128_t isn't a standard type, and so my answer is not C, but you can use either a bignumber library or a compiler extension to make the arithmetic work. For clarity I relied on the compiler extension. All the operations rely on C-like unsigned overflow behaviour (take the low-order bits of the arbitrary-precision result).

Related

How does XorShift32 works?

I have this homework where i need to implement xorshift32(i can t use anything else) so i can generate some numbers but i don t understand how the algorithm works or how to implement it.
I am trying to print the generated number but i don t know how to call the xorshift32 function because of the state[static 1] argument.
uint32_t xorshift32(uint32_t state[static 1])
{
uint32_t x = state[0];
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
state[0] = x;
return x;
}
I do not have much information on xorshft32 other that what is on wikipedia(en.wikipedia.org/wiki/Xorshift).

This is an extended comment to the good answer by Jabberwocky.
The Xorshift variants, rand(), and basically all random number generator functions, are actually pseudorandom number generators. They are not "real random", because the sequence of numbers they generate depends on their internal state; but they are "pseudorandom", because if you do not know the generator internal state, the sequence of numbers they generate is random in the statistical sense.
George Marsaglia, the author of the Xorshift family of pseudorandom number generators, also developed a set of statistical tools called Diehard tests that can be used to analyse the "randomness" of the sequences generated. Currently, the TestU01 tests are probably the most widely used and trusted; in particular, the 160-test BigCrush set.
The sequence generated by ordinary pseudorandom number generators often allows one to determine the internal state of the generator. This means that observing a long enough generated sequence, allows one to fairly reliably predict the future sequence. Cryptographically secure pseudorandom number generators avoid that, usually by applying a cryptographically secure hash function to the output; one would need a catalog of the entire sequence to be able to follow it. When the periods are longer than 2256 or so, there is not enough baryonic matter in the entire observable universe to store the sequence.
My own favourite PRNG is Xorshift64*, which has a period of 264-1, and passes all but the MatrixRank test in BigCrush. In C99 and later, you can implement it using
#include <inttypes.h>
typedef struct {
uint64_t state;
} prng_state;
static inline uint64_t prng_u64(prng_state *const p)
{
uint64_t state = p->state;
state ^= state >> 12;
state ^= state << 25;
state ^= state >> 27;
p->state = state;
return state * UINT64_C(2685821657736338717);
}
The state can be initialized to any nonzero uint64_t. (A zero state will lead the generator to generate all zeros till infinity. The period is 264-1, because the generator will have each 64-bit state (excluding zero) exactly once during each period.)
It is good enough for most use cases, and extremely fast. It belongs to the class of linear-feedback shift register pseudorandom number generators.
Note that the variant which returns an uniform distribution between 0 and 1,
static inline double prng_one(prng_state *p)
{
return prng_u64(p) / 18446744073709551616.0;
}
uses the high bits; the high 32 bits of the sequence does pass all BigCrunch tests in TestU01 suite, so this is a surprisingly good (randomness and efficiency) generator for double-precision uniform random numbers -- my typical use case.
The format above allows multiple independent generators in a single process, by specifying the generator state as a parameter. If the basic generator is implemented in a header file (thus the static inline; it is a preprocessor macro-like function), you can switch between generators by switching between header files, and recompiling the binary.
(You are usually better off by using a single generator, unless you use multiple threads in a pseudorandom number heavy simulator, in which case using a separate generator for each thread will help a lot; avoids cacheline ping-pong between threads competing for the generator state, in particular.)
The rand() function in most C standard library implementations is a linear-congruential generator. They often suffer from poor choices of the coefficients, and nowadays, also from the relative slowness of the modulo operator (when the modulus is not a power of two).
The most widely used pseudorandom number generator is the Mersenne Twister, by Makoto Matsumoto (松本 眞) and Takuji Nishimura (西村 拓士). It is a twisted generalized linear feedback shift register, and has quite a large state (about 2500 bytes) and very long period (219937-1).
When we talk of true random number generators, we usually mean a combination of a pseudorandom number generator (usually a cryptographically secure one), and a source of entropy; random bits with at least some degree of true physical randomness.
In Linux, Mac OS, and BSDs at least, the operating system kernel exposes a source of pseudorandom numbers (getentropy() in Linux and OpenBSD, getrandom() in Linux, /dev/urandom, /dev/arandom, /dev/random in many Unixes, and so on). Entropy is gathered from physical electronic sources, like internal processor latencies, physical interrupt line timings, (spinning disk) hard drive timings, possibly even keyboard and mice. Many motherboards and some processors even have hardware random number sources that can be used as sources for entropy (or even directly as "trusted randomness sources").
The exclusive-or operation (^ in C) is used to mix in randomness to the generator state. This works, because exclusive-or between a known bit and a random bit results in a random bit; XOR preserves randomness. When mixing entropy pools (with some degree of randomness in the bit states) using XOR, the result will have at least as much entropy as the sources had.
Note that that does not mean that you get "better" random numbers by mixing the output of two or more generators. The statistics of true randomness is hard for humans to grok (just look at how poor the common early rand() implementations were! HORRIBLE!). It is better to pick a generator (or a set of generators to switch between at compile time, or at run time) that passes the BigCrunch tests, and ensure it has a good random initial state on every run. That way you leverage the work of many mathematicians and others who have worked on these things for decades, and can concentrate on the other stuff, what you yourself are good at.

The C code in the wikipedia article is somewhat misleading:
Here is a working example that uses both the 32 bit and the 64 bit versions:
#include <stdio.h>
#include <stdint.h>
/* The state word must be initialized to non-zero */
uint32_t xorshift32(uint32_t state[])
{
/* Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs" */
uint32_t x = state[0];
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
state[0] = x;
return x;
}
uint64_t xorshift64(uint64_t state[])
{
uint64_t x = state[0];
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
state[0] = x;
return x;
}
int main()
{
uint32_t state[1] = {1234}; // "seed" (can be anthing but 0)
for (int i = 0; i < 50; i++)
{
printf("%u\n", xorshift32(state));
}
uint64_t state64[1] = { 1234 }; // "seed" (can be anthing but 0)
for (int i = 0; i < 50; i++)
{
printf("%llu\n", xorshift64(state64));
}
}
The mathematical aspects are explained in the wikipedia article and in it's footnotes.
The rest is basic C language knowledge, ^ is the C bitwise XOR operator.

Fast hashing of 32 bit values to between 0 and 254 inclusive

I'm looking for a fast way in C to hash numbers 32-bit numbers more or less uniformly between 0 and 254. 255 is reserved for a special purpose.
As an added constraint, I'm looking for a method that would map well to being used with ISA-specific vector intrinsics or to a language like OpenCL or CUDA without introducing control flow divergence between the vector lanes/threads.
Ordinarily, I would just use the following code to hash the number between 0 and 255, as this is just a fast way of doing x mod 256.
inline uint8_t hash(uint32_t x){ return x & 255; }
I could just give in and use the following:
inline uint8_t hash(uint32_t x){ return x % 255; }
However, this solution seems unimaginative and unlikely to be the highest performing solution. I found code at this site (http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml#exmod15) that appears to provide a reasonable solution for scalar code and have inserted it here for your convenience.
uint32_t mod255( uint32_t a ) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
I see two potential performance issues with this code:
The large number of if statements makes me question how easy it will be for a compiler or human :) to effectively vectorize the code without leading to control flow divergence within a warp/wavefront on a SIMT architecture or vectorized execution on a multicore CPU. If such divergence does occur, it will reduce parallel efficiency, as the divergent paths will have to be run in series.
It looks like it could be troublesome for a branch predictor (not applicable on common GPU architectures) as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Any recommendations on alternatives that I could use are most welcome. Alternatively, let me know if what I am asking for is unreasonable.

The "if statements on GPU kill performance" is a popular misconception which desperately wants to live on, it seems.
The large number of if statements makes me question how easy it will
be for a compiler or human :) to vectorize the code.
First of all I wouldn't consider 2 if statements a "large number of if statements", and those are so short and trivial that I'm willing to bet the compiler will turn them into branchless conditional moves or predicated instructions. There will be no performance penalty at all. (Do check the generated assembly, however).
It looks like it could be troublesome for a branch predictor as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Current GPUs do not have branch predictors. Note however that depending on the underlying hardware, operation on integers (and notably shifting) may be quite costly.

I would just do this:
uchar fast_mod255( uint a32 ) {
ushort a16 = (a32 >> 16) + (a32 & 0xFFFF); /* sum base 2**16 digits */
uchar a8 = (a16 >> 8) + (a16 & 0xFF); /* sum base 2**8 digits */
return (a8 % 255);
}
Another option is to just do:
uchar fast_mod255( uchar4 a ) {
return (dot(a) % 255); // or return (distance(a) % 255);
}
GPUs are very efficient in computing the distances and dot products, even in 4 dimensions. And it is a valid way of hashing as well. Dsicarding the overflowed values.
No branching, and a clever compiler can even optimize it out. Or do you really need that values that fall in the 255 zone have a scattered pattern instead of 1?

I wanted to answer my own question because over the last 2 years I have seen ways to get around a slow integer divide instruction. The easiest way is to make the integer a compile-time constant. Any decent modern compiler should replace the integer divide with an equivalent set of other instructions with typically higher throughput (how many such instructions can be retired per cycle) and reduced latency (how many cycles it takes the instruction to execute). If you're curious, check out Hacker's Delight (an excellent book on low-level computer arithmetic).
I wanted to share another finding, which I found on Daniel Lemire's blog (located here). The code that follows doesn't compute mod 255 but does something similar, which is equally useful in a number of applications and much faster.
Suppose that you have a set of numbers S that are uniformly randomly picked from the range 0 to 2^k - 1 inclusive, where k >= 0. In this case, if you care only about mapping numbers roughly uniformly from 0 to 254 inclusive, you may do the following:
For each number n in a set S, you may map n to one of the 255 candidate values by multiplying n by 255 and then arithmetically shifting the result to the right by k digits.
Here is the function that you call on each n for a fixed value of k:
int map_to_0_to_254(int n, int k){
return (n * 255) >> k;
}
As an example, if the values for the argument n range uniformly randomly from 0 to 4095 (2^12 - 1),
then map_to_0_254(n, 12) will return a value in the range 0 to 254 inclusive.
Here is a more general templated version in C++ for mapping to range from 0 to range_size - 1 inclusive:
template<typename T>
T map_to_0_to_range_size_minus_1(T n, T range_size, T k){
return (n * range_size) >> k;
}
REMEMBER that this code assumes that the inputs for n are roughly uniformly randomly distributed between 0 and 2^k - 1 inclusive. If that property holds, then the outputs will be roughly uniformly distributed between 0 and range_size - 1 inclusive. The larger 2^k is relative to range_size, the more uniform the mapping will be for a fixed set of inputs.
Why This is Useful
This approach has applications to computing hash functions for hash tables where the number of bins is not a power of 2. Those operations would ordinarily require a long-latency integer divide instruction, which is often an order of magnitude slower to execute than an integer multiply, because you often do not know the number of bins in the hash table at compile time.

Deterministic bit scrambling to filter coordinates

I am trying to write a function that, given an (x,y) coordinate pair and the random seed of the program, will psuedo-randomly return true for some preset percentage of all such pairs. There are no limits on x or y beyond the restrictions of the data type, which is a 32-bit signed int.
My current approach is to scramble the bits of x, y, and the seed together and then compare the resulting number to the percentage:
float percentage = 0.005;
...
unsigned int n = (x ^ y) ^ seed;
return (((float) n / UINT_MAX) < percentage);
However, it seems that this approach would be biased for certain values of x and y. For example, if it returns true for (0,a), it will also return true for (a,0).
I know this implementation that just XORs them together is naive. Is there a better bit-scrambling algorithm to use here that will not be biased?
Edit: To clarify, I am not starting with a set of (x,y) coordinates, nor am I trying to get a fixed-size set of coordinates that evaluate to true. The function should be able to evaluate a truth value for arbitrary x, y, and seed, with the percentage controlling the average frequency of "true" coordinates.

The easy solution is to use a good hashing algorithm. You can do the range check on the value of hash(seed || x || y).
Of course, selecting points individually with percentage p does not guarantee that you will end up with a sample whose size will be exactly p * N. (That's the expected size of the sample, but any given sample will deviate a bit.) If you want to get a sample of size precisely k from a universe of N objects, you can use the following simple algorithm:
Examine the elements in the sample one at a time until k reaches 0.
When examining element i, add it to the sample if its hash value mapped onto the range [0, N-i) is less than k. If you add the element to the sample, decrement k.
There's no way to get the arithmetic absolutely perfect (since there is no way to perfectly partition 2i different hash values into n buckets unless n is a power of 2), so there will always be a tiny bias. (Floating point arithmetic does not help; the number of possible floating point values is also fixed, and suffers from the same bias.)
If you do 64-bit arithmetic, the bias will be truly tiny, but the arithmetic is more complicated unless your environment provides a 128-bit multiply. So you might feel satisfied with 32-bit computations, where the bias of one in a couple of thousand million [Note 1] doesn't matter. Here, you can use the fact that any 32 bits in your hash should be as unbiased as any other 32 bits, assuming your hash algorithm is any good (see below). So the following check should work fine:
// I need k elements from a remaining universe of n, and I have a 64-bit hash.
// Return true if I should select this element
bool select(uint32_t n, uint32_t k, uint64_t hash) {
return ((hash & (uint32_t)(-1)) * (uint64_t)n) >> 32 < k;
}
// Untested example sampler
// select exactly k elements from U, using a seed value
std::vector<E> sample(const std::vector<E>& U, uint64_t seed, uint32_t k) {
std::vector<E> retval;
uint32_t n = U.size();
for (uint32_t n = U.size(); k && n;) {
E& elt = U[--n];
if (select(n, k, hash_function(seed, elt))) {
retval.push_back(elt);
--k;
}
}
return retval;
}
Assuming you need to do this a lot, you'll want to use a fast hash algorithm; since you're not actually working in a secure environment, you don't need to worry about whether the algorithm is cryptographically secure.
Many high-speed hashing algorithms work on 64-bit units, so you could maximize the speed by constructing a 128-bit input consisting of a 64-bit seed and the two 32-bit co-ordinates. You can then unroll the hash loop to do exactly two blocks.
I won't venture a guess at the best hash function for your purpose. You might want to check out one or more of these open-source hashing functions:
Farmhash https://code.google.com/p/farmhash/
Murmurhash https://code.google.com/p/smhasher/
xxhash https://code.google.com/p/xxhash/
siphash https://github.com/majek/csiphash/
... and many more.
Notes
A couple of billion, if you're on that side of the Atlantic.

I would prefer feeding seed, x, and y through a Combined Linear Congruential Generator.
This is generally much faster than hashing, and it is designed specifically for the purpose: To output a pseudo-random number uniformly in a certain range.
Using coefficients recommended by Wichmann-Hill (which are also used in some versions of Microsoft Excel) we can do:
si = 171 * s % 30269;
xi = 172 * x % 30307;
yi = 170 * y % 30323;
r_combined = fmod(si/30269. + xi/30307. + yi/30323., 1.);
return r_combined < percentage;
Where s is the seed on the first call, and the previous si on each subsequent call. (Thanks to rici's comment for this point.)

Looking for decent-quality PRNG with only 32 bits of state

I'm trying to implement a tolerable-quality version of the rand_r interface, which has the unfortunate interface requirement that its entire state is stored in a single object of type unsigned, which for my purposes means exactly 32 bits. In addition, I need its output range to be [0,2³¹-1]. The standard solution is using a LCG and dropping the low bit (which has the shortest period), but this still leaves very poor periods for the next few bits.
My initial thought was to use two or three iterations of the LCG to generate the high/low or high/mid/low bits of the output. However, such an approach does not preserve the non-biased distribution; rather than each output value having equal frequency, many occur multiple times, and some never occur at all.
Since there are only 32 bits of state, the period of the PRNG is bounded by 2³², and in order to be non-biased, the PRNG must output each value exactly twice if it has full period or exactly once if it has period 2³¹. Shorter periods cannot be non-biased.
Is there any good known PRNG algorithm that meets these criteria?

One good (but probably not the fastest) possibility, offering very high quality, would be to use a 32-bit block cipher in CTR mode. Basically, your RNG state would simply be a 32-bit counter that gets incremented by one for each RNG call, and the output would be the encryption of that counter value using the block cipher with some arbitrarily chosen fixed key. For extra randomness, you could even provide a (non-standard) function to let the user set a custom key.
There aren't a lot of 32-bit block ciphers in common use, since such a short block size introduces problems for cryptographic use. (Basically, the birthday paradox lets you distinguish the output of such a cipher from a random function with a non-negligible probability after only about 216 = 65536 outputs, and after 232 outputs the non-randomness obviously becomes certain.) However, some ciphers with an adjustable block size, such as XXTEA or HPC, will let you go down to 32 bits, and should be suitable for your purposes.
(Edit: My bad, XXTEA only goes down to 64 bits. However, as suggested by CodesInChaos in the comments, Skip32 might be another option. Or you could build your own 32-bit Feistel cipher.)
The CTR mode construction guarantees that the RNG will have a full period of 232 outputs, while the standard security claim of (non-broken) block ciphers is essentially that it is not computationally feasible to distinguish their output from a random permutation of the set of 32-bit integers. (Of course, as noted above, such a permutation is still easily distinguished from a random function taking 32-bit values.)
Using CTR mode also provides some extra features you may find convenient (even if they're not part of the official API you're developing against), such as the ability to quickly seek into any point in the RNG output stream just by adding or subtracting from the state.
On the other hand, you probably don't want to follow the common practice of seeding the RNG by just setting the internal state to the seed value, since that would cause the output streams generated from nearby seeds to be highly similar (basically just the same stream shifted by the difference of the seeds). One way to avoid this issue would be to add an extra encryption step to the seeding process, i.e. to encrypt the seed with the cipher and set the internal counter value equal to the result.

A 32-bit maximal-period Galois LFSR might work for you. Try:
r = (r >> 1) ^ (-(r & 1) & 0x80200003);
The one problem with LFSRs is that you can't produce the value 0. So this one has a range of 1 to 2^32-1. You may want to tweak the output or else stick with a good LCG.

Besides using a Lehmer MCG, there's a couple you could use:
32-bit variants of Xorshift have a guaranteed period of 232−1 using a 32-bit state:
uint32_t state;
uint32_t xorshift32(void) {
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state;
}
That's the original 32-bit recommendation from 2003 (see paper). Depending on your definition of "decent quality", that should be fine. However it fails the binary rank tests of Diehard, and 5/10 tests of SmallCrush.
Alternate version with better mixing and constants (passes SmallCrush and Crush):
uint32_t xorshift32amx(void) {
int s = __builtin_bswap32(state * 1597334677);
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state + s;
}
Based on research here and here.
There's also Mulberry32 which has a period of exactly 232:
uint32_t mulberry32(void) {
uint32_t z = state += 0x6D2B79F5;
z = (z ^ z >> 15) * (1 | z);
z ^= z + (z ^ z >> 7) * (61 | z);
return z ^ z >> 14;
}
This is probably your best option. It's quite good. Author states "It passes gjrand's 13 tests with no failures and a total P-value
of 0.984 (where 1 is perfect and 0.1 or less is a failure) on 4GB of
generated data. That's a quarter of the full period". It appears to be an improvement over SplitMix32.
"SplitMix32", adopted from xxHash/MurmurHash3 (Weyl sequence):
uint32_t splitmix32(void) {
uint32_t z = state += 0x9e3779b9;
z ^= z >> 15; // 16 for murmur3
z *= 0x85ebca6b;
z ^= z >> 13;
z *= 0xc2b2ae35;
return z ^= z >> 16;
}
The quality might be questionable here, but its 64-bit big brother has a lot of fans (passes BigCrush). So the general structure is worth looking at.

Elaborating on my comment...
A block cipher in counter mode gives a generator in approximately the following form (except using much bigger data types):
uint32_t state = 0;
uint32_t rand()
{
state = next(state);
return temper(state);
}
Since cryptographic security hasn't been specified (and in 32 bits it would be more or less futile), a simpler, ad-hoc tempering function should do the trick.
One approach is where the next() function is simple (eg., return state + 1;) and temper() compensates by being complex (as in the block cipher).
A more balanced approach is to implement LCG in next(), since we know that it also visits all possible states but in a random(ish) order, and to find an implementation of temper() which does just enough work to cover the remaining problems with LCG.
Mersenne Twister includes such a tempering function on its output. That might be suitable. Also, this question asks for operations which fulfill the requirement.
I have a favourite, which is to bit-reverse the word, and then multiply it by some constant (odd) number. That may be overly complex if bit-reverse isn't a native operation on your architecture.

How to map a long integer number to a N-dimensional vector of smaller integers (and fast inverse)?

Given a N-dimensional vector of small integers is there any simple way to map it with one-to-one correspondence to a large integer number?
Say, we have N=3 vector space. Can we represent a vector X=[(int16)x1,(int16)x2,(int16)x3] using an integer (int48)y? The obvious answer is "Yes, we can". But the question is: "What is the fastest way to do this and its inverse operation?"
Will this new 1-dimensional space possess some very special useful properties?

For the above example you have 3 * 32 = 96 bits of information, so without any a priori knowledge you need 96 bits for the equivalent long integer.
However, if you know that your x1, x2, x3, values will always fit within, say, 16 bits each, then you can pack them all into a 48 bit integer.
In either case the technique is very simple you just use shift, mask and bitwise or operations to pack/unpack the values.

Just to make this concrete, if you have a 3-dimensional vector of 8-bit numbers, like this:
uint8_t vector[3] = { 1, 2, 3 };
then you can join them into a single (24-bit number) like so:
uint32_t all = (vector[0] << 16) | (vector[1] << 8) | vector[2];
This number would, if printed using this statement:
printf("the vector was packed into %06x", (unsigned int) all);
produce the output
the vector was packed into 010203
The reverse operation would look like this:
uint8_t v2[3];
v2[0] = (all >> 16) & 0xff;
v2[1] = (all >> 8) & 0xff;
v2[2] = all & 0xff;
Of course this all depends on the size of the individual numbers in the vector and the length of the vector together not exceeding the size of an available integer type, otherwise you can't represent the "packed" vector as a single number.

If you have sets Si, i=1..n of size Ci = |Si|, then the cartesian product set S = S1 x S2 x ... x Sn has size C = C1 * C2 * ... * Cn.
This motivates an obvious way to do the packing one-to-one. If you have elements e1,...,en from each set, each in the range 0 to Ci-1, then you give the element e=(e1,...,en) the value e1+C1*(e2 + C2*(e3 + C3*(...Cn*en...))).
You can do any permutation of this packing if you feel like it, but unless the values are perfectly correlated, the size of the full set must be the product of the sizes of the component sets.
In the particular case of three 32 bit integers, if they can take on any value, you should treat them as one 96 bit integer.
If you particularly want to, you can map small values to small values through any number of means (e.g. filling out spheres with the L1 norm), but you have to specify what properties you want to have.
(For example, one can map (n,m) to (max(n,m)-1)^2 + k where k=n if n<=m and k=n+m if n>m--you can draw this as a picture of filling in a square like so:
1 2 5 | draw along the edge of the square this way
4 3 6 v
8 7
if you start counting from 1 and only worry about positive values; for integers, you can spiral around the origin.)

I'm writing this without having time to check details, but I suspect the best way is to represent your long integer via modular arithmetic, using k different integers which are mutually prime. The original integer can then be reconstructed using the Chinese remainder theorem. Sorry this is a bit sketchy, but hope it helps.

To expand on Rex Kerr's generalised form, in C you can pack the numbers like so:
X = e[n];
X *= MAX_E[n-1] + 1;
X += e[n-1];
/* ... */
X *= MAX_E[0] + 1;
X += e[0];
And unpack them with:
e[0] = X % (MAX_E[0] + 1);
X /= (MAX_E[0] + 1);
e[1] = X % (MAX_E[1] + 1);
X /= (MAX_E[1] + 1);
/* ... */
e[n] = X;
(Where MAX_E[n] is the greatest value that e[n] can have). Note that these maximum values are likely to be constants, and may be the same for every e, which will simplify things a little.
The shifting / masking implementations given in the other answers are a generalisation of this, for cases where the MAX_E + 1 values are powers of 2 (and thus the multiplication and division can be done with a shift, the addition with a bitwise-or and the modulus with a bitwise-and).

There is some totally non portable ways to make this real fast using packed unions and direct accesses to memory. That you really need this kind of speed is suspicious. Methods using shifts and masks should be fast enough for most purposes. If not, consider using specialized processors like GPU for wich vector support is optimized (parallel).
This naive storage does not possess any usefull property than I can foresee, except you can perform some computations (add, sub, logical bitwise operators) on the three coordinates at once as long as you use positive integers only and you don't overflow for add and sub.
You'd better be quite sure you won't overflow (or won't go negative for sub) or the vector will become garbage.

#include <stdint.h> // for uint8_t
long x;
uint8_t * p = &x;
or
union X {
long L;
uint8_t A[sizeof(long)/sizeof(uint8_t)];
};
works if you don't care about the endian. In my experience compilers generate better code with the union because it doesn't set of their "you took the address of this, so I must keep it in RAM" rules as quick. These rules will get set off if you try to index the array with stuff that the compiler can't optimize away.
If you do care about the endian then you need to mask and shift.

I think what you want can be solved using multi-dimensional space filling curves. The link gives a lot of references on this, which in turn give different methods and insights. Here's a specific example of an invertible mapping. It works for any dimension N.
As for useful properties, these mappings are related to Gray codes.
Hard to say whether this was what you were looking for, or whether the "pack 3 16-bit ints into a 48-bit int" does the trick for you.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight