mapping from one set to another - c

I have one set of continuous integer values and corresponding set of non-continuous values, for example:
0 -> 22
1 -> 712
2 -> 53
3 -> 12323
...
and so on.
Amount of items is very huge (about of 10^9...10^10), so using just plain array is not an option.
Is there data structure that capable of fast mapping from first values to another with moderate memory requirements? For example:
ret = map(0); // returns 22
ret = map(3); // returns 12323
Edit: values in this set are really generated using pseudo-random number generator, so it is not possible to suggest some specific distribution. Question is - is it possible to lower memory requirements (may be in price of lookup speed)? I mean using something like "perfect hashing" - time required for generate such "perfect hash" doesn't matter.

As your range is continuous, the obvious solution is to store your values in a contiguous int[]. Then value i is arr[i]. As the values generated by PRNG, it will be difficult to apply further compression.
Another solution, which trades time for space, is to store the seed of your RNG and recalculate on the fly. This approach could be improved in time, and worsened in space, by storing intermediate seeds. I.e. seed for key 1000, 2000 etc.

You may be able to save some space by using exactly the number of bits required by each value. For example if your values are only 24 bits, you can save a byte over 32-bit integers. That said, there is only so much memory you can save.
Ob 64-bit machines it would be feasible to mmap() a file to a memory address, thus getting over the physical memory limit by using disk storage, at the price of performance.
But since you mentioned using a pseudo-random generator to generate the values, how about just storing the RNG seed for specific indexes and calculating the rest of the values as needed? For example you could store the seed for indexes 0, 100, 200, ... and calculate e.g. 102 by re-seeding the RNG for 100 and calling the generator function three times.
Such an approach would reduce the memory needed by a large factor (100 in this case) and you could lessen the performance cost by bunching or caching your queries.

If the range of your function is the set of numbers generated by a pseudo-random number generator in sequence then you can compress the series down to, well, to the code which generates the sequence plus the state of the PRNG before starting. For example, the (infinite) series of digits comprising the decimal expansion of pi is easily (and, technically, infinitely) compressed to the code to generate that series; your series could be seen as an example of something almost identical.
So, if you are willing to wait for a long time to get the last elements in the series, you can get very good compression, by writing your series not into a data structure but out of a function. That is at one end of your time/space trade-off spectrum.
At the other end of the spectrum is an array of all the numbers; this uses lots of space but gives very quick (O(1)) access to any desired element in the set. This doesn't seem to appeal to you for a variety of reasons, but I'm not sure that a cleverer data structure than an array will offer much space saving, or, for that matter, time saving.
The one obvious solution I see is to save a set of intermediate states of the PRNG at intervals, so your 'data' structure would become:
ret(0) = prng(seed, other_parameters, ...)
ret(10^5-1) = prng(seed', other_parameters, ...)
ret(2*(10^5)-1) = prng(seed'', other_parameters, ...)
etc. then, to get element 9765, say, you read (the state of the PRNG at) ret(0) and generate the 9765-th pseudo-random number thereafter.

Ok, so the intent is to trade speed for less memory usage.
Imagine that you have some sort of loop that fills the array.
int array[intendedArraySize];
seed = 3;
for (size_t z = 0; z < intendedArraySize; z++)
{
array[z] = some_int_psn_generator(seed);
}
After which you can display the values.
for (size_t z = 0; z < intendedArraySize; z++)
{
std::cout << z << " " << array[z] << std::endl;
}
If that is indeed the case, consider discarding the array altogether, by simply recalculating the value each time.
for (size_t z = 0; z < intendedArraySize; z++)
{
std::cout << z << " " << some_int_psn_generator(z) << std::endl;
}

Related

Fast C random boolean generator

I'm interested in generating fast random booleans (or equivalently a Bernoulli(0.5) random variable) in C. Of course if one has a fast random generator with a decent statistical behaviour the problem "sample a random Bernoulli(0.5)" is easily solved: sample x uniformly in (0,1) and return 1 if x<0.5, 0 otherwise.
Suppose speed is the most important thing, now I have two questions/considerations:
Many random doubles generators first generate an integer m uniformly in a certain range [0,M] and then simply return the division m/M. Wouldn't it be faster just to check whether m < M/2 (here M/2 is fixed, so we are saving one division)
Is there any faster way to do it? At the end, we're asking for way less statistical properties here: we're maybe still interested in a long period but, for example, we don't care about the uniformity of the distribution (as long as roughly 50% of the values are in the first half of the range).
Extracting say the last bit of a random number can wreak havoc as linear congruential generators can alternate between odd and even numbers1. A scheme like clock() & 1 would also have ghastly correlation plains.
Consider a solution based on the quick and dirty generator of Donald Kunth: for uint32_t I, sequence
I = 1664525 * I + 1013904223;
and 2 * I < I is the conditional yielding the Boolean drawing. Here I'm relying on the wrap-around behaviour of I which should occur half the time, and a potentially expensive division is avoided.
Testing I <= 0x7FFFFFFF is less flashy and might be faster still, but the hardcoding of the midpoint is not entirely satisfactory.
1 The generator I present here does.
I'm interested in generating fast random booleans
Using a LCG can be fast, yet since OP's needs only a bool result, consider extracting only 1 bit at a time from a reasonable generator and save the rest for later. #Akshay L Aradhya
Example based on #R.. and #R.. code.
extern uint32_t lcg64_temper(uint64_t *seed); // see R.. code
static uint64_t gseed; // Initialize this in some fashion.
static unsigned gcount = 0;
bool rand_bool(void) {
static uint32_t rbits;
if (gcount == 0) {
gcount = 32; // I'd consider using 31 here, just to cope with some LCG weaknesses.
rbits = lcg64_temper(&gseed);
}
gcount--;
bool b = rbits & 1;
rbits >>= 1;
return b;
}

Fast hashing of 32 bit values to between 0 and 254 inclusive

I'm looking for a fast way in C to hash numbers 32-bit numbers more or less uniformly between 0 and 254. 255 is reserved for a special purpose.
As an added constraint, I'm looking for a method that would map well to being used with ISA-specific vector intrinsics or to a language like OpenCL or CUDA without introducing control flow divergence between the vector lanes/threads.
Ordinarily, I would just use the following code to hash the number between 0 and 255, as this is just a fast way of doing x mod 256.
inline uint8_t hash(uint32_t x){ return x & 255; }
I could just give in and use the following:
inline uint8_t hash(uint32_t x){ return x % 255; }
However, this solution seems unimaginative and unlikely to be the highest performing solution. I found code at this site (http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml#exmod15) that appears to provide a reasonable solution for scalar code and have inserted it here for your convenience.
uint32_t mod255( uint32_t a ) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
I see two potential performance issues with this code:
The large number of if statements makes me question how easy it will be for a compiler or human :) to effectively vectorize the code without leading to control flow divergence within a warp/wavefront on a SIMT architecture or vectorized execution on a multicore CPU. If such divergence does occur, it will reduce parallel efficiency, as the divergent paths will have to be run in series.
It looks like it could be troublesome for a branch predictor (not applicable on common GPU architectures) as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Any recommendations on alternatives that I could use are most welcome. Alternatively, let me know if what I am asking for is unreasonable.
The "if statements on GPU kill performance" is a popular misconception which desperately wants to live on, it seems.
The large number of if statements makes me question how easy it will
be for a compiler or human :) to vectorize the code.
First of all I wouldn't consider 2 if statements a "large number of if statements", and those are so short and trivial that I'm willing to bet the compiler will turn them into branchless conditional moves or predicated instructions. There will be no performance penalty at all. (Do check the generated assembly, however).
It looks like it could be troublesome for a branch predictor as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Current GPUs do not have branch predictors. Note however that depending on the underlying hardware, operation on integers (and notably shifting) may be quite costly.
I would just do this:
uchar fast_mod255( uint a32 ) {
ushort a16 = (a32 >> 16) + (a32 & 0xFFFF); /* sum base 2**16 digits */
uchar a8 = (a16 >> 8) + (a16 & 0xFF); /* sum base 2**8 digits */
return (a8 % 255);
}
Another option is to just do:
uchar fast_mod255( uchar4 a ) {
return (dot(a) % 255); // or return (distance(a) % 255);
}
GPUs are very efficient in computing the distances and dot products, even in 4 dimensions. And it is a valid way of hashing as well. Dsicarding the overflowed values.
No branching, and a clever compiler can even optimize it out. Or do you really need that values that fall in the 255 zone have a scattered pattern instead of 1?
I wanted to answer my own question because over the last 2 years I have seen ways to get around a slow integer divide instruction. The easiest way is to make the integer a compile-time constant. Any decent modern compiler should replace the integer divide with an equivalent set of other instructions with typically higher throughput (how many such instructions can be retired per cycle) and reduced latency (how many cycles it takes the instruction to execute). If you're curious, check out Hacker's Delight (an excellent book on low-level computer arithmetic).
I wanted to share another finding, which I found on Daniel Lemire's blog (located here). The code that follows doesn't compute mod 255 but does something similar, which is equally useful in a number of applications and much faster.
Suppose that you have a set of numbers S that are uniformly randomly picked from the range 0 to 2^k - 1 inclusive, where k >= 0. In this case, if you care only about mapping numbers roughly uniformly from 0 to 254 inclusive, you may do the following:
For each number n in a set S, you may map n to one of the 255 candidate values by multiplying n by 255 and then arithmetically shifting the result to the right by k digits.
Here is the function that you call on each n for a fixed value of k:
int map_to_0_to_254(int n, int k){
return (n * 255) >> k;
}
As an example, if the values for the argument n range uniformly randomly from 0 to 4095 (2^12 - 1),
then map_to_0_254(n, 12) will return a value in the range 0 to 254 inclusive.
Here is a more general templated version in C++ for mapping to range from 0 to range_size - 1 inclusive:
template<typename T>
T map_to_0_to_range_size_minus_1(T n, T range_size, T k){
return (n * range_size) >> k;
}
REMEMBER that this code assumes that the inputs for n are roughly uniformly randomly distributed between 0 and 2^k - 1 inclusive. If that property holds, then the outputs will be roughly uniformly distributed between 0 and range_size - 1 inclusive. The larger 2^k is relative to range_size, the more uniform the mapping will be for a fixed set of inputs.
Why This is Useful
This approach has applications to computing hash functions for hash tables where the number of bins is not a power of 2. Those operations would ordinarily require a long-latency integer divide instruction, which is often an order of magnitude slower to execute than an integer multiply, because you often do not know the number of bins in the hash table at compile time.

How to generate a logarithmic spaced array in C

I am trying to generate an logarithmic spaced array in C.
For example, starting at 100 and ending at 500, with 40 logarithmic spaced points.
Can anyone help me? Are there any logspace() functions available?
With no further constraints, simply divide the linear interval [ln(100)..ln(500)] into as much subintervals (equidistant) as you need. Then take the exp() of each point.
Arrays always use linear, integer and n+1 stepping. So you have to map the logarithmic scale to the linear index. This can be done either by simply taking log(log_index) or a table of ranges and a linear search in that. For log(), there might be approximations which suit your needs better and are faster than a full-grown (float) logarithm function.
You might for instance take the number of the uppermost 1-bit in the log-index and use the next n lower bits as range-index:
// all vars are size_t (unsigned at least!)
base_index = get_number_of_uppermost_bit(log_index);
shift = (base_index > 3U) ? (base_index - 3U) : 0;
lin_index = base_index * 8U + ((log_index >> shift) & (8U-1U);
The values of 8 and 3 (ld(8)) are the number of entries per log-range. Note these are linear (sometimes an acceptable approximation). You can also apply the algorithm to the lower bits, however getting an integer log function. But the above is faster and might be sufficient. Alternatively, you can use a lookup table for the lower 3 bits.
A decimal stepping would be more difficult that way and pretty inefficient.

Looking for decent-quality PRNG with only 32 bits of state

I'm trying to implement a tolerable-quality version of the rand_r interface, which has the unfortunate interface requirement that its entire state is stored in a single object of type unsigned, which for my purposes means exactly 32 bits. In addition, I need its output range to be [0,2³¹-1]. The standard solution is using a LCG and dropping the low bit (which has the shortest period), but this still leaves very poor periods for the next few bits.
My initial thought was to use two or three iterations of the LCG to generate the high/low or high/mid/low bits of the output. However, such an approach does not preserve the non-biased distribution; rather than each output value having equal frequency, many occur multiple times, and some never occur at all.
Since there are only 32 bits of state, the period of the PRNG is bounded by 2³², and in order to be non-biased, the PRNG must output each value exactly twice if it has full period or exactly once if it has period 2³¹. Shorter periods cannot be non-biased.
Is there any good known PRNG algorithm that meets these criteria?
One good (but probably not the fastest) possibility, offering very high quality, would be to use a 32-bit block cipher in CTR mode. Basically, your RNG state would simply be a 32-bit counter that gets incremented by one for each RNG call, and the output would be the encryption of that counter value using the block cipher with some arbitrarily chosen fixed key. For extra randomness, you could even provide a (non-standard) function to let the user set a custom key.
There aren't a lot of 32-bit block ciphers in common use, since such a short block size introduces problems for cryptographic use. (Basically, the birthday paradox lets you distinguish the output of such a cipher from a random function with a non-negligible probability after only about 216 = 65536 outputs, and after 232 outputs the non-randomness obviously becomes certain.) However, some ciphers with an adjustable block size, such as XXTEA or HPC, will let you go down to 32 bits, and should be suitable for your purposes.
(Edit: My bad, XXTEA only goes down to 64 bits. However, as suggested by CodesInChaos in the comments, Skip32 might be another option. Or you could build your own 32-bit Feistel cipher.)
The CTR mode construction guarantees that the RNG will have a full period of 232 outputs, while the standard security claim of (non-broken) block ciphers is essentially that it is not computationally feasible to distinguish their output from a random permutation of the set of 32-bit integers. (Of course, as noted above, such a permutation is still easily distinguished from a random function taking 32-bit values.)
Using CTR mode also provides some extra features you may find convenient (even if they're not part of the official API you're developing against), such as the ability to quickly seek into any point in the RNG output stream just by adding or subtracting from the state.
On the other hand, you probably don't want to follow the common practice of seeding the RNG by just setting the internal state to the seed value, since that would cause the output streams generated from nearby seeds to be highly similar (basically just the same stream shifted by the difference of the seeds). One way to avoid this issue would be to add an extra encryption step to the seeding process, i.e. to encrypt the seed with the cipher and set the internal counter value equal to the result.
A 32-bit maximal-period Galois LFSR might work for you. Try:
r = (r >> 1) ^ (-(r & 1) & 0x80200003);
The one problem with LFSRs is that you can't produce the value 0. So this one has a range of 1 to 2^32-1. You may want to tweak the output or else stick with a good LCG.
Besides using a Lehmer MCG, there's a couple you could use:
32-bit variants of Xorshift have a guaranteed period of 232−1 using a 32-bit state:
uint32_t state;
uint32_t xorshift32(void) {
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state;
}
That's the original 32-bit recommendation from 2003 (see paper). Depending on your definition of "decent quality", that should be fine. However it fails the binary rank tests of Diehard, and 5/10 tests of SmallCrush.
Alternate version with better mixing and constants (passes SmallCrush and Crush):
uint32_t xorshift32amx(void) {
int s = __builtin_bswap32(state * 1597334677);
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state + s;
}
Based on research here and here.
There's also Mulberry32 which has a period of exactly 232:
uint32_t mulberry32(void) {
uint32_t z = state += 0x6D2B79F5;
z = (z ^ z >> 15) * (1 | z);
z ^= z + (z ^ z >> 7) * (61 | z);
return z ^ z >> 14;
}
This is probably your best option. It's quite good. Author states "It passes gjrand's 13 tests with no failures and a total P-value
of 0.984 (where 1 is perfect and 0.1 or less is a failure) on 4GB of
generated data. That's a quarter of the full period". It appears to be an improvement over SplitMix32.
"SplitMix32", adopted from xxHash/MurmurHash3 (Weyl sequence):
uint32_t splitmix32(void) {
uint32_t z = state += 0x9e3779b9;
z ^= z >> 15; // 16 for murmur3
z *= 0x85ebca6b;
z ^= z >> 13;
z *= 0xc2b2ae35;
return z ^= z >> 16;
}
The quality might be questionable here, but its 64-bit big brother has a lot of fans (passes BigCrush). So the general structure is worth looking at.
Elaborating on my comment...
A block cipher in counter mode gives a generator in approximately the following form (except using much bigger data types):
uint32_t state = 0;
uint32_t rand()
{
state = next(state);
return temper(state);
}
Since cryptographic security hasn't been specified (and in 32 bits it would be more or less futile), a simpler, ad-hoc tempering function should do the trick.
One approach is where the next() function is simple (eg., return state + 1;) and temper() compensates by being complex (as in the block cipher).
A more balanced approach is to implement LCG in next(), since we know that it also visits all possible states but in a random(ish) order, and to find an implementation of temper() which does just enough work to cover the remaining problems with LCG.
Mersenne Twister includes such a tempering function on its output. That might be suitable. Also, this question asks for operations which fulfill the requirement.
I have a favourite, which is to bit-reverse the word, and then multiply it by some constant (odd) number. That may be overly complex if bit-reverse isn't a native operation on your architecture.

How to define 2-bit numbers in C, if possible?

For my university process I'm simulating a process called random sequential adsorption.
One of the things I have to do involves randomly depositing squares (which cannot overlap) onto a lattice until there is no more room left, repeating the process several times in order to find the average 'jamming' coverage %.
Basically I'm performing operations on a large array of integers, of which 3 possible values exist: 0, 1 and 2. The sites marked with '0' are empty, the sites marked with '1' are full. Initially the array is defined like this:
int i, j;
int n = 1000000000;
int array[n][n];
for(j = 0; j < n; j++)
{
for(i = 0; i < n; i++)
{
array[i][j] = 0;
}
}
Say I want to deposit 5*5 squares randomly on the array (that cannot overlap), so that the squares are represented by '1's. This would be done by choosing the x and y coordinates randomly and then creating a 5*5 square of '1's with the topleft point of the square starting at that point. I would then mark sites near the square as '2's. These represent the sites that are unavailable since depositing a square at those sites would cause it to overlap an existing square. This process would continue until there is no more room left to deposit squares on the array (basically, no more '0's left on the array)
Anyway, to the point. I would like to make this process as efficient as possible, by using bitwise operations. This would be easy if I didn't have to mark sites near the squares. I was wondering whether creating a 2-bit number would be possible, so that I can account for the sites marked with '2'.
Sorry if this sounds really complicated, I just wanted to explain why I want to do this.
You can't create a datatype that is 2-bits in size since it wouldn't be addressable. What you can do is pack several 2-bit numbers into a larger cell:
struct Cell {
a : 2;
b : 2;
c : 2;
d : 2;
};
This specifies that each of the members a, b, c and d should occupy two bits in memory.
EDIT: This is just an example of how to create 2-bit variables, for the actual problem in question the most efficient implementation would probably be to create an array of int and wrap up the bit fiddling in a couple of set/get methods.
Instead of a two-bit array you could use two separate 1-bit arrays. One holds filled squares and one holds adjacent squares (or available squares if this is more efficient).
I'm not really sure that this has any benefit though over packing 2-bit fields into words.
I'd go for byte arrays unless you are really short of memory.
The basic idea
Unfortunately, there is no way to do this in C. You can create arrays of 1 byte, 2 bytes, etc., but you can't create areas of bits.
The best thing you can do, then, is to write a new library for yourself, which makes it look like you're dealing with arrays of 2 bits, but in reality does a lot of hard work. The same way that the string libraries give you functions that work on "strings" (which in C are just arrays), you'll be creating a new library which works on "bit arrays" (which in reality will be arrays of integers, with a few special functions to deal with them as-if they were arrays of bits).
NOTE: If you're new to C, and haven't learned the ideas of "creating a new library/module", or the concept of "abstraction", then I'd recommend learning about them before you continue with this project. Understanding them is IMO more important than optimizing your program to use a little less space.
How to implement this new "library" or module
For your needs, I'd create a new module called "2-bit array", which exports functions for dealing with the 2-bit arrays, as you need them.
It would have a few functions that deal with setting/reading bits, so that you can work with it as if you have an actual array of bits (you'll actually have an array of integers or something, but the module will make it seem like you have an array of bits).
Using this module would like something like this:
// This is just an example of how to use the functions in the twoBitArray library.
twoB my_array = Create2BitArray(size); // This will "create" a twoBitArray and return it.
SetBit(twoB, 5, 1); // Set bit 5 to 1 //
bit b = GetBit(twoB, 5); // Where bit is typedefed to an int by your module.
What the module will actually do is implement all these functions using regular-old arrays of integers.
For example, the function GetBit(), for GetBit(my_arr, 17), will calculate that it's the 1st bit in the 4th integer of your array (depending on sizeof(int), obviously), and you'd return it by using bitwise operations.
You can compact one dimension of array into sub-integer cells. To convert coordinate (lets say x for example) to position inside byte:
byte cell = array[i][ x / 4 ];
byte mask = 0x0004 << (x % 4);
byte data = (cell & mask) >> (x % 4);
to write data do reverse

Resources