How to map a long integer number to a N-dimensional vector of smaller integers (and fast inverse)? - c

Given a N-dimensional vector of small integers is there any simple way to map it with one-to-one correspondence to a large integer number?
Say, we have N=3 vector space. Can we represent a vector X=[(int16)x1,(int16)x2,(int16)x3] using an integer (int48)y? The obvious answer is "Yes, we can". But the question is: "What is the fastest way to do this and its inverse operation?"
Will this new 1-dimensional space possess some very special useful properties?

For the above example you have 3 * 32 = 96 bits of information, so without any a priori knowledge you need 96 bits for the equivalent long integer.
However, if you know that your x1, x2, x3, values will always fit within, say, 16 bits each, then you can pack them all into a 48 bit integer.
In either case the technique is very simple you just use shift, mask and bitwise or operations to pack/unpack the values.

Just to make this concrete, if you have a 3-dimensional vector of 8-bit numbers, like this:
uint8_t vector[3] = { 1, 2, 3 };
then you can join them into a single (24-bit number) like so:
uint32_t all = (vector[0] << 16) | (vector[1] << 8) | vector[2];
This number would, if printed using this statement:
printf("the vector was packed into %06x", (unsigned int) all);
produce the output
the vector was packed into 010203
The reverse operation would look like this:
uint8_t v2[3];
v2[0] = (all >> 16) & 0xff;
v2[1] = (all >> 8) & 0xff;
v2[2] = all & 0xff;
Of course this all depends on the size of the individual numbers in the vector and the length of the vector together not exceeding the size of an available integer type, otherwise you can't represent the "packed" vector as a single number.

If you have sets Si, i=1..n of size Ci = |Si|, then the cartesian product set S = S1 x S2 x ... x Sn has size C = C1 * C2 * ... * Cn.
This motivates an obvious way to do the packing one-to-one. If you have elements e1,...,en from each set, each in the range 0 to Ci-1, then you give the element e=(e1,...,en) the value e1+C1*(e2 + C2*(e3 + C3*(...Cn*en...))).
You can do any permutation of this packing if you feel like it, but unless the values are perfectly correlated, the size of the full set must be the product of the sizes of the component sets.
In the particular case of three 32 bit integers, if they can take on any value, you should treat them as one 96 bit integer.
If you particularly want to, you can map small values to small values through any number of means (e.g. filling out spheres with the L1 norm), but you have to specify what properties you want to have.
(For example, one can map (n,m) to (max(n,m)-1)^2 + k where k=n if n<=m and k=n+m if n>m--you can draw this as a picture of filling in a square like so:
1 2 5 | draw along the edge of the square this way
4 3 6 v
8 7
if you start counting from 1 and only worry about positive values; for integers, you can spiral around the origin.)

I'm writing this without having time to check details, but I suspect the best way is to represent your long integer via modular arithmetic, using k different integers which are mutually prime. The original integer can then be reconstructed using the Chinese remainder theorem. Sorry this is a bit sketchy, but hope it helps.

To expand on Rex Kerr's generalised form, in C you can pack the numbers like so:
X = e[n];
X *= MAX_E[n-1] + 1;
X += e[n-1];
/* ... */
X *= MAX_E[0] + 1;
X += e[0];
And unpack them with:
e[0] = X % (MAX_E[0] + 1);
X /= (MAX_E[0] + 1);
e[1] = X % (MAX_E[1] + 1);
X /= (MAX_E[1] + 1);
/* ... */
e[n] = X;
(Where MAX_E[n] is the greatest value that e[n] can have). Note that these maximum values are likely to be constants, and may be the same for every e, which will simplify things a little.
The shifting / masking implementations given in the other answers are a generalisation of this, for cases where the MAX_E + 1 values are powers of 2 (and thus the multiplication and division can be done with a shift, the addition with a bitwise-or and the modulus with a bitwise-and).

There is some totally non portable ways to make this real fast using packed unions and direct accesses to memory. That you really need this kind of speed is suspicious. Methods using shifts and masks should be fast enough for most purposes. If not, consider using specialized processors like GPU for wich vector support is optimized (parallel).
This naive storage does not possess any usefull property than I can foresee, except you can perform some computations (add, sub, logical bitwise operators) on the three coordinates at once as long as you use positive integers only and you don't overflow for add and sub.
You'd better be quite sure you won't overflow (or won't go negative for sub) or the vector will become garbage.

#include <stdint.h> // for uint8_t
long x;
uint8_t * p = &x;
or
union X {
long L;
uint8_t A[sizeof(long)/sizeof(uint8_t)];
};
works if you don't care about the endian. In my experience compilers generate better code with the union because it doesn't set of their "you took the address of this, so I must keep it in RAM" rules as quick. These rules will get set off if you try to index the array with stuff that the compiler can't optimize away.
If you do care about the endian then you need to mask and shift.

I think what you want can be solved using multi-dimensional space filling curves. The link gives a lot of references on this, which in turn give different methods and insights. Here's a specific example of an invertible mapping. It works for any dimension N.
As for useful properties, these mappings are related to Gray codes.
Hard to say whether this was what you were looking for, or whether the "pack 3 16-bit ints into a 48-bit int" does the trick for you.

Related

Fast hashing of 32 bit values to between 0 and 254 inclusive

I'm looking for a fast way in C to hash numbers 32-bit numbers more or less uniformly between 0 and 254. 255 is reserved for a special purpose.
As an added constraint, I'm looking for a method that would map well to being used with ISA-specific vector intrinsics or to a language like OpenCL or CUDA without introducing control flow divergence between the vector lanes/threads.
Ordinarily, I would just use the following code to hash the number between 0 and 255, as this is just a fast way of doing x mod 256.
inline uint8_t hash(uint32_t x){ return x & 255; }
I could just give in and use the following:
inline uint8_t hash(uint32_t x){ return x % 255; }
However, this solution seems unimaginative and unlikely to be the highest performing solution. I found code at this site (http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml#exmod15) that appears to provide a reasonable solution for scalar code and have inserted it here for your convenience.
uint32_t mod255( uint32_t a ) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
I see two potential performance issues with this code:
The large number of if statements makes me question how easy it will be for a compiler or human :) to effectively vectorize the code without leading to control flow divergence within a warp/wavefront on a SIMT architecture or vectorized execution on a multicore CPU. If such divergence does occur, it will reduce parallel efficiency, as the divergent paths will have to be run in series.
It looks like it could be troublesome for a branch predictor (not applicable on common GPU architectures) as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Any recommendations on alternatives that I could use are most welcome. Alternatively, let me know if what I am asking for is unreasonable.
The "if statements on GPU kill performance" is a popular misconception which desperately wants to live on, it seems.
The large number of if statements makes me question how easy it will
be for a compiler or human :) to vectorize the code.
First of all I wouldn't consider 2 if statements a "large number of if statements", and those are so short and trivial that I'm willing to bet the compiler will turn them into branchless conditional moves or predicated instructions. There will be no performance penalty at all. (Do check the generated assembly, however).
It looks like it could be troublesome for a branch predictor as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Current GPUs do not have branch predictors. Note however that depending on the underlying hardware, operation on integers (and notably shifting) may be quite costly.
I would just do this:
uchar fast_mod255( uint a32 ) {
ushort a16 = (a32 >> 16) + (a32 & 0xFFFF); /* sum base 2**16 digits */
uchar a8 = (a16 >> 8) + (a16 & 0xFF); /* sum base 2**8 digits */
return (a8 % 255);
}
Another option is to just do:
uchar fast_mod255( uchar4 a ) {
return (dot(a) % 255); // or return (distance(a) % 255);
}
GPUs are very efficient in computing the distances and dot products, even in 4 dimensions. And it is a valid way of hashing as well. Dsicarding the overflowed values.
No branching, and a clever compiler can even optimize it out. Or do you really need that values that fall in the 255 zone have a scattered pattern instead of 1?
I wanted to answer my own question because over the last 2 years I have seen ways to get around a slow integer divide instruction. The easiest way is to make the integer a compile-time constant. Any decent modern compiler should replace the integer divide with an equivalent set of other instructions with typically higher throughput (how many such instructions can be retired per cycle) and reduced latency (how many cycles it takes the instruction to execute). If you're curious, check out Hacker's Delight (an excellent book on low-level computer arithmetic).
I wanted to share another finding, which I found on Daniel Lemire's blog (located here). The code that follows doesn't compute mod 255 but does something similar, which is equally useful in a number of applications and much faster.
Suppose that you have a set of numbers S that are uniformly randomly picked from the range 0 to 2^k - 1 inclusive, where k >= 0. In this case, if you care only about mapping numbers roughly uniformly from 0 to 254 inclusive, you may do the following:
For each number n in a set S, you may map n to one of the 255 candidate values by multiplying n by 255 and then arithmetically shifting the result to the right by k digits.
Here is the function that you call on each n for a fixed value of k:
int map_to_0_to_254(int n, int k){
return (n * 255) >> k;
}
As an example, if the values for the argument n range uniformly randomly from 0 to 4095 (2^12 - 1),
then map_to_0_254(n, 12) will return a value in the range 0 to 254 inclusive.
Here is a more general templated version in C++ for mapping to range from 0 to range_size - 1 inclusive:
template<typename T>
T map_to_0_to_range_size_minus_1(T n, T range_size, T k){
return (n * range_size) >> k;
}
REMEMBER that this code assumes that the inputs for n are roughly uniformly randomly distributed between 0 and 2^k - 1 inclusive. If that property holds, then the outputs will be roughly uniformly distributed between 0 and range_size - 1 inclusive. The larger 2^k is relative to range_size, the more uniform the mapping will be for a fixed set of inputs.
Why This is Useful
This approach has applications to computing hash functions for hash tables where the number of bins is not a power of 2. Those operations would ordinarily require a long-latency integer divide instruction, which is often an order of magnitude slower to execute than an integer multiply, because you often do not know the number of bins in the hash table at compile time.

How do you use bitwise operators, masks, to find if a number is a multiple of another number?

So I have been told that this can be done and that bitwise operations and masks can be very useful but I must be missing something in how they work.
I am trying to calculate whether a number, say x, is a multiple of y. If x is a multiple of y great end of story, otherwise I want to increase x to reach the closest multiple of y that is greater than x (so that all of x fits in the result). I have just started learning C and am having difficulty understanding some of these tasks.
Here is what I have tried but when I input numbers such as 5, 9, or 24 I get the following respectively: 0, 4, 4.
if(x&(y-1)){ //if not 0 then multiple of y
x = x&~(y-1) + y;
}
Any explanations, examples of the math that is occurring behind the scenes, are greatly appreciated.
EDIT: So to clarify, I somewhat understand the shifting of bits to get whether an item is a multiple. (As was explained in a reply 10100 is a multiple of 101 as it is just shifted over). If I have the number 16, which is 10000, its complement is 01111. How would I use this complement to see if an item is a multiple of 16? Also can someone give a numerical explanation of the code given above? Showing this may help me understand why it does not work. Once I understand why it does not work I will be able to problem solve on my own I believe.
Why would you even think about using bit-wise operations for this? They certainly have their place but this isn't it.
A better method is to simply use something like:
unsigned multGreaterOrEqual(unsigned x, unsigned y) {
if ((x % y) == 0)
return x;
return (x / y + 1) * y;
}
In the trivial cases, every number that is an even multiple of a power of 2 is just shifted to the left (this doesn't apply when possibly altering the sign bit)
For example
10100
is 4 times
101
and
10100
is 2 time
1010
As for other multiples, they would have to be found by combining the outputs of two shifts. You might want to look up some primitive means of computer division, where division looks roughly like
x = a / b
implemented like
buffer = a
while a is bigger than b; do
yes: subtract a from b
add 1 to x
done
faster routines try to figure out higher level place values first, skipping lots of subtractions. All of these routine can be done bitwise; but it is a big pain. In the ALU these routines are done bitwise. Might want to look up a digital logic design book for more ideas.
Ok, so I have discovered what the error was in my code and since the majority say that it is impossible to calculate whether a number is a multiple of another number using masks I figured I would share what I have learned.
It is possible! - if you are using the correct data types that is.
The code given above works if y is declared as a constant unsigned long as x which was being passed in was also an unsigned long. The key point is not the long or constant part but that the number is unsigned. This sign bit causes miscalculation as the first place in the number indicates sign and when performing bitwise operations signs can get muddled.
So here is my code if we are looking for multiples of 16:
const unsigned long y = 16; //declared globally in my case
Then an unsigned long is passed to the function which runs the following code:
if(x&(y-1)){ //if not 0 then multiple of y
x = x&~(y-1) + y;
}
x will now be the size of the nearest multiple of 16.

Notation for fixed point representation

I'm looking for a commonly understandable notation to define a fixed point number representation.
The notation should be able to define both a power-of-two factor (using fractional bits) and a generic factor (sometimes I'm forced to use this, though less efficient). And also an optional offset should be defined.
I already know some possible notations, but all of them seem to be constrained to specific applications.
For example the Simulink notation would perfectly fit my needs, but it's known only in the Simulink world. Furthermore the overloaded usage of the fixdt() function is not so readable.
TI defines a really compact Q Formats, but the sign is implicit, and it doesn't manage a generic factor (i.e. not a power-of-two).
ASAM uses a generic 6-coefficient rational function with 2nd-degree numerator and denominator polynomials (COMPU_METHOD). Very generic, but not so friendly.
See also the Wikipedia discussion.
The question is only about the notation (not efficiency of the representation nor fixed-point manipulation). So it's a matter of code readability, maintenability and testability.
Ah, yes. Having good naming annotations is absolutely critical to not introducing bugs with fixed point arithmetic. I use an explicit version of the Q notation which handles
any division between M and N by appending _Q<M>_<N> to the name of the variable. This also makes it possible to include the signedness as well. There are no run-time performance penalties for this. Example:
uint8_t length_Q2_6; // unsigned, 2 bit integer, 6 bit fraction
int32_t sensor_calibration_Q10_21; // signed (1 bit), 10 bit integer, 21 bit fraction.
/*
* Calculations with the bc program (with '-l' argument):
*
* sqrt(3)
* 1.73205080756887729352
*
* obase=16
* sqrt(3)
* 1.BB67AE8584CAA73B0
*/
const uint32_t SQRT_3_Q7_25 = 1 << 25 | 0xBB67AE85U >> 7; /* Unsigned shift super important here! */
In case someone have not fully understood why such annotation is extremely important,
Can you spot the if there is an bug in the following two examples?
Example 1:
speed_fraction = fix32_udiv(25, speed_percent << 25, 100 << 25);
squared_speed = fix32_umul(25, speed_fraction, speed_fraction);
tmp1 = fix32_umul(25, squared_speed, SQRT_3);
tmp2 = fix32_umul(12, tmp1 >> (25-12), motor_volt << 12);
Example 2:
speed_fraction_Q7_25 = fix32_udiv(25, speed_percent << 25, 100 << 25);
squared_speed_Q7_25 = fix32_umul(25, speed_fraction_Q7_25, speed_fraction_Q7_25);
tmp1_Q7_25 = fix32_umul(25, squared_speed_Q7_25, SQRT_3_Q1_31);
tmp2_Q20_12 = fix32_umul(12, tmp1_Q7_25 >> (25-12), motor_volt << 12);
Imagine if one file contained #define SQRT_3 (1 << 25 | 0xBB67AE85U >> 7) and another file contained #define SQRT_3 (1 << 31 | 0xBB67AE85U >> 1) and code was moved between those files. For example 1 this has a high chance of going unnoticed and introduce the bug that is present in example 2 which here is done deliberately and has a zero chance of being done accidentally.
Actually Q format is the most used representation in commercial applications: you use is when you need to deal with fractional numbers FAST and your processor does not have a FPU (floating point unit) is it cannot use float and double data types natively - it has to emulate instructions for them which are very expensive.
usually you use Q format to represent only the fractional part, though this not a must, you get more precision for your representation. Here's what you need to consider:
number of bits you use (Q15 uses 16 bitdata types, usually short int)
the first bit is the sign bit (out of 16 bits you are left with 15 for data value)
the rest of the bits are used to store the fractional part of your number.
since you are representing fractional numbers your value is somewhere in [0,1)
you can choose to use some bits for the integer part as well, but you would loose precision - e.g if you wanted to represent 3.3 in Q format, you would need 1 bit for sign, 2 bits for the integer part, and are left with 13 bits for the fractional part (assuming you are using 16 bits representation)-> this format is called 2Q13
Example: Say you want to represent 0.3 in Q15 format; you apply the Rule of Three:
1 = 2^15 = 32768 = 0x8000
0.3 = X
-------------
X = 0.3*32768 = 9830 = 0x666
You lost precision by doing this but at least the computation is fast now.
In C, you can't use a user defined type like a builtin one. If you want to do that, you need to use C++. In that language you can define a class for your fixed point type, overload all the arithmetic operators (+, -, *, /, %, +=, -=, *=, /=, %=, --, ++, cast to other types), so that usage of the instances of this class really behave like the builtin types.
In C, you need to do what you want explicitly. There are two basic approaches.
Approach 1: Do the fixed point adjustments in the user code.
This is overhead-free, but you need to remember to do the correct adjustments. I believe, it is easiest to just add the number of past point bits to the end of the variable name, because the type system won't do you much good, even if you typedef'd all the point positions you use. Here is an example:
int64_t a_7 = (int64_t)(7.3*(1<<7)); //a variable with 7 past point bits
int64_t b_5 = (int64_t)(3.78*(1<<5)); //a variable with 5 past point bits
int64_t sum_7 = a_7 + (b_5 << 2); //to add those two variables, we need to adjust the point position in b
int64_t product_12 = a_7 * b_5; //the product produces a number with 12 past point bits
Of course, this is a lot of hassle, but at least you can easily check at every point whether the point adjustment is correct.
Approach 2: Define a struct for your fixed point numbers and encapsulate the arithmetic on it in a bunch of functions. Like this:
typedef struct FixedPoint {
int64_t data;
uint8_t pointPosition;
} FixedPoint;
FixedPoint fixed_add(FixedPoint a, FixedPoint b) {
if(a.pointPosition >= b.PointPosition) {
return (FixedPoint){
.data = a.data + (b.data << a.pointPosition - b.pointPosition),
.pointPosition = a.pointPosition
};
} else {
return (FixedPoint){
.data = (a.data << b.pointPosition - a.pointPosition) + b.data,
.pointPosition = b.pointPosition
};
}
}
This approach is a bit cleaner in the usage, however, it introduces significant overhead. That overhead consists of:
The function calls.
The copying of the structs for parameter and result passing, or the pointer dereferences if you use pointers.
The need to calculate the point adjustments at runtime.
This is pretty much similar to the overhead of a C++ class without templates. Using templates would move some decisions back to compile time, at the cost of loosing flexibility.
This object based approach is probably the most flexible one, and it allows you to add support for non-binary point positions in a transparent way.

Quick integer logarithm for special case

I have integer values ranging from 32-8191 which I want to map to a roughly logarithmic scale. If I were using base 2, I could just count the leading zero bits and map them into 8 slots, but this is too course-grained; I need 32 slots (and more would be better, but I need them to map to bits in a 32-bit value), which comes out to a base of roughly 1.18-1.20 for the logarithm. Anyone have some tricks for computing this value, or a reasonable approximation, very fast?
My intuition is to break the range down into 2 or 3 subranges with conditionals, and use a small lookup table for each, but I wonder if there's some trick I could do with count-leading-zeros then refining the result, especially since the results don't have to be exact but just roughly logarithmic.
Why not use the next two bits other than the leading bit. You can first partition the number into the 8 bin, and the next two bits to further divide each bin into four. In this case, you can use a simple shift operation which is very fast.
Edit: If you think using the logarithm is a viable solution. Here is the general algorithm:
Let a be the base of the logarithm, and the range is (b_min, b_max) = (32,8191). You can find the base using the formula:
log(b_max/b_min) / log(a) = 32 bin
which give you a~1.1892026. If you use this a as the base of the logarithm, you can map the range (b_min, b_max) into (log_a(b_min), log_a(b_max)) = (20.0004,52.0004).
Now you only need to subtract the all element by a 20.0004 to get the range (0,32). It guarantees all elements are logarithmically uniform. Done
Note: Either a element may fall out of range because of numerical error. You should calculate it yourself for the exact value.
Note2: log_a(b) = log(b)/log(a)
Table lookup is one option, that table isn't that big. If an 8K table is too big, and you have a count leading zeros instruction, you can use a table lookup on the top few bits.
nbits = 32 - count_leading_zeros(v) # number of bits in number
highbits = v >> (nbits - 4) # top 4 bits. Top bit is always a 1.
log_base_2 = nbits + table[highbits & 0x7]
The table you populate with some approximation of log_2
table[i] = approx(log_2(1 + i/8.0))
If you want to stay in integer arithmetic, multiply the last line by a convenient factor.
Answer I just came up with based in IEEE 754 floating point:
((union { float v; uint32_t r; }){ x }.r >> 21 & 127) - 16
It maps 32-8192 onto 0-31 roughly logarithmically (same as hwlau's answer).
Improved version (cut out useless bitwise and):
((union { float v; uint32_t r; }){ x }.r >> 21) - 528

How to define 2-bit numbers in C, if possible?

For my university process I'm simulating a process called random sequential adsorption.
One of the things I have to do involves randomly depositing squares (which cannot overlap) onto a lattice until there is no more room left, repeating the process several times in order to find the average 'jamming' coverage %.
Basically I'm performing operations on a large array of integers, of which 3 possible values exist: 0, 1 and 2. The sites marked with '0' are empty, the sites marked with '1' are full. Initially the array is defined like this:
int i, j;
int n = 1000000000;
int array[n][n];
for(j = 0; j < n; j++)
{
for(i = 0; i < n; i++)
{
array[i][j] = 0;
}
}
Say I want to deposit 5*5 squares randomly on the array (that cannot overlap), so that the squares are represented by '1's. This would be done by choosing the x and y coordinates randomly and then creating a 5*5 square of '1's with the topleft point of the square starting at that point. I would then mark sites near the square as '2's. These represent the sites that are unavailable since depositing a square at those sites would cause it to overlap an existing square. This process would continue until there is no more room left to deposit squares on the array (basically, no more '0's left on the array)
Anyway, to the point. I would like to make this process as efficient as possible, by using bitwise operations. This would be easy if I didn't have to mark sites near the squares. I was wondering whether creating a 2-bit number would be possible, so that I can account for the sites marked with '2'.
Sorry if this sounds really complicated, I just wanted to explain why I want to do this.
You can't create a datatype that is 2-bits in size since it wouldn't be addressable. What you can do is pack several 2-bit numbers into a larger cell:
struct Cell {
a : 2;
b : 2;
c : 2;
d : 2;
};
This specifies that each of the members a, b, c and d should occupy two bits in memory.
EDIT: This is just an example of how to create 2-bit variables, for the actual problem in question the most efficient implementation would probably be to create an array of int and wrap up the bit fiddling in a couple of set/get methods.
Instead of a two-bit array you could use two separate 1-bit arrays. One holds filled squares and one holds adjacent squares (or available squares if this is more efficient).
I'm not really sure that this has any benefit though over packing 2-bit fields into words.
I'd go for byte arrays unless you are really short of memory.
The basic idea
Unfortunately, there is no way to do this in C. You can create arrays of 1 byte, 2 bytes, etc., but you can't create areas of bits.
The best thing you can do, then, is to write a new library for yourself, which makes it look like you're dealing with arrays of 2 bits, but in reality does a lot of hard work. The same way that the string libraries give you functions that work on "strings" (which in C are just arrays), you'll be creating a new library which works on "bit arrays" (which in reality will be arrays of integers, with a few special functions to deal with them as-if they were arrays of bits).
NOTE: If you're new to C, and haven't learned the ideas of "creating a new library/module", or the concept of "abstraction", then I'd recommend learning about them before you continue with this project. Understanding them is IMO more important than optimizing your program to use a little less space.
How to implement this new "library" or module
For your needs, I'd create a new module called "2-bit array", which exports functions for dealing with the 2-bit arrays, as you need them.
It would have a few functions that deal with setting/reading bits, so that you can work with it as if you have an actual array of bits (you'll actually have an array of integers or something, but the module will make it seem like you have an array of bits).
Using this module would like something like this:
// This is just an example of how to use the functions in the twoBitArray library.
twoB my_array = Create2BitArray(size); // This will "create" a twoBitArray and return it.
SetBit(twoB, 5, 1); // Set bit 5 to 1 //
bit b = GetBit(twoB, 5); // Where bit is typedefed to an int by your module.
What the module will actually do is implement all these functions using regular-old arrays of integers.
For example, the function GetBit(), for GetBit(my_arr, 17), will calculate that it's the 1st bit in the 4th integer of your array (depending on sizeof(int), obviously), and you'd return it by using bitwise operations.
You can compact one dimension of array into sub-integer cells. To convert coordinate (lets say x for example) to position inside byte:
byte cell = array[i][ x / 4 ];
byte mask = 0x0004 << (x % 4);
byte data = (cell & mask) >> (x % 4);
to write data do reverse

Resources