I want to find an as fast as possible way of multiplying two small boolean matrices, where small means, 8x8, 9x9 ... 16x16. This routine will be used a lot, so it needs to be very efficient, so please don't suggest that the straightforward solution should be fast enough.
For the special cases 8x8, and 16x16 I already have fairly efficient implementations, based on the solution found here, where we treat the entire matrix as an uint64_t or uint64_t[4] respectively. On my machine this is roughly 70-80 times faster than the straightforward implementation.
However, in the case of 8 < k < 16, I don't really know how I can leverage any reasonable representation in order to enable such clever tricks as above.
So basically, I'm open for any suggestions using any kind of representation (of the matrices) and function signature. You may assume that this targets either a 32-bit or 64-bit architecture (pick what best suits your suggestion)
Given two 4x4 matrices a= 0010,0100,1111,0001, b=1100,0001,0100,0100, one could first calculate the transpose b' = 1000,1011,0000,0100.
Then the resulting matrix M(i,j)=a x b mod 2 == popcount(a[i]&b[j]) & 1; // or parity
From that one can notice that the complexity only grows in n^2, as long as the bitvector fits a computer word.
This can be speed up for 8x8 matrices at least, provided that some special permutation and bit selection operations are available. One can iterate exactly N times with NxN bits in a vector. (so 16x16 is pretty much the limit).
Each step consists of accumulating i.e. Result(n+1) = Result(n) XOR A(n) .& B(n), where Result(0) = 0, A(n) is A <<< n, and '<<<' == columnwise rotation of elements and where B(n) copies diagonal elements from the matrix B:
a b c a e i d h c g b f
B= d e f B(0) = a e i B(1) = d h c B(2) = g b f
g h i a e i d h c g b f
And after thinking it a bit further, a better option is to ^^^ (row wise rotate) matrix B and select A(n) == column copied diagonals from A:
a b c a a a b b b c c c
A= d e f A(0) = e e e , A(1) = f f f, A(2) = d d d
g h i i i i g g g h h h
EDIT To benefit later readers, I'd propose the full solution for W<=16 bit matrix multiplications in portable C.
#include <stdint.h>
void matrix_mul_gf2(uint16_t *a, uint16_t *b, uint16_t *c)
{
// these arrays can be read in two successive xmm registers or in a single ymm
uint16_t D[16]; // Temporary
uint16_t C[16]={0}; // result
uint16_t B[16];
uint16_t A[16];
int i,j;
uint16_t top_row;
// Preprocess B (while reading from input)
// -- "un-tilt" the diagonal to bit position 0x8000
for (i=0;i<W;i++) B[i]=(b[i]<<i) | (b[i]>>(W-i));
for (i=0;i<W;i++) A[i]=a[i]; // Just read in matrix 'a'
// Loop W times
// Can be parallelized 4x with MMX, 8x with XMM and 16x with YMM instructions
for (j=0;j<W;j++) {
for (i=0;i<W;i++) D[i]=((int16_t)B[i])>>15; // copy sign bit to rows
for (i=0;i<W;i++) B[i]<<=1; // Prepare B for next round
for (i=0;i<W;i++) C[i]^= A[i]&D[i]; // Add the partial product
top_row=A[0];
for (i=0;i<W-1;i++) A[i]=A[i+1];
A[W-1]=top_row;
}
for (i=0;i<W;i++) c[i]=C[i]; // return result
}
How about padding it out to the next "clever" (e.g. 8 or 16) size, with all '1' on the diagonal?
Depending on your application, storing both the matrix and its transpose together might help. You will save a lot of time that otherwise would be used to transpose during matrix multiplications, at the expense of some memory and some more operations.
There is a faster method for multiplying 8x8 matrices using 64-bit multiplication along with some simple bit trickery, which works for either GF[2] or boolean algebra.
Assuming the three matrices being packed in 8 consecutive rows of 8 bits inside a 64-bit int each, we can use multiplication to scatter the bits and do the job in just one for loop:
uint64_t mul8x8 (uint64_t A, uint64_t B) {
const uint64_t ROW = 0x00000000000000FF;
const uint64_t COL = 0x0101010101010101;
uint64_t C = 0;
for (int i=0; i<8; ++i) {
uint64_t p = COL & (A>>i);
uint64_t r = ROW & (B>>i*8);
C |= (p*r); // use ^ for GF(2) instead
}
return C;
}
The code for 16x16 is straightfoward if you can afford blocking the rows for improved efficiency.
This trick is also used extensively in high-performance linear algebra libraries, and consists in partitioning the matrix into N/M x N/M blocks of MxM submatrices, with M = 2^m chosen to maximize locality in cache. The usual way to deal with N % M != 0 is to pad rows and columns with 0s so one can use the same algorithm for all block multiplications.
We can apply the same ideas to boolean matrices of variable dimension 8 >= N >= 16 as long as we can afford to have the matrices represented internally in a row blocking format. We just assume the matrix is 16x16 and the last 16-N rows and columns are filled with 0s:
void mul16x16 (uint64_t C[2][2], const uint64_t A[2][2], const uint64_t B[2][2]) {
for (int i=0; i<2; ++i)
for (int j=0; j<2; ++j)
C[i][j] = mul8x8(A[i][0],B[0][j])
| mul8x8(A[i][1],B[1][j]); // once again, use ^ instead for GF(2)
}
Notice we have done a 16x16 matrix multiplication in just 8x8=64 integer products and some bit operations.
Also mul8x8 can be much improved with modern SSE/AVX vector instructions. In theory it is possible to perform all 8 products in parallel with one AVX512 instruction (we still need to scatter the data to the ZMM register first) and then reduce horizontally using lg2(8) = O(3) instructions.
Related
TLTR
For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?
EXTENDED VERSION
Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:
for (int y = 0; y < rows; y += 2) {
uint8_t* p_out = outBuffer + (y / 2) * outStride;
uint8_t* p_in = inBuffer + y * inStride;
for (int x = 0; x < cols; x += 2) {
*p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
p_out++;
p_in+=2;
}
}
Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.
Now I want to vectorize this. The idea is:
take 2 consecutive rows of pixels
load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
compute the minimum byte by byte between a and b. Store in a.
create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
compute the minimum byte by byte between a and b. Store in a.
store every second byte of a in the output image (discards half of the bytes)
I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.
For example, at point 3 one can use (from here):
uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);
And at point 4 one can use one of the following using a shift of 8 bits (from here):
uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);
That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)
HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union (Edit: although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.
What is the way to combine different data types while using ARM Neon intrinsics?
For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.
In some situations, you might want to treat a vector as having a
different type, without changing its value. A set of intrinsics is
provided to perform this type of conversion.
So, assuming a and b are declared as:
uint8x16_t a, b;
Your point 4 can be written as(*):
b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );
However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?
(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):
__m128i b = _mm_srli_si128(a,1);
I am doing the solution for this problem from Euler Project Problem 513, Integral median:
ABC is an integral sided triangle with sides a≤b≤c. mc is the median
connecting C and the midpoint of AB. F(n) is the number of such
triangles with c≤n for which mc has integral length as well. F(10)=3
and F(50)=165.
Find F(100000).
Analyse:
a <= b <= c <= n == 100000
ABC is a triangle so it should: abs(a-b) < c < a+b
Mc = sqrt(2 * a^2+ 2 * b^2 - c^2) / 2 wikipedia
Mc is integer so 2 * a^2+ 2 * b^2 - c^2 should be a perfect square and divisible by 4.
Code:
#include <stdio.h>
#include <math.h>
#define N 100000
#define MAX(a,b) (((a)>(b))?(a):(b))
void main(){
unsigned long int count = 0;
unsigned long int a,b,c;
double mc;
for (a = 1; a <= N; a++) {
printf("%lu\n", a);
for (b = a; b <= N; b++)
for (c = MAX(b, abs(b-a)); c <=N && c < a+b; c++){
mc = sqrt(2 *a *a + 2 * b * b - c * c)/2.0;
if (mc-(unsigned long)mc == 0)
count++;
}
}
printf("\ncpt == %lu\n", count);
}
Issues:
It works fine for small n but the complexity of the solution is too high, I assume it is O(n^3)(am I wrong?) which will take days for n = 100000.
How could I improve this whether with a mathematical or algorithmic way?
Updates
I got those suggestions:
Calculating power of a outside the b/c loops and power of b outside c loop. This improved slightly the performance.
c cannot be odd. then a and b must have same parity. This improved the performance 4 times.
Using threads to divide the work on many cores. It may improve by a factor close to number of cores.
A mathematical solution posted in math.stackexchange. It claims O(N^5/2) for a basic solution and can achieve O(N^2) by using O(N^2) of memory. I didn't test it yet.
Since this is a Project Euler problem, you are supposed to be able to do it in about a minute of computing time on a modern computer. They don't always stick to that, but it indicates that a running time of k*n^2 or k*n^2*log(n) is probably fine if the constant isn't too bad, but probably not k*n^2.5 or k*n^3.
As SleuthEye commented, the side c must be even or else the inside of the square root would have to be odd, so taking the square root and dividing by 2 could not make an integer.
You can simplify the equation to 4(mc^2+(c/2)^2) = 2(a^2+b^2).
Here is one approach: Create two dictionaries, left and right. For each, let the keys be possible values of that side of the equation, and let the values be a list of the pairs (mc,c/2) or (a,b) which produce the key. For the right dictionary, we only need to consider where a and b have the same parity, and where 1<=a<=b<=n. For the left, we only need to consider 1<=c/2<=n/2 and 1<=mc<=sqrt(3)/2 n since 4mc^2 = 2a^2+2b^2-c^2 <= 3b^2 <=3n^2.
Then go through the possible keys, and compare the elements of the values from each dictionary, finding the number of compatible ((mc,c/2),(a,b)) pairs where b <= c < a+b. This inner step is not constant time, but the maximum and average lengths of the lists are not too long. The ways to write n as a sum of two squares roughly correspond (up to units) to the ways to factor n in the Gaussian integers, and just as the largest number of factors of an integer does not grow too rapidly, the same is true in the Gaussian integers. This step takes O(n^epsilon) time for any epsilon>0. So, the total running time is O(n^(2+epsilon)) for any epsilon>0.
In practice, if you run out of memory, you can construct partial dictionaries where the keys are restricted to be in particular ranges. This parallelizes well, too.
I am trying to immplement big integer addition in CUDA using the following code
__global__ void add(unsigned *A, unsigned *B, unsigned *C/*output*/, int radix){
int id = blockIdx.x * blockDim.x + threadIdx.x;
A[id ] = A[id] + B[id];
C[id ] = A[id]/radix;
__syncthreads();
A[id] = A[id]%radix + ((id>0)?C[id -1]:0);
__syncthreads();
C[id] = A[id];
}
but it does not work properly and also i don't now how to handle the extra carry bit. Thanks
TL;DR build a carry-lookahead adder where each individual additionner adds modulo radix, instead of modulo 2
Additions need incoming carries
The problem in your model is that you have a rippling carry. See Rippling carry adders.
If you were in an FPGA that wouldn't be a problem because they have dedicated logic to do that fast (carry chains, they're cool). But alas, you're on a GPU !
That is, for a given id, you only know the input carry (thus whether you are going to sum A[id]+B[id] or A[id]+B[id]+1) when all the sums with smaller id values have been computed. As a matter of fact, initially, you only know the first carry.
A[3]+B[3] + ? A[2]+B[2] + ? A[1]+B[1] + ? A[0]+B[0] + 0
| | | |
v v v v
C[3] C[2] C[1] C[0]
Characterize the carry output
And each sum also has a carry output, which isn't on the drawing. So you have to think of the addition in this larger scheme as a function with 3 inputs and 2 outputs : (C, c_out) = add(A, B, c_in)
In order to not wait O(n) for the sum to complete (where n is the number of items your sum is cut into), you can precompute all the possible results at each id. That isn't such a huge load of work, since A and B don't change, only the carries. So you have 2 possible outputs : (c_out0, C) = add(A, B, 0) and (c_out1, C') = add(A, B, 1).
Now with all these results, we need to basically implement a carry lookahead unit.
For that, we need to figure out to functions of each sum's carry output P and G :
P a.k.a. all of the following definitions
Propagate
"if a carry comes in, then a carry will go out of this sum"
c_out1 && !c_out0
A + B == radix-1
G a.k.a. all of the following definitions
Generate
"whatever carry comes in, a carry will go out of this sum"
c_out1 && c_out0
c_out0
A + B >= radix
So in other terms, c_out = G or (P and c_in). So now we have a start of an algorithm that can tell us easily for each id the carry output as a function of its carry input directly :
At each id, compute C[id] = A[id]+B[id]+0
Get G[id] = C[id] > radix -1
Get P[id] = C[id] == radix-1
Logarithmic tree
Now we can finish in O(log(n)), even though treeish things are nasty on GPUs, but still shorter than waiting. Indeed, from 2 additions next to each other, we can get a group G and a group P :
For id and id+1 :
step = 2
if id % step == 0, do steps 6 through 10, otherwise, do nothing
group_P = P[id] and P[id+step/2]
group_G = (P[id+step/2] and G[id]) or G[id+step/2]
c_in[id+step/2] = G[id] or (P[id] and c_in[id])
step = step * 2
if step < n, go to 5
At the end (after repeating steps 5-10 for every level of your tree with less ids every time), everything will be expressed in terms of Ps and Gs which you computed, and c_in[0] which is 0. On the wikipedia page there are formulas for the grouping by 4 instead of 2, which will get you an answer in O(log_4(n)) instead of O(log_2(n)).
Hence the end of the algorithm :
At each id, get c_in[id]
return (C[id]+c_in[id]) % radix
Take advantage of hardware
What we really did in this last part, was mimic the circuitry of a carry-lookahead adder with logic. However, we already have additionners in the hardware that do similar things (by definition).
Let us replace our definitions of P and G based on radix by those based on 2 like the logic inside our hardware, mimicking a sum of 2 bits a and b at each stage : if P = a ^ b (xor), and G = a & b (logical and). In other words, a = P or G and b = G. So if we create a intP integer and a intG integer, where each bit is respectively the P and G we computed from each ids sum (limiting us to 64 sums), then the addition (intP | intG) + intG has the exact same carry propagation as our elaborate logical scheme.
The reduction to form these integers will still be a logarithmic operation I guess, but that was to be expected.
The interesting part, is that each bit of the sum is function of its carry input. Indeed, every bit of the sum is eventually function of 3 bits a+b+c_in % 2.
If at that bit P == 1, then a + b == 1, thus a+b+c_in % 2 == !c_in
Otherwise, a+b is either 0 or 2, and a+b+c_in % 2 == c_in
Thus we can trivially form the integer (or rather bit-array) int_cin = ((P|G)+G) ^ P with ^ being xor.
Thus we have an alternate ending to our algorithm, replacing steps 4 and later :
at each id, shift P and G by id : P = P << id and G = G << id
do an OR-reduction to get intG and intP which are the OR of all the P and G for id 0..63
Compute (once) int_cin = ((P|G)+G) ^ P
at each id, get `c_in = int_cin & (1 << id) ? 1 : 0;
return (C[id]+c_in) % radix
PS : Also, watch out for integer overflow in your arrays, if radix is big. If it isn't then the whole thing doesn't really make sense I guess...
PPS : in the alternate ending, if you have more than 64 items, characterize them by their P and G as if radix was 2^64, and re-run the same steps at a higher level (reduction, get c_in) and then get back to the lower level apply 7 with P+G+carry in from higher level
I am using a base-conversion algorithm to generate a permutation from a large integer (split into 32-bit words).
I use a relatively standard algorithm for this:
/* N = count,K is permutation index (0..N!-1) A[N] contains 0..N-1 */
i = 0;
while (N > 1) {
swap A[i] and A[i+(k%N)]
k = k / N
N = N - 1
i = i + 1
}
Unfortunately, the divide and modulo each iteration adds up, especially moving to large integers - But, it seems I could just use multiply!
/* As before, N is count, K is index, A[N] contains 0..N-1 */
/* Split is arbitrarily 128 (bits), for my current choice of N */
/* "Adjust" is precalculated: (1 << Split)/(N!) */
a = k*Adjust; /* a can be treated as a fixed point fraction */
i = 0;
while (N > 1) {
a = a*N;
index = a >> Split;
a = a & ((1 << Split) - 1); /* actually, just zeroing a register */
swap A[i] and A[i+index]
N = N - 1
i = i + 1
}
This is nicer, but doing large integer multiplies is still sluggish.
Question 1:
Is there a way of doing this faster?
Eg. Since I know that N*(N-1) is less than 2^32, could I pull out those numbers from one word, and merge in the 'leftovers'?
Or, is there a way to modify an arithetic decoder to pull out the indicies one at a time?
Question 2:
For the sake of curiosity - if I use multiplication to convert a number to base 10 without the adjustment, then the result is multiplied by (10^digits/2^shift). Is there a tricky way to remove this factor working with the decimal digits? Even with the adjustment factor, this seems like it would be faster -- why wouldn't standard libraries use this vs divide and mod?
Seeing that you are talking about numbers like 2^128/(N!), it seems that in your problem N is going to be rather small (N < 35 according to my calculations).
I suggest taking the original algorithm as a starting point; first switch the direction of the loop:
i = 2;
while (i < N) {
swap A[N - 1 - i] and A[N - i + k % i]
k = k / i
i = i + 1
}
Now change the loop to do several permutations per iteration. I guess the speed of division is the same regardless of the number i, as long as i < 2^32.
Split the range 2...N-1 into sub-ranges so that the product of the numbers in each sub-range is less than 2^32:
2, 3, 4, ..., 12: product is 479001600
13, 14, ..., 19: product is 253955520
20, 21, ..., 26: product is 3315312000
27, 28, ..., 32: product is 652458240
33, 34, 35: product is 39270
Then, divide the long number k by the products instead of dividing by i. Each iteration will yield a remainder (less than 2^32) and a smaller number k. When you have the remainder, you can work with it in an inner loop using the original algorithm; which will now be faster because it doesn't involve long division.
Here is some code:
static const int rangeCount = 5;
static const int rangeLimit[rangeCount] = {13, 20, 27, 33, 36};
static uint32_t rangeProduct[rangeCount] = {
479001600,
253955520,
3315312000,
652458240,
39270
};
for (int rangeIndex = 0; rangeIndex < rangeCount; ++rangeIndex)
{
// The following two lines involve long division;
// math libraries probably calculate both quotient and remainder
// in one function call
uint32_t rangeRemainder = k % rangeProduct[rangeIndex];
k /= rangeProduct[rangeIndex];
// A range starts where the previous range ended
int rangeStart = (rangeIndex == 0) ? 2 : rangeLimit[rangeIndex - 1];
// Iterate over range
for (int i = rangeStart; i < rangeLimit[rangeIndex] && i < n; ++i)
{
// The following two lines involve a 32-bit division;
// it produces both quotient and remainder in one Pentium instruction
int remainder = rangeRemainder % i;
rangeRemainder /= i;
std::swap(permutation[n - 1 - i], permutation[n - i + remainder]);
}
}
Of course, this code can be extended into more than 128 bits.
Another optimization could involve extraction of powers of 2 from the products of ranges; this might add a slight speedup by making the ranges longer. Not sure whether this is worthwhile (maybe for large values of N, like N=1000).
Dont know about algorithms, but the ones you use seems pretty simple, so i dont really see how you can optimize the algorithm.
You may use alternative approaches:
use ASM (assembler) - from my experience, after a long time trying to figure out how should a certain algorithm would be written in ASM, it ended up being slower than the version generated by the compiler:) Probably because the compiler also knows how to layout the code so the CPU cache would be more efficient, and/or what instructions are actually faster and what situations(this was on GCC/linux).
use multi-processing:
make your algorithm multithreaded, and make sure you run with the same number of threads as the number of available cpu cores(most cpu's nowdays do have multiple cores/multithreading)
make you algorithm capable of running on multiple machines on a network, and devise a way of sending these numbers to machines in a network, so you may use their CPU power.
Given a N-dimensional vector of small integers is there any simple way to map it with one-to-one correspondence to a large integer number?
Say, we have N=3 vector space. Can we represent a vector X=[(int16)x1,(int16)x2,(int16)x3] using an integer (int48)y? The obvious answer is "Yes, we can". But the question is: "What is the fastest way to do this and its inverse operation?"
Will this new 1-dimensional space possess some very special useful properties?
For the above example you have 3 * 32 = 96 bits of information, so without any a priori knowledge you need 96 bits for the equivalent long integer.
However, if you know that your x1, x2, x3, values will always fit within, say, 16 bits each, then you can pack them all into a 48 bit integer.
In either case the technique is very simple you just use shift, mask and bitwise or operations to pack/unpack the values.
Just to make this concrete, if you have a 3-dimensional vector of 8-bit numbers, like this:
uint8_t vector[3] = { 1, 2, 3 };
then you can join them into a single (24-bit number) like so:
uint32_t all = (vector[0] << 16) | (vector[1] << 8) | vector[2];
This number would, if printed using this statement:
printf("the vector was packed into %06x", (unsigned int) all);
produce the output
the vector was packed into 010203
The reverse operation would look like this:
uint8_t v2[3];
v2[0] = (all >> 16) & 0xff;
v2[1] = (all >> 8) & 0xff;
v2[2] = all & 0xff;
Of course this all depends on the size of the individual numbers in the vector and the length of the vector together not exceeding the size of an available integer type, otherwise you can't represent the "packed" vector as a single number.
If you have sets Si, i=1..n of size Ci = |Si|, then the cartesian product set S = S1 x S2 x ... x Sn has size C = C1 * C2 * ... * Cn.
This motivates an obvious way to do the packing one-to-one. If you have elements e1,...,en from each set, each in the range 0 to Ci-1, then you give the element e=(e1,...,en) the value e1+C1*(e2 + C2*(e3 + C3*(...Cn*en...))).
You can do any permutation of this packing if you feel like it, but unless the values are perfectly correlated, the size of the full set must be the product of the sizes of the component sets.
In the particular case of three 32 bit integers, if they can take on any value, you should treat them as one 96 bit integer.
If you particularly want to, you can map small values to small values through any number of means (e.g. filling out spheres with the L1 norm), but you have to specify what properties you want to have.
(For example, one can map (n,m) to (max(n,m)-1)^2 + k where k=n if n<=m and k=n+m if n>m--you can draw this as a picture of filling in a square like so:
1 2 5 | draw along the edge of the square this way
4 3 6 v
8 7
if you start counting from 1 and only worry about positive values; for integers, you can spiral around the origin.)
I'm writing this without having time to check details, but I suspect the best way is to represent your long integer via modular arithmetic, using k different integers which are mutually prime. The original integer can then be reconstructed using the Chinese remainder theorem. Sorry this is a bit sketchy, but hope it helps.
To expand on Rex Kerr's generalised form, in C you can pack the numbers like so:
X = e[n];
X *= MAX_E[n-1] + 1;
X += e[n-1];
/* ... */
X *= MAX_E[0] + 1;
X += e[0];
And unpack them with:
e[0] = X % (MAX_E[0] + 1);
X /= (MAX_E[0] + 1);
e[1] = X % (MAX_E[1] + 1);
X /= (MAX_E[1] + 1);
/* ... */
e[n] = X;
(Where MAX_E[n] is the greatest value that e[n] can have). Note that these maximum values are likely to be constants, and may be the same for every e, which will simplify things a little.
The shifting / masking implementations given in the other answers are a generalisation of this, for cases where the MAX_E + 1 values are powers of 2 (and thus the multiplication and division can be done with a shift, the addition with a bitwise-or and the modulus with a bitwise-and).
There is some totally non portable ways to make this real fast using packed unions and direct accesses to memory. That you really need this kind of speed is suspicious. Methods using shifts and masks should be fast enough for most purposes. If not, consider using specialized processors like GPU for wich vector support is optimized (parallel).
This naive storage does not possess any usefull property than I can foresee, except you can perform some computations (add, sub, logical bitwise operators) on the three coordinates at once as long as you use positive integers only and you don't overflow for add and sub.
You'd better be quite sure you won't overflow (or won't go negative for sub) or the vector will become garbage.
#include <stdint.h> // for uint8_t
long x;
uint8_t * p = &x;
or
union X {
long L;
uint8_t A[sizeof(long)/sizeof(uint8_t)];
};
works if you don't care about the endian. In my experience compilers generate better code with the union because it doesn't set of their "you took the address of this, so I must keep it in RAM" rules as quick. These rules will get set off if you try to index the array with stuff that the compiler can't optimize away.
If you do care about the endian then you need to mask and shift.
I think what you want can be solved using multi-dimensional space filling curves. The link gives a lot of references on this, which in turn give different methods and insights. Here's a specific example of an invertible mapping. It works for any dimension N.
As for useful properties, these mappings are related to Gray codes.
Hard to say whether this was what you were looking for, or whether the "pack 3 16-bit ints into a 48-bit int" does the trick for you.