2D morton code encode/decode 64bits - c

How to encode/decode morton codes(z-order) given [x, y] as 32bit unsigned integers producing 64bit morton code, and vice verse ?
I do have xy2d and d2xy but only for coordinates that are 16bits wide producing 32bit morton number. Searched a lot in net, but couldn't find. Please help.

If it is possible for you to use architecture specific instructions you'll likely be able to accelerate the operation beyond what is possible using bit-twiddeling hacks:
For example if you write code for the Intel Haswell and later CPUs you can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.
Here is a complete example (tested with GCC):
#include <immintrin.h>
#include <stdint.h>
// on GCC, compile with option -mbmi2, requires Haswell or better.
uint64_t xy_to_morton(uint32_t x, uint32_t y)
{
return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}
void morton_to_xy(uint64_t m, uint32_t *x, uint32_t *y)
{
*x = _pext_u64(m, 0x5555555555555555);
*y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}
If you have to support earlier CPUs or the ARM platform not all is lost. You may still get at least get help for the xy_to_morton function from instructions specific for cryptography.
A lot of CPUs have support for carry-less multiplication these days. On ARM that'll be vmul_p8 from the NEON instruction set. On X86 you'll find it as PCLMULQDQ from the CLMUL instruction set (available since 2010).
The trick here is, that a carry-less multiplication of a number with itself will return a bit-pattern that contains the original bits of the argument with zero-bits interleaved. So it is identical to the _pdep_u32(x,0x55555555) shown above. E.g. it turns the following byte:
+----+----+----+----+----+----+----+----+
| b7 | b6 | b5 | b4 | b3 | b2 | b1 | b0 |
+----+----+----+----+----+----+----+----+
Into:
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 0 | b7 | 0 | b6 | 0 | b5 | 0 | b4 | 0 | b3 | 0 | b2 | 0 | b1 | 0 | b0 |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
Now you can build the xy_to_morton function as (here shown for CLMUL instruction set):
#include <wmmintrin.h>
#include <stdint.h>
// on GCC, compile with option -mpclmul
uint64_t carryless_square (uint32_t x)
{
uint64_t val[2] = {x, 0};
__m128i *a = (__m128i * )val;
*a = _mm_clmulepi64_si128 (*a,*a,0);
return val[0];
}
uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
return carryless_square(x)|(carryless_square(y) <<1);
}
_mm_clmulepi64_si128 generates a 128 bit result of which we only use the lower 64 bits. So you can even improve upon the version above and use a single _mm_clmulepi64_si128 do do the job.
That is as good as you can get on mainstream platforms (e.g. modern ARM with NEON and x86). Unfortunately I don't know of any trick to speed up the morton_to_xy function using the cryptography instructions and I tried really hard for several month.

void xy2d_morton(uint64_t x, uint64_t y, uint64_t *d)
{
x = (x | (x << 16)) & 0x0000FFFF0000FFFF;
x = (x | (x << 8)) & 0x00FF00FF00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x << 2)) & 0x3333333333333333;
x = (x | (x << 1)) & 0x5555555555555555;
y = (y | (y << 16)) & 0x0000FFFF0000FFFF;
y = (y | (y << 8)) & 0x00FF00FF00FF00FF;
y = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y | (y << 2)) & 0x3333333333333333;
y = (y | (y << 1)) & 0x5555555555555555;
*d = x | (y << 1);
}
// morton_1 - extract even bits
uint32_t morton_1(uint64_t x)
{
x = x & 0x5555555555555555;
x = (x | (x >> 1)) & 0x3333333333333333;
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x >> 4)) & 0x00FF00FF00FF00FF;
x = (x | (x >> 8)) & 0x0000FFFF0000FFFF;
x = (x | (x >> 16)) & 0x00000000FFFFFFFF;
return (uint32_t)x;
}
void d2xy_morton(uint64_t d, uint64_t &x, uint64_t &y)
{
x = morton_1(d);
y = morton_1(d >> 1);
}

The naïve code would be the same irregardless of the bit count. If you don't need super fast bit twiddling version, this will do
uint32_t x;
uint32_t y;
uint64_t z = 0;
for (int i = 0; i < sizeof(x) * 8; i++)
{
z |= (x & (uint64_t)1 << i) << i | (y & (uint64_t)1 << i) << (i + 1);
}
If you need faster bit twiddling, then this one should work. Note that x and y have to be 64bit variables.
uint64_t x;
uint64_t y;
uint64_t z = 0;
x = (x | (x << 16)) & 0x0000FFFF0000FFFF;
x = (x | (x << 8)) & 0x00FF00FF00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x << 2)) & 0x3333333333333333;
x = (x | (x << 1)) & 0x5555555555555555;
y = (y | (y << 16)) & 0x0000FFFF0000FFFF;
y = (y | (y << 8)) & 0x00FF00FF00FF00FF;
y = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y | (y << 2)) & 0x3333333333333333;
y = (y | (y << 1)) & 0x5555555555555555;
z = x | (y << 1);

Related

How to implement SWAR unsigned less-than?

I'm trying to use uint64_t as if it was 8 lanes of uint8_ts; my goal is to implement a lane-by-lane less-than. This operation, given x and y, should produce a result with 0xFF in a lane if the value for the corresponding lane in x is less than the value for that lane in y, and 0x00 otherwise. A lane-by-lane less-than-or-equal would also work.
Based on what I've seen, I'm guessing I would need a lanewise difference-or-zero operation (defined as doz(x, y) = if (x < y) then 0 else (x - y)), and then to use that to construct a selection mask. However, all the lane-wise subtraction approaches I've seen are signed, and I'm not sure how I would use them to do this kind of task.
Is there a way I could do this, using difference-or-zero or some other way?
Turns out basing it on DOZ is the wrong way to go after all. All of this is pointless, don't use it.
However, all the lane-wise subtraction approaches I've seen are signed
This is surprising, because subtraction is neither signed nor unsigned, there is only one subtraction and it can be interpreted both ways. At least, that's how it works in the 2's complement world.
For reference, SWAR subtraction looks like this: (source: SIMD and SWAR Techniques)
SWAR sub z = x - y
z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)
And DOZ could be based on that. A full DOZ is overkill though, if it was a primitive that would make sense. But SWAR DOZ would work by computing the difference, and then zeroing it out if x < y, which is the condition that we wanted all along. So let's just compute that and not the whole DOZ. That condition is based on this: when is there a borrow out of the high bit?
If the high bit of x is zero and the high bit of y is one.
If the high bits of x and y are the same, and the high bit of their difference is one. Equivalently: if the high bits of x and y are the same, and there is a borrow out of the second highest bit.
The first part of SWAR sub, ((x | H) - (y &~H)), computes (among other things) the borrow out of the second highest bit. The high bit of the SWAR difference is the inverse of the borrow out of the second highest bit (that bit from H either gets "eaten" by the borrow, or not).
Putting it together, SWAR unsigned-less-than could work like this:
tmp = ((~x ^ y) & ~((x | H) - (y &~H)) | (~x & y)) & H
less_than_mask = (tmp << 1) - (tmp >> 7)
Parts:
(~x ^ y) = mask of "bits are the same", used for "high bits are the same"
~((x | H) - (y &~H)) = difference of the low parts of elements, used for "borrow out of second highest bit"
(~x & y) = mask of "x is zero and y is one", used for "high bit of x is zero and high bit of y is one"
& H near the end, used to grab only the bits that correspond to the borrow out of the high bit
(tmp << 1) - (tmp >> 7) spreads out the bits grabbed by the previous step into lane-masks. Alternative: (tmp >> 7) * 255. This is the only step where the SWAR logic explicitly depends on the lane size, and it needs to be the same for every lane, even though for SWAR sub you could mix lane sizes.
One operation can be removed at the expression level by applying De Morgan's Rule:
tmp = (~(x ^ y | (x | H) - (y & ~H)) | ~x & y) & H
But ~x needs to be computed anyway, so at the assembly level that may not help, depending on how it gets compiled.
Perhaps some simplification is possible.
Which approach will work the fastest will depend on what kind of instructions are available in the processor architecture of the target platform, such as shift-plus, add, three-input adds, three-input logical instructions. It also depends on whether one desired a throughput- or latency-optimized version, and the superscalarity of the processor architecture.
The following ISO C 99 code provides two alternatives. One uses an unsigned byte-wise comparison taken directly from the literature (LTU_VARIANT = 1), the other (LTU_VARIANT = 0) I devised myself on the basis of a halving add (i.e. the sum of two integers divided by two, rounded down). This is based on the fact that for twos-complement integers a, b each in [0,255], a < u b ⇔ ~a + b >= 256.
However, this would require nine bits for the sum, so we can use a < u b ⇔ ((~a + b) >> 1) >= 128 instead, where the average can be computed within 8 bits by a well-known bit twiddling technique. The only processor architecture I know that offers a SIMD halving addition as a hardware instruction vhadd is Arm NEON.
I have included a test framework for a functional test, but benchmarking will be needed to establish which version performs better on a given platform.
Partial overlap of this answer with other answers is probable; there are only so many different ways to skin a kumquat.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#define LTU_VARIANT (1) // 0 or 1
#define UINT64_H8 (0x8080808080808080U) // byte-wise sign bits (MSBs)
uint64_t sign_to_mask8 (uint64_t a)
{
a = a & UINT64_H8; // isolate sign bits
a = a + a - (a >> 7); // extend them to full byte to create mask
return a;
}
uint64_t vhaddu8 (uint64_t a, uint64_t b)
{
/* Peter L. Montgomery's observation (newsgroup comp.arch, 2000/02/11,
https://groups.google.com/d/msg/comp.arch/gXFuGZtZKag/_5yrz2zDbe4J):
(A+B)/2 = (A AND B) + (A XOR B)/2.
*/
return (a & b) + (((a ^ b) >> 1) & ~UINT64_H8);
}
uint64_t ltu8_core (uint64_t a, uint64_t b)
{
/* Sebastiano Vigna, "Broadword implementation of rank/select queries."
In: International Workshop on Experimental and Efficient Algorithms,
pp. 154-168, Springer Berlin Heidelberg, 2008.
*/
return (((a | UINT64_H8) - (b & ~UINT64_H8)) | (a ^ b)) ^ (a | ~b);
}
uint64_t vcmpltu8 (uint64_t a, uint64_t b)
{
#if LTU_VARIANT==1
return sign_to_mask8 (ltu8_core (a, b));
#else // LTU_VARIANT
return sign_to_mask8 (vhaddu8 (~a, b));
#endif // LTU_VARIANT
}
uint64_t ref_func (uint64_t a, uint64_t b)
{
uint8_t a0 = (uint8_t)((a >> 0) & 0xff);
uint8_t a1 = (uint8_t)((a >> 8) & 0xff);
uint8_t a2 = (uint8_t)((a >> 16) & 0xff);
uint8_t a3 = (uint8_t)((a >> 24) & 0xff);
uint8_t a4 = (uint8_t)((a >> 32) & 0xff);
uint8_t a5 = (uint8_t)((a >> 40) & 0xff);
uint8_t a6 = (uint8_t)((a >> 48) & 0xff);
uint8_t a7 = (uint8_t)((a >> 56) & 0xff);
uint8_t b0 = (uint8_t)((b >> 0) & 0xff);
uint8_t b1 = (uint8_t)((b >> 8) & 0xff);
uint8_t b2 = (uint8_t)((b >> 16) & 0xff);
uint8_t b3 = (uint8_t)((b >> 24) & 0xff);
uint8_t b4 = (uint8_t)((b >> 32) & 0xff);
uint8_t b5 = (uint8_t)((b >> 40) & 0xff);
uint8_t b6 = (uint8_t)((b >> 48) & 0xff);
uint8_t b7 = (uint8_t)((b >> 56) & 0xff);
uint8_t r0 = (a0 < b0) ? 0xff : 0x00;
uint8_t r1 = (a1 < b1) ? 0xff : 0x00;
uint8_t r2 = (a2 < b2) ? 0xff : 0x00;
uint8_t r3 = (a3 < b3) ? 0xff : 0x00;
uint8_t r4 = (a4 < b4) ? 0xff : 0x00;
uint8_t r5 = (a5 < b5) ? 0xff : 0x00;
uint8_t r6 = (a6 < b6) ? 0xff : 0x00;
uint8_t r7 = (a7 < b7) ? 0xff : 0x00;
return ( ((uint64_t)r0 << 0) +
((uint64_t)r1 << 8) +
((uint64_t)r2 << 16) +
((uint64_t)r3 << 24) +
((uint64_t)r4 << 32) +
((uint64_t)r5 << 40) +
((uint64_t)r6 << 48) +
((uint64_t)r7 << 56) );
}
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t a, b, res, ref, n = 0;
printf ("Testing vcmpltu8: byte-wise unsigned comparison with mask result\n");
printf ("using LTU variant %d\n", LTU_VARIANT);
do {
a = KISS64;
b = KISS64;
res = vcmpltu8 (a, b);
ref = ref_func (a, b);
if (res != ref) {
printf ("\nerr # a=%016" PRIx64 " b=%016" PRIx64 " : res=%016" PRIx64 " ref=%016" PRIx64 "\n",
a, b, res, ref);
return EXIT_FAILURE;
}
n++;
if (!(n & 0xffffff)) printf ("\r%016" PRIx64, n);
} while (a);
printf ("\ntest passed\n");
return EXIT_SUCCESS;
}
Here's an architecture-independent approach. I'm sure it could use refinement, but it seems to be working fine. With x86 gcc/clang, it compiles to 20/19 instructions.
The idea is to first solve the problem when both bytes are either less than 128 or not, setting bit 7 in each byte with that result. Then patch up the other cases. Finally smear the bit 7's downward.
#include <stdio.h>
#include <stdint.h>
uint64_t bwlt(uint64_t a, uint64_t b) {
uint64_t lo7 = ~0ull / 255 * 127, // low 7 bits set in each byte
alo7 = a & lo7, // mask low 7 bits in a
blo7 = b & lo7, // mask low 7 bits in b
r = (lo7 - alo7 + blo7) & ~lo7, // set 8th bits with a < b
diff = (a ^ b) & ~lo7; // 8th bits that differ
r &= ~(a & diff); // unset if a[i]_7=1,b[i]_7=0
r |= b & diff; // set if a[i]_7=0,b[i]_7=1
return (r << 1) - (r >> 7);
}
int main(void) {
uint64_t a = 0x11E1634052A6B7CB;
uint64_t b = 0x1EAEF1E85F26734E;
printf("r=%016llx\n", bwlt(a, b));
return 0;
}
One test case:
$ gcc foo.c -o foo
$ ./foo
r=ff00ffffff000000
I enjoyed figuring out how to create the SWAR x LT (Less Than) y function with 64bit unsigned int and using only logical operators and arithmetic + and -.
I looked at some information on the web (https://www.chessprogramming.org/SIMD_and_SWAR_Techniques) and from there I got the idea that the function can be done starting from the subtraction (x - y).
Looking at the meaning of the highest bit of: x, y and (x - y) when unsigned int are used, I created the following truth table where:
R (result) is 1 when the LT condition occurs.
D is the highest bit of the subtracion (x-y),
X is the highest bit of the X value to be tested,
Y is the highest bit of the Y value to be tested.
D X Y | R
0 0 0 | 0
0 0 1 | 1
0 1 0 | 0
0 1 1 | 0
1 0 0 | 1
1 0 1 | 1
1 1 0 | 0
1 1 1 | 1
Applying the Karnaugh's map (https://getcalc.com/karnaugh-map/3variable-kmap-solver.htm) to the table above we obtain the following formula:
(~ X & Y) | (D & ~ X) | (D & Y)
from which the macro SWARLTU(x, y) arose (see file swar.h below).
Since I was not satisfied, I observed how the compiler generated the assembler code of the macro SWARLTU and then following that code I wrote the macro SWARLTU2(x, y) (see file swar.h below). This last macro should be logically optimized.
The limit of this code is that the value for the LT result is 0x80 and not 0xFF as requested in the question.
The program can be launched in three different ways:
Without parameters, in this case it will perform 10 tests on random numbers.
With only one parameter, the parameter will indicate the number of random tests to be performed.
With two parameters, two numbers in the form 0xnnnnn, in this case only the control of the entered values will be shown.
Here the code:
The file swar.h (this file contains also other SWAR macros E.G.: SHL, SHR)
#ifndef SWAR_H
#define SWAR_H
/*
https://www.chessprogramming.org/SIMD_and_SWAR_Techniques
SWAR add z = x + y
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
SWAR sub z = x - y
z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)
SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
z = (x & y) + (((x ^ y) & ~L) >> 1)
*/
// 0 1 2 3 4 5 6 7
#define SWARH 0x8080808080808080LL
#define SWARL 0x0101010101010101LL
#define SWARADD(x,y) \
((( (x) &~SWARH) + ( (y) &~SWARH)) ^ (( (x) ^ (y) ) & SWARH))
#define SWARSUB(x,y) \
((( (x) | SWARH) - ( (y) &~SWARH)) ^ (( (x) ^~(y) ) & SWARH))
#define SWARAVE(x,y) \
(( (x) & (y) ) + ((( (x) ^ (y)) & ~SWARL) >> 1))
#define SWARLTI(x,y) \
( SWARSUB(x,y) & SWARH )
#define SWARSHL(x) \
(((x)&(~SWARH))<<1)
#define SWARSHR(x) \
(((x)&(~SWARL))>>1)
/*** Computing unsigned less than
Truth table considering the HIGH bit setting of
Differece, X Value, Y Value
D X Y | R
0 0 0 | 0
0 0 1 | 1
0 1 0 | 0
0 1 1 | 0
1 0 0 | 1
1 0 1 | 1
1 1 0 | 0
1 1 1 | 1
***/
#define _SWARDH (SWARSUB(x,y) & SWARH)
#define _SWARXH ((x)&SWARH)
#define _SWARYH ((y)&SWARH)
#define SWARLTU(x,y) \
((~_SWARXH & _SWARYH) | (_SWARDH & ~_SWARXH) | (_SWARDH & _SWARYH))
// Elaborated from the generated ASM of the previous.
#define SWARLTU2(X,Y) \
((((~(X & SWARH)) & ((((~(X ^ Y)) & SWARH) ^ ((X | SWARH) - Y)) | Y)) | \
((((~(X ^ Y)) & SWARH) ^ ((X | SWARH) - Y)) & Y)) & SWARH)
#endif // SWAR_H
The file main.c
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <inttypes.h>
#include <time.h>
#include "swar.h"
char * verifyltu(char * rs,uint64_t x, uint64_t y, uint64_t v);
void printvalues(uint64_t x,uint64_t y,uint64_t r,uint64_t r1);
uint64_t rand64();
int main(int argc, char *argv[])
{
int rnd=1;
size_t i,n=10;
uint64_t x=0,y=0,r,r1;
srand(time(NULL));
if (argc>1) {
if (argc==2) {
n=strtoul(argv[1],NULL,0);
} else {
x=strtoull(argv[1],NULL,0);
y=strtoull(argv[2],NULL,0);
rnd=0;
}
}
if (rnd) {
for(i=0;i<n;i++) {
x=rand64();
y=rand64();
r=SWARLTU(x,y);
r1=SWARLTU2(x,y);
printvalues(x,y,r,r1);
}
} else {
r=SWARLTU(x,y);
r1=SWARLTU2(x,y);
printvalues(x,y,r,r1);
}
return 0;
}
char * verifyltu(char * rs,uint64_t x, uint64_t y, uint64_t v)
{
size_t i;
uint8_t *xs, *ys, *vs;
xs=(uint8_t *)&x; ys=(uint8_t *)&y;
vs=(uint8_t *)&v;
for(i=0;i<sizeof(uint64_t);i++) {
if ( ( xs[i]<ys[i] && vs[i]&0x80) ||
( !(xs[i]<ys[i]) && !(vs[i]&0x80) ) )
{
rs[i*2]='*';rs[i*2+1]=' ';
} else {
rs[i*2]='-';rs[i*2+1]=' ';
}
}
rs[i*2]=0;
return rs;
}
void printvalues(uint64_t x,uint64_t y,uint64_t r,uint64_t r1)
{
char rs[17],rs1[17];
printf(
"X %016" PRIX64 " <\n"
"Y %016" PRIX64 "\n"
" ----------------\n"
"LTU %016" PRIX64 "\n"
"*=Ok %s\n"
"LTU2 %016" PRIX64 "\n"
"*=Ok %s\n\n",
x,y,
r,verifyltu(rs,x,y,r),
r1,verifyltu(rs1,x,y,r1)
);
}
uint64_t rand64()
{
uint64_t x;
x=rand(); x=(x<<32)+rand();
return x;
}
I came up with
uint64_t magic(uint64_t a, uint64_t b) {
auto H = 0x8080808080808080ull;
auto c = (a|H) - (b&(~H));
auto d = a^b;
auto e = ((a & d) | (c & (~d))) & H;
return e ^ H;
}
The logic goes pretty much the same path as in Harold's; the difference is in interpreting the top bits as
c = 1aaaaaaa -> the carry to the H bit is 0 only IFF a<b
0bbbbbbb
a = 80, b = 00 different sign -> select b, i.e. ~a
a = 00, b = 80 different sign -> select b, i.e. ~a
a = 80, b = 80 same sign -> select ~c
a = 00, b = 00 same sign -> select ~c
If one could work with inverted mask (i.e. b >= a), then the last ^H can be omitted as well.
The instruction count of the results using clang / godbolt for arm64 and x64 would be with and (without) sign_to_mask.
instructions arm64 x64
-------------+----------+---------
vhaddu8 | 5 (8) | 8 (13)
ltu8_core | 7 (10)| 11 (15)
magic | 9 (10)| 12 (15)
harold | 9 (11)| 14 (17)
bwlt | 10 (12)| 15 (18)
SirJoBlack | 11 (13)| 16 (19)

How to read reversed number from a binary file?

i would like to read a 32-bits number from binary file in C. The problem is that the order of bits is reversed. For an example 3 digits number 110 would stand for 3, not for 6. At the beginning we have the least significant bit (2^0), then 2^1 and so on. Is there any simple way to do this in C, or do i have to write all the logic by myself (read the first bit, multiply it by 2^0, add to the sum, repeat to the end)?
you have many possible ways:
Portable:
(not my algorithm)
uint32_t rev(uint32_t x)
{
x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
return((x >> 16) | (x << 16));
}
or
uint32_t bit_reverse_4bytes(uint32_t x)
{
x = ((x & 0xF0F0F0F0) >> 4) | ((x & 0x0F0F0F0F) << 4);
x = ((x & 0xCCCCCCCC) >> 2) | ((x & 0x33333333) << 2);
return ((x & 0xAAAAAAAA) >> 1) | ((x & 0x55555555) << 1);
}
Naive
uint32_t naiverevese(uint32_t x)
{
uint32_t result = 0;
for(int i = 0; i < 32; i++)
{
result |= x & 1;
result <<=1;
x >>= 1;
}
return result;
}
or lookup table.
Not portable but the most efficient:
Many processors have a special instructions for it for example:
ARM - rbit and the intrinsic unsigned int __rbit(unsigned int val)

Morton Reverse Encoding for a 3D grid

I have a 3D grid/array say u[nx+2][ny+2][nz+2]. The trailing +2 corresponds to two layers of halo cells in each of the three dimension x,y,z. I have another grid which allows for refinement(using quadtree) hence I have the morton index (or the Z order) of each of the cells.
Lets say without refinement the two grids are alike in the physical reality(except the second code doesnt have halo cells), What I want to find is for a cell q with morton id mid what is the corresponding index i , j and k index in the 3D grid. Basically a decoding of the mid or Z-order to get corresponding i,j,k for u matrix.
Looking for a C solution but general comments in any other programming language is also OK.
For forward encoding I am following the magic bits method as shown in
Morton Encoding using different methods
Morton encoding is just interleaving the bits of two or more components.
If we number binary digits in increasing order of significance, so that the least significant binary digit in an unsigned integer is 0 (and binary digit i has value 2i), then binary digit i in component k of N corresponds to binary digit (i N + k) in the Morton code.
Here are two simple functions to encode and decode three-component Morton codes:
#include <stdlib.h>
#include <inttypes.h>
/* This source is in the public domain. */
/* Morton encoding in binary (components 21-bit: 0..2097151)
0zyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyxzyx */
#define BITMASK_0000000001000001000001000001000001000001000001000001000001000001 UINT64_C(18300341342965825)
#define BITMASK_0000001000001000001000001000001000001000001000001000001000001000 UINT64_C(146402730743726600)
#define BITMASK_0001000000000000000000000000000000000000000000000000000000000000 UINT64_C(1152921504606846976)
/* 0000000ccc0000cc0000cc0000cc0000cc0000cc0000cc0000cc0000cc0000cc */
#define BITMASK_0000000000000011000000000011000000000011000000000011000000000011 UINT64_C(844631138906115)
#define BITMASK_0000000111000000000011000000000011000000000011000000000011000000 UINT64_C(126113986927919296)
/* 00000000000ccccc00000000cccc00000000cccc00000000cccc00000000cccc */
#define BITMASK_0000000000000000000000000000000000001111000000000000000000001111 UINT64_C(251658255)
#define BITMASK_0000000000000000000000001111000000000000000000001111000000000000 UINT64_C(1030792212480)
#define BITMASK_0000000000011111000000000000000000000000000000000000000000000000 UINT64_C(8725724278030336)
/* 000000000000000000000000000ccccccccccccc0000000000000000cccccccc */
#define BITMASK_0000000000000000000000000000000000000000000000000000000011111111 UINT64_C(255)
#define BITMASK_0000000000000000000000000001111111111111000000000000000000000000 UINT64_C(137422176256)
/* ccccccccccccccccccccc */
#define BITMASK_21BITS UINT64_C(2097151)
static inline void morton_decode(uint64_t m, uint32_t *xto, uint32_t *yto, uint32_t *zto)
{
const uint64_t mask0 = BITMASK_0000000001000001000001000001000001000001000001000001000001000001,
mask1 = BITMASK_0000001000001000001000001000001000001000001000001000001000001000,
mask2 = BITMASK_0001000000000000000000000000000000000000000000000000000000000000,
mask3 = BITMASK_0000000000000011000000000011000000000011000000000011000000000011,
mask4 = BITMASK_0000000111000000000011000000000011000000000011000000000011000000,
mask5 = BITMASK_0000000000000000000000000000000000001111000000000000000000001111,
mask6 = BITMASK_0000000000000000000000001111000000000000000000001111000000000000,
mask7 = BITMASK_0000000000011111000000000000000000000000000000000000000000000000,
mask8 = BITMASK_0000000000000000000000000000000000000000000000000000000011111111,
mask9 = BITMASK_0000000000000000000000000001111111111111000000000000000000000000;
uint64_t x = m,
y = m >> 1,
z = m >> 2;
/* 000c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c */
x = (x & mask0) | ((x & mask1) >> 2) | ((x & mask2) >> 4);
y = (y & mask0) | ((y & mask1) >> 2) | ((y & mask2) >> 4);
z = (z & mask0) | ((z & mask1) >> 2) | ((z & mask2) >> 4);
/* 0000000ccc0000cc0000cc0000cc0000cc0000cc0000cc0000cc0000cc0000cc */
x = (x & mask3) | ((x & mask4) >> 4);
y = (y & mask3) | ((y & mask4) >> 4);
z = (z & mask3) | ((z & mask4) >> 4);
/* 00000000000ccccc00000000cccc00000000cccc00000000cccc00000000cccc */
x = (x & mask5) | ((x & mask6) >> 8) | ((x & mask7) >> 16);
y = (y & mask5) | ((y & mask6) >> 8) | ((y & mask7) >> 16);
z = (z & mask5) | ((z & mask6) >> 8) | ((z & mask7) >> 16);
/* 000000000000000000000000000ccccccccccccc0000000000000000cccccccc */
x = (x & mask8) | ((x & mask9) >> 16);
y = (y & mask8) | ((y & mask9) >> 16);
z = (z & mask8) | ((z & mask9) >> 16);
/* 0000000000000000000000000000000000000000000ccccccccccccccccccccc */
if (xto) *xto = x;
if (yto) *yto = y;
if (zto) *zto = z;
}
static inline uint64_t morton_encode(uint32_t xsrc, uint32_t ysrc, uint32_t zsrc)
{
const uint64_t mask0 = BITMASK_0000000001000001000001000001000001000001000001000001000001000001,
mask1 = BITMASK_0000001000001000001000001000001000001000001000001000001000001000,
mask2 = BITMASK_0001000000000000000000000000000000000000000000000000000000000000,
mask3 = BITMASK_0000000000000011000000000011000000000011000000000011000000000011,
mask4 = BITMASK_0000000111000000000011000000000011000000000011000000000011000000,
mask5 = BITMASK_0000000000000000000000000000000000001111000000000000000000001111,
mask6 = BITMASK_0000000000000000000000001111000000000000000000001111000000000000,
mask7 = BITMASK_0000000000011111000000000000000000000000000000000000000000000000,
mask8 = BITMASK_0000000000000000000000000000000000000000000000000000000011111111,
mask9 = BITMASK_0000000000000000000000000001111111111111000000000000000000000000;
uint64_t x = xsrc,
y = ysrc,
z = zsrc;
/* 0000000000000000000000000000000000000000000ccccccccccccccccccccc */
x = (x & mask8) | ((x << 16) & mask9);
y = (y & mask8) | ((y << 16) & mask9);
z = (z & mask8) | ((z << 16) & mask9);
/* 000000000000000000000000000ccccccccccccc0000000000000000cccccccc */
x = (x & mask5) | ((x << 8) & mask6) | ((x << 16) & mask7);
y = (y & mask5) | ((y << 8) & mask6) | ((y << 16) & mask7);
z = (z & mask5) | ((z << 8) & mask6) | ((z << 16) & mask7);
/* 00000000000ccccc00000000cccc00000000cccc00000000cccc00000000cccc */
x = (x & mask3) | ((x << 4) & mask4);
y = (y & mask3) | ((y << 4) & mask4);
z = (z & mask3) | ((z << 4) & mask4);
/* 0000000ccc0000cc0000cc0000cc0000cc0000cc0000cc0000cc0000cc0000cc */
x = (x & mask0) | ((x << 2) & mask1) | ((x << 4) & mask2);
y = (y & mask0) | ((y << 2) & mask1) | ((y << 4) & mask2);
z = (z & mask0) | ((z << 2) & mask1) | ((z << 4) & mask2);
/* 000c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c00c */
return x | (y << 1) | (z << 2);
}
The functions work symmetrically. To decode, binary digits and digit groups are shifted to larger consecutive units; to encode, binary digit groups are split and spread by shifting. Examine the masks (the BITMASK_ constants are named after their binary digit pattern), and the shift operations, to understand in detail how the encoding and decoding happens.
While two functions are quite efficient, they are not optimized.
The above functions have been verified tested to work using a few billion round-trips using random 21-bit unsigned integer components: decoding a Morton-encoded value yields the original three components.

Morton ordering in 3D usign unit64_t as input

I am trying to use Morton code to produce unique encoding for a given (x,y,z) where x,y,z are double precision floating point numbers. I presume that I can use type cast to convert the floats to integers and run Morton ordering on those integers. For example consider the following C++ code. (I don't now how to do the same in C)
double x=-1.123456789123456E205;
int64_t i = reinterpret_cast<int64_t &>(x);
cout<<i<<endl;
output >>> i = -1548698869907112442
And ditto for the reaming x,y. Once I have the "reinterpreted" values ,I would like to use them as a subroutine for Morton coding.
I checked the above type cast and it worked fine in reverse
double y = reinterpret_cast<double &>(i);
cout<<setprecision(16)<<y<<endl;
output>>-1.123456789123456e+205
I managed to find some codes for Morton coding, even some of them on this forum, but non of them used int64_t in 3D. Hence I am going to need the help of the experts on the forum, how to encode and decode int64_t integers.
I managed to reverse engineer the following code. Unfortunately there is some bug, I am not getting the proper numbers when I run the decode part. I would appreciate any help to figure out what is wrong.
2D morton code encode/decode 64bits.
#include <iostream>
#include <stdint.h>
#include<iomanip>
using namespace std;
uint64_t code_2D_M(double xd,double yd){
uint64_t x = reinterpret_cast<uint64_t& >(xd);
uint64_t y = reinterpret_cast<uint64_t& >(yd);
x = (x | (x << 16)) & 0x0000FFFF0000FFFF;
x = (x | (x << 8)) & 0x00FF00FF00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x << 2)) & 0x3333333333333333;
x = (x | (x << 1)) & 0x5555555555555555;
y = (y | (y << 16)) & 0x0000FFFF0000FFFF;
y = (y | (y << 8)) & 0x00FF00FF00FF00FF;
y = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y | (y << 2)) & 0x3333333333333333;
y = (y | (y << 1)) & 0x5555555555555555;
return x | (y << 1);
}
uint64_t code_3D_M(double xd,double yd,double zd){
uint64_t x = reinterpret_cast<uint64_t& >(xd);
uint64_t y = reinterpret_cast<uint64_t& >(yd);
uint64_t z = reinterpret_cast<uint64_t& >(zd);
x = (x | (x << 16)) & 0x0000FFFF0000FFFF;
x = (x | (x << 8)) & 0x00FF00FF00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x << 2)) & 0x3333333333333333;
x = (x | (x << 1)) & 0x5555555555555555;
y = (y | (y << 16)) & 0x0000FFFF0000FFFF;
y = (y | (y << 8)) & 0x00FF00FF00FF00FF;
y = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y | (y << 2)) & 0x3333333333333333;
y = (y | (y << 1)) & 0x5555555555555555;
z = (y | (y << 16)) & 0x0000FFFF0000FFFF;
z = (y | (y << 8)) & 0x00FF00FF00FF00FF;
z = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F;
z = (y | (y << 2)) & 0x3333333333333333;
z = (y | (y << 1)) & 0x5555555555555555;
return x | (y << 1) | (z << 2);
}
double decode_M(uint64_t x)
{
x = x & 0x5555555555555555;
x = (x | (x >> 1)) & 0x3333333333333333;
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x >> 4)) & 0x00FF00FF00FF00FF;
x = (x | (x >> 8)) & 0x0000FFFF0000FFFF;
x = (x | (x >> 16)) & 0xFFFFFFFFFFFFFFFF;
return reinterpret_cast<double& >(x);
}
int main (void){
uint64_t mort;
double x,y,z;
// test input
x=2.123456789123459E205;
y=1.789789123456129E205;
z=9.999999912345779E205;
// echo the input
cout<<setprecision(17)<<x<<endl;
cout<<setprecision(17)<<y<<endl;
cout<<setprecision(17)<<z<<endl;
// encode 2D case
mort = code_2D_M(x,y);
//decode and print the results to see if all was fine
cout<<setprecision(17)<<decode_M(mort>>0)<<endl;
cout<<setprecision(17)<<decode_M(mort>>1)<<endl;
// encode 3D case
mort = code_3D_M(x,y,z);
//decode and print the results to see if all was fine
cout<<setprecision(17)<<decode_M(mort>>0)<<endl;
cout<<setprecision(17)<<decode_M(mort>>1)<<endl;
cout<<setprecision(17)<<decode_M(mort>>2)<<endl;
return 0;
}
I am doing this because I would like not storing the coordinates as a 3D point (x,y,z) but rather as a single long integer and decode them when needed. By doing so I will reduce the size of my coordinate storage array 3-fold.

2d Morton code 64bits decode function

The first function encodes [x, y] as 64bit wide Morton code where x and y are 32bit wide integers using Interleave bits by Binary Magic Numbers.
What would be the reverse function ?
void xy2d_morton_64bits(uint64_t x, uint64_t y, uint64_t *d)
{
x = (x | (x << 16)) & 0x0000FFFF0000FFFF;
x = (x | (x << 8)) & 0x00FF00FF00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x << 2)) & 0x3333333333333333;
x = (x | (x << 1)) & 0x5555555555555555;
y = (y | (y << 16)) & 0x0000FFFF0000FFFF;
y = (y | (y << 8)) & 0x00FF00FF00FF00FF;
y = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y | (y << 2)) & 0x3333333333333333;
y = (y | (y << 1)) & 0x5555555555555555;
*d = x | (y << 1);
}
void d2xy_morton_64bits(uint64_t d, uint64_t *x, uint64_t *y)
{
????
}
This question was answered here.
Split d into evens and odds then use a similar set of shifts and masks to compress the bits back together:
x = d&0x5555555555555555;
x = (x|x>>1)&0x3333333333333333; //converts 0a0b0c0d.. -> 00ab00cd...
x = (x|x>>2)&0x0f0f0f0f0f0f0f0f; //converts 00ab00cd.. -> 0000abcd...
//etc.

Resources