I have written a MariaDB/MySQL UDF in C for working with spatial data. I data is available to the function as an unsigned char*. The binary encoding of the data begins with a toggle bit signalling whether the stream is encoded little endian or big endian. Since this is the case I have used the following macros to read the unsigned 32 bit ints from the stream:
#define U32BIT_LE_DATA(ptr) (*(ptr)<<0) | (*(ptr + 1)<<8) | (*(ptr + 2)<<16) | (*(ptr + 3)<<24)
#define U32BIT_BE_DATA(ptr) (*(ptr + 3)<<0) | (*(ptr + 2)<<8) | (*(ptr + 1)<<16) | (*(ptr)<<24)
uint32_t var = U32BIT_LE_DATA(ptr); // Little endian encoding
uint32_t var = U32BIT_BE_DATA(ptr); // Big endian encoding
The stream also has doubles that I need to parse (64-bit (8 byte) double-precision data using the IEEE 754 double-precision format). I know I can do:
double var;
memcpy(&var, ptr, sizeof(double));
But this code is not very safe in regards to portability. I am aware that if I know my machines endiannes, then I can simply reverse the order of the bytes before calling memcpy. Nevertheless, is there a more reliable way to decode a double from or encode it to a 64bit IEEE 754 double-precision floating point using a specified endianness without needing to know the endianness (and system specific double layout) of the machine running the code?
typedef union
{
double d;
uint8_t b[sizeof(double)];
}u64;
inline double toDoubleLE(const uint8_t *arr, int endianess)
{
u64 u;
if (endianess)
{
for(size_t x = 0; x < sizeof(u); x++)
{
u.b[sizeof(u) - x - 1] = arr[x];
}
}
else
{
for(size_t x = 0; x < sizeof(u); x++)
{
u.b[x] = arr[x];
}
}
return u.d;
}
double fooLE(uint8_t *arr)
{
return toDoubleLE(arr, 0);
}
double foobE(uint8_t *arr)
{
return toDoubleLE(arr, 1);
}
compilers are "smart" and x86-64 will convert it to 2 machine code operations.
fooLE:
movzx eax, BYTE PTR [rdi]
movq xmm0, rax
ret
foobE:
mov rax, QWORD PTR [rdi]
bswap rax
movq xmm0, rax
ret
https://godbolt.org/z/ofpDGe
Nevertheless, is there a more reliable way to decode a double from or encode it to a 64bit IEEE 754 double-precision floating point
On POSIX or Unix systems, connsider using XDR or ASN/1.
If the data is not too big, consider JSON (e.g. with Jansson) or perhaps YAML and decide to represent it in textual form. Read then carefully the documentation of fscanf and fprintf (e.g. related to %a format specifier).
See also the floating-point-gui.de
Related
I have an embedded device producing a stream of bits which I receive stored in an array of uint_16s. I know that the 16 bits at a particular offset are actually a 2's compliment binary number. i.e. an int16.
What is the best way to interpret these bits as an signed number that doesn't invoke undefined or implementation-defined behavior.
Given uint16_t data[N];:
int16_t value = (int16_t)data[offset];
Works on my platform, but is definitely implementation defined.
union { int16_t value; uint16_t raw } u;
u.raw = data[offset]
int16_t value = u.value;
Also works, but is also implementation defined, as far as I know.
What about this?
int16_t value = *(int16_t*)&(data[offset]);
I'm having trouble finding a clear standards-based answer for the best practice here.
Converting through a union is not implementation-defined. Per the C standard, reading a union member other than the last one read reinterprets the bytes, and the value bits of corresponding integer types must represent the same values. So (union { uint16_t u; int16_t i; }) {data[offset]} .i gives the int16_t value represented by the bits of the uint16_t data[offset], with no undefined or implementation-defined behavior.
Another solution is:
#include <stdint.h>
// Convert a uint16_t to the int16_t represented by the same bits.
static int16_t ConvertUInt16ToInt16(uint16_t u)
{
/* int16_t is specified to be two’s complement. So we will interpret the
most significant bit as -32768 instead of 32768: If the high bit is set
(representing 32768 in a uint16_t), we subtract 65536 (thus yielding a
net interpretation of -32768, its value in two's complement). We also
cast to int32_t to ensure the expression remains within bounds. Then
we return the result (which is automatically converted to the int16_t
return type, which does not change the value since it is representable
in int16_t).
*/
return u - ((int32_t) u >> 15 << 16);
}
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
static void Test(uint16_t x)
{
int16_t Expected = (union { uint16_t u; int16_t i; }) {x} .i;
int16_t Observed = ConvertUInt16ToInt16(x);
if (Expected != Observed)
{
printf("Error with 0x%08" PRIx16 ":\n", x);
printf("\tExpected %" PRId16 " but observed %" PRId16 ".\n",
Expected, Observed);
exit(EXIT_FAILURE);
}
}
int main(void)
{
Test(0);
for (uint16_t i = 1; i; ++i)
Test(i);
}
Apple Clang 11.0 actually fully analyzes this program at compile time and compiles it to a program that immediately returns success without any looping and run-time testing:
_main:
pushq %rbp
movq %rsp, %rbp
xorl %eax, %eax
popq %rbp
retq
I am using C to read a .png image file, and if you're not familiar with the PNG encoding format, useful integer values are encoded in .png files in the form of 4-byte big-endian integers.
My computer is a little-endian machine, so to convert from a big-endian uint32_t that I read from the file with fread() to a little-endian one my computer understands, I've been using this little function I wrote:
#include <stdint.h>
uint32_t convertEndian(uint32_t val){
union{
uint32_t value;
char bytes[sizeof(uint32_t)];
}in,out;
in.value=val;
for(int i=0;i<sizeof(uint32_t);++i)
out.bytes[i]=in.bytes[sizeof(uint32_t)-1-i];
return out.value;
}
This works beautifully on my x86_64 UNIX environment, gcc compiles without error or warning even with the -Wall flag, but I feel rather confident that I'm relying on undefined behavior and type-punning that may not work as well on other systems.
Is there a standard function I can call that can reliably convert a big-endian integer to one the native machine understands, or if not, is there an alternative safer way to do this conversion?
I see no real UB in OP's code.
Portability issues: yes.
"type-punning that may not work as well on other systems" is not a problem with OP's C code yet may cause trouble with other languages.
Yet how about a big (PNG) endian to host instead?
Extract the bytes by address (lowest address which has the MSByte to highest address which has the LSByte - "big" endian) and form the result with the shifted bytes.
Something like:
uint32_t Endian_BigToHost32(uint32_t val) {
union {
uint32_t u32;
uint8_t u8[sizeof(uint32_t)]; // uint8_t insures a byte is 8 bits.
} x = { .u32 = val };
return
((uint32_t)x.u8[0] << 24) |
((uint32_t)x.u8[1] << 16) |
((uint32_t)x.u8[2] << 8) |
x.u8[3];
}
Tip: many libraries have a implementation specific function to efficiently to this. Example be32toh.
IMO it'd be better style to read from bytes into the desired format, rather than apparently memcpy'ing a uint32_t and then internally manipulating the uint32_t. The code might look like:
uint32_t read_be32(uint8_t *src) // must be unsigned input
{
return (src[0] * 0x1000000u) + (src[1] * 0x10000u) + (src[2] * 0x100u) + src[3];
}
It's quite easy to get this sort of code wrong, so make sure you get it from high rep SO users 😉. You may often see the alternative suggestion return (src[0] << 24) + (src[1] << 16) + (src[2] << 8) + src[3]; however, that causes undefined behaviour if src[0] >= 128 due to signed integer overflow , due to the unfortunate rule that the integer promotions take uint8_t to signed int. And also causes undefined behaviour on a system with 16-bit int due to large shifts.
Modern compilers should be smart enough to optimize, this, e.g. the assembly produced by clang little-endian is:
read_be32: # #read_be32
mov eax, dword ptr [rdi]
bswap eax
ret
However I see that gcc 10.1 produces a much more complicated code, this seems to be a surprising missed optimization bug.
This solution doesn't rely on accessing inactive members of a union, but relies instead on unsigned integer bit-shift operations which can portably and safely convert from big-endian to little-endian or vice versa
#include <stdint.h>
uint32_t convertEndian32(uint32_t in){
return ((in&0xffu)<<24)|((in&0xff00u)<<8)|((in&0xff0000u)>>8)|((in&0xff000000u)>>24);
}
This code reads a uint32_t from a pointer of uchar_t in big endian storage, independently of the endianness of your architecture. (The code just acts as if it was reading a base 256 number)
uint32_t read_bigend_int(uchar_t *p, int sz)
{
uint32_t result = 0;
while(sz--) {
result <<= 8; /* multiply by base */
result |= *p++; /* and add the next digit */
}
}
if you call, for example:
int main()
{
/* ... */
uchar_t buff[1024];
read(fd, buff, sizeof buff);
uint32_t value = read_bigend_int(buff + offset, sizeof value);
/* ... */
}
I want to do some operation using the Intel intrinsics (vector of unsigned int of 16 bit) and the operations are the following :
load or set from an array of unsigned short int.
Div and Mod operations with unsigned short int.
Multiplication operation with unsigned short int.
Store operation of unsigned short int into an array.
I looked into the Intrinsics guide but it looks like there are only intrinsics for short integers and not the unsigned ones. Could someone have any trick that could help me with this ?
In fact I'm trying to store an image of a specific raster format in an array with a specific ordering. So I have to calculate the index where each pixel value is going to be stored:
unsigned int Index(unsigned int interleaving_depth, unsigned int x_size, unsigned int y_size, unsigned int z_size, unsigned int Pixel_number)
{
unsigned int x = 0, y = 0, z = 0, reminder = 0, i = 0;
y = Pixel_number/(x_size*z_size);
reminder = Pixel_number % (x_size*z_size);
i = reminder/(x_size*interleaving_depth);
reminder = reminder % (x_size*interleaving_depth);
if(i == z_size/interleaving_depth){
x = reminder/(z_size - i*interleaving_depth);
reminder = reminder % (z_size - i*interleaving_depth);
}
else
{
x = reminder/interleaving_depth;
reminder = reminder % interleaving_depth;
}
z = interleaving_depth*i + reminder;
if(z >= z_size)
z = z_size - 1;
return x + y*x_size + *x_size*y_size;
}
If you only want the low half of the result, multiplication is the same binary operation for signed or unsigned. So you can use pmullw on either. There are separate high-half multiply instructions for signed and unsigned short, though: _mm_mulhi_epu16 (pmulhuw) vs. _mm_mulhi_epi16 (pmuluw)
Similarly, you don't need an _mm_set_epu16 because it's the same operation: on x86 casting to signed doesn't change the bit-pattern, so Intel only bothered to provide _mm_set_epi16, but you can use it with args like 0xFFFFu instead of -1 with no problems. (Using Intel intrinsics automatically means your code only has to be portable to x86 32 and 64 bit.)
Load / store intrinsics don't change the data at all.
SSE/AVX doesn't have integer division or mod instructions. If you have compile-time-constant divisors, do it yourself with a multiply/shift. You can look at compiler output to get the magic constant and shift counts (Why does GCC use multiplication by a strange number in implementing integer division?), or even let gcc auto-vectorize something for you. Or even use GNU C native vector syntax to divide:
#include <immintrin.h>
__m128i div13_epu16(__m128i a)
{
typedef unsigned short __attribute__((vector_size(16))) v8uw;
v8uw tmp = (v8uw)a;
v8uw divisor = (v8uw)_mm_set1_epi16(13);
v8uw result = tmp/divisor;
return (__m128i)result;
// clang allows "lax" vector type conversions without casts
// gcc allows vector / scalar, e.g. tmp / 13. Clang requires set1
// to work with both, we need to jump through all the syntax hoops
}
compiles to this asm with gcc and clang (Godbolt compiler explorer):
div13_epu16:
pmulhuw xmm0, XMMWORD PTR .LC0[rip] # tmp93,
psrlw xmm0, 2 # tmp95,
ret
.section .rodata
.LC0:
.value 20165
# repeats 8 times
If you have runtime-variable divisors, it's going to be slower, but you can use http://libdivide.com/. It's not too bad if you reuse the same divisor repeatedly, so you only have to calculate a fixed-point inverse for it once, but code to use an arbitrary inverse needs a variable shift count which is less efficient with SSE (well also for integer), and potentially more instructions because some divisors require a more complicated sequence than others.
I have the following method in C that takes two 16-bit short ints and:
Adds the two integers
If the carry flag is set, add 1 to the result
Negate (NOT) all the bits in the final results
Return the result:
short __declspec(naked) getchecksum(short s1, short s2)
{
__asm
{
mov ax, word ptr [esp+4]
mov bx, word ptr [esp+8]
add ax, bx
jnc skip_add
add ax, 1
skip_add:
not ax
ret
}
}
I had to write this in inline assembly because I do not know any way to test the carry flag without using assembler. Does anyone know of a way to do this?
No (C has no notion of flags at all) but that doesn't mean you can't get the same result. If you use 32bit integers to do addition, the 17th bit is the carry. So you can write it like this:
uint16_t getchecksum(uint16_t s1, uint16_t s2)
{
uint32_t u1 = s1, u2 = s2;
uint32_t sum = u1 + u2;
sum += sum >> 16;
return ~sum;
}
I've made the types unsigned to prevent trouble. That may not be necessary on your platform.
You don't need to access the flags to do higher precision arithmetics. There is a carry if the sum is smaller than either of the operands, so you can do like this
short __declspec(naked) getchecksum(short s1, short s2)
{
short s = s1 + s2;
if ((unsigned short)s < (unsigned short)s1)
s++;
return ~s;
}
There are already many questions about adding and carrying on SO: Efficient 128-bit addition using carry flag, Multiword addition in C
However in C operations are always done at least in int type so you can simply add that if int has more than 16 bits in your system. In your case the inline assembly is 16-bit x86 so I guess you're on Turbo C which should be get rid ASAP (reason: Why not to use Turbo C++?). In other systems that has 16-bit int you can use long which is guaranteed to be at least 32 bits by the standard
short __declspec(naked) getchecksum(short s1, short s2)
{
long s = s1 + s2;
return ~((s & 0xffff) + ((s >> 16) & 0x1));
}
Why in the world was _mm_crc32_u64(...) defined like this?
unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );
The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source operand as large as possible (i.e. 64-bits if I have that much data left, smaller for the tail ends) and always a 32-bit destination operand. But the intrinsics don't allow a 64-bit source and 32-bit destination. Note the other intrinsics:
unsigned int _mm_crc32_u8 ( unsigned int crc, unsigned char v );
The type of "crc" is not an 8-bit type, nor is the return type, they are 32-bits. Why is there no
unsigned int _mm_crc32_u64 ( unsigned int crc, unsigned __int64 v );
? The Intel instruction supports this, and that is the intrinsic that makes the most sense.
Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.
My guess is something like this:
#define CRC32(D32,S) __asm__("crc32 %0, %1" : "+xrm" (D32) : ">xrm" (S))
for GCC, and
#define CRC32(D32,S) __asm { crc32 D32, S }
for VisualStudio. Unfortunately I have little understanding of how constraints work, and little experience with the syntax and semantics of assembly level programming.
Small edit: note the macros I've defined:
#define GET_INT64(P) *(reinterpret_cast<const uint64* &>(P))++
#define GET_INT32(P) *(reinterpret_cast<const uint32* &>(P))++
#define GET_INT16(P) *(reinterpret_cast<const uint16* &>(P))++
#define GET_INT8(P) *(reinterpret_cast<const uint8 * &>(P))++
#define DO1_HW(CR,P) CR = _mm_crc32_u8 (CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = _mm_crc32_u16(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = _mm_crc32_u32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = (_mm_crc32_u64((uint64)CR, GET_INT64(P))) & 0xFFFFFFFF;
Notice how different the last macro statement is. The lack of uniformity is certainly and indication that the intrinsic has not been defined sensibly. While it is not necessary to put in the explicit (uint64) cast in the last macro, it is implicit and does happen. Disassembling the generated code shows code for both casts 32->64 and 64->32, both of which are unnecessary.
Put another way, it's _mm_crc32_u64, not _mm_crc64_u64, but they've implemented it as if it were the latter.
If I could get the definition of CRC32 above correct, then I would want to change my macros to
#define DO1_HW(CR,P) CR = CRC32(CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = CRC32(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = CRC32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = CRC32(CR, GET_INT64(P))
The 4 intrinsic functions provided really do allow all possible uses of the Intel defined CRC32 instruction. The instruction output always 32-bits because the instruction is hard-coded to use a specific 32-bit CRC polynomial. However, the instruction allows your code to feed input data to it 8, 16, 32, or 64 bits at a time. Processing 64-bits at a time should maximize throughput. Processing 32-bits at a time is the best you can do if restricted to 32-bit build. Processing 8 or 16 bits at a time could simplify your code logic if the input byte count is odd or or not a multiple of 4/8.
#include <stdio.h>
#include <stdint.h>
#include <intrin.h>
int main (int argc, char *argv [])
{
int index;
uint8_t *data8;
uint16_t *data16;
uint32_t *data32;
uint64_t *data64;
uint32_t total1, total2, total3;
uint64_t total4;
uint64_t input [] = {0x1122334455667788, 0x1111222233334444};
total1 = total2 = total3 = total4 = 0;
data8 = (void *) input;
data16 = (void *) input;
data32 = (void *) input;
data64 = (void *) input;
for (index = 0; index < sizeof input / sizeof *data8; index++)
total1 = _mm_crc32_u8 (total1, *data8++);
for (index = 0; index < sizeof input / sizeof *data16; index++)
total2 = _mm_crc32_u16 (total2, *data16++);
for (index = 0; index < sizeof input / sizeof *data32; index++)
total3 = _mm_crc32_u32 (total3, *data32++);
for (index = 0; index < sizeof input / sizeof *data64; index++)
total4 = _mm_crc32_u64 (total4, *data64++);
printf ("CRC32 result using 8-bit chunks: %08X\n", total1);
printf ("CRC32 result using 16-bit chunks: %08X\n", total2);
printf ("CRC32 result using 32-bit chunks: %08X\n", total3);
printf ("CRC32 result using 64-bit chunks: %08X\n", total4);
return 0;
}
Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.
My friend and I wrote a c++ sse intrinsics wrapper which contains the more preferred usage of the crc32 instruction with 64bit src.
http://code.google.com/p/sse-intrinsics/
See the i_crc32() instruction.
(sadly there are even more flaws with intel's sse intrinsic specifications on other instructions, see this page for more examples of flawed intrinsic design)