I have the following method in C that takes two 16-bit short ints and:
Adds the two integers
If the carry flag is set, add 1 to the result
Negate (NOT) all the bits in the final results
Return the result:
short __declspec(naked) getchecksum(short s1, short s2)
{
__asm
{
mov ax, word ptr [esp+4]
mov bx, word ptr [esp+8]
add ax, bx
jnc skip_add
add ax, 1
skip_add:
not ax
ret
}
}
I had to write this in inline assembly because I do not know any way to test the carry flag without using assembler. Does anyone know of a way to do this?
No (C has no notion of flags at all) but that doesn't mean you can't get the same result. If you use 32bit integers to do addition, the 17th bit is the carry. So you can write it like this:
uint16_t getchecksum(uint16_t s1, uint16_t s2)
{
uint32_t u1 = s1, u2 = s2;
uint32_t sum = u1 + u2;
sum += sum >> 16;
return ~sum;
}
I've made the types unsigned to prevent trouble. That may not be necessary on your platform.
You don't need to access the flags to do higher precision arithmetics. There is a carry if the sum is smaller than either of the operands, so you can do like this
short __declspec(naked) getchecksum(short s1, short s2)
{
short s = s1 + s2;
if ((unsigned short)s < (unsigned short)s1)
s++;
return ~s;
}
There are already many questions about adding and carrying on SO: Efficient 128-bit addition using carry flag, Multiword addition in C
However in C operations are always done at least in int type so you can simply add that if int has more than 16 bits in your system. In your case the inline assembly is 16-bit x86 so I guess you're on Turbo C which should be get rid ASAP (reason: Why not to use Turbo C++?). In other systems that has 16-bit int you can use long which is guaranteed to be at least 32 bits by the standard
short __declspec(naked) getchecksum(short s1, short s2)
{
long s = s1 + s2;
return ~((s & 0xffff) + ((s >> 16) & 0x1));
}
Related
I have written a MariaDB/MySQL UDF in C for working with spatial data. I data is available to the function as an unsigned char*. The binary encoding of the data begins with a toggle bit signalling whether the stream is encoded little endian or big endian. Since this is the case I have used the following macros to read the unsigned 32 bit ints from the stream:
#define U32BIT_LE_DATA(ptr) (*(ptr)<<0) | (*(ptr + 1)<<8) | (*(ptr + 2)<<16) | (*(ptr + 3)<<24)
#define U32BIT_BE_DATA(ptr) (*(ptr + 3)<<0) | (*(ptr + 2)<<8) | (*(ptr + 1)<<16) | (*(ptr)<<24)
uint32_t var = U32BIT_LE_DATA(ptr); // Little endian encoding
uint32_t var = U32BIT_BE_DATA(ptr); // Big endian encoding
The stream also has doubles that I need to parse (64-bit (8 byte) double-precision data using the IEEE 754 double-precision format). I know I can do:
double var;
memcpy(&var, ptr, sizeof(double));
But this code is not very safe in regards to portability. I am aware that if I know my machines endiannes, then I can simply reverse the order of the bytes before calling memcpy. Nevertheless, is there a more reliable way to decode a double from or encode it to a 64bit IEEE 754 double-precision floating point using a specified endianness without needing to know the endianness (and system specific double layout) of the machine running the code?
typedef union
{
double d;
uint8_t b[sizeof(double)];
}u64;
inline double toDoubleLE(const uint8_t *arr, int endianess)
{
u64 u;
if (endianess)
{
for(size_t x = 0; x < sizeof(u); x++)
{
u.b[sizeof(u) - x - 1] = arr[x];
}
}
else
{
for(size_t x = 0; x < sizeof(u); x++)
{
u.b[x] = arr[x];
}
}
return u.d;
}
double fooLE(uint8_t *arr)
{
return toDoubleLE(arr, 0);
}
double foobE(uint8_t *arr)
{
return toDoubleLE(arr, 1);
}
compilers are "smart" and x86-64 will convert it to 2 machine code operations.
fooLE:
movzx eax, BYTE PTR [rdi]
movq xmm0, rax
ret
foobE:
mov rax, QWORD PTR [rdi]
bswap rax
movq xmm0, rax
ret
https://godbolt.org/z/ofpDGe
Nevertheless, is there a more reliable way to decode a double from or encode it to a 64bit IEEE 754 double-precision floating point
On POSIX or Unix systems, connsider using XDR or ASN/1.
If the data is not too big, consider JSON (e.g. with Jansson) or perhaps YAML and decide to represent it in textual form. Read then carefully the documentation of fscanf and fprintf (e.g. related to %a format specifier).
See also the floating-point-gui.de
I am using C to read a .png image file, and if you're not familiar with the PNG encoding format, useful integer values are encoded in .png files in the form of 4-byte big-endian integers.
My computer is a little-endian machine, so to convert from a big-endian uint32_t that I read from the file with fread() to a little-endian one my computer understands, I've been using this little function I wrote:
#include <stdint.h>
uint32_t convertEndian(uint32_t val){
union{
uint32_t value;
char bytes[sizeof(uint32_t)];
}in,out;
in.value=val;
for(int i=0;i<sizeof(uint32_t);++i)
out.bytes[i]=in.bytes[sizeof(uint32_t)-1-i];
return out.value;
}
This works beautifully on my x86_64 UNIX environment, gcc compiles without error or warning even with the -Wall flag, but I feel rather confident that I'm relying on undefined behavior and type-punning that may not work as well on other systems.
Is there a standard function I can call that can reliably convert a big-endian integer to one the native machine understands, or if not, is there an alternative safer way to do this conversion?
I see no real UB in OP's code.
Portability issues: yes.
"type-punning that may not work as well on other systems" is not a problem with OP's C code yet may cause trouble with other languages.
Yet how about a big (PNG) endian to host instead?
Extract the bytes by address (lowest address which has the MSByte to highest address which has the LSByte - "big" endian) and form the result with the shifted bytes.
Something like:
uint32_t Endian_BigToHost32(uint32_t val) {
union {
uint32_t u32;
uint8_t u8[sizeof(uint32_t)]; // uint8_t insures a byte is 8 bits.
} x = { .u32 = val };
return
((uint32_t)x.u8[0] << 24) |
((uint32_t)x.u8[1] << 16) |
((uint32_t)x.u8[2] << 8) |
x.u8[3];
}
Tip: many libraries have a implementation specific function to efficiently to this. Example be32toh.
IMO it'd be better style to read from bytes into the desired format, rather than apparently memcpy'ing a uint32_t and then internally manipulating the uint32_t. The code might look like:
uint32_t read_be32(uint8_t *src) // must be unsigned input
{
return (src[0] * 0x1000000u) + (src[1] * 0x10000u) + (src[2] * 0x100u) + src[3];
}
It's quite easy to get this sort of code wrong, so make sure you get it from high rep SO users 😉. You may often see the alternative suggestion return (src[0] << 24) + (src[1] << 16) + (src[2] << 8) + src[3]; however, that causes undefined behaviour if src[0] >= 128 due to signed integer overflow , due to the unfortunate rule that the integer promotions take uint8_t to signed int. And also causes undefined behaviour on a system with 16-bit int due to large shifts.
Modern compilers should be smart enough to optimize, this, e.g. the assembly produced by clang little-endian is:
read_be32: # #read_be32
mov eax, dword ptr [rdi]
bswap eax
ret
However I see that gcc 10.1 produces a much more complicated code, this seems to be a surprising missed optimization bug.
This solution doesn't rely on accessing inactive members of a union, but relies instead on unsigned integer bit-shift operations which can portably and safely convert from big-endian to little-endian or vice versa
#include <stdint.h>
uint32_t convertEndian32(uint32_t in){
return ((in&0xffu)<<24)|((in&0xff00u)<<8)|((in&0xff0000u)>>8)|((in&0xff000000u)>>24);
}
This code reads a uint32_t from a pointer of uchar_t in big endian storage, independently of the endianness of your architecture. (The code just acts as if it was reading a base 256 number)
uint32_t read_bigend_int(uchar_t *p, int sz)
{
uint32_t result = 0;
while(sz--) {
result <<= 8; /* multiply by base */
result |= *p++; /* and add the next digit */
}
}
if you call, for example:
int main()
{
/* ... */
uchar_t buff[1024];
read(fd, buff, sizeof buff);
uint32_t value = read_bigend_int(buff + offset, sizeof value);
/* ... */
}
This question already has answers here:
Position of least significant bit that is set
(23 answers)
Closed 4 years ago.
My current (bad) solution:
void initialization(void) {
for (unsigned int i = 0; i < pageSize; i++) {
if (2 << i == pageSize) {
offset = i;
break;
}
}
}
Example:
pageSize = 4096, Solve for x in, pageSize = 2^x
I feel like there should be a much better way if doing this without a loop and without math.h. Especially since the base is always 2 and there's a lot of stuff with bit shifting involving powers of 2.
clang is capable of recognizing what this function does:
unsigned int log2i(unsigned int n) {
unsigned int i = 0;
while (n >>= 1) {
i++;
}
return i;
}
log2i:
shr edi
lzcnt ecx, edi
mov eax, 32
sub eax, ecx
ret
But since you have an exact power of two, in practice, it’s probably fine and better to use a GCC/clang/others builtin even if it’s not standard C:
unsigned int log2Exact(unsigned int n) {
return __builtin_ctz(n);
}
log2Exact:
tzcnt eax, edi
ret
My problem was I didn't really know what I was looking for. I needed to find the highest bit set (or most significant bit set, MSB). Multiple solutions can be found on Bit Twiddling Hacks. Hopefully my poorly worded question helps others Googling similarly poorly worded questions at some point.
For my specific purpose this solution works:
int v = pageSize; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
t.u[__ORDER_LITTLE_ENDIAN__==LITTLE_ENDIAN] = 0x43300000;
t.u[__ORDER_LITTLE_ENDIAN__!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
r = (t.u[__ORDER_LITTLE_ENDIAN__==LITTLE_ENDIAN] >> 20) - 0x3FF;
I want to do some operation using the Intel intrinsics (vector of unsigned int of 16 bit) and the operations are the following :
load or set from an array of unsigned short int.
Div and Mod operations with unsigned short int.
Multiplication operation with unsigned short int.
Store operation of unsigned short int into an array.
I looked into the Intrinsics guide but it looks like there are only intrinsics for short integers and not the unsigned ones. Could someone have any trick that could help me with this ?
In fact I'm trying to store an image of a specific raster format in an array with a specific ordering. So I have to calculate the index where each pixel value is going to be stored:
unsigned int Index(unsigned int interleaving_depth, unsigned int x_size, unsigned int y_size, unsigned int z_size, unsigned int Pixel_number)
{
unsigned int x = 0, y = 0, z = 0, reminder = 0, i = 0;
y = Pixel_number/(x_size*z_size);
reminder = Pixel_number % (x_size*z_size);
i = reminder/(x_size*interleaving_depth);
reminder = reminder % (x_size*interleaving_depth);
if(i == z_size/interleaving_depth){
x = reminder/(z_size - i*interleaving_depth);
reminder = reminder % (z_size - i*interleaving_depth);
}
else
{
x = reminder/interleaving_depth;
reminder = reminder % interleaving_depth;
}
z = interleaving_depth*i + reminder;
if(z >= z_size)
z = z_size - 1;
return x + y*x_size + *x_size*y_size;
}
If you only want the low half of the result, multiplication is the same binary operation for signed or unsigned. So you can use pmullw on either. There are separate high-half multiply instructions for signed and unsigned short, though: _mm_mulhi_epu16 (pmulhuw) vs. _mm_mulhi_epi16 (pmuluw)
Similarly, you don't need an _mm_set_epu16 because it's the same operation: on x86 casting to signed doesn't change the bit-pattern, so Intel only bothered to provide _mm_set_epi16, but you can use it with args like 0xFFFFu instead of -1 with no problems. (Using Intel intrinsics automatically means your code only has to be portable to x86 32 and 64 bit.)
Load / store intrinsics don't change the data at all.
SSE/AVX doesn't have integer division or mod instructions. If you have compile-time-constant divisors, do it yourself with a multiply/shift. You can look at compiler output to get the magic constant and shift counts (Why does GCC use multiplication by a strange number in implementing integer division?), or even let gcc auto-vectorize something for you. Or even use GNU C native vector syntax to divide:
#include <immintrin.h>
__m128i div13_epu16(__m128i a)
{
typedef unsigned short __attribute__((vector_size(16))) v8uw;
v8uw tmp = (v8uw)a;
v8uw divisor = (v8uw)_mm_set1_epi16(13);
v8uw result = tmp/divisor;
return (__m128i)result;
// clang allows "lax" vector type conversions without casts
// gcc allows vector / scalar, e.g. tmp / 13. Clang requires set1
// to work with both, we need to jump through all the syntax hoops
}
compiles to this asm with gcc and clang (Godbolt compiler explorer):
div13_epu16:
pmulhuw xmm0, XMMWORD PTR .LC0[rip] # tmp93,
psrlw xmm0, 2 # tmp95,
ret
.section .rodata
.LC0:
.value 20165
# repeats 8 times
If you have runtime-variable divisors, it's going to be slower, but you can use http://libdivide.com/. It's not too bad if you reuse the same divisor repeatedly, so you only have to calculate a fixed-point inverse for it once, but code to use an arbitrary inverse needs a variable shift count which is less efficient with SSE (well also for integer), and potentially more instructions because some divisors require a more complicated sequence than others.
I have an unsigned 32 bit integer encoded in the following way:
the first 6 bits define the opcode
next 8 bits define a register
next 18 bits are a two's complement signed integer value.
I am currently decoding this number (uint32_t inst) using:
const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const int32_t value = ((inst >> 17) & 0x01) ? -(131072 - (inst & 0x1FFFF)) : (inst & 0x1FFFF);
I can measure a significant overhead while decoding the value and I am quite sure it is due to the ternary operator (essentially an if statment) used to compare the sign an perform the negative operation.
Is there a way to perform value decoding in a faster way?
Your expression is more complicated than it needs to be, especially in needlessly involving the ternary operator. The following expression computes the same results for all inputs without involving the ternary operator.* It is a good candidate for a replacement, but as with any optimization problem, it is essential to test:
const int32_t value = (int32_t)(inst & 0x1FFFF) - (int32_t)(inst & 0x20000);
Or this variation on #doynax's suggestion along similar lines might be more optimizer-friendly:
const int32_t value = (int32_t)(inst & 0x3FFFF ^ 0x20000) - (int32_t)0x20000;
In each case, the casts avoid implementation-defined behavior; on many architectures they would be no-ops as far as the machine code is concerned. On those architectures, these expressions involve fewer operations in all cases than does yours, not to mention being unconditional.
Competitive alternatives involving shifting may also optimize well, but all such alternatives necessarily rely on implementation-defined behavior because of integer overflow of a left shift, a negative integer being the left-hand operand of a right shift, and / or converting an out-of-range value to a signed integer type. You will have to determine for yourself whether constitutes a problem.
* as compiled by GCC 4.4.7 for x86_64. The original expression invokes implementation-defined behavior for some inputs, so on other implementations the two expressions might compute different values for those inputs.
A standard (even though non-portable) practice is a left-shift followed by an arithmetic right-shift:
const int32_t temp = inst << 14; // "shift out" the 14 unneeded bits
const int32_t value = temp >> 14; // shift the number back; sign-extend
This involves a conversion from uint32_t to int32_t and a right-shift of a possibly negative int32_t; both operations are implementation-defined, i.e. unportable (work on 2's complement systems; pretty much guaranteed to work on any architecture). If you want to gain the best performance and willing to rely on implementation-defined behavior, you can use this code.
As a single expression:
const int32_t value = (int32_t)(inst << 14) >> 14;
Note: the following looks cleaner, will also typically work, but involves undefined behavior (signed integer overflow):
const int32_t value = (int32_t)inst << 14 >> 14;
Don't use it! (even though you probably won't receive any warning or error about it).
For ideal compiler output with no implementation-defined or undefined behaviour, use #doynax's 2's complement decoding expression:
value = (int32_t)((inst & 0x3FFFF) ^ 0x20000) - (int32_t)0x20000;
The casts make sure we're doing signed subtraction, rather than unsigned with wraparound and then assigning that bit-pattern to a signed integer.
This compiles to optimal asm on ARM, where gcc uses sbfx r1, r1, #0, #18 (signed bitfield-extract) to sign-extend bits [17:0] into a full int32_t register. On x86, it uses shl by 14 and sar by 14 (arithmetic shift) to do the same thing. This is a clear sign that gcc recognizes the 2's complement pattern and uses whatever is most optimal on the target machine to sign-extend the bitfield.
There isn't a portable way to make sure bitfields are ordered the way you want them. gcc appears to order bitfields from LSB to MSB for little-endian targets, but MSB to LSB for big-endian targets. You can use an #if to get identical asm output for ARM with/without -mbig-endian, just like the other methods, but there's no guarantee that other compilers work the same.
If gcc/clang didn't see through the xor and sub, it would be worth considering the <<14 / >>14 implementation which hand-holds the compiler towards doing it that way. Or considering the signed/unsigned bitfield approach with an #if.
But since we can get ideal asm from gcc/clang with fully safe and portable code, we should just do that.
See the code on the Godbolt Compiler Explorer, for versions from most of the answers. You can look at asm output for x86, ARM, ARM64, or PowerPC.
// have to put the results somewhere, so the function doesn't optimize away
struct decode {
//unsigned char opcode, r1;
unsigned int opcode, r1;
int32_t value;
};
// in real code you might return the struct by value, but there's less ABI variation when looking at the ASM this way (some would pack the struct into registers)
void decode_two_comp_doynax(struct decode *result, uint32_t inst) {
result->opcode = ((inst >> 26) & 0x3F);
result->r1 = (inst >> 18) & 0xFF;
result->value = ((inst & 0x3FFFF) ^ 0x20000) - 0x20000;
}
# clang 3.7.1 -O3 -march=haswell (enables BMI1 bextr)
mov eax, esi
shr eax, 26 # grab the top 6 bits with a shift
mov dword ptr [rdi], eax
mov eax, 2066 # (0x812)# only AMD provides bextr r32, r32, imm. Intel has to set up the constant separately
bextr eax, esi, eax # extract the middle bitfield
mov dword ptr [rdi + 4], eax
shl esi, 14 # <<14
sar esi, 14 # >>14 (arithmetic shift)
mov dword ptr [rdi + 8], esi
ret
You may consider using bit-fields to simplify your code.
typedef struct inst_type {
#ifdef MY_MACHINE_NEEDS_THIS
uint32_t opcode : 6;
uint32_t r1 : 8;
int32_t value : 18;
#else
int32_t value : 18;
uint32_t r1 : 8;
uint32_t opcode : 6;
#endif
} inst_type;
const uint32_t opcode = inst.opcode;
const uint32_t r1 = inst.r1;
const int32_t value = inst.value;
Direct bit manipulation often performs better, but not always. Using John Bollinger's answer as a baseline, the above structure results in one fewer instruction to extract the three values of interest on GCC (but fewer instructions does not necessarily mean faster).
const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const uint32_t negative = ((inst >> 17) & 0x01);
const int32_t value = -(negative * 131072 - (inst & 0x1FFFF));
when negative is 1 -(131072 - (inst & 0x1FFFF)) and for 0: -(0 - (inst & 0x1FFFF)) which is equal to inst & 0x1FFFF.