Popcnt using inline assembly language in C [duplicate]

Popcnt using inline assembly language in C [duplicate] - c

This question already has answers here:
Can I modify input operands in gcc inline assembly
(1 answer)
Count the number of set bits in a 32-bit integer
(65 answers)
Inline assembly reusing same register when it shouldn't [duplicate]
(2 answers)
Closed 1 year ago.
A simple implementation of the popcnt function in C:
int popcnt(uint64_t x) {
int s = 0;
for (int i = 0; i < 64; i++) {
if ((x << i) & 1 == 1) s++;
}
return s;
}
I am using inline assembly language (x86-64) to implement popcnt,
int asm_popcnt(uint64_t x) {
int i = 0, sum = 0;
uint64_t tmp = 0;
asm ( ".Pct: \n\t"
"movq %[xx], %[tm]\n\t"
"andq $0x1, %[tm]\n\t"
"test %[tm], %[tm]\n\t"
"je .Grt \n\t"
"incl %[ss] \n\t"
".Grt: \n\t"
"shrq $0x1, %[xx]\n\t"
"incl %[ii] \n\t"
"cmpl $0x3f, %[ii]\n\t"
"jle .Pct \n\t"
: [ss] "+r"(sum)
: [xx] "r"(x) , [ii] "r"(i),
[tm] "r"(tmp)
);
return sum;
}
but received WA (online judge)
I tested all powers of 2 (from 0x1 to (0x1 << 63)) on my computer and it returned 1, which indicates that my asm_popcnt can identify all bits of any 64_bits integer since all other integers are just combinations of 0x1, 0x2, 0x4, etc.(for example, 0x11a = 0x2 "or" 0x8 "or" 0x10 "or" 0x100). Therefore there shouldn't be cases for OJ to return a "WA". Is there anything wrong in my code? The jump instruction?

Related

Moving data into __uint24 with assembly

I originally had the following C code:
volatile register uint16_t counter asm("r12");
__uint24 getCounter() {
__uint24 res = counter;
res = (res << 8) | TCNT0;
return res;
}
This function runs in some hot places and is inlined, and I'm trying to cram a lot of stuff into an ATtiny13, so it came time to optimize it.
That function compiles to:
getCounter:
movw r24,r12
ldi r26,0
clr r22
mov r23,r24
mov r24,r25
in r25,0x32
or r22,r25
ret
I came up with this assembly:
inline __uint24 getCounter() {
//__uint24 res = counter;
//res = (res << 8) | TCNT0;
uint32_t result;
asm(
"in %A[result],0x32" "\n\t"
"movw %C[result],%[counter]" "\n\t"
"mov %B[result],%C[result]" "\n\t"
"mov %C[result],%D[result]" "\n\t"
: [result] "=r" (result)
: [counter] "r" (counter)
:
);
return (__uint24) result;
}
The reason for uint32_t is to "allocate" the fourth consecutive register and for the compiler to understand it is clobbered (since I cannot do something like "%D[result]" in the clobber list)
Is my assembly correct? From my testing it seems like it is.
Is there a way to allow the compiler to optimize getCounter() better so there's not need for confusing assembly?
Is there a better way to do this in assembly?

Write registers data into array using asm C

I created a program that writes registers data into variables using asm. And it seems to be working well. But then I decided to replace variables by an array and to write registers data into an array. I used the same approach, but noticed that when I'm printing variables and array members they have different values, but should have the same values.
What I'm doing wrong trying to write the registers values into an array? As I understand it
should work the same way if to write to a standalone variable.
void read_registers(void)
{
int ebx_val, ecx_val, edx_val;
char reg_name[4][4] = {"ebx", "ecx", "edx"};
int reg_val[3];
printk("\n===OLD VALUES BELOW===");
test_syscall();/*inside of the syscall registers were written 0xDEADBEEF*/
__asm__ volatile (
"\t movl %%ebx,%0" : "=r"(ebx_val));
__asm__ volatile (
"\t movl %%ecx,%0" : "=r"(ecx_val));
__asm__ volatile (
"\t movl %%edx,%0" : "=r"(edx_val));
printk("\nReg ebx val user mode 0x%x\n", ebx_val);
printk("\nReg ecx val user mode 0x%x\n", ecx_val);
printk("\nReg edx val user mode 0x%x\n", edx_val);
printk("\n===NEW VALUES BELOW===");
__asm__ volatile (
"\t movl %%ebx,%0" : "=r"(reg_val[0]));
__asm__ volatile (
"\t movl %%ecx,%0" : "=r"(reg_val[1]));
__asm__ volatile (
"\t movl %%edx,%0" : "=r"(reg_val[2]));
for(int i=0; i<3; i++)
{
printk("\nReg %s val is 0x%x\n", reg_name + i,
reg_val[i]);
}

Last stretch of rounding function in ASM

What I essentially have to do is make what is in Main work.
I'm on my last stretch of this assignment (which will likely take just as long as it did for me to get here) I'm having trouble figuring out how to pass the roundingMode that is passed to roundD and using it in ASM.
Also, there is a block of just comments, as far as I can tell, that's all I have left to do. does that sound right?
#include <stdio.h>
#include <stdlib.h>
#define PRECISION 3
#define RND_CTL_BIT_SHIFT 10
// floating point rounding modes: IA-32 Manual, Vol. 1, p. 4-20
typedef enum {
ROUND_NEAREST_EVEN = 0 << RND_CTL_BIT_SHIFT,
ROUND_MINUS_INF = 1 << RND_CTL_BIT_SHIFT,
ROUND_PLUS_INF = 2 << RND_CTL_BIT_SHIFT,
ROUND_TOWARD_ZERO = 3 << RND_CTL_BIT_SHIFT
} RoundingMode;
double roundD(double n, RoundingMode roundingMode)
{
// do not change anything above this comment
int oldCW = 0x0000;
int newCW = 0xF3FF;
int mask = 0x0300;
int tempVar = 0x0000;
asm(" push %eax \n"
" push %ebx \n"
" fstcw %[oldCWOut] \n" //store FPU CW into OldCW
" mov %%eax, %[oldCWOut] \n" //store old FPU CW into tempVar
" mov %[tempVarIn], %%eax \n"
" add %%eax, %[maskIn] \n" //isolate rounding bits
" add %%eax, %[roundModeOut] \n" //adding rounding modifier
//shift in old bits to tempFPU
//do rounding calculation
//store result into n
" fldcw %[oldCWIn] \n" //restoring the FPU CW to normal
" pop %ebx \n"
" pop %eax \n"
: [oldCWOut] "=m" (oldCW),
[newCWOut] "=m" (newCW),
[maskOut] "=m" (mask),
[tempVarOut] "=m" (tempVar),
[roundModeOut] "=m" (roundMode)
: [oldCWIn] "m" (oldCW),
[newCWIn] "m" (newCW),
[maskIn] "m" (mask),
[tempVarIn] "m" (tempVar),
[roundModeIn] "m" (roundMode)
:"eax", "ebx"
);
return n;
// do not change anything below this comment, except for printing out your name
}
int main(int argc, char **argv)
{
double n = 0.0;
if (argc > 1)
n = atof(argv[1]);
printf("roundD even %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_NEAREST_EVEN));
printf("roundD down %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_MINUS_INF));
printf("roundD up %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_PLUS_INF));
printf("roundD zero %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_TOWARD_ZERO));
return 0;
}

While C might like to pretend that enum is not just an integer, it is just an integer. If you can't use roundingMode directly in the assembly, create an integer local variable and set it equal to the roundingMode parameter.
I'm just offering this as a suggestion to you. I've never used inline assembly before and I've never used x86 assembly before, but if all you need to do is reference the parameter, what I said above should work.

How do you detect the CPU architecture type during run-time with GCC and inline asm?

I need to find the architecture type of a CPU. I do not have access to /proc/cpuinfo, as the machine is running syslinux. I know there is a way to do it with inline ASM, however I believe my syntax is incorrect as my variable iedx is not being set properly.
I'm drudging along with ASM, and by no means an expert. If anyone has any tips or can point me in the right direction, I would be much obliged.
static int is64Bit(void) {
int iedx = 0;
asm("mov %eax, 0x80000001");
asm("cpuid");
asm("mov %0, %%eax" : : "a" (iedx));
if ((iedx) && (1 << 29))
{
return 1;
}
return 0;
}

How many bugs can you fit in so few lines ;)
Try
static int is64bit(void) {
int iedx = 0;
asm volatile ("movl $0x80000001, %%eax\n"
"cpuid\n"
: "=d"(iedx)
: /* No Inputs */
: "eax", "ebx", "ecx"
);
if(iedx & (1 << 29))
{
return 1;
}
return 0;
}

How to do unsigned saturating addition in C?

What is the best (cleanest, most efficient) way to write saturating addition in C?
The function or macro should add two unsigned inputs (need both 16- and 32-bit versions) and return all-bits-one (0xFFFF or 0xFFFFFFFF) if the sum overflows.
Target is x86 and ARM using gcc (4.1.2) and Visual Studio (for simulation only, so a fallback implementation is OK there).

You probably want portable C code here, which your compiler will turn into proper ARM assembly. ARM has conditional moves, and these can be conditional on overflow. The algorithm then becomes: add and conditionally set the destination to unsigned(-1), if overflow was detected.
uint16_t add16(uint16_t a, uint16_t b)
{
uint16_t c = a + b;
if (c < a) /* Can only happen due to overflow */
c = -1;
return c;
}
Note that this differs from the other algorithms in that it corrects overflow, instead of relying on another calculation to detect overflow.
x86-64 clang 3.7 -O3 output for adds32: significantly better than any other answer:
add edi, esi
mov eax, -1
cmovae eax, edi
ret
ARMv7: gcc 4.8 -O3 -mcpu=cortex-a15 -fverbose-asm output for adds32:
adds r0, r0, r1 # c, a, b
it cs
movcs r0, #-1 # conditional-move
bx lr
16bit: still doesn't use ARM's unsigned-saturating add instruction (UADD16)
add r1, r1, r0 # tmp114, a
movw r3, #65535 # tmp116,
uxth r1, r1 # c, tmp114
cmp r0, r1 # a, c
ite ls #
movls r0, r1 #,, c
movhi r0, r3 #,, tmp116
bx lr #

In plain C:
uint16_t sadd16(uint16_t a, uint16_t b) {
return (a > 0xFFFF - b) ? 0xFFFF : a + b;
}
uint32_t sadd32(uint32_t a, uint32_t b) {
return (a > 0xFFFFFFFF - b) ? 0xFFFFFFFF : a + b;
}
which is almost macro-ized and directly conveys the meaning.

In IA32 without conditional jumps:
uint32_t sadd32(uint32_t a, uint32_t b)
{
#if defined IA32
__asm
{
mov eax,a
xor edx,edx
add eax,b
setnc dl
dec edx
or eax,edx
}
#elif defined ARM
// ARM code
#else
// non-IA32/ARM way, copy from above
#endif
}

In ARM you may already have saturated arithmetic built-in. The ARMv5 DSP-extensions can saturate registers to any bit-length. Also on ARM saturation is usually cheap because you can excute most instructions conditional.
ARMv6 even has saturated addition, subtraction and all the other stuff for 32 bits and packed numbers.
On the x86 you get saturated arithmetic either via MMX or SSE.
All this needs assembler, so it's not what you've asked for.
There are C-tricks to do saturated arithmetic as well. This little code does saturated addition on four bytes of a dword. It's based on the idea to calculate 32 half-adders in parallel, e.g. adding numbers without carry overflow.
This is done first. Then the carries are calculated, added and replaced with a mask if the addition would overflow.
uint32_t SatAddUnsigned8(uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80808080;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 7);
return (x ^ t0) | t1;
}
You can get the same for 16 bits (or any kind of bit-field) by changing the signmask constant and the shifts at the bottom like this:
uint32_t SatAddUnsigned16(uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80008000;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 15);
return (x ^ t0) | t1;
}
uint32_t SatAddUnsigned32 (uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80000000;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 31);
return (x ^ t0) | t1;
}
Above code does the same for 16 and 32 bit values.
If you don't need the feature that the functions add and saturate multiple values in parallel just mask out the bits you need. On ARM you also want to change the signmask constant because ARM can't load all possible 32 bit constants in a single cycle.
Edit: The parallel versions are most likely slower than the straight forward methods, but they are faster if you have to saturate more than one value at a time.

If you care about performance, you really want to do this sort of stuff in SIMD, where x86 has native saturating arithmetic.
Because of this lack of saturating arithmetic in scalar math, one can get cases in which operations done on 4-variable-wide SIMD is more than 4 times faster than the equivalent C (and correspondingly true with 8-variable-wide SIMD):
sub8x8_dct8_c: 1332 clocks
sub8x8_dct8_mmx: 182 clocks
sub8x8_dct8_sse2: 127 clocks

Zero branch solution:
uint32_t sadd32(uint32_t a, uint32_t b)
{
uint64_t s = (uint64_t)a+b;
return -(s>>32) | (uint32_t)s;
}
A good compiler will optimize this to avoid doing any actual 64-bit arithmetic (s>>32 will merely be the carry flag, and -(s>>32) is the result of sbb %eax,%eax).
In x86 asm (AT&T syntax, a and b in eax and ebx, result in eax):
add %eax,%ebx
sbb %eax,%eax
or %ebx,%eax
8- and 16-bit versions should be obvious. Signed version might require a bit more work.

uint32_t saturate_add32(uint32_t a, uint32_t b)
{
uint32_t sum = a + b;
if ((sum < a) || (sum < b))
return ~((uint32_t)0);
else
return sum;
} /* saturate_add32 */
uint16_t saturate_add16(uint16_t a, uint16_t b)
{
uint16_t sum = a + b;
if ((sum < a) || (sum < b))
return ~((uint16_t)0);
else
return sum;
} /* saturate_add16 */
Edit: Now that you've posted your version, I'm not sure mine is any cleaner/better/more efficient/more studly.

The current implementation we are using is:
#define sadd16(a, b) (uint16_t)( ((uint32_t)(a)+(uint32_t)(b)) > 0xffff ? 0xffff : ((a)+(b)))
#define sadd32(a, b) (uint32_t)( ((uint64_t)(a)+(uint64_t)(b)) > 0xffffffff ? 0xffffffff : ((a)+(b)))

I'm not sure if this is faster than Skizz's solution (always profile), but here's an alternative no-branch assembly solution. Note that this requires the conditional move (CMOV) instruction, which I'm not sure is available on your target.
uint32_t sadd32(uint32_t a, uint32_t b)
{
__asm
{
movl eax, a
addl eax, b
movl edx, 0xffffffff
cmovc eax, edx
}
}

I suppose, the best way for x86 is to use inline assembler to check overflow flag after addition. Something like:
add eax, ebx
jno ##1
or eax, 0FFFFFFFFh
##1:
.......
It's not very portable, but IMHO the most efficient way.

Just in case someone wants to know an implementation without branching using 2's complement 32bit integers.
Warning! This code uses the undefined operation: "shift right by -1" and therefore exploits the property of the Intel Pentium SAL instruction to mask the count operand to 5 bits.
int32_t sadd(int32_t a, int32_t b){
int32_t sum = a+b;
int32_t overflow = ((a^sum)&(b^sum))>>31;
return (overflow<<31)^(sum>>overflow);
}
It's the best implementation known to me

The best performance will usually involve inline assembly (as some have already stated).
But for portable C, these functions only involve one comparison and no type-casting (and thus I believe optimal):
unsigned saturate_add_uint(unsigned x, unsigned y)
{
if (y > UINT_MAX - x) return UINT_MAX;
return x + y;
}
unsigned short saturate_add_ushort(unsigned short x, unsigned short y)
{
if (y > USHRT_MAX - x) return USHRT_MAX;
return x + y;
}
As macros, they become:
SATURATE_ADD_UINT(x, y) (((y)>UINT_MAX-(x)) ? UINT_MAX : ((x)+(y)))
SATURATE_ADD_USHORT(x, y) (((y)>SHRT_MAX-(x)) ? USHRT_MAX : ((x)+(y)))
I leave versions for 'unsigned long' and 'unsigned long long' as an exercise to the reader. ;-)

An alternative to the branch free x86 asm solution is (AT&T syntax, a and b in eax and ebx, result in eax):
add %eax,%ebx
sbb $0,%ebx

int saturating_add(int x, int y)
{
int w = sizeof(int) << 3;
int msb = 1 << (w-1);
int s = x + y;
int sign_x = msb & x;
int sign_y = msb & y;
int sign_s = msb & s;
int nflow = sign_x && sign_y && !sign_s;
int pflow = !sign_x && !sign_y && sign_s;
int nmask = (~!nflow + 1);
int pmask = (~!pflow + 1);
return (nmask & ((pmask & s) | (~pmask & ~msb))) | (~nmask & msb);
}
This implementation doesn't use control flows, campare operators(==, !=) and the ?: operator. It just uses bitwise operators and logical operators.

Using C++ you could write a more flexible variant of Remo.D's solution:
template<typename T>
T sadd(T first, T second)
{
static_assert(std::is_integral<T>::value, "sadd is not defined for non-integral types");
return first > std::numeric_limits<T>::max() - second ? std::numeric_limits<T>::max() : first + second;
}
This can be easily translated to C - using the limits defined in limits.h. Please also note that the Fixed width integer types might not been available on your system.

//function-like macro to add signed vals,
//then test for overlow and clamp to max if required
#define SATURATE_ADD(a,b,val) ( {\
if( (a>=0) && (b>=0) )\
{\
val = a + b;\
if (val < 0) {val=0x7fffffff;}\
}\
else if( (a<=0) && (b<=0) )\
{\
val = a + b;\
if (val > 0) {val=-1*0x7fffffff;}\
}\
else\
{\
val = a + b;\
}\
})
I did a quick test and seems to work, but not extensively bashed it yet! This works with SIGNED 32 bit.
op : the editor used on the web page does not let me post a macro ie its not understanding non-indented syntax etc!

Saturation arithmetic is not standard for C, but it's often implemented via compiler intrinsics, so the most efficient way will not be the cleanest. You must add #ifdef blocks to select the proper way. MSalters's answer is the fastest for x86 architecture. For ARM you need to use __qadd16 function (ARM compiler) of _arm_qadd16 (Microsoft Visual Studio) for 16 bit version and __qadd for 32-bit version. They'll be automatically translated to one ARM instruction.
Links:
__qadd16
_arm_qadd16
__qadd

I'll add solutions that were not yet mentioned above.
There exists ADC instruction in Intel x86. It is represented as _addcarry_u32() intrinsic function. For ARM there should be similar intrinsic.
Which allows us to implement very fast uint32_t saturated addition for Intel x86:
Try it online!
#include <stdint.h>
#include <immintrin.h>
uint32_t add_sat_u32(uint32_t a, uint32_t b) {
uint32_t r, carry = _addcarry_u32(0, a, b, &r);
return r | (-carry);
}
Intel x86 MMX saturated addition instructions can be used to implement uint16_t variant:
Try it online!
#include <stdint.h>
#include <immintrin.h>
uint16_t add_sat_u16(uint16_t a, uint16_t b) {
return _mm_cvtsi64_si32(_mm_adds_pu16(
_mm_cvtsi32_si64(a),
_mm_cvtsi32_si64(b)
));
}
I don't mention ARM solution, as it can be implemented by other generic solutions from other answers.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Popcnt using inline assembly language in C [duplicate] - c

Related

Moving data into __uint24 with assembly

Write registers data into array using asm C

Last stretch of rounding function in ASM

How do you detect the CPU architecture type during run-time with GCC and inline asm?

How to do unsigned saturating addition in C?

Categories

Resources