Last stretch of rounding function in ASM - c

What I essentially have to do is make what is in Main work.
I'm on my last stretch of this assignment (which will likely take just as long as it did for me to get here) I'm having trouble figuring out how to pass the roundingMode that is passed to roundD and using it in ASM.
Also, there is a block of just comments, as far as I can tell, that's all I have left to do. does that sound right?
#include <stdio.h>
#include <stdlib.h>
#define PRECISION 3
#define RND_CTL_BIT_SHIFT 10
// floating point rounding modes: IA-32 Manual, Vol. 1, p. 4-20
typedef enum {
ROUND_NEAREST_EVEN = 0 << RND_CTL_BIT_SHIFT,
ROUND_MINUS_INF = 1 << RND_CTL_BIT_SHIFT,
ROUND_PLUS_INF = 2 << RND_CTL_BIT_SHIFT,
ROUND_TOWARD_ZERO = 3 << RND_CTL_BIT_SHIFT
} RoundingMode;
double roundD(double n, RoundingMode roundingMode)
{
// do not change anything above this comment
int oldCW = 0x0000;
int newCW = 0xF3FF;
int mask = 0x0300;
int tempVar = 0x0000;
asm(" push %eax \n"
" push %ebx \n"
" fstcw %[oldCWOut] \n" //store FPU CW into OldCW
" mov %%eax, %[oldCWOut] \n" //store old FPU CW into tempVar
" mov %[tempVarIn], %%eax \n"
" add %%eax, %[maskIn] \n" //isolate rounding bits
" add %%eax, %[roundModeOut] \n" //adding rounding modifier
//shift in old bits to tempFPU
//do rounding calculation
//store result into n
" fldcw %[oldCWIn] \n" //restoring the FPU CW to normal
" pop %ebx \n"
" pop %eax \n"
: [oldCWOut] "=m" (oldCW),
[newCWOut] "=m" (newCW),
[maskOut] "=m" (mask),
[tempVarOut] "=m" (tempVar),
[roundModeOut] "=m" (roundMode)
: [oldCWIn] "m" (oldCW),
[newCWIn] "m" (newCW),
[maskIn] "m" (mask),
[tempVarIn] "m" (tempVar),
[roundModeIn] "m" (roundMode)
:"eax", "ebx"
);
return n;
// do not change anything below this comment, except for printing out your name
}
int main(int argc, char **argv)
{
double n = 0.0;
if (argc > 1)
n = atof(argv[1]);
printf("roundD even %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_NEAREST_EVEN));
printf("roundD down %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_MINUS_INF));
printf("roundD up %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_PLUS_INF));
printf("roundD zero %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_TOWARD_ZERO));
return 0;
}

While C might like to pretend that enum is not just an integer, it is just an integer. If you can't use roundingMode directly in the assembly, create an integer local variable and set it equal to the roundingMode parameter.
I'm just offering this as a suggestion to you. I've never used inline assembly before and I've never used x86 assembly before, but if all you need to do is reference the parameter, what I said above should work.

Related

Popcnt using inline assembly language in C [duplicate]

This question already has answers here:
Can I modify input operands in gcc inline assembly
(1 answer)
Count the number of set bits in a 32-bit integer
(65 answers)
Inline assembly reusing same register when it shouldn't [duplicate]
(2 answers)
Closed 1 year ago.
A simple implementation of the popcnt function in C:
int popcnt(uint64_t x) {
int s = 0;
for (int i = 0; i < 64; i++) {
if ((x << i) & 1 == 1) s++;
}
return s;
}
I am using inline assembly language (x86-64) to implement popcnt,
int asm_popcnt(uint64_t x) {
int i = 0, sum = 0;
uint64_t tmp = 0;
asm ( ".Pct: \n\t"
"movq %[xx], %[tm]\n\t"
"andq $0x1, %[tm]\n\t"
"test %[tm], %[tm]\n\t"
"je .Grt \n\t"
"incl %[ss] \n\t"
".Grt: \n\t"
"shrq $0x1, %[xx]\n\t"
"incl %[ii] \n\t"
"cmpl $0x3f, %[ii]\n\t"
"jle .Pct \n\t"
: [ss] "+r"(sum)
: [xx] "r"(x) , [ii] "r"(i),
[tm] "r"(tmp)
);
return sum;
}
but received WA (online judge)
I tested all powers of 2 (from 0x1 to (0x1 << 63)) on my computer and it returned 1, which indicates that my asm_popcnt can identify all bits of any 64_bits integer since all other integers are just combinations of 0x1, 0x2, 0x4, etc.(for example, 0x11a = 0x2 "or" 0x8 "or" 0x10 "or" 0x100). Therefore there shouldn't be cases for OJ to return a "WA". Is there anything wrong in my code? The jump instruction?

Using FPU with C inline assembly

I wrote a vector structure like this:
struct vector {
float x1, x2, x3, x4;
};
Then I created a function which does some operations with inline assembly using the vector:
struct vector *adding(const struct vector v1[], const struct vector v2[], int size) {
struct vector vec[size];
int i;
for(i = 0; i < size; i++) {
asm(
"FLDL %4 \n" //v1.x1
"FADDL %8 \n" //v2.x1
"FSTL %0 \n"
"FLDL %5 \n" //v1.x2
"FADDL %9 \n" //v2.x2
"FSTL %1 \n"
"FLDL %6 \n" //v1.x3
"FADDL %10 \n" //v2.x3
"FSTL %2 \n"
"FLDL %7 \n" //v1.x4
"FADDL %11 \n" //v2.x4
"FSTL %3 \n"
:"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4) //wyjscie
:"g"(&v1[i].x1), "g"(&v1[i].x2), "g"(&v1[i].x3), "g"(&v1[i].x4), "g"(&v2[i].x1), "g"(&v2[i].x2), "g"(&v2[i].x3), "g"(&v2[i].x4) //wejscie
:
);
}
return vec;
}
Everything looks OK, but when I try to compile this with GCC I get these errors:
Error: Operand type mismatch for 'fadd'
Error: Invalid instruction suffix for 'fld'
On OS/X in XCode everything working correctly. What is wrong with this code?
Coding Issues
I'm not looking at making this efficient (I'd be using SSE/SIMD if the processor supports it). Since this part of the assignment is to use the FPU stack then here are some concerns I have:
Your function declares a local stack based variable:
struct vector vec[size];
The problem is that your function returns a vector * and you do this:
return vec;
This is very bad. The stack based variable could get clobbered after the function returns and before the data gets consumed by the caller. One alternative is to allocate memory on the heap rather than the stack. You can replace struct vector vec[size]; with:
struct vector *vec = malloc(sizeof(struct vector)*size);
This would allocate enough space for an array of size number of vector. The person who calls your function would have to use free to deallocate the memory from the heap when finished.
Your vector structure uses float, not double. The instructions FLDL, FADDL, FSTL all operate on double (64-bit floats). Each of these instructions will load and store 64-bits when used with a memory operand. This would lead to the wrong values being loaded/stored to/from the FPU stack. You should be using FLDS, FADDS, FSTS to operate on 32-bit floats.
In the assembler templates you use the g constraint on the inputs. This means the compiler is free to use any general purpose registers, a memory operand, or an immediate value. FLDS, FADDS, FSTS do not take immediate values or general purpose registers (non-FPU registers) so if the compiler attempts to do so it will likely produce errors similar to Error: Operand type mismatch for xxxx.
Since these instructions understand a memory reference use m instead of g constraint. You will need to remove the & (ampersands) from the input operands since m implies that it will be dealing with the memory address of a variable/C expression.
You don't pop the values off the FPU stack when finished. FST with a single operand copies the value at the top of the stack to the destination. The value on the stack remains. You should store it and pop it off with an FSTP instruction. You want the FPU stack to be empty when your assembler template ends. The FPU stack is very limited with only 8 slots available. If the FPU stack is not clear when the template completes then you run the risk of the FPU stack overflowing on subsequent calls. Since you leave 4 values on the stack on each call, calling the function adding a third time should fail.
To simplify the code a bit I'd recommend using a typedef to define vector. Define your structure this way:
typedef struct {
float x1, x2, x3, x4;
} vector;
All references to struct vector can simply become vector.
With all of these things in mind your code could look something like this:
typedef struct {
float x1, x2, x3, x4;
} vector;
vector *adding(const vector v1[], const vector v2[], int size) {
vector *vec = malloc(sizeof(vector)*size);
int i;
for(i = 0; i < size; i++) {
__asm__(
"FLDS %4 \n" //v1.x1
"FADDS %8 \n" //v2.x1
"FSTPS %0 \n"
"FLDS %5 \n" //v1.x2
"FADDS %9 \n" //v2.x2
"FSTPS %1 \n"
"FLDS %6 \n" //v1->x3
"FADDS %10 \n" //v2->x3
"FSTPS %2 \n"
"FLDS %7 \n" //v1->x4
"FADDS %11 \n" //v2->x4
"FSTPS %3 \n"
:"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4)
:"m"(v1[i].x1), "m"(v1[i].x2), "m"(v1[i].x3), "m"(v1[i].x4),
"m"(v2[i].x1), "m"(v2[i].x2), "m"(v2[i].x3), "m"(v2[i].x4)
:
);
}
return vec;
}
Alternative Solutions
I don't know the parameters of the assignment, but if it were to make you use GCC extended assembler templates to manually do an operation on the vector with an FPU instruction then you could define the vector with an array of 4 float. Use a nested loop to process each element of the vector independently passing each through to the assembler template to be added together.
Define the vector as:
typedef struct {
float x[4];
} vector;
The function as:
vector *adding(const vector v1[], const vector v2[], int size) {
int i, e;
vector *vec = malloc(sizeof(vector)*size);
for(i = 0; i < size; i++)
for (e = 0; e < 4; e++) {
__asm__(
"FADDPS\n"
:"=t"(vec[i].x[e])
:"0"(v1[i].x[e]), "u"(v2[i].x[e])
);
}
return vec;
}
This uses the i386 machine constraints t and u on the operands. Rather than passing a memory address we allow GCC to pass them via the top two slots on the FPU stack. t and u are defined as:
t
Top of 80387 floating-point stack (%st(0)).
u
Second from top of 80387 floating-point stack (%st(1)).
The no operand form of FADDP does this:
Add ST(0) to ST(1), store result in ST(1), and pop the register stack
We pass the two values to add at the top of the stack and perform an operation leaving ONLY the result in ST(0). We can then get the assembler template to copy the value on the top of the stack and pop it off automatically for us.
We can use an output operand of =t to specify the value we want moved is from the top of the FPU stack. =t will also pop (if needed) the value off the top of FPU stack for us. We can also use the top of the stack as an input value too! If the output operand is %0 we can reference it as an input operand with the constraint 0 (which means use the same constraint as operand 0). The second vector value will use the u constraint so it is passed as the second FPU stack element (ST(1))
A slight improvement that could potentially allow GCC to optimize the code it generates would be to use the % modifier on the first input operand. The % modifier is documented as:
Declares the instruction to be commutative for this operand and the following operand. This means that the compiler may interchange the two operands if that is the cheapest way to make all operands fit the constraints. ‘%’ applies to all alternatives and must appear as the first character in the constraint. Only read-only operands can use ‘%’.
Because x+y and y+x yield the same result we can tell the compiler that it can swap the operand marked with % with the one defined immediately after in the template. "0"(v1[i].x[e]) could be changed to "%0"(v1[i].x[e])
Disadvantages: We've reduced the code in the assembler template to a single instruction, and we've used the template to do most of the work setting things up and tearing it down. The problem is that if the vectors are likely going to be memory bound then we transfer between FPU registers and memory and back more times than we may like it to. The code generated may not be very efficient as we can see in this Godbolt output.
We can force memory usage by applying the idea in your original code to the template. This code may yield more reasonable results:
vector *adding(const vector v1[], const vector v2[], int size) {
int i, e;
vector *vec = malloc(sizeof(vector)*size);
for(i = 0; i < size; i++)
for (e = 0; e < 4; e++) {
__asm__(
"FADDS %2\n"
:"=&t"(vec[i].x[e])
:"0"(v1[i].x[e]), "m"(v2[i].x[e])
);
}
return vec;
}
Note: I've removed the % modifier in this case. In theory it should work, but GCC seems to emit less efficient code (CLANG seems okay) when targeting x86-64. I'm unsure if it is a bug; whether my understanding is lacking in how this operator should work; or there is an optimization being done I don't understand. Until I look at it closer I am leaving it off to generate the code I would expect to see.
In the last example we are forcing the FADDS instruction to operate on a memory operand. The Godbolt output is considerably cleaner, with the loop itself looking like:
.L3:
flds (%rdi) # MEM[base: _51, offset: 0B]
addq $16, %rdi #, ivtmp.6
addq $16, %rcx #, ivtmp.8
FADDS (%rsi) # _31->x
fstps -16(%rcx) # _28->x
addq $16, %rsi #, ivtmp.9
flds -12(%rdi) # MEM[base: _51, offset: 4B]
FADDS -12(%rsi) # _31->x
fstps -12(%rcx) # _28->x
flds -8(%rdi) # MEM[base: _51, offset: 8B]
FADDS -8(%rsi) # _31->x
fstps -8(%rcx) # _28->x
flds -4(%rdi) # MEM[base: _51, offset: 12B]
FADDS -4(%rsi) # _31->x
fstps -4(%rcx) # _28->x
cmpq %rdi, %rdx # ivtmp.6, D.2922
jne .L3 #,
In this final example GCC unwound the inner loop and only the outer loop remains. The code generated by the compiler is similar in nature to what was produced by hand in the original question's assembler template.

divide and store quotient and reminder in different arrays

The standard div() function returns a div_t struct as parameter, for example:
/* div example */
#include <stdio.h> /* printf */
#include <stdlib.h> /* div, div_t */
int main ()
{
div_t divresult;
divresult = div (38,5);
printf ("38 div 5 => %d, remainder %d.\n", divresult.quot, divresult.rem);
return 0;
}
My case is a bit different; I have this
#define NUM_ELTS 21433
int main ()
{
unsigned int quotients[NUM_ELTS];
unsigned int remainders[NUM_ELTS];
int i;
for(i=0;i<NUM_ELTS;i++) {
divide_single_instruction(&quotient[i],&reminder[i]);
}
}
I know that the assembly language for division does everything in single instruction, so I need to do the same here to save on cpu cycles, which is bassicaly move the quotient from EAX and reminder from EDX into a memory locations where my arrays are stored. How can this be done without including the asm {} or SSE intrinsics in my C code ? It has to be portable.
Since you're writing to the arrays in-place (replacing numerator and denominator with quotient and remainder) you should store the results to temporary variables before writing to the arrays.
void foo (unsigned *num, unsigned *den, int n) {
int i;
for(i=0;i<n;i++) {
unsigned q = num[i]/den[i], r = num[i]%den[i];
num[i] = q, den[i] = r;
}
}
produces this main loop assembly
.L5:
movl (%rdi,%rcx,4), %eax
xorl %edx, %edx
divl (%rsi,%rcx,4)
movl %eax, (%rdi,%rcx,4)
movl %edx, (%rsi,%rcx,4)
addq $1, %rcx
cmpl %ecx, %r8d
jg .L5
There are some more complicated cases where it helps to save the quotient and remainder when they are first used. For example in testing for primes by trial division you often see a loop like this
for (p = 3; p <= n/p; p += 2)
if (!(n % p)) return 0;
It turns out that GCC does not use the remainder from the first division and therefore it does the division instruction twice which is unnecessary. To fix this you can save the remainder when the first division is done like this:
for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
if (!r) return 0;
This speeds up the result by a factor of two.
So in general GCC does a good job particularly if you save the quotient and remainder when they are first calculated.
The general rule here is to trust your compiler to do something fast. You can always disassemble the code and check that the compiler is doing something sane. It's important to realise that a good compiler knows a lot about the machine, often more than you or me.
Also let's assume you have a good reason for needing to "count cycles".
For your example code I agree that the x86 "idiv" instruction is the obvious choice. Let's see what my compiler (MS visual C 2013) will do if I just write out the most naive code I can
struct divresult {
int quot;
int rem;
};
struct divresult divrem(int num, int den)
{
return (struct divresult) { num / den, num % den };
}
int main()
{
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
}
And the compiler gives us:
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
01121000 push 1
01121002 push 2
01121004 push 1123018h
01121009 call dword ptr ds:[1122090h] ;;; this is printf()
Wow, I was outsmarted by the compiler. Visual C knows how division works so it just precalculated the result and inserted constants. It didn't even bother to include my function in the final code. We have to read in the integers from console to force it to actually do the calculation:
int main()
{
int num, den;
scanf("%d, %d", &num, &den);
struct divresult res = divrem(num, den);
printf("%d, %d", res.quot, res.rem);
}
Now we get:
struct divresult res = divrem(num, den);
01071023 mov eax,dword ptr [num]
01071026 cdq
01071027 idiv eax,dword ptr [den]
printf("%d, %d", res.quot, res.rem);
0107102A push edx
0107102B push eax
0107102C push 1073020h
01071031 call dword ptr ds:[1072090h] ;;; printf()
So you see, the compiler (or this compiler at least) already does what you want, or something even more clever.
From this we learn to trust the compiler and only second-guess it when we know it isn't doing a good enough job already.

Assembly language parse error before '[' token

I get a parse error on line 24(I believe) "parse error before '[' token"
Also, if any of you would like to give me some helpful tips and insights into my project I would appreciate that very much. I'm building a pow function with all calculations done in asm, this piece of code is to change the FPU to round to 0 so that I can split up the exponent into 2 parts (example: 2^3.2 = 2^3 * 2^0.2)
#include <stdio.h>
#include <stdlib.h>
#define PRECISION 3
#define RND_CTL_BIT_SHIFT 10
// floating point rounding modes: IA-32 Manual, Vol. 1, p. 4-20
typedef enum {
ROUND_TOWARD_ZERO = 3 << RND_CTL_BIT_SHIFT
} RoundingMode;
int main(int argc, char **argv)
{
int fpMask = 0x1F9FF;
int localVar = 0x00000;
asm(" FSTCW %[localVarIn] \n" // store FPU control word into localVar
" add %[localVarIn], %[fpMaskOut] \n" // add fpMaskIn to localVar
: [localVarOut] "=m" (localVar)
: [fpMaskOut] "=m" (fpMask)
: [localVarIn] "m" (localVar)
: [fpMaskIn] "m" (localVar)
);
printf("FPU is 0x%08X\n\n", localVar);
return 0;
}
I believe your clobber list is slightly wrong. In GCC, these take the form of the following:
asm("MY ASM CODE" : output0, output1, outputn
: input0, input1, inputn
: clobber0, clobber1, clobbern);
Note that each class is colon-separated (i.e. the set of outputs, the set of inputs etc), then each element within a set is comma-separated. Therefore, try changing your ASM to the following:
asm(" FSTCW %[localVarIn] \n" // store FPU control word into localVar
" add %[localVarIn], %[fpMaskOut] \n" // add fpMaskIn to localVar
: [localVarOut] "=m" (localVar), // Outputs...
[fpMaskOut] "=m" (fpMask)
: [localVarIn] "m" (localVar), // Inputs...
[fpMaskIn] "m" (localVar)

Conversion short to int and sum with NEON

I want to convert the next function to NEON:
int dot4_c(unsigned char v0[4], unsigned char v1[4]){
int r=0;
r = v0[0]*v1[0];
r += v0[1]*v1[1];
r += v0[2]*v1[2];
r += v0[3]*v1[3];
return r;
}
I think I almost do it, but there is an error because it is not working well
int dot4_neon_hfp(unsigned char v0[4], unsigned char v1[4])
{
asm volatile (
"vld1.16 {d2, d3}, [%0] \n\t" //d2={x0,y0}, d3={z0, w0}
"vld1.16 {d4, d5}, [%1] \n\t" //d4={x1,y1}, d5={z1, w1}
"vcvt.32.u16 d2, d2 \n\t" //conversion
"vcvt.32.u16 d3, d3 \n\t"
"vcvt.32.u16 d4, d4 \n\t"
"vcvt.32.u16 d5, d5 \n\t"
"vmul.32 d0, d2, d4 \n\t" //d0= d2*d4
"vmla.32 d0, d3, d5 \n\t" //d0 = d0 + d3*d5
"vpadd.32 d0, d0 \n\t" //d0 = d[0] + d[1]
:: "r"(v0), "r"(v1) :
);
}
How can I get this working?
As mentioned, you must load at least 8 bytes at a time with NEON. As long as the load doesn't go past the end of your buffer, you can ignore the extra bytes. Here is how to do it with intrinsics:
uint8x8_t v0_vec, v1_vec;
uint16x8_t vproduct;
uint32x2_t vsum32;
v0_vec = vld1_u8(v0); // extra bytes will be ignored as long as you can safely read them
v1_vec = vld1_u8(v1);
// you didn't specify if the product of your vector fits in 8-bits, so I assume it needs to be widened to 16-bits
vproduct = vmull_u8(v0_vec, v1_vec);
vsum32 = vpaddl_u16(vget_low_u16(vproduct)); // pairwise add lower half (first 4 u16's)
return vsum32.val[0] + vsum32.val[1];
If you absolutely can't load 8 bytes from your source pointers, you can manually load a 32-bit value into a NEON register (the 4 bytes) and then cast it to the proper intrinsic type.

Resources