Using FPU with C inline assembly - c

I wrote a vector structure like this:
struct vector {
float x1, x2, x3, x4;
};
Then I created a function which does some operations with inline assembly using the vector:
struct vector *adding(const struct vector v1[], const struct vector v2[], int size) {
struct vector vec[size];
int i;
for(i = 0; i < size; i++) {
asm(
"FLDL %4 \n" //v1.x1
"FADDL %8 \n" //v2.x1
"FSTL %0 \n"
"FLDL %5 \n" //v1.x2
"FADDL %9 \n" //v2.x2
"FSTL %1 \n"
"FLDL %6 \n" //v1.x3
"FADDL %10 \n" //v2.x3
"FSTL %2 \n"
"FLDL %7 \n" //v1.x4
"FADDL %11 \n" //v2.x4
"FSTL %3 \n"
:"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4) //wyjscie
:"g"(&v1[i].x1), "g"(&v1[i].x2), "g"(&v1[i].x3), "g"(&v1[i].x4), "g"(&v2[i].x1), "g"(&v2[i].x2), "g"(&v2[i].x3), "g"(&v2[i].x4) //wejscie
:
);
}
return vec;
}
Everything looks OK, but when I try to compile this with GCC I get these errors:
Error: Operand type mismatch for 'fadd'
Error: Invalid instruction suffix for 'fld'
On OS/X in XCode everything working correctly. What is wrong with this code?

Coding Issues
I'm not looking at making this efficient (I'd be using SSE/SIMD if the processor supports it). Since this part of the assignment is to use the FPU stack then here are some concerns I have:
Your function declares a local stack based variable:
struct vector vec[size];
The problem is that your function returns a vector * and you do this:
return vec;
This is very bad. The stack based variable could get clobbered after the function returns and before the data gets consumed by the caller. One alternative is to allocate memory on the heap rather than the stack. You can replace struct vector vec[size]; with:
struct vector *vec = malloc(sizeof(struct vector)*size);
This would allocate enough space for an array of size number of vector. The person who calls your function would have to use free to deallocate the memory from the heap when finished.
Your vector structure uses float, not double. The instructions FLDL, FADDL, FSTL all operate on double (64-bit floats). Each of these instructions will load and store 64-bits when used with a memory operand. This would lead to the wrong values being loaded/stored to/from the FPU stack. You should be using FLDS, FADDS, FSTS to operate on 32-bit floats.
In the assembler templates you use the g constraint on the inputs. This means the compiler is free to use any general purpose registers, a memory operand, or an immediate value. FLDS, FADDS, FSTS do not take immediate values or general purpose registers (non-FPU registers) so if the compiler attempts to do so it will likely produce errors similar to Error: Operand type mismatch for xxxx.
Since these instructions understand a memory reference use m instead of g constraint. You will need to remove the & (ampersands) from the input operands since m implies that it will be dealing with the memory address of a variable/C expression.
You don't pop the values off the FPU stack when finished. FST with a single operand copies the value at the top of the stack to the destination. The value on the stack remains. You should store it and pop it off with an FSTP instruction. You want the FPU stack to be empty when your assembler template ends. The FPU stack is very limited with only 8 slots available. If the FPU stack is not clear when the template completes then you run the risk of the FPU stack overflowing on subsequent calls. Since you leave 4 values on the stack on each call, calling the function adding a third time should fail.
To simplify the code a bit I'd recommend using a typedef to define vector. Define your structure this way:
typedef struct {
float x1, x2, x3, x4;
} vector;
All references to struct vector can simply become vector.
With all of these things in mind your code could look something like this:
typedef struct {
float x1, x2, x3, x4;
} vector;
vector *adding(const vector v1[], const vector v2[], int size) {
vector *vec = malloc(sizeof(vector)*size);
int i;
for(i = 0; i < size; i++) {
__asm__(
"FLDS %4 \n" //v1.x1
"FADDS %8 \n" //v2.x1
"FSTPS %0 \n"
"FLDS %5 \n" //v1.x2
"FADDS %9 \n" //v2.x2
"FSTPS %1 \n"
"FLDS %6 \n" //v1->x3
"FADDS %10 \n" //v2->x3
"FSTPS %2 \n"
"FLDS %7 \n" //v1->x4
"FADDS %11 \n" //v2->x4
"FSTPS %3 \n"
:"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4)
:"m"(v1[i].x1), "m"(v1[i].x2), "m"(v1[i].x3), "m"(v1[i].x4),
"m"(v2[i].x1), "m"(v2[i].x2), "m"(v2[i].x3), "m"(v2[i].x4)
:
);
}
return vec;
}
Alternative Solutions
I don't know the parameters of the assignment, but if it were to make you use GCC extended assembler templates to manually do an operation on the vector with an FPU instruction then you could define the vector with an array of 4 float. Use a nested loop to process each element of the vector independently passing each through to the assembler template to be added together.
Define the vector as:
typedef struct {
float x[4];
} vector;
The function as:
vector *adding(const vector v1[], const vector v2[], int size) {
int i, e;
vector *vec = malloc(sizeof(vector)*size);
for(i = 0; i < size; i++)
for (e = 0; e < 4; e++) {
__asm__(
"FADDPS\n"
:"=t"(vec[i].x[e])
:"0"(v1[i].x[e]), "u"(v2[i].x[e])
);
}
return vec;
}
This uses the i386 machine constraints t and u on the operands. Rather than passing a memory address we allow GCC to pass them via the top two slots on the FPU stack. t and u are defined as:
t
Top of 80387 floating-point stack (%st(0)).
u
Second from top of 80387 floating-point stack (%st(1)).
The no operand form of FADDP does this:
Add ST(0) to ST(1), store result in ST(1), and pop the register stack
We pass the two values to add at the top of the stack and perform an operation leaving ONLY the result in ST(0). We can then get the assembler template to copy the value on the top of the stack and pop it off automatically for us.
We can use an output operand of =t to specify the value we want moved is from the top of the FPU stack. =t will also pop (if needed) the value off the top of FPU stack for us. We can also use the top of the stack as an input value too! If the output operand is %0 we can reference it as an input operand with the constraint 0 (which means use the same constraint as operand 0). The second vector value will use the u constraint so it is passed as the second FPU stack element (ST(1))
A slight improvement that could potentially allow GCC to optimize the code it generates would be to use the % modifier on the first input operand. The % modifier is documented as:
Declares the instruction to be commutative for this operand and the following operand. This means that the compiler may interchange the two operands if that is the cheapest way to make all operands fit the constraints. ‘%’ applies to all alternatives and must appear as the first character in the constraint. Only read-only operands can use ‘%’.
Because x+y and y+x yield the same result we can tell the compiler that it can swap the operand marked with % with the one defined immediately after in the template. "0"(v1[i].x[e]) could be changed to "%0"(v1[i].x[e])
Disadvantages: We've reduced the code in the assembler template to a single instruction, and we've used the template to do most of the work setting things up and tearing it down. The problem is that if the vectors are likely going to be memory bound then we transfer between FPU registers and memory and back more times than we may like it to. The code generated may not be very efficient as we can see in this Godbolt output.
We can force memory usage by applying the idea in your original code to the template. This code may yield more reasonable results:
vector *adding(const vector v1[], const vector v2[], int size) {
int i, e;
vector *vec = malloc(sizeof(vector)*size);
for(i = 0; i < size; i++)
for (e = 0; e < 4; e++) {
__asm__(
"FADDS %2\n"
:"=&t"(vec[i].x[e])
:"0"(v1[i].x[e]), "m"(v2[i].x[e])
);
}
return vec;
}
Note: I've removed the % modifier in this case. In theory it should work, but GCC seems to emit less efficient code (CLANG seems okay) when targeting x86-64. I'm unsure if it is a bug; whether my understanding is lacking in how this operator should work; or there is an optimization being done I don't understand. Until I look at it closer I am leaving it off to generate the code I would expect to see.
In the last example we are forcing the FADDS instruction to operate on a memory operand. The Godbolt output is considerably cleaner, with the loop itself looking like:
.L3:
flds (%rdi) # MEM[base: _51, offset: 0B]
addq $16, %rdi #, ivtmp.6
addq $16, %rcx #, ivtmp.8
FADDS (%rsi) # _31->x
fstps -16(%rcx) # _28->x
addq $16, %rsi #, ivtmp.9
flds -12(%rdi) # MEM[base: _51, offset: 4B]
FADDS -12(%rsi) # _31->x
fstps -12(%rcx) # _28->x
flds -8(%rdi) # MEM[base: _51, offset: 8B]
FADDS -8(%rsi) # _31->x
fstps -8(%rcx) # _28->x
flds -4(%rdi) # MEM[base: _51, offset: 12B]
FADDS -4(%rsi) # _31->x
fstps -4(%rcx) # _28->x
cmpq %rdi, %rdx # ivtmp.6, D.2922
jne .L3 #,
In this final example GCC unwound the inner loop and only the outer loop remains. The code generated by the compiler is similar in nature to what was produced by hand in the original question's assembler template.

Related

C - efficiently changing a function pointer based on command line input

I have several similar functions, say A, B, C. I want to choose one of them with command line options. Also, I'm calling that function billion times because of that instead of checking a variable inside a function billion times, I'm defining a function pointer Phi and set it to desired function just one time. But when I set, Phi = A, (so no user input considered) my code runs in ~24 secs, when I add an if-else and set Phi to desired function, my code runs in ~30 secs with exact same parameters. (Of course command line option sets Phi to A) What is the efficient way to handle this case?
My functions:
double funcA(double r)
{
return 0;
}
double funcB(double r)
{
return 1;
}
double funcC(double r)
{
return r;
}
void computationFunctionFast(Context *userInputs) {
double (*Phi)(double) = funcA;
/* computation codes */
}
void computationFunctionSlow(Context *userInputs) {
double (*Phi)(double);
switch (userInputs->funcEnum) {
case A:
Phi = funcA;
break;
case B:
Phi = funcB;
break;
case C:
Phi = funcC;
}
/* computation codes */
}
I've tried gcc, clang, icx with -O2 and -O3 optimizations. (gcc has no performance difference in mentioned cases but has the worst performance) Although I'm using C, I've tried std::function too. I've tried defining Phi function in different scopes etc.
Generally, there are a few things here that are slightly bad for performance:
Branches/comparisons lead to inefficient use of branch prediction/instruction cache and might affect pipelining too.
Function pointers are notoriously inefficient since they generally block inlining and generally the compiler can't do much about them.
Here's an example based on your code:
double computationFunctionSlow (int input, double val) {
double (*Phi)(double);
switch (input) {
case 0: Phi = funcA; break;
case 1: Phi = funcB; break;
case 2: Phi = funcC; break;
}
double res = Phi(val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # #computationFunctionSlow
cmp edi, 2
ja .LBB3_1
movsxd rax, edi
lea rcx, [rip + .Lswitch.table.computationFunctionSlow]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.LBB3_1:
xorps xmm0, xmm0
ret
.Lswitch.table.computationFunctionSlow:
.quad funcA
.quad funcB
.quad funcC
Even though the numbers I picked are adjacent, the usual compilers fail to optimize out the comparison cmp. Even when I include a default: return 0; it is still there. You can quite easily manually optimize any switch with contiguous indices like this into a function pointer jump table:
double computationFunctionSlow (int input, double val) {
double (*Phi[3])(double) = {funcA, funcB, funcC};
double res = Phi[input](val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # #computationFunctionSlow
movsxd rax, edi
lea rcx, [rip + .L__const.computationFunctionSlow.Phi]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.L__const.computationFunctionSlow.Phi:
.quad funcA
.quad funcB
.quad funcC
This leads to slightly better code here as the comparison instruction/branch is now removed. However, this is really a micro optimization that shouldn't have that much impact of performance. You have to benchmark it for sure to see if there's any improvement.
(Also gcc 12.2 didn't optimize this code as good, why I went with clang for this example.)
Godbolt link: https://godbolt.org/z/ja4zerj7o
There isn't a more "efficient" way to handle this case, you are already doing what you should.
The difference in timing you observe is because:
In the first case (Phi = funcA) the compiler knows the function will always be the same and is therefore able to optimize its calls. Depending on what your "computation code" does, this could mean inlining the function and simplifying a lot of calculations for you.
In the second case (Phi = <choice from user>) the compiler cannot know which function will be selected, and therefore cannot optimize any of the calls made to it by the rest of the code. It also cannot propagate optimizations to other parts of your "computation code" like in the first case.
In general, there isn't much you can do. Dynamic function pointers inherently add a bit of runtime overhead and make optimizations harder (or impossible).
What you could try is duplicating the "computation code" inside different functions or different branches that you only enter after asserting that Phi is equal to a constant, like so:
void computationFunctionSlow(Context *userInputs) {
if (userInputs->funcEnum == A) {
const double (*Phi)(double) = funcA;
// computation code
} else if (...) {
// ...
}
}
In the above piece of code, the compiler knows that inside any of those if blocks the value of Phi can only have one value, and could therefore be able to perform the same optimizations discussed in point 1 above.
There's no need to put an enum in your userInputs when all you do with it is use it to select a function pointer. Just add the function pointer in the structure directly and eliminate the branching done on every call.
Instead of
struct Context
{
.
.
.
enum funcType funcEnum;
};
use
struct Context
{
.
.
.
double (*phi)(double);
};
You'd wind up with something like this:
void computationFunctionSlow(Context *userInputs) {
/* computation codes */
double result = userInputs->phi( data );
}

MSVC Inline Assembly: Freeing FPU registers for performance

While playing a little with FPU using MSVC's Inline Assembly, I got a little confused about freeing FPU registers in favor of increasing performance...
For example:
#include <stdio.h>
double fpu_add(register double x, register double y) {
double res = 0.0;
__asm {
fld x
fld y
fadd
fstp res
}
return res;
}
int main(void) {
double x = fpu_add(5.0, 2.0);
(void) printf("x = %f\n", x);
return 0;
}
When do I have to ffree the FPU registers in Inline Assembly?
In that example would performance be better if I decided to ffree the st(1) register?
Also is fstp a shorthand for instructions below?
__asm {
fst res
ffree st(0)
}
NOTE: I know FPU instructions are a bit old nowdays, But dealing with them as another option along with SSE
The ffree instruction allows you to mark any slot of the x87 fo stack as free without actually changing the stack pointer. So ffree st(0) does NOT pop the stack, just marks the top value of the stack as free/invalid, so any following instruction that tries to access it will get a floating point exception.
To actually pop to the stack you need both ffree st(0) and fincstp (to increment the pointer). Or better, fstp st(0) to do both those things with a single cheap instruction. Or fstp st(1) to keep the top-of-stack value and discard the old st(1).
But it's usually even better and easier (and faster) to use the p suffixed versions of other instructions. In your case, you probably want
__asm {
fld x // push x on the stack
fld y // push y on the stack
faddp // pop a value and add it to the (now) tos
fstp res // pop and store tos
}
This ends up pushing and popping two values, leaving the fp stack in the same state as it was before. Leaving stuff on the fp stack is likely to cause problems with other fp code, if the compiler is generating x87 fp code, so should be avoided.
Or even better, use memory-source fadd to save instructions, if you're optimizing for CPUs where that's not slower. (Check Agner Fog's microarch PDF and instruction tables for P5 Pentium and newer: seems to be fine, at least break even, and saves a uop on more modern CPUs like Core2 that can do micro-fusion of memory source operands.)
__asm {
fld x // push x on the stack
fadd y // ST0 += y
fstp res // pop and store tos
}
But MSVC inline asm is inherently slow for wrapping a single instruction like fadd, forcing inputs to be in memory, even if the compiler had them available in registers before the asm statement. And forcing the result to be stored in the asm and then reloaded for the return statement, unless you use a hack like leaving a value in st(0) and falling off the end of a function without a return statement. (MSVC does actually support this even when inlining, but clang-cl / clang -fasm-blocks does not.)
GNU C inline asm could wrap a single fadd instruction with appropriate constraints to ask for inputs in x87 registers and tell the compiler where the output is (in st(0)), but you'd still have to choose between fadd and faddp, not letting the compiler pick based on whether it had values in registers or a value from memory. (https://stackoverflow.com/tags/inline-assembly/info)
Compilers aren't terrible, they will make code at least this good from plain C source. Inline asm is generally not useful for performance, unless you're writing a whole loop that's carefully tuned for a specific CPU, or for a case where the compiler does a poor job with something. (Look at the compiler's optimized asm output, e.g. on https://godbolt.org/)

multiplication instruction error in inline assembly

Consider following program:
#include <stdio.h>
int main(void) {
int foo = 10, bar = 15;
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
printf("foo+bar=%d\n", foo);
}
I know that add instruction is used for addition, sub instruction is used for subtraction & so on. But I didn't understand these lines:
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
What is the exact meaning of :"=a"(foo) :"a"(foo), "b"(bar) ); ? What it does ? And when I try to use mul instruction here I get following error for the following program:
#include <stdio.h>
int main(void) {
int foo = 10, bar = 15;
__asm__ __volatile__("mul %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
printf("foo*bar=%d\n", foo);
}
Error: number of operands mismatch for `mul'
So, why I am getting this error ? How do I solve this error ? I've searched on google about these, but I couldn't find solution of my problem. I am using windows 10 os & processor is intel core i3.
What is the exact meaning of :"=a"(foo) :"a"(foo), "b"(bar) );
There is a detailed description of how parameters are passed to the asm instruction here. In short, this is saying that bar goes into the ebx register, foo goes into eax, and after the asm is executed, eax will contain an updated value for foo.
Error: number of operands mismatch for `mul'
Yeah, that's not the right syntax for mul. Perhaps you should spend some time with an x86 assembler reference manual (for example, this).
I'll also add that using inline asm is usually a bad idea.
Edit: I can't fit a response to your question into a comment.
I'm not quite sure where to start. These questions seem to indicate that you don't have a very good grasp of how assembler works at all. Trying to teach you asm programming in a SO answer is not really practical.
But I can point you in the right direction.
First of all, consider this bit of asm code:
movl $10, %eax
movl $15, %ebx
addl %ebx, %eax
Do you understand what that does? What will be in eax when this completes? What will be in ebx? Now, compare that with this:
int foo = 10, bar = 15;
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
By using the "a" constraint, you are asking gcc to move the value of foo into eax. By using the "b" constraint you are asking it to move bar into ebx. It does this, then executes the instructions for the asm (ie add). On exit from the asm, the new value for foo will be in eax. Get it?
Now, let's look at mul. According to the docs I linked you to, we know that the syntax is mul value. That seems weird, doesn't it? How can there only be one parameter to mul? What does it multiple the value with?
But if you keep reading, you see "Always multiplies EAX by a value." Ahh. So the "eax" register is always implied here. So if you were to write mul %ebx, that would really be mean mul ebx, eax, but since it always has to be eax, there's no real point it writing it out.
However, it's a little more complicated than that. ebx can hold a 32bit value number. Since we are using ints (instead of unsigned ints), that means that ebx could have a number as big as 2,147,483,647. But wait, what happens if you multiply 2,147,483,647 * 10? Well, since 2,147,483,647 is already as big a number as you can store in a register, the result is much too big to fit into eax. So the multiplication (always) uses 2 registers to output the result from mul. This is what that link meant when it referred "stores the result in EDX:EAX."
So, you could write your multiplication like this:
int foo = 10, bar = 15;
int upper;
__asm__ ("mul %%ebx"
:"=a"(foo), "=d"(upper)
:"a"(foo), "b"(bar)
:"cc"
);
As before, this puts bar in ebx and foo in eax, then executes the multiplication instruction.
And after the asm is done, eax will contain the lower part of the result and edx will contain the upper. If foo * bar < 2,147,483,647, then foo will contain the result you need and upper will be zero. Otherwise, things get more complicated.
But that's as far as I'm willing to go. Other than that, take an asm class. Read a book.
PS You might also look at this answer and the 3 comments that follow that show why even your "add" example is "wrong."
PPS If this answer has resolved your question, don't forget to click the check mark next to it so I get my karma points.

Returning function arguments from assembly

I'm using an AMD 64bit (I don't think it matters what exact architecture) on Linux, also 64bit. Compiling with gcc to elf64.
I've seen from the C ABI that integer arguments are passed to a function via general purpose registers, and I can find the values on the assembly side of my code (the callee). The problem arises when I need to retrieve the results from the callee to the caller.
As far as I can understand, RAX gets the 1st integer returned value, and I can easily find and use that one. The 2nd integer returned value is passed via RDX. And this is the point that baffles me.
I can also see, from the C ABI, that RDX is the register used to pass the third integer function argument from the caller to the callee, but my function doesn't use a third argument.
How do I get the RDX out from my function? Do I have to fake an argument in the function just to be able to refer to it on the caller side?
fixed point multiplication 16.16:
called from C looks like:
typedef long int Fixedpoint;
Fixedpoint _FixedMul(Fixedpoint v1, Fixedpoint v2);
and this is the function itself:
_FixedMul:
push bp
mov bp, sp
; entering the function EDI contains v1, ESI contains v2. So:
mov eax, edi ; eax = v1
imul dword esi ; eax = v1 * v2
; at this point EDX contains the higher part of the
; imul moltiplication, EAX the lower one.
add eax, 8000h ; round by adding 2^(-17)
adc edx, 0 ; whole part of result is in DX
shr eax, 16 ; put the fractional part in AX
pop bp
ret
from System V Application Binary Interface
AMD64 Architecture Processor Supplement
Returning of Values algorithm:
The returning of values is done according to the following
Classify the return type with the classification algorithm.
If the type has class MEMORY, then the caller provides space for the return
value and passes the address of this storage in %rdi as if it were the first
argument to the function. In effect, this address becomes a “hidden” first ar-
gument. This storage must not overlap any data visible to the callee through
other names than this argument.
On return %rax will contain the address that has been passed in by the
caller in %rdi.
If the class is INTEGER, the next available register of the sequence %rax,
%rdx is used.
I hope is more clear what i mean.
PS: sorry for confusion I've made in comments. Thanks for the tip.
The AMD64 calling conventions (System V ABI) designate a dual-use role for register RDX: it may be used to pass the third argument (if any) and it may be used as second return register. (cf. Figure 3.4, page 21)
Depending on the function signature it's used for either or none of these roles.
So why are there 2 return registers (i.e. RAX and RDX)? This allows a function to return values of up to 128 bits in general purpose registers - such as __int128 or a struct with two 8 byte fields.
Thus, to access the RDX register from your C call site you just have to adjust your function's signature. That means instead of
typedef long int Fixedpoint;
Fixedpoint _FixedMul(Fixedpoint v1, Fixedpoint v2);
declare it as:
typedef long int Fixedpoint;
struct Fixedpoint_Pair { // when used as return value:
Fixedpoint fst; // - passed in RAX
Fixedpoint snd; // - passed in RDX
};
typedef struct Fixedpoint_Pair Fixedpoint_Pair;
Fixedpoint_Pair _FixedMul(Fixedpoint v1, Fixedpoint v2);
Then you can access RDX like this in your C code:
Fixedpoint_Pair r = _FixedMul(a, b);
printf("RDX: %ld\n", r.snd);
References:
page 18,19: especially 'The classification of aggregate (structure and arrays)', point 3:
If the size of the aggregate exceeds a single eightbyte, each is classified
separately. Each eightbyte gets initialized to class NO_CLASS.
page 22: 'Return of Values', point 3:
If the class is INTEGER, the next available register of the sequence %rax,
%rdx is used.
Your English is absolutely fine.
How arguments are passed depends on the calling convention that has been adopted - the two most common ones, __stdcall and __cdecl, use the stack to pass all arguments, but for example the __fastcall convention will use registers for the first two args, and in x64, it's still different. See here for a comprehensive list.
Actual return values are returned in (E/R)AX; however, if the function receives some pointers to "return" more values, those pointers will be passed as normal arguments - as previously stated, usually on the stack.

Multithreading with inline assembly and access to a c variable

I'm using inline assembly to construct a set of passwords, which I will use to brute force against a given hash. I used this website as a reference for the construction of the passwords.
This is working flawlessly in a singlethreaded environment. It produces an infinite amount of incrementing passwords.
As I have only basic knowledge of asm, I understand the idea. The gcc uses ATT, so I compile with -masm=intel
During the attempt to multithread the program, I realize that this approach might not work.
The following code uses 2 global C variables, and I assume that this might be the problem.
__asm__("pushad\n\t"
"mov edi, offset plaintext\n\t" <---- global variable
"mov ebx, offset charsetTable\n\t" <---- again
"L1: movzx eax, byte ptr [edi]\n\t"
" movzx eax, byte ptr [charsetTable+eax]\n\t"
" cmp al, 0\n\t"
" je L2\n\t"
" mov [edi],al\n\t"
" jmp L3\n\t"
"L2: xlat\n\t"
" mov [edi],al\n\t"
" inc edi\n\t"
" jmp L1\n\t"
"L3: popad\n\t");
It produces a non deterministic result in the plaintext variable.
How can i create a workaround, that every thread accesses his own plaintext variable? (If this is the problem...).
I tried modifying this code, to use extended assembly, but I failed every time. Probably due to the fact that all tutorials use ATT syntax.
I would really appreciate any help, as I'm stuck for several hours now :(
Edit: Running the program with 2 threads, and printing the content of plaintext right after the asm instruction, produces:
b
b
d
d
f
f
...
Edit2:
pthread_create(&thread[i], NULL, crack, (void *) &args[i]))
[...]
void *crack(void *arg) {
struct threadArgs *param = arg;
struct crypt_data crypt; // storage for reentrant version of crypt(3)
char *tmpHash = NULL;
size_t len = strlen(param->methodAndSalt);
size_t cipherlen = strlen(param->cipher);
crypt.initialized = 0;
for(int i = 0; i <= LIMIT; i++) {
// intel syntax
__asm__ ("pushad\n\t"
//mov edi, offset %0\n\t"
"mov edi, offset plaintext\n\t"
"mov ebx, offset charsetTable\n\t"
"L1: movzx eax, byte ptr [edi]\n\t"
" movzx eax, byte ptr [charsetTable+eax]\n\t"
" cmp al, 0\n\t"
" je L2\n\t"
" mov [edi],al\n\t"
" jmp L3\n\t"
"L2: xlat\n\t"
" mov [edi],al\n\t"
" inc edi\n\t"
" jmp L1\n\t"
"L3: popad\n\t");
tmpHash = crypt_r(plaintext, param->methodAndSalt, &crypt);
if(0 == memcmp(tmpHash+len, param->cipher, cipherlen)) {
printf("success: %s\n", plaintext);
break;
}
}
return 0;
}
Since you're already using pthreads, another option is making the variables that are modified by several threads into per-thread variables (threadspecific data). See pthread_getspecific OpenGroup manpage. The way this works is like:
In the main thread (before you create other threads), do:
static pthread_key_y tsd_key;
(void)pthread_key_create(&tsd_key); /* unlikely to fail; handle if you want */
and then within each thread, where you use the plaintext / charsetTable variables (or more such), do:
struct { char *plainText, char *charsetTable } *str =
pthread_getspecific(tsd_key);
if (str == NULL) {
str = malloc(2 * sizeof(char *));
str.plainText = malloc(size_of_plaintext);
str.charsetTable = malloc(size_of_charsetTable);
initialize(str.plainText); /* put the data for this thread in */
initialize(str.charsetTable); /* ditto */
pthread_setspecific(tsd_key, str);
}
char *plaintext = str.plainText;
char *charsetTable = str.charsetTable;
Or create / use several keys, one per such variable; in that case, you don't get the str container / double indirection / additional malloc.
Intel assembly syntax with gcc inline asm is, hm, not great; in particular, specifying input/output operands is not easy. I think to get that to use the pthread_getspecific mechanism, you'd change your code to do:
__asm__("pushad\n\t"
"push tsd_key\n\t" <---- threadspecific data key (arg to call)
"call pthread_getspecific\n\t" <---- gets "str" as per above
"add esp, 4\n\t" <---- get rid of the func argument
"mov edi, [eax]\n\t" <---- first ptr == "plainText"
"mov ebx, [eax + 4]\n\t" <---- 2nd ptr == "charsetTable"
...
That way, it becomes lock-free, at the expense of using more memory (one plaintext / charsetTable per thread), and the expense of an additional function call (to pthread_getspecific()). Also, if you do the above, make sure you free() each thread's specific data via pthread_atexit(), or else you'll leak.
If your function is fast to execute, then a lock is a much simpler solution because you don't need all the setup / cleanup overhead of threadspecific data; if the function is either slow or very frequently called, the lock would become a bottleneck though - in that case the memory / access overhead for TSD is justified. Your mileage may vary.
Protect this function with mutex outside of inline Assembly block.

Resources