In the following example running a 32-bit ELF on a 64-bit architecture is faster and I don't understand why. I have tried with two examples one using a division the other one with a multiplication. The performance is as expected, however, the performance for the division is surprizing.
We see on the assembly that the compiler is calling _alldiv which emulates a 64-bit division on a 32-bit architecture, so it must be slower than simply using the assembly instruction idiv. So I don't understand the results I got:
My setup is: Windows 10 x64, Visual Studio 2019
To time the code I use Measure-Command { .\out.exe }:
Multiplication
32-bit ELF: 3360 ms
64-bit ELF: 1469 ms
Division
32-bit ELF: 7383 ms
64-bit ELF: 8567 ms
Code
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>
#include <Windows.h>
volatile int64_t m = 32;
volatile int64_t n = 12;
volatile int64_t result;
int main(void)
{
for (size_t i = 0; i < (1 << 30); i++)
{
# ifdef DIVISION
result = m / n;
# else
result = m * n;
# endif
m += 1;
n += 3;
}
}
64-bit disassembly (division)
for (size_t i = 0; i < (1 << 30); i++)
00007FF60DA81000 mov r8d,40000000h
00007FF60DA81006 nop word ptr [rax+rax]
{
result = m / n;
00007FF60DA81010 mov rcx,qword ptr [n (07FF60DA83038h)]
00007FF60DA81017 mov rax,qword ptr [m (07FF60DA83040h)]
00007FF60DA8101E cqo
00007FF60DA81020 idiv rax,rcx
00007FF60DA81023 mov qword ptr [result (07FF60DA83648h)],rax
m += 1;
00007FF60DA8102A mov rax,qword ptr [m (07FF60DA83040h)]
00007FF60DA81031 inc rax
00007FF60DA81034 mov qword ptr [m (07FF60DA83040h)],rax
n += 3;
00007FF60DA8103B mov rax,qword ptr [n (07FF60DA83038h)]
00007FF60DA81042 add rax,3
00007FF60DA81046 mov qword ptr [n (07FF60DA83038h)],rax
00007FF60DA8104D sub r8,1
00007FF60DA81051 jne main+10h (07FF60DA81010h)
}
}
32-bit disassembly (division)
for (size_t i = 0; i < (1 << 30); i++)
00A41002 mov edi,40000000h
00A41007 nop word ptr [eax+eax]
{
result = m / n;
00A41010 mov edx,dword ptr [n (0A43018h)]
00A41016 mov eax,dword ptr ds:[00A4301Ch]
00A4101B mov esi,dword ptr [m (0A43020h)]
00A41021 mov ecx,dword ptr ds:[0A43024h]
00A41027 push eax
00A41028 push edx
00A41029 push ecx
00A4102A push esi
00A4102B call _alldiv (0A41CD0h)
00A41030 mov dword ptr [result (0A433A0h)],eax
00A41035 mov dword ptr ds:[0A433A4h],edx
m += 1;
00A4103B mov eax,dword ptr [m (0A43020h)]
00A41040 mov ecx,dword ptr ds:[0A43024h]
00A41046 add eax,1
00A41049 mov dword ptr [m (0A43020h)],eax
00A4104E adc ecx,0
00A41051 mov dword ptr ds:[0A43024h],ecx
n += 3;
00A41057 mov eax,dword ptr [n (0A43018h)]
00A4105C mov ecx,dword ptr ds:[0A4301Ch]
00A41062 add eax,3
00A41065 mov dword ptr [n (0A43018h)],eax
00A4106A adc ecx,0
00A4106D mov dword ptr ds:[0A4301Ch],ecx
00A41073 sub edi,1
00A41076 jne main+10h (0A41010h)
}
}
Edit
To investigate further as Chris Dodd, I have slightly modified my code as follow:
volatile int64_t m = 32000000000;
volatile int64_t n = 12000000000;
volatile int64_t result;
This time I have these results:
Division
32-bit ELF: 22407 ms
64-bit ELF: 17812 ms
If you look at instruction timings for x86 processors, it turns out that on recent Intel processors, a 64-bit divide is 3-4x as expensive as a 32-bit divide -- and if you look at the internals of alldiv (link in the comments above), for your values which will always fit in 32 bits, it will use a single 32-bit divide...
Related
Function 1.
It is a pointer function.
char *abc(unsigned int a, unsigned int b)
{
//do something here ...
}
Function 2
Leveraged the function 1 into function 2.
I am trying to store the abc function into an array, however I am getting the error as : error: assignment to expression with array type.
fun2()
{
unsigned int x, y;
x= 5, y=6;
char *array1;
char array2;
for(i=0; i<3; i++)
{
array2[i] = abc(x, y);
}
}
You can't store the invocation of a function in C since it would defeat many existing popular optimizations involving register parameters passing - see because normally parameters are assigned their argument values immediately before the execution flow is transferred to the calling site - compilers may choose to use the registers to store those values but as it stands those registers are volatile and so if we were to delay the actual call they would be overwritten at said later time - possibly even by another call to some function which also have its arguments passed as registers. A solution - which I've personally implemented - is to have a function simulate the call for you by re-assigning to the proper registers and any further arguments - to the stack. In this case you store the argument values in a flat memory. But this must be done in assembly exclusively for this purpose and specific to your target architecture. On the other hand if your architecture is not using any such optimizations - it could be quite easier but still hand written assembly would be required.
In any case this is not a feature the standard (or even pre standard as far as I know) C has implemented anytime.
For example this is an implementation for x86-64 I've wrote some time ago (for MSVC masm assembler):
PUBLIC makeuniquecall
.data
makeuniquecall_jmp_table dq zero_zero, one_zero, two_zero, three_zero ; ordinary
makeuniquecall_jmp_table_one dq zero_one, one_one, two_one, three_one ; single precision
makeuniquecall_jmp_table_two dq zero_two, one_two, two_two, three_two ; double precision
.code
makeuniquecall PROC
;rcx - function pointer
;rdx - raw argument data
;r8 - a byte array specifying each register parameter if it's float and the last qword is the size of the rest
push r12
push r13
push r14
mov r12, rcx
mov r13, rdx
mov r14, r8
; first store the stack vars
mov rax, [r14 + 4] ; retrieve size of stack
sub rsp, rax
mov rdi, rsp
xor rdx, rdx
mov r8, 8
div r8
mov rcx, rax
mov rsi, r13
;add rsi, 32
rep movs qword ptr [rdi], qword ptr [rsi]
xor r10,r10
cycle:
mov rax, r14
add rax, r10
movzx rax, byte ptr [rax]
test rax, rax
jnz jmp_one
lea rax, makeuniquecall_jmp_table
jmp qword ptr[rax + r10 * 8]
jmp_one:
cmp rax, 1
jnz jmp_two
lea rax, makeuniquecall_jmp_table_one
jmp qword ptr[rax + r10 * 8]
jmp_two:
lea rax, makeuniquecall_jmp_table_two
jmp qword ptr[rax + r10 * 8]
zero_zero::
mov rcx, qword ptr[r13+r10*8]
jmp continue
one_zero::
mov rdx, qword ptr[r13+r10*8]
jmp continue
two_zero::
mov r8, qword ptr[r13+r10*8]
jmp continue
three_zero::
mov r9, qword ptr[r13+r10*8]
jmp continue
zero_one::
movss xmm0, dword ptr[r13+r10*8]
jmp continue
one_one::
movss xmm1, dword ptr[r13+r10*8]
jmp continue
two_one::
movss xmm2, dword ptr[r13+r10*8]
jmp continue
three_one::
movss xmm3, dword ptr[r13+r10*8]
jmp continue
zero_two::
movsd xmm0, qword ptr[r13+r10*8]
jmp continue
one_two::
movsd xmm1, qword ptr[r13+r10*8]
jmp continue
two_two::
movsd xmm2, qword ptr[r13+r10*8]
jmp continue
three_two::
movsd xmm3, qword ptr[r13+r10*8]
continue:
inc r10
cmp r10, 4
jb cycle
mov r14, [r14 + 4] ; retrieve size of stack
call r12
add rsp, r14
pop r14
pop r13
pop r12
ret
makeuniquecall ENDP
END
And your code will look something like this:
#include <stdio.h>
char* abc(unsigned int a, unsigned int b)
{
printf("a - %d, b - %d\n", a, b);
return "return abc str\n";
}
extern makeuniquecall();
main()
{
unsigned int x, y;
x = 5, y = 6;
#pragma pack(4)
struct {
struct { char maskargs[4]; unsigned long long szargs; } invok;
char *(*pfunc)();
unsigned long long args[2], shadow[2];
} array2[3];
#pragma pack(pop)
for (int i = 0; i < 3; i++)
{
memset(array2[i].invok.maskargs, 0, sizeof array2[i].invok.maskargs); // standard - no floats passed
array2[i].invok.szargs = 8 * 4; //consider shadow space
array2[i].pfunc = abc;
array2[i].args[0] = x;
array2[i].args[1] = y;
}
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", ((char *(*)())makeuniquecall)(array2[i].pfunc, array2[i].args, &array2[i].invok));
}
You'll probably not need that for your specific case you will get away with simply storing each argument and calling the function directly - i.e. (plus this method won't be x86-64 specific):
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", array2[i].pfunc(array2[i].args[0], array2[i].args[1]));
But mine implementation gives you the flexibility to store different amount of arguments for each call.
Note consider this guide for running above examples on msvc (since it requires to add asm file for the assembly code).
I love such noob questions since they make you think about why x-y feature doesn't actually exist in the language.
I need to do some operations with 48 bit variables, so I had two options:
Create my own structure with 48 bit variables, or
Use unsigned long long (64 bits).
As the operations will not overflow 48 bits, I considered that using 64 bit variables was an overkill, so I created a base structure
#ifdef __GNUC__
#define PACK( __Declaration__ ) __Declaration__ __attribute__((__packed__))
#endif
#ifdef _MSC_VER
#define PACK( __Declaration__ ) __pragma( pack(push, 1) ) __Declaration__ __pragma( pack(pop))
#endif
PACK(struct uint48 {
unsigned long long v : 48;
});
and created some code to check for speed in the operations
#include <stdio.h>
#include <time.h>
#ifdef __GNUC__
#define PACK( __Declaration__ ) __Declaration__ __attribute__((__packed__))
#endif
#ifdef _MSC_VER
#define PACK( __Declaration__ ) __pragma( pack(push, 1) ) __Declaration__ __pragma( pack(pop))
#endif
PACK(struct uint48 {
unsigned long long v : 48;
});
void TestProductLong();
void TestProductLong02();
void TestProductPackedStruct();
void TestProductPackedStruct02();
clock_t start, end;
double cpu_time_used;
int cycleNumber = 100000;
int main(void)
{
TestProductLong();
TestProductLong02();
TestProductPackedStruct();
TestProductPackedStruct02();
return 0;
}
void TestProductLong() {
start = clock();
for (int i = 0; i < cycleNumber;i++) {
unsigned long long varlong01 = 155782;
unsigned long long varlong02 = 15519994;
unsigned long long product01 = varlong01 * varlong02;
unsigned long long varlong03 = 155782;
unsigned long long varlong04 = 15519994;
unsigned long long product02 = varlong03 * varlong04;
unsigned long long addition = product01 + product02;
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("TestProductLong() took %f seconds to execute \n", cpu_time_used);
}
void TestProductLong02() {
start = clock();
unsigned long long varlong01;
unsigned long long varlong02;
unsigned long long product01;
unsigned long long varlong03;
unsigned long long varlong04;
unsigned long long product02;
unsigned long long addition;
for (int i = 0; i < cycleNumber;i++) {
varlong01 = 155782;
varlong02 = 15519994;
product01 = varlong01 * varlong02;
varlong03 = 155782;
varlong04 = 15519994;
product02 = varlong03 * varlong04;
addition = product01 + product02;
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("TestProductLong02() took %f seconds to execute \n", cpu_time_used);
}
void TestProductPackedStruct() {
start = clock();
for (int i = 0; i < cycleNumber; i++) {
struct uint48 x01;
struct uint48 x02;
struct uint48 x03;
x01.v = 155782;
x02.v = 15519994;
x03.v = x01.v * x02.v;
struct uint48 x04;
struct uint48 x05;
struct uint48 x06;
x04.v = 155782;
x05.v = 15519994;
x06.v = x04.v * x05.v;
struct uint48 x07;
x07.v = x03.v + x06.v;
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("TestProductPackedStruct() took %f seconds to execute \n", cpu_time_used);
}
void TestProductPackedStruct02() {
start = clock();
struct uint48 x01;
struct uint48 x02;
struct uint48 x03;
struct uint48 x04;
struct uint48 x05;
struct uint48 x06;
struct uint48 x07;
for (int i = 0; i < cycleNumber; i++) {
x01.v = 155782;
x02.v = 15519994;
x03.v = x01.v * x02.v;
x04.v = 155782;
x05.v = 15519994;
x06.v = x04.v * x05.v;
x07.v = x03.v + x06.v;
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("TestProductPackedStruct02() took %f seconds to execute \n", cpu_time_used);
}
But I got the following results
TestProductLong() took 0.000188 seconds to execute
TestProductLong02() took 0.000198 seconds to execute
TestProductPackedStruct() took 0.001231 seconds to execute
TestProductPackedStruct02() took 0.001231 seconds to execute
So the operations using unsigned long long took less time than the ones using the packed structure.
Why is that?
Would be better then to use the unsigned long long instead?
Is there a better way to pack structures?
As I'm right now unrolling loops, using the correct datastructure could impact the performance of my application significantly.
Thank you.
Although you know that the operations on the 48-bit values will not overflow, a compiler cannot know this! Further, with the vast majority of compilers and platforms, your uint48 structure will actually be implemented as a 64-bit data type, for which only the low 48-bits will ever be used.
So, after any arithmetic (or other) operations on the .v data, the 'unused' 16-bits of the (internal) 64-bit representation will need to be cleared, to ensure that any future accesses to that data will give the true, 48-bit-only value.
Thus, using the clang-cl compiler in Visual Studio 2019, the following (rather trivial) function using the native uint64_t type:
extern uint64_t add64(uint64_t a, uint64_t b) {
return a + b;
}
generates the expected, highly efficient assembly code:
lea rax, [rcx + rdx]
ret
However, using (an equivalent of) your 48-bit packed structure:
#pragma pack(push, 1)
typedef struct uint48 {
unsigned long long v : 48;
} uint48_t;
#pragma pack(pop)
extern uint48_t add48(uint48_t a, uint48_t b) {
uint48_t c;
c.v = a.v + b.v;
return c;
}
requires additional assembly code to ensure that any overflow into the 'unused' bits is discarded:
add rcx, rdx
movabs rax, 281474976710655 # This is 0x0000FFFFFFFFFFFF - clearing top 16 bits!
and rax, rcx
ret
Note that the MSVC compiler generates very similar code.
Thus, you should expect that using native, uint64_t variables will generate more efficient code than your 'space-saving' structure.
Your test procedure is wrong. Why?
Packing 1 member struct does actually nothing.
You execute it using -O0 and with no optimizations testing the execution speed does not make any sense. It you compile it with optimizations - your code will be wiped out :) https://godbolt.org/z/9ibP_8
When you sort this code to be optimizable (As you do not use the value they have to be global or at least static and adding compiler memory barrier (clobber)).
https://godbolt.org/z/BL9uJE
The difference comes with trimming the results to 48 bits.
If you pack the struct (which is not necesary here) you force compiler to byte access the variables - because only bytes are always aligned: https://godbolt.org/z/2iV7vq
You can also use the mixed approach - not portable as it relies on endianess and bitfields implementation https://godbolt.org/z/J3-it_
so the code will compile to:
unsigned long long:
mov QWORD PTR varlong01[rip], 155782
mov QWORD PTR varlong02[rip], 15519994
mov QWORD PTR product01[rip], rdx
mov QWORD PTR varlong03[rip], 155782
mov QWORD PTR varlong04[rip], 15519994
mov QWORD PTR product02[rip], rdx
mov QWORD PTR addition[rip], rcx
not packed struct
mov rdx, QWORD PTR x01[rip]
and rdx, rax
or rdx, 155782
mov QWORD PTR x01[rip], rdx
mov rdx, QWORD PTR x02[rip]
and rdx, rax
or rdx, 15519994
mov QWORD PTR x02[rip], rdx
mov rdx, QWORD PTR x03[rip]
and rdx, rax
or rdx, rsi
mov QWORD PTR x03[rip], rdx
mov rdx, QWORD PTR x04[rip]
and rdx, rax
or rdx, 155782
mov QWORD PTR x04[rip], rdx
mov rdx, QWORD PTR x05[rip]
and rdx, rax
or rdx, 15519994
mov QWORD PTR x05[rip], rdx
mov rdx, QWORD PTR x06[rip]
and rdx, rax
or rdx, rsi
mov QWORD PTR x06[rip], rdx
mov rdx, QWORD PTR x07[rip]
and rdx, rax
or rdx, rdi
mov QWORD PTR x07[rip], rdx
packed struct
mov BYTE PTR x01[rip], -122
mov BYTE PTR x01[rip+1], 96
mov BYTE PTR x01[rip+2], 2
mov BYTE PTR x01[rip+3], 0
mov BYTE PTR x01[rip+4], 0
mov BYTE PTR x01[rip+5], 0
mov BYTE PTR x02[rip], -6
mov BYTE PTR x02[rip+1], -48
mov BYTE PTR x02[rip+2], -20
mov BYTE PTR x02[rip+3], 0
mov BYTE PTR x02[rip+4], 0
mov BYTE PTR x02[rip+5], 0
mov BYTE PTR x03[rip], -36
mov BYTE PTR x03[rip+1], 34
mov BYTE PTR x03[rip+2], 71
mov BYTE PTR x03[rip+3], -20
mov BYTE PTR x03[rip+4], 50
mov BYTE PTR x03[rip+5], 2
mov BYTE PTR x04[rip], -122
mov BYTE PTR x04[rip+1], 96
mov BYTE PTR x04[rip+2], 2
mov BYTE PTR x04[rip+3], 0
mov BYTE PTR x04[rip+4], 0
mov BYTE PTR x04[rip+5], 0
mov BYTE PTR x05[rip], -6
mov BYTE PTR x05[rip+1], -48
mov BYTE PTR x05[rip+2], -20
mov BYTE PTR x05[rip+3], 0
mov BYTE PTR x05[rip+4], 0
mov BYTE PTR x05[rip+5], 0
mov BYTE PTR x06[rip], -36
mov BYTE PTR x06[rip+1], 34
mov BYTE PTR x06[rip+2], 71
mov BYTE PTR x06[rip+3], -20
mov BYTE PTR x06[rip+4], 50
mov BYTE PTR x06[rip+5], 2
mov BYTE PTR x07[rip], -72
mov BYTE PTR x07[rip+1], 69
mov BYTE PTR x07[rip+2], -114
mov BYTE PTR x07[rip+3], -40
mov BYTE PTR x07[rip+4], 101
mov BYTE PTR x07[rip+5], 4
I got a task to write an assembly routine that can be read from C and declared as follows:
extern int solve_equation(long int a, long int b,long int c, long int *x, long int *y);
that finds a solution to the equation
a * x + b * y = c
In -2147483648 <x, y <2147483647 by checking all options.
The value returned from the routine will be 1 if a solution is found and another 0.
You must take into consideration that the results of the calculations: a * x, b * y, a * x + b * y can exceed 32 bits.
.MODEL SMALL
.DATA
C DQ ?
SUM DQ 0
MUL1 DQ ?
MUL2 DQ ?
X DD ?
Y DD ?
.CODE
.386
PUBLIC _solve_equation
_solve_equation PROC NEAR
PUSH BP
MOV BP,SP
PUSH SI
MOV X,-2147483648
MOV Y,-2147483648
MOV ECX,4294967295
FOR1:
CMP ECX,0
JE FALSE
PUSH ECX
MOV Y,-2147483648
MOV ECX,4294967295
FOR2:
MOV SUM,0
CMP ECX,0
JE SET_FOR1
MOV EAX,DWORD PTR [BP+4]
IMUL X
MOV DWORD PTR MUL1,EAX
MOV DWORD PTR MUL1+4,EDX
MOV EAX,DWORD PTR [BP+8]
IMUL Y
MOV DWORD PTR MUL2,EAX
MOV DWORD PTR MUL2+4,EDX
MOV EAX, DWORD PTR MUL1
ADD DWORD PTR SUM,EAX
MOV EAX, DWORD PTR MUL2
ADD DWORD PTR SUM,EAX
MOV EAX, DWORD PTR MUL1+4
ADD DWORD PTR SUM+4,EAX
MOV EAX, DWORD PTR MUL2+4
ADD DWORD PTR SUM+4,EAX
CMP SUM,-2147483648
JL SET_FOR2
CMP SUM,2147483647
JG SET_FOR2
MOV EAX,DWORD PTR [BP+12]
CMP DWORD PTR SUM,EAX
JE TRUE
SET_FOR2:
DEC ECX
INC Y
JMP FOR2
SET_FOR1:
POP ECX
DEC ECX
JMP FOR1
FALSE:
MOV AX,0
JMP SOF
TRUE:
MOV SI,WORD PTR [BP+16]
MOV EAX,X
MOV DWORD PTR [SI],EAX
MOV SI,WORD PTR [BP+18]
MOV EAX,Y
MOV DWORD PTR [SI],EAX
MOV AX,1
SOF:
POP SI
POP BP
RET
_solve_equation ENDP
END
Is this the right way to work with large numbers?
I get argument to operation or instruction has illegal size when I try to do:
MOV SUM,0
CMP SUM,-2147483648
CMP SUM,2147483647
main code:
int main()
{
long int x, y, flag;
flag = solve_equation(-5,4,2147483647,&x, &y);
if (flag == 1)
printf("%ld*%ld + %ld*%ld = %ld\n", -5L,x,4L,y,2147483647);
return 0;
}
output
-5*-2147483647 + 4*-2147483647 = 2147483647
I`m using dosbox 0.74 and tcc
You're using 16-bit code, so 64-bit operand-size isn't available. Your assembler magically associates a size with sum, because you defined it with sum dq 0.
So mov sum, 0 is equivalent to mov qword ptr [sum], 0, which of course won't assemble in 16 or 32-bit mode; you can only operate on up to 32 bits at once with integer operations.
(32-bit operand-size is available in 16-bit mode on 386-compatible CPUs, using the same machine encodings that allows 16-bit operand size in 32-bit mode. But 64-bit operand size is only available in 64-bit mode. Unlike 386, AMD64 didn't add any new prefixes or anything to previously-existing modes, for various reasons.)
You could zero the whole 64-bit sum with an SSE store, or even compare with SSE4.2 pcmpgtq, but that's probably not what you want.
It looks like you want to check if 64-bit sum fits in 32 bits. (i.e. if it is a sign-extended 32-bit integer).
So really you just need to check that all 32 high bits are the same and match bit 31 of the low half.
mov eax, dword ptr [sum]
cdq ; sign extend eax into edx:eax
; i.e. copy bit 31 of EAX to all bits of EDX
cmp edx, dword ptr [sum+4]
je small_sum
I've been struggling for a while with the performance of the network coding in an application I'm developing (see Optimzing SSE-code, Improving performance of network coding-encoding and OpenCL distribution). Now I'm quite close to achieve acceptable performance. This is the current state of the innermost loop (which is where >99% of the execution time is being spent):
while(elementIterations-- >0)
{
unsigned int firstMessageField = *(currentMessageGaloisFieldsArray++);
unsigned int secondMessageField = *(currentMessageGaloisFieldsArray++);
__m128i valuesToMultiply = _mm_set_epi32(0, secondMessageField, 0, firstMessageField);
__m128i mulitpliedHalves = _mm_mul_epu32(valuesToMultiply, fragmentCoefficentVector);
}
Do you have any suggestions on how to further optimize this? I understand that it's hard to do without more context but any help is appreciated!
Now that I'm awake, here's my answer:
In your original code, the bottleneck is almost certainly _mm_set_epi32. This single intrinsic gets compiled into this mess in your assembly:
633415EC xor edi,edi
633415EE movd xmm3,edi
...
633415F6 xor ebx,ebx
633415F8 movd xmm4,edi
633415FC movd xmm5,ebx
63341600 movd xmm0,esi
...
6334160B punpckldq xmm5,xmm3
6334160F punpckldq xmm0,xmm4
...
63341618 punpckldq xmm0,xmm5
What is this? 9 instructions?!?!?! Pure overhead...
Another place that seems odd is that the compiler didn't merge the adds and loads:
movdqa xmm3,xmmword ptr [ecx-10h]
paddq xmm0,xmm3
should have been merged into:
paddq xmm0,xmmword ptr [ecx-10h]
I'm not sure if the compiler went brain-dead, or if it actually had a legitimate reason to do that... Anyways, it's a small thing compared to the _mm_set_epi32.
Disclaimer: The code I will present from here on violates strict-aliasing. But non-standard compliant methods are often needed to achieve maximum performance.
Solution 1: No Vectorization
This solution assumes allZero is really all zeros.
The loop is actually simpler than it looks. Since there isn't a lot of arithmetic, it might be better to just not vectorize:
// Test Data
unsigned __int32 fragmentCoefficentVector = 1000000000;
__declspec(align(16)) int currentMessageGaloisFieldsArray_[8] = {10,11,12,13,14,15,16,17};
int *currentMessageGaloisFieldsArray = currentMessageGaloisFieldsArray_;
__m128i currentUnModdedGaloisFieldFragments_[8];
__m128i *currentUnModdedGaloisFieldFragments = currentUnModdedGaloisFieldFragments_;
memset(currentUnModdedGaloisFieldFragments,0,8 * sizeof(__m128i));
int elementIterations = 4;
// The Loop
while (elementIterations > 0){
elementIterations -= 1;
// Default 32 x 32 -> 64-bit multiply code
unsigned __int64 r0 = currentMessageGaloisFieldsArray[0] * (unsigned __int64)fragmentCoefficentVector;
unsigned __int64 r1 = currentMessageGaloisFieldsArray[1] * (unsigned __int64)fragmentCoefficentVector;
// Use this for Visual Studio. VS doesn't know how to optimize 32 x 32 -> 64-bit multiply
// unsigned __int64 r0 = __emulu(currentMessageGaloisFieldsArray[0], fragmentCoefficentVector);
// unsigned __int64 r1 = __emulu(currentMessageGaloisFieldsArray[1], fragmentCoefficentVector);
((__int64*)currentUnModdedGaloisFieldFragments)[0] += r0 & 0x00000000ffffffff;
((__int64*)currentUnModdedGaloisFieldFragments)[1] += r0 >> 32;
((__int64*)currentUnModdedGaloisFieldFragments)[2] += r1 & 0x00000000ffffffff;
((__int64*)currentUnModdedGaloisFieldFragments)[3] += r1 >> 32;
currentMessageGaloisFieldsArray += 2;
currentUnModdedGaloisFieldFragments += 2;
}
Which compiles to this on x64:
$LL4#main:
mov ecx, DWORD PTR [rbx]
mov rax, r11
add r9, 32 ; 00000020H
add rbx, 8
mul rcx
mov ecx, DWORD PTR [rbx-4]
mov r8, rax
mov rax, r11
mul rcx
mov ecx, r8d
shr r8, 32 ; 00000020H
add QWORD PTR [r9-48], rcx
add QWORD PTR [r9-40], r8
mov ecx, eax
shr rax, 32 ; 00000020H
add QWORD PTR [r9-24], rax
add QWORD PTR [r9-32], rcx
dec r10
jne SHORT $LL4#main
and this on x86:
$LL4#main:
mov eax, DWORD PTR [esi]
mul DWORD PTR _fragmentCoefficentVector$[esp+224]
mov ebx, eax
mov eax, DWORD PTR [esi+4]
mov DWORD PTR _r0$31463[esp+228], edx
mul DWORD PTR _fragmentCoefficentVector$[esp+224]
add DWORD PTR [ecx-16], ebx
mov ebx, DWORD PTR _r0$31463[esp+228]
adc DWORD PTR [ecx-12], edi
add DWORD PTR [ecx-8], ebx
adc DWORD PTR [ecx-4], edi
add DWORD PTR [ecx], eax
adc DWORD PTR [ecx+4], edi
add DWORD PTR [ecx+8], edx
adc DWORD PTR [ecx+12], edi
add esi, 8
add ecx, 32 ; 00000020H
dec DWORD PTR tv150[esp+224]
jne SHORT $LL4#main
It's possible that both of these are already faster than your original (SSE) code... On x64, Unrolling it will make it even better.
Solution 2: SSE2 Integer Shuffle
This solution unrolls the loop to 2 iterations:
// Test Data
__m128i allZero = _mm_setzero_si128();
__m128i fragmentCoefficentVector = _mm_set1_epi32(1000000000);
__declspec(align(16)) int currentMessageGaloisFieldsArray_[8] = {10,11,12,13,14,15,16,17};
int *currentMessageGaloisFieldsArray = currentMessageGaloisFieldsArray_;
__m128i currentUnModdedGaloisFieldFragments_[8];
__m128i *currentUnModdedGaloisFieldFragments = currentUnModdedGaloisFieldFragments_;
memset(currentUnModdedGaloisFieldFragments,0,8 * sizeof(__m128i));
int elementIterations = 4;
// The Loop
while(elementIterations > 1){
elementIterations -= 2;
// Load 4 elements. If needed use unaligned load instead.
// messageField = {a, b, c, d}
__m128i messageField = _mm_load_si128((__m128i*)currentMessageGaloisFieldsArray);
// Get into this form:
// values0 = {a, x, b, x}
// values1 = {c, x, d, x}
__m128i values0 = _mm_shuffle_epi32(messageField,216);
__m128i values1 = _mm_shuffle_epi32(messageField,114);
// Multiply by "fragmentCoefficentVector"
values0 = _mm_mul_epu32(values0, fragmentCoefficentVector);
values1 = _mm_mul_epu32(values1, fragmentCoefficentVector);
__m128i halves0 = _mm_unpacklo_epi32(values0, allZero);
__m128i halves1 = _mm_unpackhi_epi32(values0, allZero);
__m128i halves2 = _mm_unpacklo_epi32(values1, allZero);
__m128i halves3 = _mm_unpackhi_epi32(values1, allZero);
halves0 = _mm_add_epi64(halves0, currentUnModdedGaloisFieldFragments[0]);
halves1 = _mm_add_epi64(halves1, currentUnModdedGaloisFieldFragments[1]);
halves2 = _mm_add_epi64(halves2, currentUnModdedGaloisFieldFragments[2]);
halves3 = _mm_add_epi64(halves3, currentUnModdedGaloisFieldFragments[3]);
currentUnModdedGaloisFieldFragments[0] = halves0;
currentUnModdedGaloisFieldFragments[1] = halves1;
currentUnModdedGaloisFieldFragments[2] = halves2;
currentUnModdedGaloisFieldFragments[3] = halves3;
currentMessageGaloisFieldsArray += 4;
currentUnModdedGaloisFieldFragments += 4;
}
which gets compiled to this (x86): (x64 isn't too different)
$LL4#main:
movdqa xmm1, XMMWORD PTR [esi]
pshufd xmm0, xmm1, 216 ; 000000d8H
pmuludq xmm0, xmm3
movdqa xmm4, xmm0
punpckhdq xmm0, xmm2
paddq xmm0, XMMWORD PTR [eax-16]
pshufd xmm1, xmm1, 114 ; 00000072H
movdqa XMMWORD PTR [eax-16], xmm0
pmuludq xmm1, xmm3
movdqa xmm0, xmm1
punpckldq xmm4, xmm2
paddq xmm4, XMMWORD PTR [eax-32]
punpckldq xmm0, xmm2
paddq xmm0, XMMWORD PTR [eax]
punpckhdq xmm1, xmm2
paddq xmm1, XMMWORD PTR [eax+16]
movdqa XMMWORD PTR [eax-32], xmm4
movdqa XMMWORD PTR [eax], xmm0
movdqa XMMWORD PTR [eax+16], xmm1
add esi, 16 ; 00000010H
add eax, 64 ; 00000040H
dec ecx
jne SHORT $LL4#main
Only slightly longer than the non-vectorized version for two iterations. This uses very few registers, so you can further unroll this even on x86.
Explanations:
As Paul R mentioned, unrolling to two iterations allows you to combine the initial load into one SSE load. This also has the benefit of getting your data into the SSE registers.
Since the data starts off in the SSE registers, _mm_set_epi32 (which gets compiled into about ~9 instructions in your original code) can be replaced with a single _mm_shuffle_epi32.
I suggest you unroll your loop by a factor of 2 so that you can load 4 messageField values using one _mm_load_XXX, and then unpack these four values into two vector pairs and process them as per the current loop. That way you won't have a lot of messy code being generated by the compiler for _mm_set_epi32 and all your loads and stores will be 128 bit SSE loads/stores. This will also give the compiler more opportunity to schedule instructions optimally within the loop.
ASM to C Code emulating nearly done.. just trying to solve these second pass problems.
Lets say I got this ASM function
401040 MOV EAX,DWORD PTR [ESP+8]
401044 MOV EDX,DWORD PTR [ESP+4]
401048 PUSH ESI
401049 MOV ESI,ECX
40104B MOV ECX,EAX
40104D DEC EAX
40104E TEST ECX,ECX
401050 JE 401083
401052 PUSH EBX
401053 PUSH EDI
401054 LEA EDI,[EAX+1]
401057 MOV AX,WORD PTR [ESI]
40105A XOR EBX,EBX
40105C MOV BL,BYTE PTR [EDX]
40105E MOV ECX,EAX
401060 AND ECX,FFFF
401066 SHR ECX,8
401069 XOR ECX,EBX
40106B XOR EBX,EBX
40106D MOV BH,AL
40106F MOV AX,WORD PTR [ECX*2+45F81C]
401077 XOR AX,BX
40107A INC EDX
40107B DEC EDI
40107C MOV WORD PTR [ESI],AX
40107F JNE 401057
401081 POP EDI
401082 POP EBX
401083 POP ESI
401084 RET 8
My program would create the following for it.
int Func_401040() {
regs.d.eax = *(unsigned int *)(regs.d.esp+0x00000008);
regs.d.edx = *(unsigned int *)(regs.d.esp+0x00000004);
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.esi;
regs.d.esi = regs.d.ecx;
regs.d.ecx = regs.d.eax;
regs.d.eax--;
if(regs.d.ecx == 0)
goto label_401083;
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.ebx;
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.edi;
regs.d.edi = (regs.d.eax+0x00000001);
regs.x.ax = *(unsigned short *)(regs.d.esi);
regs.d.ebx ^= regs.d.ebx;
regs.h.bl = *(unsigned char *)(regs.d.edx);
regs.d.ecx = regs.d.eax;
regs.d.ecx &= 0x0000FFFF;
regs.d.ecx >>= 0x00000008;
regs.d.ecx ^= regs.d.ebx;
regs.d.ebx ^= regs.d.ebx;
regs.h.bh = regs.h.al;
regs.x.ax = *(unsigned short *)(regs.d.ecx*0x00000002+0x0045F81C);
regs.x.ax ^= regs.x.bx;
regs.d.edx++;
regs.d.edi--;
*(unsigned short *)(regs.d.esi) = regs.x.ax;
JNE 401057
regs.d.edi = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
regs.d.ebx = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
label_401083:
regs.d.esi = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
return 0x8;
}
Since JNE 401057 doesn't use the CMP or TEST
How do I fix that use this in C code?
The most recent instruction that modified flags is the dec, which sets ZF when its operand hits 0. So the jne is about equivalent to if (regs.d.edi != 0) goto label_401057;.
BTW: ret 8 isn't equivalent to return 8. The ret instruction's operand is the number of bytes to add to ESP when returning. (It's commonly used to clean up the stack.) It'd be kinda like
return eax;
regs.d.esp += 8;
except that semi-obviously, this won't work in C -- the return makes any code after it unreachable.
This is actually a part of the calling convention -- [ESP+4] and [ESP+8] are arguments passed to the function, and the ret is cleaning those up. This isn't the usual C calling convention; it looks more like fastcall or thiscall, considering the function expects a value in ECX.