I'm working on a c based program that works on assembly for image twisting. Thepseudocode that is supposed to work is this one(always using images of 240x320
voltearHorizontal(imgO, imgD){
dirOrig = imgO;
dirDest = imgD;
dirOrig = dirOrig + 239*320; //bring the pointer to the first pixel of the last row
for(f=0; f<240; f++){
for(c=0; c<320; c++){
[dirDest]=[dirOrig];
dirOrig++;
dirDest++;
}
dirOrig=dirOrig+640;//move the pixel to the first one of the upper row
}
}
But when applied to assembly, on the result, the first rows are not read, leaving the space in black.
https://gyazo.com/7a76f147da96ae2bc27e109593ed6df8
this is the code I've written, that's supposed to work, and this one is what really happens to the image:
https://gyazo.com/2e389248d9959a786e736eecd3bf1531
Why are, with this code, not written/read the upper lines of pixels of the origen image to the second image? what part of code did I get wrong?
I think I have no tags left to put for my problem, thanks for any help that can be given (on where I am wrong).Also, the horitzontal flip (the oneabove is the vertical) simply finishes the program unexpectedly:
https://gyazo.com/a7a18cf10ac3c06fc73a93d9e55be70c
Any special reason, why you write it as slow assembler?
Why don't you just keep it in fast C++? https://godbolt.org/g/2oIpzt
#include <cstring>
void voltearHorizontal(const unsigned char* imgO, unsigned char* imgD) {
imgO += 239*320; //bring the pointer to the first pixel of the last row
for(unsigned f=0; f<240; ++f) {
memcpy(imgD, imgO, 320);
imgD += 320;
imgO -= 320;
}
}
Will be compiled with gcc6.3 -O3 to:
voltearHorizontal(unsigned char const*, unsigned char*):
lea rax, [rdi+76480]
lea r8, [rdi-320]
mov rdx, rsi
.L2:
mov rcx, QWORD PTR [rax]
lea rdi, [rdx+8]
mov rsi, rax
sub rax, 320
and rdi, -8
mov QWORD PTR [rdx], rcx
mov rcx, QWORD PTR [rax+632]
mov QWORD PTR [rdx+312], rcx
mov rcx, rdx
add rdx, 320
sub rcx, rdi
sub rsi, rcx
add ecx, 320
shr ecx, 3
cmp rax, r8
rep movsq
jne .L2
rep ret
Ie. like 800% more efficient than your inline asm.
Anyway, in your question the problem is:
dirOrig=dirOrig+640;//move the pixel to the first one of the upper row
You need to do -= 640 to return two lines up.
About those inline asm in screens... put them as text into question, but from a quick look on them I would simply rewrite it in C++ and keep it to compiler, you are doing many performance-wrong things in your asm, so I don't see any point in doing that, plus inline asm is ugly and hard to maintain, and hard to write correctly.
I did check even that asm in picture. You have lines counter in eax, but you use al to copy the pixel, so it does destroy the line counter value.
Use debugger next time.
BTW, your pictures are 320x240, not 240x320.
Related
This is my first question on this platform. I'm trying to modify the pixels of an image file and to copy them to memory requested with calloc. When the code tries to dereference the pointer to the memory requested with calloc at offset 16360 to write, an "access violation writing location" exception is thrown. Sometimes the offset is slightly higher or lower. The amount of memory requested is correct. When I write equivalent code in C++ with calloc, it works, but not in assembly. I've also tried to request an higher amount of memory in assembly and to raise the heap and stack size in the visual studio settings but nothing works for the assembly code. I also had to set the option /LARGEADDRESSAWARE:NO before I could even build and run the program.
I know that the AVX instruction sets would be better suited for this, but the code would contain slightly more lines so I made it simpler for this question and I'm also not a pro, I did this to practice the AVX instruction set.
Many thanks in advance :)
const uint8_t* getImagePtr(sf::Image** image, const char* imageFilename, uint64_t* imgSize) {
sf::Image* img = new sf::Image;
img->loadFromFile(imageFilename);
sf::Vector2u sz = img->getSize();
*imgSize = uint64_t((sz.x * sz.y) * 4u);
*image = img;
return img->getPixelsPtr();
}
EXTRN getImagePtr:PROC
EXTRN calloc:PROC
.data
imagePixelPtr QWORD 0 ; contains address to source array of 8 bit pixels
imageSize QWORD 0 ; contains size in bytes of the image file
image QWORD 0 ; contains pointer to image object
newImageMemory QWORD 0 ; contains address to destination array
imageFilename BYTE "IMAGE.png", 0 ; name of the file
.code
mainasm PROC
sub rsp, 40
mov rcx, OFFSET image
mov rdx, OFFSET imageFilename
mov r8, OFFSET imageSize
call getImagePtr
mov imagePixelPtr, rax
mov rcx, 1
mov rdx, imageSize
call calloc
add rsp, 40
cmp rax, 0
je done
mov newImageMemory, rax
mov rcx, imageSize
xor eax, eax
mov bl, 20
SomeLoop:
mov dl, BYTE PTR [imagePixelPtr + rax]
add dl, bl
mov BYTE PTR [newImageMemory + rax], dl ; exception when dereferencing and writing to offset 16360
inc rax
loop SomeLoop
done:
ret
mainasm ENDP
END
Let's translate this line back into C:
mov BYTE PTR [newImageMemory + rax], dl ;
In C, this is more or less equivalent to:
*((unsigned char *)&newImageMemory + rax) = dl;
Which is clearly not what you want. It's writing to an offset from the location of newImageMemory, and not to an offset from where newImageMemory points to.
You will need to keep newImageMemory in a register if you want to use it as the base address for an offset.
While we're at it, this line is also wrong, for the same reason:
mov dl, BYTE PTR [imagePixelPtr + rax]
It just happens not to crash.
So lets say I have 2 functions to choose from based on whether number is even or odd. I came up with this:
(void (*[])()){f1 ,f2}[n%2]();.
Is it more or less efficient than simply:
n%2 ? f2() : f1();
Profile it; most likely too small to measure.
Assuming a naive compiler, the second will be a lot shorter to execute. But the first is written terribly; could be squashed down to
((n1&1) ? f1 : f2)();
now it's pretty much a toss-up. the first generates something like
test al, 1
jz +3
call f1
jmp +1
call f2
and the second something like
test al, 1
jmp +3
lea rcx, [f1]
jmp +1
lea rcx, [f2]
call rcx
but a good optimizer could flatten that down to
lea rcx, [f1]
test al, 1
cmovcc rcx, f2
call rcx
while all this is true; the initial statement applies. It's most likely too small to measure.
Additional question in comments involving "easier to expand"; well yes. After a surprisingly short number of functions, the array lookup becomes faster. I'm skeptical of things and would not write the array inline but somebody could come along and prove me wrong. If there's more than two I would write
static void (*dispatch_table[])() = {f1 ,f2};
dispatch_table[n % (sizeof(dispatch_table) / sizeof(dispatch_table[0]))]();
This complex mod expression compiles to a compile-time constant; which lets the compiler optimize out the % to something more performant; it's written as so so that adding more entries doesn't require changing the second argument to % when adding to the array.
As is being pointed out in comments I didn't handle n negative. Most RNG sources don't generate negative numbers.
Godbolt to the rescue (64-bit Clang 11.0 with -O3 optimizations): https://godbolt.org/z/MWjPnn
First implementation:
void impl1(int n)
{
(void (*[])()){evenfunc ,oddfunc}[n%2]();
}
mov qword ptr [rsp - 16], offset evenfunc()
mov qword ptr [rsp - 8], offset oddfunc()
mov eax, edi
shr eax, 31
add eax, edi
and eax, -2
sub edi, eax
movsxd rax, edi
jmp qword ptr [rsp + 8*rax - 16] # TAILCALL
Second implementation:
void impl2(int n)
{
n%2 ? oddfunc() : evenfunc();
}
test dil, 1
jne .LBB1_1
jmp evenfunc() # TAILCALL
.LBB1_1:
jmp oddfunc() # TAILCALL
Studying I found the use of the (i+1)mod(SIZE) to perform a cycle in an array of elements.
So I wondered if this method was more efficient than an if-statement...
For example:
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i = (i + 1) % SIZE) items[i] += 1;
return 0;
}
It is more efficient than(?):
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i++) {
if(i == SIZE) i = 0;
items[i] += 1;
}
return 0;
}
Thanks for the answers and your time.
You can check the assembly online (i. e. here). The result depends on the architecture and the optimization, but without optimization and for x64 with GCC, you get this code (as a simple example).
Example 1:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
mov eax, DWORD PTR [rbp-4]
add eax, 1
movsx rdx, eax
imul rdx, rdx, -2004318071
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 3
cdq
sub ecx, edx
mov edx, ecx
mov DWORD PTR [rbp-4], edx
mov ecx, DWORD PTR [rbp-4]
mov edx, ecx
sal edx, 4
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
jmp .L3
.L2:
mov eax, 0
pop rbp
ret
Example 2:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L4:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
cmp DWORD PTR [rbp-4], 15
jne .L3
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
add DWORD PTR [rbp-4], 1
jmp .L4
.L2:
mov eax, 0
pop rbp
ret
You see, that for the specific case with x86, the solution without modulo is much shorter.
Although you are only asking about mod vs branch, there are probably more like five cases depending on the actual implementation of the mod and branch:
Modulus-based
Power-of-two
If the value of SIZE is known to the compiler and is a power of 2, the mod will compile into a single and like this and will be very efficient in performance and code size. The and is still part of the loop increment dependency chain though, putting a speed limit on the performance of 2 cycles per iteration unless the compiler is clever enough to unroll it and keep the and out of the carried chain (gcc and clang weren't).
Known, not power-of-two
On the other hand, if the value of SIZE is known but not a power of two, then you are likely to get a multiplication-based implementation of the fixed modulus value, like this. This generally takes something like 4-6 instructions, which end up part of the dependency chain. So this will limit your performance to something like 1 iteration every 5-8 cycles, depending exactly on the latency of the dependency chain.
Unknown
In your example SIZE is a known constant, but in the more general case where it is not known at compile time you will get an division instruction on platforms that support it. Something like this.
That is good for code size, since it's a single instruction, but probably disastrous for performance because now you have a slow division instruction as part of the carried dependency for the loop. Depending on your hardware and the type of the SIZE variable, you are looking at 20-100 cycles per iteration.
Branch-based
You put a branch in your code, but jump compiler made decide to implement that as a conditional jump or as a conditional move. At -O2, gcc decides on a jump and clang on a conditional move.
Conditional Jump
This is the direct interpretation of your code: use a conditional branch to implement the i == SIZE condition.
It has the advantage of making the condition a control dependency, not a data dependency, so your loop will mostly run at full speed when the branch is not taken.
However, performance could be seriously impacted if the branch mispredicts often. That depends heavily on the value of SIZE and on your hardware. Modern Intel should be able to predict nested loops like this up to 20-something iterations, but beyond that it will mispredict once every time the inner loop is exited. Of course, is SIZE is very large then the single mispredict won't matter much anyways, so the worst case is SIZE just large enough to mispredict.
Conditional Move
clang uses a conditional move to update i. This is a reasonable option, but it does mean a carried data flow dependency of 3-4 cycles.
1 Either actually a constant like your example or effectively a constant due to inlining and constant propagation.
so my question is basic but i had a hard time finding anything on the internet.
lets say i want to write a function in C that calls an external nasm function written in x86_64 assembly.
I want to pass to the external function two char* of numbers, preform some arithmetic operations on the two and return char* of the result. My idea was to iterate over [rdi] and [rsi] somehow and saving the result in rax (i.e add rax, [rdi], [rsi]) but I'm having a hard time to actually do so. what would be the right way to go over each character? increasing [rsi] and [rdi]? and also- I would only need to move to rax the value of the first character right?
Thanks in advance!
If you could post assembly/C code - it would be easier to suggest changes.
For any assembly, I would start with a C code(since I think in C :)) and then convert to assembly using a compiler and then optimize it in the assembly as needed. Assuming you need write a function which takes two strings and adds them and returns the result as int like the following:
int ext_asm_func(unsigned char *arg1, unsigned char *arg2, int len)
{
int i, result = 0;
for(i=0; i<len; i++) {
result += arg1[i] + arg2[i];
}
return result;
}
Here is assembly (generated by gcc https://godbolt.org/g/1N6vBT):
ext_asm_func(unsigned char*, unsigned char*, int):
test edx, edx
jle .L4
lea r9d, [rdx-1]
xor eax, eax
xor edx, edx
add r9, 1
.L3:
movzx ecx, BYTE PTR [rdi+rdx]
movzx r8d, BYTE PTR [rsi+rdx]
add rdx, 1
add ecx, r8d
add eax, ecx
cmp r9, rdx
jne .L3
rep ret
.L4:
xor eax, eax
ret
I have a problem with asm code that works when mixed with C, but does not when used in asm code with proper parameters.
;; array - RDI, x- RSI, y- RDX
getValue:
mov r13, rsi
sal r13, $3
mov r14, rdx
sal r14, $2
mov r15, [rdi+r13]
mov rax, [r15+r14]
ret
Technically I want to keep the rdi, rsi and rdx registers untouched and thus I use other ones.
I am using an x64 machine and thus my pointers have 8 bytes. Technically speaking this code is supposed to do:
int getValue(int** array, int x, int y) {
return array[x][y];
}
it somehow works inside my C code, but does not when used in asm in this way:
mov rdi, [rdi] ;; get first pointer - first row
mov r9, $4 ;; we want second element from the row
mov rax, [rdi+r9] ;; get the element (4 bytes vs 8 bytes???)
mov rdi, FMT ;; prepare printf format "%d", 10, 0
mov rsi, rax ;; we want to print the element we just fetched
mov eax, $0 ;; say we have no non-integer argument
call printf ;; always gives 0, no matter what's in the matrix
Can someone see into this and help me? Thanks in advance.
The sal r14, $2 implies the elements are dwords, so the last line before the ret shouldn't load a qword. Besides, x86 has nice scaling addressing modes, so you can do this:
mov rax, [rdi + rsi * 8] ; load pointer to column
mov eax, [rax + rdx * 4] ; note this loads a dword
ret
That implies that you have an array of pointers to columns, which is unusual. You can do that, but was it intended?
This is a standard matrix of integers.
int** array;
sizeof(int*) == 8
sizeof(int) == 4
How I see it is that when I have that array at first, I have a pointer to a space of memory without "blanks" that holds all pointers one by one (index-by-index), so I say "let's go to the element rsi-th of the array" and that's why I shift by rsi-th * 8 bytes. So now I get the same situation, but the pointer should point to a space of integers, so 4-byte items. That's why I shift by 4 bytes there.
Is my thinking wrong?