Can't write to memory requested with malloc/calloc in x64 Assembly

Can't write to memory requested with malloc/calloc in x64 Assembly - c

This is my first question on this platform. I'm trying to modify the pixels of an image file and to copy them to memory requested with calloc. When the code tries to dereference the pointer to the memory requested with calloc at offset 16360 to write, an "access violation writing location" exception is thrown. Sometimes the offset is slightly higher or lower. The amount of memory requested is correct. When I write equivalent code in C++ with calloc, it works, but not in assembly. I've also tried to request an higher amount of memory in assembly and to raise the heap and stack size in the visual studio settings but nothing works for the assembly code. I also had to set the option /LARGEADDRESSAWARE:NO before I could even build and run the program.
I know that the AVX instruction sets would be better suited for this, but the code would contain slightly more lines so I made it simpler for this question and I'm also not a pro, I did this to practice the AVX instruction set.
Many thanks in advance :)
const uint8_t* getImagePtr(sf::Image** image, const char* imageFilename, uint64_t* imgSize) {
sf::Image* img = new sf::Image;
img->loadFromFile(imageFilename);
sf::Vector2u sz = img->getSize();
*imgSize = uint64_t((sz.x * sz.y) * 4u);
*image = img;
return img->getPixelsPtr();
}
EXTRN getImagePtr:PROC
EXTRN calloc:PROC
.data
imagePixelPtr QWORD 0 ; contains address to source array of 8 bit pixels
imageSize QWORD 0 ; contains size in bytes of the image file
image QWORD 0 ; contains pointer to image object
newImageMemory QWORD 0 ; contains address to destination array
imageFilename BYTE "IMAGE.png", 0 ; name of the file
.code
mainasm PROC
sub rsp, 40
mov rcx, OFFSET image
mov rdx, OFFSET imageFilename
mov r8, OFFSET imageSize
call getImagePtr
mov imagePixelPtr, rax
mov rcx, 1
mov rdx, imageSize
call calloc
add rsp, 40
cmp rax, 0
je done
mov newImageMemory, rax
mov rcx, imageSize
xor eax, eax
mov bl, 20
SomeLoop:
mov dl, BYTE PTR [imagePixelPtr + rax]
add dl, bl
mov BYTE PTR [newImageMemory + rax], dl ; exception when dereferencing and writing to offset 16360
inc rax
loop SomeLoop
done:
ret
mainasm ENDP
END

Let's translate this line back into C:
mov BYTE PTR [newImageMemory + rax], dl ;
In C, this is more or less equivalent to:
*((unsigned char *)&newImageMemory + rax) = dl;
Which is clearly not what you want. It's writing to an offset from the location of newImageMemory, and not to an offset from where newImageMemory points to.
You will need to keep newImageMemory in a register if you want to use it as the base address for an offset.
While we're at it, this line is also wrong, for the same reason:
mov dl, BYTE PTR [imagePixelPtr + rax]
It just happens not to crash.

Related

flip an image assembly code

I'm working on a c based program that works on assembly for image twisting. Thepseudocode that is supposed to work is this one(always using images of 240x320
voltearHorizontal(imgO, imgD){
dirOrig = imgO;
dirDest = imgD;
dirOrig = dirOrig + 239*320; //bring the pointer to the first pixel of the last row
for(f=0; f<240; f++){
for(c=0; c<320; c++){
[dirDest]=[dirOrig];
dirOrig++;
dirDest++;
}
dirOrig=dirOrig+640;//move the pixel to the first one of the upper row
}
}
But when applied to assembly, on the result, the first rows are not read, leaving the space in black.
https://gyazo.com/7a76f147da96ae2bc27e109593ed6df8
this is the code I've written, that's supposed to work, and this one is what really happens to the image:
https://gyazo.com/2e389248d9959a786e736eecd3bf1531
Why are, with this code, not written/read the upper lines of pixels of the origen image to the second image? what part of code did I get wrong?
I think I have no tags left to put for my problem, thanks for any help that can be given (on where I am wrong).Also, the horitzontal flip (the oneabove is the vertical) simply finishes the program unexpectedly:
https://gyazo.com/a7a18cf10ac3c06fc73a93d9e55be70c

Any special reason, why you write it as slow assembler?
Why don't you just keep it in fast C++? https://godbolt.org/g/2oIpzt
#include <cstring>
void voltearHorizontal(const unsigned char* imgO, unsigned char* imgD) {
imgO += 239*320; //bring the pointer to the first pixel of the last row
for(unsigned f=0; f<240; ++f) {
memcpy(imgD, imgO, 320);
imgD += 320;
imgO -= 320;
}
}
Will be compiled with gcc6.3 -O3 to:
voltearHorizontal(unsigned char const*, unsigned char*):
lea rax, [rdi+76480]
lea r8, [rdi-320]
mov rdx, rsi
.L2:
mov rcx, QWORD PTR [rax]
lea rdi, [rdx+8]
mov rsi, rax
sub rax, 320
and rdi, -8
mov QWORD PTR [rdx], rcx
mov rcx, QWORD PTR [rax+632]
mov QWORD PTR [rdx+312], rcx
mov rcx, rdx
add rdx, 320
sub rcx, rdi
sub rsi, rcx
add ecx, 320
shr ecx, 3
cmp rax, r8
rep movsq
jne .L2
rep ret
Ie. like 800% more efficient than your inline asm.
Anyway, in your question the problem is:
dirOrig=dirOrig+640;//move the pixel to the first one of the upper row
You need to do -= 640 to return two lines up.
About those inline asm in screens... put them as text into question, but from a quick look on them I would simply rewrite it in C++ and keep it to compiler, you are doing many performance-wrong things in your asm, so I don't see any point in doing that, plus inline asm is ugly and hard to maintain, and hard to write correctly.
I did check even that asm in picture. You have lines counter in eax, but you use al to copy the pixel, so it does destroy the line counter value.
Use debugger next time.
BTW, your pictures are 320x240, not 240x320.

A few assembly instructions

Could you please help me understand the purpose of the two assembly instructions in below ? (for more context, assembly + C code at the end). Thanks !
movzx edx,BYTE PTR [edx+0xa]
mov BYTE PTR [eax+0xa],dl
===================================
Assembly code below:
push ebp
mov ebp,esp
and esp,0xfffffff0
sub esp,0x70
mov eax,gs:0x14
mov DWORD PTR [esp+0x6c],eax
xor eax,eax
mov edx,0x8048520
lea eax,[esp+0x8]
mov ecx,DWORD PTR [edx]
mov DWORD PTR [eax],ecx
mov ecx,DWORD PTR [edx+0x4]
mov DWORD PTR [eax+0x4],ecx
movzx ecx,WORD PTR [edx+0x8]
mov WORD PTR [eax+0x8],cx
movzx edx,BYTE PTR [edx+0xa] ; instruction 1
mov BYTE PTR [eax+0xa],dl ; instruction 2
mov edx,DWORD PTR [esp+0x6c]
xor edx,DWORD PTR gs:0x14
je 804844d <main+0x49>
call 8048320 <__stack_chk_fail#plt>
leave
ret
===================================
C source code below (without libraries inclusion):
int main() {
char str_a[100];
strcpy(str_a, "eeeeefffff");
}

It inlined the strcpy() call, the code generator can tell that 11 bytes need to be copied. The string literal "eeeeefffff" has 10 characters, one extra for the zero terminator.
The code optimizer unrolled the copy loop to 4 moves, moving 4 + 4 + 2 + 1 bytes. It needs to be done this way because there is no processor instruction that moves 3 bytes. The instructions you are asking about copy the 11th byte. Using movzx is a bit overkill but it is probably faster than loading the DL register.
Observe the changes in the generated code when you alter the string. Adding an extra letter should unroll to 3 moves, 4 + 4 + 4. When the string gets too long you ought to see it fall back to something like memmove.

Why do byte spills occur and what do they achieve?

What is a byte spill?
When I dump the x86 ASM from an LLVM intermediate representation generated from a C program, there are numerous spills, usually of a 4 byte size. I cannot figure out why they occur and what they achieve.
They seem to "cut" pieces of the stack off, but in an unusual way:
## this fragment comes from a C program right before a malloc() call to a struct.
## there are other spills in different circumstances in this same program, so it
## is not related exclusively to malloc()
...
sub ESP, 84
mov EAX, 60
mov DWORD PTR [ESP + 80], 0
mov DWORD PTR [ESP], 60
mov DWORD PTR [ESP + 60], EAX # 4-byte Spill
call malloc
mov ECX, 60
...

A register spill is simply what happens when you have more local variables than registers (it's an analogy - really the meaning is that they must be saved to memory). The instruction is saving the value of EAX, likely because EAX is clobbered by malloc and you don't have another spare register to save it in (and for whatever reason the compiler has decided it needs the constant 60 in the register later).
By the looks of it, the compiler could certainly have omitted mov DWORD PTR [ESP + 60], EAX and instead repeated the mov EAX, 60 where it would otherwise mov EAX, DWORD PTR [ESP + 60] or whatever offset it used, because the saved value of EAX cannot be other than 60 at that point. However, compilation is not guaranteed to be perfectly optimal.
Bear also in mind that after sub ESP, 84, the stack size is not adjusted (except by the call instruction which of course pushes the return address). The following instructions are using ESP as a memory offset, not a destination.

Array value fetching in asm x64

I have a problem with asm code that works when mixed with C, but does not when used in asm code with proper parameters.
;; array - RDI, x- RSI, y- RDX
getValue:
mov r13, rsi
sal r13, $3
mov r14, rdx
sal r14, $2
mov r15, [rdi+r13]
mov rax, [r15+r14]
ret
Technically I want to keep the rdi, rsi and rdx registers untouched and thus I use other ones.
I am using an x64 machine and thus my pointers have 8 bytes. Technically speaking this code is supposed to do:
int getValue(int** array, int x, int y) {
return array[x][y];
}
it somehow works inside my C code, but does not when used in asm in this way:
mov rdi, [rdi] ;; get first pointer - first row
mov r9, $4 ;; we want second element from the row
mov rax, [rdi+r9] ;; get the element (4 bytes vs 8 bytes???)
mov rdi, FMT ;; prepare printf format "%d", 10, 0
mov rsi, rax ;; we want to print the element we just fetched
mov eax, $0 ;; say we have no non-integer argument
call printf ;; always gives 0, no matter what's in the matrix
Can someone see into this and help me? Thanks in advance.

The sal r14, $2 implies the elements are dwords, so the last line before the ret shouldn't load a qword. Besides, x86 has nice scaling addressing modes, so you can do this:
mov rax, [rdi + rsi * 8] ; load pointer to column
mov eax, [rax + rdx * 4] ; note this loads a dword
ret
That implies that you have an array of pointers to columns, which is unusual. You can do that, but was it intended?

This is a standard matrix of integers.
int** array;
sizeof(int*) == 8
sizeof(int) == 4
How I see it is that when I have that array at first, I have a pointer to a space of memory without "blanks" that holds all pointers one by one (index-by-index), so I say "let's go to the element rsi-th of the array" and that's why I shift by rsi-th * 8 bytes. So now I get the same situation, but the pointer should point to a space of integers, so 4-byte items. That's why I shift by 4 bytes there.
Is my thinking wrong?

How Do I Access Thread Local Storage From ml64.exe (MSVC 64-bit X64 Assembler)?

The following C function attempts to prevent recursion in multicore code in a thread-safe manner using a thread local storage variable. However, for reasons that are somewhat complicated, I NEED to write this function in X64 assembler (Intel X86 / AMD 64-bit) and assemble it with ml64.exe from VC2010. I know how to do this if I'm using global variables but I'm not sure how to do it properly with a TLS variable that has __declspec(thread).
__declspec(thread) int tls_VAR = 0;
void norecurse( )
{
if(0==tls_VAR)
{
tls_VAR=1;
DoWork();
tls_VAR=0;
}
}
Note: This is what VC2010 kicks out for the function. However, MASM (ml64.exe) doesn't support the gs:88 or OFFSET FLAT: parts of the code.
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01
include listing.inc
INCLUDELIB MSVCRTD
INCLUDELIB OLDNAMES
PUBLIC norecurse
EXTRN DoWork:PROC
EXTRN tls_VAR:DWORD
EXTRN _tls_index:DWORD
pdata SEGMENT
$pdata$norecurse DD imagerel $LN4
DD imagerel $LN4+70
DD imagerel $unwind$norecurse
pdata ENDS
xdata SEGMENT
$unwind$norecurse DD 040a01H
DD 06340aH
DD 07006320aH
; Function compile flags: /Ogtpy
xdata ENDS
_TEXT SEGMENT
norecurse PROC
; File p:\hackytests\64bittest2010\64bittest\64bittest.cpp
; Line 19
$LN4:
mov QWORD PTR [rsp+8], rbx
push rdi
sub rsp, 32 ; 00000020H
; Line 20
mov ecx, DWORD PTR _tls_index
mov rax, QWORD PTR gs:88
mov edi, OFFSET FLAT:tls_VAR
mov rbx, QWORD PTR [rax+rcx*8]
cmp DWORD PTR [rbx+rdi], 0
jne SHORT $LN1#norecurse
; Line 22
mov DWORD PTR [rbx+rdi], 1
; Line 23
call DoWork
; Line 24
mov DWORD PTR [rbx+rdi], 0
$LN1#norecurse:
; Line 26
mov rbx, QWORD PTR [rsp+48]
add rsp, 32 ; 00000020H
pop rdi
ret 0
norecurse ENDP
_TEXT ENDS
END

As your answer indicates the problem comes down finding the MASM equivalents to the following two lines in assembly listing generated by the Microsoft's C++ compiler:
mov rax, QWORD PTR gs:88
mov edi, OFFSET FLAT:tls_VAR
The first line is easy. Just replace gs:88 with gs:[88].
The second line is less obvious. The OFFSET FLAT: operator is a red herring. It means use the offset relative to the beginning of the "FLAT" segment. With the 32-bit version of MASM, the FLAT segment is the segment that includes the entire 4G address space. This is the segment that's used for both the code and data segment as part of the 32-bit flat memory model. The 64-bit version of MASM doesn't support memory models, it essentially always assumes a 64-bit version of the flat memory model, so it doesn't support the FLAT keyword. As result the plain OFFSET operator ends meaning the same thing. (In fact with the 32-bit assembler, plain OFFSET also normally means the same thing because PECOFF only supports the flat memory model.)
However using OFFSET here won't work. That's because it would use the offset of the address of tls_VAR in memory relative to address 0. Or in other words, it would use the absolute address of tls_VAR in memory. What's needed here is the offset relative to the beginning of the TLS data section.
So the compiler must be doing something special here. In order find out, I dumped the relocations in the object file generated while compiling your example C code:
> dumpbin /relocations t215a.obj
...
RELOCATIONS #4
Symbol Symbol
Offset Type Applied To Index Name
-------- ---------------- ----------------- -------- ------
00000008 REL32 00000000 14 _tls_index
00000016 SECREL 00000000 8 tls_VAR
0000002D REL32 00000000 C DoWork
...
As you can see it generates a relocation of type SECREL for the reference to tls_VAR. This makes the relocation relative to the base of the section in the generated executable that that symbol appears in. In this case that's the .tls section, so this relocation generates an offset relative to the beginning of the section used for static TLS data.
So now the question becomes how to get MASM to generate the same SECREL relocation the compiler emits. This turns out to have a easy solution as well, just replace OFFSET FLAT: with SECTIONREL.
So with these changes (and a bit of optimization) your function becomes:
EXTERN tls_VAR:DWORD
EXTERN _tls_index:DWORD
EXTERN DoWork:PROC
PUBLIC norecurse
_TEXT SEGMENT
norecurse PROC
push rbx
sub rsp, 32
mov rax, gs:[88]
mov ecx, _tls_index
mov rbx, [rax + rcx * 8]
cmp DWORD PTR [rbx + SECTIONREL tls_VAR], 0
jne return
mov DWORD PTR [rbx + SECTIONREL tls_VAR], 1
call DoWork
mov DWORD PTR [rbx + SECTIONREL tls_VAR], 0
return:
add rsp, 32
pop rbx
ret
norecurse ENDP
_TEXT ENDS
END

I was able to work a hack around the issue. My implementation in assember is less efficient than the C compiler generated code though because I was not able to figure out how to use the following two addressing modes:
mov rax, QWORD PTR gs:88
mov edi, OFFSET FLAT:tls_VAR
For (1), I had to load 88 into rax and use gs:[rax] to access the TLS-base for the thread.
For (2), the lack of OFFSET FLAT in MASM (ml64.exe) meant that I had to be more clever. I computed the offset by subtracting _tls_start from the TLS-base for the thread that could be applied to TLS-variables in assembler to access their thread local values.
PUBLIC norecurse
EXTRN _tls_index:DWORD
EXTRN _tls_start:DWORD
EXTRN tls_VAR:DWORD
EXTRN DoWork:PROC
_TEXT SEGMENT
norecurse PROC
; non-volatile
push rbx
sub rsp,32
; The gs segment register refers to the base address of the TEB on x64.
; 88 (0×58) is the offset in the TEB for the ThreadLocalStoragePointer member on x64
mov rax,88
mov edx, DWORD PTR _tls_index
mov rax, gs:[rax]
mov r11, QWORD PTR [rax+rdx*8]
lea r10, _tls_start
; r11 will be the the offset-adjusted TLS-Base
sub r11, r10
; ebx will be the the thread local address of tls_VAR
lea rdx, tls_VAR
lea rbx,[r11+rdx]
cmp DWORD PTR [rbx], 0
jne #F
mov DWORD PTR [rbx], 1
call DoWork
mov DWORD PTR [rbx], 0
##:
add rsp,32
pop rbx
ret
norecurse ENDP
_TEXT ENDS
END
I'd love to see more efficient method or pointers on how to actually use the two addressing modes I couldn't figure out with MASM (ml64.exe) though.

Check out TlsGetValue, TlsSetvalue, and friends.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight