INT 13 Extension Read in C - c

i can use extended read functions of bios int 13h well from assembly,
with the below code
; *************************************************************************
; Setup DISK ADDRESS PACKET
; *************************************************************************
jmp strtRead
DAPACK :
db 010h ; Packet Size
db 0 ; Always 0
blkcnt:
dw 1 ; Sectors Count
db_add :
dw 07e00h ; Transfer Offset
dw 0 ; Transfer Segment
d_lba :
dd 1 ; Starting LBA(0 - n)
dd 0 ; Bios 48 bit LBA
; *************************************************************************
; Start Reading Sectors using INT13 Func 42
; *************************************************************************
strtRead:
mov si, OFFSET DAPACK; Load DPACK offset to SI
mov ah, 042h ; Function 42h
mov dl, 080h ; Drive ID
int 013h; Call INT13h
i want to convert this to be a c callable function but i have no idea about how to transfer the parameters from c to asm like drive id , sectors count, buffer segment:offset .... etc.
i am using msvc and masm and working with nothing except bios functions.
can anyone help ?!!
update :
i have tried the below function but always nothing loaded into the buffer ??
void read_sector()
{
static unsigned char currentMBR[512] = { 0 };
struct disk_packet //needed for int13 42h
{
byte size_pack; //size of packet must be 16 or 16+
byte reserved1; //reserved
byte no_of_blocks; //nof blocks for transfer
byte reserved2; //reserved
word offset; //offset address
word segment; //segment address
dword lba1;
dword lba2;
} disk_pack;
disk_pack.size_pack = 16; //set size to 16
disk_pack.no_of_blocks = 1; //1 block ie read one sector
disk_pack.reserved1 = 0; //reserved word
disk_pack.reserved2 = 0; //reserved word
disk_pack.segment = 0; //segment of buffer
disk_pack.offset = (word)&currentMBR[0]; //offset of buffer
disk_pack.lba1 = 0; //lba first 32 bits
disk_pack.lba2 = 0; //last 32 bit address
_asm
{
mov dl, 080h;
mov[disk_pack.segment], ds;
mov si, disk_pack;
mov ah, 42h;
int 13h
; jc NoError; //No error, ignore error code
; mov bError, ah; // Error, get the error code
NoError:
}
}

Sorry to post this as "answer"; I want to post this as "comment" but it is too long...
Different compilers have a different syntax of inline assembly. This means that the correct syntax of the following lines:
mov[disk_pack.segment], ds;
mov si, disk_pack;
... depends on the compiler used. Unfortunately I do not use 16-bit C compilers so I cannot help you in this point.
The next thing I see in your program is the following one:
disk_pack.segment = 0; //segment of buffer
disk_pack.offset = (word)&currentMBR[0]; //offset of buffer
With a 99% chance this will lead to a problem. Instead I would to the following:
struct disk_packet //needed for int13 42h
{
byte size_pack;
byte reserved;
word no_of_blocks; // note that this is 16-bit!
void far *data; // <- This line is the main change!
dword lba1;
dword lba2;
} disk_pack;
...
disk_pack.size_pack = 16;
disk_pack.no_of_blocks = 1;
disk_pack.reserved = 0;
disk_pack.data = &currentMBR[0]; // also note the change here
disk_pack.lba1 = 0;
disk_pack.lba2 = 0;
...
Note that some compilers name the keyword "_far" or "__far" instead of "far".
A third problem is that some (buggy) BIOSes require ES to be equal to the segment value from the disk_pack and a fourth one is that many compilers require the inline assembly code not to modify any registers (AX, CX and DX is normally OK).
These two could be solved the following way:
push ds;
push es;
push si;
mov dl, 080h;
// TODO here: Set ds:si to disk_pack in a compiler-specific way
mov es,[si+6];
mov ah, 42h;
int 13h;
...
pop si;
pop es;
pop ds;
In my opinion the "#pragma pack" should not be neccessary because all elements in the structure are propperly aligned.

Related

Get EDID info in C (UEFI): read the ES:DI register?

I am Developing an OS, I wants to get EDID from monitor, I am found some asm code (https://wiki.osdev.org/EDID) to get edid in ES:DI registers,
mov ax, 0x4f15
mov bl, 0x01
xor cx, cx
xor dx, dx
int 0x10
;AL = 0x4F if function supported
;AH = status (0 is success, 1 is fail)
;ES:DI contains the EDID
How can I get AL, AH, and ES:DI values in C File?
Actually I am developing an 64 bit UEFI OS
LoadGDT:
lgdt [rdi]
mov ax, 0x10
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax
pop rdi
mov rax, 0x08
push rax
push rdi
retfq
GLOBAL LoadGDT
I am able to run these above asm code and get it in c using Global Functions in C,
That page on osdev.org contains code intended to be run when the CPU is in 16-bit real mode.
You can tell not only from the registers involved but also from the fact that int 10h is used.
This is a well-known BIOS interrupt service that is written in 16-bit real-mode code.
If you target UEFI, then your bootloader is actually an UEFI application, which is a PE32(+) image.
If the CPU is 64-bit capable, the firmware will switch into long mode (64-bit mode) and load your bootloader.
Otherwise, it will switch into protected mode (32-bit mode).
In any case, real mode is never used in UEFI.
You can call 16-bit code from protected/long mode with the use of a 16-bit code segment in the GDT/LDT but you cannot call real-mode code (i.e. code written to work with the real-mode segmentation) because segmentation works completely different between the modes.
Plus, in real mode the interrupts are dispatched through the IVT and not the IDT, you would need to get the original entry-point for interrupt 10h.
UEFI protocol EFI_EDID_DISCOVERED_PROTOCOL
Luckily, UEFI has a replacement for most basic services offered by the legacy BIOS interface.
In this case, you can use the EFI_EDID_DISCOVERED_PROTOCOL and eventually apply any override from the platform firmware with the use of EFI_EDID_OVERRIDE_PROTOCOL.
The EFI_EDID_DISCOVERED_PROTOCOL is straightforward to use, it's just a (Size, Data) pair.
typedef struct _EFI_EDID_DISCOVERED_PROTOCOL {
UINT32 SizeOfEdid;
UINT8 *Edid;
} EFI_EDID_DISCOVERED_PROTOCOL;
(from gnu-efi)
The format of the buffer Edid can be found in the VESA specification or even on Wikipedia.
As an example, I wrote a simple UEFI application with gnu-efi and x64_64-w64-mingw32 (a version of GCC and tools that target PEs).
I avoided using uefilib.h in order to use gnu-efi just for the definition of the structures related to EUFI.
The code sucks, it assumes at most 10 handles support the EDID protocol and I wrote only a partial structure for the EDID data (because I got bored).
But this should be enough the get the idea.
NOTE That my VM didn't return any EDID information, so the code is not completely tested!
#include <efi.h>
//You are better off using this lib
//#include <efilib.h>
EFI_GUID gEfiEdidDiscoveredProtocolGuid = EFI_EDID_DISCOVERED_PROTOCOL_GUID;
EFI_SYSTEM_TABLE* gST = NULL;
typedef struct _EDID14 {
UINT8 Signature[8];
UINT16 ManufacturerID;
UINT16 ManufacturerCode;
UINT32 Serial;
UINT8 Week;
UINT8 Year;
UINT8 Major;
UINT8 Minor;
UINT32 InputParams;
UINT8 HSize;
UINT8 VSize;
UINT8 Gamma;
//...Omitted...
} EDID14_RAW;
VOID Print(CHAR16* string)
{
gST->ConOut->OutputString(gST->ConOut, string);
}
VOID PrintHex(UINT64 number)
{
CHAR16* digits = L"0123456789abcdef";
CHAR16 buffer[2] = {0, 0};
for (INTN i = 64-4; i >= 0; i-=4)
{
buffer[0] = digits[(number >> i) & 0xf];
Print(buffer);
}
}
VOID PrintDec(UINT64 number)
{
CHAR16 buffer[21] = {0};
UINTN i = 19;
do
{
buffer[i--] = L'0' + (number % 10);
number = number / 10;
}
while (number && i >= 0);
Print(buffer + i + 1);
}
#define MANUFACTURER_DECODE_LETTER(x) ( L'A' + ( (x) & 0x1f ) - 1 )
EFI_STATUS efi_main(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE* SystemTable)
{
EFI_STATUS Status = EFI_SUCCESS;
EFI_HANDLE EDIDHandles[10];
UINTN Size = sizeof(EFI_HANDLE) * 10;
EFI_EDID_DISCOVERED_PROTOCOL* EDID;
gST = SystemTable;
if ( EFI_ERROR( (Status = SystemTable->BootServices->LocateHandle(ByProtocol, &gEfiEdidDiscoveredProtocolGuid, NULL, &Size, EDIDHandles)) ) )
{
Print(L"Failed to get EDID handles: "); PrintHex(Status); Print(L"\r\n");
return Status;
}
for (INTN i = 0; i < Size/sizeof(EFI_HANDLE); i++)
{
if (EFI_ERROR( (SystemTable->BootServices->OpenProtocol(
EDIDHandles[i], &gEfiEdidDiscoveredProtocolGuid, (VOID**)&EDID, ImageHandle, NULL, EFI_OPEN_PROTOCOL_GET_PROTOCOL)) ) )
{
Print(L"Failed to get EDID info for handle "); PrintDec(i); Print(L": "); PrintHex(Status); Print(L"\r\n");
return Status;
}
if (EDID->SizeOfEdid == 0 || EDID->Edid == NULL)
{
Print(L"No EDID data for handle "); PrintDec(i); Print(L"\r\n");
continue;
}
/*
THIS CODE IS NOT TESTED!
! ! ! D O N O T U S E ! ! !
*/
EDID14_RAW* EdidData = (EDID14_RAW*)EDID->Edid;
CHAR16 Manufacturer[4] = {0};
Manufacturer[0] = MANUFACTURER_DECODE_LETTER(EdidData->ManufacturerID >> 10);
Manufacturer[1] = MANUFACTURER_DECODE_LETTER(EdidData->ManufacturerID >> 5);
Manufacturer[2] = MANUFACTURER_DECODE_LETTER(EdidData->ManufacturerID);
Print(L"Manufacturer ID: "); Print(Manufacturer); Print(L"\r\n");
Print(L"Resolution: "); PrintDec(EdidData->HSize); Print(L"X"); PrintDec(EdidData->VSize); Print(L"\r\n");
}
return Status;
}
ACPI
If you don't want to use these UEFI protocols you can use ACPI. Each display output device has a _DDC method that is documented in the ACPI specification and can be used to return the EDID data (either as a buffer of 128 or 256 bytes).
This method is conceptually simple but in practice it requires writing a full-blown ACPI parser (including the AML VM) which is a lot of work.
However, ACPI is necessary for modern OSes and so you can use it, later on, to get the EDID data without having to worry about UEFI protocols.

Assembly, draw an image

I need to draw QRCode via Assembly(intel)+C(c99) in DOS. But it seems I have too little memory for it.
I tried to store image as bit array:
image
db 11111110b,
...
But anyway I had no result(Illegal read from 9f208c70, CS:IP 192:9f20734f). Now I don't know what to do. Here is code I used:
module.asm:
[bits 16]
global setpixel
global setVM
global getch
global getPixelBlock
section .text
setVM:
push bp
mov bp, sp
mov ax, [bp+6]
mov ah, 0
int 10h
pop bp
ret
setpixel:
push bp
mov bp,sp
xor bx, bx
mov cx, [bp+6]
mov dx, [bp+10]
mov al, [bp+14]
mov ah, 0ch
int 10h
pop bp
ret
getch:
push bp
mov ah,0
int 16h
mov ah,0
pop bp
ret
getPixelBlock:
push bp
mov cx, [bp+6]
mov bx, image
add bx, cx
mov ax, [bx]
pop bp
ret
section .data
image
db 11111110b,
db 10011011b,
db 11111100b,
db 00010011b,
db 00010000b,
db 01101110b,
db 10110000b,
db 10111011b,
db 01110101b,
db 01100101b,
db 11011011b,
db 10100000b,
db 00101110b,
db 11000001b,
db 01110001b,
db 00000111b,
db 11111010b,
db 10101111b,
db 11100000b,
db 00011000b,
db 00000000b,
db 11010011b,
db 01011111b,
db 01101011b,
db 11100100b,
db 11101000b,
db 00110101b,
db 11001111b,
db 01001111b,
db 11100000b,
db 00011011b,
db 11010001b,
db 00100111b,
db 00000011b,
db 10000000b,
db 00000011b,
db 10001111b,
db 11111010b,
db 00100000b,
db 01010000b,
db 01000110b,
db 01011011b,
db 10111010b,
db 01001111b,
db 01010101b,
db 11010110b,
db 10001110b,
db 00101110b,
db 10010001b,
db 01111011b,
db 00000101b,
db 01100001b,
db 10001111b,
db 11101110b,
db 11000001b
main.c:
__asm(".code16gcc\n");
int run();
int _start()
{
return run();
} // Dirty hack to code as I used to
#include "ASM.inl"
#include "Painter.inl"
int run()
{
setVM(0x10);
_brushSize = 5;
drawLogo(30,30);
uint ret = (uint)getch();
return ret>>5;
}
ASM.inl
#ifndef __ASM_H__
#define __ASM_H__
typedef unsigned short int uint;
typedef unsigned char uchar;
void setpixel(uint x, uint y, uint color);
void setVM(uint vm);
uchar getch();
uchar getPixelBlock(uchar);
#endif /* __ASM_H__ */
Painter.inl:
/**
* You can create other colors by using bitwise or
*/
enum Color {
White = 0b0111,
Black = 0b0000,
Red = 0b0100,
Green = 0b0010,
Blue = 0b0001,
Bright = 0b1000,
};
int _brushSize = 5;
void rect(uint x, uint y, uint width, uint height, uint color)
{
uint i,j;
for (i=x; i<width+x; i++) {
for (j=y; j<height+y; j++) {
setpixel(i,j,color);
}
}
}
uint getColor(uchar element, uchar offset)
{
element = element & (1 << offset) >> offset;
return element ? Black : White;
}
void drawLogo(uint x, uint y)
{
uchar current;
uchar counter = 0;
for (uint i=0; i<21; i++) {
for (uint j=0; j<21; j++) {
counter = i*21+j;
current = getPixelBlock((uchar)counter/8);
rect(x+i*_brushSize, y+j*_brushSize, _brushSize, _brushSize, getColor(current, counter%8));
}
}
}
Compilation script:
#!/bin/bash
nasm -f elf32 module.asm -o module.o
gcc -std=c99 -m32 -ffreestanding -masm=intel -c main.c -o main.o
ld -m elf_i386 -Ttext=0x100 main.o module.o -o os.com
objcopy os.com -O binary
GCC version: 4.8.3 (Gentoo 4.8.3 p1.1, pie-0.5.9)
NASM version: 2.11.05
DOSBox version: 0.74
What I am doing wrong? Maybe I should write directly into graphic memory or something like that? Or maybe I should change gcc optimizations?
The assembly code looks generally alright. You might want to check the interrupt calling sequences against the order of parameters on the stack by setting a breakpoint right on the int 10h and checking the register values. I haven't done that stuff for well over 20 years, and I'm rusty.
You have at least two probable operator precedence problems. I don't think these do the right thing.
element = element & (1 << offset) >> offset;
current = getPixelBlock((uchar)counter/8);
You have a hard-coded 'magic number': 21. I have no idea what that means.
After that, the question is: where did it crash? Time to get that debugger stoked up and paying for itself.
I meant to ask: why on earth write this stuff in assembly? You can easily call int 10h either directly from C, from embedded asm in C, or by a single wrapper function.
The way you define your data with a trailing comma introduces an extra byte with zero value. At least in my assembler!
I think you need to double the value of CURRENT in the DRAWLOGO function in order to synchronize with the data.
The function GETPIXELBLOCK recieves values from 0 To 55 which is 1 more than the data lines available!

Cache size estimation on your system?

I got this program from this link (https://gist.github.com/jiewmeng/3787223).I have been searching the web with the idea of gaining a better understanding of processor caches (L1 and L2).I want to be able to write a program that would enable me to guess the size of L1 and L2 cache on my new Laptop.(just for learning purpose.I know I could check the spec.)
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define KB 1024
#define MB 1024 * 1024
int main() {
unsigned int steps = 256 * 1024 * 1024;
static int arr[4 * 1024 * 1024];
int lengthMod;
unsigned int i;
double timeTaken;
clock_t start;
int sizes[] = {
1 * KB, 4 * KB, 8 * KB, 16 * KB, 32 * KB, 64 * KB, 128 * KB, 256 * KB,
512 * KB, 1 * MB, 1.5 * MB, 2 * MB, 2.5 * MB, 3 * MB, 3.5 * MB, 4 * MB
};
int results[sizeof(sizes)/sizeof(int)];
int s;
/*for each size to test for ... */
for (s = 0; s < sizeof(sizes)/sizeof(int); s++)
{
lengthMod = sizes[s] - 1;
start = clock();
for (i = 0; i < steps; i++)
{
arr[(i * 16) & lengthMod] *= 10;
arr[(i * 16) & lengthMod] /= 10;
}
timeTaken = (double)(clock() - start)/CLOCKS_PER_SEC;
printf("%d, %.8f \n", sizes[s] / 1024, timeTaken);
}
return 0;
}
The output of the program in my machine is as follows.How do I interpret the numbers? What does this program tell me.?
1, 1.07000000
4, 1.04000000
8, 1.06000000
16, 1.13000000
32, 1.14000000
64, 1.17000000
128, 1.20000000
256, 1.21000000
512, 1.19000000
1024, 1.23000000
1536, 1.23000000
2048, 1.46000000
2560, 1.21000000
3072, 1.45000000
3584, 1.47000000
4096, 1.94000000
you need direct access to memory
I am not meaning DMA transfer by this. Memory must be accessed by CPU of course (otherwise you are not measuring CACHEs) but as directly as it can be ... so measurements will probably not be very accurate on Windows/Linux because services and other processes can mess with caches during runtime. Measure many times and average for better results (or use the fastest time or filter it together). For best accuracy use DOS and asm for example
rep + movsb,movsw,movsd
rep + stosb,stosw,stosd
so you measure the memory transfer and not something else like in your code !!!
measure the raw transfer times and plot a graph
x axis is transfer block size
y axis is transfer speed
zones with the same transfer rate are consistent with appropriate CACHE layer
[Edit1] could not find my old source code for this so I busted something right now in C++ for windows:
Time measurement:
//---------------------------------------------------------------------------
double performance_Tms=-1.0, // perioda citaca [ms]
performance_tms= 0.0; // zmerany cas [ms]
//---------------------------------------------------------------------------
void tbeg()
{
LARGE_INTEGER i;
if (performance_Tms<=0.0) { QueryPerformanceFrequency(&i); performance_Tms=1000.0/double(i.QuadPart); }
QueryPerformanceCounter(&i); performance_tms=double(i.QuadPart);
}
//---------------------------------------------------------------------------
double tend()
{
LARGE_INTEGER i;
QueryPerformanceCounter(&i); performance_tms=double(i.QuadPart)-performance_tms; performance_tms*=performance_Tms;
return performance_tms;
}
//---------------------------------------------------------------------------
Benchmark (32bit app):
//---------------------------------------------------------------------------
DWORD sizes[]= // used transfer block sizes
{
1<<10, 2<<10, 3<<10, 4<<10, 5<<10, 6<<10, 7<<10, 8<<10, 9<<10,
10<<10, 11<<10, 12<<10, 13<<10, 14<<10, 15<<10, 16<<10, 17<<10, 18<<10,
19<<10, 20<<10, 21<<10, 22<<10, 23<<10, 24<<10, 25<<10, 26<<10, 27<<10,
28<<10, 29<<10, 30<<10, 31<<10, 32<<10, 48<<10, 64<<10, 80<<10, 96<<10,
112<<10,128<<10,192<<10,256<<10,320<<10,384<<10,448<<10,512<<10, 1<<20,
2<<20, 3<<20, 4<<20, 5<<20, 6<<20, 7<<20, 8<<20, 9<<20, 10<<20,
11<<20, 12<<20, 13<<20, 14<<20, 15<<20, 16<<20, 17<<20, 18<<20, 19<<20,
20<<20, 21<<20, 22<<20, 23<<20, 24<<20, 25<<20, 26<<20, 27<<20, 28<<20,
29<<20, 30<<20, 31<<20, 32<<20,
};
const int N=sizeof(sizes)>>2; // number of used sizes
double pmovsd[N]; // measured transfer rate rep MOVSD [MB/sec]
double pstosd[N]; // measured transfer rate rep STOSD [MB/sec]
//---------------------------------------------------------------------------
void measure()
{
int i;
BYTE *dat; // pointer to used memory
DWORD adr,siz,num; // local variables for asm
double t,t0;
HANDLE hnd; // process handle
// enable priority change (huge difference)
#define measure_priority
// enable critical sections (no difference)
// #define measure_lock
for (i=0;i<N;i++) pmovsd[i]=0.0;
for (i=0;i<N;i++) pstosd[i]=0.0;
dat=new BYTE[sizes[N-1]+4]; // last DWORD +4 Bytes (should be 3 but i like 4 more)
if (dat==NULL) return;
#ifdef measure_priority
hnd=GetCurrentProcess(); if (hnd!=NULL) { SetPriorityClass(hnd,REALTIME_PRIORITY_CLASS); CloseHandle(hnd); }
Sleep(200); // wait to change take effect
#endif
#ifdef measure_lock
CRITICAL_SECTION lock; // lock handle
InitializeCriticalSectionAndSpinCount(&lock,0x00000400);
EnterCriticalSection(&lock);
#endif
adr=(DWORD)(dat);
for (i=0;i<N;i++)
{
siz=sizes[i]; // siz = actual block size
num=(8<<20)/siz; // compute n (times to repeat the measurement)
if (num<4) num=4;
siz>>=2; // size / 4 because of 32bit transfer
// measure overhead
tbeg(); // start time meassurement
asm {
push esi
push edi
push ecx
push ebx
push eax
mov ebx,num
mov al,0
loop0: mov esi,adr
mov edi,adr
mov ecx,siz
// rep movsd // es,ds already set by C++
// rep stosd // es already set by C++
dec ebx
jnz loop0
pop eax
pop ebx
pop ecx
pop edi
pop esi
}
t0=tend(); // stop time meassurement
// measurement 1
tbeg(); // start time meassurement
asm {
push esi
push edi
push ecx
push ebx
push eax
mov ebx,num
mov al,0
loop1: mov esi,adr
mov edi,adr
mov ecx,siz
rep movsd // es,ds already set by C++
// rep stosd // es already set by C++
dec ebx
jnz loop1
pop eax
pop ebx
pop ecx
pop edi
pop esi
}
t=tend(); // stop time meassurement
t-=t0; if (t<1e-6) t=1e-6; // remove overhead and avoid division by zero
t=double(siz<<2)*double(num)/t; // Byte/ms
pmovsd[i]=t/(1.024*1024.0); // MByte/s
// measurement 2
tbeg(); // start time meassurement
asm {
push esi
push edi
push ecx
push ebx
push eax
mov ebx,num
mov al,0
loop2: mov esi,adr
mov edi,adr
mov ecx,siz
// rep movsd // es,ds already set by C++
rep stosd // es already set by C++
dec ebx
jnz loop2
pop eax
pop ebx
pop ecx
pop edi
pop esi
}
t=tend(); // stop time meassurement
t-=t0; if (t<1e-6) t=1e-6; // remove overhead and avoid division by zero
t=double(siz<<2)*double(num)/t; // Byte/ms
pstosd[i]=t/(1.024*1024.0); // MByte/s
}
#ifdef measure_lock
LeaveCriticalSection(&lock);
DeleteCriticalSection(&lock);
#endif
#ifdef measure_priority
hnd=GetCurrentProcess(); if (hnd!=NULL) { SetPriorityClass(hnd,NORMAL_PRIORITY_CLASS); CloseHandle(hnd); }
#endif
delete dat;
}
//---------------------------------------------------------------------------
Where arrays pmovsd[] and pstosd[] holds the measured 32bit transfer rates [MByte/sec]. You can configure the code by use/rem two defines at the start of measure function.
Graphical Output:
To maximize accuracy you can change process priority class to maximum. So create measure thread with max priority (I try it but it mess thing up actually) and add critical section to it so the test will not be uninterrupted by OS as often (no visible difference with and without threads). If you want to use Byte transfers then take account that it uses only 16bit registers so you need to add loop and address iterations.
PS.
If you try this on notebook then you should overheat the CPU to be sure that you measure on top CPU/Mem speed. So no Sleeps. Some stupid loops before measurement will do it but they should run at least few seconds. Also you can synchronize this by CPU frequency measurement and loop while is rising. Stop after it saturates ...
asm instruction RDTSC is best for this (but beware its meaning has slightly changed with new architectures).
If you are not under Windows then change functions tbeg,tend to your OS equivalents
[edit2] further improvements of accuracy
Well after finally solving problem with VCL affecting measurement accuracy which I discover thanks to this question and more about it here, to improve accuracy you can prior to benchmark do this:
set process priority class to realtime
set process affinity to single CPU
so you measure just single CPU on multi-core
flush DATA and Instruction CACHEs
For example:
// before mem benchmark
DWORD process_affinity_mask=0;
DWORD system_affinity_mask =0;
HANDLE hnd=GetCurrentProcess();
if (hnd!=NULL)
{
// priority
SetPriorityClass(hnd,REALTIME_PRIORITY_CLASS);
// affinity
GetProcessAffinityMask(hnd,&process_affinity_mask,&system_affinity_mask);
process_affinity_mask=1;
SetProcessAffinityMask(hnd,process_affinity_mask);
GetProcessAffinityMask(hnd,&process_affinity_mask,&system_affinity_mask);
}
// flush CACHEs
for (DWORD i=0;i<sizes[N-1];i+=7)
{
dat[i]+=i;
dat[i]*=i;
dat[i]&=i;
}
// after mem benchmark
if (hnd!=NULL)
{
SetPriorityClass(hnd,NORMAL_PRIORITY_CLASS);
SetProcessAffinityMask(hnd,system_affinity_mask);
}
So the more accurate measurement looks like this:
Your lengthMod variable doesn't do what you think it does. You want it to limit the size of your data set, but you have 2 problems there -
Doing a bitwise AND with a power of 2 would mask off all bits except the one that's on. If for e.g. lengthMod is 1k (0x400), then all indices lower than 0x400 (meaning i=1 to 63) would simply map to index 0, so you'll always hit the cache. That's probably why the results are so fast. Instead use lengthMod - 1 to create a correct mask (0x400 --> 0x3ff, which would mask just the upper bits and leave the lower ones intact).
Some of the values for lengthMod are not a power of 2, so doing the lengthMod-1 isn't going to work there as some of the mask bits would still be zeros. Either remove them from the list, or use a modulo operation instead of lengthMod-1 altogether. See also my answer here for a similar case.
Another issue is that 16B jumps are probably not enough to skip a cachline as most common CPUs work with 64 byte cachelines, so you get only one miss for every 4 iterations. Use (i*64) instead.

C - Inline asm patching at runtime

I am writing a program in C and i use inline asm. In the inline assembler code is have some addresses where i want to patch them at runtime.
A quick sample of the code is this:
void __declspec(naked) inline(void)
{
mov eax, 0xAABBCCDD
call 0xAABBCCDD
}
An say i want to modify the 0xAABBCCDD value from the main C program.
What i tried to do is to Call VirtualProtect an is the pointer of the function in order to make it Writeable, and then call memcpy to add the appropriate values to the code.
DWORD old;
VirtualProtect(inline, len, PAGE_EXECUTE_READWRITE, &old);
However VirtualProtect fails and GetLastError() returns 487 which means accessing invalid address. Anyone have a clue about this problem??
Thanks
Doesn't this work?
int X = 0xAABBCCDD;
void __declspec(naked) inline(void)
{
mov eax, [X]
call [X]
}
How to do it to another process at runtime,
Create a variable that holds the program base address
Get the target RVA (Relative Virtual Address)
Then calculate the real address like this PA=RVA + BASE
then call it from your inline assembly
You can get the base address like this
DWORD dwGetModuleBaseAddress(DWORD dwProcessID)
{
TCHAR zFileName[MAX_PATH];
ZeroMemory(zFileName, MAX_PATH);
HANDLE hProcess = OpenProcess(PROCESS_ALL_ACCESS, true, dwProcessID);
HANDLE hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPMODULE, dwProcessID);
DWORD dwModuleBaseAddress = 0;
if (hSnapshot != INVALID_HANDLE_VALUE)
{
MODULEENTRY32 ModuleEntry32 = { 0 };
ModuleEntry32.dwSize = sizeof(MODULEENTRY32);
if (Module32First(hSnapshot, &ModuleEntry32))
{
do
{
if (wcscmp(ModuleEntry32.szModule, L"example.exe") == 0)
{
dwModuleBaseAddress = (DWORD_PTR)ModuleEntry32.modBaseAddr;
break;
}
} while (Module32Next(hSnapshot, &ModuleEntry32));
}
CloseHandle(hSnapshot);
CloseHandle(hProcess);
}
return dwModuleBaseAddress;
}
Assuming you have a local variable and your base address
mov dword ptr ss : [ebp - 0x14] , eax;
mov eax, dword ptr BaseAddress;
add eax, PA;
call eax;
mov eax, dword ptr ss : [ebp - 0x14] ;
You have to restore the value of your Register after the call returns, since this value may be used somewhere down the code execution, assuming you're trying to patch an existing application that may depend on the eax register after your call. Although this method has it disadvantages, but at least it will give anyone idea on what to do.

Multithreading with inline assembly and access to a c variable

I'm using inline assembly to construct a set of passwords, which I will use to brute force against a given hash. I used this website as a reference for the construction of the passwords.
This is working flawlessly in a singlethreaded environment. It produces an infinite amount of incrementing passwords.
As I have only basic knowledge of asm, I understand the idea. The gcc uses ATT, so I compile with -masm=intel
During the attempt to multithread the program, I realize that this approach might not work.
The following code uses 2 global C variables, and I assume that this might be the problem.
__asm__("pushad\n\t"
"mov edi, offset plaintext\n\t" <---- global variable
"mov ebx, offset charsetTable\n\t" <---- again
"L1: movzx eax, byte ptr [edi]\n\t"
" movzx eax, byte ptr [charsetTable+eax]\n\t"
" cmp al, 0\n\t"
" je L2\n\t"
" mov [edi],al\n\t"
" jmp L3\n\t"
"L2: xlat\n\t"
" mov [edi],al\n\t"
" inc edi\n\t"
" jmp L1\n\t"
"L3: popad\n\t");
It produces a non deterministic result in the plaintext variable.
How can i create a workaround, that every thread accesses his own plaintext variable? (If this is the problem...).
I tried modifying this code, to use extended assembly, but I failed every time. Probably due to the fact that all tutorials use ATT syntax.
I would really appreciate any help, as I'm stuck for several hours now :(
Edit: Running the program with 2 threads, and printing the content of plaintext right after the asm instruction, produces:
b
b
d
d
f
f
...
Edit2:
pthread_create(&thread[i], NULL, crack, (void *) &args[i]))
[...]
void *crack(void *arg) {
struct threadArgs *param = arg;
struct crypt_data crypt; // storage for reentrant version of crypt(3)
char *tmpHash = NULL;
size_t len = strlen(param->methodAndSalt);
size_t cipherlen = strlen(param->cipher);
crypt.initialized = 0;
for(int i = 0; i <= LIMIT; i++) {
// intel syntax
__asm__ ("pushad\n\t"
//mov edi, offset %0\n\t"
"mov edi, offset plaintext\n\t"
"mov ebx, offset charsetTable\n\t"
"L1: movzx eax, byte ptr [edi]\n\t"
" movzx eax, byte ptr [charsetTable+eax]\n\t"
" cmp al, 0\n\t"
" je L2\n\t"
" mov [edi],al\n\t"
" jmp L3\n\t"
"L2: xlat\n\t"
" mov [edi],al\n\t"
" inc edi\n\t"
" jmp L1\n\t"
"L3: popad\n\t");
tmpHash = crypt_r(plaintext, param->methodAndSalt, &crypt);
if(0 == memcmp(tmpHash+len, param->cipher, cipherlen)) {
printf("success: %s\n", plaintext);
break;
}
}
return 0;
}
Since you're already using pthreads, another option is making the variables that are modified by several threads into per-thread variables (threadspecific data). See pthread_getspecific OpenGroup manpage. The way this works is like:
In the main thread (before you create other threads), do:
static pthread_key_y tsd_key;
(void)pthread_key_create(&tsd_key); /* unlikely to fail; handle if you want */
and then within each thread, where you use the plaintext / charsetTable variables (or more such), do:
struct { char *plainText, char *charsetTable } *str =
pthread_getspecific(tsd_key);
if (str == NULL) {
str = malloc(2 * sizeof(char *));
str.plainText = malloc(size_of_plaintext);
str.charsetTable = malloc(size_of_charsetTable);
initialize(str.plainText); /* put the data for this thread in */
initialize(str.charsetTable); /* ditto */
pthread_setspecific(tsd_key, str);
}
char *plaintext = str.plainText;
char *charsetTable = str.charsetTable;
Or create / use several keys, one per such variable; in that case, you don't get the str container / double indirection / additional malloc.
Intel assembly syntax with gcc inline asm is, hm, not great; in particular, specifying input/output operands is not easy. I think to get that to use the pthread_getspecific mechanism, you'd change your code to do:
__asm__("pushad\n\t"
"push tsd_key\n\t" <---- threadspecific data key (arg to call)
"call pthread_getspecific\n\t" <---- gets "str" as per above
"add esp, 4\n\t" <---- get rid of the func argument
"mov edi, [eax]\n\t" <---- first ptr == "plainText"
"mov ebx, [eax + 4]\n\t" <---- 2nd ptr == "charsetTable"
...
That way, it becomes lock-free, at the expense of using more memory (one plaintext / charsetTable per thread), and the expense of an additional function call (to pthread_getspecific()). Also, if you do the above, make sure you free() each thread's specific data via pthread_atexit(), or else you'll leak.
If your function is fast to execute, then a lock is a much simpler solution because you don't need all the setup / cleanup overhead of threadspecific data; if the function is either slow or very frequently called, the lock would become a bottleneck though - in that case the memory / access overhead for TSD is justified. Your mileage may vary.
Protect this function with mutex outside of inline Assembly block.

Resources