Ways to divide the high/low byte from a 16bit address? - c

I'm developing a software on 8051 processor. A frequent job is to divide the high and low byte of a 16bit address. I want to see there are how many ways to achieve it. The ways I come up so far are: (say ptr is a 16bit pointer, and int is 16bit int) [note the rn and arn is registers]
bitwise operation
ADDH = (unsigned int) ptr >> 8;
ADDL = (unsigned int) ptr & 0x00FF;
SDCC gives the following assembly code
; t.c:32: ADDH = (unsigned int) ptr >> 8;
mov ar6,r3
mov ar7,r4
mov _main_ADDH_1_1,r7
; t.c:33: ADDL = (unsigned int) ptr & 0x00FF;
mov _main_ADDL_1_1,r6
Keil C51 gives me:
; SOURCE LINE # 32
0045 AA00 R MOV R2,ptr+01H
0047 A900 R MOV R1,ptr+02H
0049 AE02 MOV R6,AR2
004B EE MOV A,R6
004C F500 R MOV ADDH,A
; SOURCE LINE # 33
004E AF01 MOV R7,AR1
0050 EF MOV A,R7
0051 F500 R MOV ADDL,A
which has many useless code IMHO.
pointer trick
ADDH = ((unsigned char *)&ptr)[0];
ADDL = ((unsigned char *)&ptr)[1];
SDCC gives me:
; t.c:37: ADDH = ((unsigned char *)&ptr)[0];
mov _main_ADDH_1_1,_main_ptr_1_1
; t.c:38: ADDL = ((unsigned char *)&ptr)[1];
mov _main_ADDL_1_1,(_main_ptr_1_1 + 0x0001)
Keil C51 gives me:
; SOURCE LINE # 37
006A 850000 R MOV ADDH,ptr
; SOURCE LINE # 38
006D 850000 R MOV ADDL,ptr+01H
which is the same with SDCC version.
Andrey's mathematic approach
ADDH = ptr / 256;
ADDL = ptr % 256;
SDCC gives:
; t.c:42: ADDH = (unsigned int)ptr / 256;
mov ar5,r3
mov ar6,r4
mov ar7,r6
mov _main_ADDH_1_1,r7
; t.c:43: ADDL = (unsigned int)ptr % 256;
mov _main_ADDL_1_1,r5
I've no idea why sdcc use the r7 register...
Keil C51 gives me:
; SOURCE LINE # 42
0079 AE00 R MOV R6,ptr
007B AF00 R MOV R7,ptr+01H
007D AA06 MOV R2,AR6
007F EA MOV A,R2
0080 F500 R MOV ADDH,A
; SOURCE LINE # 43
0082 8F00 R MOV ADDL,R7
I've no idea why Keil use R2 register neither...
semaj's union approach
typedef union
{
unsigned short u16;
unsigned char u8[2];
} U16_U8;
U16_U8 ptr;
// Do something to set the variable ptr
ptr.u16 = ?;
ADDH = ptr.u8[0];
ADDL = ptr.u8[1];
SDCC gives me
; t.c:26: ADDH = uptr.u8[0];
mov _main_ADDH_1_1,_main_uptr_1_1
; t.c:27: ADDL = uptr.u8[1];
mov _main_ADDL_1_1,(_main_uptr_1_1 + 0x0001)
Keil C51 gives me:
; SOURCE LINE # 26
0028 850000 R MOV ADDH,uptr
; SOURCE LINE # 27
002B 850000 R MOV ADDL,uptr+01H
which is very smiler to the pointers trick. However, this approach require two more bytes memory the store the union.
Does anyone have any other bright ideas? ;)
And anyone can tell me which way is more efficient?
In case anyone interested, here is the test case:
typedef union
{
unsigned short u16;
unsigned char u8[2];
} U16_U8;
// call a function on the ADDs to avoid optimizition
void swap(unsigned char *a, unsigned char *b)
{
unsigned char tm;
tm = *a;
*a = *b;
*b = tm;
}
main (void)
{
char c[] = "hello world.";
unsigned char xdata *ptr = (unsigned char xdata *)c;
unsigned char ADDH, ADDL;
unsigned char i = 0;
U16_U8 uptr;
uptr.u16 = (unsigned short)ptr;
for ( ; i < 4 ; i++, uptr.u16++){
ADDH = uptr.u8[0];
ADDL = uptr.u8[1];
swap(&ADDH, &ADDL);
}
for ( ; i < 4 ; i++, ptr++){
ADDH = (unsigned int) ptr >> 8;
ADDL = (unsigned int) ptr & 0x00FF;
swap(&ADDH, &ADDL);
}
for ( ; i < 4 ; i++, ptr++){
ADDH = ((unsigned char *)&ptr)[0];
ADDL = ((unsigned char *)&ptr)[1];
swap(&ADDH, &ADDL);
}
for ( ; i < 4 ; i++, ptr++){
ADDH = (unsigned int)ptr / 256;
ADDL = (unsigned int)ptr % 256;
swap(&ADDH, &ADDL);
}
}

The most efficient way is completely dependent on the compiler. You definitely have to figure out how to get an assembly listing from your compiler for an 8051 project.
One method you might try that is similar to those already mentioned is a union:
typedef union
{
unsigned short u16;
unsigned char u8[2];
} U16_U8;
U16_U8 ptr;
// Do something to set the variable ptr
ptr.u16 = ?;
ADDH = ptr.u8[0];
ADDL = ptr.u8[1];

Another not so bright way to split the address:
ADDH = ptr / 256;
ADDL = ptr % 256;

most efficient is first one, since it is done in single instruction.
NO! I lied to you sorry. I forgot that 8051 instruction set has only 1-bit shift instructions. Second should be faster, but compiler may generate stupid code, so beware and check assembly code.

I just create two defines(as follows).
It seems more straight forward, and less error prone.
#define HI(x) ((x) >> 8)
#define LO(x) ((x) & 0xFF)

Related

Pointer and array usage confusion

There is a code excerpt from official Quake 2 source code:
unsigned *buf;
dheader_t header;
...
header = *(dheader_t *)buf; // #1
for (i=0 ; i<sizeof(dheader_t)/4 ; i++)
((int *)&header)[i] = LittleLong ( ((int *)&header)[i]); // #2
Can someone please explain me in the most possible details what do the line #1 and then #2 really do because I'm little or more confused...
P.S
Here is the rest of the definitions if it helps:
int LittleLong (int l) {return _LittleLong(l);}
...
typedef struct
{
int ident;
int version;
lump_t lumps[HEADER_LUMPS];
} dheader_t;
P.S. 2
I've linked above the original full source file code if needed.
This is some seriously brittle code and you shouldn't write code like this.
What it does is to go through the struct int by int, then does something with each such int inside _LittleLong. Very likely this function performs a 32 bit conversion from a big endian integer to a little endian one. Meaning that the source you are looking at is likely something related to reception of IP packages.
Checking at what the code does step by step:
for (i=0 ; i<sizeof(dheader_t)/4 ; i++) is a sloppier way of writing sizeof(dheader_t)/sizeof(int). That is: iterate through the struct int by int, chunks of 32 bits.
(int *)&header converts from a dheader_t* to a int*. This is actually well-defined by a special rule in C that allows us to convert from a pointer to a struct to a pointer to its first member or vice versa and the first member is int.
However, doing so is only well-defined for the first member. Instead they take the converted int* and apply array dereferencing on it: ((int *)&header)[i]. This is undefined behavior in C, a so-called strict aliasing violation, and could also cause alignment problems in some situations. Bad.
The int read from the struct through this dereferencing is then passed along to LittleLong which very likely does a big -> little endian conversion.
((int *)&header)[i] = and here it is written back to where it was grabbed from.
Better, safer, well-defined and possibly faster code could look like:
void endianify (dheader_t* header)
{
_Static_assert(sizeof(dheader_t)%sizeof(uint32_t)==0,
"Broken struct: dheader_t");
unsigned char* start = (unsigned char*)header;
unsigned char* end = start + sizeof(dheader_t);
for(unsigned char* i=start; i!=end; i+=sizeof(uint32_t))
{
uint32_t tmp;
memcpy(&tmp,i,sizeof(uint32_t));
i[0]= (tmp >> 24) & 0xFF;
i[1]= (tmp >> 16) & 0xFF;
i[2]= (tmp >> 8) & 0xFF;
i[3]= (tmp >> 0) & 0xFF;
}
}
Disassembly:
endianify:
mov eax, DWORD PTR [rdi]
bswap eax
mov DWORD PTR [rdi], eax
mov eax, DWORD PTR [rdi+4]
bswap eax
mov DWORD PTR [rdi+4], eax
mov eax, DWORD PTR [rdi+8]
bswap eax
mov DWORD PTR [rdi+8], eax
mov eax, DWORD PTR [rdi+12]
bswap eax
mov DWORD PTR [rdi+12], eax
mov eax, DWORD PTR [rdi+16]
bswap eax
mov DWORD PTR [rdi+16], eax
ret

Keyboard interrupts in long mode. 64 bit os

I try to create my first os from 0 using different tutorials. Now I have a simple kernel with simple paging, GDT for 64 bit, and entering to long mode. But there is a problem with keyboard interrupts. I read lots of topics about that and I think its a double-fault when start typing. Please, help me to understand and fix this problem. Here is repository with my code https://github.com/alexanderian76/TestOS
So, the problem is if I type anything, QEMU resets system instead showing symbols on display. Each interrupt resets it, as I understand.
This is my gdt64 and enter to long-mode
gdt64:
dq 0 ; zero entry
.code: equ $ - gdt64 ; new
dq (1<<43) | (1<<44) | (1<<47) | (1<<53) ; code segment
.pointer:
dw $ - gdt64 - 1
dq gdt64
This is enter to long mode.
start:
mov esp, stack_top
call check_multiboot
call check_cpuid
call check_long_mode
call set_up_page_tables ; new
call enable_paging ; new
; print `OK` to screen
mov dword [0xb8000], 0x2f4b2f4f
lgdt [gdt64.pointer]
jmp gdt64.code:long_mode_start
hlt
long_mode_start:
; print `OKAY` to screen
mov ax, 0
mov ss, ax
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov rax, 0x2f592f412f4b2f4f
mov qword [0xb8000], rax
call main
hlt
And this is IDT
#define PIC2_COMMAND 0xA0
#define PIC2_DATA 0xA1
#define ICW1_INIT 0x10
#define ICW1_ICW4 0x01
#define ICW4_8086 0x01
struct IDT_entry {
unsigned short offset_lowerbits;
unsigned short selector;
unsigned char ist;
unsigned short offset_mid;
unsigned int zero;
unsigned char type_attr;
unsigned int offset_higherbits;
};
//*********************************
extern struct IDT_entry IDT[IDT_SIZE];
void load_idt_entry()
{
for(unsigned long long int t = 0; t < 256; t++) {
IDT[t].offset_lowerbits = (unsigned short)(((unsigned long long int)&isr1 & 0x000000000000ffff));
IDT[t].offset_mid = (unsigned short)(((unsigned long long int)&isr1 & 0x00000000ffff0000) >> 16);
IDT[t].offset_higherbits = (unsigned int)(((unsigned long long int)&isr1 & 0xffffffff00000000) >> 32);
IDT[t].selector = 0x08;
IDT[t].type_attr = 0x8e;
IDT[t].zero = 0;
IDT[t].ist = 0;
RemapPic();
outb(0x21, 0xfd);
outb(0xa1, 0xff);
LoadIDT();
}
}
void outb(unsigned short port, unsigned char val){
asm volatile ("outb %0, %1" : : "a"(val), "Nd"(port));
}
unsigned char inb(unsigned short port){
unsigned char returnVal;
asm volatile ("inb %1, %0"
: "=a"(returnVal)
: "Nd"(port));
return returnVal;
}
void RemapPic(){
unsigned char a1, a2;
a1 = inb(PIC1_DATA);
a2 = inb(PIC2_DATA);
outb(PIC1_COMMAND, ICW1_INIT | ICW1_ICW4);
outb(PIC2_COMMAND, ICW1_INIT | ICW1_ICW4);
outb(PIC1_DATA, 0);
outb(PIC2_DATA, 8);
outb(PIC1_DATA, 4);
outb(PIC2_DATA, 2);
outb(PIC1_DATA, ICW4_8086);
outb(PIC2_DATA, ICW4_8086);
outb(PIC1_DATA, a1);
outb(PIC2_DATA, a2);
}
Sorry, this the first time I ask something here, but I cant solve this problem by myself.
Thank you so much!
Your IDT_entry seems to have the fields in the wrong order. type_attr should be after the ist and the zero should be at the end. Also your POPALL macro actually has pushes not pops.
diff --git a/long_mode_init.asm b/long_mode_init.asm
index cd64e24..926afae 100644
--- a/long_mode_init.asm
+++ b/long_mode_init.asm
## -19,13 +19,13 ## extern IDT
%endmacro
%macro POPALL 0
- push r11
- push r10
- push r9
- push r8
- push rdx
- push rcx
- push rax
+ pop r11
+ pop r10
+ pop r9
+ pop r8
+ pop rdx
+ pop rcx
+ pop rax
%endmacro
diff --git a/main.c b/main.c
index b1bfa1c..22ef2fe 100644
--- a/main.c
+++ b/main.c
## -41,10 +41,10 ## struct IDT_entry {
unsigned short offset_lowerbits;
unsigned short selector;
unsigned char ist;
- unsigned short offset_mid;
- unsigned int zero;
unsigned char type_attr;
+ unsigned short offset_mid;
unsigned int offset_higherbits;
+ unsigned int zero;
};
//*********************************
With these changes it does no longer crash.

CPU usage C Packed struct vs Unsigned Long Long operations

I need to do some operations with 48 bit variables, so I had two options:
Create my own structure with 48 bit variables, or
Use unsigned long long (64 bits).
As the operations will not overflow 48 bits, I considered that using 64 bit variables was an overkill, so I created a base structure
#ifdef __GNUC__
#define PACK( __Declaration__ ) __Declaration__ __attribute__((__packed__))
#endif
#ifdef _MSC_VER
#define PACK( __Declaration__ ) __pragma( pack(push, 1) ) __Declaration__ __pragma( pack(pop))
#endif
PACK(struct uint48 {
unsigned long long v : 48;
});
and created some code to check for speed in the operations
#include <stdio.h>
#include <time.h>
#ifdef __GNUC__
#define PACK( __Declaration__ ) __Declaration__ __attribute__((__packed__))
#endif
#ifdef _MSC_VER
#define PACK( __Declaration__ ) __pragma( pack(push, 1) ) __Declaration__ __pragma( pack(pop))
#endif
PACK(struct uint48 {
unsigned long long v : 48;
});
void TestProductLong();
void TestProductLong02();
void TestProductPackedStruct();
void TestProductPackedStruct02();
clock_t start, end;
double cpu_time_used;
int cycleNumber = 100000;
int main(void)
{
TestProductLong();
TestProductLong02();
TestProductPackedStruct();
TestProductPackedStruct02();
return 0;
}
void TestProductLong() {
start = clock();
for (int i = 0; i < cycleNumber;i++) {
unsigned long long varlong01 = 155782;
unsigned long long varlong02 = 15519994;
unsigned long long product01 = varlong01 * varlong02;
unsigned long long varlong03 = 155782;
unsigned long long varlong04 = 15519994;
unsigned long long product02 = varlong03 * varlong04;
unsigned long long addition = product01 + product02;
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("TestProductLong() took %f seconds to execute \n", cpu_time_used);
}
void TestProductLong02() {
start = clock();
unsigned long long varlong01;
unsigned long long varlong02;
unsigned long long product01;
unsigned long long varlong03;
unsigned long long varlong04;
unsigned long long product02;
unsigned long long addition;
for (int i = 0; i < cycleNumber;i++) {
varlong01 = 155782;
varlong02 = 15519994;
product01 = varlong01 * varlong02;
varlong03 = 155782;
varlong04 = 15519994;
product02 = varlong03 * varlong04;
addition = product01 + product02;
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("TestProductLong02() took %f seconds to execute \n", cpu_time_used);
}
void TestProductPackedStruct() {
start = clock();
for (int i = 0; i < cycleNumber; i++) {
struct uint48 x01;
struct uint48 x02;
struct uint48 x03;
x01.v = 155782;
x02.v = 15519994;
x03.v = x01.v * x02.v;
struct uint48 x04;
struct uint48 x05;
struct uint48 x06;
x04.v = 155782;
x05.v = 15519994;
x06.v = x04.v * x05.v;
struct uint48 x07;
x07.v = x03.v + x06.v;
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("TestProductPackedStruct() took %f seconds to execute \n", cpu_time_used);
}
void TestProductPackedStruct02() {
start = clock();
struct uint48 x01;
struct uint48 x02;
struct uint48 x03;
struct uint48 x04;
struct uint48 x05;
struct uint48 x06;
struct uint48 x07;
for (int i = 0; i < cycleNumber; i++) {
x01.v = 155782;
x02.v = 15519994;
x03.v = x01.v * x02.v;
x04.v = 155782;
x05.v = 15519994;
x06.v = x04.v * x05.v;
x07.v = x03.v + x06.v;
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("TestProductPackedStruct02() took %f seconds to execute \n", cpu_time_used);
}
But I got the following results
TestProductLong() took 0.000188 seconds to execute
TestProductLong02() took 0.000198 seconds to execute
TestProductPackedStruct() took 0.001231 seconds to execute
TestProductPackedStruct02() took 0.001231 seconds to execute
So the operations using unsigned long long took less time than the ones using the packed structure.
Why is that?
Would be better then to use the unsigned long long instead?
Is there a better way to pack structures?
As I'm right now unrolling loops, using the correct datastructure could impact the performance of my application significantly.
Thank you.
Although you know that the operations on the 48-bit values will not overflow, a compiler cannot know this! Further, with the vast majority of compilers and platforms, your uint48 structure will actually be implemented as a 64-bit data type, for which only the low 48-bits will ever be used.
So, after any arithmetic (or other) operations on the .v data, the 'unused' 16-bits of the (internal) 64-bit representation will need to be cleared, to ensure that any future accesses to that data will give the true, 48-bit-only value.
Thus, using the clang-cl compiler in Visual Studio 2019, the following (rather trivial) function using the native uint64_t type:
extern uint64_t add64(uint64_t a, uint64_t b) {
return a + b;
}
generates the expected, highly efficient assembly code:
lea rax, [rcx + rdx]
ret
However, using (an equivalent of) your 48-bit packed structure:
#pragma pack(push, 1)
typedef struct uint48 {
unsigned long long v : 48;
} uint48_t;
#pragma pack(pop)
extern uint48_t add48(uint48_t a, uint48_t b) {
uint48_t c;
c.v = a.v + b.v;
return c;
}
requires additional assembly code to ensure that any overflow into the 'unused' bits is discarded:
add rcx, rdx
movabs rax, 281474976710655 # This is 0x0000FFFFFFFFFFFF - clearing top 16 bits!
and rax, rcx
ret
Note that the MSVC compiler generates very similar code.
Thus, you should expect that using native, uint64_t variables will generate more efficient code than your 'space-saving' structure.
Your test procedure is wrong. Why?
Packing 1 member struct does actually nothing.
You execute it using -O0 and with no optimizations testing the execution speed does not make any sense. It you compile it with optimizations - your code will be wiped out :) https://godbolt.org/z/9ibP_8
When you sort this code to be optimizable (As you do not use the value they have to be global or at least static and adding compiler memory barrier (clobber)).
https://godbolt.org/z/BL9uJE
The difference comes with trimming the results to 48 bits.
If you pack the struct (which is not necesary here) you force compiler to byte access the variables - because only bytes are always aligned: https://godbolt.org/z/2iV7vq
You can also use the mixed approach - not portable as it relies on endianess and bitfields implementation https://godbolt.org/z/J3-it_
so the code will compile to:
unsigned long long:
mov QWORD PTR varlong01[rip], 155782
mov QWORD PTR varlong02[rip], 15519994
mov QWORD PTR product01[rip], rdx
mov QWORD PTR varlong03[rip], 155782
mov QWORD PTR varlong04[rip], 15519994
mov QWORD PTR product02[rip], rdx
mov QWORD PTR addition[rip], rcx
not packed struct
mov rdx, QWORD PTR x01[rip]
and rdx, rax
or rdx, 155782
mov QWORD PTR x01[rip], rdx
mov rdx, QWORD PTR x02[rip]
and rdx, rax
or rdx, 15519994
mov QWORD PTR x02[rip], rdx
mov rdx, QWORD PTR x03[rip]
and rdx, rax
or rdx, rsi
mov QWORD PTR x03[rip], rdx
mov rdx, QWORD PTR x04[rip]
and rdx, rax
or rdx, 155782
mov QWORD PTR x04[rip], rdx
mov rdx, QWORD PTR x05[rip]
and rdx, rax
or rdx, 15519994
mov QWORD PTR x05[rip], rdx
mov rdx, QWORD PTR x06[rip]
and rdx, rax
or rdx, rsi
mov QWORD PTR x06[rip], rdx
mov rdx, QWORD PTR x07[rip]
and rdx, rax
or rdx, rdi
mov QWORD PTR x07[rip], rdx
packed struct
mov BYTE PTR x01[rip], -122
mov BYTE PTR x01[rip+1], 96
mov BYTE PTR x01[rip+2], 2
mov BYTE PTR x01[rip+3], 0
mov BYTE PTR x01[rip+4], 0
mov BYTE PTR x01[rip+5], 0
mov BYTE PTR x02[rip], -6
mov BYTE PTR x02[rip+1], -48
mov BYTE PTR x02[rip+2], -20
mov BYTE PTR x02[rip+3], 0
mov BYTE PTR x02[rip+4], 0
mov BYTE PTR x02[rip+5], 0
mov BYTE PTR x03[rip], -36
mov BYTE PTR x03[rip+1], 34
mov BYTE PTR x03[rip+2], 71
mov BYTE PTR x03[rip+3], -20
mov BYTE PTR x03[rip+4], 50
mov BYTE PTR x03[rip+5], 2
mov BYTE PTR x04[rip], -122
mov BYTE PTR x04[rip+1], 96
mov BYTE PTR x04[rip+2], 2
mov BYTE PTR x04[rip+3], 0
mov BYTE PTR x04[rip+4], 0
mov BYTE PTR x04[rip+5], 0
mov BYTE PTR x05[rip], -6
mov BYTE PTR x05[rip+1], -48
mov BYTE PTR x05[rip+2], -20
mov BYTE PTR x05[rip+3], 0
mov BYTE PTR x05[rip+4], 0
mov BYTE PTR x05[rip+5], 0
mov BYTE PTR x06[rip], -36
mov BYTE PTR x06[rip+1], 34
mov BYTE PTR x06[rip+2], 71
mov BYTE PTR x06[rip+3], -20
mov BYTE PTR x06[rip+4], 50
mov BYTE PTR x06[rip+5], 2
mov BYTE PTR x07[rip], -72
mov BYTE PTR x07[rip+1], 69
mov BYTE PTR x07[rip+2], -114
mov BYTE PTR x07[rip+3], -40
mov BYTE PTR x07[rip+4], 101
mov BYTE PTR x07[rip+5], 4

NASM x86 core dump writing on memory [duplicate]

This question already has answers here:
How can I pass parameters in assembler x86 function call?
(3 answers)
Can't pass parameter from C to Assembly code
(3 answers)
Why does IA-32 have a non-intuitive caller and callee register saving convention?
(4 answers)
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
(4 answers)
Closed 3 years ago.
I am learning assembly NASM and trying to do a LFSR code and call it on a C program to evaluate the execution time diference, but failed to figure out the problem with my code.
My LFSR C version works just fine and is defined as follows:
int lfsr(){
int cont = 0;
uint32_t start_state = SEED;
uint32_t lfsr = start_state;
uint32_t bit;
while (cont != 16777216) {
bit = ((lfsr >> 1) ^ (lfsr >> 5) ^ (lfsr >> 7) ^ (lfsr >> 13)) & 1;
lfsr = (lfsr >> 1) | (bit << 23);
lfsr_nums[cont] = lfsr;
cont++;
}
return cont;
}
My NASM x86 was based on the C version and it generates the numbers the same way the C code does. It should take a pointer to an array as parameter, and return (as reference) the same array with the numbers created and return (as value) the amount of the numbers. The LFSR logic works just fine, I checked the numbers created, but the code still give me a SegFault Core Dump error.
With gdb the message is that the error is in the do procedure. While I tried to debug my code I found out that the error was in the mov dword [esi + 4 * eax], ebx, if I comment it out the code doesn't output a segfault.
section .text
global lfsr_nasm
lfsr_nasm:
push dword ebx;
mov esi, edi ; vec
mov eax, 0 ;Cont = 0
mov ebx, 0x1313 ; lfst = start_state = seed
do:
mov ecx, ebx ; ecx = lfst
shr ecx, 1 ; lfsr >> 1
mov edx, ebx ; edx = lfst
shr edx, 5; lfst >> 5
xor ecx, edx ; lfst >> 1 ^ lfsr >> 5
mov edx, ebx ; edx = lfsr
shr edx, 7 ; edx = lfst >> 7
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7
mov edx, ebx ; edx = lfsr
shr edx, 13 ; edx = lfst >> 13
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7 ^ lfsr >> 13
and ecx, 1 ;ecx = bit
shr ebx, 1 ;lfsr >> 1
shl ecx, 23 ; bit << 23
or ebx, ecx ; lfsr = (lfsr >> 1) | (bit << 23);
mov dword [esi + 4 * eax], ebx
inc eax ; cont++
cmp eax, 16777216; cont != 16777216
jne do ;
pop dword ebx;
ret
The way I make the call in C, and declare my vector and NASM function:
extern int lfsr_nasm (uint32_t *vec);
uint32_t lfsr_nums[16777216];
int main(int argc, char *argv[]){
int cont;
cont = lfsr_nasm(lfsr_nums);
for(int i = 0; i < 16777216; i++){
printf("%d ", lfsr_nums[i]);
}
}
I believe that the vector is too big for the NASM or C and maybe the program is trying to access bad memory, but I couldn't find anything to confirm my believes neither a fix to the problem. Already tried with malloc and calloc.

Assembly how to translate JNE to C Code without ZF flag access

ASM to C Code emulating nearly done.. just trying to solve these second pass problems.
Lets say I got this ASM function
401040 MOV EAX,DWORD PTR [ESP+8]
401044 MOV EDX,DWORD PTR [ESP+4]
401048 PUSH ESI
401049 MOV ESI,ECX
40104B MOV ECX,EAX
40104D DEC EAX
40104E TEST ECX,ECX
401050 JE 401083
401052 PUSH EBX
401053 PUSH EDI
401054 LEA EDI,[EAX+1]
401057 MOV AX,WORD PTR [ESI]
40105A XOR EBX,EBX
40105C MOV BL,BYTE PTR [EDX]
40105E MOV ECX,EAX
401060 AND ECX,FFFF
401066 SHR ECX,8
401069 XOR ECX,EBX
40106B XOR EBX,EBX
40106D MOV BH,AL
40106F MOV AX,WORD PTR [ECX*2+45F81C]
401077 XOR AX,BX
40107A INC EDX
40107B DEC EDI
40107C MOV WORD PTR [ESI],AX
40107F JNE 401057
401081 POP EDI
401082 POP EBX
401083 POP ESI
401084 RET 8
My program would create the following for it.
int Func_401040() {
regs.d.eax = *(unsigned int *)(regs.d.esp+0x00000008);
regs.d.edx = *(unsigned int *)(regs.d.esp+0x00000004);
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.esi;
regs.d.esi = regs.d.ecx;
regs.d.ecx = regs.d.eax;
regs.d.eax--;
if(regs.d.ecx == 0)
goto label_401083;
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.ebx;
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.edi;
regs.d.edi = (regs.d.eax+0x00000001);
regs.x.ax = *(unsigned short *)(regs.d.esi);
regs.d.ebx ^= regs.d.ebx;
regs.h.bl = *(unsigned char *)(regs.d.edx);
regs.d.ecx = regs.d.eax;
regs.d.ecx &= 0x0000FFFF;
regs.d.ecx >>= 0x00000008;
regs.d.ecx ^= regs.d.ebx;
regs.d.ebx ^= regs.d.ebx;
regs.h.bh = regs.h.al;
regs.x.ax = *(unsigned short *)(regs.d.ecx*0x00000002+0x0045F81C);
regs.x.ax ^= regs.x.bx;
regs.d.edx++;
regs.d.edi--;
*(unsigned short *)(regs.d.esi) = regs.x.ax;
JNE 401057
regs.d.edi = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
regs.d.ebx = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
label_401083:
regs.d.esi = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
return 0x8;
}
Since JNE 401057 doesn't use the CMP or TEST
How do I fix that use this in C code?
The most recent instruction that modified flags is the dec, which sets ZF when its operand hits 0. So the jne is about equivalent to if (regs.d.edi != 0) goto label_401057;.
BTW: ret 8 isn't equivalent to return 8. The ret instruction's operand is the number of bytes to add to ESP when returning. (It's commonly used to clean up the stack.) It'd be kinda like
return eax;
regs.d.esp += 8;
except that semi-obviously, this won't work in C -- the return makes any code after it unreachable.
This is actually a part of the calling convention -- [ESP+4] and [ESP+8] are arguments passed to the function, and the ret is cleaning those up. This isn't the usual C calling convention; it looks more like fastcall or thiscall, considering the function expects a value in ECX.

Resources