Setting up interrupts in protected mode (x86)

Setting up interrupts in protected mode (x86) - c

What is the process of setting up interrupts for protected mode?
This link says one should:
Make space for the interrupt descriptor table
Tell the CPU where that space is (see GDT Tutorial: lidt works the very same way as lgdt)
Tell the PIC that you no longer want to use the BIOS defaults (see Programming the PIC chips)
Write a couple of ISR handlers (see Interrupt Service Routines) for both IRQs and exceptions
Put the addresses of the ISR handlers in the appropriate descriptors
Enable all supported interrupts in the IRQ mask (of the PIC)
The third step makes no sense to me (I looked at this link but there wasn't anything about telling the PIC anything) so I ignored it and completed the next two steps, only to be clueless once again when I reached the final step. However, from my understanding of interrupts, both of the steps I didn't understand relate to hardware interrupts from the PIC controller and shouldn't affect the interrupts raised by the PIT on IRQ 0. I therefore ignored this step as well.
When I ran my code it compiled fine and even ran in a virtual machine, but the interrupt seemed to fire only once. I then realised that I wasn't sending EOI to the PIC, preventing it from raising any more interrupts. However, adding mov al, 0x20 and out 0x20, al just before the iret instruction makes the virtual machine crash.
Here's my IDT:
; idt
idt_start :
dw 0x00 ; The interrupt handler is located at absolute address 0x00
dw CODE_SEG ; CODE_SEG points to the GDT entry for code
db 0x0 ; The unused byte
db 0b11101001 ; 1110 Defines a 32 bit Interrupt gate, 0 is mandatory, privilege level = 0 (0b00), the last bit is one so that the CPU knows that the interrupt will be used
dw 0x00 ; The higher part of the offset (0x00) is 0x00
idt_end:
idt_descriptor :
dw idt_end - idt_start - 1 ; Size of our idt, always one less than the actual size
dd idt_start ; Start address of our idt
Here's my interrupt handler (located at absolute location 0x00 in memory):
ISR_0:
push eax
add [0x300], byte
mov al, 0x20
out 0x20, al
pop eax
iret
times 512-($-$$) db 0
This is the code I use to enter protected mode and load the GDT and IDT into memory:
[bits 16]
switch_to_pm:
cli
lgdt [gdt_descriptor]
lidt [idt_descriptor]
mov eax, cr0
or eax, 1
mov cr0,eax
jmp CODE_SEG:init_pm
[bits 32]
init_pm :
mov ax, DATA_SEG
mov ds, ax
mov ss, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ebp, 0x90000
mov esp, ebp
sti
call BEGIN_PM
My main function (that checks the value of 0x300) is as follows:
void main() {
char iii[15];
int * aa = (int *)0x300;
for (;;)
{
setCursor(0, 0);
print(itoab(*aa, iii));
}
}
By the way, I have verified using a memory dump that everything is loaded at the correct address and everything is exactly where it is expected. For example, 0x300 is a free part of memory used simply to simplify my code.

Let's look at how some comparably small kernel, i.e., Linux 0.01 does it!
Make space for the interrupt descriptor table
This is done two times (well, technically only one time): first, the bootloader (the path is /boot/boot.s) initializes the IDTR, so the CPU is happy when jumping into Protected Mode. The IDTR content is as follows:
idt_48:
.word 0 | idt limit=0
.word 0,0 | idt base=0L
The IDTR is loaded like this:
lidt idt_48 | load idt with 0,0
Now, the jump can be performed.
Note that there is no IDT here. It's just a dummy, so no error occurs somewhere in the kernel.
Afterwards, the real IDT is initialized (the path is /boot/head.s). The space is allocated like this:
_idt: .fill 256,8,0 # idt is uninitialized
Tell the CPU where that space is (see GDT Tutorial: lidt works the very same way as lgdt)
lidt expects a linear address containing the content of the IDTR. That content looks like this:
idt_descr:
.word 256*8-1 # idt contains 256 entries
.long _idt
The IDTR is initialized as follows:
lidt idt_descr
Tell the PIC that you no longer want to use the BIOS defaults (see Programming the PIC chips)
As #RossRidge mentioned in the comments to your question, that means remapping the IRQ interrupt vectors (IVs).
Since the PIC IVs overlap with the Intel x86 exception addresses, we have to remap one of them. The exception addresses are hard-wired, so we need to remap the PIC vectors.
See also this comment right above the corresponding code by Linus:
| well, that went ok, I hope. Now we have to reprogram the interrupts :-(
| we put them right after the intel-reserved hardware interrupts, at
| int 0x20-0x2F. There they won't mess up anything. Sadly IBM really
| messed this up with the original PC, and they haven't been able to
| rectify it afterwards. Thus the bios puts interrupts at 0x08-0x0f,
| which is used for the internal hardware interrupts as well. We just
| have to reprogram the 8259's, and it isn't fun.
Now, here's the real code. The jmps in between are for synchronizing CPU and PIC, so the CPU won't send data the PIC cannot receive yet. This is comparable to wait states when writing to memory: when the CPU is faster than the memory/memory arbiter, it needs to wait some time before accessing memory the next time.
mov al,#0x11 | initialization sequence
out #0x20,al | send it to 8259A-1
.word 0x00eb,0x00eb | jmp $+2, jmp $+2
out #0xA0,al | and to 8259A-2
.word 0x00eb,0x00eb
mov al,#0x20 | start of hardware int's (0x20)
out #0x21,al
.word 0x00eb,0x00eb
mov al,#0x28 | start of hardware int's 2 (0x28)
out #0xA1,al
.word 0x00eb,0x00eb
mov al,#0x04 | 8259-1 is master
out #0x21,al
.word 0x00eb,0x00eb
mov al,#0x02 | 8259-2 is slave
out #0xA1,al
.word 0x00eb,0x00eb
mov al,#0x01 | 8086 mode for both
out #0x21,al
.word 0x00eb,0x00eb
out #0xA1,al
.word 0x00eb,0x00eb
mov al,#0xFF | mask off all interrupts for now
out #0x21,al
.word 0x00eb,0x00eb
out #0xA1,al
Write a couple of ISR handlers (see Interrupt Service Routines) for both IRQs and exceptions
For exceptions, you can find the handler code in /kernel/traps.c and /kernel/asm.s.
Some exceptions push an error code on the stack prior to jumping to the handler, which you have to pop off or the iret instruction will fail. A page fault also writes the corresponding virtual address to cr2 in addition.
The IRQ handlers are spread across the whole system. -.- The timer and disk interrupt handlers are in /kernel/system_call.s, the keyboard interrupt handler is in /kernel/keyboard.s, for example.
Put the addresses of the ISR handlers in the appropriate descriptors
The initialization for exceptions is done in /kernel/traps.c in the trap_init function:
void trap_init(void)
{
int i;
set_trap_gate(0,&divide_error);
set_trap_gate(1,&debug);
set_trap_gate(2,&nmi);
set_system_gate(3,&int3); /* int3-5 can be called from all */
set_system_gate(4,&overflow);
set_system_gate(5,&bounds);
set_trap_gate(6,&invalid_op);
set_trap_gate(7,&device_not_available);
set_trap_gate(8,&double_fault);
set_trap_gate(9,&coprocessor_segment_overrun);
set_trap_gate(10,&invalid_TSS);
set_trap_gate(11,&segment_not_present);
set_trap_gate(12,&stack_segment);
set_trap_gate(13,&general_protection);
set_trap_gate(14,&page_fault);
set_trap_gate(15,&reserved);
set_trap_gate(16,&coprocessor_error);
for (i=17;i<32;i++)
set_trap_gate(i,&reserved);
/* __asm__("movl $0x3ff000,%%eax\n\t"
"movl %%eax,%%db0\n\t"
"movl $0x000d0303,%%eax\n\t"
"movl %%eax,%%db7"
:::"ax");*/
}
The IRQ handler entry initializations are again spread across several files. sched_init from /kernel/sched.c initializes the timer interrupt handler's address, for instance.
Enable all supported interrupts in the IRQ mask (of the PIC)
This is done in /init/main.c in the main function with the macro sti. It is defined in /asm/system.h as follows:
#define sti() __asm__ ("sti"::)

Related

Any attempt to put a string to the screen in Protected Mode causes reboot

I have just recently gone into Protected Mode when developing an OS from scratch. I have managed to get into C and make functions to print characters to the screen (thanks Michael Petch for helping me reach this stage). Anyway, whenever I try to make a routine that loops through a string literal and prints every character in it, well, there's a bit of a problem. QEMU just goes into a boot loop, restarts again and again, and I am never able to see my beautiful green-on-black video mode. If I move this out of a routine and print it character-by-character in the kmain() function (that part of which I have removed), everything works fine and dandy. Here's the file where I try to implement a string printing function:
vga.c -
#include <vga.h>
size_t terminal_row;
size_t terminal_column;
uint8_t terminal_color;
uint16_t *terminal_buffer;
volatile uint16_t * const VIDMEM = (volatile uint16_t *) 0xB8000;
size_t strlen(const char *s)
{
size_t len = 0;
while(s[len]) {
len++;
}
return len;
}
void terminal_init(void)
{
terminal_row = 0;
terminal_column = 0;
terminal_color = vga_entry_color(LGREEN, BLACK);
for(size_t y = 0; y < VGA_HEIGHT; y++) {
for(size_t x = 0; x < VGA_WIDTH; x++) {
const size_t index = y * VGA_WIDTH + x;
VIDMEM[index] = vga_entry(' ', terminal_color);
}
}
}
void terminal_putentryat(char c, uint8_t color, size_t x, size_t y)
{
const size_t index = y * VGA_WIDTH + x;
VIDMEM[index] = vga_entry(c, color);
}
void terminal_putchar(char c)
{
terminal_putentryat(c, terminal_color, terminal_column, terminal_row);
if(++terminal_column == VGA_WIDTH) {
terminal_column = 0;
if(++terminal_row == VGA_HEIGHT) {
terminal_row = 0;
}
}
}
void terminal_puts(const char *s)
{
size_t n = strlen(s);
for (size_t i=0; i < n; i++) {
terminal_putchar(s[i]);
}
}
I read my kernel into memory with this bootloader code:
extern kernel_start ; External label for start of kernel
global boot_start ; Make this global to suppress linker warning
bits 16
boot_start:
xor ax, ax ; Set DS to 0. xor register to itselfzeroes register
mov ds, ax
mov ss, ax ; Stack just below bootloader SS:SP=0x0000:0x7c00
mov sp, 0x7c00
mov ah, 0x00
mov al, 0x03
int 0x10
load_kernel:
mov ah, 0x02 ; call function 0x02 of int 13h (read sectors)
mov al, 0x01 ; read one sector (512 bytes)
mov ch, 0x00 ; track 0
mov cl, 0x02 ; sector 2
mov dh, 0x00 ; head 0
; mov dl, 0x00 ; drive 0, floppy 1. Comment out DL passed to bootloader
xor bx, bx ; segment 0x0000
mov es, bx ; segments must be loaded from non immediate data
mov bx, 0x7E00 ; load the kernel right after the bootloader in memory
.readsector:
int 13h ; call int 13h
jc .readsector ; error? try again
jmp 0x0000:kernel_start ; jump to the kernel at 0x0000:0x7e00
I have an assembly stub at the start of my kernel that enters protected mode, zeroes the BSS section, issues a CLD and calls into my C code:
; These symbols are defined by the linker. We use them to zero BSS section
extern __bss_start
extern __bss_sizel
; Export kernel entry point
global kernel_start
; This is the C entry point defined in kmain.c
extern kmain ; kmain is C entry point
bits 16
section .text
kernel_start:
cli
in al, 0x92
or al, 2
out 0x92, al
lgdt[toc]
mov eax, cr0
or eax, 1
mov cr0, eax
jmp 0x08:start32 ; The FAR JMP is simplified since our segment is 0
section .rodata
gdt32:
dd 0
dd 0
dw 0x0FFFF
dw 0
db 0
db 0x9A
db 0xCF
db 0
dw 0x0FFFF
dw 0
db 0
db 0x92
db 0xCF
db 0
gdt_end:
toc:
dw gdt_end - gdt32 - 1
dd gdt32 ; The GDT base is simplified since our segment is now 0
bits 32
section .text
start32:
mov ax, 0x10
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax
mov esp, 0x9c000 ; Set the stack to grow down from area under BDA/Video memory
; We need to zero out the BSS section. We'll do it a DWORD at a time
cld
lea edi, [__bss_start] ; Start address of BSS
lea ecx, [__bss_sizel] ; Lenght of BSS in DWORDS
xor eax, eax ; Set to 0x00000000
rep stosd ; Do clear using string store instruction
call kmain
I have a specialized linker script that places the bootloader at 0x7c00 and the kernel at 0x7e00.
What's the problem and how can I fix it? I've made my git repo available if more information is needed.

TL;DR : You haven't read your entire kernel into memory with your bootloader in start.asm. Missing code and/or data is causing your kernel to crash with a triple fault which results in a reboot. You will need to read more sectors as your kernel grows.
I noticed that your generated lunaos.img is larger than 1024 bytes. The bootloader is 512 bytes, and the kernel after it is slightly more than 512 bytes. That means the kernel now spans multiple sectors. In your kernel.asm you load a single 512-byte sector with this code:
load_kernel:
mov ah, 0x02 ; call function 0x02 of int 13h (read sectors)
mov al, 0x18 ; read one sector (512 bytes)
mov ch, 0x00 ; track 0
mov cl, 0x02 ; sector 2
mov dh, 0x00 ; head 0
; mov dl, 0x00 ; drive 0, floppy 1. Comment out DL passed to bootloader
xor bx, bx ; segment 0x0000
mov es, bx ; segments must be loaded from non immediate data
mov bx, 0x7E00 ; load the kernel right after the bootloader in memory
.readsector:
int 13h ; call int 13h
jc .readsector ; error? try again
In particular:
mov al, 0x01 ; read one sector (512 bytes)
This is at the heart of your problem. Since you are booting as a floppy I'd recommend generating a 1.44MiB file and placing your bootloader and kernel in it with:
dd if=/dev/zero of=bin/lunaos.img bs=1024 count=1440
dd if=bin/os.bin of=bin/lunaos.img bs=512 conv=notrunc seek=0
The first command makes a 1.44MiB file filled with zeros. The second uses conv=notrunc to tell DD not truncate the file after writing. seek=0 tells DD to start writing at the first logical sector in the file. The result would be that os.bin is placed inside of a 1.44MiB image starting at logical sector 0 without truncating the original file when finished.
A properly sized disk image of a known floppy disk size makes it easier to use in some emulators.
A 1.44MiB floppy has 36 sectors per track (18 sectors per head, 2 heads per track). If you run your code on real hardware, some BIOSes may not load across a track boundary. You're likely safe reading 35 sectors with your disk read. The first sector was read by the BIOS off track 0 head 0. There are 35 more sectors on the first track. I'd amend the line above to be:
mov al, 35 ; read 35 sectors (35*512 = 17920 bytes)
This would allow your kernel to be 35*512 bytes long = 17920 bytes with minimum hassles even on real hardware. Any larger than that you will have to consider modifying your bootloader with a loop that attempts to read more than one track. To complicate matters you'd have to concern yourself that larger kernels will eventually exceed the 64k segment limit. The disk reads would probably have to be modified to use a segment (ES) that isn't 0. If your kernel gets that large your bootloader can be fixed at that time.
Debugging
Since you are in protected mode and using QEMU, I highly suggest you consider using a debugger. QEMU supports remote debugging with GDB. It's not difficult to set up and since you have generated a ELF executable of your kernel you also can use symbolic debugging.
You will want to add -Fdwarf to your NASM assembly commands right after -felf32 to enable debug information. Add the -g option to your GCC commands to enable debug information. The command below should start up your bootloader/kernel; automatically break on kmain; use os.elffor debug symbols; and display the source code and registers in the terminal.
qemu-system-i386 -fda bin/lunaos.img -S -s &
gdb bin/os.elf \
-ex 'target remote localhost:1234' \
-ex 'layout src' \
-ex 'layout regs' \
-ex 'break *kmain' \
-ex 'continue'
There are many tutorials on using GDB if you search with Google. There is a cheat sheet that describes most of the basic commands and their syntax.
If you ever find yourself in the future having troubles with Interrupts, GDT or paging I recommend using Bochs for debugging those aspects of an operating system. Although Bochs doesn't have a symbolic debugger it makes up for in being able to identify low level problems more easily than QEMU. Debugging real mode code like bootloaders is easier in Bochs given that it understands 20 bit segment:offset addressing unlike QEMU

Recover from Hard Fault on Cortex M0+

Until now I had a Hard fault handler in C that I defined in the vector table:
.sect ".intvecs"
.word _top_of_main_stack
.word _c_int00
.word NMI
.word Hard_Fault
.word Reserved
.word Reserved
.word Reserved
.word Reserved
.word Reserved
.word Reserved
.word Reserved
.word Reserved
.word Reserved
.word Reserved
.word Reserved
.word Reserved
....
....
....
One of our tests triggers a hard fault (on purpose) by writing to a non existing address. Once the test is done, the handler returns to the calling function and the cortex recovers from the fault. Worth mentioning that the handler does not have any arguments.
Now I'm in the phase of writing a real handler.
I created a struct for the stack frame so we can print PC, LR, and xPSR in case of a fault:
typedef struct
{
int R0 ;
int R1 ;
int R2 ;
int R3 ;
int R12 ;
int LR ;
int ReturnAddress ;
int xPSR ;
} InterruptStackFrame_t ;
My hard fault handler in C is defined:
void Hard_Fault(InterruptStackFrame_t* p_stack_frame)
{
// Write to external memory that I can read from outside
/* prints a message containing information about stack frame:
* p_stack_frame->LR, p_stack_frame->PC, p_stack_frame->xPSR,
* (uint32_t)p_stack_frame (SP)
*/
}
I created an assembly function:
.thumbfunc _hard_fault_wrapper
_hard_fault_wrapper: .asmfunc
MRS R0, MSP ; store pointer to stack frame
BL Hard_Fault ; go to C function handler
POP {R0-R7} ; pop out all stack frame
MOV PC, R5 ; jump to LR that was in the stack frame (the calling function before the fault)
.endasmfunc
This is the right time to say that I don't have an OS, so I do not have to check bit[2] of LR because I definitely know that I use MSP and not PSP.
The program compiles and runs properly and I used JTAG to ensure that all registers restore to the wanted values.
When executing the last command (MOV PC, R5) the PC returns to the correct address, but at some point, the debugger indicates that the M0 is locked in a hard fault and cannot recover.
I do not understand the difference between using a C function as a handler or an assembly function that calls a C function.
Does anyone know what is the problem?
Eventually, I will use an assert function that will stuck the processor, but I want it to be optional and up to my decision.

To explain "old_timer"'s comment:
When entering an exception or interrupt handler on the Cortex the LR register has a special value.
Normally you return from the exception handler by simply jumping to that value (by writing that value to the PC register).
The Cortex CPU will then automatically pop all the registers from the stack and it will reset the interrupt logic.
When directly jumping to the PC stored on the stack however you will destroy some registers and you don't restore the interrupt logic.
Therefore this is not a good idea.
Instead I'd do something like this:
.thumbfunc _hard_fault_wrapper
_hard_fault_wrapper: .asmfunc
MRS R0, MSP
B Hard_Fault
EDIT
Using the B instruction may not work because the "distance" allowed for the B instruction is more limited than for the BL instruction.
However there are two possibilities you could use (unfortunately I'm not sure if these will definitely work).
The first one will return to the address that had been passed in the LR register when entering the assembler handler:
.thumbfunc _hard_fault_wrapper
_hard_fault_wrapper: .asmfunc
MRS R0, MSP
PUSH {LR}
BL Hard_Fault
POP {PC}
The second one will indirectly do the jump:
.thumbfunc _hard_fault_wrapper
_hard_fault_wrapper: .asmfunc
MRS R0, MSP
LDR R1, =Hard_Fault
MOV PC, R1
EDIT 2
You cannot use LR because it holds EXC_RETURN value. ... You have to read the LR from stack and you must clean the stack from the stack frame, because the interrupted program doesn't know that a frame was stored.
According to the Cortex M3 manual you must exit from an exception handler by writing one of the three EXC_RETURN values to the PC register.
If you simply jump to the LR value stored in the stack frame you remain in the exception handler!
If something stupid happens during the program the CPU will assume that an exception happened inside the exception handler and it hangs.
I assume that the Cortex M0 works the same way as the M3 in this point.
If you want to modify some CPU register during the exception handler you can modify the stack frame. Thc CPU will automatically pop all registers from the stack frame when you are writing the EXC_RETURN value to the PC register.
If you want to modify one of the registers not present in the stack frame (such as R5) you can directly modify it in the exception handler.
And this shows another problem of your interrupt handler:
The instruction POP {R0-R7} will set registers R4 to R7 to values that do not match the program that has been interrupted. R12 will also be destroyed depending on the C code. This means that in the program being interrupted these four registers suddenly change while the program is not prepared for that!

When a syscall is called by a userspace program, how does execution transfer back to kernelspace?

I've been studying a lot about the ABI for x86-64, writing Assembly, and studying how the stack and heap work.
Given the following code:
#include <linux/seccomp.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
// execute the seccomp syscall (could be any syscall)
seccomp(...);
return 0;
}
In Assembly for x86-64, this would do the following:
Align the stack pointer (as it's off by 8 bytes by default).
Setup registers and the stack for any arguments for the call seccomp.
Execute the following Assembly call seccomp.
When seccomp returns, it's likely the that the C will call exit(0) as far as I know.
I'd like to talk about what happens between step three and four above.
I currently have my stack for the currently running process with its own data in registers and on the stack. How does the userspace process turn over execution to the kernel? Does the kernel just pick up when the call is made and then push to and pop from the same stack?
I believe I heard somewhere that syscalls don't happen immediately but on certain CPU ticks or interrupts. Is this true? How does this happen, for example, on Linux?

syscalls don't happen immediately but on certain CPU ticks or interrupts
Totally wrong. The CPU doesn't just sit there doing nothing until a timer interrupt. On most architectures, including x86-64, switching to kernel mode takes tens to hundreds of cycles, but not because the CPU is waiting for anything. It's just a slow operation.
Note that glibc provides function wrappers around nearly every syscall, so if you look at disassembly you'll just see a normal-looking function call.
What really happens (x86-64 as an example):
See the AMD64 SysV ABI docs, linked from the x86 tag wiki. It specifies which registers to put args in, and that system calls are made with the syscall instruction. Intel's insn ref manual (also linked from the tag wiki) documents in full detail every change that syscall makes to the architectural state of the CPU. If you're interested in the history of how it was designed, I dug up some interesting mailing list posts from the amd64 mailing list between AMD architects and kernel devs. AMD updated the behaviour before the release of the first AMD64 hardware so it was actually usable for Linux (and other kernels).
32bit x86 uses the int 0x80 instruction for syscalls, or sysenter. syscall isn't available in 32bit mode, and sysenter isn't available in 64bit mode. You can run int 0x80 in 64bit code, but you still get the 32bit API that treats pointers as 32bit. (i.e. don't do it). BTW, perhaps you were confused about syscalls having to wait for interrupts because of int 0x80? Running that instruction fires that interrupt on the spot, jumping right to the interrupt handler. 0x80 is not an interrupt that hardware can trigger, either, so that interrupt handler only ever runs after a software-triggered system call.
AMD64 syscall example:
#include <stdlib.h>
#include <unistd.h>
#include <linux/unistd.h> // for __NR_write
const char msg[]="hello world!\n";
ssize_t amd64_write(int fd, const char*msg, size_t len) {
ssize_t ret;
asm volatile("syscall" // volatile because we still need the side-effect of making the syscall even if the result is unused
: "=a"(ret) // outputs
: [callnum]"a"(__NR_write), // inputs: syscall number in rax,
"D" (fd), "S"(msg), "d"(len) // and args, in same regs as the function calling convention
: "rcx", "r11", // clobbers: syscall always destroys rcx/r11, but Linux preserves all other regs
"memory" // "memory" to make sure any stores into buffers happen in program order relative to the syscall
);
}
int main(int argc, char *argv[]) {
amd64_write(1, msg, sizeof(msg)-1);
return 0;
}
int glibcwrite(int argc, char**argv) {
write(1, msg, sizeof(msg)-1); // don't write the trailing zero byte
return 0;
}
compiles to this asm output, with the godbolt Compiler Explorer:
gcc's -masm=intel output is somewhat MASM-like, in that it uses the OFFSET keywork to get the address of a label.
.rodata
msg:
.string "hello world!\n"
.text
main: // using an in-line syscall
mov eax, 1 # __NR_write
mov edx, 13 # string length
mov esi, OFFSET FLAT:msg # string pointer
mov edi, eax # file descriptor = 1 happens to be the same as __NR_write
syscall
xor eax, eax # zero the return value
ret
glibcwrite: // using the normal way that you get from compiler output
sub rsp, 8 // keep the stack 16B-aligned for the function call
mov edx, 13 // put args in registers
mov esi, OFFSET FLAT:msg
mov edi, 1
call write
xor eax, eax
add rsp, 8
ret
glibc's write wrapper function just puts 1 in eax and runs syscall, then checks the return value and sets errno. Also handles restarting syscalls on EINTR and stuff.
// objdump -R -Mintel -d /lib/x86_64-linux-gnu/libc.so.6
...
00000000000f7480 <__write>:
f7480: 83 3d f9 27 2d 00 00 cmp DWORD PTR [rip+0x2d27f9],0x0 # 3c9c80 <argp_program_version_hook+0x1f8>
f7487: 75 10 jne f7499 <__write+0x19>
f7489: b8 01 00 00 00 mov eax,0x1
f748e: 0f 05 syscall
f7490: 48 3d 01 f0 ff ff cmp rax,0xfffffffffffff001 // I think that's -EINTR
f7496: 73 31 jae f74c9 <__write+0x49>
f7498: c3 ret
... more code to handle cases where one of those branches was taken

syscalls don't happen immediately but on certain CPU ticks or interrupts
Certainly the effect of your syscall could be dependent on many things including ticks. Scheduler granularity or the resolution of timing could be limited to tick period, e.g. But the call itself should happen "immediately" (inline with execution).
How does the userspace process turn over execution to the kernel? Does the kernel just pick up when the call is made and then push to and pop from the same stack?
It probably varies slightly between architectures but in general the syscall arguments are assembled by the libc and then a processor exception is generated in order to change context.
For additional details, see: "How system calls work on x86 linux"

Hardware Processor Counters Incorrectly Resetting

I wrote a program which reads the APERF/MPERF counters on an Intel chip (page 2 on http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf).
These counters are readable/writable via the readmsr/writemsr instructions, and I'm currently simply reading them at a regular interval via a device driver in Windows 7. The counters are 64 bits, and increment approximately with each processor clock, so you'd expect them to overflow in a very long amount of time, but when I read the counters, their value jumps around as if they are being reset by another program.
Is there any way to track down what program would be resetting the counters? Could something else be causing incorrect values to be read? The relevant assembly and corresponding C functions I'm using are attached below. The 64-bit result from rdmsr is saved into eax:edx, so to make sure I wasn't missing any numbers in the r_x registers, I run the command multiple times to check them all.
C:
long long test1, test2, test3, test4;
test1 = TST1();
test2 = TST2();
test3 = TST3();
test4 = TST4();
status = RtlStringCbPrintfA(buffer, sizeof(buffer), "Value: %llu %llu %llu %llu\n", test1, test2, test3, test4);
Assembly:
;;;;;;;;;;;;;;;;;;;
PUBLIC TST1
TST1 proc
mov ecx, 231 ; 0xE7
rdmsr
ret ; returns rax
TST1 endp
;;;;;;;;;;;;;;;;;;;
PUBLIC TST2
TST2 proc
mov ecx, 231 ; 0xE7
rdmsr
mov rax, rbx
ret ; returns rax
TST2 endp
;;;;;;;;;;;;;;;;;;;
PUBLIC TST3
TST3 proc
mov ecx, 231 ; 0xE7
rdmsr
mov rax, rcx
ret ; returns rax
TST3 endp
;;;;;;;;;;;;;;;;;;;
PUBLIC TST4
TST4 proc
mov ecx, 231 ; 0xE7
rdmsr
mov rax, rdx
ret ; returns rax
TST4 endp
The result that prints out is something like below, but the only register which ever changes is the rax register, and it doesn't increase monotonically (can jump around):
Value: 312664 37 231 0
Value: 252576 37 231 0
Value: 1051857 37 231 0

I was not able to figure out what was resetting my counters, but I was able to determine the frequency. The Intel docs state that when one counter overflows, the other counter also will. So even though the counters are constantly resetting, the ratio of aperf and mperf still does represent the processor's frequency.

It seems that Windows 7 and Windows 8 read and reset the writeable APERF/MPERF counters on AMD processors. So, you want to access the read-only APERF/MPERF counters at registers 0xc00000E7/E8.
But there is a new issue. On some of the latest AMD processors (Family 0x16 processors), those registers are not always supported. To determine if those registers are supported, you have to read the EffFreqRO bit in CPUID Fn8000_0007_EDX. As stated before, all this applies only to AMD processors.

test&set and test&test&set LOCK implementations in ASM for C

Searching around for some test&set and test&test&set LOCK implementations on ASMx86 (x86 architecture) Assembly to use in my C codes. I don't want implementations in C, but plain assembly.
Please point me to some useful ones.
Thanks in advance!

Hare you have a simple implementation of test&set under IA32 x86
//eax = pointer on 32 bit lock variable
//Variable must be 4 byte aligned
//edx = bit test and set number from 0..31
lock bts dword ptr [eax], edx
setnc al //al is 1 if bts instruction was successful
And hare you have a simple implementation of looped test&set
//eax = pointer on 32 bit lock variable
//Variable must be 4 byte aligned
//edx = bit test and set number from 0..31
#wait:
pause //CPU hint for waiting in loop
lock bts dword ptr [eax], edx
jc #wait //waiting in loop!!!
Remember waiting in loop will freeze the application thread so it is smart to also impement the maximum wait loop time.

Depending on the architecture, you can do this in a single instruction or by disabling interrupts.
80386 and later compatible architectures have the bts instruction which will do test-and-set atomically with the test result in the carry flag. Here is a great explanation of how to use PPC instructions to implement mutexes.
Others require something like:
cli ;; Clear interrupts flag.
move r0, r1 ;; Copy the value into r0.
ori r1, 1 ;; Set the bit in r1 (r1 holds the value to test-and-set.)
sti ;; Re-enable interrupts.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight