Can i use interrupts in inline assembly - c

I want to use interrupts in inline asm but it is not letting me I think my app is this
__asm__("movb %ah,9;"
"movb %al,0x41;"
"movb %bh,0x0;"
"movw %cx,0x1; "
"movb %bl,0x0F ;"
"int $10; ");
I am getting nothing after running this just a blank space what should I do?

Can i use interrupts in inline assembly?
Yes, as long as the interrupt exists and you inform the compiler of how to deal with it correctly (e.g. "clobber list" to tell the compiler which registers changed, etc). See the GCC manual and https://stackoverflow.com/tags/inline-assembly/info
And as long as you use the correct number, likely you wanted int $0x10 instead of int $10. And you'll want to mov immediates into registers, not mov registers to absolute memory addresses.
If that doesn't work, I'd be tempted to suspect that you're trying to use BIOS "video services" to print a character on the screen; but you're not in real mode (and doing it in 32-bit code or 64-bit code); and the BIOS might not exist at all (e.g. a UEFI system); and an OS may have taken over control of the hardware (and broken all the assumptions that BIOS expects, like disabling obsolete "legacy junk emuation" in certain devices, reconfiguring memory, and timers and interrupt controllers; so that even if BIOS exists and you're in real mode you still can't use it safely).

Related

GCC Extended assembly pin local variable to any register except r12

Basically I am looking for a way that I pin a temporary to any register except r12.
I know I can "hint" the compiler to pin to a single register with:
// Toy example. Obviously an unbalanced `pop` in
// extended assembly will cause serious problems.
register long tmp asm("rdi"); // or just clober rdi and use it directly.
asm volatile("pop %[tmp]\n" // using pop hence don't want r12
: [tmp] "=&r" (tmp)
:
:);
and this will generally work as in avoiding r12 but might mess up the compilers register allocation elsewhere.
Is it possible to do this without forcing the compiler to use a single register?
Note that register asm doesn't truly "pin" a variable to a register, it only ensures that uses of that variable as an operand in inline asm will use that register. In principle the variable may be stored elsewhere in between. See https://gcc.gnu.org/onlinedocs/gcc-11.1.0/gcc/Local-Register-Variables.html#Local-Register-Variables. But it sounds like all you really need is to ensure that your pop instruction doesn't use r12 as its operand, possibly because of Why is POP slow when using register R12?. I'm not aware of any way to do precisely this, but here are some options that may help.
The registers rax, rbx, rcx, rdx, rsi, rdi each have their own constraint letters, a,b,c,d,S,D respectively (the other registers don't). So you can get about halfway there by doing
long tmp;
asm volatile("pop %[tmp]\n"
: [tmp] "=&abcdSD" (tmp)
:
:);
This way the compiler has the option to choose any of those six registers, which should give the register allocator a lot more flexibility.
Another option is to declare that your asm clobbers r12, which will prevent the compiler from allocating operands there:
long tmp;
asm volatile("pop %[tmp]\n"
: [tmp] "=&r" (tmp)
:
: "r12");
The tradeoff is that it will also not use r12 to cache local variables across the asm, since it assumes that it may be modified. Hopefully it will be smart enough to just avoid using r12 in that part of the code at all, but if it can't, it may emit extra register moves or spill to the stack around your asm. Still, it's less brutal than -ffixed-r12 which would prevent the compiler from using r12 anywhere in the entire source file.
Future readers should note that in general it is unsafe to modify the stack pointer inside inline asm on x86-64. The compiler assumes that rsp isn't changed by inline asm, and it may access stack variables via effective addresses with constant offsets relative to rsp, at any time. Moreover, x86-64 uses a red zone, so even a push/pop pair is unsafe, because there may be important data stored below rsp. (And an unexpected pop may mean that other important data is no longer in the red zone and thus subject to overwriting by signal handlers.) So, you shouldn't do this unless you're willing to carefully read the generated assembly after every recompilation to make sure the compiler hasn't decided to do any of these things. (And before you ask, you cannot fix this by declaring a clobber of rsp; that's not supported.)

How to prevent LDM/STM instructions expansion in ARM Compiler 5 armcc inline assembler?

I'm trying to generate AXI bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
But ARM Compiler armcc User Guide, paragraph 7.18 is saying:
"All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."
And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
Target CPU: Cortex M0+ (ARMv6-M).
Edit:
Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).
In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.
You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.
See: ldm/stm and gcc for issues with GCC stm/ldm.
Taking your example,
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.
However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).
If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.

How does including assembly inline with C code work?

I've seen code for Arduino and other hardware that have assembly inline with C, something along the lines of:
asm("movl %ecx %eax"); /* moves the contents of ecx to eax */
__asm__("movb %bh (%eax)"); /*moves the byte from bh to the memory pointed by eax */
How does this actually Work? I realize every compiler is different, but what are the common reasons this is done, and how could someone take advantage of this?
The inline assembler code goes right into the complete assembled code untouched and in one piece. You do this when you really need absolutely full control over your instruction sequence, or maybe when you can't afford to let an optimizer have its way with your code. Maybe you need every clock tick. Maybe you need every single branch of your code to take the exact same number of clock ticks, and you pad with NOPs to make this happen.
In any case, lots of reasons why someone may want to do this, but you really need to know what you're doing. These chunks of code will be pretty opaque to your compiler, and its likely you won't get any warnings if you're doing something bad.
Usually the compiler will just insert the assembler instructions right into its generated assembler output. And it will do this with no regard for the consequences.
For example, in this code the optimiser is performing copy propagation, whereby it sees that y=x, then z=y. So it replaces z=y with z=x, hoping that this will allow it to perform further optimisations. Howver, it doesn't spot that I've messed with the value of x in the mean time.
char x=6;
char y,z;
y=x; // y becomes 6
_asm
rrncf x, 1 // x becomes 3. Optimiser doesn't see this happen!
_endasm
z=y; // z should become 6, but actually gets
// the value of x, which is 3
To get around this, you can essentially tell the optimiser not to perform this optimisation for this variable.
volatile char x=6; // Tell the compiler that this variable could change
// all by itself, and any time, and therefore don't
// optimise with it.
char y,z;
y=x; // y becomes 6
_asm
rrncf x, 1 // x becomes 3. Optimiser doesn't see this happen!
_endasm
z=y; // z correctly gets the value of y, which is 6
Historically, C compilers generated assembly code, which would then be translated to machine code by an assembler. Inline assembly arises as a simple feature — in the intermediate assembly code, at that point, inject some user-picked code. Some compilers directly generate machine code, in which case they contain an assembler or call an external assembler to generate the machine code for the inline assembly snippets.
The most common use for assembly code is to use specialized processor instructions that the compiler isn't able to generate. For example, disabling interrupts for a critical section, controlling processor features (cache, MMU, MPU, power management, querying CPU capabilities, …), accessing coprocessors and hardware peripherals (e.g. inb/outb instructions on x86), etc. You'll rarely find asm("movl %ecx %eax"), because that affects general-purpose registers that the C code around it is also using, but something like asm("mcr p15, 0, 0, c7, c10, 5") has its use (data memory barrier on ARM). The OSDev wiki has several examples with code snippets.
Assembly code is also useful to implement features that break C's flow control model. A common example is context switching between threads (whether cooperative or preemptive, whether in the same address space or not) requiring assembly code to save and restore register values.
Assembly code is also useful to hand-optimize small bits of code for memory or speed. As compilers are getting smarter, this is rarely relevant at the application level nowadays, but it's still relevant in much of the embedded world.
There are two ways to combine assembly with C: with inline assembly, or by linking assembly modules with C modules. Linking is arguably cleaner but not always applicable: sometimes you need that one instruction in the middle of a function (e.g. for register saving on a context switch, a function call would clobber the registers), or you don't want to pay the cost of a function call.
Most C compilers support inline assembly, but the syntax varies. It is typically introduced by the keyword asm, _asm, __asm or __asm__. In addition to the assembly code itself, the inline assembly construct may contain additional code that allows you to pass values between assembly and C (for example, requesting that the value of a local variable is copied to a register on entry), or to declare that the assembly code clobbers or preserves certain registers.
asm("") and __asm__ are both valid usage. Basically, you can use __asm__ if the keyword asm conflicts with something in your program. If you have more than one instructions, you can write one per line in double quotes, and also suffix a ’\n’ and ’\t’ to the instruction. This is because gcc sends each instruction as a string to as(GAS) and by using the newline/tab you can send correctly formatted lines to the assembler. The code snippet in your question is basic inline.
In basic inline assembly, there is only instructions. In extended assembly, you can also specify the operands. It allows you to specify the input registers, output registers and a list of clobbered registers. It is not mandatory to specify the registers to use, you can leave that to GCC and that probably fits into GCC’s optimization scheme better. An example for the extended asm is:
__asm__ ("movl %eax, %ebx\n\t"
"movl $56, %esi\n\t"
"movl %ecx, $label(%edx,%ebx,$4)\n\t"
"movb %ah, (%ebx)");
Notice that the '\n\t' at the end of each line except the last, and each line is enclosed in quotes. This is because gcc sends each as instruction to as as a string as I mentioned before. The newline/tab combination is required so that the lines are fed to as according to the correct format.

Unexpected warning on GNU ARM Assembler

I am writing some bare metal code for the Raspberry Pi and am getting an unexpected warning from the ARM cross assembler on Windows. The instructions causing the warnings were:
stmdb sp!,{r0-r14}^
and
ldmia sp!,{r0-r14}^
The warning is:
Warning: writeback of base register is UNPREDICTABLE
I can sort of understand this as although the '^' modifier tells the processor to store the user mode copies of the registers, it doesn't know what mode the processor will be in when the instruction is executed and there doesn't appear to be a way to tell it. I was a little more concerned to get the same warning for:
stmdb sp!,{r0-r9,sl,fp,ip,lr}^
and:
ldmia sp!,{r0-r9,sl,fp,ip,lr}^
despite the fact that I am explicitly not storing ANY sp register.
My concern is that, although I used to do a lot of assembler coding about 15 years ago, ARM code is new to me and I may be misunderstanding something! Also, if I can safely ignore the warnings, is there any way to suppress them?
The ARM Architecture Reference Manual says that writeback is not allowed in LDM/SMT of user registers. It is allowed in the exception return case, where pc is in the register list.
LDM (exception return)
LDM{<amode>}<c> <Rn>{!},<registers_with_pc>^
LDM (user registers)
LDM{<amode>}<c> <Rn>,<registers_without_pc>^
The term "writeback" refers not to the presence or absence of SP in the register list, but to the ! symbol which means the instruction is supposed to update the SP value with the end of transfer area address. The base register (SP) value will be used from the current mode, not the User mode, so you can still load or store user-mode SP value into your stack. From the ARM ARM B9.3.6 LDM (User registers):
In a PL1 mode other than System mode, Load Multiple (User registers)
loads multiple User mode registers from consecutive memory locations
using an address from a base register. The registers loaded cannot
include the PC. The processor reads the base register value normally,
using the current mode to determine the correct Banked version of the
register. This instruction cannot writeback to the base register.
The encoding diagram reflects this by specifying the bit 21 (W, writeback) as '(0)' which means that the result is unpredictable if the bit is not 0.
So the solution is just to not specify the ! and decrement or increment SP manually if necessary.

Triple fault in home grown kernel

I am trying to write a kernel, mostly for entertainment purposes, and I am running into a problem were I believe it is triple faulting. Everything worked before I attempted to enable paging. The code that is breaking is this:
void switch_page_directory(page_directory_t *dir){
current_directory = dir;
asm volatile("mov %0, %%cr3":: "r"(&dir->tablesPhysical));
u32int cr0;
asm volatile("mov %%cr0, %0": "=r"(cr0));
cr0 |= 0x80000000;//enable paging
asm volatile("mov %0, %%cr0":: "r"(cr0)); //this line breaks
}//switch page directory
I have been following a variety of tutorials / documents for this but the one I am using to paging is thus http://www.jamesmolloy.co.uk/tutorial_html/6.-Paging.html . I am not sure what other code will be useful in figuring this out but if there is more I should provide I will be more than happy to do so.
Edit=====
I believe the CS,DS and SS are selecting correct entries here's the code used to set them
global gdt_flush
extern gp
gdt_flush:
lgdt [gp] ; Load the GDT with our 'gp' which is a special pointer
mov ax, 0x10 ; 0x10 is the offset in the GDT to our data segment
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax
jmp 0x08:flush2 ; 0x08 is the offset to our code segment: Far jump!
flush2:
ret ; Returns back to the C code!
and here's the gdt struct itself
struct gdt_entry{
unsigned short limit_low;
unsigned short base_low;
unsigned char base_middle;
unsigned char access;
unsigned char granularity;
unsigned char base_high;
} __attribute__((packed));
struct gdt_ptr{
unsigned short limit;
unsigned int base;
} __attribute__((packed));
struct gdt_entry gdt[5];
struct gdt_ptr gp;
The IDT is very similar to this.
GDT: you don't say what the contents of the GDT entries are, but the stuff you've shown looks fairly similar to the earlier part of the tutorial you linked to and if you've set up the entries in the same way, then all should be well (i.e. flat segment mapping with a ring 0 code segment for CS, ring 0 data segment for everything else, both with a base of 0 and a limit of 4GB).
IDT: probably doesn't matter anyway if interrupts are disabled and you're not (yet) expecting to cause page faults.
Page tables: incorrect page tables do seem like the most likely suspect. Make sure that your identity mapping covers all of the code, data and stack memory that you're using (at least).
The source code linked to at the bottom of http://www.jamesmolloy.co.uk/tutorial_html/6.-Paging.html definitely builds something that does work correctly with both QEMU and Bochs, so hopefully you can compare what you're doing with what that's doing, and work out what is wrong.
QEMU is good in general, but I would recommend Bochs for developing really low-level stuff - it includes (or can be configured to include) a very handy internal debugger. e.g. set reset_on_triple_fault=0 on the cpu: line of the config file, set a breakpoint in the switch_page_directory() code, run to the breakpoint, then single step instructions and see what happens...
You can link qemu to a gdb debugger session via the remote debugger tools in gdb. This can be done by issuing the following commands:
qemu -s [optional arguments]
Then inside your gdb session, open your kernel executable, and after setting a break-point in your switch_page_directory() function, type in the following command at the gdb prompt:
target remote localhost:1234
You can then single-step through your kernel at the breakpoint, and see where your triple-fault is taking place.
Another step to consider is to actually install some default exception handlers in your IDT ... the reason you are triple-faulting is because an exception is being thrown by the CPU, but there is no proper exception handler available to handle it. Thus with some default handlers installed, especially the double-fault handler, you can effectively stop the kernel without going into a triple-fault that automatically resets the PC.
Finally, make sure you have re-programmed the PIC before going into protected mode ... otherwise the default hardware interrupts it's programmed to trigger from the BIOS in real-mode will now trigger the exception interrupts in protected mode.
I was also facing the same problem with paging tutorial.But after some searching i found the solution it was happening because as soon as paging is enabled, all address become virtual and to solve it we must map the virtual addresses to the same physical addresses so they refer to the same thing and this is called identity mapping.
you can follow this link for further help in implementing Identity Maping.
and one more thing you have memset the newly allocated space to zero because it may contain garbage values and memset was not done in tutorial it will work on bochs because it set the space to zero for you but other emulator(qemu) and real hardware are so kind.

Resources