gcc asm too many memory references - c

I am trying to read the time from the CMOS using asm but i get this error:
/tmp/ccyx8l5L.s:1236: Error: too many memory references for 'mov'
/tmp/ccyx8l5L.s:1240: Error: too many memory references for 'out'
/tmp/ccyx8l5L.s:1244: Error: too many memory references for 'in'
/tmp/ccyx8l5L.s:1252: Error: too many memory references for 'mov'
and here is the code:
for (index = 0; index < 128; index++) {
asm("cli");
asm("mov al, index"); /* Move index address*/
asm("out 0x70,al"); /* Copy address to CMOS register*/
/* some kind of real delay here is probably best */
asm("in al,0x71"); /* Fetch 1 byte to al*/
asm("sti"); /* Enable interrupts*/
asm("mov tvalue,al");
array[index] = tvalue;
}
I am using gcc to compile

gcc uses AT&T syntax.
Compile with -masm=intel

As Janycz points out, your code uses intel syntax for the assembler, while gcc (by default) expects at&t. If you are using gcc, how about something like this:
int main()
{
unsigned char array[128];
for (unsigned char index = 0; index < 128; index++) {
asm("cli\n\t"
"out %%al,$0x70\n\t"
/* some kind of real delay here is probably best */
"in $0x71, %%al\n\t" /* Fetch 1 byte to al*/
"sti" /* Enable interrupts*/
: "=a" (array[index]) : "a" (index) );
}
}
This minimizes the amount of asm you have to write (4 lines vs your 6) and allows the compiler to perform more optimizations. See the docs for gcc's inline asm to understand how all this works.
I haven't tried running this, since it won't run on protected mode operating systems (like Windows). In and Out are protected instructions and can't be used by user-mode applications. I assume you are running this on DOS or something?

Related

How do I call hex data stored in an array with inline assembly?

I have an OS project that I am working on and I am trying to call data that I have read from the disk in C with inline assembly.
I have already tried reading the code and executing it with the assembly call instruction, using inline assembly.
void driveLoop() {
uint16_t sectors = 31;
uint16_t sector = 0;
uint16_t basesector = 40000;
uint32_t i = 40031;
uint16_t code[sectors][256];
int x = 0;
while(x==0) {
read(i);
for (int p=0; p < 256; p++) {
if (readOut[p] == 0) {
} else {
x = 1;
//kprint_int(i);
}
}
i++;
}
kprint("Found sector!\n");
kprint("Loading OS into memory...\n");
for (sector=0; sector<sectors; sector++) {
read(basesector+sector);
for (int p=0; p<256; p++) {
code[sector][p] = readOut[p];
}
}
kprint("Done loading.\n");
kprint("Attempting to call...\n");
asm volatile("call (%0)" : : "r" (&code));
When the inline assembly is called I expect it to run the code from the sectors I read from the "disk" (this is in a VM, because its a hobby OS). What it does instead is it just hangs.
I probably don't much understand how variables, arrays, and assembly work, so if you could fill me in, that would be nice.
EDIT: The data I am reading from the disk is a binary file that was added
to the disk image file with
cat kernel.bin >> disk.img
and the kernel.bin is compiled with
i686-elf-ld -o kernel.bin -Ttext 0x4C4B40 *insert .o files here* --oformat binary
What it does instead is it just hangs.
Run your OS inside BOCHS so you can use BOCHS's built-in debugger to see exactly where it's stuck.
Being able to debug lockups, including with interrupts disabled, is probably very useful...
asm volatile("call (%0)" : : "r" (&code)); is unsafe because of missing clobbers.
But even worse than that it will load a new EIP value from the first 4 bytes of the array, instead of setting EIP to that address. (Unless the data you're loading is an array of pointers, not actual machine code?)
You have the %0 in parentheses, so it's an addressing mode. The assembler will warn you about an indirect call without *, but will assemble it like call *(%eax), with EAX = the address of code[0][0]. You actually want a call *%eax or whatever register the compiler chooses, register-indirect not memory-indirect.
&code and code are both just a pointer to the start of the array; &code doesn't create an anonymous pointer object storing the address of another address. &code takes the address of the array as a whole. code in this context "decays" to a pointer to the first object.
https://gcc.gnu.org/wiki/DontUseInlineAsm (for this).
You can get the compiler to emit a call instruction by casting the pointer to a function pointer.
__builtin___clear_cache(&code[0][0], &code[30][255]); // don't optimize away stores into the buffer
void (*fptr)(void) = (void*)code; // casting to void* instead of the actual target type is simpler
fptr();
That will compile (with optimization enabled) to something like lea 16(%esp), %eax / call *%eax, for 32-bit x86, because your code[][] buffer is an array on the stack.
Or to have it emit a jmp instead, do it at the end of a void function, or return funcptr(); in a non-void function, so the compiler can optimize the call/ret into a jmp tailcall.
If it doesn't return, you can declare it with __attribute__((noreturn)).
Make sure the memory page / segment is executable. (Your uint16_t code[]; is a local, so gcc will allocate it on the stack. This might not be what you want. The size is a compile-time constant so you could make it static, but if you do that for other arrays in other sibling functions (not parent or child), then you lose out on the ability to reuse a big chunk of stack memory for different arrays.)
This is much better than your unsafe inline asm. (You forgot a "memory" clobber, so nothing tells the compiler that your asm actually reads the pointed-to memory). Also, you forgot to declare any register clobbers; presumably the block of code you loaded will have clobbered some registers if it returns, unless it's written to save/restore everything.
In GNU C you do need to use __builtin__clear_cache when casting a data pointer to a function pointer. On x86 it doesn't actually clear any cache, it's telling the compiler that the stores to that memory are not dead because it's going to be read by execution. See How does __builtin___clear_cache work?
Without that, gcc could optimize away the copying into uint16_t code[sectors][256]; because it looks like a dead store. (Just like with your current inline asm which only asks for the pointer in a register.)
As a bonus, this part of your OS becomes portable to other architectures, including ones like ARM without coherent instruction caches where that builtin expands to a actual instructions. (On x86 it purely affects the optimizer).
read(basesector+sector);
It would probably be a good idea for your read function to take a destination pointer to read into, so you don't need to bounce data through your readOut buffer.
Also, I don't see why you'd want to declare your code as a 2D array; sectors are an artifact of how you're doing your disk I/O, not relevant to using the code after it's loaded. The sector-at-a-time thing should only be in the code for the loop that loads the data, not visible in other parts of your program.
char code[sectors * 512]; would be good.

How to do computations with addresses at compile/linking time?

I wrote some code for initializing the IDT, which stores 32-bit addresses in two non-adjacent 16-bit halves. The IDT can be stored anywhere, and you tell the CPU where by running the LIDT instruction.
This is the code for initializing the table:
void idt_init(void) {
/* Unfortunately, we can't write this as loops. The first option,
* initializing the IDT with the addresses, here looping over it, and
* reinitializing the descriptors didn't work because assigning a
* a uintptr_t (from (uintptr_t) handler_func) to a descr (a.k.a.
* uint64_t), according to the compiler, "isn't computable at load
* time."
* The second option, storing the addresses as a local array, simply is
* inefficient (took 0.020ms more when profiling with the "time" command
* line program!).
* The third option, storing the addresses as a static local array,
* consumes too much space (the array will probably never be used again
* during the whole kernel runtime).
* But IF my argument against the third option will be invalidated in
* the future, THEN it's the best option I think. */
/* Initialize descriptors of exception handlers. */
idt[EX_DE_VEC] = idt_trap(ex_de);
idt[EX_DB_VEC] = idt_trap(ex_db);
idt[EX_NMI_VEC] = idt_trap(ex_nmi);
idt[EX_BP_VEC] = idt_trap(ex_bp);
idt[EX_OF_VEC] = idt_trap(ex_of);
idt[EX_BR_VEC] = idt_trap(ex_br);
idt[EX_UD_VEC] = idt_trap(ex_ud);
idt[EX_NM_VEC] = idt_trap(ex_nm);
idt[EX_DF_VEC] = idt_trap(ex_df);
idt[9] = idt_trap(ex_res); /* unused Coprocessor Segment Overrun */
idt[EX_TS_VEC] = idt_trap(ex_ts);
idt[EX_NP_VEC] = idt_trap(ex_np);
idt[EX_SS_VEC] = idt_trap(ex_ss);
idt[EX_GP_VEC] = idt_trap(ex_gp);
idt[EX_PF_VEC] = idt_trap(ex_pf);
idt[15] = idt_trap(ex_res);
idt[EX_MF_VEC] = idt_trap(ex_mf);
idt[EX_AC_VEC] = idt_trap(ex_ac);
idt[EX_MC_VEC] = idt_trap(ex_mc);
idt[EX_XM_VEC] = idt_trap(ex_xm);
idt[EX_VE_VEC] = idt_trap(ex_ve);
/* Initialize descriptors of reserved exceptions.
* Thankfully we compile with -std=c11, so declarations within
* for-loops are possible! */
for (size_t i = 21; i < 32; ++i)
idt[i] = idt_trap(ex_res);
/* Initialize descriptors of hardware interrupt handlers (ISRs). */
idt[INT_8253_VEC] = idt_int(int_8253);
idt[INT_8042_VEC] = idt_int(int_8042);
idt[INT_CASC_VEC] = idt_int(int_casc);
idt[INT_SERIAL2_VEC] = idt_int(int_serial2);
idt[INT_SERIAL1_VEC] = idt_int(int_serial1);
idt[INT_PARALL2_VEC] = idt_int(int_parall2);
idt[INT_FLOPPY_VEC] = idt_int(int_floppy);
idt[INT_PARALL1_VEC] = idt_int(int_parall1);
idt[INT_RTC_VEC] = idt_int(int_rtc);
idt[INT_ACPI_VEC] = idt_int(int_acpi);
idt[INT_OPEN2_VEC] = idt_int(int_open2);
idt[INT_OPEN1_VEC] = idt_int(int_open1);
idt[INT_MOUSE_VEC] = idt_int(int_mouse);
idt[INT_FPU_VEC] = idt_int(int_fpu);
idt[INT_PRIM_ATA_VEC] = idt_int(int_prim_ata);
idt[INT_SEC_ATA_VEC] = idt_int(int_sec_ata);
for (size_t i = 0x30; i < IDT_SIZE; ++i)
idt[i] = idt_trap(ex_res);
}
The macros idt_trap and idt_int, and are defined as follows:
#define idt_entry(off, type, priv) \
((descr) (uintptr_t) (off) & 0xffff) | ((descr) (KERN_CODE & 0xff) << \
0x10) | ((descr) ((type) & 0x0f) << 0x28) | ((descr) ((priv) & \
0x03) << 0x2d) | (descr) 0x800000000000 | \
((descr) ((uintptr_t) (off) & 0xffff0000) << 0x30)
#define idt_int(off) idt_entry(off, 0x0e, 0x00)
#define idt_trap(off) idt_entry(off, 0x0f, 0x00)
idt is an array of uint64_t, so these macros are implicitly cast to that type. uintptr_t is the type guaranteed to be capable of holding pointer values as integers and on 32-bit systems usually 32 bits wide. (A 64-bit IDT has 16-byte entries; this code is for 32-bit).
I get the warning that the initializer element is not constant due to the address modification in play.
It is absolutely sure that the address is known at linking time.
Is there anything I can do to make this work? Making the idt array automatic would work but this would require the whole kernel to run in the context of one function and this would be some bad hassle, I think.
I could make this work by some additional work at runtime (as Linux 0.01 also does) but it just annoys me that something technically feasible at linking time is actually infeasible.
Related: Solution needed for building a static IDT and GDT at assemble/compile/link time - a linker script for ld can shift and mask to break up link-time-constant addresses. No earlier step has the final addresses, and relocation entries are limited in what they can represent in a .o.
The main problem is that function addresses are link-time constants, not strictly compile time constants. The compiler can't just get 32b binary integers and stick that into the data segment in two separate pieces. Instead, it has to use the object file format to indicate to the linker where it should fill in the final value (+ offset) of which symbol when linking is done. The common cases are as an immediate operand to an instruction, a displacement in an effective address, or a value in the data section. (But in all those cases it's still just filling in 32-bit absolute address so all 3 use the same ELF relocation type. There's a different relocation for relative displacements for jump / call offsets.)
It would have been possible for ELF to have been designed to store a symbol reference to be substituted at link time with a complex function of an address (or at least high / low halves like on MIPS for lui $t0, %hi(symbol) / ori $t0, $t0, %lo(symbol) to build address constants from two 16-bit immediates). But in fact the only function allowed is addition/subtraction, for use in things like mov eax, [ext_symbol + 16].
It is of course possible for your OS kernel binary to have a static IDT with fully resolved addresses at build time, so all you need to do at runtime is execute a single lidt instruction. However, the standard
build toolchain is an obstacle. You probably can't achieve this without post-processing your executable.
e.g. you could write it this way, to produce a table with the full padding in the final binary, so the data can be shuffled in-place:
#include <stdint.h>
#define PACKED __attribute__((packed))
// Note, this is the 32-bit format. 64-bit is larger
typedef union idt_entry {
// we will postprocess the linker output to have this format
// (or convert at runtime)
struct PACKED runtime { // from OSdev wiki
uint16_t offset_1; // offset bits 0..15
uint16_t selector; // a code segment selector in GDT or LDT
uint8_t zero; // unused, set to 0
uint8_t type_attr; // type and attributes, see below
uint16_t offset_2; // offset bits 16..31
} rt;
// linker output will be in this format
struct PACKED compiletime {
void *ptr; // offset bits 0..31
uint8_t zero;
uint8_t type_attr;
uint16_t selector; // to be swapped with the high16 of ptr
} ct;
} idt_entry;
// #define idt_ct_entry(off, type, priv) { .ptr = off, .type_attr = type, .selector = priv }
#define idt_ct_trap(off) { .ct = { .ptr = off, .type_attr = 0x0f, .selector = 0x00 } }
// generate an entry in compile-time format
extern void ex_de(); // these are the raw interrupt handlers, written in ASM
extern void ex_db(); // they have to save/restore *all* registers, and end with iret, rather than the usual C ABI.
// it might be easier to use asm macros to create this static data,
// just so it can be in the same file and you don't need cross-file prototypes / declarations
// (but all the same limitations about link-time constants apply)
static idt_entry idt[] = {
idt_ct_trap(ex_de),
idt_ct_trap(ex_db),
// ...
};
// having this static probably takes less space than instructions to write it on the fly
// but not much more. It would be easy to make a lidt function that took a struct pointer.
static const struct PACKED idt_ptr {
uint16_t len; // encoded as bytes - 1, so 0xffff means 65536
void *ptr;
} idt_ptr = { sizeof(idt) - 1, idt };
/****** functions *********/
// inline
void load_static_idt(void) {
asm volatile ("lidt %0"
: // no outputs
: "m" (idt_ptr));
// memory operand, instead of writing the addressing mode ourself, allows a RIP-relative addressing mode in 64bit mode
// also allows it to work with -masm=intel or not.
}
// Do this once at at run-time
// **OR** run this to pre-process the binary, after link time, as part of your build
void idt_convert_to_runtime(void) {
#ifdef DEBUG
static char already_done = 0; // make sure this only runs once
if (already_done)
error;
already_done = 1;
#endif
const int count = sizeof idt / sizeof idt[0];
for (int i=0 ; i<count ; i++) {
uint16_t tmp1 = idt[i].rt.selector;
uint16_t tmp2 = idt[i].rt.offset_2;
idt[i].rt.offset_2 = tmp1;
idt[i].rt.selector = tmp2;
// or do this swap in fewer insns with SSE or MMX pshufw, but using vector instructions before setting up the IDT may be insane.
}
}
This does compile. See a diff of the -m32 and -m64 asm output on the Godbolt compiler explorer. Look at the layout in the data section (note that .value is a synonym for .short, and is 16 bits.) (But note that the IDT table format is different for 64-bit mode.)
I think I have the size calculation correct (bytes - 1), as documented in http://wiki.osdev.org/Interrupt_Descriptor_Table. Minimum value 100h bytes long (encoded as 0x99). See also https://en.wikibooks.org/wiki/X86_Assembly/Global_Descriptor_Table. (lgdt size/pointer works the same way, although the table itself has a different format.)
The other option, instead of having the IDT static in the data section, is to have it in the bss section, with the data stored as immediate constants in a function that will initialize it (or in an array read by that function).
Either way, that function (and its data) can be in a .init section whose memory you re-use after it's done. (Linux does this to reclaim memory from code and data that's only needed once, at startup.) This would give you the optimal tradeoff of small binary size (since 32b addresses are smaller than 64b IDT entries), and no runtime memory wasted on code to set up the IDT. A small loop that runs once at startup is negligible CPU time. (The version on Godbolt fully unrolls because I only have 2 entries, and it embeds the address into each instruction as a 32-bit immediate, even with -Os. With a large enough table (just copy/paste to duplicate a line) you get a compact loop even at -O3. The threshold is lower for -Os.)
Without memory-reuse haxx, probably a tight loop to rewrite 64b entries in place is the way to go. Doing it at build time would be even better, but then you'd need a custom tool to run the tranformation on the kernel binary.
Having the data stored in immediates sounds good in theory, but the code for each entry would probably total more than 64b, because it couldn't loop. The code to split an address into two would have to be fully unrolled (or placed in a function and called). Even if you had a loop to store all the same-for-multiple-entries stuff, each pointer would need a mov r32, imm32 to get the address in a register, then mov word [idt+i + 0], ax / shr eax, 16 / mov word [idt+i + 6], ax. That's a lot of machine-code bytes.
One way would be to use an intermediate jump table that is located at a fixed address. You could initialize the idt with addresses of the locations in this table (which will be compile time constant). The locations in the jump table will contain jump instructions to the actual isr routines.
The dispatch to an isr will be indirect as follows:
trap -> jump to intermediate address in the idt -> jump to isr
One way to create a jump table at a fixed address is as follows.
Step 1: Put jump table in a section
// this is a jump table at a fixed address
void jump(void) __attribute__((section(".si.idt")));
void jump(void) {
asm("jmp isr0"); // can also be asm("call ...") depending on need
asm("jmp isr1");
asm("jmp isr2");
}
Step 2: Instruct the linker to locate the section at a fixed address
SECTIONS
{
.so.idt 0x600000 :
{
*(.si.idt)
}
}
Put this in the linker script right after the .text section. This will make sure that the executable code in the table will go into a executable memory region.
You can instruct the linker to use your script as follows using the --script option in the Makefile.
LDFLAGS += -Wl,--script=my_script.lds
The following macro gives you the address of the location which contains the jump (or call) instruction to the corresponding isr.
// initialize the idt at compile time with const values
// you can find a cleaner way to generate offsets
#define JUMP_ADDR(off) ((char*)0x600000 + 4 + (off * 5))
You will then initialize the idt as follows using modified macros.
// your real idt will be initialized as follows
#define idt_entry(addr, type, priv) \
( \
((descr) (uintptr_t) (addr) & 0xffff) | \
((descr) (KERN_CODE & 0xff) << 0x10) | \
((descr) ((type) & 0x0f) << 0x28) | \
((descr) ((priv) & 0x03) << 0x2d) | \
((descr) 0x1 << 0x2F) | \
((descr) ((uintptr_t) (addr) & 0xffff0000) << 0x30) \
)
#define idt_int(off) idt_entry(JUMP_ADDR(off), 0x0e, 0x00)
#define idt_trap(off) idt_entry(JUMP_ADDR(off), 0x0f, 0x00)
descr idt[] =
{
...
idt_trap(ex_de),
...
idt_int(int_casc),
...
};
A demo working example is below, which shows dispatch to a isr with a non-fixed address from a instruction at a fixed address.
#include <stdio.h>
// dummy isrs for demo
void isr0(void) {
printf("==== isr0\n");
}
void isr1(void) {
printf("==== isr1\n");
}
void isr2(void) {
printf("==== isr2\n");
}
// this is a jump table at a fixed address
void jump(void) __attribute__((section(".si.idt")));
void jump(void) {
asm("jmp isr0"); // can be asm("call ...")
asm("jmp isr1");
asm("jmp isr2");
}
// initialize the idt at compile time with const values
// you can find a cleaner way to generate offsets
#define JUMP_ADDR(off) ((char*)0x600000 + 4 + (off * 5))
// dummy idt for demo
// see below for the real idt
char* idt[] =
{
JUMP_ADDR(0),
JUMP_ADDR(1),
JUMP_ADDR(2),
};
int main(int argc, char* argv[]) {
int trap;
char* addr = idt[trap = argc - 1];
printf("==== idt[%d]=%p\n", trap, addr);
asm("jmp *%0\n" : :"m"(addr));
}

Errors using inline assembly in C

I'm trying my hand at assembly in order to use vector operations, which I've never really used before, and I'm admittedly having a bit of trouble grasping some of the syntax.
The relevant code is below.
unit16_t asdf[4];
asdf[0] = 1;
asdf[1] = 2;
asdf[2] = 3;
asdf[3] = 4;
uint16_t other = 3;
__asm__("movq %0, %%mm0"
:
: "m" (asdf));
__asm__("pcmpeqw %0, %%mm0"
:
: "r" (other));
__asm__("movq %%mm0, %0" : "=m" (asdf));
printf("%u %u %u %u\n", asdf[0], asdf[1], asdf[2], asdf[3]);
In this simple example, I'm trying to do a 16-bit compare of "3" to each element in the array. I would hope that the output would be "0 0 65535 0". But it won't even assemble.
The first assembly instruction gives me the following error:
error: memory input 0 is not directly addressable
The second instruction gives me a different error:
Error: suffix or operands invalid for `pcmpeqw'
Any help would be appreciated.
You can't use registers directly in gcc asm statements and expect them to match up with anything in other asm statements -- the optimizer moves things around. Instead, you need to declare variables of the appropriate type and use constraints to force those variables into the right kind of register for the instruction(s) you are using.
The relevant constraints for MMX/SSE are x for xmm registers and y for mmx registers. For your example, you can do:
#include <stdint.h>
#include <stdio.h>
typedef union xmmreg {
uint8_t b[16];
uint16_t w[8];
uint32_t d[4];
uint64_t q[2];
} xmmreg;
int main() {
xmmreg v1, v2;
v1.w[0] = 1;
v1.w[1] = 2;
v1.w[2] = 3;
v1.w[3] = 4;
v2.w[0] = v2.w[1] = v2.w[2] = v2.w[3] = 3;
asm("pcmpeqw %1,%0" : "+x"(v1) : "x"(v2));
printf("%u %u %u %u\n", v1.w[0], v1.w[1], v1.w[2], v1.w[3]);
}
Note that you need to explicitly replicate the 3 across all the relevant elements of the second vector.
From intel reference manual:
PCMPEQW mm, mm/m64 Compare packed words in mm/m64 and mm for equality.
PCMPEQW xmm1, xmm2/m128 Compare packed words in xmm2/m128 and xmm1 for equality.
Your pcmpeqw uses an "r" register which is wrong. Only "mm" and "m64" registers
valter
The code above failed when expanding the asm(), it never tried to even assemble anything. In this case, you are trying to use the zeroth argument (%0), but you didn't give any.
Check out the GCC Inline assembler HOWTO, or read the relevant chapter of your local GCC documentation.
He's right, the optimizer is changing register contents. Switching to intrinsics and using volatile to keep things a little more in place might help.

How to give hint to gcc about loop count

Knowing the number of iteration a loop will go through allows the compiler to do some optimization. Consider for instance the two loops below :
Unknown iteration count :
static void bitreverse(vbuf_desc * vbuf)
{
unsigned int idx = 0;
unsigned char * img = vbuf->usrptr;
while(idx < vbuf->bytesused) {
img[idx] = bitrev[img[idx]];
idx++;
}
}
Known iteration count
static void bitreverse(vbuf_desc * vbuf)
{
unsigned int idx = 0;
unsigned char * img = vbuf->usrptr;
while(idx < 1280*400) {
img[idx] = bitrev[img[idx]];
idx++;
}
}
The second version will compile to faster code, because it will be unrolled twice (on ARM with gcc 4.6.3 and -O2 at least). Is there a way to make assertion on the loop count that gcc will take into account when optimizing ?
There is hot attribute on functions to give a hint to compiler about hot-spot: http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html. Just abb before your function:
static void bitreverse(vbuf_desc * vbuf) __attribute__ ((pure));
Here the docs about 'hot' from gcc:
hot The hot attribute on a function is used to inform the compiler
that the function is a hot spot of the compiled program. The function
is optimized more aggressively and on many target it is placed into
special subsection of the text section so all hot functions appears
close together improving locality. When profile feedback is available,
via -fprofile-use, hot functions are automatically detected and this
attribute is ignored.
The hot attribute on functions is not implemented in GCC versions
earlier than 4.3.
The hot attribute on a label is used to inform the compiler that path
following the label are more likely than paths that are not so
annotated. This attribute is used in cases where __builtin_expect
cannot be used, for instance with computed goto or asm goto.
The hot attribute on labels is not implemented in GCC versions earlier
than 4.8.
Also you can try to add __builtin_expect around your idx < vbuf->bytesused - it will be hint that in most cases the expression is true.
In both cases I'm not sure that your loop will be optimized.
Alternatively you can try profile-guided optimization. Build profile-generating version of program with -fprofile-generate; run it on target, copy profile data to build-host and rebuild with -fprofile-use. This will give a lot of information to compiler.
In some compilers (not in GCC) there are loop pragmas, including "#pragma loop count (N)" and "#pragma unroll (M)", e.g. in Intel; unroll in IBM; vectorizing pragmas in MSVC
ARM compiler (armcc) also has some loop pragmas: unroll(n) (via 1):
Loop Unrolling: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJACACFE.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJAHJDAB.html
and __promise intrinsic:
Using __promise to improve vectorization
The __promise(expr) intrinsic is a promise to the compiler that a given expression is nonzero. This enables the compiler to improve vectorization by optimizing away code that, based on the promise you have made, is redundant.
The disassembled output of Example 3.21 shows the difference that __promise makes, reducing the disassembly to a simple vectorized loop by the removal of a scalar fix-up loop.
Example 3.21. Using __promise(expr) to improve vectorization code
void f(int *x, int n)
{
int i;
__promise((n > 0) && ((n&7)==0));
for (i=0; i<n;i++) x[i]++;
}
You can actually specify the exact count with __builtin_expect, like this:
while (idx < __builtin_expect(vbuf->bytesused, 1280*400)) {
This tells gcc that vbuf->bytesused is expected to be 1280*400 at runtime.
Alas, this does nothing for optimization with current gcc version. Haven't tried with 4.8, though.
Edit: Just realized that every standard C compiler has a way to exactly specify the loop count, via assert. Since the assert
#include <assert.h>
...
assert(loop_count == 4096);
for (i = 0; i < loop_count; i++) ...
will call exit() or abort() if the condition is not true, any compiler with value propagation will know the exact value of loop_count. I always thought that this would be the most elegant and standard-conforming way to give such optimization hints. Now, I want a C compiler that actually uses this information.
Note that if you want to make this faster, bytewise unrolling might be less effective than using a wider lookup table. A 16-bit table would occupy 128K, and thus often fit into the CPU cache. If the data is not completely random, an even wider table (3 bytes) might be effective.
2-byte example:
unsigned short *bitrev2;
...
for (idx = 0; idx < vbuf->bytesused; idx += 2) {
*(unsigned short *)(&img[idx]) = bitrev2[*(unsigned short *)(&img[idx]);
}
This is an optimization the compiler can't perform, regardless of the information you give it.

Can this be atomically executed?

I would like to know whether it is possible to ensure line is atomically executed, given that it could be executed by both the ISR and Main context. I'm working on an ARM9 (LPC313x) and using RealView 4 (armcc).
foo() {
..
stack_var = ++volatile_var; // line
..
}
I'm looking for any routine like _atomic_ for C166, direct assembly code, etc. I would prefer not to have to disable the interrupts.
Thank you very much.
No, I don't think that you ever can expect ++volatile_var to be atomic, even if you don't have the assignment. Use a proper atomic primitive for that. If your compiler doesn't provide such an extension you easily find short inline assembler for that on the web. The assembler instructions are call ldrex and strex for atomic exchange on arm, I think.
Edit: it seems that the specific processor type that is asked for in the question does not implement these instructions.
Edit: The following should work with gcc, for another compiler one probably has to adapt the __asm__ parts.
inline
size_t arm_ldrex(size_t volatile*ptr) {
size_t ret;
__asm__ volatile ("ldrex %0,[%1]\t# load exclusive\n"
: "=&r" (ret)
: "r" (ptr)
: "cc", "memory"
);
return ret;
}
inline
_Bool arm_strex(size_t volatile*ptr, size_t val) {
size_t error;
__asm__ volatile ("strex %0,%1,[%2]\t# store exclusive\n"
: "=&r" (error)
: "r" (val), "r" (ptr)
: "cc", "memory"
);
return !error;
}
inline
size_t atomic_add_fetch(size_t volatile *object, size_t operand) {
for (;;) {
size_t oldval = arm_ldrex(object);
size_t newval = oldval + operand;
if (arm_strex(object, newval)) return newval;
}
}
From a quick look, the C166 _atomic_ macro seems to utilize an instruction that effectively masks interrupts for the duration of a specified number of instructions.
There is nothing directly corresponding to that in the ARM architecture.
You could of course use the swp instruction (or __swp intrinsic in the RealView toolchain) to implement a lock around the critical section. ldrex/strex mentioned in another answer do not exist in ARM architecture version 5, which includes the ARM9 processors.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0491c/CJAHDCHB.html and http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/Chdbbbai.html respectively.
A simplistic lock implementation around this (using the RealView toolchain) would be:
{
/* Loop until lock acquired */
while (__swp(LOCKED, &lockvar) == LOCKED);
..
/* Critical section */
..
lockvar = UNLOCKED;
}
However, this will lead to deadlock in the ISR context when the Main thread is holding the lock.
I think masking interrupts around the operation is likely to be the least hairy solution, although if your Main context is executing in User mode it will require a system call to implement.

Resources