GCC 4.7.2 Optimization Problems

GCC 4.7.2 Optimization Problems - c

Summary
I'm porting ST's USB OTG Library to a custom STM32F4 board using the latest version of Sourcery CodeBench Lite toolchain (GCC arm-none-eabi 4.7.2).
When I compile the code with -O0, the program runs fine. When I compile with -O1 or -O2 it fails. When I say fail, it just stops. No hard fault, nothing (Well, obviously there is something it's doing but I don't have a emulator to use to debug and find out, I'm sorry. My hard fault handler is not being called).
Details
I'm trying to make a call to the following function...
void USBD_Init(USB_OTG_CORE_HANDLE *pdev,
USB_OTG_CORE_ID_TypeDef coreID,
USBD_DEVICE *pDevice,
USBD_Class_cb_TypeDef *class_cb,
USBD_Usr_cb_TypeDef *usr_cb);
...but it doesn't seem to make it into the function body. (Is this a symptom of "stack-smashing"?)
The structures passed to this function have the following definitions:
typedef struct USB_OTG_handle
{
USB_OTG_CORE_CFGS cfg;
USB_OTG_CORE_REGS regs;
DCD_DEV dev;
}
USB_OTG_CORE_HANDLE , *PUSB_OTG_CORE_HANDLE;
typedef enum
{
USB_OTG_HS_CORE_ID = 0,
USB_OTG_FS_CORE_ID = 1
}USB_OTG_CORE_ID_TypeDef;
typedef struct _Device_TypeDef
{
uint8_t *(*GetDeviceDescriptor)( uint8_t speed , uint16_t *length);
uint8_t *(*GetLangIDStrDescriptor)( uint8_t speed , uint16_t *length);
uint8_t *(*GetManufacturerStrDescriptor)( uint8_t speed , uint16_t *length);
uint8_t *(*GetProductStrDescriptor)( uint8_t speed , uint16_t *length);
uint8_t *(*GetSerialStrDescriptor)( uint8_t speed , uint16_t *length);
uint8_t *(*GetConfigurationStrDescriptor)( uint8_t speed , uint16_t *length);
uint8_t *(*GetInterfaceStrDescriptor)( uint8_t speed , uint16_t *length);
} USBD_DEVICE, *pUSBD_DEVICE;
typedef struct _Device_cb
{
uint8_t (*Init) (void *pdev , uint8_t cfgidx);
uint8_t (*DeInit) (void *pdev , uint8_t cfgidx);
/* Control Endpoints*/
uint8_t (*Setup) (void *pdev , USB_SETUP_REQ *req);
uint8_t (*EP0_TxSent) (void *pdev );
uint8_t (*EP0_RxReady) (void *pdev );
/* Class Specific Endpoints*/
uint8_t (*DataIn) (void *pdev , uint8_t epnum);
uint8_t (*DataOut) (void *pdev , uint8_t epnum);
uint8_t (*SOF) (void *pdev);
uint8_t (*IsoINIncomplete) (void *pdev);
uint8_t (*IsoOUTIncomplete) (void *pdev);
uint8_t *(*GetConfigDescriptor)( uint8_t speed , uint16_t *length);
uint8_t *(*GetUsrStrDescriptor)( uint8_t speed ,uint8_t index, uint16_t *length);
} USBD_Class_cb_TypeDef;
typedef struct _USBD_USR_PROP
{
void (*Init)(void);
void (*DeviceReset)(uint8_t speed);
void (*DeviceConfigured)(void);
void (*DeviceSuspended)(void);
void (*DeviceResumed)(void);
void (*DeviceConnected)(void);
void (*DeviceDisconnected)(void);
}
USBD_Usr_cb_TypeDef;
I've tried to include all the source code relevant to this problem. If you want to see the entire source code you can download it here: http://www.st.com/st-web-ui/static/active/en/st_prod_software_internet/resource/technical/software/firmware/stm32_f105-07_f2_f4_usb-host-device_lib.zip
Solutions Attempted
I tried playing with #pragma GCC optimize ("O0"), __attribute__((optimize("O0"))), and declaring certain definitions as volatile, but nothing worked. I'd rather just modify the code to make it play nicely with the optimizer anyway.
Question
How can I modify this code to make it play nice with GCC's optimizer?

There doesn't seem to be anything wrong with the code you showed, so this answer will be more general.
What are typical errors with "close to hardware" code that works properly unoptimized and fails with higher optimization levels?
Think about the differences between -O0 and -O1/-O2: optimization strategies are - among others - loop unrolling (doesn't seem to be dangerous), attempting to hold values in registers as long as possible, dead code elimination and instruction reordering.
improved register usage typically leads to problems with higher optimization levels if hardware registers that can change anytime aren't declared volatileproperly (see PokyBrain's comment above). The optimized code will try to hold values in registers as long as possible resulting in your program failing to notice changes on the hardware side. Make sure to declare hardware registers volatile properly
dead code elimination will likely lead to problems if you need to read a hardware register to produce whatever effect on the hardware not known to the compiler and don't do anything with the value you just read. These hardware accesses might be optimized away if you don't declare the variable used for read access void properly (compiler should issue a warning, though). Make sure to cast dummy reads to (void)
instruction reordering: if you need to access different hardware registers in a certain sequence to produce the desired results and if you do that through pointers not related in any way otherwise, the compiler is free to reorder the resulting instructions as it sees fit (even if hardware registers are properly declared volatile). You will need to stray memory barriers into your code to enforce the required access sequence (__asm__ __volatile__(::: "memory");). Make sure to add memory barriers where needed.
Although unlikely, it might still be the case that you found a compiler bug. Optimization is not an easy job, especially when it comes close to hardware. It might be worth a peek into the gcc bug database.
If all this doesn't help, you sometimes just can't avoid to dig into the generated assembler code to make sure its doing what it is supposed to do.

Related

How to tell gcc to not align function parameters on the stack?

I am trying to decompile an executable for the 68000 processor into C code, replacing the original subroutines with C functions one by one.
The problem I faced is that I don't know how to make gcc use the calling convention that matches the one used in the original program. I need the parameters on the stack to be packed, not aligned.
Let's say we have the following function
int fun(char arg1, short arg2, int arg3) {
return arg1 + arg2 + arg3;
}
If we compile it with
gcc -m68000 -Os -fomit-frame-pointer -S source.c
we get the following output
fun:
move.b 7(%sp),%d0
ext.w %d0
move.w 10(%sp),%a0
lea (%a0,%d0.w),%a0
move.l %a0,%d0
add.l 12(%sp),%d0
rts
As we can see, the compiler assumed that parameters have addresses 7(%sp), 10(%sp) and 12(%sp):
but to work with the original program they need to have addresses 4(%sp), 5(%sp) and 7(%sp):
One possible solution is to write the function in the following way (the processor is big-endian):
int fun(int bytes4to7, int bytes8to11) {
char arg1 = bytes4to7>>24;
short arg2 = (bytes4to7>>8)&0xffff;
int arg3 = ((bytes4to7&0xff)<<24) | (bytes8to11>>8);
return arg1 + arg2 + arg3;
}
However, the code looks messy, and I was wondering: is there a way to both keep the code clean and achieve the desired result?
UPD: I made a mistake. The offsets I'm looking for are actually 5(%sp), 6(%sp) and 8(%sp) (the char-s should be aligned with the short-s, but the short-s and the int-s are still packed):
Hopefully, this doesn't change the essence of the question.
UPD 2: It turns out that the 68000 C Compiler by Sierra Systems gives the described offsets (as in UPD, with 2-byte alignment).
However, the question is about tweaking calling conventions in gcc (or perhaps another modern compiler).

Here's a way with a packed struct. I compiled it on an x86 with -m32 and got the desired offsets in the disassembly, so I think it should still work for an mc68000:
typedef struct {
char arg1;
short arg2;
int arg3;
} __attribute__((__packed__)) fun_t;
int
fun(fun_t fun)
{
return fun.arg1 + fun.arg2 + fun.arg3;
}
But, I think there's probably a still cleaner way. It would require knowing more about the other code that generates such a calling sequence. Do you have the source code for it?
Does the other code have to remain in asm? With the source, you could adjust the offsets in the asm code to be compatible with modern C ABI calling conventions.
I've been programming in C since 1981 and spent years doing mc68000 C and assembler code (for apps, kernel, device drivers), so I'm somewhat familiar with the problem space.

It's not a gcc 'fault', it is 68k architecture that requires stack to be always aligned on 2 bytes.
So there is simply no way to break 2-byte alignment on the hardware stack.
but to work with the original program they need to have addresses
4(%sp), 5(%sp) and 7(%sp):
Accessing word or long values off the ODD memory address will immediately trigger alignment exception on 68000.

To get integral parameters passed using 2 byte alignment instead of 4 byte alignment, you can change the default int size to be 16 bit by -mshort. You need to replace all int in your code by long (if you want them to be 32 bit wide). The crude way to do that is to also pass -Dint=long to your compiler. Obviously, you will break ABI compatibility to object files compiled with -mno-short (which appears to be the default for gcc).

Compatibility between IAR C/C++ Compiler and GCC

I have a code-block, which is written in C for IAR C/C++ Compiler.
__no_init uint8_t u8ramPhysStart_startUp # 0x20000000;
__no_init uint8_t u8ramPhysEnd_startUp # 0x2002FFFF;
__no_init uint8_t u8ramTestStart_startUp # 0x20004008;
__no_init uint8_t u8ramTestEnd_startUp # 0x20008008;
#define START_ASM (&u8ramPhysStart_startUp)
#define RAMSTART_STRTUP ((uint32_t)START_ASM)
My goal is converting it or rather making it GCC compatible. For this, I rewrite above code like:
unsigned char u8ramPhysStart_startUp __asm("# 0x20000000");
unsigned char u8ramPhysEnd_startUp __asm("# 0x2002FFFF");
unsigned char u8ramTestStart_startUp __asm("# 0x20004008");
unsigned char u8ramTestEnd_startUp __asm("# 0x20008008");
But after compilation I get following error:
C:\Users\Pc\AppData\Local\Temp\ccyuCWQT.s: Assembler messages:
C:\Users\Pc\AppData\Local\Temp\ccyuCWQT.s:971: Error: expected symbol name
C:\Users\Pc\AppData\Local\Temp\ccyuCWQT.s:972: Error: expected symbol name
Do someone knows, what it means?

I believe the gcc code should be something like
uint8_t __attribute__ ((section(".my_section"))) u8ramPhysStart_startUp;
where .my_section is something you have added to the linker script.
That being said, the only way which you can make allocation at absolute addresses portable, is to stick to pure ISO C:
#define u8ramPhysStart_startUp (*(volatile uint8_t*)0x20000000u)
or in case you want a pointer to an address:
#define u8ramPhysStart_startUp ((volatile uint8_t*)0x20000000u)
The disadvantage of this is that it doesn't actually allocate any memory, but relies on a linker script to handle that part. That's preferable in most cases.
Another disadvantage is that you won't be able to view these "variable" names in a debugger, since they are actually not variables at all. And that's the main reason why some tool chains come up with things like the non-standard # syntax.

what's the difference between __builtin_popcountll and_mm_popcnt_u64?

I was trying to how many 1 in 512MB memory and I found two possible methods, _mm_popcnt_u64() and __builtin_popcountll() in the gcc builtins.
_mm_popcnt_u64() is said to use the CPU introduction SSE4.2，which seems to be the fastest, and __builtin_popcountll() is excepted to use table lookup.
So, I think __builtin_popcountll() should be little slower than _mm_popcnt_u64().
However I got a result like this:
It took almost the same time for two methods. I highly doubt that they used the same way to work.
I also got this in popcntintrin.h
/* Calculate a number of bits set to 1. */
extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial___))
_mm_popcnt_u32 (unsigned int __X)
{
return __builtin_popcount (__X);
}
#ifdef __x86_64__
extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_popcnt_u64 (unsigned long long __X)
{
return __builtin_popcountll (__X);
}
#endif
So, I'm confused how __builtin_popcountll() works on earth

_mm_popcnt_u64 is part of <nmmintrin.h>, a header devised by Intel for utility functions for accessing SSE 4.2 instructions.
__builtin_popcountll is a GCC extension.
_mm_popcnt_u64 is portable to non-GNU compilers, and __builtin_popcountll is portable to non-SSE-4.2 CPUs. But on systems where both are available, both should compile to the exact same code.

If You compile without march flag, so with x86_64 default, builtin should be slower because it needs to dispatch function selecting between different architectures. This will cause no inlining and additional condition.

gcc unused-but-set-variable warning on volatile

I have a little function that writes values to HW using the volatile variable
void gige_rx_prepare(void) {
volatile uint hw_write;
// more code here
hw_write = 0x32;
}
The gcc version 4.7.3 (Altera 13.1 Build 162) flags this variable as set but unused even though, being a volatile, it facilitates writing of the HW registers.
I still would like to see this warning on any other variable. Is there a way to avoid this warning on volatile variables without resorting to setting gcc attributes for each volatile variable in the code?

Local variable is not a good representation of a h/w register and that's part of the reason why you see the warning.
Compiler complains (correctly) because hw_write is a local variable on the stack. In this case compiler does have enough data to infer that it's a pointless assignment. If it were a global variable or a pointer to a volatile uint, then there would be no warning as variable lifetime would not be limited by the scope of the function and thus it could've been used somewhere else.
Following examples compile without any warnings:
volatile int hw_write2; // h/w register
void gige_rx_prepare2(void) {
// more code here
hw_write2 = 0x32;
}
void gige_rx_prepare3(void) {
volatile int *hw_write3 = (void*)0x1234; // pointer to h/w register.
// more code here
*hw_write3 = 0x32;
}

Function pointer location not getting passed

I've got some C code I'm targeting for an AVR. The code is being compiled with avr-gcc, basically the gnu compiler with the right backend.
What I'm trying to do is create a callback mechanism in one of my event/interrupt driven libraries, but I seem to be having some trouble keeping the value of the function pointer.
To start, I have a static library. It has a header file (twi_master_driver.h) that looks like this:
#ifndef TWI_MASTER_DRIVER_H_
#define TWI_MASTER_DRIVER_H_
#define TWI_INPUT_QUEUE_SIZE 256
// define callback function pointer signature
typedef void (*twi_slave_callback_t)(uint8_t*, uint16_t);
typedef struct {
uint8_t buffer[TWI_INPUT_QUEUE_SIZE];
volatile uint16_t length; // currently used bytes in the buffer
twi_slave_callback_t slave_callback;
} twi_global_slave_t;
typedef struct {
uint8_t slave_address;
volatile twi_global_slave_t slave;
} twi_global_t;
void twi_init(uint8_t slave_address, twi_global_t *twi, twi_slave_callback_t slave_callback);
#endif
Now the C file (twi_driver.c):
#include <stdint.h>
#include "twi_master_driver.h"
void twi_init(uint8_t slave_address, twi_global_t *twi, twi_slave_callback_t slave_callback)
{
twi->slave.length = 0;
twi->slave.slave_callback = slave_callback;
twi->slave_address = slave_address;
// temporary workaround <- why does this work??
twi->slave.slave_callback = twi->slave.slave_callback;
}
void twi_slave_interrupt_handler(twi_global_t *twi)
{
(twi->slave.slave_callback)(twi->slave.buffer, twi->slave.length);
// some other stuff (nothing touches twi->slave.slave_callback)
}
Then I build those two files into a static library (.a) and construct my main program (main.c)
#include
#include
#include
#include
#include "twi_master_driver.h"
// ...define microcontroller safe way for mystdout ...
twi_global_t bus_a;
ISR(TWIC_TWIS_vect, ISR_NOBLOCK)
{
twi_slave_interrupt_handler(&bus_a);
}
void my_callback(uint8_t *buf, uint16_t len)
{
uint8_t i;
fprintf(&mystdout, "C: ");
for(i = 0; i < length; i++)
{
fprintf(&mystdout, "%d,", buf[i]);
}
fprintf(&mystdout, "\n");
}
int main(int argc, char **argv)
{
twi_init(2, &bus_a, &my_callback);
// ...PMIC setup...
// enable interrupts.
sei();
// (code that causes interrupt to fire)
// spin while the rest of the application runs...
while(1){
_delay_ms(1000);
}
return 0;
}
I carefully trigger the events that cause the interrupt to fire and call the appropriate handler. Using some fprintfs I'm able to tell that the location assigned to twi->slave.slave_callback in the twi_init function is different than the one in the twi_slave_interrupt_handler function.
Though the numbers are meaningless, in twi_init the value is 0x13b, and in twi_slave_interrupt_handler when printed the value is 0x100.
By adding the commented workaround line in twi_driver.c:
twi->slave.slave_callback = twi->slave.slave_callback;
The problem goes away, but this is clearly a magic and undesirable solution. What am I doing wrong?
As far as I can tell, I've marked appropriate variables volatile, and I've tried marking other portions volatile and removing the volatile markings. I came up with the workaround when I noticed removing fprintf statements after the assignment in twi_init caused the value to be read differently later on.
The problem seems to be with how I'm passing around the function pointer -- and notably the portion of the program that is accessing the value of the pointer (the function itself?) is technically in a different thread.
Any ideas?
Edits:
resolved typos in code.
links to actual files: http://straymark.com/code/ [test.c|twi_driver.c|twi_driver.h]
fwiw: compiler options: -Wall -Os -fpack-struct -fshort-enums -funsigned-char -funsigned-bitfields -mmcu=atxmega128a1 -DF_CPU=2000000UL
I've tried the same code included directly (rather than via a library) and I've got the same issue.
Edits (round 2):
I removed all the optimizations, without my "workaround" the code works as expected. Adding back -Os causes an error. Why is -Os corrupting my code?

Just a hunch, but what happens if you switch these two lines around:
twi->slave.slave_callback = slave_callback;
twi->slave.length = 0;
Does removing the -fpack-struct gcc flag fix the problem? I wonder if you haven't stumbled upon a bug where writing that length field is overwriting part of the callback value.
It looks to me like with the -Os optimisations on (you could try combinations of the individual optimisations enabled by -Os to see exactly which one is causing it), the compiler isn't emitting the right code to manipulate the uint16_t length field when its not aligned on a 2-byte boundary. This happens when you include a twi_global_slave_t inside a twi_global_t that is packed, because the initial uint8_t member of twi_global_t causes the twi_global_slave_t struct to be placed at an odd address.
If you make that initial field of twi_global_t a uint16_t it will probably fix it (or you could turn off struct packing). Try the latest gcc build and see if it still happens - if it does, you should be able to create a minimal test case that shows the problem, so you can submit a bug report to the gcc project.

This really sounds like a stack/memory corruption issue. If you run avr-size on your elf file, what do you get? Make sure (data + bss) < the RAM you have on the part. These types of issues are very difficult to track down. The fact that removing/moving unrelated code changes the behavior is a big red flag.

Replace "&my_callback" with "my_callback" in function main().
Because different threads access the callback address, try protecting it with a mutex or read-write lock.
If the callback function pointer isn't accessed by a signal handler, then the "volatile" qualifier is unnecessary.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight