GCC: Specify target address for actual symbol - c

knowing that similar questions have been asked a few times already, let me just explain what I'm trying to do and why: I have an embedded system which consists of an ARM Cortex-M4 microcontroller and an FPGA. The FPGA can be configured by an on-board flash memory without even needing the microcontroller. But sometimes it would be nice for the microcontroller to be able to reconfigure the FPGA, so it has access to the required programming signals.
The bitstream which needs to be sent to the FPGA is some 150kBytes, but changes much less frequently than the MCU firmware. Since programming that much data with the programmer/debugger (Segger J-Link) takes some time, I would prefer not to erase and reprogram this bitstream every time the firmware changes (but the bitstream is identical). The debugger does reprogram only those memory blocks which actually have changed, so it happily skipped programming of the bitstream if it would always be located at the same address. But this is obviously not the case if I just include it some way like
static const uint8_t fpga_bitstream[] = {
// ...
};
since the linker is free to decide where to place this data.
Now the question: What would be a (not too intrusive) way to have the linker put this symbol always to the same address? I am aware of the --defsym linker option, but this seems to be ignored when the symbol in question is defined in a source file (instead of just being referenced and declared as extern). The only way I know which works (I've used that some time ago already) is to use a custom linker file which defines a separate partition in the MEMORY part and then creates a new section to put the symbol into, using e.g.
static const uint8_t fpga_bitstream[] __attribute__((section(".fpgabitstream"))) = {
// ...
};
This project uses a small operating system (Nut/OS), however, which brings its own linker files. So I'd be fine if I could append some data to it using the -Tfile.ld option, but I'd prefer not having to change the original script, as this is located in the source tree of the operating system itself.
Thanks,
Philipp

As it doesn't seem to be possible to "add" to an existing linker script, probably the easiest solution would be to reserve a separate area of flash (e.g. an area at a fixed address near end with a little reserve for future growth) for the FPGA configuration and load from that fixed address.
Since there is probably nothing useful you can do to the FPGA bitstream anyway from the MCU side (other than loading it), I don't see any advantage in embedding the FPGA config into your MCU binary. You only need start address and size in your MCU code, both of which can be made known to the MCU by the --defsym linker option (without linker script modification).
I have a very similar configuration (with FPGA config and MCU code living on separate areas of flash) over here which works well. MCU flash is used regularily while FPGA is flashed only now and then.

Related

How can I relocate firmware for stm32?

I have built firmware for stm32f4, so I have *.elf an *.bin files. If I load *.bin file into internal flash it runs correctly. But if I want load this firmware to another address different from default (0x08000000) and run it from my bootloader, of course it does not works. I fix memory area address in project settings (I use CooCox 1.7.6 if it matter) and it begins runing from the bootloader.
I don't want rebuild project every time I load firmware in standalone mode or using bootloader. So I'm looking for method that will allow me to make *.bin files from *.elf, which will be able to load from the different addresses.
I tried manualy interupt vector table fixing. It allow me to run small blinker project but it doesn't work for more complex projects. I must be missing something.
Then I tried objcopy with --change-addresses but unfortunately it doesn't work. I get the same file without any difference.
I think, it is a general problem and there are solution for it but I must be missing something important.
You have to compile your code as position independent. See the -fpic and -fPIC options of gcc. Or you need a loader that understands the relocation table of the ELF image.
I just happened to complete an odyssey on STM32 Cortex-M4 (and M0) with Position-Independent Code firmware images operated by a bootloader. I will give you and overview and link the full article below. My evaluation board is NUCLEO-L432KC with target MCU STM32L432KCU6U. The magic goes like this:
On high level, there are couple of things needed:
A very simple bootloader. It only needs to read 2 4-byte words from the flash location of the firmware image. First is stack pointer address and second is Reset_Handler address. Bootloader needs to jump to Reset_Handler. But as the firmware is relocated further in the flash, the Reset_Handler is actually a bit further. See this picture for clarification:
So, bootloader adds the correct offset before jumping to Reset_Handler of the firmware image. No other patching is done, but bootloader (in my solution at least) stores location and offset of the firmware image, calculates checksum and passes this information in registers for the firmware image to use.
Then we need modifications in firmware linker file to force the ISR vector table in the beginning of RAM. For Cortex-M4 and VTOR (Vector Offset Table Register) the ISR vector table needs to be aligned to 512 boundary. In the beginning of RAM it is there naturally. In linker script we also dedicate a position in RAM for Global Offset Table (GOT) for easier manipulation. Address ranges should be exported via linker script symbols also.
We need also C compiler options. Basically these:
-fpic
-mpic-register=r9
-msingle-pic-base
-mno-pic-data-is-text-relative
They go to C compiler only! This creates the Global Offset Table (GOT) accounting.
Finally we need dedicated and tedious assembly bootstrap routines in the firmware image project. These routines perform the normal startup task of setting up the C runtime environment, but they additionally also go and read ISR and GOT from flash and copy them to RAM. In RAM, the addresses of those tables pointing to flash are offset by the amount the firmware image is running apart from bootloader. Example pictures:
I spent last 6 months researching this subject and I have written an in-depth article about it here:
https://techblog.paalijarvi.fi/2022/01/16/portable-position-independent-code-pic-bootloader-and-firmware-for-arm-cortex-m0-and-cortex-m4/
I do not think so that would work well in your case because when you are compiling the file to run with boot loader, you also needs to make sure that Interrupt vector also have been moved to the new location else there will be a problem in executing some of the functions.

How does OS execute binary files in virtual memory?

For example in my program I called a function foo(). The compiler and assembler would eventually write jmp someaddr in the binary. I know the concept of virtual memory. The program would think that it has the whole memory at disposal, and the start position is 0x000. In this way the assembler can calculate the position of foo().
But in fact this is not decided until runtime right? I have to run the program to know where I loaded the program into, hence the address of the jmp. But when the program actually runs, how does the OS come in and change the address of the jmp? These are direct CPU instructions right?
This question can't be answered in general because it's totally hardware and OS dependent. However a typical answer is that the initially loaded program can be compiled as you say: Because the VM hardware gives each program its own address space, all addresses can be fixed when the program is linked. No recalculation of addresses at load time is needed.
Things get much more interesting with dynamically loaded libraries because two used by the same initially loaded program might be compiled with the same base address, so their address spaces overlap.
One approach to this problem is to require Position Independent Code in DLLs. In such code all addresses are relative to the code itself. Jumps are usually relative to the PC (though a code segment register can also be used). Data are also relative to some data segment or base register. To choose the runtime location, the PIC code itself needs no change. Only the segment or base register(s) need(s) be set whenever in the prelude of every DLL routine.
PIC tends to be a bit slower than position dependent code because there's additional address arithmetic and the PC and/or base registers can bottleneck the processor's instruction pipeline.
So the other approach is for the loader to rebase the DLL code when necessary to eliminate address space overlaps. For this the DLL must include a table of all the absolute addresses in the code. The loader computes an offset between the assumed code and data base addresses and actual, then traverses the table, adding the offset to each absolute address as the program is copied into VM.
DLLs also have a table of entry points so that the calling program knows where the library procedures start. These must be adjusted as well.
Rebasing is not great for performance either. It slows down loading. Moreover, it defeats sharing of DLL code. You need at least one copy per rebase offset.
For these reasons, DLLs that are part of Windows are deliberately compiled with non-overlapping VM address spaces. This speeds loading and allows sharing. If you ever notice that a 3rd party DLL crunches the disk and loads slowly, while MS DLLs like the C runtime library load quickly, you are seeing the effects of rebasing in Windows.
You can infer more about this topic by reading about object file formats. Here is one example.
Position-independent code is code that you can run from any address. If you have a jmp instruction in position-independent code, it will often be a relative jump, which jumps to an offset from the current location. When you copy the code, it won't change the offsets between parts of the code so it will still work.
Relocatable code is code that you can run from any address, but you might have to modify the code first (maybe you can't just copy it). The code will contain a relocation table which tells how it needs to be modified.
Non-relocatable code is code that must be loaded at a certain address or it will not work.
Each program is different, it depends on how the program was written, or the compiler settings, or other various factors.
Shared libraries are usually compiled as position-independent code, which allows the same library to be loaded at different locations in different processes, without having to load multiple copies into memory. The same copy can be shared between processes, even though it is at a different address in each process.
Executables are often non-relocatable, but they can be position-independent. Virtual memory allows each program to have the entire address space (minus some overhead) to itself, so each executable can choose the address at which it's loaded without worrying about collisions with other executables. Some executables are position-independent, which can be used to increase security (ASLR).
Object files and static libraries are usually relocatable code. The linker will relocate them when combining them to create a shared library, executable, or other image.
Boot loaders and operating system kernels are almost always non-relocatable.
Yes, it is at runtime. The operating system, the part managing starting and switching tasks is ideally at a different protection level, it has more power. It knows what memory is in use and allocates some for the new task. It configures the mmu so that the new task has a virtual address space starting at zero or whatever the rule is for that operating system and processor. How you get into user mode at that starting address, is very processor specific.
One method for example is the hardware might save some state not just address but mode or virtual id or something when an interrupt occurs, lets say on the stack. And the return from interrupt instruction as defined by that processor takes the address, and state/mode, off of the stack and switches there (causing lets assume the mmu to react to its next fetch based on the new mode not the old). For a processor that works like that then you may be able to fake an interrupt return by placing the right items on the stack such that when you kick the interrupt return instruction it basically does a jump with additional features of mode switching, etc.
The ARM family for example (not cortex-m) has a processor state register for what you are running now (in the case of an interrupt or service call) and a second state register for where you came from, the state that was interrupted, when you do the proper return you give it the address and it switches back to that mode using the other register. You can directly access that register from the non-users modes so you can manipulate the state of the return. There is no return instruction in arm, just flavors of jump (modifications to the program counter), so it is a special kind of jump.
The short answer is that it is very specific to the processor as to what your choices are for jumping to the first time or returning to after a task switch to a running task in an application mode in a virtual address space. Either directly or indirectly the processor documentation will describe these modes and how you change them. If not explicitly described then you have to figure out on your own from the instructions and the mmu protections and such how to switch tasks.

C code that checksums itself *in ram*

I'm trying to get a ram-resident image to checksum itself, which is proving easier said than done.
The code is first compiled on a cross development platform, generating an .elf output. A utility is used to strip out the binary image, and that image is burned to flash on the target platform, along with the image size. When the target is started, it copies the binary to the correct region of ram, and jumps to it. The utility also computes a checksum of all the words in the elf that are destined for ram, and that too is burned into the flash. So my image theoretically could checksum its own ram resident image using the a-priori start address and the size saved in flash, and compare to the sum saved in flash.
That's the theory anyway. The problem is that once the image begins executing, there is change in the .data section as variables are modified. By the time the sum is done, the image that has been summed is no longer the image for which the utility calculated the sum.
I've eliminated change due to variables defined by my application, by moving the checksum routine ahead of all other initializations in the app (which makes sense b/c why run any of it if an integrity check fails, right?), but the killer is the C run time itself. It appears that there are some items relating to malloc and pointer casting and other things that are altered before main() is even entered.
Is the entire idea of self-checksumming C code lame? If there was a way to force app and CRT .data into different sections, I could avoid the CRT thrash, but one might argue that if the goal is to integrity check the image before executing (most of) it, that initialized CRT data should be part of that. Is there a way to make code checksum itself in RAM like this at all?
FWIW, I seem stuck with a requirement for this. Personally I'd have thought that the way to go is to checksum the binary in the flash, before transfer to ram, and trust the loader and the ram. Paranoia has to end somewhere right?
Misc details: tool chain is GNU, image contains .text, .rodata and .data as one contiguously loaded chunk. There is no OS, this is bare metal embedded. Primary loader essentially memcpy's my binary into ram, at a predetermined address. No relocations occur. VM is not used. Checksum only needs testing once at init only.
updated
Found that by doing this..
__attribute__((constructor)) void sumItUp(void) {
// sum it up
// leave result where it can be found
}
.. that I can get the sum done before almost everything except the initialization of the malloc/sbrk vars by the CRT init, and some vars owned by "impure.o" and "locale.o". Now, the malloc/sbrk value is something I know from the project linker script. If impure.o and locale.o could be mitigated, might be in business.
update
Since I can control the entry point (by what's stated in flash for the primary loader), it seems the best angle of attack now is to use a piece of custom assembler code to set up stack and sdata pointers, call the checksum routine, and then branch into the "normal" _start code.
If the checksum is done EARLY enough, you could use ONLY stack variables, and not write to any data-section variables - that is, make EVERYTHING you need to perform the checksumming [and all preceding steps to get to that point] ONLY use local variables for storing things in [you can read global data of course].
I'm fairly convinced that the right way is to trust the flash & loader to load what is in the flash. If you want to checksum the code, sure, go and do that [assuming it's not being modified by the loader of course - for example runtime loading of shared libraries or relocation of the executable itself, such as random virtual address spaces and such]. But the data loaded from flash can't be relied upon once execution starts properly.
If there is a requirement from someone else that you should do this, then please explain to them that this is not feasible to implement, and that "the requirement, as it stands" is "broken".
I'd suggest approaching this like an executable packer, like upx.
There are several things in the other answers and in your question that, for lack of a better term, make me nervous.
I would not trust the loader or anything in flash that wasn't forced on me.
There is source code floating around on the net that was used to secure one of, I think, HTCs recent phones. Look around on forum.xda-developers.com and see if you can find it and use it for an example.
I would push back on this requirement. Cellphone manufacturers spend a lot of time on keeping their images locked down and, eventually, all of them are beaten. This seems like a vicious circle.
Can you use the linker script to place impure.o and locale.o before or after everything else, allowing you to checksum everything but those and the malloc/sbrk stuff? I'm guessing malloc and sbrk are called in the bootloader that loads your application, so the thrash caused by those cannot be eliminated?
It's not an answer to just tell you to fight this requirement, but I agree that this seems to be over-thought. I'm sure you can't go into any amount of detail, but I'm assuming the spec authors are concerned about malicious users/hackers, rather than regular memory corruption due to cosmic rays, etc. In this case, if a malicious user/hacker can change what's loaded into RAM, they can just change your checksumming routine (which is itself running from RAM, correct?) to always return a happy status, no matter how well the checksum routine they aren't running anymore is designed.
Even if they are concerned about regular memory corruption, this checksum routine would only catch that if the error occurred during the original copy to memory, which is actually the least likely time such an error would occur, simply because the system hasn't been running long enough to have a high probability of a corruption event.
In general, what you want to do is impossible, since on many (most?) platforms the program loader may "relocate" some program address constants.
Can you update the loader to perform the checksum test on the flash resident binary image, before it is copied to ram?

What is the deal with position-independent code (PIC)?

Could somebody explain why I should be interested in compiling position-independent code, and also why should I avoid it?
Making code position-independent adds a layer of abstraction, which requires an additional lookup step at runtime for certain operations (usually pertaining to accessing variables with static storage).
So if you don't need it, don't use it!
There are specific situations where you must produce PIC (namely when creating run-time loadable code, such as a plug-in module or library), but the added flexibility comes at a price.
The gory details depend on how your loader works on on whether you are building a executable or a library, but there is a sense in which this is all a problem for the build system and the compiler, not for you.
If you really want to understand you need to consider where the code gets put in the address space before execution starts and what set of branching instructions your chip provides. Are branches relative or absolute? Is access to the data segment relative or absolute?
If branches are absolute, then the code must be loaded to a reliable address or it won't work. That's position dependent code. Many simple (or older) operating systems accommodate this by always loading a program to the same place.
Relative branches mean that the can be placed at any location in memory. That is position independent code.
Again, all you need to know is the recipe for invoking your compiler and linker on your platform.
PIC code usually has to be slightly larger because the compiler can't use instructions that encode relative address offsets. Without PIC, many addresses can be encoded with 16 or 8 bits relative to current PC. Sometimes in embedded systems, PIC is useful. For example if you want to have patch code that can run at various physical addresses.

Fixed address variable in C

For embedded applications, it is often necessary to access fixed memory locations for peripheral registers. The standard way I have found to do this is something like the following:
// access register 'foo_reg', which is located at address 0x100
#define foo_reg *(int *)0x100
foo_reg = 1; // write to foo_reg
int x = foo_reg; // read from foo_reg
I understand how that works, but what I don't understand is how the space for foo_reg is allocated (i.e. what keeps the linker from putting another variable at 0x100?). Can the space be reserved at the C level, or does there have to be a linker option that specifies that nothing should be located at 0x100. I'm using the GNU tools (gcc, ld, etc.), so am mostly interested in the specifics of that toolset at the moment.
Some additional information about my architecture to clarify the question:
My processor interfaces to an FPGA via a set of registers mapped into the regular data space (where variables live) of the processor. So I need to point to those registers and block off the associated address space. In the past, I have used a compiler that had an extension for locating variables from C code. I would group the registers into a struct, then place the struct at the appropriate location:
typedef struct
{
BYTE reg1;
BYTE reg2;
...
} Registers;
Registers regs _at_ 0x100;
regs.reg1 = 0;
Actually creating a 'Registers' struct reserves the space in the compiler/linker's eyes.
Now, using the GNU tools, I obviously don't have the at extension. Using the pointer method:
#define reg1 *(BYTE*)0x100;
#define reg2 *(BYTE*)0x101;
reg1 = 0
// or
#define regs *(Registers*)0x100
regs->reg1 = 0;
This is a simple application with no OS and no advanced memory management. Essentially:
void main()
{
while(1){
do_stuff();
}
}
Your linker and compiler don't know about that (without you telling it anything, of course). It's up to the designer of the ABI of your platform to specify they don't allocate objects at those addresses.
So, there is sometimes (the platform i worked on had that) a range in the virtual address space that is mapped directly to physical addresses and another range that can be used by user space processes to grow the stack or to allocate heap memory.
You can use the defsym option with GNU ld to allocate some symbol at a fixed address:
--defsym symbol=expression
Or if the expression is more complicated than simple arithmetic, use a custom linker script. That is the place where you can define regions of memory and tell the linker what regions should be given to what sections/objects. See here for an explanation. Though that is usually exactly the job of the writer of the tool-chain you use. They take the spec of the ABI and then write linker scripts and assembler/compiler back-ends that fulfill the requirements of your platform.
Incidentally, GCC has an attribute section that you can use to place your struct into a specific section. You could then tell the linker to place that section into the region where your registers live.
Registers regs __attribute__((section("REGS")));
A linker would typically use a linker script to determine where variables would be allocated. This is called the "data" section and of course should point to a RAM location. Therefore it is impossible for a variable to be allocated at an address not in RAM.
You can read more about linker scripts in GCC here.
Your linker handles the placement of data and variables. It knows about your target system through a linker script. The linker script defines regions in a memory layout such as .text (for constant data and code) and .bss (for your global variables and the heap), and also creates a correlation between a virtual and physical address (if one is needed). It is the job of the linker script's maintainer to make sure that the sections usable by the linker do not override your IO addresses.
When the embedded operating system loads the application into memory, it will load it in usually at some specified location, lets say 0x5000. All the local memory you are using will be relative to that address, that is, int x will be somewhere like 0x5000+code size+4... assuming this is a global variable. If it is a local variable, its located on the stack. When you reference 0x100, you are referencing system memory space, the same space the operating system is responsible for managing, and probably a very specific place that it monitors.
The linker won't place code at specific memory locations, it works in 'relative to where my program code is' memory space.
This breaks down a little bit when you get into virtual memory, but for embedded systems, this tends to hold true.
Cheers!
Getting the GCC toolchain to give you an image suitable for use directly on the hardware without an OS to load it is possible, but involves a couple of steps that aren't normally needed for normal programs.
You will almost certainly need to customize the C run time startup module. This is an assembly module (often named something like crt0.s) that is responsible initializing the initialized data, clearing the BSS, calling constructors for global objects if C++ modules with global objects are included, etc. Typical customizations include the need to setup your hardware to actually address the RAM (possibly including setting up the DRAM controller as well) so that there is a place to put data and stack. Some CPUs need to have these things done in a specific sequence: e.g. The ColdFire MCF5307 has one chip select that responds to every address after boot which eventually must be configured to cover just the area of the memory map planned for the attached chip.
Your hardware team (or you with another hat on, possibly) should have a memory map documenting what is at various addresses. ROM at 0x00000000, RAM at 0x10000000, device registers at 0xD0000000, etc. In some processors, the hardware team might only have connected a chip select from the CPU to a device, and leave it up to you to decide what address triggers that select pin.
GNU ld supports a very flexible linker script language that allows the various sections of the executable image to be placed in specific address spaces. For normal programming, you never see the linker script since a stock one is supplied by gcc that is tuned to your OS's assumptions for a normal application.
The output of the linker is in a relocatable format that is intended to be loaded into RAM by an OS. It probably has relocation fixups that need to be completed, and may even dynamically load some libraries. In a ROM system, dynamic loading is (usually) not supported, so you won't be doing that. But you still need a raw binary image (often in a HEX format suitable for a PROM programmer of some form), so you will need to use the objcopy utility from binutil to transform the linker output to a suitable format.
So, to answer the actual question you asked...
You use a linker script to specify the target addresses of each section of your program's image. In that script, you have several options for dealing with device registers, but all of them involve putting the text, data, bss stack, and heap segments in address ranges that avoid the hardware registers. There are also mechanisms available that can make sure that ld throws an error if you overfill your ROM or RAM, and you should use those as well.
Actually getting the device addresses into your C code can be done with #define as in your example, or by declaring a symbol directly in the linker script that is resolved to the base address of the registers, with a matching extern declaration in a C header file.
Although it is possible to use GCC's section attribute to define an instance of an uninitialized struct as being located in a specific section (such as FPGA_REGS), I have found that not to work well in real systems. It can create maintenance issues, and it becomes an expensive way to describe the full register map of the on-chip devices. If you use that technique, the linker script would then be responsible for mapping FPGA_REGS to its correct address.
In any case, you are going to need to get a good understanding of object file concepts such as "sections" (specifically the text, data, and bss sections at minimum), and may need to chase down details that bridge the gap between hardware and software such as the interrupt vector table, interrupt priorities, supervisor vs. user modes (or rings 0 to 3 on x86 variants) and the like.
Typically these addresses are beyond the reach of your process. So, your linker wouldn't dare put stuff there.
If the memory location has a special meaning on your architecture, the compiler should know that and not put any variables there. That would be similar to the IO mapped space on most architectures. It has no knowledge that you're using it to store values, it just knows that normal variables shouldn't go there. Many embedded compilers support language extensions that allow you to declare variables and functions at specific locations, usually using #pragma. Also, generally the way I've seen people implement the sort of memory mapping you're trying to do is to declare an int at the desired memory location, then just treat it as a global variable. Alternately, you could declare a pointer to an int and initialize it to that address. Both of these provide more type safety than a macro.
To expand on litb's answer, you can also use the --just-symbols={symbolfile} option to define several symbols, in case you have more than a couple of memory-mapped devices. The symbol file needs to be in the format
symbolname1 = address;
symbolname2 = address;
...
(The spaces around the equals sign seem to be required.)
Often, for embedded software, you can define within the linker file one area of RAM for linker-assigned variables, and a separate area for variables at absolute locations, which the linker won't touch.
Failing to do this should cause a linker error, as it should spot that it's trying to place a variable at a location already being used by a variable with absolute address.
This depends a bit on what OS you are using. I'm guessing you are using something like DOS or vxWorks. Generally the system will have certian areas of the memory space reserved for hardware, and compilers for that platform will always be smart enough to avoid those areas for their own allocations. Otherwise you'd be continually writing random garbage to disk or line printers when you meant to be accessing variables.
In case something else was confusing you, I should also point out that #define is a preprocessor directive. No code gets generated for that. It just tells the compiler to textually replace any foo_reg it sees in your source file with *(int *)0x100. It is no different than just typing *(int *)0x100 in yourself everywhere you had foo_reg, other than it may look cleaner.
What I'd probably do instead (in a modern C compiler) is:
// access register 'foo_reg', which is located at address 0x100
const int* foo_reg = (int *)0x100;
*foo_reg = 1; // write to foo_regint
x = *foo_reg; // read from foo_reg

Resources