Is is possible to have a linker generate a binary, that is as close to the previously generated one as possible? - linker

Here is the context: I work with microcontrollers, where the resulting binary is written to an internal flash, and executed from there. The flash is erased and written in 4KiB blocks. The flasher is smart enough to skip over blocks that do not need to change.
During development, it is typical to make minor changes and recompile and re-flash dozens of times a day.
What I want to achieve is, if possible, have the linker keep the existing structure as much as possible. For example, if I remove an if-check somewhere, the code gets a few bytes smaller, and all subsequent information becomes shifted by a few bytes, hence the entire flash gets erased and reprogrammed.
If, somehow, the linker was able to add padding between the linked symbols, a big are of the flash could often be skipped without overwriting. This will obviously not work when the code gets bigger.
Are there any gcc options that can help with automating such tasks (without handcrafting massive linker scripts)? Are there any compilers that have at least partial support for such behaviours?

Related

Can small functions defined in dynamic libraries be inlined?

This is a follow-up to this question: Can the linker inline functions?
This time I am wondering about the same optimization, not at link time, but at run time when linking to a dynamic library. Is it possible at all? Do modern OSes do it? Why?
In theory it's possible, but there are many reasons not to do it. In practice, "dynamic linking" is not really full linking; position-independent code is used for all but the main program (and possibly the main program too), whereby the full range of relocations a full (static) linker might have to perform are not needed. Instead, only a small number of relocation types are required, which basically just amount to filling in addresses of functions and objects contained in libraries in a big contiguous table. Of course, such references in objects of static storage duration in the .data segment must also be filled in, so it's a little bit more work than just filling in a contiguous table, but the key point is that only data, not code, is modified.
If you start modifying code at all, you throw away most of the advantages of dynamic linking: code pages could not be shared among multiple instances of the application/library, and a lot more time would be spent at startup duplicating (via page faults and copy-on-write semantics) the mapped code pages. And this is just the minimal cost for patching a few bytes here and there in code.
For actually inlining code from dynamic libraries, what you would have to do would amount to full link-time optimization. Measure how much time it takes to LTO-link a large program, and then ask yourself if users would be acceptable waiting that long every time they start the program. The answer is almost surely no.

Need to migrate my firmware image to ROM mask

I need to create a ROM mask that provides some functions. However, it should be possible to overwrite the functions for providing firmware patches. Therefore, a patch table should be located in a Flash memory that may be overwritten by later firmware upgrades, whereas the main part of the firmware is located in mask ROM and cannot be modified later.
Does anyone have any idea how this is done? what is the best practice to create patch tables?
Patch table - Basically your ROM has one more built in jumps to a RAM (Flash) adddresses built into it. The flash always has some sort of jump back to the loction just after your ROM jump call to return to your original code. You now have the ability to change your program's behavior from RAM. This assumes you are allowed to run code from RAM of course. If not, then only data tables can be changed on the fly.
Now, just having a single jump on startup allows you to modify starting state code like version numbers or other global/constant data, but that may not be enough. You can add another jump to flash that is run every "so often" to your code as well so that it can update program state after it is running - something like once a vblank or once a application loop.
The above should give you enough lee-way to change data that has changed over time since release or perhaps to fix a slight logic error, but it won't allow you to whole-sale change functions. To do that, you need more code. And that more code depends on just how much RAM you can use.
For instance, if you had enough RAM, and running from Flash RAM didn't hurt perforamnce too much (and is allowed), you could make things very flexible by copying all your key ROM functions into RAM on startup, then calling your RAM patch jump described above, which would allow new code stored elsewhere in your RAM patch to overwrite any previously copied code. If you took this approach, you would also want to make sure that you left some extra space around your original functions to allow new padded functions a bit of room to grow. The original code to be copied to RAM could be stored compressed as well to save ROM space. This allows you to change anything after the fact. It also introduces a very easy way to exploit your code if something else is allowed to write to RAM so beware.
Hope this helps.

C code that checksums itself *in ram*

I'm trying to get a ram-resident image to checksum itself, which is proving easier said than done.
The code is first compiled on a cross development platform, generating an .elf output. A utility is used to strip out the binary image, and that image is burned to flash on the target platform, along with the image size. When the target is started, it copies the binary to the correct region of ram, and jumps to it. The utility also computes a checksum of all the words in the elf that are destined for ram, and that too is burned into the flash. So my image theoretically could checksum its own ram resident image using the a-priori start address and the size saved in flash, and compare to the sum saved in flash.
That's the theory anyway. The problem is that once the image begins executing, there is change in the .data section as variables are modified. By the time the sum is done, the image that has been summed is no longer the image for which the utility calculated the sum.
I've eliminated change due to variables defined by my application, by moving the checksum routine ahead of all other initializations in the app (which makes sense b/c why run any of it if an integrity check fails, right?), but the killer is the C run time itself. It appears that there are some items relating to malloc and pointer casting and other things that are altered before main() is even entered.
Is the entire idea of self-checksumming C code lame? If there was a way to force app and CRT .data into different sections, I could avoid the CRT thrash, but one might argue that if the goal is to integrity check the image before executing (most of) it, that initialized CRT data should be part of that. Is there a way to make code checksum itself in RAM like this at all?
FWIW, I seem stuck with a requirement for this. Personally I'd have thought that the way to go is to checksum the binary in the flash, before transfer to ram, and trust the loader and the ram. Paranoia has to end somewhere right?
Misc details: tool chain is GNU, image contains .text, .rodata and .data as one contiguously loaded chunk. There is no OS, this is bare metal embedded. Primary loader essentially memcpy's my binary into ram, at a predetermined address. No relocations occur. VM is not used. Checksum only needs testing once at init only.
updated
Found that by doing this..
__attribute__((constructor)) void sumItUp(void) {
// sum it up
// leave result where it can be found
}
.. that I can get the sum done before almost everything except the initialization of the malloc/sbrk vars by the CRT init, and some vars owned by "impure.o" and "locale.o". Now, the malloc/sbrk value is something I know from the project linker script. If impure.o and locale.o could be mitigated, might be in business.
update
Since I can control the entry point (by what's stated in flash for the primary loader), it seems the best angle of attack now is to use a piece of custom assembler code to set up stack and sdata pointers, call the checksum routine, and then branch into the "normal" _start code.
If the checksum is done EARLY enough, you could use ONLY stack variables, and not write to any data-section variables - that is, make EVERYTHING you need to perform the checksumming [and all preceding steps to get to that point] ONLY use local variables for storing things in [you can read global data of course].
I'm fairly convinced that the right way is to trust the flash & loader to load what is in the flash. If you want to checksum the code, sure, go and do that [assuming it's not being modified by the loader of course - for example runtime loading of shared libraries or relocation of the executable itself, such as random virtual address spaces and such]. But the data loaded from flash can't be relied upon once execution starts properly.
If there is a requirement from someone else that you should do this, then please explain to them that this is not feasible to implement, and that "the requirement, as it stands" is "broken".
I'd suggest approaching this like an executable packer, like upx.
There are several things in the other answers and in your question that, for lack of a better term, make me nervous.
I would not trust the loader or anything in flash that wasn't forced on me.
There is source code floating around on the net that was used to secure one of, I think, HTCs recent phones. Look around on forum.xda-developers.com and see if you can find it and use it for an example.
I would push back on this requirement. Cellphone manufacturers spend a lot of time on keeping their images locked down and, eventually, all of them are beaten. This seems like a vicious circle.
Can you use the linker script to place impure.o and locale.o before or after everything else, allowing you to checksum everything but those and the malloc/sbrk stuff? I'm guessing malloc and sbrk are called in the bootloader that loads your application, so the thrash caused by those cannot be eliminated?
It's not an answer to just tell you to fight this requirement, but I agree that this seems to be over-thought. I'm sure you can't go into any amount of detail, but I'm assuming the spec authors are concerned about malicious users/hackers, rather than regular memory corruption due to cosmic rays, etc. In this case, if a malicious user/hacker can change what's loaded into RAM, they can just change your checksumming routine (which is itself running from RAM, correct?) to always return a happy status, no matter how well the checksum routine they aren't running anymore is designed.
Even if they are concerned about regular memory corruption, this checksum routine would only catch that if the error occurred during the original copy to memory, which is actually the least likely time such an error would occur, simply because the system hasn't been running long enough to have a high probability of a corruption event.
In general, what you want to do is impossible, since on many (most?) platforms the program loader may "relocate" some program address constants.
Can you update the loader to perform the checksum test on the flash resident binary image, before it is copied to ram?

What is the deal with position-independent code (PIC)?

Could somebody explain why I should be interested in compiling position-independent code, and also why should I avoid it?
Making code position-independent adds a layer of abstraction, which requires an additional lookup step at runtime for certain operations (usually pertaining to accessing variables with static storage).
So if you don't need it, don't use it!
There are specific situations where you must produce PIC (namely when creating run-time loadable code, such as a plug-in module or library), but the added flexibility comes at a price.
The gory details depend on how your loader works on on whether you are building a executable or a library, but there is a sense in which this is all a problem for the build system and the compiler, not for you.
If you really want to understand you need to consider where the code gets put in the address space before execution starts and what set of branching instructions your chip provides. Are branches relative or absolute? Is access to the data segment relative or absolute?
If branches are absolute, then the code must be loaded to a reliable address or it won't work. That's position dependent code. Many simple (or older) operating systems accommodate this by always loading a program to the same place.
Relative branches mean that the can be placed at any location in memory. That is position independent code.
Again, all you need to know is the recipe for invoking your compiler and linker on your platform.
PIC code usually has to be slightly larger because the compiler can't use instructions that encode relative address offsets. Without PIC, many addresses can be encoded with 16 or 8 bits relative to current PC. Sometimes in embedded systems, PIC is useful. For example if you want to have patch code that can run at various physical addresses.

C memory space and #defines

I am working on an embedded system, so memory is precious for me.
One issue that has been recurring is that I've been running out of memory space when attempting to compile a program for it. This is usually fixed by limiting the number of typedefs, etc that can take up a lot of space.
There is a macro generator that I use to create a file with a lot of #define's in it.
Some of these are simple values, others are boundary checks
ie
#define SIGNAL1 (float)0.03f
#define SIGNAL1_ISVALID(value) ((value >= 0.0f) && (value <= 10.0f))
Now, I don't use all of these defines. I use some, but not actually the majority.
I have been told that they don't actually take up any memory if they are not used, but I was unsure on this point. I'm hoping that by cutting out the unused ones that I can free up some extra memory (but again, I was told this is pointless).
Do unused #define's take up any memory space?
No, #defines take up no space unless they are used - #defines work like find/replace; whenever the compiler sees the left half, it'll replace it with the right half before it actually compiles.
So, if you have:
float f = SIGNAL1;
The compiler will literally interpret the statement:
float f = (float)0.03f;
It will never see the SIGNAL1, it won't show up in a debugger, etc.
This is usually fixed by limiting the number of typedefs, etc that can take up a lot of space.
You seem somewhat confused because typedef's do not take up space at runtime. They are merely aliases for data types. Now you may have instances of large structures (typedef'd or otherwise), but it is the instance that takes space, not the type definition. I wonder what 'etc' might cover in this statement.
Macro instances are replaced in the source code with their definition, and code generated accordingly, an unused macro does not result in any generated code.
Things that take up space are:
Executable code (functions/member functions)
Data instantiation (including C++ object instances)
The amount of space allocated to the stack (or stacks in a multi-threaded system).
What is left is typically available for dynamic memory allocation (RAM), or is unused or made for non-volatile storage (Flash/EPROM).
Reducing memory usage is primarily a case of selecting/designing efficient data structures, using appropriate data types, and efficient code and algorithm design. It is best to target the area that will get the greatest benefit. To see the size of objects and code in your application, get the linker to generate a map file. That will tell you which are the largest functions, as well as the sizes of global and static objects.
Source file text length is not a good guide to code size. Large amounts of C code is declarative (typically header files are all declarative), and do not generate memory occupying code or data.
An embedded system does not necessarily imply small memory, so you should specify. I have worked on systems with 64Mb RAM and 2Mb Flash, and even that is modest compared with many systems. A typical micro-controller with on-chip resources however, will generally have much less (especially SRAM which takes up a lot of chip area). Also whether your system is Harvard or Von Neumann architecture is relevant here, since in a Harvard architecture data and code spaces are separate, so we need to know what it is you are short of. If Von Neumann, the code/data usage is still relevant if the code is running from ROM, or is it is copied from ROM to RAM at run-time (i.e. different types of memory, even if they are in the same address space).
Clifford
Well, yes and no.
No, unused #defines won't increase the size of the resulting binary.
Yes, all #defines (whether used or unused) must be known by the compiler when building the binary.
By your question it's a bit ambiguous how you use the compiler, but it almost seems that you try to build directly on an embedded device; have you tried a cross-compiler? :)
unused #defines doesn't take up space in the resulting executable. They do take up memory in the compiler itself whist compiling.

Resources