Embedded: memcpy/memset not used by most CRT startup code ― why? - c

Context:
I'm working on an ARM target, more specifically a Cortex-M4F microcontroller from ST. When working on such platforms (microcontrollers in general), there's obviously no OS; in order to get a working C/C++ "environment" (moreover, to be standard compliant in regard to initialization of variables) there must be some kind of startup code run at reset that does the minimum setup required before explicitly calling main. Such startup code, as I hinted, must initialize initialized global and static variables (such as int foo = 42;at global scope) and zero-out the other globals (such as int bar; at global scope). Then, if necessary, global "ctors" are called.
On a microcontroller, that simply means that the startup code has to copy data from flash to ram for every initialized global (all in section '.data') and clear the others (all in '.bss'). Because I use GCC, I must supply such a startup code and I happily analyzed several startup codes (and its associated linker script!) bundled with numerous examples I've found on the Internet, all using the same demo board I'm developing on.
Question:
As stated, I've seen numerous startup codes, and they initialize globals in different ways, some more efficient in term of space and time than others. But they all have something odd in common: they didn't use memset nor memcpy, resorting instead to hand-written loops to do the job. As it appears natural to me to use standard functions when possible (simple "DRY principle"), I tried the following in lieu of the initial hand-written loops:
/* Initialize .data section */
ldr r0, DATA_LOAD
ldr r1, DATA_START
ldr r2, DATA_SIZE
bl memcpy /* memcpy(DATA_LOAD, DATA_START, DATA_SIZE); */
/* Initialize .bss section */
ldr r0, BSS_START
mov r1, #0
ldr r2, BSS_SIZE
bl memset /* memset(BSS_START, 0, BSS_SIZE); */
... and it worked perfectly. The space saving are negligible, but it is clearly dead simple now.
So, I thought about it, and I see no reason to do hand-written loops in this case:
memcpy and memset are very likely to be linked in the executable anyway, because the programmer would use it directly, or indirectly through another library;
It is smaller;
Speed is not a very important factor for startup code, but nevertheless it is likely faster;
It's nearly impossible to get it wrong.
Any idea why one wouldn't rely on memcpy and memset for startup code?

I suspect the startup code does not want to make assumptions about the implementation of memcpy and such in libc. For example, the implementation of memcpy might use a global variable set by libc initialization code to report which cpu extensions are available, in order to provide optimized SIMD copying on machines that support such operations. At the point where the early "crt" startup code is running, the storage for such a global might be completely uninitialized (containing random junk), in which case it would be dangerous to call memcpy. Even if making the call works for you, it's a consequence of the implementation (or maybe even the unpredictable results of UB...) making it work; this is probably not something the crt code wants to depend on.

Whether the standard library is linked at all is decision for the application developer (--nostdlib may be used for example), but the start-up code is required, so it cannot make any assumptions.
Further, the purpose of the start-up code is to establish an environment in which C code can run; before that is complete, it is by no means a given that any library code that might reasonably assume a complete run-time environment will run correctly. For the functions in question this is perhaps not an issue in many cases, but you cannot know that.
The start-up code has to at least establish a stack and initialise static data, in C++ it additionally calls the constructors of global static objects. The standard library might reasonably assume those are established, so using the standard library before then may conceivably result in erroneous behaviour.
Finally you should be clear that the C language and the C standard library are distinct entities. The language must necessarily be capable of standing alone.

I don't think this is likely to have anything to do with "assumptions about the internal state of memcy/memset", they are unlikely to use any global resources (though I suppose some odd cases exist where they do).
All start up code on microcontrollers is usually written "inline assembler" in this manner, simply because it runs at an early stage in the code, where a stack might not yet be present and the MMU setup may not yet have been executed. Init code therefore doesn't want to risk putting anything on the stack, simple as that. Function calls put things on the stack.
So while this happened to be the initialization code of the static storage copy-down, you are likely to find the same inline assembler in other such init code as well. For example you will likely find some fundamental register setup code written in assembler somewhere before the copy-down, and you will also find the MMU setup in assembler somewhere around there too.

Related

Functions splitting effect on running time

I am writing a DSP code in C (windows environment). The code should be modified, by another engineer, to run on Cortex-M4. This engineer claims that, for reduction of running time, many of the functions that I have implemented should be united into one function. I prefer to avoid it keeping clarity and testing.
Does his claim make sense? If it is, where I can read about it. Otherwise, can I show that he is wrong without a comparison of running time?
Does his claim make sense?
Depends on context. Modern compilers are perfectly able to inline function calls, but that usually means that those functions must be placed in the same translation unit (essentially the same .c file).
If your functions are in the same .c file then their claim is wrong, if you have the functions scattered across multiple files, then their claim is likely correct.
If it is, where I can read about it.
Function inlining has been around for some 30 years. C even added an inline keyword for it in year 1999 (C++ had one earlier still), though during the 2000s compilers turned smarter than programmers in terms of determining when and what to inline. Nowadays when using modern compilers, inline is mostly considered obsolete.
Otherwise, can I show that he is wrong without a comparison of running time?
By disassembling the optimized code and see if there are any function calls or not. Still, function calls are relatively cheap on Cortex M (unless there's a ton of different parameters), so doing manual optimization to remove them would be very tiny optimization.
As always there's a choice between code size and execution speed.
If you wish to remove the stack overhead of calling a new function but wish to keep your code modular then consider using the inline function attribute suitable for your compiler e.g.
static inline void com_ClearMessageBuffer(uint8_t* pBuffer, uint32_t length)
{
NRF_LOG_DEBUG("com_ClearMessageBuffer");
memset(pBuffer, 0, length);
}
Then at compile time your inline function code will be inserted into the code flow wherever it is called.
This will speed execution, but when called multiple times increase the code size.

C startup code is only written in assembly confusion

I understand that the C startup code is for initializing the C runtime environment, initializes static variables, sets up the stack pointer etc. and finally branches to main().
They say that this can only be written in assembly language as it's platform-specific. However, can't this still be written in C and compiled for the specific platform?
Function calls of course would be not possible because we "more than likely" don't have the stack pointer set up at that stage. I still can't see other main reasons. Thanks in advance.
Startup code can be written in C language only if:
Implementation provides all necessary intrinsic functions to set hardware features that cannot be set using standard C
Provides mechanism of placing fragments of code and data in the specific place and in specific order (gcc support for ld linker scripts for example).
If both conditions are met you can write the startup code in C language.
I use my own startup code written in C (instead of one provided by the chip vendors) for Cortex-M microcontrollers as ARM provides CMSIS header files with all needed inline assembly functions and gcc based toolchain gives me full memory layout control.
Most of the problem with writing early startup code in C is, in fact, the absence of a properly structured stack. It's worse than just not being able to make function calls. All of a C compiler's generated machine code assumes the existence of a stack, pointed to by the ABI-specified register, that can be used for scratch storage at any time. Changing this assumption would be so much work as to amount to a complete second "back end" for the compiler—way more work than continuing to write early startup code by hand in assembly.
Early bootstrap code, bringing up the machine from power-on, also has to do a bunch of special operations that can't usually be accessed from C, like configuring interrupts and virtual memory. And it may have to deal with the code not having been loaded at the address it was linked for, or the relocation table not having been processed, or other similar problems; these also break pervasive assumptions made by the C compiler (e.g. that it can inject a call to memcpy whenever it wants).
Despite all that, most of a user mode C library's startup code will, in fact, be written in C, for exactly the reason you are thinking. Nobody wants to write more code in assembly, over and over for each supported ISA, than absolutely necessary.
A minimal C runtime environment requires a stack, and a jump to a start address. Setting the stack pointer on most architectures requires assembly code. Once a stack is available it is possible to run code generated from C source.
ARM Cortex-M devices load the stack pointer and start address from the vector table on reset, so can in fact boot directly into code generated from C source.
On other architectures, the minimal assembly requires is to set a stack pointer, and jump to the start address. Thereafter it is possible to write other start-up tasks in C ( or C++ even). Such startup code is responsible for establishing the full C runtime, so must not assume static initialisation or library initialisation (no heap or filesystem for example), which are things that must be done by the startup code.
In that sense you can run code generated from C source, but the environment is not strictly conforming until main() has been called, so there are some constraints.
Even where assembly code is used, it need not be the whole start-up code that is in assembly.

Shall I use register class variables in modern C programs?

In C++, the keyword register was removed in its latest standard ISO/IEC 14882:2017 (C++17).
But also in C, I see a lot, that more and more coders tend to not use or like to declare an object with the register class qualifier because its purposed benefit shall be almost useless, like in #user253751´s answer:
register does not cause the compiler to store a value in a register. register does absolutely nothing. Only extremely old compilers used register to know which variables to store in registers. New compilers do it automatically. Even 20-year-old compilers do it automatically.
Is the use of register class variables and with that the use of the keyword register deprecated?
Shall I use register class variables in my modern programs? Or is this behavior redundant and deprecated?
There is no benefit to using register. Modern compilers substantially ignore it — they can handle register allocation better than you can. The only thing it prevents is taking the address of the variable, which is not a significant benefit.
None of my own code uses register any more. The code I work on loses register when I get to work on a file — but it takes time to get through 17,000+ files (and I only change a file when I have an external reason to change it — but it can be a flimsy reason).
As #JonathanLeffler stated it is ignored in most cases.
Some compilers have a special extension syntax if you want to keep the variable in the particular register.
gcc Global or local variable can be placed in the particular register. This option is not available for all platforms. I know that AVR & ARM ports implement it.
example:
register int x asm ("10");
int foo(int y)
{
x = bar(x);
x = bar1(x);
return x*x;
}
https://godbolt.org/z/qwAZ8x
More information: https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/Explicit-Register-Variables.html#Explicit-Register-Variables
But to be honest I was never using it in my programming life (30y+)
It's effectively deprecated and offers no real benefit.
C is a product of the early 1970s, and the register keyword served as a hint to the compiler that a) this particular object was going to be used a lot, so b) you might want to store it somewhere other than main memory - IOW, a register or some other "fast" memory.
It may have made a difference then - now, it's pretty much ignored. The only measurable effect is that it prevents you from taking the address of that object.
First of all, this feature is NOT deprecated because: "register" in this context (global or local register variables) is a GNU extension which are not deprecated.
In your example, R10 (or the register that GCC internally assigns REGNO(reg) = 10), is a global register. "global" here means, that all code in your application must agree on that usage. This is usually not the case for code from libraries like libc, libm or libgcc because they are not compiled with -ffixed-10. Moreover, global registers might conflict with the ABI. avr-gcc for example might pass values in R10. In avr-gcc, R2...R9 are not used by the ABI and not by code from libgcc (except for 64-bit double).
In some hard real-time app with avr-gcc I used global regs in a (premature) optimization, just to notice that the performance gain was miniscule.
Local register variables, however, are very handy when it comes to integrating non-ABI functions for example assembly functions that don't comply to the GCC ABI, without the need for assembly wrappers.

Is it possible to instruct C to not zero-initialize global arrays?

I'm writing an embedded application and almost all of my RAM is used by global byte-arrays. When my firmware boots it starts by overwriting the whole BSS section in RAM with zeroes, which is completely unnecessary in my case.
Is there some way I can instruct the compiler that it doesn't need to zero-initialize certain arrays? I know this can also be solved by declaring them as pointers, and using malloc(), but there are several reasons I want to avoid that.
The problem is that standard C enforces zero initialization of static objects. If the compiler skips it, it wouldn't conform to the C standard.
On embedded systems compilers there is usually a non-standard option "compact startup" or similar. When enabled, no initialization of static/global objects will occur at all, anywhere in the program. How to do this depends on your compiler, or in this case, on your gcc port.
If you mention which system you are using, someone might be able to provide a solution for that particular compiler port.
This means that any static/global (static storage duration) variable that you initialize explicitly will no longer be initialized. You will have to initialize it in runtime, that is, instead of static int x=1; you will have to write static int x; x=1;. It is rather common to write embedded C programs in this manner, to make them compatible with compilers where the static initialization is disabled.
It turned out that the linker-script included in my toolchain has a special "noinit" section.
__attribute__ ((section (".noinit")))
/** Forces the compiler to not automatically zero the given global
variable on startup, so that the current RAM contents is retained.
Under most conditions this value will be random due to the
behaviour of volatile memory once power is removed, but may be used in some specific
circumstances, like the passing of values back after a system watchdog reset.
So all global variabeles marked with that attribute will not be zero-initialised during boot.
The C standard REQUIRES global data to be initialized to zero.
It is possible that SOME embedded system manufacturers provide a way to bypass this option, but there are certainly many typical applications that would simply fail if the "initialize to zero" wasn't done.
Some compilers also allow you to have further sections, which may have other characteristics than the 'bss' section.
The other alternative is of course to "make your own allocation". Since it's an embedded system, I suppose you have control over how the application and data is loaded into RAM, in particular, what addresses are used for that.
So, you could use a pointer, and simply use your own mechanism for assigning the pointer to a memory region that is reserved for whatever you need large arrays for. This avoids the rather complex usage of malloc - and it gives you a more or less permanent address, so you don't have to worry about trying to find where your data is later on. This will of course have a small effect on performance, since it adds another level of indirection, but in most cases, that disappears as soon as the array is used as an argument to a function, as it decays to a pointer at that point anyways.
There are a few workarounds like:
Deleting the BSS section from the binary or setting its size to 0 or 1. This will not work if the loader must explicitly allocate memory for all sections. This will work if the loader simply copies data to the RAM.
Declaring your arrays as extern in C code and defining the symbols (along with their addresses) either in assembly code in separate assembly files or in the linker script. Again, if memory must be explicitly allocated, this won't work.
Patching or removing the relevant BSS-zeroing code either in the loader or in the startup code that is executed in your program before main().
All embedded compilers should allow a noinit segment. With the IAR AVR compiler the variables you don't want to be initialised are simply declared as follows:
__no_init uint16_t foo;
The most useful reason for this is to allow variables to maintain their values over a watchdog or brown-out reset, which of course doesn't happen in computer-based C programs, hence its omission from standard C.
Just search you compiler manual for "noinit" or something similar.
Are you sure the binary format actually includes a BSS section in the binary? In the binary formats I've worked with BSS is simply a integer that tells the kernel/loader how much memory to allocate and zero out.
There definitely is no general way in C to get uninitialized global variables. This would be a function of your compiler/linker/runtime system and highly specific to that.
with gcc, -fno-zero-initialized-in-bss

Fake anonymous functions in C

In this SO thread, Brian Postow suggested a solution involving fake anonymous functions:
make a comp(L) function that returns the version of comp for arrays of length L... that way L becomes a parameter, not a global
How do I implement such a function?
See the answer I just posted to that question. You can use the callback(3) library to generate new functions at runtime. It's not standards compliant, since it involves lots of ugly platform-specific hacks, but it does work on a large number of systems.
The library takes care of allocating memory, making sure that memory is executable, and flushing the instruction cache if necessary, in order to ensure that code which is dynamically generated (i.e. the closure) is executable. It essentially generates stubs of code that might look like this on x86:
pop %ecx
push $THUNK
push %ecx
jmp $function
THUNK:
.long $parameter
And then returns the address of the first instruction. What this stub does is stores the the return address into ECX (a scratch register in the x86 calling convention), pushes an extra parameter onto the stack (a pointer to a thunk), and then re-pushes the return address. Then, it jumps to the actual function. This results in the function getting fooled into thinking it has an extra parameter, which is the hidden context of the closure.
It's actually more complicated than that (the actual function called at the end of the stub is __vacall_r, not the function itself, and __vacall_r() handles more implementation details), but that's the basic principle.
I don't believe you can do that with C99 -- there's no partial application or closure facility available unless you start manually generating machine code at runtime.
Apple's recently proposed blocks would work, though you need compiler support for that. Here's a brief overview of blocks. I have no idea when/if any vendor outside apple will support them.
It is not possible to generate ordinary functions during run-time in either C or C++. What Brian suggested is based on a big "if": "...if you can fake anonymous functions...". And the answer to that "if" is: no, you can't. (Although it is not clear what he meant by "fake".)
(In C++ it is possible to generate function-like objects at run time, but not ordinary functions.)
The above applies to standard C and C++ languages. Particular implementations can support various implementation-provided extensions and/or manually-implemented hacks, like "closures", "delegates" and similar stuff. Nothing of that, of course, have anything to do with standard C/C++ languages.

Resources