Does Clang misunderstand the 'const' pointer specifier?

Does Clang misunderstand the 'const' pointer specifier? - c

In the code below I saw that clang fails to perform better optimisation without implicit restrict pointer specifier:
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct {
uint32_t event_type;
uintptr_t param;
} event_t;
typedef struct
{
event_t *queue;
size_t size;
uint16_t num_of_items;
uint8_t rd_idx;
uint8_t wr_idx;
} queue_t;
static bool queue_is_full(const queue_t *const queue_ptr)
{
return queue_ptr->num_of_items == queue_ptr->size;
}
static size_t queue_get_size_mask(const queue_t *const queue_ptr)
{
return queue_ptr->size - 1;
}
int queue_enqueue(queue_t *const queue_ptr, const event_t *const event_ptr)
{
if(queue_is_full(queue_ptr))
{
return 1;
}
queue_ptr->queue[queue_ptr->wr_idx++] = *event_ptr;
queue_ptr->num_of_items++;
queue_ptr->wr_idx &= queue_get_size_mask(queue_ptr);
return 0;
}
I compiled this code with clang version 11.0.0 (clang-1100.0.32.5)
clang -O2 -arch armv7m -S test.c -o test.s
In the disassembled file I saw that the generated code re-reads the memory:
_queue_enqueue:
.cfi_startproc
# %bb.0:
ldrh r2, [r0, #8] ---> reads the queue_ptr->num_of_items
ldr r3, [r0, #4] ---> reads the queue_ptr->size
cmp r3, r2
itt eq
moveq r0, #1
bxeq lr
ldrb r2, [r0, #11] ---> reads the queue_ptr->wr_idx
adds r3, r2, #1
strb r3, [r0, #11] ---> stores the queue_ptr->wr_idx + 1
ldr.w r12, [r1]
ldr r3, [r0]
ldr r1, [r1, #4]
str.w r12, [r3, r2, lsl #3]
add.w r2, r3, r2, lsl #3
str r1, [r2, #4]
ldrh r1, [r0, #8] ---> !!! re-reads the queue_ptr->num_of_items
adds r1, #1
strh r1, [r0, #8]
ldrb r1, [r0, #4] ---> !!! re-reads the queue_ptr->size (only the first byte)
ldrb r2, [r0, #11] ---> !!! re-reads the queue_ptr->wr_idx
subs r1, #1
ands r1, r2
strb r1, [r0, #11] ---> !!! stores the updated queue_ptr->wr_idx once again after applying the mask
movs r0, #0
bx lr
.cfi_endproc
# -- End function
After adding the restrict keyword to the pointers, these unneeded re-reads just vanished:
int queue_enqueue(queue_t * restrict const queue_ptr, const event_t * restrict const event_ptr)
I know that in clang, by default strict aliasing is disabled. But in this case, event_ptr pointer is defined as const so its object's content cannot be modified by this pointer, thus it cannot affect the content to which queue_ptr points (assuming the case when the objects overlap in the memory), right?
So is this a compiler optimisation bug or there is indeed some weird case when the object pointed by queue_ptr can be affected by event_ptr assuming this declaration:
int queue_enqueue(queue_t *const queue_ptr, const event_t *const event_ptr)
By the way, I tried to compile the same code for x86 target and inspected similar optimisation issue.
The generated assembly with the restrict keyword, doesn't contain the re-reads:
_queue_enqueue:
.cfi_startproc
# %bb.0:
ldr r3, [r0, #4]
ldrh r2, [r0, #8]
cmp r3, r2
itt eq
moveq r0, #1
bxeq lr
push {r4, r6, r7, lr}
.cfi_def_cfa_offset 16
.cfi_offset lr, -4
.cfi_offset r7, -8
.cfi_offset r6, -12
.cfi_offset r4, -16
add r7, sp, #8
.cfi_def_cfa r7, 8
ldr.w r12, [r1]
ldr.w lr, [r1, #4]
ldrb r1, [r0, #11]
ldr r4, [r0]
subs r3, #1
str.w r12, [r4, r1, lsl #3]
add.w r4, r4, r1, lsl #3
adds r1, #1
ands r1, r3
str.w lr, [r4, #4]
strb r1, [r0, #11]
adds r1, r2, #1
strh r1, [r0, #8]
movs r0, #0
pop {r4, r6, r7, pc}
.cfi_endproc
# -- End function
Addition:
After some discussion with Lundin in the comments to his answer, I got the impression that the re-reads could be caused because the compiler would assume that queue_ptr->queue might potentially point to *queue_ptr itself. So I changed the queue_t struct to contain array instead of the pointer:
typedef struct
{
event_t queue[256]; // changed from pointer to array with max size
size_t size;
uint16_t num_of_items;
uint8_t rd_idx;
uint8_t wr_idx;
} queue_t;
However the re-reads remained as previously. I still can't understand what could make the compiler think that queue_t fields may be modified and thus require re-reads... The following declaration eliminates the re-reads:
int queue_enqueue(queue_t * restrict const queue_ptr, const event_t *const event_ptr)
But why queue_ptr has to be declared as restrict pointer to prevent the re-reads I don't understand (unless it is a compiler optimization "bug").
P.S.
I also couldn't find any link to file/report an issue on clang that doesn't cause the compiler to crash...

[talking about the original program]
This is caused by deficiency in the TBAA metadata, generated by Clang.
If you emit LLVM IR with -S -emit-llvm you'll see (snipped for brevity):
...
%9 = load i8, i8* %wr_idx, align 1, !tbaa !12
%10 = trunc i32 %8 to i8
%11 = add i8 %10, -1
%conv4 = and i8 %11, %9
store i8 %conv4, i8* %wr_idx, align 1, !tbaa !12
br label %return
...
!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 1, !"min_enum_size", i32 4}
!2 = !{!"clang version 10.0.0 (/home/chill/src/llvm-project 07da145039e1a6a688fb2ac2035b7c062cc9d47d)"}
!3 = !{!4, !9, i64 8}
!4 = !{!"queue", !5, i64 0, !8, i64 4, !9, i64 8, !6, i64 10, !6, i64 11}
!5 = !{!"any pointer", !6, i64 0}
!6 = !{!"omnipotent char", !7, i64 0}
!7 = !{!"Simple C/C++ TBAA"}
!8 = !{!"int", !6, i64 0}
!9 = !{!"short", !6, i64 0}
!10 = !{!4, !8, i64 4}
!11 = !{!4, !5, i64 0}
!12 = !{!4, !6, i64 11}
See the TBAA metadata !4: this is the type descriptor for queue_t (btw, I added names to structs, e.g. typedef struct queue ...) you may see empty string there). Each element in the description corresponds to struct fields and look at !5, which is the event_t *queue field: it's "any pointer"! At this point we've lost all the information about the actual type of the pointer, which tells me the compiler would assume writes through this pointer can modify any memory object.
That said, there's a new form for TBAA metadata, which is more precise (still has deficiencies, but for that later ...)
Compile the original program with -Xclang -new-struct-path-tbaa. My exact command line was (and I've included stddef.h instead of stdlib.h since that development build with no libc):
./bin/clang -I lib/clang/10.0.0/include/ -target armv7m-eabi -O2 -Xclang -new-struct-path-tbaa -S queue.c
The resulting assembly is (again, some fluff snipped):
queue_enqueue:
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r11, [sp, #-4]!
ldrh r3, [r0, #8]
ldr.w r12, [r0, #4]
cmp r12, r3
bne .LBB0_2
movs r0, #1
ldr r11, [sp], #4
pop {r4, r5, r6, r7, pc}
.LBB0_2:
ldrb r2, [r0, #11] ; load `wr_idx`
ldr.w lr, [r0] ; load `queue` member
ldrd r6, r1, [r1] ; load data pointed to by `event_ptr`
add.w r5, lr, r2, lsl #3 ; compute address to store the event
str r1, [r5, #4] ; store `param`
adds r1, r3, #1 ; increment `num_of_items`
adds r4, r2, #1 ; increment `wr_idx`
str.w r6, [lr, r2, lsl #3] ; store `event_type`
strh r1, [r0, #8] ; store new value for `num_of_items`
sub.w r1, r12, #1 ; compute size mask
ands r1, r4 ; bitwise and size mask with `wr_idx`
strb r1, [r0, #11] ; store new value for `wr_idx`
movs r0, #0
ldr r11, [sp], #4
pop {r4, r5, r6, r7, pc}
Looks good, isn't it! :D
I mentioned earlier there are deficiencies with the "new struct path", but for that: on the mailing list.
PS. I'm afraid there's no general lesson to be learned in this case. In principle, the more information one is able to give the compiler - the better: things like restrict, strong typing (not gratuitous casts, type punning, etc), relevant function and variable attributes ... but not in this case, the original program already contained all the necessary information. It is just a compiler deficiency and the best way to tackle those is to raise awareness: ask on the mailing list and/or file bug reports.

The event_t member of queue_ptr could point at the same memory as event_ptr. Compilers tend to produce less efficient code when they can't rule out that two pointers point at the same memory. So there's nothing strange with restrict leading to better code.
Const qualifiers don't really matter, because these were added by the function and the original type could be modifiable elsewhere. In particular, the * const doesn't add anything because the pointer is already a local copy of the original pointer, so nobody including the caller cares if the function modifies that local copy or not.
"Strict aliasing" rather refers to the cases where the compiler can cut corners like when assuming that a uint16_t* can't point at a uint8_t* etc. But in your case you have two completely compatible types, one of them is merely wrapped in an outer struct.

As far as I can tell, yes, in your code queue_ptr pointee's contents cannot be modified. Is is it an optimization bug? It's a missed optimization opportunity, but I wouldn't call it a bug. It doesn't "misunderstand" const, it just doesn't have/doesn't do the necessary analyses to determine it cannot be modified for this specific scenario.
As a side note: queue_is_full(queue_ptr) can modify the contents of *queue_ptr even if it has const queue_t *const param because it can legally cast away the constness since the original object is not const. That being said, the definition of quueue_is_full is visible and available to the compiler so it can ascertain it indeed does not.

As you know, your code appears to modify the data, invalidating the const state:
queue_ptr->num_of_items++; // this stores data in the memory
Without the restrict keyword, the compiler must assume that the two types might share the same memory space.
This is required in the updated example because event_t is a member of queue_t and strict aliasing applies on:
... an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or...
On the original example there are a number of reasons the types might be considered to alias, leading to the same result (i.e., the use of a char pointer and the fact that types might be considered compatible "enough" on some architectures if not all).
Hence, the compiler is required to reload the memory after it was mutated to avoid possible conflicts.
The const keyword doesn't really enter into this, because a mutation happens through a pointer that might point to the same memory address.
(EDIT) For your convenience, here's the full rule regarding access to a variable:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types (88):
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the object,
— a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
(88) The intent of this list is to specify those circumstances in which an object may or may not be aliased.
P.S.
The _t suffix is reserved by POSIX. You might consider using a different suffix.
It's common practice to use _s for structs and _u for unions.

Related

Avoid volatile bit-field assignment expression reading or writing memory several times

I want to use volatile bit-field struct to set hardware register like following code
union foo {
uint32_t value;
struct {
uint32_t x : 1;
uint32_t y : 3;
uint32_t z : 28;
};
};
union foo f = {0};
int main()
{
volatile union foo *f_ptr = &f;
//union foo tmp;
*f_ptr = (union foo) {
.x = 1,
.y = 7,
.z = 10,
};
//*f_ptr = tmp;
return 0;
}
However, the compiler will make it to STR, LDR HW register several times.
It is a terrible things that it will trigger hardware to work at once when the register is writed.
main:
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
movw r3, #:lower16:.LANCHOR0
movs r0, #0
movt r3, #:upper16:.LANCHOR0
ldr r2, [r3]
orr r2, r2, #1
str r2, [r3]
ldr r2, [r3]
orr r2, r2, #14
str r2, [r3]
ldr r2, [r3]
and r2, r2, #15
orr r2, r2, #160
str r2, [r3]
bx lr
.size main, .-main
.global f
.bss
.align 2
My gcc version is : arm-linux-gnueabi-gcc (Linaro GCC 4.9-2017.01) 4.9.4
and build with -O2 optimation
I have tried to use the local variable to resolve this problem
union foo {
uint32_t value;
struct {
uint32_t x : 1;
uint32_t y : 3;
uint32_t z : 28;
};
};
union foo f = {0};
int main()
{
volatile union foo *f_ptr = &f;
union foo tmp;
tmp = (union foo) {
.x = 1,
.y = 7,
.z = 10,
};
*f_ptr = tmp;
return 0;
}
Well, it will not STR to HW register several times
main:
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
movs r1, #10
movs r2, #15
movw r3, #:lower16:.LANCHOR0
bfi r2, r1, #4, #28
movt r3, #:upper16:.LANCHOR0
movs r0, #0
str r2, [r3]
bx lr
.size main, .-main
.global f
.bss
.align 2
I think it is still not a good idea to use local variable, considering the limitation of binary size for embedded system.
Is there any way to handle this problem without using local variable?

I think this is a bug in GCC. Per discussion below, you might consider using:
f_ptr->value = (union foo) {
.x = 1,
.y = 7,
.z = 10,
} .value;
By the C standard, the code a compiler generates for a program may not access a volatile object when the original C code nominally does not access the object. The code *f_ptr = (union foo) { .x = 1, .y = 7, .z = 10, }; is a single assignment to *f_ptr. So we would expect this to generate a single store to *f_ptr; generating two stores is a violation of the standard’s requirements.
We could consider an explanation for this to be that GCC is treating the aggregate (the union and/or the structure within it) as several objects, each individually volatile, rather than one aggregated volatile object.1 But, if this were so, then it ought to generate separate 16-bit strh instructions for the parts (per the original example code, which had 16-bit parts), not the 32-bit str instructions we see.
While using a local variable appears to work around the issue, I would not rely on that, because the assignment of the compound literal above is semantically equivalent, so the cause of why GCC generates broken assembly code for one sequence of code and not the other is unclear. With different circumstances (such as additional or modified code in the function or other variations that might affect optimization), GCC might generate broken code with the local variable too.
What I would do is avoid using an aggregate for the volatile object. The hardware register is, presumably, physically more like a 32-bit unsigned integer than like a structure of bit-fields (even though semantically it is defined with bit-fields). So I would define the register as volatile uint32_t and use that type when assigning values to it. Those values could be prepared with bit shifts or structures with bit-fields or whatever other method you prefer.
It should not be necessary to avoid using local variables, as the optimizer should effectively eliminate them. However, if you wish to neither change the register definition nor use local variables, an alternative is the code I opened with:
f_ptr->value = (union foo) {
.x = 1,
.y = 7,
.z = 10,
} .value;
That prepares the value to be stored but then assigns it using the uint32_t member of the union rather than using the whole union, and testing with ARM GCC 4.6.4 on Compiler Explorer (the closest match I could find on Compiler Explorer to what you are using) suggests it generates a single store with minimal code:
main:
ldr r3, .L2
mov r2, #175
str r2, [r3, #0]
mov r0, #0
bx lr
.L2:
.word .LANCHOR0
.LANCHOR0 = . + 0
f:
Footnote
1 I would consider this to a bug too, as the C standard does not make provision for applying the volatile qualifier on a union or structure declaration as applying to the members rather than to the whole aggregate. For arrays, it does say that qualifiers apply to the elements, rather than the whole array (C 2018 6.7.3 10). It has no such wording for unions or structures.

You can force the aggregate union to be written in one go with
f_ptr->value = (union foo) {
.x = 10,
.y = 20,
}.value;
// produced asm
mov r1, #10
orr r1, r1, #1310720
str r1, [r0]
bx lr

There seems to be no need for bitfields in your program: using uint16_t types should make it simpler and generate better code:
#include <stdint.h>
union foo {
uint32_t value;
struct {
uint16_t x;
uint16_t y;
};
};
extern union foo f;
int main() {
volatile union foo *f_ptr = &f;
*f_ptr = (union foo) {
.x = 10,
.y = 20,
};
return 0;
}
Code generated by arm gcc 4.6.4 linux, as produced by Godbolt Compiler Explorer:
main:
ldr r3, .L2
mov r0, #0
mov r2, #10
str r0, [r3, #0]
strh r2, [r3, #0] # movhi
mov r2, #20
strh r2, [r3, #2] # movhi
bx lr
.L2:
.word f
The code is much simpler but still performs a redundant store for the 32-bit value: str r0, [r3, #0] when storing the union in one shot.
Investigating this, I tried different approaches and got surprising results: using struct or union assignment generates code potentially inappropriate for memory mapped hardware register, storing the fields element-wise seems required to generate the proper code:
#include <stdint.h>
union foo {
uint32_t value;
struct {
uint16_t x;
uint16_t y;
};
};
extern union foo f;
void store_1(void) {
volatile union foo *f_ptr = &f;
*f_ptr = (union foo) {
.x = 10,
.y = 20,
};
}
void store_2(void) {
volatile union foo *f_ptr = &f;
union foo bar = { .x = 10, .y = 20, };
*f_ptr = bar;
}
void store_3(void) {
volatile union foo *f_ptr = &f;
f_ptr->x = 10;
f_ptr->y = 20;
}
int main() {
return 0;
}
Furthermore, removing the uint32_t value; generates calls to memcpy for the struct assignment version.
Code generated by arm gcc 4.6.4 linux:
store_1:
ldr r3, .L2
mov r2, #0
str r2, [r3, #0]
mov r2, #10
strh r2, [r3, #0] # movhi
mov r2, #20
strh r2, [r3, #2] # movhi
bx lr
.L2:
.word f
store_2:
ldr r3, .L5
ldr r2, .L5+4
str r2, [r3, #0]
bx lr
.L5:
.word f
.word 1310730
store_3:
ldr r3, .L8
mov r2, #10
strh r2, [r3, #0] # movhi
mov r2, #20
strh r2, [r3, #2] # movhi
bx lr
.L8:
.word f
main:
mov r0, #0
bx lr
Further investigations seems to link the issue to the use of volatile union foo *f_ptr = &f; instead of tagging the union members as volatile:
#include <stdint.h>
union foo {
uint32_t value;
struct {
volatile uint16_t x;
volatile uint16_t y;
};
};
extern union foo f;
void store_1(void) {
union foo *f_ptr = &f;
*f_ptr = (union foo) {
.x = 10,
.y = 20,
};
*f_ptr = (union foo) {
.x = 10,
.y = 20,
};
}
void store_2(void) {
union foo *f_ptr = &f;
union foo bar = { .x = 10, .y = 20, };
*f_ptr = bar;
*f_ptr = bar;
}
void store_3(void) {
union foo *f_ptr = &f;
f_ptr->x = 10;
f_ptr->y = 20;
f_ptr->x = 10;
f_ptr->y = 20;
}
Code generated:
store_1:
ldr r3, .L2
mov r1, #10
mov r2, #20
strh r1, [r3, #0] # movhi
strh r2, [r3, #2] # movhi
strh r1, [r3, #0] # movhi
strh r2, [r3, #2] # movhi
bx lr
.L2:
.word f
store_2:
ldr r3, .L5
ldr r2, .L5+4
str r2, [r3, #0]
bx lr
.L5:
.word f
.word 1310730
store_3:
ldr r3, .L8
mov r1, #10
mov r2, #20
strh r1, [r3, #0] # movhi
strh r2, [r3, #2] # movhi
strh r1, [r3, #0] # movhi
strh r2, [r3, #2] # movhi
bx lr
.L8:
.word f
As you can see, assigning the union does not generate appropriate code in store_2, even when qualifying the value member as volatile too.
Using the C99 compound literals seems to work correctly in store_1, generating redundant stores when the fields are qualified as volatile.
Yet I would recommend assigning the fields explicitly as in store_3, making the assignment order explicit too. If instead you want to generate a single 32-bit store, assuming it is correct for your hardware, Aki Suihkonen suggested an interesting approach.
The original problem is a side effect of how the compiler generates code for assigning compound literals to structures and unions: it first initializes the destination to all bits zero, then stores the members specified in the compound literal explicitly. Redundant stores are eliminated unless the destination is volatile qualified`. I don't believe this behavior is mandated by the C Standard, so it may well be compiler specific.

How do i know if the compiler will optimize a variable?

I am new to Microcontrollers. I have read a lot of Articles and documentations about volatile variables in c. What i understood, is that while using volatile we are telling the compiler not to cache either to optimize the variable. However i still didnt get when this should really be used.For example let's say i have a simple counter and for loop like this.
for(int i=0; i < blabla.length; i++) {
//code here
}
or maybe when i write a simple piece of code like this
int i=1;
int j=1;
printf("the sum is: %d\n" i+j);
I have never cared about compiler optimization for such examples. But in many scopes if the variable is not declared as volatile the ouptut wont be as expected. How would i know that i have to care about compiler optimization in other examples?

Simple example:
int flag = 1;
while (flag)
{
do something that doesn't involve flag
}
This can be optimized to:
while (true)
{
do something
}
because the compiler knows that flag never changes.
with this code:
volatile int flag = 1;
while (flag)
{
do something that doesn't involve flag
}
nothing will be optimized, because now the compiler knows: "although the program doesn't change flag inside the while loop, it might changed anyway".

According to cppreference:
volatile object - an object whose type is volatile-qualified, or a subobject of a volatile object, or a mutable subobject of a const-volatile object. Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to refer to a volatile object through a non-volatile glvalue (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.
This explains why some optimizations can’t be made by the compiler since it can’t entirely predict when its value will be modified at compile-time. This qualifier is useful to indicate to the compiler that it shouldn’t do these optimizations because its value can be changed in a way unknown by the compiler.
I have not worked recently with microcontrollers but I think that the states of different electrical input and output pins have to be marked as volatile since the compiler doesn’t know that they can be changed externally. (In this case by means other than code like when you plug-in a component).

Just try it. First off there is the language and what is possible to be optimized and then there is what the compiler actual figures out and optimizes, if it can be optimized does not mean the compiler will figure it out nor will it always produce the code you think.
Volatile has nothing to do with caching of any kind, did not we just get this question recently using that term? Volatile indicates to the compiler that the variable should not be optimized into a register or optimized away. Let us say "all" accesses to that variable must go back to memory, although different compilers have a different understanding of how to use volatile, I have seen clang (llvm) and gcc (gnu) disagree, when the variable was used twice in a row or something like that clang didnt do two reads it only did one.
It was a Stack Overflow question you are welcome to search for it, the clang code was slightly faster than gcc, simply because of one less instruction because of differences of opinion of how to implement volatile. So even there the main compiler folks can't agree on what it really means. Its the nature of the C language, lots of implementation defined features and pro tip, avoid them volatile, bitfields, unions, etc, certainly across compile domains.
void fun0 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
}
}
00000000 <fun0>:
0: 4770 bx lr
This is completely dead code, it does noting it touches nothing, all the items are local, so it can all go away, simply return.
unsigned int fun1 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
}
return i;
}
00000004 <fun1>:
4: 2005 movs r0, #5
6: 4770 bx lr
This one returns something, the compiler can figure out it is counting and the last value after the loop is what gets returned....so just return that value, no need for variables or any other code generation, the rest is dead code.
unsigned int fun2 ( unsigned int len )
{
unsigned int i;
for(i=0; i < len; i++)
{
}
return i;
}
00000008 <fun2>:
8: 4770 bx lr
Like fun1 except the value is passed in in a register, just happens to be the same register as the return value for the ABI for this target. So you do not even have to copy the length to the return value in this case, for other architectures or ABIs we would hope that this optimizes to return = len and that gets sent back. A simple mov instruction.
unsigned int fun3 ( unsigned int len )
{
volatile unsigned int i;
for(i=0; i < len; i++)
{
}
return i;
}
0000000c <fun3>:
c: 2300 movs r3, #0
e: b082 sub sp, #8
10: 9301 str r3, [sp, #4]
12: 9b01 ldr r3, [sp, #4]
14: 4298 cmp r0, r3
16: d905 bls.n 24 <fun3+0x18>
18: 9b01 ldr r3, [sp, #4]
1a: 3301 adds r3, #1
1c: 9301 str r3, [sp, #4]
1e: 9b01 ldr r3, [sp, #4]
20: 4283 cmp r3, r0
22: d3f9 bcc.n 18 <fun3+0xc>
24: 9801 ldr r0, [sp, #4]
26: b002 add sp, #8
28: 4770 bx lr
2a: 46c0 nop ; (mov r8, r8)
it gets significantly different here, that is a lot of code compared to the ones thus far. We would like to think that volatile indicates all uses of that variable touch the memory for that variable.
12: 9b01 ldr r3, [sp, #4]
14: 4298 cmp r0, r3
16: d905 bls.n 24 <fun3+0x18>
get i and compare it to len is it less than? we are done exit loop
18: 9b01 ldr r3, [sp, #4]
1a: 3301 adds r3, #1
1c: 9301 str r3, [sp, #4]
i was less than len so we need to increment it, read it, change it, write it back.
1e: 9b01 ldr r3, [sp, #4]
20: 4283 cmp r3, r0
22: d3f9 bcc.n 18 <fun3+0xc>
do the i < len test again, see if it is less than or greater than and loop again or do not.
24: 9801 ldr r0, [sp, #4]
get i from ram so it can be returned.
All reads and writes of i involved the memory that holds i. Because we asked for that now the loop is not dead code each iteration has to be implemented in order to handle all the touches of that variable on memory.
void fun4 ( void )
{
unsigned int a;
unsigned int b;
a = 1;
b = 1;
fun3(a+b);
}
0000002c <fun4>:
2c: 2300 movs r3, #0
2e: b082 sub sp, #8
30: 9301 str r3, [sp, #4]
32: 9b01 ldr r3, [sp, #4]
34: 2b01 cmp r3, #1
36: d805 bhi.n 44 <fun4+0x18>
38: 9b01 ldr r3, [sp, #4]
3a: 3301 adds r3, #1
3c: 9301 str r3, [sp, #4]
3e: 9b01 ldr r3, [sp, #4]
40: 2b01 cmp r3, #1
42: d9f9 bls.n 38 <fun4+0xc>
44: 9b01 ldr r3, [sp, #4]
46: b002 add sp, #8
48: 4770 bx lr
4a: 46c0 nop ; (mov r8, r8)
this both optimized out the addition and the a and b variables but also optimized by inlining the fun3 function.
void fun5 ( void )
{
volatile unsigned int a;
unsigned int b;
a = 1;
b = 1;
fun3(a+b);
}
0000004c <fun5>:
4c: 2301 movs r3, #1
4e: b082 sub sp, #8
50: 9300 str r3, [sp, #0]
52: 2300 movs r3, #0
54: 9a00 ldr r2, [sp, #0]
56: 9301 str r3, [sp, #4]
58: 9b01 ldr r3, [sp, #4]
5a: 3201 adds r2, #1
5c: 429a cmp r2, r3
5e: d905 bls.n 6c <fun5+0x20>
60: 9b01 ldr r3, [sp, #4]
62: 3301 adds r3, #1
64: 9301 str r3, [sp, #4]
66: 9b01 ldr r3, [sp, #4]
68: 429a cmp r2, r3
6a: d8f9 bhi.n 60 <fun5+0x14>
6c: 9b01 ldr r3, [sp, #4]
6e: b002 add sp, #8
70: 4770 bx lr
Also fun3 is inlined, but the a variable is read from memory every time
instead of being optimized out
58: 9b01 ldr r3, [sp, #4]
5a: 3201 adds r2, #1
void fun6 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
fun3(i);
}
}
00000074 <fun6>:
74: 2300 movs r3, #0
76: 2200 movs r2, #0
78: 2100 movs r1, #0
7a: b082 sub sp, #8
7c: 9301 str r3, [sp, #4]
7e: 9b01 ldr r3, [sp, #4]
80: 3201 adds r2, #1
82: 9b01 ldr r3, [sp, #4]
84: 2a05 cmp r2, #5
86: d00d beq.n a4 <fun6+0x30>
88: 9101 str r1, [sp, #4]
8a: 9b01 ldr r3, [sp, #4]
8c: 4293 cmp r3, r2
8e: d2f7 bcs.n 80 <fun6+0xc>
90: 9b01 ldr r3, [sp, #4]
92: 3301 adds r3, #1
94: 9301 str r3, [sp, #4]
96: 9b01 ldr r3, [sp, #4]
98: 429a cmp r2, r3
9a: d8f9 bhi.n 90 <fun6+0x1c>
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
a2: d1f1 bne.n 88 <fun6+0x14>
a4: b002 add sp, #8
a6: 4770 bx lr
This one I found interesting, could have been optimized better, based on my gnu experience kind of confused, but as pointed out, this is how it is, you can expect one thing but the compiler does what it does.
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
The i variable in the fun6 function is put on the stack for some reason, it is not volatile it does not desire that kind of access every time. But that is how they implemented it.
If I build with an older version of gcc I see this
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
Another thing to note is that gnu at least is not getting better every version, it has been at times getting worse, this is a simple case.
void fun7 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
fun2(i);
}
}
0000013c <fun7>:
13c: e12fff1e bx lr
Okay too extreme (no surprise in the result), let us try this
void more_fun ( unsigned int );
void fun8 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
more_fun(i);
}
}
000000ac <fun8>:
ac: b510 push {r4, lr}
ae: 2000 movs r0, #0
b0: f7ff fffe bl 0 <more_fun>
b4: 2001 movs r0, #1
b6: f7ff fffe bl 0 <more_fun>
ba: 2002 movs r0, #2
bc: f7ff fffe bl 0 <more_fun>
c0: 2003 movs r0, #3
c2: f7ff fffe bl 0 <more_fun>
c6: 2004 movs r0, #4
c8: f7ff fffe bl 0 <more_fun>
cc: bd10 pop {r4, pc}
ce: 46c0 nop ; (mov r8, r8)
No surprise there it chose to unroll it because 5 is below some threshold.
void fun9 ( unsigned int len )
{
unsigned int i;
for(i=0; i < len; i++)
{
more_fun(i);
}
}
000000d0 <fun9>:
d0: b570 push {r4, r5, r6, lr}
d2: 1e05 subs r5, r0, #0
d4: d006 beq.n e4 <fun9+0x14>
d6: 2400 movs r4, #0
d8: 0020 movs r0, r4
da: 3401 adds r4, #1
dc: f7ff fffe bl 0 <more_fun>
e0: 42a5 cmp r5, r4
e2: d1f9 bne.n d8 <fun9+0x8>
e4: bd70 pop {r4, r5, r6, pc}
That is what I was looking for. So in this case the i variable is in a register (r4) not on the stack as shown above. The calling convention for this says r4 and some number of others after it (r5,r6,...) must be preserved. This is calling an external function which the optimizer can't see, so it has to implement the loop so that the function is called that many times with each of the values in order. Not dead code.
Textbook/classroom implies that local variables are on the stack, but they do not have to be. i is not declared volatile so instead take a non-volatile register, r4 save that on the stack so the caller does not lose its state, use r4 as i and the callee function more_fun either will not touch it or will return it as it found it. You add a push, but save a bunch of loads and stores in the loop, yet another optimization based on the target and the ABI.
Volatile is a suggestion/recommendation/desire to the compiler that it have an address for the variable and perform actual load and store accesses to that variable when used. Ideally for use cases like when you have a control/status register in a peripheral in hardware that you need all of the accesses described in the code to happen in the order coded, no optimization. As to a cache that is independent of the language you have to setup the cache and the mmu or other solution so that control and status registers do not get cached and the peripheral is not touched when we wanted it to be touched. Takes both layers you need to tell the compiler to do all the accesses and need to not block those accesses in the memory system.
Without volatile and based on the command line options you use and the list of optimizations the compiler has been programmed to attempt to perform the compiler will try to perform those optimizations as they are programmed in the compilers code. If the compiler can't see into a calling function like more_fun above because it is not in this optimization domain then the compiler must functionally represent all the calls in order, if it can see and inlining is allowed then the compiler can if programmed to do so essentially pull the function inline with the caller THEN optimize that whole blob as if it were one function based on other available options. Not uncommon to have the callee function be bulky because of its nature, but when specific values are passed by a caller and the compiler can see all of it the caller plus callee code can be smaller than the callee implementation.
You will often see folks wanting to for example learn assembly language by examining the output of a compiler do something like this:
void fun10 ( void )
{
int a;
int b;
int c;
a = 5;
b = 6;
c = a + b;
}
not realizing that that is dead code and should be optimized out if an optimizer is used, they ask a Stack Overflow question and someone says you need to turn the optimizer off, now you get a lot of loads and stores have to understand and keep track of stack offsets and while it is valid asm code you can study it is not what you were hoping for, instead something like this is more valuable to that effort
unsigned int fun11 ( unsigned int a, unsigned int b )
{
return(a+b);
}
The inputs are unknown to the compiler and a return value is required so it can't dead code this it has to implement it.
And this is a simple case of demonstrating the caller plus callee is smaller than the callee
000000ec <fun11>:
ec: 1840 adds r0, r0, r1
ee: 4770 bx lr
000000f0 <fun12>:
f0: 2007 movs r0, #7
f2: 4770 bx lr
While that may not look simpler it has inlined the code, it has optimized out the a = 3, b = 4 assignments, optimized out the addition operation and simply pre-computed the result and returned it.
Certainly with gcc you can cherry pick the optimizations you want to add or block there is a laundry list of them that you can go research.
With very little practice you can see what is optimizable at least within the view of the function but then hope the compiler figures it out. Certainly visualizing inline takes more work but really it is the same you just visually inline it.
Now there are ways with gnu and llvm to optimize across files, basically whole project so more_fun would be visible now and the functions that call it might get further optimized than what you see in the object of the one file with the caller. Takes certain command lines on the compile and/or link for this to work and I have not memorized them. With llvm there is a way to merge bytecode and then optimize that, but it does not always do what you hoped it would do as far as a whole project optimization.

Overeager struct packing warnings with `attribute((packed))`?

I'm implementing a binary logging system on a 32 bit ARM mcu (Atmel SAM4SD32C, a Cortex-M4/ARMv7E-M part), and in the process of designing my data structures. My goal is to describe the log format as a packed struct, and simply union the struct with a char array, for writing to the log device (a SD card, via FatFS, in this case).
Basically, I have a very simple struct:
typedef struct adc_samples_t
{
int32_t adc_samples[6];
uint64_t acq_time;
int8_t overrun;
uint8_t padding_1;
uint8_t padding_2;
uint8_t padding_3;
} __attribute__((packed, aligned(4))) adc_sample_set;
Now, my architecture is 32 bits, so as far as I understand, access to any member /other/ then the overrun member should be 32-bit aligned, and therefore not have an extra overhead. Furthermore, the aligned(4) attribute should force any instantiations of the struct to be on a 32-bit aligned boundary.
However, compiling the above struct definition produces a pile of warnings:
In file included from ../src/main.c:13:0:
<snip>\src\fs\fs-logger.h(10,10): warning: packed attribute causes inefficient alignment for 'adc_samples' [-Wattributes]
int32_t adc_samples[6];
^
<snip>\src\fs\fs-logger.h(12,11): warning: packed attribute causes inefficient alignment for 'acq_time' [-Wattributes]
uint64_t acq_time;
As far as I know (and I'm now realizing this is a big assumption), I assumed that 32-bit alignment was all that was needed for optimal component positioning on 32 bit arm. Oddly, the only member that does /not/ produce warnings are the overrun and padding_X members, which I don't understand the causes for. (Ok, the ARM docs say Byte accesses are always aligned.)
What, exactly, is going on here? I assume (possibly incorrectly) that the struct instantiation will be on 4 bytes boundaries. Does the compiler require a more broad alignment (on 8 byte boundaries)?
Edit: Ok, digging into the ARM docs (the magic words here were "Cortex-M4 alignment":
3.3.5. Address alignment
An aligned access is an operation where a word-aligned address is used for a word, dual word, or multiple word access, or where a halfword-aligned address is used for a halfword access. Byte accesses are always aligned.
The Cortex-M4 processor supports unaligned access only for the following instructions:
LDR, LDRT
LDRH, LDRHT
LDRSH, LDRSHT
STR, STRT
STRH, STRHT
All other load and store instructions generate a UsageFault exception if they perform an unaligned access, and therefore their accesses must be address aligned. For more information about UsageFaults see Fault handling.
Unaligned accesses are usually slower than aligned accesses. In addition, some memory regions might not support unaligned accesses. Therefore, ARM recommends that programmers ensure that accesses are aligned. To trap
accidental generation of unaligned accesses, use the UNALIGN_TRP bit in the Configuration and Control Register, see Configuration and Control Register.
How is my 32-bit aligned value not word-aligned? The user guide defines "Aligned" as the following:
Aligned
A data item stored at an address that is divisible by the
number of bytes that defines the data size is said to be aligned.
Aligned words and halfwords have addresses that are divisible by four
and two respectively. The terms word-aligned and halfword-aligned
therefore stipulate addresses that are divisible by four and two
respectively.

I assumed that 32-bit alignment was all that was needed for optimal component positioning on 32-bit ARM
It is.
But you don't have 32-bit alignment here [in the originally-asked question] because:
Specifying the packed attribute for struct and union types is equivalent to specifying the packed attribute on each of the structure or union members.
given that:
The packed attribute specifies that a variable or structure field should have the smallest possible alignment—one byte for a variable, and one bit for a field, unless you specify a larger value with the aligned attribute.
In other words, if you still want a packed structure to still have some minimum alignment after you've forced the alignment of all members, and thus the type itself, to nothing, you need to specify so - the fact that that might not actually make -Wpacked shut up is a different matter - GCC may well just spit that out reflexively before it actually considers any further alignment modifiers.
Note that in terms of serialisation, you don't necessarily need to pack it anyway. The members fit in 9 words exactly, so the only compiler padding anywhere is an extra word at the end to round the total size up to 40 bytes, since acq_time forces the struct to a natural alignment of 8. Unless you want to operate on a whole array of these things at once, you can get away with simply ignoring that and still treating the members as one 36-byte chunk.

Ok, at this point, I'm somewhat confident that the warning is being emitted in error.
I have a statically defined instance of the struct, and at one point I zero it:
adc_sample_set running_average;
int accumulated_samples;
inline void zero_average_buf(void)
{
accumulated_samples = 0;
running_average.adc_samples[0] = 0;
running_average.adc_samples[1] = 0;
running_average.adc_samples[2] = 0;
running_average.adc_samples[3] = 0;
running_average.adc_samples[4] = 0;
running_average.adc_samples[5] = 0;
running_average.overrun = 0;
running_average.acq_time = 0;
}
The disassembly for the function is the follows:
{
004005F8 push {r3, lr}
accumulated_samples = 0;
004005FA movs r2, #0
004005FC ldr r3, [pc, #36]
004005FE str r2, [r3]
running_average.adc_samples[0] = 0;
00400600 ldr r3, [pc, #36]
00400602 str r2, [r3]
running_average.adc_samples[1] = 0;
00400604 str r2, [r3, #4]
running_average.adc_samples[2] = 0;
00400606 str r2, [r3, #8]
running_average.adc_samples[3] = 0;
00400608 str r2, [r3, #12]
running_average.adc_samples[4] = 0;
0040060A str r2, [r3, #16]
running_average.adc_samples[5] = 0;
0040060C str r2, [r3, #20]
running_average.overrun = 0;
0040060E strb.w r2, [r3, #32]
running_average.acq_time = 0;
00400612 movs r0, #0
00400614 movs r1, #0
00400616 strd r0, r1, [r3, #24]
Note that r3 in the above is 0x2001ef70, which is indeed 4-byte aligned. r2 is the literal value 0.
The str opcode requires 4-byte alignment. The strd opcode only requires 4 byte alignment as well, since it appears to really be two sequential 4-byte operations, though I don't know how it actually works internally.
If I intentionally mis-align my struct, to force the slow-path copy operation:
typedef struct adc_samples_t
{
int8_t overrun;
int32_t adc_samples[6];
uint64_t acq_time;
uint8_t padding_1;
uint8_t padding_2;
uint8_t padding_3;
} __attribute__((packed, aligned(8))) adc_sample_set;
I get the following assembly:
{
00400658 push {r3, lr}
accumulated_samples = 0;
0040065A movs r3, #0
0040065C ldr r2, [pc, #84]
0040065E str r3, [r2]
running_average.adc_samples[0] = 0;
00400660 ldr r2, [pc, #84]
00400662 strb r3, [r2, #1]
00400664 strb r3, [r2, #2]
00400666 strb r3, [r2, #3]
00400668 strb r3, [r2, #4]
running_average.adc_samples[1] = 0;
0040066A strb r3, [r2, #5]
0040066C strb r3, [r2, #6]
0040066E strb r3, [r2, #7]
00400670 strb r3, [r2, #8]
running_average.adc_samples[2] = 0;
00400672 strb r3, [r2, #9]
00400674 strb r3, [r2, #10]
00400676 strb r3, [r2, #11]
00400678 strb r3, [r2, #12]
running_average.adc_samples[3] = 0;
0040067A strb r3, [r2, #13]
0040067C strb r3, [r2, #14]
0040067E strb r3, [r2, #15]
00400680 strb r3, [r2, #16]
running_average.adc_samples[4] = 0;
00400682 strb r3, [r2, #17]
00400684 strb r3, [r2, #18]
00400686 strb r3, [r2, #19]
00400688 strb r3, [r2, #20]
running_average.adc_samples[5] = 0;
0040068A strb r3, [r2, #21]
0040068C strb r3, [r2, #22]
0040068E strb r3, [r2, #23]
00400690 strb r3, [r2, #24]
running_average.overrun = 0;
00400692 mov r1, r2
00400694 strb r3, [r1], #25
running_average.acq_time = 0;
00400698 strb r3, [r2, #25]
0040069A strb r3, [r1, #1]
0040069C strb r3, [r1, #2]
0040069E strb r3, [r1, #3]
004006A0 strb r3, [r1, #4]
004006A2 strb r3, [r1, #5]
004006A4 strb r3, [r1, #6]
004006A6 strb r3, [r1, #7]
So, pretty clearly, I'm getting the proper aligned-copy behaviour with my original struct definition, despite the compiler apparently incorrectly warning that it will result in inefficient accesses.

Why gcc (ARM) aren't using Global Register Variables as source operands?

here is a c source code example:
register int a asm("r8");
register int b asm("r9");
int main() {
int c;
a=2;
b=3;
c=a+b;
return c;
}
And this is the assembled code generated using a arm gcc cross compiler:
$ arm-linux-gnueabi-gcc -c global_reg_var_test.c -Wa,-a,-ad
...
mov r8, #2
mov r9, #3
mov r2, r8
mov r3, r9
add r3, r2, r3
...
When using -frename-registers, the behaviour was the same. (updated. Before I had said with -O3.)
So the question is: why gcc add the 3rd and 4th MOV's instead of 'ADD R3, R8, R9'?
Context: I need to optimize a code in a simulated inorder cpu (gem5 arm minorcpu) that doesn't rename registers.

I took real example (posted in comments) and put it on the godbolt compiler explorer. The main inefficiency in calc() is that src1 and src2 are globals it has to load from memory, instead of args passed in registers.
I didn't look at main, just calc.
register int sum asm ("r4");
register int r asm ("r5");
register int c asm ("r6");
register int k asm ("r7");
register int temp1 asm ("r8"); // really? you're using two global register vars for scratch temporaries? Just let the compiler do its job.
register int temp2 asm ("r9");
register long n asm ("r10");
int *src1, *src2, *dst;
void calc() {
temp1 = r*n;
temp2 = k*n;
temp1 = temp1+k;
temp2 = temp2+c;
// you get bad code for this because src1 and src2 are globals, not args passed in regs
sum = sum + src1[temp1] * src2[temp2];
}
# gcc 4.8.2 -O3 -Wall -Wextra -Wa,-a,-ad -fverbose-asm
mla r0, r10, r7, r6 # temp2.9, n, k, c ## tmp = k*n + c
movw r3, #:lower16:.LANCHOR0 # tmp136,
mla r8, r10, r5, r7 # temp1, n, r, k ## temp1 = r*n + k
movt r3, #:upper16:.LANCHOR0 # tmp136,
ldmia r3, {r1, r2} # tmp136,, ## load both pointers, since they're stored adjacently in memory
mov r9, r0 # temp2, temp2.9 ## This insn is wasted: the first MLA should have had this as the dest
ldr r3, [r1, r8, lsl #2] # *_22, *_22
ldr r2, [r2, r9, lsl #2] # *_28, *_28
mla r4, r2, r3, r4 # sum, *_28, *_22, sum
bx lr #
For some reason, one of the integer multiply-accumulate (mla) instructions uses r8 (temp1) as the destination, but the other one writes to r0 (a scratch reg), and only later moves the result to r9 (temp2).
The sum += src1[temp1] * src2[temp2] is done with an mla that reads and writes r4 (sum).
Why do you need temp1 and temp2 to be globals? That's just going to stop the optimizer from doing aggressive optimizations that don't calculate exactly the same temporaries that the C source does. Fortunately the C memory model is weak enough that it should be able to reorder assignments to them, although this might actually be why it didn't MLA into temp2 directly, since it decided to do that calculation first. (Hmm, does the memory model even apply? Other threads can't see our registers at all, so those globals are all effectively thread-local. It should allow relaxed ordering for assignments to globals. Signal handlers can see these globals, and could run at any point. gcc isn't following strict source order, since in the source both multiplies happen before either add.)
Godbolt doesn't have a newer ARM gcc version, so I can't easily test a newer gcc. A newer gcc might do a better job with this.
BTW, I tried a version of the function using local variables for temporaries, and didn't actually get better results. Probably because there are still so many register globals that gcc couldn't pick convenient regs for the temporaries.
// same register globals, except for temp1 and temp2.
void calc_local_tmp() {
int t1 = r*n + k;
sum += src1[t1] * src2[k*n + c];
}
push {lr} # gcc decides to push to get a tmp reg
movw r3, #:lower16:.LANCHOR0 # tmp131,
mla lr, r10, r5, r7 # tmp133, n.1, r, k.2
movt r3, #:upper16:.LANCHOR0 # tmp131,
mla ip, r7, r10, r6 # tmp137, k.2, n.1, c
ldr r2, [r3] # src1, src1
ldr r0, [r3, #4] # src2, src2
ldr r1, [r2, lr, lsl #2] # *_10, *_10
ldr r3, [r0, ip, lsl #2] # *_20, *_20
mla r4, r3, r1, r4 # sum, *_20, *_10, sum
ldr pc, [sp], #4 #
Compiling with -fcall-used-r8 -fcall-used-r9 didn't help; gcc makes the same code that pushes lr to get an extra temporary. It fails to use ldmia (load-multiple) because it makes a sub-optimal choice of which temporary to put in which reg. (&src1 in r0 would let it load src1 and src2 into r2 and r3.)

Optimize C or assembly code in size for Cortex-M0

I need to reduce the code bloat for the Cortex-M0 microprocessor.
At startup the ROM data has to be copied to the RAM data once. Therefore I have this piece of code:
void __startup( void ){
extern unsigned int __data_init_start;
extern unsigned int __data_start;
extern unsigned int __data_end;
// copy .data section from flash to ram
s = & __data_init_start;
d = & __data_start;
e = & __data_end;
while( d != e ){
*d++ = *s++;
}
}
The assembly code that is generated by the compiler looks like this:
ldr r1, .L10+8
ldr r2, .L10+12
sub r0, r1, r2
lsr r3, r0, #2
add r3, r3, #1
lsl r1, r3, #2
mov r3, #0
.L4:
add r3, r3, #4
cmp r3, r1
beq .L9
.L5:
ldr r4, .L10+16
add r0, r2, r3
add r4, r3, r4
sub r4, r4, #4
ldr r4, [r4]
sub r0, r0, #4
str r4, [r0]
b .L4
How can I optimize this code so the code size is at minimum?

The compiler (or you!) does not realize that the range to copy is end - start. There seems to be some unnecessarily shuffling of data going on -- the 2 add and the sub in the loop. Also, it seems to me the compiler makes sure that the number of copies to make is a multiple of 4. An obvious optimization, then, is to make sure it is in advance! Below I assume it is (if not, the bne will fail and happily keep on copying and trample all over your memory).
Using my decade-old ARM assembler knowlegde (yes, that is a major disclaimer), and post-incrementing, I think the following short snippet is what it can be condensed to. From 18 instructions down to 8, not too bad. If it works.
ldr r1, __data_init_start
ldr r2, __data_start
ldr r3, __data_end
sub r4, r3, r2
.L1:
ldr r3, [r1], #4 ; safe to re-use r3 here
str r3, [r2], #4
subs r4, r4, #4
bne L1

May be that platform guarantees that writing to an unsigned int * you may change an unsigned int * value (i.e. it doesn't take advantage of type mismatch aliasing rules).
Then the code is inefficient because e is a global variable and the generated code logic must take in account that writing to *d may change the value of e.
Making at least e a local should solve this problem (most compilers know that aliasing a local that never had its address taken is not possible from a C point of view).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight