How to distinguish armhf (ARMv7) and armel (ARMv4) in C code? - c

In the executable I'm writing I have 2 implementations of the same function, one for armhf (fast) and one for armel (slow). At runtime I'd like to detect the CPU type, and call the armhf implementation if armhf was detected. How do I detect the CPU? I need something like this in C code:
int is_cpu_armhf(void) {
...
}
The code may contain inline assembly, but preferably it shouldn't contain a call to a library function or a system call, because it should work with multiple libraries and multiple operating systems.
I've found https://github.com/pytorch/cpuinfo/tree/master/src/arm , but it doesn't seem to be using any inline assembly, but it relies on the operating system to get the CPU information.

... I have two implementations of the same function, one for armhf (fast) and one for armel (slow). At runtime I'd like to detect the CPU type, and call the armhf implementation if armhf was detected. How do I detect the CPU? I need something like this in C code ...
As #Ruslan noted, the cpu features are mostly privileged on ARM. If you are root then you can read a MRS register for the feature mask. The latest kernels fake a cpuid for ARM, but it is only available on most recent kernels.
At runtime you may be able to parse /proc/cpuinfo on Linux for cpu arch and features. You may also be able to call getauxval and read the bits from the auxiliary vector.
What I have found that works best is:
Try to read getauxval for arch and feature
Use a SIGILL probe if getauxval fails
The SIGILL probe is expensive. You setup a SIGILL handler and try the ARMv5 or ARMv7 instruction. If you catch a SIGILL you know the instruction is not available.
SIGILL probes are used by Crypto++ and OpenSSL. For example, movw and movt were added at ARMv7. Here is the code to probe for ARMv7 using the movw and movt instructions in Crypto++. OpenSSL performs similar in crypto/armcap.c.
bool CPU_ProbeARMv7()
{
volatile bool result = true;
volatile SigHandler oldHandler = signal(SIGILL, SigIllHandler);
if (oldHandler == SIG_ERR)
return false;
volatile sigset_t oldMask;
if (sigprocmask(0, NULLPTR, (sigset_t*)&oldMask))
return false;
if (setjmp(s_jmpSIGILL))
result = false;
else
{
unsigned int a;
asm volatile (
#if defined(__thumb__)
".inst.n 0xf241, 0x2034 \n\t" // movw r0, 0x1234
".inst.n 0xf2c1, 0x2034 \n\t" // movt r0, 0x1234
"mov %0, r0 \n\t" // mov [a], r0
#else
".inst 0xe3010234 \n\t" // movw r0, 0x1234
".inst 0xe3410234 \n\t" // movt r0, 0x1234
"mov %0, r0 \n\t" // mov [a], r0
#endif
: "=r" (a) : : "r0");
result = (a == 0x12341234);
}
sigprocmask(SIG_SETMASK, (sigset_t*)&oldMask, NULLPTR);
signal(SIGILL, oldHandler);
return result;
}
The volatiles are required in the probes. Also see What sense do these clobbered variable warnings make?
On Android you should use android_getCpuFamily() and android_getCpuFeatures() instead of getauxval.
The ARM folks say you should NOT parse /proc/cpuinfo. Also see ARM Blog and Runtime Detection of CPU Features on an armv8-a CPU. (Non-paywall version here).
DO NOT perform SIGILL based feature probes on iOS devices. Apple devices trash memory. For Apple devices use something like How to get device make and model on iOS?.
You also need to enable code paths based on compiler options. That is a whole 'nother can of worms. For that problem see Detect ARM NEON availability in the preprocessor?
For some additional source code to examine, see cpu.cpp in Crypto++. It is the place where Crypto++ does things like call getauxval, android_getCpuFamily() and android_getCpuFeatures().
The Crypto++ SIGILL probes occur in specific source files since a source file usually needs a compiler option to enable an arch, like -march=armv7-a and -fpu=neon for ARM. That's why ARMv7 and NEON are detected in neon_simd.cpp. (There are other similar files for i686 and x86_64, Altivec, PowerPC, and Aarch64).
Here is what a getauxval and android_getCpuFamily() looks like in Crypto++. CPU_QueryARMv7 is used first. If CPU_QueryARMv7 fails, then a SIGILL feature probe is used.
inline bool CPU_QueryARMv7()
{
#if defined(__ANDROID__) && defined(__arm__)
if (((android_getCpuFamily() & ANDROID_CPU_FAMILY_ARM) != 0) &&
((android_getCpuFeatures() & ANDROID_CPU_ARM_FEATURE_ARMv7) != 0))
return true;
#elif defined(__linux__) && defined(__arm__)
if ((getauxval(AT_HWCAP) & HWCAP_ARMv7) != 0 ||
(getauxval(AT_HWCAP) & HWCAP_NEON) != 0)
return true;
#elif defined(__APPLE__) && defined(__arm__)
// Apple hardware is ARMv7 or above.
return true;
#endif
return false;
}
The ARM instructions for movw and movt were disassembled from the following source code:
int a;
asm volatile("movw %0,%1 \n"
"movt %0,%1 \n"
: "=r"(a) : "i"(0x1234));
00000010 <_Z5test2v>: // ARM
10: e3010234 movw r0, #4660 ; 0x1234
14: e3410234 movt r0, #4660 ; 0x1234
18: e12fff1e bx lr
0000001c <_Z5test3v>: // Thumb
1c: f241 2034 movw r0, #4660 ; 0x1234
20: f2c1 2034 movt r0, #4660 ; 0x1234
24: e12fff1e bx lr
Here is what reading a MRS looks like. It is very similar to getting cpuid bitmask on x86. The code below can be used to get Crypto features for Aarch64, but it requires root privileges.
The code requires Exception Level 1 (EL1) and above, but user space runs at EL0. Attempting to run the code from userland results in a SIGILL and termination.
#if defined(__arm64__) || defined(__aarch64__)
uint64_t caps = 0; // Read ID_AA64ISAR0_EL1
__asm __volatile("mrs %0, " "id_aa64isar0_el1" : "=r" (caps));
#elif defined(__arm__) || defined(__aarch32__)
uint32_t caps = 0; // Read ID_ISAR5_EL1
__asm __volatile("mrs %0, " "id_isar5_el1" : "=r" (caps));
#endif
The benefit of issuing instructions yourself is, it does not need arch options when compiling the source file:
unsigned int a;
asm volatile (
#if defined(__thumb__)
".inst.n 0xf241, 0x2034 \n\t" // movw r0, 0x1234
".inst.n 0xf2c1, 0x2034 \n\t" // movt r0, 0x1234
"mov %0, r0 \n\t" // mov [a], r0
#else
".inst 0xe3010234 \n\t" // movw r0, 0x1234
".inst 0xe3410234 \n\t" // movt r0, 0x1234
"mov %0, r0 \n\t" // mov [a], r0
#endif
: "=r" (a) : : "r0");
You can compile the above code without arch options:
gcc cpu-test.c -o cpu-test.o
If you were to use movw and movt:
int a;
asm volatile("movw %0,%1 \n"
"movt %0,%1 \n"
: "=r"(a) : "i"(0x1234));
then your compiler would need to support ARMv7, and you would need to use the arch option:
gcc -march=armv7 cpu-test.c -o cpu-test.o
And GCC could use ARMv7 throughout the source file, which could cause a SIGILL outside your protected code.
I've experienced Clang using the wrong instruction set on x86. See Crypto++ Issue 751. GCC will surely follow. In the Clang case, I needed to compile with -march=avx on a source file so I could use AVX intrinsics. Clang generated AVX code outside my protected block and it crashed on a old Core2 Duo machine. (The Clang generated unsafe code was initialization of a std::string).
In the case of ARM the problem is, you need -march=armv7 to enable the ISA with movw and movt and the compiler thinks it can use the ISA, too. It is a design bug in the compiler where the user's arch and the compiler's arch'es are conflated. In reality, because of the compiler design, you need a user arch and a separate compiler arch.

Related

Error: width suffixes are invalid in ARM mode

I'm trying to manually issue ARMv7 movt and movw instructions for a cpu feature test. I'm catching a compile error with Clang.
The test program is below. According to the ARM folks, .inst.w is the way to do this. It handles big-endian and little-endian properly, and places the code in the .text section instead of a data section.
$ cat test.cxx
int test()
{
int a;
asm volatile (
".inst.w 0xf2412334 \n\t" // movw r3, 0x1234
".inst.w 0xf2c12334 \n\t" // movt r3, 0x1234
"mov %0, r3 \n\t" // mov [a], r3
: "=r" (a) : : "r3");
return a;
}
GCC is fine:
$ g++ -O1 -march=armv7-a test.cxx -c
$ objdump --disassemble test.o
...
00000000 <_Z4testv>:
0: f241 2334 movw r3, #4660 ; 0x1234
4: f2c1 2334 movt r3, #4660 ; 0x1234
8: 4618 mov r0, r3
a: 4770 bx lr
However, Clang:
$ clang++ -O1 -march=armv7-a test.cxx -c
test.cxx:5:2: error: width suffixes are invalid in ARM mode
".inst.w 0xf2412334 \n\t" // movw r3, 0x1234
^
<inline asm>:1:2: note: instantiated into assembly here
.inst.w 0xf2412334
^
test.cxx:5:25: error: width suffixes are invalid in ARM mode
".inst.w 0xf2412334 \n\t" // movw r3, 0x1234
^
<inline asm>:2:2: note: instantiated into assembly here
.inst.w 0xf2c12334
^
2 errors generated.
If I change .inst.w to .inst, then Clang produces garbage:
$ clang++ -O1 -march=armv7-a test.cxx -c
$ objdump --disassemble test.o
...
00000000 <_Z4testv>:
0: f2412334 vcge.s8 d18, d1, d20
4: f2c12334 vbic.i32 d18, #5120 ; 0x00001400
8: e1a00003 mov r0, r3
c: e12fff1e bx lr
I verified Clang is defining __GNUC__, so it should be able to consume this code.
How do I get Clang to assemble the movt and movw instructions?
The main difference is that your GCC is configured to default to thumb mode, while clang isn't.
ARM has got two different 32 bit instruction sets, ARM and Thumb, and even if the instruction names are similar, the encodings are different. The ARM instruction set encodes all instructions as fixed length 32 bit instructions, while Thumb originally was a much smaller instruction set with all instructions being 16 bit. Since Thumb2 (which is the case for ARMv7), the instructions can either be a single 16 bit instruction or a pair of two 16 bit instructions.
The disassembly you showed indicates this:
0: f241 2334 movw r3, #4660 ; 0x1234
4: f2c1 2334 movt r3, #4660 ; 0x1234
8: 4618 mov r0, r3
a: 4770 bx lr
The latter two instructions are plain 16 bit opcodes (4618 and 4770), while the former two are two pairs of 16 bits (f241 2334 and f2c1 2334) separated with whitespace.
The clang disassembly however doesn't split the opcodes in half, and have full 32 bit opcodes for all instructions:
0: f2412334 vcge.s8 d18, d1, d20
4: f2c12334 vbic.i32 d18, #5120 ; 0x00001400
8: e1a00003 mov r0, r3
c: e12fff1e bx lr
In this case, passing -mthumb to Clang should get the same behaviour as GCC, and vice versa, passing -marm to GCC should reproduce the same failure there.
The .w suffix to .inst is to indicate that the value should be handled as a wide 32 bit instruction (as opposed to a narrow 16 bit one), which only makes sense in Thumb mode. IIRC, both GCC (since some time) and Clang (since release 8) should be able to deduce the kind of Thumb instruction without the .w suffix as well.
Instead of forcing the compiler to one mode or another, you probably want something like this instead though:
asm volatile (
#ifdef __thumb__
".inst.w 0xf2412334 \n\t" // movw r3, 0x1234
".inst.w 0xf2c12334 \n\t" // movt r3, 0x1234
#else
".inst 0xe3013234 \n\t" // movw r3, 0x1234
".inst 0xe3413234 \n\t" // movt r3, 0x1234
#endif
"mov %0, r3 \n\t" // mov [a], r3
: "=r" (a) : : "r3");

How to get lower and higher 32 bits of a 64-bit integer for gcc inline asm? (ARMV5 platform)

I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).
How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).
I know, that compiler would do it for me, but I just write the small function for an example.
int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
int64_t tmp_acc;
asm("SMULL %0, %1, %2, %3"
: "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
: "r"(x), "r"(y)
);
return tmp_acc + acc;
}
You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:
int64_t accum(int64_t acc, int32_t x, int32_t y) {
return acc + x * (int64_t)y;
}
which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)
accum:
smlal r0, r1, r3, r2 #, y, x
bx lr #
As a bonus, this pure C also compiles efficiently for AArch64.
https://gcc.gnu.org/wiki/DontUseInlineAsm
If you insist on shooting yourself in the foot and using inline asm:
Or in the general case with other instructions, there might be a case where you'd want this.
First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.
This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.
Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.
So you can specify a 64-bit input or output as a pair of 32-bit variables.
#include <stdint.h>
int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
uint32_t prod_lo, prod_hi;
asm("SMULL %0, %1, %2, %3"
: "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);
int64_t prod = ((int64_t)prod_hi) << 32;
prod |= prod_lo; // + here won't optimize away, but | does, with gcc
return acc + prod;
}
Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.
Again from Godbolt:
# gcc -O3 output with early-clobber, valid even before ARMv6
testFunc:
str lr, [sp, #-4]! #, Save return address (link register)
SMULL ip, lr, r2, r3 # prod_lo, prod_hi, x, y
adds r0, ip, r0 #, prod, acc
adc r1, lr, r1 #, prod, acc
ldr pc, [sp], #4 # return by popping the return address into PC
# gcc -O3 output without early-clobber (&) on output constraints:
# valid only for ARMv6 and later
testFunc:
SMULL r3, r2, r2, r3 # prod_lo, prod_hi, x, y
adds r0, r3, r0 #, prod, acc
adc r1, r2, r1 #, prod, acc
bx lr #
Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.
// using an int64_t directly with inline asm, using %Q0 and %R0 constraints
// Q is the low half, R is the high half.
int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
{
int64_t prod; // gcc and clang seem to want more free registers this way
asm("SMULL %Q0, %R0, %1, %2"
: "=&r" (prod) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);
return acc + prod;
}
again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)
# gcc -O3 with the early-clobber so it's safe on ARMv5
testFunc2:
push {r4, r5} #
SMULL r4, r5, r2, r3 # prod, x, y
adds r0, r4, r0 #, prod, acc
adc r1, r5, r1 #, prod, acc
pop {r4, r5} #
bx lr #
So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.

Conversion of ARM code into C

Here is my assembly code for A9,
ldr x1, = 0x400020 // Const value may be address also
ldr w0, = 0x200018 // Const value may be address also
str w0, [x1]
The below one is expected output ?
*((u32 *)0x400020) = 0x200018;
When i cross checked with it by compiler it given differnet result as mov and movs insted of ldr. How to create ldr in c?
When i cross checked with it by compiler it given differnet result as mov and movs
It sounds to me like you compiled the C code with a compiler targetting AArch32, but the assembly code you've shown looks like it was written for AArch64.
Here's what I get when I compile with ARM64 GCC 5.4 and optimization level O3 (comments added by me):
mov x0, 32 # x0 = 0x20
mov w1, 24 # w1 = 0x18
movk x0, 0x40, lsl 16 # x0[31:16] = 0x40
movk w1, 0x20, lsl 16 # w1[31:16] = 0x20
str w1, [x0]
How to create ldr in c?
I can't see any good reason why you'd want the compiler to generate an LDR in this case.
LDR reg,=value is a pseudo-instruction that allows you to load immediates that cannot be encoded directly in the instruction word. The assembler achieves this by placing the value (e.g. 0x200018) in a literal pool, and then replacing ldr w0, =0x200018 with a PC-relative load from that literal pool (i.e. something like ldr w0,[pc,#offset_to_value]). Accessing memory is slow, so the compiler generated another sequence of instructions for you that achieves the same thing in a more efficient manner.
Pseudo-instructions are mainly a convenience for humans writing assembly code, making the code easier for them or their colleagues to read/write/maintain. Unlike a human being, a compiler doesn't get fatigued by repeating the same task over and over, and therefore doesn't have as much need for conveniences like that.
TL;DR: The compiler will generate what it thinks is the best (according to the current optimization level) instruction sequence. Also, that particular form of LDR is a pseudo-instruction, so you might not be able to get a compiler to generate it even if you disable all optimizations.

Is it possible to access hardware register through inline assembly

I am trying to access hardware register on a Broadcom ARM processsor through inline assemble. I have accessed hardware regiters through bare metal programming, but now I am trying to incorporate those bare metal programming codes in the C file using asm. Here is my code which toggles GPIO 17 on a Raspberry Pi 2:
void main() {
__asm__(
".section .init\n\t"
".globl _start\n\t"
"_start:"
"ldr r0,=0x3F200000\n\t"
"mov r1, #1\n\t"
"lsl r1, #21\n\t"
"str r1, [r0, #4]\n\t"
"loop$:\n\t"
"mov r1, #1\n\t"
"lsl r1, #17\n\t"
"str r1, [r0, #28]\n\t"
"mov r1, #1\n\t"
"lsl r1, #17\n\t"
"str r1, [r0, #40]\n\t"
"b loop$\n\t"
);
}
but when I compile it by gcc file.c
it throws following error
/tmp/ccrfp9mv.s: Assembler messages:
/tmp/ccrfp9mv.s: Error: .size expression for main does not evaluate to a constant
You get Error: .size expression for main does not evaluate to a constant because you change sections inside a function. As you can see on the godbolt compiler explorer, the compiler will emits asm directives to calculate ELF metadata, with lines like:
.size main, .-main # size_of_main = current_pos - start_of_main
Since you switch sections inside the body of main, the distance between main and the end of main isn't known until link time, and it's not possible to get the linker to fill in this piece of metadata that late. (.size has to be an assemble-time constant, not just a link-time constant).
Like people commented, you should do the whole thing in C, e.g. with a global like
#include <stdint.h>
volatile uint32_t *const GPIO17 = (uint32_t*)0x3F200000; // address is const, contents aren't.
Presumably you need to ask the OS for access to that MMIO register. Part of the OS's job is to stop programs from talking directly to the hardware and messing up other programs that are doing the same thing at the same time.
Even if your code assembled, it won't link because your definition of _start will conflict with the one provided by the libc runtime code.
Don't try to define a function in inline asm inside another function. Write a stand-alone function if you want to do that.

Linaro g++ aarch64 compilation cause unalignment fault

I'm using linaro g++ for ARM arch64 to compile a simple cpp file:
int main()
{
char *helloMain = "main module (crm.c)";
long faculty, num = 12;
int stop,mainLoop = 1;
char word[80] = "";
}
After objdump the generated elf file, I got its asm code:
0000000000001270 <main>:
int main()
{
1270: d101c3ff sub sp, sp, #0x70
char *helloMain = "main module (crm.c)";
1274: 90000020 adrp x0, 5000 <_malloc_trim_r+0x160>
1278: 9111c000 add x0, x0, #0x470
127c: f90003e0 str x0, [sp]
long faculty, num = 12;
1280: d2800180 movz x0, #0xc
1284: f90007e0 str x0, [sp,#8]
int stop,mainLoop = 1;
1288: 52800020 movz w0, #0x1
128c: b90013e0 str w0, [sp,#16]
char word[80] = "";
1290: 910063e0 add x0, sp, #0x18
1294: 90000021 adrp x1, 5000 <_malloc_trim_r+0x160>
1298: 91122021 add x1, x1, #0x488
129c: 39400021 ldrb w1, [x1]
12a0: 39000001 strb w1, [x0]
12a4: 91000400 add x0, x0, #0x1
12a8: a9007c1f stp xzr, xzr, [x0]
12ac: a9017c1f stp xzr, xzr, [x0,#16]
12b0: a9027c1f stp xzr, xzr, [x0,#32]
12b4: a9037c1f stp xzr, xzr, [x0,#48]
12b8: f900201f str xzr, [x0,#64]
12bc: b900481f str wzr, [x0,#72]
12c0: 7900981f strh wzr, [x0,#76]
12c4: 3901381f strb wzr, [x0,#78]
}
12c8: 52800000 movz w0, #0x0
12cc: 9101c3ff add sp, sp, #0x70
12d0: d65f03c0 ret
Before executing this code on an ARMV8 board, sp is initialized to an address aligned to 0x1000.
The execution of such code raised an alignment fault exception on
12a8: a9007c1f stp xzr, xzr, [x0]
I noticed x0 was added by 0x1 so it was aligned to 0x1 when stp instruction was executed.
Why g++ didn't make it align to 0x10 to avoid such alignment fault exception?
The g++ version is:
gcc 4.8.1 20130506 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2013.05 - Linaro GCC 2013.05)
From the manual:
-munaligned-access
-mno-unaligned-access
Enables (or disables) reading and writing of 16- and 32- bit values from addresses that are not 16- or 32- bit aligned.
By default unaligned access is disabled for all pre-ARMv6 and all
ARMv6-M architectures, and enabled for all other architectures. If
unaligned access is not enabled then words in packed data structures
will be accessed a byte at a time.
The ARM attribute Tag_CPU_unaligned_access will be set in the
generated object file to either true or false, depending upon the
setting of this option. If unaligned access is enabled then the
preprocessor symbol __ARM_FEATURE_UNALIGNED will also be defined.
AArch64/ARMv8 supports unaligned access out of box, so GCC assumes it's available. If this is not the case, you may have to disable it explicitly with the above switch. It's also possible that the "prerelease" version you're using is not quite finished yet and various bugs/issues are present.
EDIT
As mentioned in the comments, the corresponding AArch64 options are:
-mstrict-align
-mno-strict-align
Avoid or allow generating memory accesses that may not be aligned on a natural object boundary as described in the architecture specification.
By the way, the code behaves like this because GCC interpreted the assignment literally:
Copy the string "" (so just a single zero byte) to the start of the buffer.
Fill the rest of the buffer with zeroes.
I suspect that if you enable optimizations, the unaligned access will be gone.
Or, if you use char word[80] = {0}, it should do the zeroing in one go.
After some study on ARMV8 architecture, I got deeper understandings about the data abort exception that I met.
Why did this align fault exception occur?
As #IgorSkochinsky has mentioned, AArch64/ARMv8 supports unaligned access. But as I'm working on a simple bare-metal environment, MMU wasn't enbaled, so in this case, memory is regarded as a device, and device doesn't support unaligned access. If MMU is enabled, this exception is gone.
How to force GCC to compile an unaligned-access free elf file?
From the manual, -mno-unaligned-access should be enough, but for my GCC version:
gcc 4.8.1 20130506 (prerelease) (crosstool-NG
linaro-1.13.1-4.8-2013.05 - Linaro GCC 2013.05)
it says there's no such option. In my case, another option -mstrict-align solved this problem.

Resources