unaligned access with memcpy

unaligned access with memcpy - c

I'm using a netif struct (similar to http://www.nongnu.org/lwip/structnetif.html) and I got a question related to the alignment. I noticed that every int start on an address that is a multiplier of 4 (e.g. 0x20010db0). However, let's take a look at the following :
struct netif {
...
u8_t hwaddr_len (at address 0x20010db8)
u8_t[8] hwaddr (at address 0x20010db9)
u8_t mtu (at address 0x20010dc1)
...
}
From what I understand, hwaddr_len is align on 4 bytes, hwaddr is "align" on 1 bytes (because it's a u8_t, this isn't align on 4 bytes (32 bits)) and mtu is "align" on 1 byte. After that, all the other member of the struct are align again on 4 bytes. So, I think this should be good, even if hwaddr is not align on a 4 bytes multipler, but when I try to do a memcpy from "src" to hwaddr, I got a unalign access error.
I'm compiling on arm gcc compiler. Is anyone got an idea why it is failing?
Ps : I don't have much knowledge about ARM alignment issues, sorry if my question may seem obvious.
EDIT :
Version of the compiler : gcc-arm-none-eabi-4_9-2015q3
The section where it is failing :
lpwif_get_slla(struct lpwif *lpwif, void *lla, unsigned char lla_len)
{
WpanDeviceP dev = lpwif->dev->wpan;
unsigned char len = 0;
if (lla_len >= 8) {
if (lpwif->eui[0] == 0xff) {
/* Fetch WPAN Device's Long Address. */
uint64_t addr64;
memset(&addr64, 0xff, sizeof(addr64));
WpanGet(dev, WpanPibAttr_macExtendedAddress, &addr64, 8);
/* Always return address in network-byte order */
lpwif->eui[0] = (addr64 >> 56) & 0xff;
lpwif->eui[1] = (addr64 >> 48) & 0xff;
lpwif->eui[2] = (addr64 >> 40) & 0xff;
lpwif->eui[3] = (addr64 >> 32) & 0xff;
lpwif->eui[4] = (addr64 >> 24) & 0xff;
lpwif->eui[5] = (addr64 >> 16) & 0xff;
lpwif->eui[6] = (addr64 >> 8) & 0xff;
lpwif->eui[7] = (addr64 >> 0) & 0xff;
}
if (lpwif->eui[0] == 0xff) return 0; /* Device has no EUI-64 address. */
if (lla) memcpy(lla, lpwif->eui, 8);
}
And this method is called by `lpwif_get_slla(&state->lpwif, netif->hwaddr, 8);
Disassembly:
if (lpwif->eui[0] == 0xff) return 0; /* Device has no EUI-64 address. */
10098ac: 6bfb ldr r3, [r7, #60] ; 0x3c
10098ae: f893 3024 ldrb.w r3, [r3, #36] ; 0x24
10098b2: 2bff cmp r3, #255 ; 0xff
10098b4: d101 bne.n 10098ba <lpwif_get_slla+0x10e>
10098b6: 2300 movs r3, #0
10098b8: e07f b.n 10099ba <lpwif_get_slla+0x20e>
if (lla) memcpy(lla, lpwif->eui, 8);
10098ba: 6bbb ldr r3, [r7, #56] ; 0x38
10098bc: 2b00 cmp r3, #0
10098be: d006 beq.n 10098ce <lpwif_get_slla+0x122>
10098c0: 6bfb ldr r3, [r7, #60] ; 0x3c
10098c2: 3324 adds r3, #36 ; 0x24
10098c4: 6bb8 ldr r0, [r7, #56] ; 0x38
10098c6: 4619 mov r1, r3
10098c8: 2208 movs r2, #8
10098ca: 4b3f ldr r3, [pc, #252] ; (10099c8 <lpwif_get_slla+0x21c>)
10098cc: 4798 blx r3
return 8;
10098ce: 2308 movs r3, #8
10098d0: e073 b.n 10099ba <lpwif_get_slla+0x20e>

Try to replace the memcpy with a copy with a simple for loop. The compiler is probabily optimizing it, assuming it is memory aligned.

Related

ARM Cortex M0, BYTE or WORD access, which one is the best?

What is the real benefit of using byte access (instead of word) on a Cortex M0+.
LPC845 does have for all package pins (GPIO) one "BYTE" register and one "WORD" register.
If I want to set a GPIO using WORD register, I would use something like this
*((volatile uint32_t*)0xA0001088) = 1;
0000036e: movs r4, #137 ; 0x89
00000370: ldr r0, [pc, #80] ; (0x3c4 <AppInit+144>)
00000372: str r4, [r0, #0]
In the same time, I could use BYTE register
(using this, GPIO peripheral will automatically ignore everything is over 0xFF, writing 1 to pin)
*((volatile uint8_t*)0xA000102C) = 1;
0000039c: ldr r3, [pc, #48] ; (0x3d0 <AppInit+156>)
0000039e: adds r1, #56 ; 0x38
000003a0: strb r1, [r3, #0]
Now both techniques are consuming the same number of instructions.
Which one is the best to use and why?

Creating a empty array to be passed into a function . Using Arm Assembly

Hello I am trying to create an array of 3 indexes that will be filled by a function that I pass it into.
# r4 contains length
sub sp, sp, #12 # allocate 3 indexes for NEW array
mov r3, #4 # r3 = 4
mul sp, sp, r3 # sp = (# array elements)*4
mov r0, r4 # r0 = r4 = length parameter
mov r1, sp # r1 = (sp*index number) len[0] parameter
bl convert2pixel # convert2pixel(length, len)
Something is incorrect about this. because i keep segfaulting when i branch to convert2pixel
Convert2Pixel C Code:
void convert2pixel(int length, unsigned char len[3])
{
// byte3-byte2-byte1
unsigned char byte1, byte2, byte3;
byte1 = (unsigned char)(length % 256);
byte2 = (unsigned char)((length >> 8) % 256);
byte3 = (unsigned char)((length >> 16) % 256);
len[0] = byte1;
len[1] = byte2;
len[2] = byte3;
}

S32K146EVB Read Collision when erasing/writing flash

when i try to erase or write to the program flash on my S32K146 EVB i run into a Fault at the moment the FTFC should execute the command. Also the RDCOLLERR bit in the FTFC_STAT register is set. This is the Error from S32DS:
BusFault: A precise (synchronous) data access error has occurred. Possible location: 0x00000BA0.
The PC stopped at 0xb8a.
This is the disassembly:
11 while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
00000b88: nop
00000b8a: ldr r3, [pc, #20] ; (0xba0 <execute_command+44>)
00000b8c: ldrb r3, [r3, #0]
00000b8e: uxtb r3, r3
00000b90: sxtb r3, r3
00000b92: cmp r3, #0
00000b94: bge.n 0xb8a <execute_command+22>
12 return;
00000b96: nop
13 }
00000b98: mov sp, r7
00000b9a: pop {r7}
00000b9c: bx lr
00000b9e: nop
00000ba0: movs r0, r0
00000ba2: ands r2, r0
Strangely enough this does not happen, when i step through the program line by line. Then the flash gets programmed correctly.
This is my routine for erasing a flash sector:
void flash_erase_section(unsigned int addr)
{
// wrong address
if ((addr > FLASH_END_ADDRESS && addr < FLEXNVM_START_ADDRESS) || addr > FLEXNVM_END_ADDRESS){
return;
}
asm volatile("cpsid i");
// wait if operation in progress
while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
// clear flags
FTFC->FSTAT = FTFC_FSTAT_ACCERR_MASK | FTFC_FSTAT_FPVIOL_MASK;
FTFC->FCCOB[3] = 0x09; // erase flash section command
FTFC->FCCOB[2] = (addr >> 16) & 0xFF; // address[23:16]
FTFC->FCCOB[1] = (addr >> 8) & 0xFF; // address[15:8]
FTFC->FCCOB[0] = addr & 0xF0; // address[7:0] 128 bit aligned
execute_command();
asm volatile("cpsie i");
return;
}
The error happens in execute_command():
void execute_command()
{
FTFC->FSTAT |= FTFC_FSTAT_CCIF_MASK;
while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
}
As mentioned earlier, this only happens when NOT debugging step by step. I suspect this has something to do with the flash being busy, but i did not find anything that would help me understand.
Thank you for your help.

I found a workaround. It seems that the MCU threw a Bus Fault because, by accessing the flash memory, the cached instructions became invalid. Disabling caching by writing LMEM->PCCRMR = 0; resolved the issue.
Nonetheless it would be interesting if there is a solution which doesn't include disabling caching alltogether.

ARM Cortex M7 unaligned access and memcpy

I am compiling this code for a Cortex M7 using GCC:
// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
*ptr++ = (u8)(value);
*ptr++ = (u8)(value >> 8);
*ptr++ = (u8)(value >> 16);
*ptr++ = (u8)(value >> 24);
}
// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
void *px = (void*)&value;
memcpy(ptr, px, 4);
}
int main(void)
{
extern uint8_t data[];
extern uint32_t value;
// i added some offsets to data to
// make sure the compiler cannot
// assume it's aligned in memory
write_test_plain(data + 2, value);
__asm volatile("": : :"memory"); // just to split inlined calls
write_test_memcpy(data + 5, value);
... do something with data ...
}
And I get the following Thumb2 assembly with -O2:
// write_test_plain(data + 2, value);
800031c: 2478 movs r4, #120 ; 0x78
800031e: 2056 movs r0, #86 ; 0x56
8000320: 2134 movs r1, #52 ; 0x34
8000322: 2212 movs r2, #18 ; 0x12
8000324: 759c strb r4, [r3, #22]
8000326: 75d8 strb r0, [r3, #23]
8000328: 7619 strb r1, [r3, #24]
800032a: 765a strb r2, [r3, #25]
// write_test_memcpy(data + 5, value);
800032c: 4ac4 ldr r2, [pc, #784] ; (8000640 <main+0x3a0>)
800032e: 923b str r2, [sp, #236] ; 0xec
8000330: 983b ldr r0, [sp, #236] ; 0xec
8000332: f8c3 0019 str.w r0, [r3, #25]
Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?
Is this perhaps some optimization which happens due to some undefined behavior in my source?

For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.
If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like
write_test_memcpy(unsigned char*, unsigned long):
push {lr}
sub sp, sp, #12
movs r2, #4
add r3, sp, #8
str r1, [r3, #-4]!
mov r1, r3
bl memcpy
add sp, sp, #12
ldr pc, [sp], #4

Cortex-M 7 , M4, M3 M33, M23 does support unaligned access
M0, M+ doesn't support unaligned access
however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.
From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access

What's faster on ARM? MUL or (SHIFT + SUB)?

Which is faster on ARM?
*p++ = (*p >> 7) * 255;
or
*p++ = ((*p >> 7) << 8) - 1
Essentially what I'm doing here is taking an 8-bit word and setting it to 255 if >= 128, and 0 otherwise.

If p is char below statement is just an assignment to 255.
*p++ = ((*p >> 7) << 8) - 1
If p is int, then of course it is a different story.
You can use GCC Explorer to see how the assembly output looks like. Below is appearently what you get from Linaro's arm-linux-gnueabi-g++ 4.6.3 with -O2 -march=armv7-a flags;
void test(char *p) {
*p++ = (*p >> 7) * 255;
}
void test2(char *p) {
*p++ = ((*p >> 7) << 8) - 1 ;
}
void test2_i(int *p) {
*p++ = ((*p >> 7) << 8) - 1 ;
}
void test3(char *p) {
*p++ = *p >= 128 ? ~0 : 0;
}
void test4(char *p) {
*p++ = *p & 0x80 ? ~0 : 0;
}
creates
test(char*):
ldrb r3, [r0, #0] # zero_extendqisi2
sbfx r3, r3, #7, #1
strb r3, [r0, #0]
bx lr
test2(char*):
movs r3, #255
strb r3, [r0, #0]
bx lr
test2_i(int*):
ldr r3, [r0, #0]
asrs r3, r3, #7
lsls r3, r3, #8
subs r3, r3, #1
str r3, [r0, #0]
bx lr
test3(char*):
ldrsb r3, [r0, #0]
cmp r3, #0
ite lt
movlt r3, #255
movge r3, #0
strb r3, [r0, #0]
bx lr
test4(char*):
ldrsb r3, [r0, #0]
cmp r3, #0
ite lt
movlt r3, #255
movge r3, #0
strb r3, [r0, #0]
bx lr
If you are not nitpicking best is always to check assembly of the generated code over such details. People tend to overestimate compilers, I agree most of the time they do great but I guess it is anyone's right to question generated code.
You should also be careful interpreting instructions, since they won't always match into cycle accurate listing due to core's architectural featuers like having out-of-order, super scalar deep pipelines. So it might not be always shortest sequence of instructions win.

Well, to answer the question in your title, on ARM, a SHIFT+SUB can be done in a single instruction with 1 cycle latenency, while a MUL usually has multiple cycle latency. So the shift will usually be faster.
To answer the implied question of what C code to write for this, generally you are best off with the simplest code that expresses your intent:
*p++ = *p >= 128 ? ~0 : 0; // set byte to all ones iff >= 128
or
*p++ = *p & 0x80 ? ~0 : 0; // set byte to all ones based on the MSB
this will generally get converted by the compiler into the fastest way of doing it, whether that is a shift and whatever, or a conditional move.

Despite the fact that your compiler can optimize the line quite well (and reading the assembly will tell you what is really done), you can refer from this page to know exactly how much cycles a MUL can take.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

unaligned access with memcpy - c

Try to replace the memcpy with a copy with a simple for loop. The compiler is probabily optimizing it, assuming it is memory aligned.

Related

ARM Cortex M0, BYTE or WORD access, which one is the best?

Creating a empty array to be passed into a function . Using Arm Assembly

S32K146EVB Read Collision when erasing/writing flash

ARM Cortex M7 unaligned access and memcpy

What's faster on ARM? MUL or (SHIFT + SUB)?

Categories

Resources