Cast from array causing crash on some MCUs but not others

Cast from array causing crash on some MCUs but not others - c

I have a piece of code looking like this:
void update_clock(uint8_t *time_array)
{
time_t time = *((time_t *) &time_array[0]); // <-- hangs
/* ... more code ... */
}
Where time_array is an array of 4 bytes (i.e. uint8_t time_array[4]).
I'm using arm-none-eabi-gcc to compile this for an STM32L4 processor.
While compiling this a couple of months ago I got no errors and the code is running perfectly fine on all my test MCUs. I did some updates to my environment (OpenSTM32) when coming back to this project and now this piece of code is crashing on some MCUs while working fine on others.
I still have my binary from a couple of months ago and have confirmed that this code path works fine on all of my MCUs (I have about 5 to test on), but now it works on two of them while causing a crash on three of them.
I have mitigated the problem by rewriting the code like this:
time_t time = (
((uint32_t) time_array[0]) << 0 |
((uint32_t) time_array[1]) << 8 |
((uint32_t) time_array[2]) << 16 |
((uint32_t) time_array[3]) << 24
);
While this works for now, I think the old code looks cleaner and I'm also worried that if this code path hangs I probably will have similar errors elsewhere.
Does anyone have any idea what can be causing this? Can I change anything in my setup to make the compiler work the old way again?

From version 7-2017-q4-major, arm gcc ships with newlib compiled with time_t defined as 64 bit (long long) integer, causing all sorts of problems with code that assumes it to be 32 bits. Your code is reading past the end of the source array, taking whatever is stored there as the high order bits of the time value, possibly resulting in a date before the big bang, or after the heat death of the universe, which might not be what your code expects.
If the source array is known to contain 32 bits of data, copy it to a 32 bit int32_t variable first, then you can assign it to a time_t, this way it will be properly converted, regardless of the size of time_t.

Your development environment OpenSTM32 may be using a gcc compiler. If so, gcc supports the following macro flag.
-fno-strict-aliasing
It you are using -O2, this flag might resolve your problem.
Using memcpy is the standard advice, and is sometimes optimized-away by the compiler:
memcpy(&time, time_array, sizeof time);
Finally, you can use gcc's typeof and a compound literal with a union to generate the following safe cast:
#define PUN_CAST4(a, x) ((union {uint8_t src[4]; typeof(x) dst;}){{a[0],a[1],a[2],a[3]}}).dst
time_t time = PUN_CAST4(time_array, time);
As an example, the following code is compiled at https://godbolt.org/g/eZRXxW:
#include <stdint.h>
#include <time.h>
#include <string.h>
time_t update_clock(uint8_t *time_array) {
time_t t = *((time_t *) &time_array[0]); // assumes no alignment problem
return t;
}
time_t update_clock2(uint8_t *time_array) {
time_t t =
(uint32_t)time_array[0] << 0 |
(uint32_t)time_array[1] << 8 |
(uint32_t)time_array[2] << 16 |
(uint32_t)time_array[3] << 24;
return t;
}
time_t update_clock3(uint8_t *time_array) {
time_t t;
memcpy(&t, time_array, sizeof t);
return t;
}
#define PUN_CAST4(a, x) ((union {uint8_t src[4]; typeof(x) dst;}){{a[0],a[1],a[2],a[3]}}).dst
time_t update_clock4(uint8_t *time_array) {
time_t t = PUN_CAST4(time_array, t);
return t;
}
gcc 8.1 is good for all four examples: it generates the trivial code with -O2. But gcc 7.3 is bad for the 4th. Clang is also good for all four with -m32 for a 32-bit target, but fails on the 2nd and 4th without it

Your issue is caused by unaligned access, or writing to the wrong area.
Compiling
#include "stdint.h"
#include "time.h"
time_t myTime;
void update_clock(uint8_t *time_array)
{
myTime = *((time_t *) &time_array[0]); // <-- hangs
/* ... more code ... */
}
with GCC 7.2.1 with the arguments -march=armv7-m -Os generates the following
update_clock(unsigned char*):
ldr r3, .L2
ldrd r0, [r0]
strd r0, [r3]
bx lr
.L2:
.word .LANCHOR0
myTime:
Because your time array is an 8 bit type there are no rules for alignment, so if the linker has not word aligned it, when you try and dereference it as a time_t * the LDRD instruction is given a non word aligned address and causes a usagefault.
The LDRD and STRD instructions are loading and storing 8 bytes, whereas your array is only 4 bytes long. I suggest you check sizeof(time_t) in your environment, and make an aligned area long enough to store it.

Related

Cross-compilng C program for ARMv8-A in Linux X86_64 system

I am new to ARM architecture,I am experimenting with cache clean of Arm.
I am following "Programmer’s Guide for ARMv8-A" since Gem-5 has this implementation as per (https://www.gem5.org/documentation/general_docs/architecture_support/arm_implementation/) ,
I am trying to cross-compile below code in linux x86_64 system using
arm-linux-gnueabi-gcc test_arm.c -o test,
but I am getting following error.
/tmp/ccTM2bcE.s: Assembler messages:
/tmp/ccTM2bcE.s:38: Error: selected processor does not support requested special purpose register -- `mrs r3,ctr_el0'
/tmp/ccTM2bcE.s:69: Error: bad instruction `dc cavu,r3'
/tmp/ccTM2bcE.s:150: Error: selected processor does not support `dsb ish' in ARM mode
/tmp/ccTM2bcE.s:159: Error: selected processor does not support `dsb ish' in ARM mode
code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <stdint.h>
void clean_invalidate(uint64_t addr){
uint64_t ctr_el0 = 0;
if(ctr_el0 == 0)
asm volatile("mrs %0, ctr_el0":"=r"(ctr_el0)::);
const size_t dcache_line_size = 4 << ((ctr_el0 >>16)&15);
addr = addr & ~(dcache_line_size - 1);
asm volatile("dc cvau, %0"::"r"(addr):);
}
int main(){
int a[1000];
int index = 0;
uint64_t addr = 0;
double time_spend = 0.0;
clock_t begin = clock();
for(int i=0;i<100;i++){
index = rand()%1000;
a[index] = index;
addr = (uint64_t)(&a[index]);
asm volatile("dsb ish");
clean_invalidate(addr);
asm volatile("dsb ish");
int b = a[index];
}
clock_t end = clock();
time_spend = (double)(end-begin)/CLOCKS_PER_SEC;
printf("Time:%f\n",time_spend);
return 0;
}
Can someone please help me to compile this code for ARMv8-A in Linux X86 system.
PS: You can ignore the cast from pointer to integer of different size warning.

I think mrs %0,ctr_el0 is an ARMv8 aarch64 instruction, and arm-linux-gnueabi-gcc is the armv7/aarch32 compiler, you have to use aarch64-linux-gnu-gcc.
And dc cavu does not seem to exist, did you mean dc cvau?
With those two changes it compiles.
To be honest, there is also MRS in ARMv7 in addition to MRC, but I haven't fully understood when each one should be used in there. aarch64 has only MRS so it's simpler.
For the specific case of CTR_EL0, there exists an analogous aarch32 register CTR, but that one is accessed with MRC according to the manual, not MRS.
Here are a gazillion runnable examples that might be of interest as well:
https://cirosantilli.com/linux-kernel-module-cheat/#dump-regs
https://cirosantilli.com/linux-kernel-module-cheat/#arm-userland-assembly

The problem comes with the instruction:
asm volatile("mrs %0, ctr_el0":"=r"(ctr_el0)::);
which is translated to an assembler instruction, it is tie to your ARM architecture, for this you should take a look into your correspondign arm registers and see if it is included there, If not, then you need to find another register with a similar purpose

GCC baremetal inline-assembly SI register not playing nicely with pointers

Well, this is obviously a beginner's question, but this is my first attempt at making an operating system in C (Actually, I'm almost entirely new to C.. I'm used to asm) so, why exactly is this not valid? As far as I know, a pointer in C is just a uint16_t used to point to a certain area in memory, right (or a uint32_t and that's why it's not working)?
I've made the following kernel ("I've already made a bootloader and all in assembly to load the resulting KERNEL.BIN file):
kernel.c
void printf(char *str)
{
__asm__(
"mov si, %0\n"
"pusha\n"
"mov ah, 0x0E\n"
".repeat:\n"
"lodsb\n"
"cmp al, 0\n"
"je .done\n"
"int 0x10\n"
"jmp .repeat\n"
".done:\n"
"popa\n"
:
: "r" (str)
);
return;
}
int main()
{
char *msg = "Hello, world!";
printf(msg);
__asm__("jmp $");
return 0;
}
I've used the following command to compile it kernel.c:
gcc kernel.c -ffreestanding -m32 -std=c99 -g -O0 -masm=intel -o kernel.bin
which returns the following error:
kernel.c:3: Error: operand type mismatch for 'mov'
Why exactly might be the cause of this error?

As Michael Petch already explained, you use inline assembly only for the absolute minimum of code that cannot be done in C. For the rest there is inline assembly, but you have to be extremely careful to set the constraints and clobber list right.
Let always GCC do the job of passing the values in the right register and just specify in which register the values should be.
For your problem you probably want to do something like this
#include <stdint.h>
void print( const char *str )
{
for ( ; *str; str++) {
__asm__ __volatile__("int $0x10" : : "a" ((int16_t)((0x0E << 8) + *str)), "b" ((int16_t)0) : );
}
}
EDIT: Your assembly has the problem that you try to pass a pointer in a 16 bit register. This cannot work for 32 bit code, as 32 bit is also the pointer size.
If you in case want to generate 16 bit real-mode code, there is the -m16 option. But that does not make GCC a true 16 bit compiler, it has its limitations. Essentially it issues a .code16gcc directive in the code.

You can't simply use 16bit assembly instructions on 32-bit pointers and expect it to work. si is the lower 16bit of the esi register (which is 32bit).
gcc -m32 and -m16 both use 32-bit pointers. -m16 just uses address-size and operand-size prefixes to do mostly the same thing as normal -m32 mode, but running in real mode.
If you try to use 16bit addressing in a 32bit application you'll drop the high part of your pointers, and simply go to a different place.
Just try to read a book on intel 32bit addressing modes, and protected mode, and you'll see that many things are different on that mode.
(and if you try to switch to 64bit mode, you'll see that everything changes again)
A bootloader is something different as normally, cpu reset forces the cpu to begin in 16bit real mode. This is completely different from 32bit protected mode, which is one of the first things the operating system does. Bootloaders work in 16bit mode, and there, pointers are 16bit wide (well, not, 20bits wide, when the proper segment register is appended to the address)

Vector Sum using AVX Inline Assembly on XeonPhi

I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}
int main(int argc, char* argv[])
{
printf("Starting calculation...\n");
int i;
const int length = 65536;
unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
for(i=0; i<length; i++){
A[i] = 1;
B[i] = 2;
}
const int AVXLength = length / 16;
unsigned char * pA = (unsigned char *) A;
unsigned char * pB = (unsigned char *) B;
unsigned char * pC = (unsigned char *) C;
for(i=0; i<AVXLength; i++ ){
__asm__("vmovdqa32 %1,%%zmm0\n"
"vmovdqa32 %2,%%zmm1\n"
"vpaddd %0,%%zmm0,%%zmm1;"
: "=m" (pC) : "m" (pA), "m" (pB));
pA += 64;
pB += 64;
pC += 64;
}
// To prove that the program actually worked
for (i=0; i <5 ; i++)
{
printf("C[%d] = %f\n", i, C[i]);
}
}
However when I run the program, I've got segmentation fault from the asm part. Can somebody help me with that???
Thanks

Xeon Phi Knights Corner doesn't support AVX. It only supports a special set of vector extensions, called Intel Initial Many Core Instructions (Intel IMCI) with a vector size of 512b. So trying to put any sort of AVX specific assembly into a KNC code will lead to crashes.
Just wait for Knights Landing. It will support AVX-512 vector extensions.

Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP's case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don't have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it's good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don't think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
__m512i a16 = _mm512_load_epi32(A);
__m512i b16 = _mm512_load_epi32(B);
__m512i s16 = _mm512_add_epi32(a16,b16);
_mm512_store_epi32(C, s16);
}
and gcc -mavx512f -O3 -S knc.c procudes
vmovdqa64 (%rsi), %zmm0
vpaddd (%rdi), %zmm0, %zmm0
vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don't know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don't have these corresponding intrinsics.
Based on GCC's code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0\n"
"vpaddd (%2), %%zmm0, %%zmm0\n"
"vmovdqa64 %%zmm0, (%0)"
:
: "r" (pC), "r" (pA), "r" (pB)
: "memory"
);
Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.
I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512
is that the parameters have been converted to void*. For example with SSE you have to cast
int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine
It's still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it's probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64
Without masks the instructions are equivalent.

Assembly label address incorrect on 32-bit processors

I have some simple code that finds the difference between two assembly labels:
#include <stdio.h>
static void foo(void){
__asm__ __volatile__("_foo_start:");
printf("Hello, world.\n");
__asm__ __volatile__("_foo_end:");
}
int main(void){
extern const char foo_start[], foo_end[];
printf("foo_start: %p, foo_end: %p\n", foo_start, foo_end);
printf("Difference = 0x%tx.\n", foo_end - foo_start);
foo();
return 0;
}
Now, this code works perfectly on 64-bit processors, just like you would expect it to. However, on 32-bit processors, the address of foo_start is the same as foo_end.
I'm sure it has to do with 32 to 64 bit. On i386, it results in 0x0, and x86_64 results in 0x7. On ARMv7 (32 bit), it results in 0x0, while on ARM64, it results in 0xC. (the 64-bit results are correct, I checked them with a disassembler)
I'm using Clang+LLVM to compile.
I'm wondering if it has to do with non-lazy pointers. In the assembly output of both 32-bit processor archs mentioned above, they have something like this at the end:
L_foo_end$non_lazy_ptr:
.indirect_symbol _foo_end
.long 0
L_foo_start$non_lazy_ptr:
.indirect_symbol _foo_start
.long 0
However, this is not present in the assembly output of both x86_64 and ARM64. I messed with removing the non-lazy pointers and addressing the labels directly yesterday, but to no avail. Any ideas on why this happens?
EDIT:
It appears that when compiled for 32 bit processors, foo_start[] and foo_end[] point to main. I....I'm so confused.

I didn't check on real code but suspect you are a victim of instruction reordering. As long as you do not define proper memory barriers, the compiler ist free to move your code within the function around as it sees fit since there is no interdependency between labels and printf() call.
Try adding ::: "memory" to your asm statements which should nail them where you wrote them.

I finally found the solution (or, alternative, I suppose). Apparently, the && operator can be used to get the address of C labels, removing the need for me to use inline assembly at all. I don't think it's in the C standard, but it looks like Clang supports it, and I've heard GCC does too.
#include <stdio.h>
int main(void){
foo_start:
printf("Hello, world.\n");
foo_end:
printf("Foo has ended.");
void* foo_start_ptr = &&foo_start;
void* foo_end_ptr = &&foo_end;
printf("foo_start: %p, foo_end: %p\n", foo_start_ptr, foo_end_ptr);
printf("Difference: 0x%tx\n", (long)foo_end_ptr - (long)foo_start_ptr);
return 0;
}
Now, this only works if the labels are in the same function, but for what I intend to use this for, it's perfect. No more ASM, and it doesn't leave a symbol behind. It appears to work just how I need it to. (Not tested on ARM64)

What is this x86 inline assembly doing?

I came across this code and need to understand what it is doing. It just seems to be declaring two bytes and then doing nothing...
uint64_t x;
__asm__ __volatile__ (".byte 0x0f, 0x31" : "=A" (x));
Thanks!

This is generating two bytes (0F 31) directly into the code stream. This is an RDTSC instruction, which reads the time-stamp counter into EDX:EAX, which will then be copied to the variable 'x' by the output constraint "=A"(x)

0F 31 is the x86 opcode for the RDTSC (read time stamp counter) instruction; it places the value read into the EDX and EAX registers.
The _ _ asm__ directive isn't just declaring two bytes, it's placing inline assembly into the C code. Presumably, the program has a way of using the value in those registers immediately afterwards.
http://en.wikipedia.org/wiki/Time_Stamp_Counter

It's inserting an 0F 31 opcode, which according to this site is:
0F 31 P1+ f2 RDTSC EAX EDX IA32_T... Read Time-Stamp Counter
Then it is storing the result in the x variable

It's inline asm for rdtsc, with the machine-code encoding written out to support really old assemblers that don't know the mnemonic.
Unfortunately, it only works correctly in 32bit code because "=A" doesn't split 64bit operands in half in 64bit code. (The gcc manual even uses rdtsc an an example to illustrate this)
The safe way to write this, which compiles to optimal code with gcc -m32 or -m64, is:
#include <stdint.h>
uint64_t timestamp_safe(void)
{
unsigned long tsc_low, tsc_high; // not uint32_t: saves a zero-extend for -m64 (but not x32 :/)
asm volatile("rdtsc" : "=d"(tsc_high), "=a" (tsc_low));
return ((uint64_t)tsc_high << 32) | tsc_low;
}
In 32bit code, it's just rdtsc/ret, but in 64bit code it does the necessary shift/or to get both halves into rax for the return value.
See it on the Godbolt compiler explorer.