Can you help me with code ('C' or ARM assembly) for marking a memory region as "Normal", thereby allowing unaligned memory access? I understand we need to enable MMU before doing this. I'm new to ARM architecture.
Thanks!
If all you need is unaligned access, try setting the cp15 sctlr[1] (alignment bit) to 0.
mrc p15, 0, r0, c1, c0, 0
bic r0, r0, #2
mcr p15, 0, r0, c1, c0, 0
I don't know whether MMU needs to be enabled for this or not.
#define L1PoniterTo2ndLevelPageTable(BASE_ADDRESS, P, DOMAIN) (BASE_ADDRESS<<10U | P<<9U | DOMAIN<<5U | 1U)
#define L1Section(BASE_ADDRESS, SBZ, NG, S, APX, TEX, AP, P, DOMAIN, XN, C, B) (BASE_ADDRESS<<20U | SBZ<<19 | 0U<<18U | NG<<17U | S<<16U | APX<<15U | TEX<<12U | AP<<10U | P<<9U | DOMAIN<<5U | XN<<4U | C<<3U | B<<2 |2U)
#define L1SuperSection(BASE_ADDRESS, SBZ, NG, S, APX, TEX, AP, P, DOMAIN, XN, C, B) (BASE_ADDRESS<<24U | SBZ<<19 | 1U<<18U | NG<<17U | S<<16U | APX<<15U | TEX<<12U | AP<<10U | P<<9U | DOMAIN<<5U | XN<<4U | C<<3U | B<<2 |2U)
unsigned long au32PageTableL1[4096U] __attribute__((aligned(0x4000)));
//extern unsigned long _endof_consts[4096U];
unsigned long u32EnableMMU()
{
unsigned long u32Index_l;
unsigned long dummy;
unsigned long sctlr;
//create L1 page table
for (u32Index_l=0;u32Index_l<sizeof(au32PageTableL1)/sizeof(au32PageTableL1[0]);u32Index_l++)
{
au32PageTableL1[u32Index_l] = L1Section
(
u32Index_l, //BASE_ADDRESS
0b0, //SBZ
0b0, //NG
0b0, //S
0b0, //APX
0b001, //TEX
0b11, //AP
0b0, //P
0,//DOMAIN
0b0, //XN
0b1, //C
0b1 //B
);
}
//set ttbr0
__asm("MCR p15, 0, %0, c2, c0, 0" : : "r" (au32PageTableL1));
//invalidate TLB
__asm("MCR p15, 0, %0, c8, c3, 0" : : "r" (dummy));
/* Ensure all TLB maintenance operations complete before returning. */
__asm("dsb");
//enable MMU
__asm("MRC p15, 0, %0, c1, c0, 0" : "=r" (sctlr));
sctlr |= 1U;
__asm("MCR p15, 0, %0, c1, c0, 0" : : "r" (sctlr));
return 0;
}
Related
Cortex M 3 4 7 support LDREX and STREX assembler instructions and with these CMSIS provides for example ATOMIC_MODIFY_REG that ensures an atomic modification of an (u)int32_t (ie clear some bits and set some (maybe other) bits).
Now I thought there also could be equivalently something like ATOMIC_INC and ATOMIC_DEC to atomically increment or decrement an (u)int32_t variable. But there isn't.
Is there something wrong with this idea? I could easily change ATOMIC_MODIFY_REG into ATOMIC_INC but testing if this will really be atomic is not so easy.
I am using STMCubeIDE, latest version.
Thanks for any help
Edit: not sure anymore if ATOMIC_MODIFY_REG is really CMSIS.
here is the ATOMIC_MODIFY_REG I have in STM CubeIDE:
/* Atomic 32-bit register access macro to clear and set one or several bits */
#define ATOMIC_MODIFY_REG(REG, CLEARMSK, SETMASK) \
do { \
uint32_t val; \
do { \
val = (__LDREXW((__IO uint32_t *)&(REG)) & ~(CLEARMSK)) | (SETMASK); \
} while ((__STREXW(val,(__IO uint32_t *)&(REG))) != 0U); \
} while(0)
CMSIS 5.1 defines macros/functions for LDREX/STREX functionality. The macros/functions have variants for 'B'yte, 'H'alfword and 'W'ord. Ie, LDREXH, STREXB. It does not use the functionality to implement atomic/lock-free primitives, but you can use them to implement your own.
| Instruction | CMSIS function |
| ------------ | ----------------------------------- |
| LDREX | uint32_t __LDREXW (uint32_t \*addr) |
| LDREXH | uint16_t __LDREXH (uint16_t \*addr) |
| LDREXB | uint8_t __LDREXB (uint8_t \*addr) |
| STREX | uint32_t __STREXW (uint32_t value, uint32_t \*addr) |
| STREXH | uint32_t __STREXH (uint16_t value, uint16_t \*addr) |
| STREXB | uint32_t __STREXB (uint8_t value, uint8_t \*addr) |
| CLREX | void __CLREX (void) |
For single core CPUs and a bare metal single thread, there is no reason for an atomic increment (most drivers won't want an atomic increment). For bit fields shared with interrupts, it can be important to atomically RMW (read-modify-write) atomically. If you have an RTOS with scheduling, the scheduler MUST perform a CRLEX() on a context switch. You can also use 'single reader/single writer' structures such as a ring buffer/fifo without resorting to atomics. However, I assume you have the knowledge and need to have these primitives.
I would use either C++ or the generated code to confirm a 'C' equivalent.
See: godbolt to process this SO example for some assembler,
create_id():
ldr r3, =ID
dmb ish
.L2:
ldrex r0, [r3]
adds r2, r0, #1
strex r1, r2, [r3]
cmp r1, #0
bne .L2
dmb ish
bx lr
This gives,
void atomic_inc(uint32_t *val)
{
DMB();
do {
uint32_t tmp = LDREX(val);
tmp++;
} while(STREX(tmp,val));
DMB();
}
For a CMSIS 'C' implementation. I would just use the C++ atomics and expose them as extern "C". There would be no overhead for object files that are just using the atomic C++ primitive and I would trust (and verify) the compiler. Lots of smart people have worked on making it correct.
You probably don't need the DMB() for non-SMP systems, but it is not that harmful unless performance is a must. As you can see, the loop could lock for eternity. You may wish to insert a retry counter depending on your application space and design criteria.
I would like to ask a question about how to write inline assembly code for Store-Conditional instruction in RISC-V. Below is some brief background (RISCV-ISA-Specification on page 40, section 7.2):
SC writes a word in rs2 to the address in rs1, provided a valid reservation still exists on that address. SC writes zero to rd on success or a nonzero code on failure.
The instruction that we will be focusing on is SC.D - store-conditional a 64-bit value. As shown on page 106 of RISCV-ISA-Specification, the instruction format is as follows:
00011 | aq<1> | rl<1> | rs2<5> | rs1<5> | 011 | rd<5> | 0101111
In order to use inline assembly to generate the corresponding code for SC.W instruction, we need 3 registers. The register list can be found here.
The register field of the instruction is 5 bit each. Hence, there are 32 general registers in RISC-V: x0, x1, ... x31. Each register has its own ABI(application binary interface), for instance, register x16 corresponds to a6 register, hence, the corresponding 5-bit value should be 10000.
I choose the following registers assignment:
rs2: a6 register (register x16, i.e. 0b10000)
rs1: a7 register (register x17, i.e. 0b10001)
rd: s4 register (register x20, i.e. 0b10100)
Hence, by filling in the corresponding register bits of the original instruction, we have the following:
00011 | aq<1> | rl<1> | 10000 | 10001 | 011 | 10100 | 0101111
For the two bits aq and rl, it is used for specifying the ordering constraints (page 40 of RISCV-ISA-Specification):
If both the aq and rl bits are set, the atomic
memory operation is sequentially consistent and cannot be observed to happen before any earlier
memory operations or after any later memory operations in the same RISC-V hart, and can only be
observed by any other hart in the same global order of all sequentially consistent atomic memory
operations to the same address domain.
So we just set both bits to 1 since we want SC.D to be executed atomically. Now we have the final instruction bits:
00011 | 1 | 1 | 10000 | 10001 | 011 | 10100 | 0101111
-> 00011111|00001000|10111010|00101111
0x1f 0x08 0xba 0x2f
Since RISC-V uses little endian, the corresponding inline assembly can be generated by:
__asm__ volatile(".byte 0x2f, 0xba, 0x08, 0x1f");
There are also some other preparations like loading values into rs1(a7) and rs2(a6) registers. Therefore, I have the following code (but it did not work as expected):
/**
* rs2: holds the value to be written. I pick a6 register.
* rs1: holds the address to be written to. I pick a7 register.
* rd: holds the return value of SC.D instruction. I pick s4 register.
*
* #src: the value to be written. rs2. a6 register
* #dst: the address to be written to. rs1. a7 register
* #rd: the value that holds the return value of SC.D
*/
static inline void sc(void *src, void *dst, uint64_t *rd) {
uint64_t *tmp_src = (uint64_t *)src;
uint64_t src_val = *tmp_src; // 13
uint64_t dst_addr = (uint64_t)dst;
uint64_t ret = 100;
// first of all, need to prepare the registers a6 and a7.
/* load value to be written into register a6 */
__asm__ volatile("ld a6, %0"::"m"(src_val));
/* load the address to be written to into register a7 */
__asm__ volatile("ld a7, %0"::"m"(dst_addr));
/* the actual SC.D: */
__asm__ volatile(".byte 0x2f, 0xba, 0x08, 0x1f");
// __asm__ volatile("sc.d s4, a6, (a7)"); // this does not work either.
/* obtain the value in register s4 */
__asm__ volatile("sd s4, %0":"=m"(ret));
*rd = ret;
return;
}
int main() {
uint64_t *src = malloc(sizeof(uint64_t));
uint64_t *dst = malloc(sizeof(uint64_t));
uint64_t rd = 20;
*src = 13;
*dst = 3;
sc(src, dst, &rd); // write value 13 into #dst, so #dst should be 13 afterwards
// the expected output should be "dst: 13, rd: 0"
// What I get: "dst: 3, rd: 1"
printf("dst: %ld, rd: %ld\n", *src, *dst, rd);
return 0;
}
The result does not seem to change the dst value. May I know which part I am doing wrong? Any hints would be appreciated.
I'm learning how to program STM32 Nucleo F446RE board using registers.
To know the position of a register, I take from datasheets the boundary address and the offset.
However, I cannot calculate the sum of them. I show an exmaple:
volatile uint32_t *GPIOA = 0x0; // Initialization of the boundary adress
GPIOA = (uint32_t*)0x40020000; // Boundary adress from datasheet
volatile uint32_t *GPIOA_ODR = 0x0; // Initialization of GPIOA_ODR register
GPIOA_ODR = GPIOA + (uint32_t*)0x14; // Calculation of GPIOA_ODR as the sum of the boundary adress and the offset (i.e. 0x14.
Line 5 gives me an error, do you know how to calculate it correctly?
Thank you very much in advance.
It is wrong. If you want to use this extremely inconvenient way:
#define GPIOA 0x4002000
#define ODR_OFFSET 0x14
#define GPIO_ODR (*(volatile uint32_t *)(GPIOA + ODR_OFFSET))
why #define not the pointer? It is just more compiler friendly and saves one memory read.
https://godbolt.org/z/LdLLVN
#define GPIOA 0x4002000
#define ODR_OFFSET 0x14
#define GPIO_ODR (*(volatile uint32_t *)(GPIOA + ODR_OFFSET))
volatile uint32_t *pGPIO_ODR = (volatile uint32_t *)(GPIOA + ODR_OFFSET);
void foo(uint32_t x)
{
GPIO_ODR = x;
}
void bar(uint32_t x)
{
*pGPIO_ODR = x;
}
and resulting code
foo:
ldr r3, .L3
str r0, [r3, #20]
bx lr
.L3:
.word 67117056
bar:
ldr r3, .L6
ldr r3, [r3]
str r0, [r3]
bx lr
.L6:
.word .LANCHOR0
pGPIO_ODR:
.word 67117076
The cast should be outside the constant value, in other words, you are adding GPIOA address + 14 to generate a new address. So the cast must be outside them:
GPIOA_ODR = (uint32_t*)(GPIOA + 0x14);
I tried but nothing. If I insert the GPIOA_ODR = (uint32_t*)(0x40020000 + 0x14); it works, instead if I insert GPIOA_ODR = (uint32_t*)(GPIOA + 0x14); it doesn't work.
Some other ideas?
Thank you very much for the answer. The complete code I'm using is the following:
int main(int argc, char* argv[])
{
/** RCC **/
/* RCC */
volatile uint32_t *RCC = 0x0;
RCC = (uint32_t*)0x40023800;
/* RCC_AHB1ENR */
volatile uint32_t *RCC_AHB1ENR = 0x0;
RCC_AHB1ENR = (uint32_t*)(0x40023800 + 0x30);
*RCC_AHB1ENR |= 0x1;
/** GPIOA **/
/* GPIOA */
volatile uint32_t *GPIOA = 0x0;
GPIOA = (uint32_t*)0x40020000;
/* GPIOA_MODER */
volatile uint32_t *GPIOA_MODER = 0x0;
GPIOA_MODER = (uint32_t*)(0x40020000 + 0x00);
*GPIOA_MODER |= 1 << 16;
*GPIOA_MODER &= ~(0 << 17);
/* GPIOA_ODR */
volatile uint32_t *GPIOA_ODR = 0x0;
GPIOA_ODR = (uint32_t*)(GPIOA + 0x14);
*GPIOA_ODR |= 1 << 8;
}
This code doens't work correctly because of the line GPIOA_ODR = (uint32_t*)(GPIOA + 0x14);. If I insert GPIOA_ODR = (uint32_t*)(0x40020000 + 0x14) it works correctly.
I have a to make a SPI communication between a microcontroller and another chip. The chip accepts a 16bit word. But the abstraction library requires the data to be sent as two 8bit bytes. Now I want to make a wrapper so I can easily create requests for read and write...but I have not yet got any success. Here is how it supposed to be:
The table below shows 16bits. The MSB can be 0 for write or 1 for read. The address can be from 0x0 to 0x7 and the data is 11 bits.
R/W | ADDRESS | DATA
B15 | B14-B11 | B10-B0
0 | 0000 | 00000000000
W0 | A3, A2, A1, A0 | D10, D9, D8, D7, D6, D5, D4, D3, D2, D1, D0
For example, if I want to read from register 0x1 I think I have to set the bits like this:
W0 | A3, A2, A1, A0 | D10, D9, D8, D7, D6, D5, D4, D3, D2, D1, D0
1 | 0 0 0 1 | 0 0 0 0 0 0 0 0 0 0 0
Or reading from register 0x7:
W0 | A3, A2, A1, A0 | D10, D9, D8, D7, D6, D5, D4, D3, D2, D1, D0
1 | 0 1 1 1 | 0 0 0 0 0 0 0 0 0 0 0
I have tried to create this struct/union to see if it can work:
typedef struct{
uint8_t acc_mode:1;
uint8_t reg_addr:4;
uint8_t reg_data:8; //TODO fix me should be 11
} DRVStruct;
typedef union {
DRVStruct content;
uint16_t all;
} DRVUnion;
void DRV_PrepareReadMsg(uint8_t reg, uint8_t* msgBuffer) {
DRVUnion temp;
temp.content.acc_mode = 1;
temp.content.reg_addr = reg;
temp.content.reg_data = 0; //read mode does not need data!
msgBuffer[1] = temp.all & 0xFF;
msgBuffer[0] = temp.all >> 8;
}
I am getting strange results...from time to time I get answer from the SPI (I am sure the SPI communication is OK, but my code for preparing messages is the problem).
So the questions are:
Am I doing the right thing or approach?
How can I increase bit width of reg_data from 8 to 11 without getting compile error?
What do you suggest for a better approach?
This seems to work:
#include <stdio.h>
#include <stdint.h>
typedef union {
struct{ // no struct tag, since it is not needed...
uint16_t acc_mode:1;
uint16_t reg_addr:4;
uint16_t reg_data:11; //TODO fix me should be 11
} bits;
uint16_t all;
uint8_t bytes[2]; //extra bonus when lit;-)
} DRVUnion;
int main(void)
{
DRVUnion uni,uni13[13];
printf("Size=%zu, %zu\n", sizeof uni, sizeof uni13);
return 0;
}
For x64 I can use this:
{
uint64_t hi, lo;
// hi,lo = 64bit x 64bit multiply of c[0] and b[0]
__asm__("mulq %3\n\t"
: "=d" (hi),
"=a" (lo)
: "%a" (c[0]),
"rm" (b[0])
: "cc" );
a[0] += hi;
a[1] += lo;
}
But I'd like to perform the same calculation portably. For instance to work on x86.
As I understand the question, you want a portable pure C implementation of 64 bit multiplication, with output to a 128 bit value, stored in two 64 bit values. In which case this article purports to have what you need. That code is written for C++. It doesn't take much to turn it into C code:
void mult64to128(uint64_t op1, uint64_t op2, uint64_t *hi, uint64_t *lo)
{
uint64_t u1 = (op1 & 0xffffffff);
uint64_t v1 = (op2 & 0xffffffff);
uint64_t t = (u1 * v1);
uint64_t w3 = (t & 0xffffffff);
uint64_t k = (t >> 32);
op1 >>= 32;
t = (op1 * v1) + k;
k = (t & 0xffffffff);
uint64_t w1 = (t >> 32);
op2 >>= 32;
t = (u1 * op2) + k;
k = (t >> 32);
*hi = (op1 * op2) + w1 + k;
*lo = (t << 32) + w3;
}
Since you have gcc as a tag, note that you can just use gcc's 128-bit integer type:
typedef unsigned __int128 uint128_t;
// ...
uint64_t x, y;
// ...
uint128_t result = (uint128_t)x * y;
uint64_t lo = result;
uint64_t hi = result >> 64;
The accepted solution isn't really the best solution, in my opinion.
It is confusing to read.
It has some funky carry handling.
It doesn't take advantage of the fact that 64-bit arithmetic may be available.
It displeases ARMv6, the God of Absolutely Ridiculous Multiplies. Whoever uses UMAAL shall not lag but have eternal 64-bit to 128-bit multiplies in 4 instructions.
Joking aside, it is much better to optimize for ARMv6 than any other platform because it will have the most benefit. x86 needs a complicated routine and it would be a dead end optimization.
The best way I have found (and used in xxHash3) is this, which takes advantage of multiple implementations using macros:
It is a tiny bit slower than mult64to128 on x86 (by 1-2 instructions), but a lot faster on ARMv6.
#include <stdint.h>
#ifdef _MSC_VER
# include <intrin.h>
#endif
/* Prevents a partial vectorization from GCC. */
#if defined(__GNUC__) && !defined(__clang__) && defined(__i386__)
__attribute__((__target__("no-sse")))
#endif
static uint64_t multiply64to128(uint64_t lhs, uint64_t rhs, uint64_t *high)
{
/*
* GCC and Clang usually provide __uint128_t on 64-bit targets,
* although Clang also defines it on WASM despite having to use
* builtins for most purposes - including multiplication.
*/
#if defined(__SIZEOF_INT128__) && !defined(__wasm__)
__uint128_t product = (__uint128_t)lhs * (__uint128_t)rhs;
*high = (uint64_t)(product >> 64);
return (uint64_t)(product & 0xFFFFFFFFFFFFFFFF);
/* Use the _umul128 intrinsic on MSVC x64 to hint for mulq. */
#elif defined(_MSC_VER) && defined(_M_IX64)
# pragma intrinsic(_umul128)
/* This intentionally has the same signature. */
return _umul128(lhs, rhs, high);
#else
/*
* Fast yet simple grade school multiply that avoids
* 64-bit carries with the properties of multiplying by 11
* and takes advantage of UMAAL on ARMv6 to only need 4
* calculations.
*/
/* First calculate all of the cross products. */
uint64_t lo_lo = (lhs & 0xFFFFFFFF) * (rhs & 0xFFFFFFFF);
uint64_t hi_lo = (lhs >> 32) * (rhs & 0xFFFFFFFF);
uint64_t lo_hi = (lhs & 0xFFFFFFFF) * (rhs >> 32);
uint64_t hi_hi = (lhs >> 32) * (rhs >> 32);
/* Now add the products together. These will never overflow. */
uint64_t cross = (lo_lo >> 32) + (hi_lo & 0xFFFFFFFF) + lo_hi;
uint64_t upper = (hi_lo >> 32) + (cross >> 32) + hi_hi;
*high = upper;
return (cross << 32) | (lo_lo & 0xFFFFFFFF);
#endif /* portable */
}
On ARMv6, you can't get much better than this, at least on Clang:
multiply64to128:
push {r4, r5, r11, lr}
umull r12, r5, r2, r0
umull r2, r4, r2, r1
umaal r2, r5, r3, r0
umaal r4, r5, r3, r1
ldr r0, [sp, #16]
mov r1, r2
strd r4, r5, [r0]
mov r0, r12
pop {r4, r5, r11, pc}
The accepted solution generates a bunch of adds and adc, as well as an extra umull in Clang due to an instcombine bug.
I further explain the portable method in the link I posted.