Dealing with reserved register bits of an ARM chip

Dealing with reserved register bits of an ARM chip - c

I am working with the registers of an ARM Cortex M3. In the documentation, some of the bits may be "reserved". It is unclear to me how I should deal with these reserved bits when writing on the registers.
Are these reserved bits even writeable? Should I be cautious to not touch them? Will something bad happen if I touch them?

This is a classic embedded world problem as to what to do with reserved bits! First, you should NOT write randomly into it lest your code becomes un-portable. What happens when the architecture assigns a new meaning to the reserved bits in future? Your code will break. So the best mantra when dealing with registers having reserved bits is Read-Modify-Write. i.e read the register contents, modify only the bits you want and then write back the value so that reserved bits are untouched ( untouched, does not mean we dont write into them, but in the sense, that we wrote that which was there before )
For example, say there is a register in which only the LSBit has meaning and all others are reserved. I would do this
ldr r0,=memoryAddress
ldr r1,[r0]
orr r1,r1,#1
str r1,[r0]

If there is no other clue in the documentation, write a zero. You cannot avoid writing to a few reserved bits spread around in a 32-bit register.

Read-Modify-Write should work most of the time, however there are cases where reserved bits are undefined on read but must be written with a specific value. See this post from the LPC2000 group (the whole thread is quite interesting too). So, always check the docs carefully, and also any errata that's available. When in doubt or docs are unclear, don't hesitate to write to the manufacturer.

Ideally you should read-modify-write, no guarantee for success, when you change to a newer chip with different bits, you are changing your code anyway. I have seen vendors where writing zeros to the reserved bits failed when they revved the chip and the code had to be touched. So there are no guarantees. The biggest clue is if in the vendors code you see a register or set that are clearly read-modify-write or clearly just a write. This could be different developers writing different sections of the example or there is a register in that peripheral that is sensitive, has an undocumented bit, and needs the read-modify-write.
On the chips that I work on I make sure that undocumented (to the customer), but not unused bits are marked in some way to stand out from other unused bits. We normally mark unused/reserved bits as zero, and these other bits get a name, and a must write this value marking. Not all vendors do this.
The bottom line is there is no guarantee, assume all documentation and example programs have bugs and you have to hack your way through to figure out what is right and what is wrong. No matter what path you take (read-modify-write, write zeros, etc) you will be wrong from time to time and have to re-do the code to match a hardware change. I strongly suggest that if a vendor has a chip id of some sort, that your software reads that ID and if it is an id that you have not tested your code against, declare a failure and not program that part. In production testing long before a customer sees the product, the part change will get detected and software will be involved in understanding the reason for the part change, the resolution being the alternate part is not compatible and rejected or the software changes, etc.

Reserved most of the time mean that they aren't used in this chip, but they might be used on feature devices (other product line). (Most chip manufacturers produce one peripheral driver and they use it for all there chips. This way it's mostly copy past work and there is less change for errors) Most of the time it doesn't matter if you write to reserved bits in peripheral registers, this because there isn't any logic attached to it.
It is possible that if you write something to it, it won't be stored and next time you attempt to read the register / bits it seams unchanged.

Related

ARM SVE: svld1(mask, ptr) vs svldff1(svptrue<>, ptr)

In ARM SVE there are masked load instructions svld1and there are also non-failing loads
svldff1(svptrue<>).
Questions:
Does it make sense to do svld1 with a mask as opppose to svldff1?
The behaviour of mask in svldff1 seems confusing. Is there a practical reason to provide a not just svptrue mask for svldff1
Is there any performance difference between svld1 and svldff1

Both ldff1 and ld1 can be used to load a vector register. According my informal tests, on an AWS graviton processor, I find no performance difference, in the sense that both instructions (ldff1 and ld1) seem to have roughly the same performance characteristics. However, ldff1 will read and write to the first-fault register (FFR). It implies that you cannot do more than one ldff1 at any one time within an 'FFR group', since they are order sensitive and depend crucially on the FFR.
Furthermore, the ldff1 instruction is meant to be used along with the rdffr instruction, the instruction that generates a mask indicating which loads were successful. Using the rdffr instruction will obviously add some cost. I am assuming that the instruction in question might need to run after ldff1w, thus increasing the latency by at least a cycle. Of course, then you have to do something with the mask that rdffr produces...
Obviously, there is bound to be some small overhead tied to the FFR (clearing, setting, accessing).
"Is there a practical reason to provide a not just svptrue mask for svldff1": The documentation states that the leading inactive elements (up to the fault) are predicated to zero.

C99 "atomic" load in baremetal portable library

I'm working on a portable library for baremetal embedded applications.
Assume that I have a timer ISR that increments a counter and, in the main loop, this counter read is from in a most certainly not atomic load.
I'm trying to ensure load consistency (i.e. that I'm not reading garbage because the load was interrupted and the value changed) without resorting to disabling interrupts. It does not matter if the value changed after reading the counter as long as the read value is proper. Does this do the trick?
uint32_t read(volatile uint32_t *var){
uint32_t value;
do { value = *var; } while(value != *var);
return value;
}

It's highly unlikely that there's any sort of a portable solution for this, not least because plenty of C-only platforms are really C-only and use one-off compilers, i.e. nothing mainstream and modern-standards-compliant like gcc or clang. So if you're truly targeting entrenched C, then it's all quite platform-specific and not portable - to the point where "C99" support is a lost cause. The best you can expect for portable C code is ANSI C support - referring to the very first non-draft C standard published by ANSI. That is still, unfortunately, the common denominator - that major vendors get away with. I mean: Zilog somehow gets away with it, even if they are now but a division of Littelfuse, formerly a division of IXYS Semiconductor that Littelfuse had acquired.
For example, here are some compilers where there's only a platform-specific way of doing it:
Zilog eZ8 using a "recent" Zilog C compiler (anything 20 years old or less is OK): 8-bit value read-modify-write is atomic. 16-bit operations where the compiler generates word-aligned word instructions like LDWX, INCW, DECW are atomic as well. If the read-modify-write otherwise fits into 3 instructions or less, you'd prepend the operation with asm("\tATM");. Otherwise, you'd need to disable the interrupts: asm("\tPUSHF\n\tDI");, and subsequently re-enable them: asm("\tPOPF");.
Zilog ZNEO is a 16 bit platform with 32-bit registers, and read-modify-write accesses on registers are atomic, but memory read-modify-write round-trips via a register, usually, and takes 3 instructions - thus prepend the R-M-W operation with asm("\tATM").
Zilog Z80 and eZ80 require wrapping the code in asm("\tDI") and asm("\tEI"), although this is valid only when it's known that the interrupts are always enabled when your code runs. If they may not be enabled, then there's a problem since Z80 does not allow reading the state of IFF1 - the interrupt enable flip-flop. So you'd need to save a "shadow" of its state somewhere, and use that value to conditionally enable interrupts. Unfortunately, eZ80 does not provide an interrupt controller register that would allow access to IEF1 (eZ80 uses the IEFn nomenclature instead of IFFn) - so this architectural oversight is carried over from the venerable Z80 to the "modern" one.
Those aren't necessarily the most popular platforms out there, and many people don't bother with Zilog compilers due to their fairly poor quality (low enough that yours truly had to write an eZ8-targeting compiler*). Yet such odd corners are the mainstay of C-only code bases, and library code has no choice but to accommodate this, if not directly then at least by providing macros that can be redefined with platform-specific magic.
E.g. you could provide empty-by-default macros MYLIB_BEGIN_ATOMIC(vector) and MYLIB_END_ATOMIC(vector) that would be used to wrap code that requires access atomic with respect to a given interrupt vector (or e.g. -1 if with respect to all interrupt vectors). Naturally, replace MYLIB_ with a "namespace" prefix specific to your library.
To enable platform-specific optimizations such as ATM vs DI on "modern" Zilog platforms, an additional argument could be provided to the macro to separate the presumed "short" sequences that the compiler is apt to generate three-instruction sequences for vs. longer ones. Such micro-optimization requires usually an assembly output audit (easily automatable) to verify the assumption of the instruction sequence length, but at least the data to drive the decision would be available, and the user would have a choice of using it or ignoring it.
*If some lost soul wants to know anything bordering on the arcane re. eZ8 - ask away. I know entirely too much about that platform, in details so gory that even modern Hollywood CG and SFX would have a hard time reproducing the true depth of the experience on-screen. I'm also possibly the only one out there running the 20MHz eZ8 parts occasionally at 48MHz clock - as sure a sign of demonic possession as the multiverse allows. If you think it's outrageous that such depravity makes it into production hardware - I'm with you. Alas, business case is business case, laws of physics be damned.

Are you running on any systems that have uint32_t larger than a single assembly instruction word read/write size? If not, the IO to memory should be a single instructions and therefore atomic (assuming the bus is also word sized...) You get in trouble when the compiler breaks it up into multiple smaller read/writes. Otherwise, I've always had to resort to DI/EI. You could have the user configure your library such that it has information if atomic instructions or minimum 32-bit word size are available to prevent interrupt twiddling. If you have these guarantees, you don't need to verification code.
To answer the question though, on a system that must split the read/writes, your code is not safe. Imagine a case where you read your value in correctly in the "do" part, but the value gets split during the "while" part check. Further, in an extreme case, this is an infinite loop. For complete safety, you'd need a retry count and error condition to prevent that. The loop case is extreme for sure, but I'd want it just in case. That of course makes the run time longer.
Let's show a failure case for examples - will use 16-bit numbers on a machine that reads 8-bit values at a time to make it easier to follow:
Value to read from memory *var is 0x1234
Read 8-bit 0x12
*var becomes 0x5678
Read 8-bit 0x78 - value is now 0x1278 (invalid)
*var becomes 0x1234
Verification step reads 8-bit 0x12
*var becomes 0x5678
Verification reads 8-bit 0x78
Value confirmed correct 0x1278, but this is an error as *var was only 0x1234 and 0x5678.
Another failure case would be when *var just happens to change at the same frequency as your code is running, which could lead to an infinite loop as each verification fails. Or even if it did break out eventually, this would be a very hard to track performance bug.

Why is "ldr pc, [pc, #imm]" in thumb unpredictable

The THUMB2 reference specifies that LDR PC, [PC, #imm] (type 2) is unpredictable if the target address is not 4-byte aligned.
From my experience, on some processors this works perfectly fine, and on others it fails miserably (which is why it took me quite a while to trace the fault to this alignment issue).
So I was wondering if there's some real explanation for this (beyond "just don't do it").

With ARM language like that often means that at some point in the past or present they have a specific core where it doesnt work. So just dont do it. May work perfectly well with your core. It may or may not have anything to do with the instruction set, they can always make that instruction work if they wanted to, aligned or not, just a matter of putting the gates down. Which is why it is mostly likely that one or more specific implementations have a problem and were already released before it was found.
In the old days with ARM, and may still be true, that they put this language in when they have specifically implemented something that is in fact predictable, and they use it as a way to see if you are using stolen code or whatever. To cover the what if you cloned an ARM kind of thing. I think picoTurbo pretty much covered that and put that to bed. ARM's legal team makes short work of that now.
The program counter in particular is a bit messy, esp with pipelines, the two ahead thing is all synthesized now has probably been since acorn days. Just a bad idea in general to use the pc on the right side of the comma except for specific cases (pc relative loads, jump tables, etc), so you may see that kind of language with respect to the PC simply so they dont have to add the code and clock cycles to make that instruction just work with the pc on the right. In this case (pc relative load), again they probably have one or more implementations cut and pasted from each other that have a problem, or for performance or gate count or timing closure reasons, they made this rule. Timing closure, your design can only run as fast as the longest pole in the tent, the longest, time-wise, combinational signal takes to settle covering variations in manufacturing and temperature and other environmental factors plus margin. So before tape in you compute these, examine them and decide, do we want to split this into two or more clocks, is it tied to a specific feature, do we want to just remove that feature. Repeat synthesis and timing closure until your expected max clock rate is at or above what you expected for this product.
It could also be that they didnt trap the unaligned access in this case, not 4 byte aligned is an unaligned access, and they may not have properly implemented it assuming it would be trapped or who knows why. You can maybe try to test that. By taking or planting specific bytes on either side of the unaligned address, and then planting code at combinations of where that might land. Unless you are a chip vendor you cant see this otherwise necessarily (if it doesnt trap), as a chip vendor you would be able to sim this and see exactly what is happening, of course you would have the code as well and see exactly why it doesnt work if you have one that doesnt work.
Looking in the early ARM ARM (ARMv4T/ARMv5T and some ARMV6), it is even more generic on the LDR , [, ]
If the memory address is not word-aligned and no data abort occurs, the value written to the destination register is UNPREDICTABLE.
Doesnt even get into using the PC as one or more of the registers.
TL;DR. Highly likely it is one of two things, 1) they have at least one core that has a bug, fixed in later cores of the same family or other designs. 2) They have a design reason (often timing/performance) that made it undesirable to implement the unaligned access and allow it to produce garbage, likely not unpredictable, but not worth the lengthy explanation of what the result is as it doesnt help you anyway.
Just because it worked on one core one time for you doesnt mean that it always works you could be getting lucky with the code and core in question. If you have access to the errata you may find your answer both the why and the fix. Thumb is supported on all arm cores from ARMv4T to the present, and many of those cores are do overs from scratch so just because you find it in one errata with a fix doesnt mean that other designs relied on the documentation saying dont do this and didnt bother to make it work.

The main reason (I think) is that instructions that load PC or SP have side effects and are difficult to manage (efficiently) in the CPU. Since ARM instruction set, the newer instructions set (including Aarch64) restrict the instructions that have these side effects.

How do I determine the start and end of instructions in an object file?

So, I've been trying to write an emulator, or at least understand how stuff works. I have a decent grasp of assembly, particularly z80 and x86, but I've never really understood how an object file (or in my case, a .gb ROM file) indicates the start and end of an instruction.
I'm trying to parse out the opcode for each instruction, but it occurred to me that it's not like there's a line break after every instruction. So how does this happen? To me, it just looks like a bunch of bytes, with no way to tell the difference between an opcode and its operands.

For most CPUs - and I believe Z80 falls in this category - the length of an instruction is implicit.
That is, you must decode the instruction in order to figure out how long it is.

If you're writing an emulator you don't really ever need to be able to obtain a full disassembly. You know what the program counter is now, you know whether you're expecting a fresh opcode, an address, a CB page opcode or whatever and you just deal with it. What people end up writing, in effect, is usually a per-opcode recursive descent parser.
To get to a full disassembler, most people impute some mild simulation, recursively tracking flow. Instructions are found, data is then left by deduction.
Not so much on the GB where storage was plentiful (by comparison) and piracy had a physical barrier, but on other platforms it was reasonably common to save space or to effect disassembly-proof code by writing code where a branch into the middle of an opcode would create a multiplexed second stream of operations, or where the same thing might be achieved by suddenly reusing valid data as valid code. One of Orlando's 6502 efforts even re-used some of the loader text — regular ASCII — as decrypting code. That sort of stuff is very hard to crack because there's no simple assembly for it and a disassembler therefore usually won't be able to figure out what to do heuristically. Conversely, on a suitably accurate emulator such code should just work exactly as it did originally.

AVR 8bit, C standard compliance regarding bit accessing of SFRs

One of my colleagues ran in some strange problems with programming an ATMega, related to accessing input - output ports.
Observing the problem after some research I concluded we should avoid accessing SFR's using operations which may compile to SBI or CBI instructions if we aim for a safe C standard compliant software. I am looking for whether this decision was righteous or not, so if my concerns here are valid.
The datasheet of the Atmel processor is here, it's an ATMega16. I will refer to some pages of this document below.
I will refer to the C standard using the version found on this site under the WG14 N1256 link.
The SBI and CBI instructions of the processor operate at bit-level accessing only the bit in question. So they are not true Read-Modify-Write (R-M-W) instructions since they, as I understand, do not perform a read (of the targeted 8 bit SFR).
On page 50 of the above datasheet the first sentence begins like All AVR ports have true Read-Modify-Write functionality..., while ongoing it specifies that this only applies to accesses with the SBI and CBI instructions which technically are not R-M-W. The datasheet does not define what reading for example the PORTx registers are supposed to return (it however indicates that they are readable). So I assumed reading these SFRs are undefined (they might return the last thing written on them or the current input state or whatever).
On page 70 it lists some external interrupt flags, this is interesting because this is where the nature of the SBI and CBI instructions come to be important. The flags are set when an interrupt occurred, and they may be cleared by writing them to one. So if SBI was a true R-M-W instruction, it would clear all three flags regardless of the bit specified in the opcode.
And now let's get into the matters of C.
The compiler itself is truly irrelevant, the only important fact is that it might use the CBI and SBI instructions in certain situations which I think make it non-compliant.
In the above mentioned C99 standard, the section 5.1.2.3 Program execution, point 2 and 3 refers to this (on page 13), and 6.7.3 Type qualifiers, point 6 (on page 109). The latter mentions that What constitutes an access to an object that has volatile-qualified type is implementation-defined, however a few phrases before it requires that any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine.
Also note that hardware ports such as that used in the example are declared volatile in the appropriate headers.
Example:
PORTA |= 1U << 6;
This is known to translate to an SBI. This implies that only a Write access happens on the volatile (PORTA) object. However if one would write:
var = 6;
...
PORTA |= 1U << var;
That would not translate to an SBI even though it will still only set one bit (since SBI has the bit to set encoded in the opcode). So this will expand to a true R-M-W sequence with a potentially different result than above (in the case of PORTA this is undefined behaviour as far as I could deduct from the datasheet).
By the C standard this behaviour might or might not be permitted. It is messy in that term too that here two things happen which mix in. One, the more apparent is the lack of the Read access in one of the cases. The other, less apparent is how the Write is performed.
If the compiled code omits the Read, it might fail to trigger hardware behaviour which is tied to such an access. However the AVR as far as I know has no such mechanism, so it might pass by the standard.
The Write is more interesting, however it also takes in the Read.
Omitting the Read in the case of using SBI implies that the affected SFR's must all work like latches (or any bit not working like so is either tied to 0 or 1), so the compiler can be sure of what it would read from them if it actually did the access. If this was not be the case then the compiler would at least be buggy. By the way this also clashes with that the datasheet did not define what is read from the PORTx registers.
How the write is performed is also a source of inconsistency: the result is different depending on how the compiler compiles it (a CBI or SBI affecting only one bit, a byte write affecting all bits). So writing code to clear / set one bit might either "work" (as in not "accidentally" clearing interrupt flags), or not if the compiler produces a true R-M-W sequence instead.
Maybe these are technically permitted by the C standard (as "implementation defined" behaviour, and the compiler deducting these cases that the Read access is not necessary to the volatile object), but at least I would consider it a buggy or inconsistent implementation.
Another example:
PORTA = PORTA | (1U << 6);
It is clearly visible that normally to conform with the standard a Read and then a Write of PORTA should be carried out. While according to the behaviour of SBI, it will lack a Read access, although as above this may pass for a mix of implementation defined behaviour and the compiler deducting that the Read is unnecessary here. (Or was my assumption wrong? That is assuming a |= b identical to a = a | b?)
So based on these I settled with that we should avoid these types of code as it is (or may be in the future) unclear how they might behave depending on whether the compiler would use SBI or CBI, or a true R-M-W sequence.
To tell the truth I mostly went after various forum posts etc. resolving this, not analysing actual compiler output. Not my project after all (and now I am not at work). I accepted it reading AVRFreaks for example that AVR-GCC would output these instructions in the above mentioned situations which alone may pose a problem even if with the actual version we used we wouldn't observe this. (However I think this case it stood as my suggestion to implement port accesses using a shadow work variables fixed the problems my colleague observed)
Note: I edited the middle based on some research on the C (C99) standard.
Edit: Reading the AVR Libc FAQ I again found something which contradicts the automatic use of SBI or CBI. It is the last question & answer where it specifically states that since the ports are declared volatile the compiler can not optimize out the read access, according to the rules of the C language (as it phrases).
I also understand that it is very unlikely that this particular behaviour (that is using SBI or CBI) would directly introduce bugs, but by masking "bugs" it may introduce very nasty ones in the long run if someone accidentally generalizes based on this behaviour while not understanding the AVR at assembly level.

You should probably stop trying to apply the C memory model to I/O registers. They are not plain memory. In the case of PORTn registers, it is in fact irrelevant whether it is a single bit write or a R-M-W operation unless you're mixing in interrupts. If you do read-modify-write an interrupt may alter state in between, causing a race condition; but that would be exactly the same issue for memory. The advantage of the SBI/CBI instructions there is that they are atomic.
The PORTn registers are readable, and also drive the output buffers. They are not different functions on read and write (as on PIC), but a normal register. Newer PICs also have the output registers readable on LAT addresses, precisely so you won't need a shadow variable. Other SFRs such as PINn or interrupt flags have more complicated behaviour. On recent AVRs, writing to PINn instead toggles bits in PORTn, which again is useful for its fast and atomic operation. Writing 1s to interrupt flag registers clears them, again to prevent race conditions.
The point is, these features are in place to produce correct behaviour for hardware aware programs, even if some of it looks odd in C code (i.e. using reg=_BV(2); instead of reg&=~_BV(2);). Precise compliance with the C standard is an impractical goal when the code is by its very nature hardware specific (though semantic similarity does help, which the interrupt flag behaviour fails at). Wrapping the odd constructs in inline functions or macros with names that explain what they truly do is probably a good idea, or at least commenting what the effects are. A set of such I/O routines could also form the basis of a hardware abstraction layer that may help you port code.
Trying to interpret the C specification strictly here is also rather confusing, as it doesn't admit to addressing bits (which is what SBI and CBI do), and digging through my old (1992) copy finds that volatile accesses may result in several implementation defined behaviours, including the possibility of no accesses at all.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight