Issue with global variable while making 32-bit counter - c

I am trying to do quadrature decoding using atmel xmega avr microcontroller. Xmega has only 16-bit counters. And in addition I have used up all the available timers.
Now to make 32-bit counter I have used one 16-bit counter and in its over/under flow interrupt I have increment/decrement a 16-bit global variable, so that by combining them we can make 32-bit counter.
ISR(timer_16bit)
{
if(quad_enc_mov_forward)
{
timer_over_flow++;
}
else if (quad_enc_mov_backward)
{
timer_over_flow--;
}
}
so far it is working fine. But I need to use this 32-bit value in various tasks running parallel. I'm trying to read 32-bit values as below
uint32_t current_count = timer_over_flow;
current_count = current_count << 16;
current_count = current_count + timer_16bit_count;
`timer_16_bit_count` is a hardware register.
Now the problem I am facing is when I read the read timer_over_flow to current_count in the first statement and by the time I add the timer_16bit_count there may be overflow and the 16bit timer may have become zero. This may result in taking total wrong value.
And I am trying to read this 32-bit value in multiple tasks .
Is there a way to prevent this data corruption and get the working model of 32-bit value.
Details sought by different members:
My motor can move forward or backward and accordingly counter increments/decrements.
In case of ISR, before starting my motor I'm making the global variables(quad_enc_mov_forward & quad_enc_mov_backward) set so that if there is a overflow/underflow timer_over_flow will get changed accordingly.
Variables that are modified in the ISR are declared as volatile.
Multiple tasks means that I'm using RTOS Kernel with about 6 tasks (mostly 3 tasks running parallel).
In the XMEGA I'm directly reading TCCO_CNT register for the lower byte.

One solution is:
uint16_t a, b, c;
do {
a = timer_over_flow;
b = timer_16bit_count;
c = timer_over_flow;
} while (a != c);
uint32_t counter = (uint32_t) a << 16 | b;
Per comment from user5329483, this must not be used with interrupts disabled, since the hardware counter fetched into b may be changing while the interrupt service routine (ISR) that modifies timer_over_flow would not run if interrupts are disabled. It is necessary that the ISR interrupt this code if a wrap occurs during it.
This gets the counters and checks whether the high word changed. If it did, this code tries again. When the loop exits, we know the low word did not wrap during the reads. (Unless there is a possibility we read the high word, then the low word wrapped, then we read the low word, then it wrapped the other way, then we read the high word. If that can happen in your system, an alternative is to add a flag that the ISR sets when the high word changes. The reader would clear the flag, read the timer words, and read the flag. If the flag is set, it tries again.)
Note that timer_over_flow, timer_16bit_count, and the flag, if used, must be volatile.
If the wrap-two-times scenario cannot happen, then you can eliminate the loop:
Read a, b, and c as above.
Compare b to 0x8000.
If b has a high value, either there was no wrap, it was read before a wrap upward (0xffff to 0), or it was read after a wrap downward. Use the lower of a or c.
Otherwise, either there was no wrap, b was read after a wrap upward, or it was read before a wrap downward. Use the larger of a or c.

The #1 fundamental embedded systems programming FAQ:
Any variable shared between the caller and an ISR, or between different ISRs, must be protected against race conditions. To prevent some compilers from doing incorrect optimizations, such variables should also be declared as volatile.
Those who don't understand the above are not qualified to write code containing ISRs. Or programs containing multiple processes or threads for that matter. Programmers who don't realize the above will always write very subtle, very hard-to-catch bugs.
Some means to protect against race conditions could be one of these:
Temporary disabling the specific interrupt during access.
Temporary disabling all maskable interrupts during access (crude way).
Atomic access, verified in the machine code.
A mutex or semaphore. On single-core MCU:s where interrupts cannot be interrupted in turn, you can use a bool as "poor man's mutex".

Just reading TCCO_CNT in multithreaded code is race condition if you do not handle it correctly. Check the section on reading 16bit registers in XMega manual. You should read lower byte first (this will be probably handled transparently by compiler for you). When lower byte is read, higher byte is (atomically) copied into the TEMP register. Then, reading high byte does read the TEMP register, not the counter. In this way atomic reading of 16bit value is ensured, but only if there is no access to TEMP register between low and high byte read.
Note that this TEMP register is shared between all counters, so context switch in right (wrong) moment will probably trash its content and therefore your high byte. You need to disable interrupts for this 16bit read. Because XMega will execute one instruction after the sei with interrupts disabled, the best way is probably:
cli
ld [low_byte]
sei
ld [high byte]
It disables interrupts for four CPU cycles (if I counted it correctly).
An alternative would to save shared TEMP register(s) on each context switch. It is possible (not sure if likely) that your OS already does this, but be sure to check. Even so, you need to make sure colliding access does not occur from an ISR.
This precaution should be applied to any 16bit register read in your code. Either make sure TEMP register is correctly saved/restored (or not used by multiple threads at all) or disable interrupts when reading/writing 16bit value.

This problem is indeed a very common and very hard one. All solutions will toit will have a caveat regarding timing constraints in the lower priority layers. To clarify this: the highest priority function in your system is the hardware counter - it's response time defines the maximum frequency that you can eventually sample. The next lower priority in your solution is the interrupt routine which tries to keep track of bit 2^16 and the lowest is your application level code which tries to read the 32-bit value. The question now is, if you can quantify the shortest time between two level changes on the A- and B- inputs of your encoder. The shortest time usually does occur not at the highest speed that your real world axis is rotating but when halting at a position: through minimal vibrations the encoder can double swing between two increments, thereby producing e.g. a falling and a rising edge on the same encoder output in short succession. Iff (if and only if) you can guarantee that your interrupt processing time is shorter (by a margin) than this minmal time you can use such a method to virtually extend the coordinate range of your encoder.

Related

When is a Cortex write to a device realised

When writing to device registers on a Cortex M0 (in my case, on an STM32L073), a question arises as to how careful one should be in a) ordering accesses to device memory and b) deciding that a change to a peripheral configuration has actually completed to the point that any dependencies become valid.
Taking a specific example to change the internal voltage regulator to a different voltage. You write the change to PWR->CR and read the status from PWR->CSR. I see code that does something like this:
Write to PWR->CR to set the voltage range
Spin until (PWR->CSR & voltage flag) becomes zero
In my mind there are three issues here:
Access ordering. This is Device Memory so transaction order is preserved relative to other Device access transactions. I would assume this means a DSB is not required between the write to CR and the read from CSR. A linked question and the answer to this is: [ARM CortexA]Difference between Strongly-ordered and Device Memory Type
Device memory can be buffered. Is there a possibility that a write to CR could still be in process when the read from CSR occurs. This would mean that the voltage flag would be clear and the code would proceed. In actual fact the flag hasn't gone high yet!
Hardware response time. Is there a latency between the write and the effects becoming final? In actuality this should always be documented - for the STM32 the docs definitively say that the flag is set when the CR register changes.
Are there any race condition possibilities here? It's really the buffering that worries me - that a peripheral write is still in progress when a peripheral read takes place.
Access ordering.
Accesses are strongly ordered and you do not need barrier instructions to read back the same register.
Device memory can be buffered. Is there a possibility that a write to CR
Yes, it is possible. But it is not because of buffering but because of the bus propagation time. It may take several clocks before a particular operation will go through all bridges.
Hardware response time. Is there a latency between the write and the
effects becoming final
Even if there is a latency it is not important from your point of view. If you set bit in the CR register and wait for the result in the status register. Simply wait for the status bit to have the expected value.

Why NOP/few extra lines of code/optimization of pointer aliasing helps? [Fujitsu MB90F543 MCU C code]

I am trying to fix an bug found in a mature program for Fujitsu MB90F543. The program works for nearly 10 years so far, but it was discovered, that under some special circumstances it fails to do two things at it's very beginning. One of them is crucial.
After low and high level initialization (ports, pins, peripherials, IRQ handlers) configuration data is read over SPI from EEPROM and status LEDs are turned on for a moment (to turn them a data is send over SPI to a LED driver).
When those special circumstances occur first and only first function invoking just a few EEPROM reads fails and additionally a few of the LEDs that should, don't turn on.
The program is written in C and compiled using Softune v30L32.
Surprisingly it is sufficient to add single __asm(" NOP ") in low level hardware init to make the program work as expected under mentioned circumstances. It is sufficient to turn off 'Control optimization of pointer aliasing' in Optimization settings. Adding just a few lines of code in various places helps too.
I have compared (DIFFed) ASM listings of compiled program for a version with and without __asm(" NOP ") and with both aforementioned optimizer settings and they all look just fine.
The only warning Softune compiler has been printing for years during compilation is as follows:
*** W1372L: The section is placed outside the RAM area or the I/O area (IOXTND)
I do realize it's rather general question, but maybe someone who has a bigger picture will be able to point out possible cause.
Have you got an idea what may cause such a weird behaviour? How to locate the bug and fix it?
During the initialization a few long (about 20ms) delay loops are used. They don't help although they were increased from about 2ms, yet single NOP in any line of the hardware initialization function and even before or after the function helps.
Both the wait loops works. I have checked it using an oscilloscope. (I have added LED turn on before and off after).
I have checked timming hypothesis by slowing down SPI clock from 1MHz to 500kHz. It does not change anything. Slowing down to 250kHz makes watchdog resets, as some parts of the code execute too long (>25ms).
One more thing. I have observed that adding local variables in any source file sometimes makes the problem disappear or reappear. The same concerns initializing uninitialized local variables. Adding a few extra lines of a code in any of the files helps or reveals the problem.
void main(void)
{
watchdog_init();
// waiting for power supply to stabilize
wait; // about 45ms
hardware_init();
clear_watchdog();
application_init();
clear_watchdog();
wait; // about 20ms
test_LED();
{...}
}
void hardware_init (void)
{
__asm("NOP"); // how it comes it helps? - it may be in any line of the function
io_init(); // ports initialization
clk_init();
timer_init();
adc_init();
spi_init();
LED_init();
spi_start();
key_driver_init();
can_init();
irq_init(); // set IRQ priorities and global IRQ enable
}
Could be one of many things but two spring to mind.
Timing.
Maybe the wait is not long enough for power to stabilize and not everything is synced to the clock. The NOP gets everything back in sync.
Alignment.
Perhaps the NOP gets your instructions aligned on a 32 or 64 bit boundary expected by the hardware. (we used to do this a lot on mainframe assemblers as IO operations often expected things to be on double word boundarys).
The problem was solved. It was caused by a trivial bug.
EEPROM's nHOLD and nCS signals were not initialized immediately after MCU's reset, but before the first use of the EEPROM. As a result they were 0's, so active.
This means EEPROM was selected, but waiting on hold. Meantime other transfer using SPI started. After 6 out of 8 CLK pulses EEPROM's nHOLD I/O pin was initialized and brought high. EEPROM was no longer on hold so it clocked in last two bits of a data for an other peripheral. Every subsequent operation on the EEPROM found it being having not synchronized CLK and MOSI.
When I have added NOP or anything other the moment of nHOLD 0->1 edge was shifted to happen after the last CLK pulse. Now CLK-MOSI were in sync.
All I have had to do was to initialize all the EEPROM's SPI lines, in
particular nHOLD and nCS right after the MCU reset.

create a small delay in a Linux interrupt handler

I'm working on an interrupt handler with a hardware design group and we're trying to figure out where a bug is. I'm reading a chip over the SPI bus at 5khz. The chip loads 4 bytes and triggers a data ready pin.
My interrupt handler wakes up and read 4 bytes off the SPI bus and stores the data in a buffer. Strangely enough though, every 17th read gives 4 bytes of all 0's, which is not right. One of the options we're exploring is that the chip isn't always actually ready when it sends the data ready signal.
So, I know I can't sleep in an interrupt handler, but I'd like to try and introduce a delay of 10 or 20 microseconds. Right now I have a for loop which counts to 100,000 then processes the interrupt. I haven't seen any changes, so I thought I might see if someone has a better technique for busy waiting. Or at least a better way of figuring out how many loop iterations I should go through, as I'm not sure how long this takes, or if the compiler is simply optimizing out the whole thing.
I dont know if you have access to any pseudorandom number generation libraries on your embedded device, but doing large number multiplication followed by mod will definately take some cycles. Instead of simply adding 1 (which is very fast at the hardware level and the compiler can optimize it to shifting since you're doing it a static number of times) use a random number seed (does the system have access to a time clock?) if available and do large number multiplication, modulus or factorial operations, negative number division also takes forever. Remember, division takes the longest at the hardware level. Use that to your advantage.
I assume your compiler will strip out a simple loop.
You should use volatile.
volatile unsigned long i;
for (i=0;i< 1000000; i++)
continue;
I assume also that this will not remove the problem or help you.
I can't believe, that a SPI peripheral has such a bug.
But it's possible that you read to slow the data from the SPI-Fifo.
So some of the received data will be dropped.
You should check the error flags of the SPI module and check the RX-empty RX-fullflags of the SPI.

Why is disabling interrupts necessary here?

static void RadioReleaseSPI(void) {
__disable_interrupt();
spiTxRxByteCount &= ~0x0100;
__enable_interrupt();
}
I understand that multiple tasks may attempt to use the SPI resource. spiTxRxByteCount is a global variable used to keep track of whether the SPI is currently in use by another task. When a task requires the SPI it can check the status of spiTxRxByteCount to see if the SPI is being used. When a task is done using the SPI it calls this function and clears the bit, to indicate that the SPI is now free. But why disable the interrupts first and then re-enable them after? Just paranoia?
The &= will do a read-modify-write operation - it's not atomic. You don't want an interrupt changing things in the middle of that, resulting in the write over-writing with an incorrect value.
You need to disable interrupts to ensure atomic access. You don't want any other process to access and potentially modify that variable while you're reading it.
From Introduction to Embedded Computing:
The Need for Atomic Access
Imagine this scenario: foreground program, running on an 8-bit uC,
needs to examine a 16-bit variable, call it X. So it loads the high
byte and then loads the low byte (or the other way around, the order
doesn’t matter), and then examines the 16-bit value. Now imagine an
interrupt with an associated ISR that modifies that 16-bit variable.
Further imagine that the value of the variable happens to be 0x1234 at
a given time in the program execution. Here is the Very Bad Thing
that can happen:
foreground loads high byte (0x12)
ISR occurs, modifies X to 0xABCD
foreground loads low byte (0xCD)
foreground program sees a 16-bit value of 0x12CD.
The problem is that a supposedly indivisible piece of data, our
variable X, was actually modified in the process of accessing it,
because the CPU instructions to access the variable were divisible.
And thus our load of variable X has been corrupted. You can see that
the order of the variable read does not matter. If the order were
reversed in our example, the variable would have been incorrectly read
as 0xAB34 instead of 0x12CD. Either way, the value read is neither
the old valid value (0x1234) nor the new valid value (0xABCD).
Writing ISR-referenced data is no better. This time assume that the
foreground program has written, for the benefit of the ISR, the
previous value 0x1234, and then needs to write a new value 0xABCD. In
this case, the VBT is as follows:
foreground stores new high byte (0xAB)
ISR occurs, reads X as 0xAB34
foreground stores new low byte (0xCD)
Once again the code (this time the ISR) sees neither the previous
valid value of 0x1234, nor the new valid value of 0xABCD, but rather
the invalid value of 0xAB34.
While spiTxRxByteCount &= ~0x0100; may look like a single instruction in C, it is actually several instructions to the CPU. Compiled in GCC, the assembly listing looks like so:
57:atomic.c **** spiTxRxByteCount &= ~0x0100;
68 .loc 1 57 0
69 004d A1000000 movl _spiTxRxByteCount, %eax
69 00
70 0052 80E4FE andb $254, %ah
71 0055 A3000000 movl %eax, _spiTxRxByteCount
71 00
If an interrupt comes in in-between any of those instructions and modifies the data, your first ISR can potentially read the wrong value. So you need to disable interrupts before you operate on it and also declare the variable volatile.
There are two reasons for why you should be disabling interrupts:
The &= is a read-modify-write operation which is in nature not atomic. It consists of a read, a bitwise-and, and a write. You don't want this operation to be interrupted by an ISR (interrupt service route). The ISR could modify spiTxRxByteCount after the read and before the write. The write would then be based on an outdated value and you would lose information.
__disable_interrupt() and __enable_interrupt() serve as software barriers. Even if optimization is enabled, the compiler must not move the read or the write across the two barriers. Also, the compiler must not cache the value of spiTxRxByteCount across the two barriers. If there were no barriers, the compiler would be allowed to hold a copy of spiTxRxByteCount in some CPU register even across multiple invocations of RadioReleaseSPI(). This would typically happen if inlining is enabled and RadioReleaseSPI() is called repeatedly.
That disabling and enabling interrupts serves as barriers is at least as important as avoiding the interruption by an ISR, IMHO. But it seems to be overlooked, sometimes.

How to read two 32bit counters as a 64bit integer without race condition

At memory 0x100 and 0x104 are two 32-bit counters. They represent a 64-bit timer and are constantly incrementing.
How do I correctly read from two memory addresses and store the time as a 64-bit integer?
One incorrect solution:
x = High
y = Low
result = x << 32 + y
(The program could be swapped out and in the meantime Low overflows...)
Additional requirements:
Use C only, no assembly
The bus is 32-bit, so no way to read them in one instruction.
Your program may get context switched at any time.
No mutex or locks available.
Some high-level explanation is okay. Code not necessary. Thanks!
I learned this from David L. Mills, who attributes it to Leslie Lamport:
Read the upper half of the timer into H.
Read the lower half of the timer into L.
Read the upper half of the timer again into H'.
If H == H' then return {H, L}, otherwise go back to 1.
Assuming that the timer itself updates atomically then this is guaranteed to work -- if L overflowed somewhere between steps 1 and 2, then H will have incremented between steps 1 and 3, and the test in step 4 will fail.
Given the nature of the memory (a timer), you should be able to read A, read B, read A' and compare A to A', if they match you have your answer. Otherwise repeat.
It sortof depends on what other constraints there are on this memory. If it's something like a system-clock, the above will handle the situation where 0x0000FFFF goes to 0x00010000, and, depending on the order you read it in, you would otherwise erroneously end up with 0x00000000 or 0x0001FFFF.
In addition to what has already been said, you won't get more accurate timing reads than your interrupt / context switch jitter allows. If you fear an interrupt / context switch in the middle of a timer polling, the solution is not to adapt some strange read-read-read-compare algorithm, nor is it to use memory barriers or semaphores.
The solution is to use a hardware interrupt for the timer, with an interrupt service routine that cannot be interrupted when executed. This will give the highest possible accuracy, if you actually have need of such.
The obvious and presumably intended answer is already given by Hobbs and jkerian:
sample High
sample Low
read High again - if it differs from the sample from step 1, return to step 1
On some multi-CPU/core hardware, this doesn't actually work properly. Unless you have a memory barrier to ensure that you're not reading High and Low from your own core's cache, then updates from another core - even if 64-bit atomic and flushed to some shared memory - aren't guaranteed to be visible in your core a timely fashion. While High and Low must be volatile-qualified, this is not sufficient.
The higher the frequency of updates, the more probable and significant the errors due to this issue.
There is no portable way to do this without some C wrappers for OS/CPU-specific memory barriers, mutexes, atomic operations etc..
Brooks' comment below mentions that this does work for certain CPUs, such as modern AMDs.
If you can guarantee that the maximum time of context switch is significantly less than half the low word rollover period, you can use that fact to decide whether the Low value was read before or after its rollover, and choose the correct high word accordingly.
H1=High;L=Low;H2=High;
if (H2!=H1 && L < 0x7FFFFFF) { H1=H2;}
result= H1<<32+L;
This avoids the 'repeat' phase of other solutions.
The problem statement didn't include whether the counters could roll over all 64-bits several times between reads. So I might try alternating reading both 32-bit words a few thousand times, more if needed, store them in 2 vector arrays, run a linear regression fit modulo 2^32 against both vectors, and apply slope matching contraints of that ratio to the possible results, then use the estimated regression fit to predict the count value back to the desired reference time.

Resources