Thread safe queue implementation (or alternative data structure)

Thread safe queue implementation (or alternative data structure) - c

I'm trying to implement a threadsafe queue that will hold data coming in on a UART buffer. The queue is written to as part of the UART RX-complete-ISR. This queue now holds the data that came in on the UART RX channel. The queue also needs to be read by the application using another thread to process the data. But since I'm running all of this on a bare-metal system without any RTOS support, I'm wondering if there is a better data structure to use here. Because when I'm using queues there is one common variable that both the threads need to access and this might cause a race condition.
I realize as I'm writing this that this is the producer-consumer problem and the only way I have solved this in the past is with mutexes. Is there an alternative to that approach?
Edit:
The processor being used is a ST micro cortex-M0 based processor. I looked into some mutex implementations for M0 but couldn't find anything definitive. This is mostly because the M0 processor does not support LDREX or STREX instructions that are usually present in M3 and M4 systems and are used for implementing atomic operations required for mutexes.
As for the system, the code runs straight to main after booting and has NO OS functionality. Even the scheduler was something that was written by me and simply looks at a table that holds function pointers and calls them.
The requirement is that one thread writes into a memory location from the ISR to store data coming in through the UART RX channel and another thread reads from those memory locations to process the data received. So my initial thought was that I would push to a queue from the ISR and read from it using the application thread, but that is looking less and less feasible because of the race condition that comes out of a producer-consumer setup (with the ISR being the producer and the application being the consumer).

Your M0 is a uniprocessor, so you can disable interrupts to serve basic exclusion:
int q_put(int c, Q *q) {
int ps, n, r;
ps = disable();
if ((n = q->tail+1) == q->len) {
n = 0;
}
if (n != q->head) {
q->buf[q->tail] = c;
q->tail = n;
r = 0;
} else {
r = -1;
}
restore(ps);
return r;
}
int q_get(Q *q) {
int ps, n, r;
ps = disable();
if ((n=q->head) == q->tail) {
r = -1;
} else {
r = q->buf[n] & 0xff;
q->head = n+1 == q->len ? 0 : n+1;
}
restore(ps);
return r;
}
where disable disables interrupts returning the previous state, and restore sets the interrupt state to its argument.

If it is bare metal, then you won't have any mutex or higher level concepts, so you need to implement something similar yourself. This is a common scenario however.
The normal datatype to use for this is a ring buffer, which is a manner of queue, implemented over a circular array. You should write that one as a separate module, but include two parameters: interrupt register and bit mask for setting/clearing that register. Then let the ring buffer code temporarily disable the UART RX interrupt during copy from the ring buffer to the caller application. This will protect against race conditions.
Since UART is most of the time relatively slow (< 115.2kbps), disabling the RX interrupt for a brief moment is harmless, since you only need a couple of microseconds to do the copy. The theory behind this is that the ISR will run once per data byte received, but the caller runs completely asynchronous in relation to the data. So the caller should not be allowed to block the ISR for longer than the time it takes to clock in 2 data bytes, in which case there will be overrun errors and data losses.
Which in practice means that the caller should only block the ISR for shorter time than it takes to clock in 1 data byte, because we don't know how far the UART has gotten in clocking in the current byte, at the time we disable the RX interrupt. If the interrupt is disabled at the point the byte is clocked in, that should still be fine since it should become a pending interrupt and trigger once you enable the RX interrupt once again. (At least all UART hardware I've ever used works like this, but double-check the behavior of your specific one just to be sure.)
So this all assuming that you can do the copy faster than the time it takes to clock in 1+8+1 new bits on the UART (no parity 1 stop). So if you are running for example 115.2kbps, your code must be faster than 1/115200 * (1+8+1) = 86.8us. If you are only copying less than a 32 bit word during that time, a Cortex M should have no trouble keeping up, assuming you run a sensible clock speed (8-48MHz something like that) and not some low power clock.
You always need to check for overrun and framing errors. Depending on UART hardware, these might be separate interrupts or the same one as RX. Then handle those errors in whatever way that makes sense for the application. If both sender & receiver is configured correctly and you didn't mess up the timing calculations, you shouldn't have any such errors.

Related

Why is there barrier() in KCOV code in Linux kernel?

In Linux KCOV code, why is this barrier() placed?
void notrace __sanitizer_cov_trace_pc(void)
{
struct task_struct *t;
enum kcov_mode mode;
t = current;
/*
* We are interested in code coverage as a function of a syscall inputs,
* so we ignore code executed in interrupts.
*/
if (!t || in_interrupt())
return;
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
unsigned long *area;
unsigned long pos;
/*
* There is some code that runs in interrupts but for which
* in_interrupt() returns false (e.g. preempt_schedule_irq()).
* READ_ONCE()/barrier() effectively provides load-acquire wrt
* interrupts, there are paired barrier()/WRITE_ONCE() in
* kcov_ioctl_locked().
*/
barrier();
area = t->kcov_area;
/* The first word is number of subsequent PCs. */
pos = READ_ONCE(area[0]) + 1;
if (likely(pos < t->kcov_size)) {
area[pos] = _RET_IP_;
WRITE_ONCE(area[0], pos);
}
}
}
A barrier() call prevents the compiler from re-ordering instructions. However, how is that related to interrupts here? Why is it needed for semantic correctness?

Without barrier(), the compiler would be free to access t->kcov_area before t->kcov_mode. It's unlikely to want to do that in practice, but that's not the point. Without some kind of barrier, C rules allow the compiler to create asm that doesn't do what we want. (The C11 memory model has no ordering guarantees beyond what you impose explicitly; in C11 via stdatomic or in Linux / GNU C via barriers like barrier() or smp_rb().)
As described in the comment, barrier() is creating an acquire-load wrt. code running on the same core, which is all you need for interrupts.
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
...
barrier();
area = t->kcov_area;
...
I'm not familiar with kcov in general, but it looks like seeing a certain value in t->kcov_mode with an acquire load makes it safe to read t->kcov_area. (Because whatever code writes that object writes kcov_area first, then does a release-store to kcov_mode.)
https://preshing.com/20120913/acquire-and-release-semantics/ explains acq / rel synchronization in general.
Why isn't smp_rb() required? (Even on weakly-ordered ISAs where acquire ordering would need a fence instruction to guarantee seeing other stores done by another core.)
An interrupt handler runs on the same core that was doing the other operations, just like a signal handler interrupts a thread and runs in its context. struct task_struct *t = current means that the data we're looking at is local to a single task. This is equivalent to something within a single thread in user-space. (Kernel pre-emption leading to re-scheduling on a different core will use whatever memory barriers are necessary to preserve correct execution of a single thread when that other core accesses the memory this task had been using).
The user-space C11 stdatomic equivalent of this barrier is atomic_signal_fence(memory_order_acquire). Signal fences only have to block compile-time reordering (like Linux barrier()), unlike atomic_thread_fence that has to emit a memory barrier asm instruction.
Out-of-order CPUs do reorder things internally, but the cardinal rule of OoO exec is to preserve the illusion of instructions running one at a time, in order for the core running the instructions. This is why you don't need a memory barrier for the asm equivalent of a = 1; b = a; to correctly load the 1 you just stored; hardware preserves the illusion of serial execution1 in program order. (Typically via having loads snoop the store buffer and store-forward from stores to loads for stores that haven't committed to L1d cache yet.)
Instructions in an interrupt handler logically run after the point where the interrupt happened (as per the interrupt-return address). Therefore we just need the asm instructions in the right order (barrier()), and hardware will make everything work.
Footnote 1: There are some explicitly-parallel ISAs like IA-64 and the Mill, but they provide rules that asm can follow to be sure that one instruction sees the effect of another earlier one. Same for classic MIPS I load delay slots and stuff like that. Compilers take care of this for compiled C.

Issue with global variable while making 32-bit counter

I am trying to do quadrature decoding using atmel xmega avr microcontroller. Xmega has only 16-bit counters. And in addition I have used up all the available timers.
Now to make 32-bit counter I have used one 16-bit counter and in its over/under flow interrupt I have increment/decrement a 16-bit global variable, so that by combining them we can make 32-bit counter.
ISR(timer_16bit)
{
if(quad_enc_mov_forward)
{
timer_over_flow++;
}
else if (quad_enc_mov_backward)
{
timer_over_flow--;
}
}
so far it is working fine. But I need to use this 32-bit value in various tasks running parallel. I'm trying to read 32-bit values as below
uint32_t current_count = timer_over_flow;
current_count = current_count << 16;
current_count = current_count + timer_16bit_count;
`timer_16_bit_count` is a hardware register.
Now the problem I am facing is when I read the read timer_over_flow to current_count in the first statement and by the time I add the timer_16bit_count there may be overflow and the 16bit timer may have become zero. This may result in taking total wrong value.
And I am trying to read this 32-bit value in multiple tasks .
Is there a way to prevent this data corruption and get the working model of 32-bit value.
Details sought by different members:
My motor can move forward or backward and accordingly counter increments/decrements.
In case of ISR, before starting my motor I'm making the global variables(quad_enc_mov_forward & quad_enc_mov_backward) set so that if there is a overflow/underflow timer_over_flow will get changed accordingly.
Variables that are modified in the ISR are declared as volatile.
Multiple tasks means that I'm using RTOS Kernel with about 6 tasks (mostly 3 tasks running parallel).
In the XMEGA I'm directly reading TCCO_CNT register for the lower byte.

One solution is:
uint16_t a, b, c;
do {
a = timer_over_flow;
b = timer_16bit_count;
c = timer_over_flow;
} while (a != c);
uint32_t counter = (uint32_t) a << 16 | b;
Per comment from user5329483, this must not be used with interrupts disabled, since the hardware counter fetched into b may be changing while the interrupt service routine (ISR) that modifies timer_over_flow would not run if interrupts are disabled. It is necessary that the ISR interrupt this code if a wrap occurs during it.
This gets the counters and checks whether the high word changed. If it did, this code tries again. When the loop exits, we know the low word did not wrap during the reads. (Unless there is a possibility we read the high word, then the low word wrapped, then we read the low word, then it wrapped the other way, then we read the high word. If that can happen in your system, an alternative is to add a flag that the ISR sets when the high word changes. The reader would clear the flag, read the timer words, and read the flag. If the flag is set, it tries again.)
Note that timer_over_flow, timer_16bit_count, and the flag, if used, must be volatile.
If the wrap-two-times scenario cannot happen, then you can eliminate the loop:
Read a, b, and c as above.
Compare b to 0x8000.
If b has a high value, either there was no wrap, it was read before a wrap upward (0xffff to 0), or it was read after a wrap downward. Use the lower of a or c.
Otherwise, either there was no wrap, b was read after a wrap upward, or it was read before a wrap downward. Use the larger of a or c.

The #1 fundamental embedded systems programming FAQ:
Any variable shared between the caller and an ISR, or between different ISRs, must be protected against race conditions. To prevent some compilers from doing incorrect optimizations, such variables should also be declared as volatile.
Those who don't understand the above are not qualified to write code containing ISRs. Or programs containing multiple processes or threads for that matter. Programmers who don't realize the above will always write very subtle, very hard-to-catch bugs.
Some means to protect against race conditions could be one of these:
Temporary disabling the specific interrupt during access.
Temporary disabling all maskable interrupts during access (crude way).
Atomic access, verified in the machine code.
A mutex or semaphore. On single-core MCU:s where interrupts cannot be interrupted in turn, you can use a bool as "poor man's mutex".

Just reading TCCO_CNT in multithreaded code is race condition if you do not handle it correctly. Check the section on reading 16bit registers in XMega manual. You should read lower byte first (this will be probably handled transparently by compiler for you). When lower byte is read, higher byte is (atomically) copied into the TEMP register. Then, reading high byte does read the TEMP register, not the counter. In this way atomic reading of 16bit value is ensured, but only if there is no access to TEMP register between low and high byte read.
Note that this TEMP register is shared between all counters, so context switch in right (wrong) moment will probably trash its content and therefore your high byte. You need to disable interrupts for this 16bit read. Because XMega will execute one instruction after the sei with interrupts disabled, the best way is probably:
cli
ld [low_byte]
sei
ld [high byte]
It disables interrupts for four CPU cycles (if I counted it correctly).
An alternative would to save shared TEMP register(s) on each context switch. It is possible (not sure if likely) that your OS already does this, but be sure to check. Even so, you need to make sure colliding access does not occur from an ISR.
This precaution should be applied to any 16bit register read in your code. Either make sure TEMP register is correctly saved/restored (or not used by multiple threads at all) or disable interrupts when reading/writing 16bit value.

This problem is indeed a very common and very hard one. All solutions will toit will have a caveat regarding timing constraints in the lower priority layers. To clarify this: the highest priority function in your system is the hardware counter - it's response time defines the maximum frequency that you can eventually sample. The next lower priority in your solution is the interrupt routine which tries to keep track of bit 2^16 and the lowest is your application level code which tries to read the 32-bit value. The question now is, if you can quantify the shortest time between two level changes on the A- and B- inputs of your encoder. The shortest time usually does occur not at the highest speed that your real world axis is rotating but when halting at a position: through minimal vibrations the encoder can double swing between two increments, thereby producing e.g. a falling and a rising edge on the same encoder output in short succession. Iff (if and only if) you can guarantee that your interrupt processing time is shorter (by a margin) than this minmal time you can use such a method to virtually extend the coordinate range of your encoder.

Relaying UART1 with UART2 within an ISR() (PIC24H)

I am programming a microcontroller of the PIC24H family and using xc16 compiler.
I am relaying U1RX-data to U2TX within main(), but when I try that in an ISR it does not work.
I am sending commands to the U1RX and the ISR() is down below. At U2RX, there are databytes coming in constantly and I want to relay 500 of them with the U1TX. The results of this is that U1TX is relaying the first 4 databytes from U2RX but then re-sending the 4th byte over and over again.
When I copy the for loop below into my main() it all works properly. In the ISR(), its like that U2RX's corresponding FIFObuffer is not clearing when read so the buffer overflows and stops reading further incoming data to U2RX. I would really appreciate if someone could show me how to approach the problem here. The variables tmp and command are globally declared.
void __attribute__((__interrupt__, auto_psv, shadow)) _U1RXInterrupt(void)
{
command = U1RXREG;
if(command=='d'){
for(i=0;i<500;i++){
while(U2STAbits.URXDA==0);
tmp=U2RXREG;
while(U1STAbits.UTXBF==1); //
U1TXREG=tmp;
}
}
}
Edit: I added the first line in the ISR().

Trying to draw an answer from the various comments.
If the main() has nothing else to do, and there are no other interrupts, you might be able to "get away with" patching all 500 chars from one UART to another under interrupt, once the first interrupt has ocurred, and perhaps it would be a useful exercise to get that working.
But that's not how you should use an interrupt. If you have other tasks in main(), and equal or lower priority interrupts, the relatively huge time that this interrupt will take (500 chars at 9600 baud = half a second) will make the processor what is known as "interrupt-bound", that is, the other processes are frozen out.
As your project gains complexity, you won't want to restrict main() to this task, and there is no need to for it be involved at all, after setting up the UARTs and IRQs. After that it can calculate π ad infinitum if you want.
I am a bit perplexed as to your sequence of operations. A command 'd' is received from U1 which tells you to patch 500 chars from U2 to U1.
I suggest one way to tackle this (and there are many) seeing as you really want to use interrupts, is to wait until the command is received from U1 - in main(). You then configure, and enable, interrupts for RXD on U2.
Then the job of the ISR will be to receive data from U2 and transmit it thru U1. If both UARTS have the same clock and the same baud rate, there should not be a synchronisation problem, since a UART is typically buffered internally: once it begins to transmit, the TXD register is available to hold another character, so any stagnation in the ISR should be minimal.
I can't write the actual code for you, since it would be supposed to work, but here is some very pseudo code, and I don't have a PIC handy (or wish to research its operational details).
ISR
has been invoked because U2 has a char RXD
you *might* need to check RXD status as a required sequence to clear the interrupt
read the RXD register, which also might clear the interrupt status
if not, specifically clear the interrupt status
while (U1 TXD busy);
write char to U1
if (chars received == 500)
disable U2 RXD interrupt
return from interrupt

ISR's must be kept lean and mean and the code made hyper-efficient if there is any hope of keeping up with the buffer on a UART. Experiment with the BAUD rate just to find the point at which your code can keep up, to help discover the right heuristic and see how far away you are from achieving your goal.
Success could depend on how fast your micro controller is, as well, and how many tasks it is running. If the microcontroller has a built in UART theoretically you should be able to manage keeping the FIFO from overflowing. On the other hand, if you paired up a UART with an insufficiently-powered micro controller, you might not be able to optimize your way out of the problem.
Besides the suggestion to offload the lower-priority work to the main thread and keep the ISR fast (that someone made in the comments), you will want to carefully look at the timing of all of the lines of code and try every trick in the book to get them to run faster. One expensive instruction can ruin your whole day, so get real creative in finding ways to save time.
EDIT: Another thing to consider - look at the assembly language your C compiler creates. A good compiler should let you inline assembly language instructions to allow you to hyper-optimize for your particular case. Generally in an ISR it would just be a small number of instructions that you have to find and implement.
EDIT 2: A PIC 24 series should be fast enough if you code it right and select a fast oscillator or crystal and run the chip at a good clock rate. Also consider the divisor the UART might be using to achieve its rate vs. the PIC clock rate. It is conceivable (to me) that an even division that could be accomplished internally via shifting would be better than one where math was required.

Why do we need to delay when sending char to serial port?

Consider this code here:
// Stupid I/O delay routine necessitated by historical PC design flaws
static void
delay(void)
{
inb(0x84);
inb(0x84);
inb(0x84);
inb(0x84);
}
What is port 0x84? Why is it a design flaws? delay() is used in serial_putc() function:
static void
serial_putc(int c)
{
int i;
for (i = 0;
!(inb(COM1 + COM_LSR) & COM_LSR_TXRDY) && i < 12800;
i++)
delay();
outb(COM1 + COM_TX, c);
}
The file is from lab1 of the course Operating System Engineering from OCW.

The serial port is a piece of hardware with some semantic you have to accept. It usually has a shift register that makes the conversion from parallel to serial data. It can have a holding register for the next byte to send or even a FIFO for more than one byte. That's why you have to poll the line status register (LSR).
There are some hardware revisions out there that doesn't behave correctly. Your code looks like a workaround for a bug in old hardware. It shouldn't be necessary to read the port 0x84 here.
But the delay implementation can't be optimized out when you increase the compiler optimization level since it's accessing the I/O range. Running this code in a up-to-date hardware might be problematic if the run-time performance gives too little delay. You will have to verify that the maximum time that can be waited in the loop is sufficient to shift out one byte by the UART. Keep in mind that this is baudrate-dependend while you code example isn't.
The port 0x84 is used to access the "extra page register" (Overview). But reading this register should be a noop. Only the read operation itself is important to consume CPU cycles.

Interrupt-safe buffer

I'm writing code for an embedded system (Cortex M0) and do not have all the luxuries of mutexes/spinlocks/etc. Is there a simple way to add data to a shared buffer (log-file) which will be flushed to disk from my Main() loop?
If there is only a single producer (1 interrupt) and single consumer (main-loop), I could use a simple buffer where the producer increases the 'head' and the consumer the 'tail'. And it will be perfectly safe. But now that I have multiple producers (interrupts) it seems like I'm stuck.
I could give each interrupt its own buffer, and combine them in Main(), but this will require a lot of extra RAM and complexity.

You can implement this through a simple ring buffer (circular array), where you turn off the hardware interrupt sources during access. It only needs the functions init, add and remove.
I'm not certain how your particular MCU handles interrupts, but most likely they will remain pending, as long as you only enable/disable the particular hardware peripheral's interrupt. Depending on the nature of your application, you could also disable the global interrupt mask, but that's rather crude.
Generally, you don't need to worry about missing out interrupts, because if the code that handles the incoming interrupts is slower than the interrupt frequency, no software in the world will fix it. You would either have to accept data losses or increase the CPU clock to dodge such scenarios. But of course you should always try to keep the code inside the ISR as compact as possible.