Atomic Block for reading vs ARM SysTicks

Atomic Block for reading vs ARM SysTicks - c

I am currently porting my DCF77 library (you may find the source code at GitHub) from Arduino (AVR based) to Arduino Due (ARM Cortex M3). I am an absolute beginner with the ARM platform.
With the AVR based Arduino I can use avr-libc to get atomic blocks. Basically this blocks all interrupts during the block and will allow interrupts later on again. For the AVR this was fine. Now for the ARM Cortex things start to get complicated.
First of all: for the current uses of the library this approach would work as well. So my first question is: is there someting similar to the "ATOMIC" macros of avr-libc for ARM? Obviously other people have thought of something in this directions. Since I am using gcc I could enhance these macors to work almost exactly like the avr-libv ATOMIC macors. I already found some CMSIS documentation however this seems only to provide an "enable_irq" macro instead of a "restore_irq" macro.
Question 1: is there any library out there (for gcc) that already does this?
Because ARM has different priority interrupts I could establish the atomicity in different ways as well. In my case the "atomic" blocks must only make sure that they are not interrupted by the systick interrupt. So actually I would not need to block everything to make my blocks "atomic enough". Searching further I found an ARM synchronization primitives article in the developer infocenter. Especially there is a hint at lockless programming. According to the article this is an advanced concept and that there are many publications on it. Searching the net I found only general explanations of the concept, e.g. here. I assume that a lockless implementation would be very cool but at this time I feel not confident enough on ARM to implement this from scratch.
Question 2: does anyone have some hints for me on lockless reads of memory blocks on ARM Cortex M3?
As I already said I only need to protect the lower priority thread from sysTicks. So another option would be to disable sysTicks briefly. Since I am implementing a timing sensitive clock algorithm this must not slow down the overall sysTick frequency in the long run. Introducing some small jitter would be OK though. At this time I would find this most attractive.
Question 3: is there any good way to block sysTick interrupts without losing any ticks?
I also found the CMSIS documentation for semaphores. However I am somewhat overwhelmed. Especially I am wondering if I should use CMSIS and how to do this on an Arduino Due.
Question 4: What would be my best option? Or where should I continue reading?
Partial Answer:
with the hint from Notlikethat I implemented
#if defined(ARDUINO_ARCH_AVR)
#include <util/atomic.h>
#define CRITICAL_SECTION ATOMIC_BLOCK(ATOMIC_RESTORESTATE)
#elif defined(ARDUINO_ARCH_SAM)
// Workaround as suggested by Stackoverflow user "Notlikethat"
// http://stackoverflow.com/questions/27998059/atomic-block-for-reading-vs-arm-systicks
static inline int __int_disable_irq(void) {
int primask;
asm volatile("mrs %0, PRIMASK\n" : "=r"(primask));
asm volatile("cpsid i\n");
return primask & 1;
}
static inline void __int_restore_irq(int *primask) {
if (!(*primask)) {
asm volatile ("" ::: "memory");
asm volatile("cpsie i\n");
}
}
// This critical section macro borrows heavily from
// avr-libc util/atomic.h
// --> http://www.nongnu.org/avr-libc/user-manual/atomic_8h_source.html
#define CRITICAL_SECTION for (int primask_save __attribute__((__cleanup__(__int_restore_irq))) = __int_disable_irq(), __ToDo = 1; __ToDo; __ToDo = 0)
#else
#error Unsupported controller architecture
#endif
This macro does more or less what I need. However I find there is room for improvement as this blocks all interrupts although it would be sufficient to block only systicks. So Question 3 is still open.

Most of what you've referenced is about synchronising memory accesses between multiple CPUs, or pre-emptively scheduled threads on the same CPU, which seems entirely inappropriate given the stated situation. "Atomicity" in that sense refers to guaranteeing that when one observer is updating memory, any observer reading memory sees either the initial state, or the updated state, but never something part-way in between.
"Atomicity" with respect to interrupts follows the same principle - i.e. ensuring that if an interrupt occurs, a sequence of code has either not run at all, or run completely - but is a conceptually different thing1. There are only two things guaranteed to be atomic w.r.t. interrupts: a single instruction2, or a sequence of instructions executed with interrupts disabled.
The "right" way to achieve that is indeed via the CPSID/CPSIE instructions, which are wrapped in the __disable_irq()/__enable_irq() intrinsics. Note that there are two "stages" of interrupt handling in the system: the M3 core itself only has a single IRQ signal - it's the external NVIC's job to do all the routing/multiplexing/prioritisation of the system IRQs into this one line. When the CPU wants to enter a critical section, all it needs to do is mask its own IRQ input with CPSID, do what it needs, then unmask with CPSIE, at which point any pending IRQ from the NVIC will be taken immediately.
For the case of nested/re-entrant critical sections, the intrinsics provide a handy int __disable_irq(void) form which returns the previous state, so you can unmask conditionally on that.
For other compilers which don't offer such intrinsics, it's straightforward enough to roll your own, e.g.:
static inline int disable_irq(void) {
int primask;
asm volatile("mrs %0, PRIMASK\n"
"cpsid i\n" : "=r"(primask));
return primask & 1;
}
static inline void enable_irq(int primask) {
if (primask)
asm volatile("cpsie i\n");
}
[1] One confusing overlap is the latter sense is often used to achieve the former in single-CPU multitasking - if interrupts are off, no other thread can get scheduled until you've finished, thus will never see partially-updated memory.
[2] With the possible exception of load/store-multiple instructions - in the low-latency interrupt configuration, these can be interrupted, and either restarted or continued upon return.

Related

Critical sections in ARM

I am experienced in implementing critical sections on the AVR family of processors, where all you do is disable interrupts (with a memory barrier of course), do the critical operation, and then reenable interrupts:
void my_critical_function()
{
cli(); //Disable interrupts
// Mission critical code here
sei(); //Enable interrupts
}
Now my question is this:
Does this simple method apply to the ARM architecture of processor as well? I have heard things about the processor doing lookahead on the instructions, and other black magic, and was wondering primarily if these types of things could be problematic to this implementation of critical sections.

Assuming you're on a Cortex-M processor, take a look at the LDREX and STREX instructions, which are available in C via the __LDREXW() and __STREXW() macros provided by CMSIS (the Cortex Microcontroller Software Interface Standard). They can be used to build extremely lightweight mutual exclusion mechanisms.
Basically,
data = __LDREXW(address)
works like data = *address except that it sets an 'exclusive access flag' in the CPU. When you've finished manipulating your data, write it back using
success = __STREXW(address, data)
which is like *address = data but will only succeed in writing if the exclusive access flag is still set. If it does succeed in writing then it also clears the flag. It returns 0 on success and 1 on failure. If the STREX fails, you have to go back to the LDREX and try again.
For simple exclusive access to a shared variable, nothing else is required. For example:
do {
data = LDREX(address);
data++;
} while (STREXW(address, data));
The interesting thing about this mechanism is that it's effectively 'last come, first served'; if this code is interrupted and the interrupt uses LDREX and STREX, the STREX interrupt will succeed and the (lower-priority) user code will have to retry.
If you're using an operating system, the same primitives can be used to build 'proper' semaphores and mutexes (see this application note, for example); but then again if you're using an OS you probably already have access to mutexes through its API!

ARM architecture is very wide and as I understand you probably mean ARM Cortex M micro controllers.
You can use this technique, but many ARM uCs offer much more. As I do know what is the actual hardware I can only give you some examples:
bitband area. In this memory regions you can set and reset bits atomic way.
Hardware semaphores (STM32H7)
Hardware MUTEX-es (some NXP uCs)
etc etc.

DSB on ARM Cortex M4 processors

I have read the ARM documentation and it appears that they say in some places that the Cortex M4 can reorder memory writes, while in other places it indicates that M4 will not.
Specifically I am wondering if the DBM instruction is needed like:
volatile int flag=0;
char buffer[10];
void foo(char c)
{
__ASM volatile ("dbm" : : : "memory");
__disable_irq(); //disable IRQ as we use flag in ISR
buffer[0]=c;
flag=1;
__ASM volatile ("dbm" : : : "memory");
__enable_irq();
}

Uh, it depends on what your flag is, and it also varies from chip to chip.
In case that flag is stored in memory:
DSB is not needed here. An interrupt handler that would access flag would have to load it from memory first. Even if your previous write is still in progress the CPU will make sure that the load following the store will happen in the correct order.
If your flag is stored in peripheral memory:
Now it gets interesting. Lets assume flag is in some hardware peripheral. A write to it may make an interrupt pending or acknowledge an interrupt (aka clear a pending interrupt). Contrary to the memory example above this effect happens without the CPU having to read the flag first. So the automatic ordering of stores and loads won't help you. Also writes to flag may take effect with a surprisingly long delay due to different clock domains between the CPU and the peripheral.
So the following szenario can happen:
you write flag=1 to clear an handled interrupt.
you enable interrupts by calling __enable_irq()
interrupts get enabled, write to flag=1 is still pending.
wheee, an interrupt is pending and the CPU jumps to the interrupt handler.
flag=1 takes effect. You're now in an interrupt handler without anything to do.
Executing a DSB in front of __enable_irq() will prevent this problem because whatever is triggered by flag=1 will be in effect before __enable_irq() executes.
If you think that this case is purely academic: Nope, it's real.
Just think about a real-time clock. These usually runs at 32khz. If you write into it's peripheral space from a CPU running at 64Mhz it can take a whopping 2000 cycles before the write takes effect. Now for real-time clocks the data-sheet usually shows specific sequences that make sure you don't run into this problem.
The same thing can however happen with slow peripherals.
My personal anecdote happened when implementing power-saving late in a project. Everything was working fine. Then we reduced the peripheral clock speed of I²C and SPI peripherals to the lowest possible speed we could get away with. This can save lots of power and extend battery live. What we found out was that suddenly interrupts started to do unexpected things. They seem to fire twice each time wrecking havoc. Putting a DSB at the end of each affected interrupt handler fixed this because - you can guess - the lower clock speed caused us to leave the interrupt handlers before clearing the interrupt source was in effect due to the slow peripheral clock.

This section of the Cortex M4 generic device user guide enumerates the factors which can affect reordering.
the processor can reorder some memory accesses to improve efficiency, providing this does not affect the behavior of the instruction sequence.
the processor has multiple bus interfaces
memory or devices in the memory map have different wait states
some memory accesses are buffered or speculative.
You should also bear in mind that both DSB and ISB are often required (in that order), and that C does not make any guarantees about the ordering (except in-thread volatile accesses).
You will often observe that the short pipeline and instruction sequences can combine in such a way that the race conditions seem unreachable with a specific compiled image, but this isn't something you can rely on. Either the timing conditions might be rare (but possible), or subsequent code changes might change the resulting instruction sequence.

Atomic disable and restore interrupts from ISR and non-ISR context: may it be different on some platform?

I work with embedded stuff, namely PIC32 Microchip CPUs these days.
I'm familiar with several real-time kernels: AVIX, FreeRTOS, TNKernel, and in all of them we have 2 versions of nearly all functions: one for calling from task, and second one for calling from ISR.
Of course it makes sense for functions that could switch context and/or sleep: obviously, ISR can't sleep, and context switch should be done in different manner. But there are several functions that do not switch context nor sleep: say, it may return system tick count, or set up software timer, etc.
Now, I'm implementing my own kernel: TNeoKernel, which has well-formed code and is carefully tested, and I'm considering to invent "universal" functions sometimes: the ones that can be called from either task or ISR context. But since all three aforementioned kernels use separate functions, I'm afraid I'm going to do something wrong.
Say, in task and ISR context, TNKernel uses different routines for disabling/restoring interrupts, but as far as I see, the only possible difference is that ISR functions may be "compiled out" as an optimization if the target platform doesn't support nested interrupts. But if target platform supports nested interrupts, then disabling/restoring interrupts looks absolutely the same for task and ISR context.
So, my question is: are there platforms on which disabling/restoring interrupts from ISR should be done differently than from non-ISR context?
If there are no such platforms, I'd prefer to go with "universal" functions. If you have any comments on this approach, they are highly appreciated.
UPD: I don't like to have two set of functions because they lead to notable code duplication and complication. Say, I need to provide a function that should start software timer. Here is what it looks like:
enum TN_RCode _tn_timer_start(struct TN_Timer *timer, TN_Timeout timeout)
{
/* ... real job is done here ... */
}
/*
* Function to be called from task
*/
enum TN_RCode tn_timer_start(struct TN_Timer *timer, TN_Timeout timeout)
{
TN_INTSAVE_DATA; //-- define the variable to store interrupt status,
// it is used by TN_INT_DIS_SAVE()
// and TN_INT_RESTORE()
enum TN_RCode rc = TN_RC_OK;
//-- check that function is called from right context
if (!tn_is_task_context()){
rc = TN_RC_WCONTEXT;
goto out;
}
//-- disable interrupts
TN_INT_DIS_SAVE();
//-- perform real job, after all
rc = _tn_timer_start(timer, timeout);
//-- restore interrupts state
TN_INT_RESTORE();
out:
return rc;
}
/*
* Function to be called from ISR
*/
enum TN_RCode tn_timer_istart(struct TN_Timer *timer, TN_Timeout timeout)
{
TN_INTSAVE_DATA_INT; //-- define the variable to store interrupt status,
// it is used by TN_INT_DIS_SAVE()
// and TN_INT_RESTORE()
enum TN_RCode rc = TN_RC_OK;
//-- check that function is called from right context
if (!tn_is_isr_context()){
rc = TN_RC_WCONTEXT;
goto out;
}
//-- disable interrupts
TN_INT_IDIS_SAVE();
//-- perform real job, after all
rc = _tn_timer_start(timer, timeout);
//-- restore interrupts state
TN_INT_IRESTORE();
out:
return rc;
}
So, we need wrappers like the ones above for nearly all system function. This is a kind of inconvenience, for me as a kernel developer as well as for kernel users.
The only difference is that different macros are used: for task, these are TN_INTSAVE_DATA, TN_INT_DIS_SAVE(), TN_INT_RESTORE(); for interrupts these are TN_INTSAVE_DATA_INT, TN_INT_IDIS_SAVE(), TN_INT_IRESTORE().
For the platforms that support nested interrupts (ARM, PIC32), these macros are identical. For other platforms that don't support nested interrupts, TN_INTSAVE_DATA_INT, TN_INT_IDIS_SAVE() and TN_INT_IRESTORE() are expanded to nothing. So it is a bit of performance optimization, but the cost is too high in my opinion: it's harder to maintain, it's not so convenient to use, and the code size increases.

It's all a matter of design and CPU capabilities. I'm not familiar with any of the PICs but, for example, Freescale (Motorola) MCUs (among many others) have the ability to move the Condition Code Register (CCR) into the accumulator and back. This allows one to save the previous state of the Interrupt Enable/Disable Mask, and restore it at the end, without worrying about bluntly enabling interrupts where they should stay disabled (inside ISRs).
To answer, however, which platform(s) must do it differently inside and outside ISRs would require one to be familiar with all of them, or at least one that fails this test. If there is a CPU that does not allow saving and restoring the CCR (as mentioned above), one would have no option but to do it differently for each case.

Kernel functions that normally cause scheduling to occur have simpler ISR versions because the scheduler runs on return from interrupt (there is usually an interrupt epilogue required to do that), not from the scheduling function itself.
It is simple enough to create a function that will work in any context, but it adds a small overhead. However the safety afforded by not calling an inappropriate function is probably worth it.
For example:
OSStatus semGive( OSSem sem )
{
return isInterrupt() ? ISR_SemGive( sem ) : OS_SemGive( sem ) ;
}
The implementation of isInterrupt() is platform dependent, and is discussed at Safely detect, if function is called from an ISR?

How to achieve multitasking in a microcontroller?

I wrote a program for a wrist watch utilizing a 8051 micro-controller using Embedded (C). There are a total of 6 7-segment displays as such:
_______________________
| | | | two 7-segments for showing HOURS
| HR | MIN | SEC | two 7-segments for showing MINUTES and
|______._______.________| two 7-segments for showing SECONDS
7-segment LED display
To update the hours, minutes and seconds, we used 3 for loops. That means that first the seconds will update, then the minutes, and then the hours. Then I asked my professor why can't we update simultaneously (I mean hours increment after an hour without waiting for the minutes to update). He told me we can't do parallel processing because of the sequential execution of the instructions.
Question:
A digital birthday card which will play music continuously whilst blinking LED's simultaneously. A digital alarm clock will produce beeps at particular time. While it is producing sound, the time will continue updating. So sound and time increments both are running in parallel. How did they achieve these results with sequential execution?
How does one run multiple tasks simultaneously (scheduling) in a micro-controller?

First, what's with this sequential execution. There's just one core, one program space, one counter. The MPU executes one instruction at a time and then moves to another, in sequence. In this system there's no inherent mechanism to make it stop doing one thing and start doing another - it's all one program, and it's entirely in hands of programmer what the sequence will be and what it will do; it will last uninterrupted, one instruction at a time in sequence, as long as the MPU is running, and nothing else will happen, unless the programmer made it happen first.
Now, to multitasking:
Normally, operating systems provide multitasking, with quite complex scheduling algorithms.
Normally, microcontrollers run without operating system.
So, how do you achieve multitasking in microcontroller?
The simple answer is "you don't". But as usually, the simple answer rarely covers more than 5% cases...
You'd have an extremely hard time writing a real, preemptive multitasking. Most microcontrollers just don't have the facilities for that, and things an Intel CPU does with a couple specific instructions would require you to write miles of code. Better forget classic multitasking for microcontrollers unless you really have nothing better to do with your time.
Now, there are two usual approaches that are frequently used instead, with far less hassle.
Interrupts
Most microcontrollers have different interrupt sources, often including timers. So, the main loop runs one task continuously, and when the timer counts to zero, interrupt is issued. The main loop is stopped and execution jumps to an address known as 'interrupt vector'. There, a different procedure is launched, performing a different one-off task. Once that finishes (possibly resetting the timer if need be), you return from the interrupt and main loop is resumed.
Microcontrollers often have a few timers, and you can assign one task per timer, not to mention tasks on other, external interrupts (say, keyboard input - key pressed, or data arriving over RS232.)
While this approach is very limited, it really suffices for great most cases; specifically yours: set up the timer to cycle 1s, on interrupt calculate the new hour, change display, then leave the interrupt. In main loop wait for date to reach birthday, and when it does start playing the music and blinking the LEDs.
Cooperative multitasking
This is how it was done in the early days. You need to write your 'tasks' as subroutines, each with a finite state machine (or a single pass of a loop) inside, and the "OS" is a simple loop of jumps to consecutive tasks, in sequence.
After each jump the MPU starts executing given task, and will continue until the task returns control, after first saving up its state, to recover it when it's started again. Each pass of the task job should be very short. Any delay loops must be replaced with wait states in the finite state engine (if the condition is not satisfied, return. If it is, change the state.) All longer loops must be unrolled into distinct states ("State: copying block of data, copy byte N, increase N, N=end? yes: next state, no: return control)
Writing that way is more difficult, but the solution is more robust. In your case you might have four tasks:
clock
display update
play sound
blink LED
Clock returns control if no new second arrived. If it did, it recalculates the number of seconds, minutes, hours, date, and then returns.
Display updates the displayed values. If you multiplex over the digits on the 8-segment display, each pass will update one digit, next pass - next one etc.
Playing sound will wait (yield) while it's not birthday. If it's birthday, pick the sample value from memory, output it to speaker, yield. Optionally yield if you were called earlier than you were supposed to output next sound.
Blinking - well, output the right state to LED, yield.
Very short loops - say, 10 iterations of 5 lines - are still allowed, but anything longer should be transformed into a state of the finite state engine which the process is.
Now, if you're feeling hardcore, you may try going about...
pre-emptive multitasking.
Each task is a procedure that would normally execute infinitely, doing just its own thing. written normally, trying not to step on other procedures' memory but otherwise using resources as if there was nothing else in the world that could need them.
Your OS task is launched from a timer interrupt.
Upon getting started by the interrupt, the OS task must save all current volatile state of the last task - registers, the interrupt return address (from which the task should be resumed), current stack pointer, keeping that in a record of that task.
Then, using the scheduler algorithm, it picks another process from the list, which should start now; restores all of its state, then overwrites own return-from-interrupt address with the address of where that process left off, when preempted previously. Upon ending the interrupt normal operation of the preempted process is resumed, until another interrupt which switches control to OS again.
As you can see, there's a lot of overhead, with saving and restoring the complete state of the program instead of just what the task needs at the moment, but the program doesn't need to be written as a finite state machine - normal sequential style suffices.

While SF provides an excellent overview of multitasking there is also some additional hardware most microcontrollers have that let them do things simultaneously.
Illusion of simultaneous execution - Technically your professor is correct and updating simultaneously cannot be done. However, processors are very fast. For many tasks they can execute sequentially, like updating each 7 segment display one at a time, but it does it so fast that human perception cannot tell that each display was updated sequentially. The same applies to sound. Most audible sound is in the kilohertz range while processors run in the megahertz range. The processor has plenty of time to play part of a sound, do something else, then return to playing a sound without your ear being able to detect the difference.
Interrupts - SF covered the execution of interrupts well so I'll gloss over the mechanics and talk more about hardware. Most micro controllers have small hardware modules that operate simultaneously with instruction execution. Timers, UARTS, and SPI are common modules that do a specific action while the main portion of the processor carries out instructions. When a given module completes its task it notifies the processor and the processor jumps to the interrupt code for the module. This mechanism allows you to do things like transmit a byte over uart (which is relatively slow) while executing instructions.
PWM - PWM (Pulse Width Modulation) is a hardware module that essentially generates a square wave, two at a time, but the squares don't have to be even (I am simplifying here). One could be longer than the other, or they could be the same size. You configure in hardware the size of the squares and then the PWM generates them continuously. This module can be used to drive motors or even generate sound, where the speed of the motor or the frequency of sound depends on the ratio of the two squares. To play music, a processor would only need to change the ratio when it is time for the note to change (perhaps based on a timer interrupt) and it can execute other instructions in the meantime.
DMA - DMA (Direct Memory Access) is a specific type of hardware that automatically copies bytes from one memory location to another. Something like an ADC might continuously write a converted value to a specific register in memory. A DMA controller can be configured to read continuously from one address (the ADC output) while writing sequentially to a range of memory (like the buffer to receive multiple ADC conversions before averaging). All of this happens in hardware while the main processor executes instructions.
Timers, UART, SPI, ADC, etc - There are many other hardware modules (too many to cover here) that perform a specific task simultaneously with program execution.
TL/DR - While program instructions can only be executed sequentially, the processor can usually execute them fast enough that they appear to happen simultaneously. Meanwhile, most micro-controllers have additional hardware that accomplishes specific tasks simultaneously with program execution.

The answers by Zack and SF. nicely cover the big picture. But sometimes a working example is valuable.
While I could glibly suggest browsing the source kit to the Linux kernel (which is both open source and provides multitasking even on single-core machines), that is not the best place to start for an understanding of how to actually implement a scheduler.
A much better place to start is with the source kit to one of the hundreds (if not thousands) of real time operating systems. Many of these are open source, and most can run even on extremely small processors, including the 8051. I'll describe Micrium's uC/OS-II here in more details because it has a typical set of features and it is the one I've used extensively. Others I've evaluated in the past include OS-9, eCos, and FreeRTOS. With those names as a starting point along with keywords like "RTOS" Google will reward you with names of many others.
My first reach for an RTOS kernel would be uC/OS-II (or its newer family memeber uC/OS-III). This is a commercial product that started life as an educational exercise for readers of Embedded Systems Design magazine. The magazine articles and their attached source code became the subject of one of the better books on the subject. The OS is open source, but does carry license restrictions on commercial use. In the interest of disclosure, I am the author of the port of uC/OS-II to the ColdFire MCF5307.
Since it was originally written as an educational tool, the source code is well documented. The text book (as of the 2nd edition on my shelf here somewhere, at least) is well written as well, and goes into a lot of theoretical background on each of the features it supports.
I successfully used it in several product development projects, and would considering it again for a project that needs multitasking but does not need to carry the weight of a full OS like Linux.
uC/OS-II provides a preemptive task scheduler, along with a useful collection of inter-task communications primitives (semaphore, mutex, mailbox, message queue), timers, and a thread-safe pooled memory allocator.
It also supports task priority, and includes deadlock prevention if used correctly.
It is written entirely in a subset of standard C (meeting almost all requirements of the the MISRA-C:1998 guidelines) which helped make it possible for it to it to receive a variety of safety critical certifications.
While my applications were never in safety critical systems, it was comforting to know that the OS kernel on which I was standing had achieved those ratings. It provided assurance that the most likely reason I had a bug was either a misunderstanding of how a primitive worked, or possibly more likely, was actually a bug in my application logic.
Most RTOSes (and uC/OS-II especially) are able to run in limited resources. uC/OS-II can be built in as little as 6KB of code, and with as little as 1KB of RAM required for OS structures.
The bottom line is that apparent concurrency can be achieved in a variety of ways, and one of those ways is to use an OS kernel designed to schedule and execute each concurrent task in parallel by sharing the resources of the sequential CPU among all the tasks. For simple cases, all you might need is interrupt handlers and a main loop, but when your requirements grow to the point of implementing several protocols, managing a display, managing user input, background computation, and monitoring overall system health, standing on a well-designed RTOS kernel along with known to work communications primitives can save a lot of development and debugging effort.

Well, I see a lot of ground covered by other answers; so, hopefully I don't end up turning this into something bigger than I intend. (TL;DR: Girl to the rescue! :D). But, I do have (what I believe to be) a very good solution to offer; so I hope you can make use of it. I only have a small amount of experience with the 8051[&star;]; although I did work for ~3 months (plus ~3 more full-time) on another microcontroller, with moderate success. In the course of that I ended up doing a little bit of almost everything the little thing had to offer: serial communications, SPI, PWM signals, servo control, DIO, thermocouples, and so forth. While I was working on it, I lucked out and came across an excellent (IMO) solution for (cooperative) 'thread' scheduling, which mixed well with some small amount of additional real-time stuff done off of interrupts on the PIC. And, of course, other interrupt handlers for the other devices.
pt_thread: Invented by Adam Dunkels (with Oliver Schmidt) (v1.0 released in Feb., 2005), his site is a great introduction to them, wand includes downloads through v1.4 from Oct., 2006; and I am very glad to have gone to look again because I found ; but there's an item from Jan. 2009 stating that Larry Ruane used event-driven techniques "[for] a complete reimplementation [using GCC; and with] a very nice syntax", and available on sourceforge. Unfortunately, it looks like there are no updates to either since around 2009; but the 2006 version served me very well. The last news item (from Dec. 2009) notes that "Sonic Unleashed" indicated in its manual that protothreads were used!
One of the things that I think are awesome about pt_threads is that they're so simple; and, whatever the benefits of the newer (Ruane) version, it's certainly more complex. Although it may well be worth taking a look at, I am going to stick with Dunkels' original implementation here. His original pt_threads "library" consists of: five header files. And, really, that seems like an overstatement, as once I minified a few macros and other things, removed the doxygen sections, examples, and culled down the comments to the bare minimum I still felt gave an explanation, it clocks in at just around 115 lines (included below.)
There are examples included with the source tarball, and very nice .pdf document (or .html) available on his site (linked above.) But, let me walk through a quick example to elucidate some of the concepts. (Not the macros themselves, it took me a while to grok those, and they aren't really necessary just to use the functionality. :D)
Unfortunately, I've run out of time for tonight; but I will try to get back on at some point tomorrow to write up a little example; either way, there are a ton of resources on his website, linked above; it's a fairly straightforward procedure, the tricky part for me (as I suppose it is with any cooperative multi-threading; Win 3.1 anyone? :D) was ensuring that I had properly cycle-counted the code, so as not to overrun the time I needed to process the next thing before yielding the pt_thread.
I hope this gives you a start; let me know how it goes if you try it out!
FILE: pt.h
#ifndef __PT_H__
#define __PT_H__
#include "lc.h"
// NOTE: the enums are mine to compress space; originally all were #defines
enum PT_STATUS_ENUM { PT_WAITING, PT_YIELDED, PT_EXITED, PT_ENDED };
struct pt { lc_t lc; } // protothread control structure (pt_thread)
#define PT_INIT(pt) LC_INIT((pt)->lc) // initializes pt_thread prior to use
// you can use this to declare pt_thread functions
#define PT_THREAD(name_args) char name_args
// NOTE: looking at this, I think I might define my own macro as follows, so as not
// to have to redclare the struct pt *pt every time.
//#define PT_DECLARE(name, args) char name(struct pt *pt, args)
// start/end pt_thread (inside implementation fn); must always be paired
#define PT_BEGIN(pt) { char PT_YIELD_FLAG = 1; LC_RESUME((pt)->lc)
#define PT_END(pt) LC_END((pt)->lc);PT_YIELD_FLAG = 0;PT_INIT(pt);return PT_ENDED;}
// {block, yield} 'pt' {until,while} 'c' is true
#define PT_WAIT_UNTIL(pt,c) do { \
LC_SET((pt)->lc); if(!(c)) {return PT_WAITING;} \
} while(0)
#define PT_WAIT_WHILE(pt, cond) PT_WAIT_UNTIL((pt), !(cond))
#define PT_YIELD_UNTIL(pt, cond) \
do { PT_YIELD_FLAG = 0; LC_SET((pt)->lc); \
if((PT_YIELD_FLAG == 0) || !(cond)) { return PT_YIELDED; } } while(0)
// NOTE: no corresponding "YIELD_WHILE" exists; oversight? [shelleybutterfly]
//#define PT_YIELD_WHILE(pt,cond) PT_YIELD_UNTIL((pt), !(cond))
// block pt_thread 'pt', waiting for child 'thread' to complete
#define PT_WAIT_THREAD(pt, thread) PT_WAIT_WHILE((pt), PT_SCHEDULE(thread))
// spawn pt_thread 'ch' as child of 'pt', waiting until 'thr' exits
#define PT_SPAWN(pt,ch,thr) do { \
PT_INIT((child)); PT_WAIT_THREAD((pt),(thread)); } while(0)
// block and cause pt_thread to restart its execution at its PT_BEGIN()
#define PT_RESTART(pt) do { PT_INIT(pt); return PT_WAITING; } while(0)
// exit the pt_thread; if a child, then parent will unblock and run
#define PT_EXIT(pt) do { PT_INIT(pt); return PT_EXITED; } while(0)
// schedule pt_thread: fn ret != 0 if pt is running, or 0 if exited
#define PT_SCHEDULE(f) ((f) lc); \
if(PT_YIELD_FLAG == 0) { return PT_YIELDED; } } while(0)
FILE: lc.h
#ifndef __LC_H__
#define __LC_H__
#ifdef LC_INCLUDE
#include LC_INCLUDE
#else
#include "lc-switch.h"
#endif /* LC_INCLUDE */
#endif /* __LC_H__ */
FILE: lc-switch.h
// WARNING: implementation using switch() won't work with an LC_SET() inside a switch()!
#ifndef __LC_SWITCH_H__
#define __LC_SWITCH_H__
typedef unsigned short lc_t;
#define LC_INIT(s) s = 0;
#define LC_RESUME(s) switch(s) { case 0:
#define LC_SET(s) s = __LINE__; case __LINE__:
#define LC_END(s) }
#endif /* __LC_SWITCH_H__ */
FILE: lc-addrlabels.h
#ifndef __LC_ADDRLABELS_H__
#define __LC_ADDRLABELS_H__
typedef void * lc_t;
#define LC_INIT(s) s = NULL
#define LC_RESUME(s) do { if(s != NULL) { goto *s; } } while(0)
#define LC_CONCAT2(s1, s2) s1##s2
#define LC_CONCAT(s1, s2) LC_CONCAT2(s1, s2)
#define LC_END(s)
#define LC_SET(s) \
do {LC_CONCAT(LC_LABEL, __LINE__):(s)=&&LC_CONCAT(LC_LABEL,__LINE__);} while(0)
#endif /* __LC_ADDRLABELS_H__ */
FILE: pt-sem.h
#ifndef __PT_SEM_H__
#define __PT_SEM_H__
#include "pt.h"
struct pt_sem { unsigned int count; };
// macros to initiaize, await, and signal a pt_sem semaphore
#define PT_SEM_INIT(s, c) (s)->count = c
#define PT_SEM_WAIT(pt, s) do \
{ PT_WAIT_UNTIL(pt, (s)->count > 0); -(s)->count; } while(0)
#define PT_SEM_SIGNAL(pt, s) ++(s)->count
#endif /* __PT_SEM_H__ */
[&star;] *about a week learning about microcontrollers[†] and a week playing with it during an evaluation to see if it could meet our needs for a little line-replaceable remote I/O unit. (long story, short: no)
[†] The 8051 Microcontroller, Third Edition *was suggested to me as the 8051 programming "bible" I don't know if it is or not, but I was certainly able to get my head around things using it.[‡]
[‡] and even looking over it again now I don't see much not to like about it. :) well, I mean... I wish I hadn't bought two copies; but they were so cheap!
LICENSE AGREEMENT (where applicable)
This post contains code based on (or taken from) 'The Protothreads Library' (referred to herein and henceforth as "PTLIB";
including v1.4 and earlier revisions) relying extensively on the source code as well as the documentation for PTLIB.
PTLIB original source code and documentation was received from, and freely available for download at the author's PTLIB site
'http://dunkels.com/adam/pt/', available through a link on the downloads page at 'http://dunkels.com/adam/pt/download.html'
or directly via 'http://dunkels.com/adam/download/pt-1.4.tar.gz'.
This post consists of original text, for which I hereby give to you (with love!) under a full waiver of whatever copyright
interest I may have, under the following terms: "copyheart ♥ 2014, shelleybutterfly, share with love!"; or, if you prefer,
a fully non-restrictive, attribution-only license appropriate to the material (such as Apache 2.0 for software; or CC-BY
license for text) so that you may use it as you see fit, so that it may best suit your needs.
This post also contains source code, almost entirely created from the original source by removing explanatory material,
reformatting, and paraphrasing the in-line documentation/comments, as well as a few modifications/additions by me
(shelleybutterfly on the stackexchange network). Anything derivative of PTLIB for which I may have, legally, gained any
copyright or other interest, I hereby cede all such interest back to and all copyright interest in the original work to
the original copyright holder, as specified in the license from PTLIB, which follows this agreement.
In any jurisdiction where it is not possible for the terms above to apply to you for whatever reason, then, for whatever
interest I have in the material, I hereby offer it to you under any non-restrictive, attribution-only, license of your
choosing; or, should this also not be possible, then I give permission to stack exchange inc to provide it to you under
whatever terms the y determine to be acceptable in your jurisdiction.
All source code from PTLIB, and that which is derivative of PTLIB, that is not covered under other terms detailed above
hereby provided to 'stack exchange inc' and to you under the following agreement:
LICENSE AGREEMENT for "The Protothreads Library"
Copyright (c) 2004-2005, Swedish Institute of Computer Science. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the
following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following
disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the Institute nor the names of its contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE INSTITUTE AND CONTRIBUTORS `AS IS' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
EVENT SHALL THE INSTITUTE OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
Author: Adam Dunkels

There are some really good answers here, but just a little more context regarding your birthday card example might be a good lead in before digging in with the longer answers.
The way a single cpu can seem to do multiple things at once is by rapidly switching between tasks, as well as employing the assistance of timers, interrupts and independent hardware units that can do things independently of the cpu. (see #Zack's answer for a nice discussion and starter list of HW) So for your birthday card, the cpu could be telling a bit of audio hardware "play this chunk of sound", then go blink the LED, then come back and load the next bit of sound before the first portion is finished playing. In this situation, cpu might take say 1 msec of time to load audio that might play for 5 msec of real time leaving you with 4 msec of time to do something else before loading the next bit of sound.
The digital clock might beep by setting up a bit of PWM hardware to output at some frequency to a piezio buzzer, a timer for an interrupt to stop the beep, then go off and check a real time counter to see if the time display leds need to be updated. When the timer fires the interrupt, your code shuts off the PWM.
The details will vary according the the hardware of the chip, and going over the datasheet is the way to find out what capability a given microcontroller might have, and how to access it.

I have had good experiences with Freertos, even though it uses a fair bit of memory. Freertos gives you true preemptive threading, there's tons of ports if you ever want to upgrade those dusty old 8051s, there's semaphores and message queues and priorities and all kinds of stuff and it's totally free. I've only worked with the arduino port personally, but it seems to be one of the most popular of the free rtosses.
I think they sell a book that isn't free, but there's enough info on their website and in the arduino examples to pretty much figure it out.

Embedded Programming, Wait for 12.5 us

I'm programming on the C2000 F28069 Experimenters Kit. I'm toggling a GPIO output every 12.5 microseconds 5 times in a row. I decided I don't want to use interrupts (though I will if I absolutely have to). I want to just wait that amount of times in terms of clock cycles.
My clock is running at 80MHz, so 12.5 us should be 1000 clock cycles. When I use a loop:
for(i=0;i<1000;i++)
I get a result that is way too long (not 12.5 us). What other techniques can I use?
Is sleep(n); something that I can use on a microcontroller? If so, which header file do I need to download and where can I find it? Also, now that I think about it, sleep(n); takes an int input, so that wouldn't even work... any other ideas?

Summary: Use the PWM or Timer peripherals to generate output pulses.
First, the clock speed of the CPU has a complex relationship to actual code execution speed, and in many CPUs there is more than one clock rate involved in different stages of the execution. The chip you reference has several internal clock sources, for instance. Further, each individual instruction will likely take a different number of clocks to execute, and some cores can execute part of (or all of) several instructions simultaneously.
To rigorously create a loop that required 12.5 µs to execute without using a timing interrupt or other hardware device would require careful hand coding in assembly language along with careful accounting of the execution time of each instruction.
But you are writing in C, not assembler.
So the first question you have to ask is what machine code was actually generated for your loop. And the second question is did you enable the optimizer, and to what level.
As written, a decent optimizer will determine that the loop for (i=0; i<1000; i++) ; has no visible side effects, and therefore is just a slow way of writing ;, and can be completely removed.
If it does compile the loop, it could be written naively using perhaps as many as 5 instructions, or as few as one or two. I am not personally familiar with this particular TI CPU architecture, so I won't attempt to guess at the best possible implementation.
All that said, learning about the CPU architecture and its efficiency is important to building reliable and efficient embedded systems. But given that the chip has peripheral devices built-in that provide hardware support for PWM (pulse width modulated) outputs as well as general purpose hardware timer/counters you would be far better off learning to use the hardware to generate the waveform for you.
I would start by collecting every document available on the CPU core and its peripherals, especially app notes and sample code.
The C compiler will have an option to emit and preserve an assembly language source file. I would use that as a guide to study the structure of the code generated for critical loops and other bottlenecks, as well as the effects of the compiler's various optimization levels.
The tool suite should have a mechanism for profiling your running code. Before embarking on heroic measures in pursuit of optimizations, use that first to identify the actual bottlenecks. Even if it lacks decent profiling, you are likely to have spare GPIO pins that can be toggled around critical sections of code and measured with a logic analyzer or oscilloscope.

The chip you refer has PWM (pulse width modulation) hardware declared as one of major winning features. You should rely on this. Please refer to appropriate application guide. Generally you cannot guarantee 12.5uS periods from application layer (and should not try to do so). Even if you managed to do so directly from application layer it's bad idea. Any change in your firmware code can break this.

If you use a timer peripheral with PWM output capability as suggested by #RBerteig already, then you can generate an accurate timing signal with zero software overhead. If you need to do other work synchronously with the clock, then you can use the timer interrupt to trigger that too. However if you process interrupts at an interval of 12.5us you may find that your processor spends a great deal of time context switching rather than performing useful work.
If you simply want an accurate delay, then you should still use a hardware timer and poll its reload flag rather than process its interrupt. This allows consistent timing independent of the compiler's code generation or processor speed and allows you to add other code within the loop without extending the total loop time. You would poll it in a loop during which you might do other work as well. The timing jitter and determinism will depend on what other work you do in the loop, but for an empty loop, reaction to the timer even will probably be faster than the latency on an interrupt handler.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight