How to achieve multitasking in a microcontroller? - c

I wrote a program for a wrist watch utilizing a 8051 micro-controller using Embedded (C). There are a total of 6 7-segment displays as such:
_______________________
| | | | two 7-segments for showing HOURS
| HR | MIN | SEC | two 7-segments for showing MINUTES and
|______._______.________| two 7-segments for showing SECONDS
7-segment LED display
To update the hours, minutes and seconds, we used 3 for loops. That means that first the seconds will update, then the minutes, and then the hours. Then I asked my professor why can't we update simultaneously (I mean hours increment after an hour without waiting for the minutes to update). He told me we can't do parallel processing because of the sequential execution of the instructions.
Question:
A digital birthday card which will play music continuously whilst blinking LED's simultaneously. A digital alarm clock will produce beeps at particular time. While it is producing sound, the time will continue updating. So sound and time increments both are running in parallel. How did they achieve these results with sequential execution?
How does one run multiple tasks simultaneously (scheduling) in a micro-controller?

First, what's with this sequential execution. There's just one core, one program space, one counter. The MPU executes one instruction at a time and then moves to another, in sequence. In this system there's no inherent mechanism to make it stop doing one thing and start doing another - it's all one program, and it's entirely in hands of programmer what the sequence will be and what it will do; it will last uninterrupted, one instruction at a time in sequence, as long as the MPU is running, and nothing else will happen, unless the programmer made it happen first.
Now, to multitasking:
Normally, operating systems provide multitasking, with quite complex scheduling algorithms.
Normally, microcontrollers run without operating system.
So, how do you achieve multitasking in microcontroller?
The simple answer is "you don't". But as usually, the simple answer rarely covers more than 5% cases...
You'd have an extremely hard time writing a real, preemptive multitasking. Most microcontrollers just don't have the facilities for that, and things an Intel CPU does with a couple specific instructions would require you to write miles of code. Better forget classic multitasking for microcontrollers unless you really have nothing better to do with your time.
Now, there are two usual approaches that are frequently used instead, with far less hassle.
Interrupts
Most microcontrollers have different interrupt sources, often including timers. So, the main loop runs one task continuously, and when the timer counts to zero, interrupt is issued. The main loop is stopped and execution jumps to an address known as 'interrupt vector'. There, a different procedure is launched, performing a different one-off task. Once that finishes (possibly resetting the timer if need be), you return from the interrupt and main loop is resumed.
Microcontrollers often have a few timers, and you can assign one task per timer, not to mention tasks on other, external interrupts (say, keyboard input - key pressed, or data arriving over RS232.)
While this approach is very limited, it really suffices for great most cases; specifically yours: set up the timer to cycle 1s, on interrupt calculate the new hour, change display, then leave the interrupt. In main loop wait for date to reach birthday, and when it does start playing the music and blinking the LEDs.
Cooperative multitasking
This is how it was done in the early days. You need to write your 'tasks' as subroutines, each with a finite state machine (or a single pass of a loop) inside, and the "OS" is a simple loop of jumps to consecutive tasks, in sequence.
After each jump the MPU starts executing given task, and will continue until the task returns control, after first saving up its state, to recover it when it's started again. Each pass of the task job should be very short. Any delay loops must be replaced with wait states in the finite state engine (if the condition is not satisfied, return. If it is, change the state.) All longer loops must be unrolled into distinct states ("State: copying block of data, copy byte N, increase N, N=end? yes: next state, no: return control)
Writing that way is more difficult, but the solution is more robust. In your case you might have four tasks:
clock
display update
play sound
blink LED
Clock returns control if no new second arrived. If it did, it recalculates the number of seconds, minutes, hours, date, and then returns.
Display updates the displayed values. If you multiplex over the digits on the 8-segment display, each pass will update one digit, next pass - next one etc.
Playing sound will wait (yield) while it's not birthday. If it's birthday, pick the sample value from memory, output it to speaker, yield. Optionally yield if you were called earlier than you were supposed to output next sound.
Blinking - well, output the right state to LED, yield.
Very short loops - say, 10 iterations of 5 lines - are still allowed, but anything longer should be transformed into a state of the finite state engine which the process is.
Now, if you're feeling hardcore, you may try going about...
pre-emptive multitasking.
Each task is a procedure that would normally execute infinitely, doing just its own thing. written normally, trying not to step on other procedures' memory but otherwise using resources as if there was nothing else in the world that could need them.
Your OS task is launched from a timer interrupt.
Upon getting started by the interrupt, the OS task must save all current volatile state of the last task - registers, the interrupt return address (from which the task should be resumed), current stack pointer, keeping that in a record of that task.
Then, using the scheduler algorithm, it picks another process from the list, which should start now; restores all of its state, then overwrites own return-from-interrupt address with the address of where that process left off, when preempted previously. Upon ending the interrupt normal operation of the preempted process is resumed, until another interrupt which switches control to OS again.
As you can see, there's a lot of overhead, with saving and restoring the complete state of the program instead of just what the task needs at the moment, but the program doesn't need to be written as a finite state machine - normal sequential style suffices.

While SF provides an excellent overview of multitasking there is also some additional hardware most microcontrollers have that let them do things simultaneously.
Illusion of simultaneous execution - Technically your professor is correct and updating simultaneously cannot be done. However, processors are very fast. For many tasks they can execute sequentially, like updating each 7 segment display one at a time, but it does it so fast that human perception cannot tell that each display was updated sequentially. The same applies to sound. Most audible sound is in the kilohertz range while processors run in the megahertz range. The processor has plenty of time to play part of a sound, do something else, then return to playing a sound without your ear being able to detect the difference.
Interrupts - SF covered the execution of interrupts well so I'll gloss over the mechanics and talk more about hardware. Most micro controllers have small hardware modules that operate simultaneously with instruction execution. Timers, UARTS, and SPI are common modules that do a specific action while the main portion of the processor carries out instructions. When a given module completes its task it notifies the processor and the processor jumps to the interrupt code for the module. This mechanism allows you to do things like transmit a byte over uart (which is relatively slow) while executing instructions.
PWM - PWM (Pulse Width Modulation) is a hardware module that essentially generates a square wave, two at a time, but the squares don't have to be even (I am simplifying here). One could be longer than the other, or they could be the same size. You configure in hardware the size of the squares and then the PWM generates them continuously. This module can be used to drive motors or even generate sound, where the speed of the motor or the frequency of sound depends on the ratio of the two squares. To play music, a processor would only need to change the ratio when it is time for the note to change (perhaps based on a timer interrupt) and it can execute other instructions in the meantime.
DMA - DMA (Direct Memory Access) is a specific type of hardware that automatically copies bytes from one memory location to another. Something like an ADC might continuously write a converted value to a specific register in memory. A DMA controller can be configured to read continuously from one address (the ADC output) while writing sequentially to a range of memory (like the buffer to receive multiple ADC conversions before averaging). All of this happens in hardware while the main processor executes instructions.
Timers, UART, SPI, ADC, etc - There are many other hardware modules (too many to cover here) that perform a specific task simultaneously with program execution.
TL/DR - While program instructions can only be executed sequentially, the processor can usually execute them fast enough that they appear to happen simultaneously. Meanwhile, most micro-controllers have additional hardware that accomplishes specific tasks simultaneously with program execution.

The answers by Zack and SF. nicely cover the big picture. But sometimes a working example is valuable.
While I could glibly suggest browsing the source kit to the Linux kernel (which is both open source and provides multitasking even on single-core machines), that is not the best place to start for an understanding of how to actually implement a scheduler.
A much better place to start is with the source kit to one of the hundreds (if not thousands) of real time operating systems. Many of these are open source, and most can run even on extremely small processors, including the 8051. I'll describe Micrium's uC/OS-II here in more details because it has a typical set of features and it is the one I've used extensively. Others I've evaluated in the past include OS-9, eCos, and FreeRTOS. With those names as a starting point along with keywords like "RTOS" Google will reward you with names of many others.
My first reach for an RTOS kernel would be uC/OS-II (or its newer family memeber uC/OS-III). This is a commercial product that started life as an educational exercise for readers of Embedded Systems Design magazine. The magazine articles and their attached source code became the subject of one of the better books on the subject. The OS is open source, but does carry license restrictions on commercial use. In the interest of disclosure, I am the author of the port of uC/OS-II to the ColdFire MCF5307.
Since it was originally written as an educational tool, the source code is well documented. The text book (as of the 2nd edition on my shelf here somewhere, at least) is well written as well, and goes into a lot of theoretical background on each of the features it supports.
I successfully used it in several product development projects, and would considering it again for a project that needs multitasking but does not need to carry the weight of a full OS like Linux.
uC/OS-II provides a preemptive task scheduler, along with a useful collection of inter-task communications primitives (semaphore, mutex, mailbox, message queue), timers, and a thread-safe pooled memory allocator.
It also supports task priority, and includes deadlock prevention if used correctly.
It is written entirely in a subset of standard C (meeting almost all requirements of the the MISRA-C:1998 guidelines) which helped make it possible for it to it to receive a variety of safety critical certifications.
While my applications were never in safety critical systems, it was comforting to know that the OS kernel on which I was standing had achieved those ratings. It provided assurance that the most likely reason I had a bug was either a misunderstanding of how a primitive worked, or possibly more likely, was actually a bug in my application logic.
Most RTOSes (and uC/OS-II especially) are able to run in limited resources. uC/OS-II can be built in as little as 6KB of code, and with as little as 1KB of RAM required for OS structures.
The bottom line is that apparent concurrency can be achieved in a variety of ways, and one of those ways is to use an OS kernel designed to schedule and execute each concurrent task in parallel by sharing the resources of the sequential CPU among all the tasks. For simple cases, all you might need is interrupt handlers and a main loop, but when your requirements grow to the point of implementing several protocols, managing a display, managing user input, background computation, and monitoring overall system health, standing on a well-designed RTOS kernel along with known to work communications primitives can save a lot of development and debugging effort.

Well, I see a lot of ground covered by other answers; so, hopefully I don't end up turning this into something bigger than I intend. (TL;DR: Girl to the rescue! :D). But, I do have (what I believe to be) a very good solution to offer; so I hope you can make use of it. I only have a small amount of experience with the 8051[☆]; although I did work for ~3 months (plus ~3 more full-time) on another microcontroller, with moderate success. In the course of that I ended up doing a little bit of almost everything the little thing had to offer: serial communications, SPI, PWM signals, servo control, DIO, thermocouples, and so forth. While I was working on it, I lucked out and came across an excellent (IMO) solution for (cooperative) 'thread' scheduling, which mixed well with some small amount of additional real-time stuff done off of interrupts on the PIC. And, of course, other interrupt handlers for the other devices.
pt_thread: Invented by Adam Dunkels (with Oliver Schmidt) (v1.0 released in Feb., 2005), his site is a great introduction to them, wand includes downloads through v1.4 from Oct., 2006; and I am very glad to have gone to look again because I found ; but there's an item from Jan. 2009 stating that Larry Ruane used event-driven techniques "[for] a complete reimplementation [using GCC; and with] a very nice syntax", and available on sourceforge. Unfortunately, it looks like there are no updates to either since around 2009; but the 2006 version served me very well. The last news item (from Dec. 2009) notes that "Sonic Unleashed" indicated in its manual that protothreads were used!
One of the things that I think are awesome about pt_threads is that they're so simple; and, whatever the benefits of the newer (Ruane) version, it's certainly more complex. Although it may well be worth taking a look at, I am going to stick with Dunkels' original implementation here. His original pt_threads "library" consists of: five header files. And, really, that seems like an overstatement, as once I minified a few macros and other things, removed the doxygen sections, examples, and culled down the comments to the bare minimum I still felt gave an explanation, it clocks in at just around 115 lines (included below.)
There are examples included with the source tarball, and very nice .pdf document (or .html) available on his site (linked above.) But, let me walk through a quick example to elucidate some of the concepts. (Not the macros themselves, it took me a while to grok those, and they aren't really necessary just to use the functionality. :D)
Unfortunately, I've run out of time for tonight; but I will try to get back on at some point tomorrow to write up a little example; either way, there are a ton of resources on his website, linked above; it's a fairly straightforward procedure, the tricky part for me (as I suppose it is with any cooperative multi-threading; Win 3.1 anyone? :D) was ensuring that I had properly cycle-counted the code, so as not to overrun the time I needed to process the next thing before yielding the pt_thread.
I hope this gives you a start; let me know how it goes if you try it out!
FILE: pt.h
#ifndef __PT_H__
#define __PT_H__
#include "lc.h"
// NOTE: the enums are mine to compress space; originally all were #defines
enum PT_STATUS_ENUM { PT_WAITING, PT_YIELDED, PT_EXITED, PT_ENDED };
struct pt { lc_t lc; } // protothread control structure (pt_thread)
#define PT_INIT(pt) LC_INIT((pt)->lc) // initializes pt_thread prior to use
// you can use this to declare pt_thread functions
#define PT_THREAD(name_args) char name_args
// NOTE: looking at this, I think I might define my own macro as follows, so as not
// to have to redclare the struct pt *pt every time.
//#define PT_DECLARE(name, args) char name(struct pt *pt, args)
// start/end pt_thread (inside implementation fn); must always be paired
#define PT_BEGIN(pt) { char PT_YIELD_FLAG = 1; LC_RESUME((pt)->lc)
#define PT_END(pt) LC_END((pt)->lc);PT_YIELD_FLAG = 0;PT_INIT(pt);return PT_ENDED;}
// {block, yield} 'pt' {until,while} 'c' is true
#define PT_WAIT_UNTIL(pt,c) do { \
LC_SET((pt)->lc); if(!(c)) {return PT_WAITING;} \
} while(0)
#define PT_WAIT_WHILE(pt, cond) PT_WAIT_UNTIL((pt), !(cond))
#define PT_YIELD_UNTIL(pt, cond) \
do { PT_YIELD_FLAG = 0; LC_SET((pt)->lc); \
if((PT_YIELD_FLAG == 0) || !(cond)) { return PT_YIELDED; } } while(0)
// NOTE: no corresponding "YIELD_WHILE" exists; oversight? [shelleybutterfly]
//#define PT_YIELD_WHILE(pt,cond) PT_YIELD_UNTIL((pt), !(cond))
// block pt_thread 'pt', waiting for child 'thread' to complete
#define PT_WAIT_THREAD(pt, thread) PT_WAIT_WHILE((pt), PT_SCHEDULE(thread))
// spawn pt_thread 'ch' as child of 'pt', waiting until 'thr' exits
#define PT_SPAWN(pt,ch,thr) do { \
PT_INIT((child)); PT_WAIT_THREAD((pt),(thread)); } while(0)
// block and cause pt_thread to restart its execution at its PT_BEGIN()
#define PT_RESTART(pt) do { PT_INIT(pt); return PT_WAITING; } while(0)
// exit the pt_thread; if a child, then parent will unblock and run
#define PT_EXIT(pt) do { PT_INIT(pt); return PT_EXITED; } while(0)
// schedule pt_thread: fn ret != 0 if pt is running, or 0 if exited
#define PT_SCHEDULE(f) ((f) lc); \
if(PT_YIELD_FLAG == 0) { return PT_YIELDED; } } while(0)
FILE: lc.h
#ifndef __LC_H__
#define __LC_H__
#ifdef LC_INCLUDE
#include LC_INCLUDE
#else
#include "lc-switch.h"
#endif /* LC_INCLUDE */
#endif /* __LC_H__ */
FILE: lc-switch.h
// WARNING: implementation using switch() won't work with an LC_SET() inside a switch()!
#ifndef __LC_SWITCH_H__
#define __LC_SWITCH_H__
typedef unsigned short lc_t;
#define LC_INIT(s) s = 0;
#define LC_RESUME(s) switch(s) { case 0:
#define LC_SET(s) s = __LINE__; case __LINE__:
#define LC_END(s) }
#endif /* __LC_SWITCH_H__ */
FILE: lc-addrlabels.h
#ifndef __LC_ADDRLABELS_H__
#define __LC_ADDRLABELS_H__
typedef void * lc_t;
#define LC_INIT(s) s = NULL
#define LC_RESUME(s) do { if(s != NULL) { goto *s; } } while(0)
#define LC_CONCAT2(s1, s2) s1##s2
#define LC_CONCAT(s1, s2) LC_CONCAT2(s1, s2)
#define LC_END(s)
#define LC_SET(s) \
do {LC_CONCAT(LC_LABEL, __LINE__):(s)=&&LC_CONCAT(LC_LABEL,__LINE__);} while(0)
#endif /* __LC_ADDRLABELS_H__ */
FILE: pt-sem.h
#ifndef __PT_SEM_H__
#define __PT_SEM_H__
#include "pt.h"
struct pt_sem { unsigned int count; };
// macros to initiaize, await, and signal a pt_sem semaphore
#define PT_SEM_INIT(s, c) (s)->count = c
#define PT_SEM_WAIT(pt, s) do \
{ PT_WAIT_UNTIL(pt, (s)->count > 0); -(s)->count; } while(0)
#define PT_SEM_SIGNAL(pt, s) ++(s)->count
#endif /* __PT_SEM_H__ */
[☆] *about a week learning about microcontrollers[†] and a week playing with it during an evaluation to see if it could meet our needs for a little line-replaceable remote I/O unit. (long story, short: no)
[†] The 8051 Microcontroller, Third Edition *was suggested to me as the 8051 programming "bible" I don't know if it is or not, but I was certainly able to get my head around things using it.[‡]
[‡] and even looking over it again now I don't see much not to like about it. :) well, I mean... I wish I hadn't bought two copies; but they were so cheap!
LICENSE AGREEMENT (where applicable)
This post contains code based on (or taken from) 'The Protothreads Library' (referred to herein and henceforth as "PTLIB";
including v1.4 and earlier revisions) relying extensively on the source code as well as the documentation for PTLIB.
PTLIB original source code and documentation was received from, and freely available for download at the author's PTLIB site
'http://dunkels.com/adam/pt/', available through a link on the downloads page at 'http://dunkels.com/adam/pt/download.html'
or directly via 'http://dunkels.com/adam/download/pt-1.4.tar.gz'.
This post consists of original text, for which I hereby give to you (with love!) under a full waiver of whatever copyright
interest I may have, under the following terms: "copyheart ♥ 2014, shelleybutterfly, share with love!"; or, if you prefer,
a fully non-restrictive, attribution-only license appropriate to the material (such as Apache 2.0 for software; or CC-BY
license for text) so that you may use it as you see fit, so that it may best suit your needs.
This post also contains source code, almost entirely created from the original source by removing explanatory material,
reformatting, and paraphrasing the in-line documentation/comments, as well as a few modifications/additions by me
(shelleybutterfly on the stackexchange network). Anything derivative of PTLIB for which I may have, legally, gained any
copyright or other interest, I hereby cede all such interest back to and all copyright interest in the original work to
the original copyright holder, as specified in the license from PTLIB, which follows this agreement.
In any jurisdiction where it is not possible for the terms above to apply to you for whatever reason, then, for whatever
interest I have in the material, I hereby offer it to you under any non-restrictive, attribution-only, license of your
choosing; or, should this also not be possible, then I give permission to stack exchange inc to provide it to you under
whatever terms the y determine to be acceptable in your jurisdiction.
All source code from PTLIB, and that which is derivative of PTLIB, that is not covered under other terms detailed above
hereby provided to 'stack exchange inc' and to you under the following agreement:
LICENSE AGREEMENT for "The Protothreads Library"
Copyright (c) 2004-2005, Swedish Institute of Computer Science. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the
following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following
disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the Institute nor the names of its contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE INSTITUTE AND CONTRIBUTORS `AS IS' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
EVENT SHALL THE INSTITUTE OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
Author: Adam Dunkels

There are some really good answers here, but just a little more context regarding your birthday card example might be a good lead in before digging in with the longer answers.
The way a single cpu can seem to do multiple things at once is by rapidly switching between tasks, as well as employing the assistance of timers, interrupts and independent hardware units that can do things independently of the cpu. (see #Zack's answer for a nice discussion and starter list of HW) So for your birthday card, the cpu could be telling a bit of audio hardware "play this chunk of sound", then go blink the LED, then come back and load the next bit of sound before the first portion is finished playing. In this situation, cpu might take say 1 msec of time to load audio that might play for 5 msec of real time leaving you with 4 msec of time to do something else before loading the next bit of sound.
The digital clock might beep by setting up a bit of PWM hardware to output at some frequency to a piezio buzzer, a timer for an interrupt to stop the beep, then go off and check a real time counter to see if the time display leds need to be updated. When the timer fires the interrupt, your code shuts off the PWM.
The details will vary according the the hardware of the chip, and going over the datasheet is the way to find out what capability a given microcontroller might have, and how to access it.

I have had good experiences with Freertos, even though it uses a fair bit of memory. Freertos gives you true preemptive threading, there's tons of ports if you ever want to upgrade those dusty old 8051s, there's semaphores and message queues and priorities and all kinds of stuff and it's totally free. I've only worked with the arduino port personally, but it seems to be one of the most popular of the free rtosses.
I think they sell a book that isn't free, but there's enough info on their website and in the arduino examples to pretty much figure it out.

Related

How to use perf or other utilities to measure the elapsed time of a loop/function

I am working on a project requiring profiling the target applications at first.
What I want to know is the exact time consumed by a loop body/function. The platform is BeagleBone Black board with Debian OS and installed perf_4.9.
gettimeofday() can provide a micro-second resolution but I still want more accurate results. It looks perf can give cycles statistics and thus be a good fit for purposes. However, perf can only analyze the whole application instead of individual loop/functions.
After trying the instructions posted in this Using perf probe to monitor performance stats during a particular function, it does not work well.
I am just wondering if there is any example application in C I can test and use on this board for my purpose. Thank you!
Caveat: This is more of comment than an answer but it's a bit too long for just a comment.
Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency.
There will be inconsistency if you use two different clock sources.
What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope.
A scope is useful if you're trying to debug the hardware. It can also show what's on the pins. But, in 40+ years of doing performance measurement/improvement/tuning, I've never used it to tune software.
In fact, I would trust the CPU clock more than I would trust the scope for software performance numbers
For a production product, you may have to measure performance on a system deployed at a customer site [because the issue only shows up on that one customer's machine]. You may have to debug this remotely and can't hook up a scope there. So, you need something that can work without external probe/test rigs.
To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao
The difference could be just round off error based on different time bases and latency in getting in/out of the device (GPIO pins). I presume you're just using GPIO in this way to facilitate benchmark/timing. So, in a way, you're not measuring the "real" system, but the system with the GPIO overhead.
In tuning, one is less concerned with absolute values than relative. That is, clock_gettime is ultimately based on number of highres clock ticks (at 1ns/tick or better from the system's free running TSC (time stamp counter)). What the clock frequency actually is doesn't matter as much. If you measure a loop/function and get X duration. Then, you change some code and get X+n, this tells you whether the code got faster or slower.
500ns isn't that large an amount. Almost any system wide action (e.g. timeslicing, syscall, task switch, etc.) could account for that. Unless you've mapped the GPIO registers into app memory, the syscall overhead could dwarf that.
In fact, just the overhead of calling clock_gettime could account for that.
Although the clock_gettime is technically a syscall, linux will map the code directly into the app's code via the VDSO mechanism so there is no syscall overhead. But, even the userspace code has some calculations to do.
For example, I have two x86 PCs. On one system the overhead of the call is 26 ns. On another system, the overhead is 1800 ns. Both these systems are at least 2GHz+
For your beaglebone/arm system, the base clock rate may be less, so overhead of 500 ns may be ballpark.
I usually benchmark the overhead and subtract it out from the calculations.
And, on the x86, the actual code just gets the CPU's TSC value (via the rdtsc instruction) and does some adjustment. For arm, it has a similar H/W register but requires special care to map userspace access to it (a coprocessor instruction, IIRC).
Speaking of arm, I was doing a commercial arm product (an nVidia Jetson to be exact). We were very concerned about latency of incoming video frames.
The H/W engineer didn't trust TSC [or software in general ;-)] and was trying to use a scope, an LED [controlled by a GPIO pin] and when the LED flash/pulse showed up inside the video frame (e.g. the coordinates of the white dot in the video frame were [effectively] a time measurement).
It took a while to convince the engineer, but, eventually I was able to prove that the clock_gettime/TSC approach was more accurate/reliable.
And, certainly, easier to use. We had multiple test/development/SDK boards but could only hook up the scope/LED rig on one at a time.

Finding a source of entropy in an embedded system?

For a small embedded device (TI MSP430F2274), I am trying to create a Pseudo Random Number Generator (PRNG), but I am having difficulty identifying a potential source of entropy to use as a seed. Unfortunately, the device does not have enough available memory space to include time.h and incorporate the function srand(time(0)). Has anyone had experience with this family of devices, or has incorporated a PRNG in an embedded device that is constrained by its memory space? Thank you.
Your part (MSP430F2274) does not offer any generic solution or support, but your application may do so. Any unpredictable and asynchronous external event or value that is guaranteed to occur or be available before or exactly when you need can be used.
For example the part has a pair of 16 bit timers, with one of these running, on detection of some asynchronous trigger event such as a user button press, the value of the clock counter at that time may be used as a seed.
Alternatively if you have an analogue input with a continuously varying and asynchronous signal, simply reading that value at any time and perhaps read multiple samples spaced over a suitable time interval to generate a larger seed if necessary.
Even without a specific signal, an otherwise unused ADC input channel is likely to have sufficient noise to make its least significant bit unpredictable - you might concatenate the LSB from a number of independent samples to generate a seed or the required length.
Essentially any unpredictable external event that your application already supports may suffice. Without details of your application it is not possible to advise specifically, but given that this is specifically a mixed-signal microcontroller, there will presumably be some suitable external unpredictability?
If you have multiple clock sources (and the MSP430F2274 seems to have that at a glance), you can use the unpredictable drift between these sources for entropy if you absolutely have nothing better.
The way to do is using two sources, one as a time base, measuring ticks of the other during a period. The count of ticks will vary slightly as the two clock sources are independent. Depending on what options are available for timers, this may be done by timers, otherwise even the watchdog could be an option, configured as an interval timer (if nothing else, it is usually capable to run on a different clock source than the main clock).
This method may requre some time to set up (as the clocks don't deviate a lot from their specified frequency, so you need to wait for relatively long to gather a meaningful amount of random deviance between them, a second or so maybe sufficient).
Otherwise as Clifford mentioned, you could gather entropy from your environment, which is definitely superior, if you have such an environment available. The only good thing in this one (drift between clock sources) is that this is very likely available for just about any setup.
By the way you can not do srand(time(0)), just from where you are expecting time() to get the count of seconds since epoch on a microcontroller? :)

Beginner - While() - Optimization

I am new in embedded development and few times ago I red some code about a PIC24xxxx.
void i2c_Write(char data) {
while (I2C2STATbits.TBF) {};
IFS3bits.MI2C2IF = 0;
I2C2TRN = data;
while (I2C2STATbits.TRSTAT) {};
Nop();
Nop();
}
What do you think about the while condition? Does the microchip not using a lot of CPU for that?
I asked myself this question and surprisingly saw a lot of similar code in internet.
Is there not a better way to do it?
What about the Nop() too, why two of them?
Generally, in order to interact with hardware, there are 2 ways:
Busy wait
Interrupt base
In your case, in order to interact with the I2C device, your software is waiting first that the TBF bit is cleared which means the I2C device is ready to accept a byte to send.
Then your software is actually writing the byte into the device and waits that the TRSTAT bit is cleared, meaning that the data has been correctly processed by your I2C device.
The code your are showing is written with busy wait loops, meaning that the CPU is actively waiting the HW. This is indeed waste of resources, but in some case (e.g. your I2C interrupt line is not connected or not available) this is the only way to do.
If you would use interrupt, you would ask the hardware to tell you whenever a given event is happening. For instance, TBF bit is cleared, etc...
The advantage of that is that, while the HW is doing its stuff, you can continue doing other. Or just sleep to save battery.
I'm not an expert in I2C so the interrupt event I have described is most likely not accurate, but that gives you an idea why you get 2 while loop.
Now regarding pro and cons of interrupt base implementation and busy wait implementation I would say that interrupt based implementation is more efficient but more difficult to write since you have to process asynchronous event coming from HW. Busy wait implementation is easy to write but is slower; But this might still be fast enough for you.
Eventually, I got no idea why the 2 NoP are needed there. Most likely a tweak which is needed because somehow, the CPU would still go too fast.
when doing these kinds of transactions (i2c/spi) you find yourself in one of two situations, bit bang, or some form of hardware assist. bit bang is easier to implement and read and debug, and is often quite portable from one chip/family to the next. But burns a lot of cpu. But microcontrollers are mostly there to be custom hardware like a cpld or fpga that is easier to program. They are there to burn cpu cycles pretending to be hardware designs. with i2c or spi you are trying to create a specific waveform on some number of I/O pins on the device and at times latching the inputs. The bus has a spec and sometimes is slower than your cpu. Sometimes not, sometimes when you add the software and compiler overhead you might end up not needing a timer for delays you might be just slow enough. But ideally you look at the waveform and you simply create it, raise pin X delay n ms, raise pin Y delay n ms, drop pin Y delay 2*n ms, and so on. Those delays can come from tuned loops (count from 0 to 1341) or polling a timer until it gets to Z number of ticks of some clock. Massive cpu waste, but the point is you are really just being programmable hardware and hardware would be burning time waiting as well.
When you have a peripheral in your mcu that assists it might do much/most of the timing for you but maybe not all of it, perhaps you have to assert/deassert chip select and then the spi logic does the clock and data timing in and out for you. And these peripherals are generally very specific to one family of one chip vendor perhaps common across a chip vendor but never vendor to vendor so very not portable and there is a learning curve. And perhaps in your case if the cpu is fast enough it might be possible for you to do the next thing in a way that it violates the bus timing, so you would have to kill more time (maybe why you have those Nops()).
Think of an mcu as a software programmable CPLD or FPGA and this waste makes a lot more sense. Unfortunately unlike a CPLD or FPGA you are single threaded so you cant be doing several trivial things in parallel with clock accurate timing (exactly this many clocks task a switches state and changes output). Interrupts help but not quite the same, change one line of code and your timing changes.
In this case, esp with the nops, you should probably be using a scope anyway to see the i2c bus and since/when you have it on the scope you can try with and without those calls to see how it affects the waveform. It could also be a case of a bug in the peripheral or a feature maybe you cant hit some register too fast otherwise the peripheral breaks. or it could be a bug in a chip from 5 years ago and the code was written for that the bug is long gone, but they just kept re-using the code, you will see that a lot in vendor libraries.
What do you think about the while condition? Does the microchip not using a lot of CPU for that?
No, since the transmit buffer won't stay full for very long.
I asked myself this question and surprisingly saw a lot of similar code in internet.
What would you suggest instead?
Is there not a better way to do it? (I hate crazy loops :D)
Not that I, you, or apparently anyone else knows of. In what way do you think it could be any better? The transmit buffer won't stay full long enough to make it useful to retask the CPU.
What about the Nop() too, why two of them?
The Nop's ensure that the signal remains stable long enough. This makes this code safe to call under all conditions. Without it, it would only be safe to call this code if you didn't mess with the i2c bus immediately after calling it. But in most cases, this code would be called in a loop anyway, so it makes much more sense to make it inherently safe.

Embedded Programming, Wait for 12.5 us

I'm programming on the C2000 F28069 Experimenters Kit. I'm toggling a GPIO output every 12.5 microseconds 5 times in a row. I decided I don't want to use interrupts (though I will if I absolutely have to). I want to just wait that amount of times in terms of clock cycles.
My clock is running at 80MHz, so 12.5 us should be 1000 clock cycles. When I use a loop:
for(i=0;i<1000;i++)
I get a result that is way too long (not 12.5 us). What other techniques can I use?
Is sleep(n); something that I can use on a microcontroller? If so, which header file do I need to download and where can I find it? Also, now that I think about it, sleep(n); takes an int input, so that wouldn't even work... any other ideas?
Summary: Use the PWM or Timer peripherals to generate output pulses.
First, the clock speed of the CPU has a complex relationship to actual code execution speed, and in many CPUs there is more than one clock rate involved in different stages of the execution. The chip you reference has several internal clock sources, for instance. Further, each individual instruction will likely take a different number of clocks to execute, and some cores can execute part of (or all of) several instructions simultaneously.
To rigorously create a loop that required 12.5 µs to execute without using a timing interrupt or other hardware device would require careful hand coding in assembly language along with careful accounting of the execution time of each instruction.
But you are writing in C, not assembler.
So the first question you have to ask is what machine code was actually generated for your loop. And the second question is did you enable the optimizer, and to what level.
As written, a decent optimizer will determine that the loop for (i=0; i<1000; i++) ; has no visible side effects, and therefore is just a slow way of writing ;, and can be completely removed.
If it does compile the loop, it could be written naively using perhaps as many as 5 instructions, or as few as one or two. I am not personally familiar with this particular TI CPU architecture, so I won't attempt to guess at the best possible implementation.
All that said, learning about the CPU architecture and its efficiency is important to building reliable and efficient embedded systems. But given that the chip has peripheral devices built-in that provide hardware support for PWM (pulse width modulated) outputs as well as general purpose hardware timer/counters you would be far better off learning to use the hardware to generate the waveform for you.
I would start by collecting every document available on the CPU core and its peripherals, especially app notes and sample code.
The C compiler will have an option to emit and preserve an assembly language source file. I would use that as a guide to study the structure of the code generated for critical loops and other bottlenecks, as well as the effects of the compiler's various optimization levels.
The tool suite should have a mechanism for profiling your running code. Before embarking on heroic measures in pursuit of optimizations, use that first to identify the actual bottlenecks. Even if it lacks decent profiling, you are likely to have spare GPIO pins that can be toggled around critical sections of code and measured with a logic analyzer or oscilloscope.
The chip you refer has PWM (pulse width modulation) hardware declared as one of major winning features. You should rely on this. Please refer to appropriate application guide. Generally you cannot guarantee 12.5uS periods from application layer (and should not try to do so). Even if you managed to do so directly from application layer it's bad idea. Any change in your firmware code can break this.
If you use a timer peripheral with PWM output capability as suggested by #RBerteig already, then you can generate an accurate timing signal with zero software overhead. If you need to do other work synchronously with the clock, then you can use the timer interrupt to trigger that too. However if you process interrupts at an interval of 12.5us you may find that your processor spends a great deal of time context switching rather than performing useful work.
If you simply want an accurate delay, then you should still use a hardware timer and poll its reload flag rather than process its interrupt. This allows consistent timing independent of the compiler's code generation or processor speed and allows you to add other code within the loop without extending the total loop time. You would poll it in a loop during which you might do other work as well. The timing jitter and determinism will depend on what other work you do in the loop, but for an empty loop, reaction to the timer even will probably be faster than the latency on an interrupt handler.

How do I ensure my program runs from beginning to end without interruption?

I'm attempting to time code using RDTSC (no other profiling software I've tried is able to time to the resolution I need) on Ubuntu 8.10. However, I keep getting outliers from task switches and interrupts firing, which are causing my statistics to be invalid.
Considering my program runs in a matter of milliseconds, is it possible to disable all interrupts (which would inherently switch off task switches) in my environment? Or do I need to go to an OS which allows me more power? Would I be better off using my own OS kernel to perform this timing code? I am attempting to prove an algorithm's best/worst case performance, so it must be totally solid with timing.
The relevant code I'm using currently is:
inline uint64_t rdtsc()
{
uint64_t ret;
asm volatile("rdtsc" : "=A" (ret));
return ret;
}
void test(int readable_out, uint32_t start, uint32_t end, uint32_t (*fn)(uint32_t, uint32_t))
{
int i;
for(i = 0; i <= 100; i++)
{
uint64_t clock1 = rdtsc();
uint32_t ans = fn(start, end);
uint64_t clock2 = rdtsc();
uint64_t diff = clock2 - clock1;
if(readable_out)
printf("[%3d]\t\t%u [%llu]\n", i, ans, diff);
else
printf("%llu\n", diff);
}
}
Extra points to those who notice I'm not properly handling overflow conditions in this code. At this stage I'm just trying to get a consistent output without sudden jumps due to my program losing the timeslice.
The nice value for my program is -20.
So to recap, is it possible for me to run this code without interruption from the OS? Or am I going to need to run it on bare hardware in ring0, so I can disable IRQs and scheduling? Thanks in advance!
If you call nanosleep() to sleep for a second or so immediately before each iteration of the test, you should get a "fresh" timeslice for each test. If you compile your kernel with 100HZ timer interrupts, and your timed function completes in under 10ms, then you should be able to avoid timer interrupts hitting you that way.
To minimise other interrupts, deconfigure all network devices, configure your system without swap and make sure it's otherwise quiescent.
Tricky. I don't think you can turn the operating system 'off' and guarantee strict scheduling.
I would turn this upside down: given that it runs so fast, run it many times to collect a distribution of outcomes. Given that standard Ubuntu Linux is not a real-time OS in the narrow sense, all alternative algorithms would run in the same setup --- and you can then compare your distributions (using anything from summary statistics to quantiles to qqplots). You can do that comparison with Python, or R, or Octave, ... whichever suits you best.
You might be able to get away with running FreeDOS, since it's a single process OS.
Here's the relevant text from the second link:
Microsoft's DOS implementation, which is the de
facto standard for DOS systems in the
x86 world, is a single-user,
single-tasking operating system. It
provides raw access to hardware, and
only a minimal layer for OS APIs for
things like the file I/O. This is a
good thing when it comes to embedded
systems, because you often just need
to get something done without an
operating system in your way.
DOS has (natively) no concept of
threads and no concept of multiple,
on-going processes. Application
software makes system calls via the
use of an interrupt interface, calling
various hardware interrupts to handle
things like video and audio, and
calling software interrupts to handle
various things like reading a
directory, executing a file, and so
forth.
Of course, you'll probably get the best performance actually booting FreeDOS onto actual hardware, not in an emulator.
I haven't actually used FreeDOS, but I assume that since your program seems to be standard C, you'll be able to use whatever the standard compiler is for FreeDOS.
If your program runs in milliseconds, and if your are running on Linux,
Make sure that your timer frequency (on linux) is set to 100Hz (not 1000Hz).
(cd /usr/src/linux; make menuconfig, and look at "Processor type and features" -> "Timer frequency")
This way your CPU will get interrupted every 10ms.
Furthermore, consider that the default CPU time slice on Linux is 100ms, so with a nice level of -20, you will not get descheduled if your are running for a few milliseconds.
Also, you are looping 101 times on fn(). Please consider giving fn() to be a no-op to calibrate your system properly.
Make statistics (average + stddev) instead of printing too many times (that would consume your scheduled timeslice, and the terminal will eventually get schedule etc... avoid that).
RDTSC benchmark sample code
You can use chrt -f 99 ./test to run ./test with the maximum realtime priority. Then at least it won't be interrupted by other user-space processes.
Also, installing the linux-rt package will install a real-time kernel, which will give you more control over interrupt handler priority via threaded interrupts.
If you run as root, you can call sched_setscheduler() and give yourself a real-time priority. Check the documentation.
Maybe there is some way to disable preemptive scheduling on linux, but it might not be needed. You could potentially use information from /proc/<pid>/schedstat or some other object in /proc to sense when you have been preempted, and disregard those timing samples.

Resources