I am working on a project requiring profiling the target applications at first.
What I want to know is the exact time consumed by a loop body/function. The platform is BeagleBone Black board with Debian OS and installed perf_4.9.
gettimeofday() can provide a micro-second resolution but I still want more accurate results. It looks perf can give cycles statistics and thus be a good fit for purposes. However, perf can only analyze the whole application instead of individual loop/functions.
After trying the instructions posted in this Using perf probe to monitor performance stats during a particular function, it does not work well.
I am just wondering if there is any example application in C I can test and use on this board for my purpose. Thank you!
Caveat: This is more of comment than an answer but it's a bit too long for just a comment.
Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency.
There will be inconsistency if you use two different clock sources.
What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope.
A scope is useful if you're trying to debug the hardware. It can also show what's on the pins. But, in 40+ years of doing performance measurement/improvement/tuning, I've never used it to tune software.
In fact, I would trust the CPU clock more than I would trust the scope for software performance numbers
For a production product, you may have to measure performance on a system deployed at a customer site [because the issue only shows up on that one customer's machine]. You may have to debug this remotely and can't hook up a scope there. So, you need something that can work without external probe/test rigs.
To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao
The difference could be just round off error based on different time bases and latency in getting in/out of the device (GPIO pins). I presume you're just using GPIO in this way to facilitate benchmark/timing. So, in a way, you're not measuring the "real" system, but the system with the GPIO overhead.
In tuning, one is less concerned with absolute values than relative. That is, clock_gettime is ultimately based on number of highres clock ticks (at 1ns/tick or better from the system's free running TSC (time stamp counter)). What the clock frequency actually is doesn't matter as much. If you measure a loop/function and get X duration. Then, you change some code and get X+n, this tells you whether the code got faster or slower.
500ns isn't that large an amount. Almost any system wide action (e.g. timeslicing, syscall, task switch, etc.) could account for that. Unless you've mapped the GPIO registers into app memory, the syscall overhead could dwarf that.
In fact, just the overhead of calling clock_gettime could account for that.
Although the clock_gettime is technically a syscall, linux will map the code directly into the app's code via the VDSO mechanism so there is no syscall overhead. But, even the userspace code has some calculations to do.
For example, I have two x86 PCs. On one system the overhead of the call is 26 ns. On another system, the overhead is 1800 ns. Both these systems are at least 2GHz+
For your beaglebone/arm system, the base clock rate may be less, so overhead of 500 ns may be ballpark.
I usually benchmark the overhead and subtract it out from the calculations.
And, on the x86, the actual code just gets the CPU's TSC value (via the rdtsc instruction) and does some adjustment. For arm, it has a similar H/W register but requires special care to map userspace access to it (a coprocessor instruction, IIRC).
Speaking of arm, I was doing a commercial arm product (an nVidia Jetson to be exact). We were very concerned about latency of incoming video frames.
The H/W engineer didn't trust TSC [or software in general ;-)] and was trying to use a scope, an LED [controlled by a GPIO pin] and when the LED flash/pulse showed up inside the video frame (e.g. the coordinates of the white dot in the video frame were [effectively] a time measurement).
It took a while to convince the engineer, but, eventually I was able to prove that the clock_gettime/TSC approach was more accurate/reliable.
And, certainly, easier to use. We had multiple test/development/SDK boards but could only hook up the scope/LED rig on one at a time.
I wrote a program for a wrist watch utilizing a 8051 micro-controller using Embedded (C). There are a total of 6 7-segment displays as such:
_______________________
| | | | two 7-segments for showing HOURS
| HR | MIN | SEC | two 7-segments for showing MINUTES and
|______._______.________| two 7-segments for showing SECONDS
7-segment LED display
To update the hours, minutes and seconds, we used 3 for loops. That means that first the seconds will update, then the minutes, and then the hours. Then I asked my professor why can't we update simultaneously (I mean hours increment after an hour without waiting for the minutes to update). He told me we can't do parallel processing because of the sequential execution of the instructions.
Question:
A digital birthday card which will play music continuously whilst blinking LED's simultaneously. A digital alarm clock will produce beeps at particular time. While it is producing sound, the time will continue updating. So sound and time increments both are running in parallel. How did they achieve these results with sequential execution?
How does one run multiple tasks simultaneously (scheduling) in a micro-controller?
First, what's with this sequential execution. There's just one core, one program space, one counter. The MPU executes one instruction at a time and then moves to another, in sequence. In this system there's no inherent mechanism to make it stop doing one thing and start doing another - it's all one program, and it's entirely in hands of programmer what the sequence will be and what it will do; it will last uninterrupted, one instruction at a time in sequence, as long as the MPU is running, and nothing else will happen, unless the programmer made it happen first.
Now, to multitasking:
Normally, operating systems provide multitasking, with quite complex scheduling algorithms.
Normally, microcontrollers run without operating system.
So, how do you achieve multitasking in microcontroller?
The simple answer is "you don't". But as usually, the simple answer rarely covers more than 5% cases...
You'd have an extremely hard time writing a real, preemptive multitasking. Most microcontrollers just don't have the facilities for that, and things an Intel CPU does with a couple specific instructions would require you to write miles of code. Better forget classic multitasking for microcontrollers unless you really have nothing better to do with your time.
Now, there are two usual approaches that are frequently used instead, with far less hassle.
Interrupts
Most microcontrollers have different interrupt sources, often including timers. So, the main loop runs one task continuously, and when the timer counts to zero, interrupt is issued. The main loop is stopped and execution jumps to an address known as 'interrupt vector'. There, a different procedure is launched, performing a different one-off task. Once that finishes (possibly resetting the timer if need be), you return from the interrupt and main loop is resumed.
Microcontrollers often have a few timers, and you can assign one task per timer, not to mention tasks on other, external interrupts (say, keyboard input - key pressed, or data arriving over RS232.)
While this approach is very limited, it really suffices for great most cases; specifically yours: set up the timer to cycle 1s, on interrupt calculate the new hour, change display, then leave the interrupt. In main loop wait for date to reach birthday, and when it does start playing the music and blinking the LEDs.
Cooperative multitasking
This is how it was done in the early days. You need to write your 'tasks' as subroutines, each with a finite state machine (or a single pass of a loop) inside, and the "OS" is a simple loop of jumps to consecutive tasks, in sequence.
After each jump the MPU starts executing given task, and will continue until the task returns control, after first saving up its state, to recover it when it's started again. Each pass of the task job should be very short. Any delay loops must be replaced with wait states in the finite state engine (if the condition is not satisfied, return. If it is, change the state.) All longer loops must be unrolled into distinct states ("State: copying block of data, copy byte N, increase N, N=end? yes: next state, no: return control)
Writing that way is more difficult, but the solution is more robust. In your case you might have four tasks:
clock
display update
play sound
blink LED
Clock returns control if no new second arrived. If it did, it recalculates the number of seconds, minutes, hours, date, and then returns.
Display updates the displayed values. If you multiplex over the digits on the 8-segment display, each pass will update one digit, next pass - next one etc.
Playing sound will wait (yield) while it's not birthday. If it's birthday, pick the sample value from memory, output it to speaker, yield. Optionally yield if you were called earlier than you were supposed to output next sound.
Blinking - well, output the right state to LED, yield.
Very short loops - say, 10 iterations of 5 lines - are still allowed, but anything longer should be transformed into a state of the finite state engine which the process is.
Now, if you're feeling hardcore, you may try going about...
pre-emptive multitasking.
Each task is a procedure that would normally execute infinitely, doing just its own thing. written normally, trying not to step on other procedures' memory but otherwise using resources as if there was nothing else in the world that could need them.
Your OS task is launched from a timer interrupt.
Upon getting started by the interrupt, the OS task must save all current volatile state of the last task - registers, the interrupt return address (from which the task should be resumed), current stack pointer, keeping that in a record of that task.
Then, using the scheduler algorithm, it picks another process from the list, which should start now; restores all of its state, then overwrites own return-from-interrupt address with the address of where that process left off, when preempted previously. Upon ending the interrupt normal operation of the preempted process is resumed, until another interrupt which switches control to OS again.
As you can see, there's a lot of overhead, with saving and restoring the complete state of the program instead of just what the task needs at the moment, but the program doesn't need to be written as a finite state machine - normal sequential style suffices.
While SF provides an excellent overview of multitasking there is also some additional hardware most microcontrollers have that let them do things simultaneously.
Illusion of simultaneous execution - Technically your professor is correct and updating simultaneously cannot be done. However, processors are very fast. For many tasks they can execute sequentially, like updating each 7 segment display one at a time, but it does it so fast that human perception cannot tell that each display was updated sequentially. The same applies to sound. Most audible sound is in the kilohertz range while processors run in the megahertz range. The processor has plenty of time to play part of a sound, do something else, then return to playing a sound without your ear being able to detect the difference.
Interrupts - SF covered the execution of interrupts well so I'll gloss over the mechanics and talk more about hardware. Most micro controllers have small hardware modules that operate simultaneously with instruction execution. Timers, UARTS, and SPI are common modules that do a specific action while the main portion of the processor carries out instructions. When a given module completes its task it notifies the processor and the processor jumps to the interrupt code for the module. This mechanism allows you to do things like transmit a byte over uart (which is relatively slow) while executing instructions.
PWM - PWM (Pulse Width Modulation) is a hardware module that essentially generates a square wave, two at a time, but the squares don't have to be even (I am simplifying here). One could be longer than the other, or they could be the same size. You configure in hardware the size of the squares and then the PWM generates them continuously. This module can be used to drive motors or even generate sound, where the speed of the motor or the frequency of sound depends on the ratio of the two squares. To play music, a processor would only need to change the ratio when it is time for the note to change (perhaps based on a timer interrupt) and it can execute other instructions in the meantime.
DMA - DMA (Direct Memory Access) is a specific type of hardware that automatically copies bytes from one memory location to another. Something like an ADC might continuously write a converted value to a specific register in memory. A DMA controller can be configured to read continuously from one address (the ADC output) while writing sequentially to a range of memory (like the buffer to receive multiple ADC conversions before averaging). All of this happens in hardware while the main processor executes instructions.
Timers, UART, SPI, ADC, etc - There are many other hardware modules (too many to cover here) that perform a specific task simultaneously with program execution.
TL/DR - While program instructions can only be executed sequentially, the processor can usually execute them fast enough that they appear to happen simultaneously. Meanwhile, most micro-controllers have additional hardware that accomplishes specific tasks simultaneously with program execution.
The answers by Zack and SF. nicely cover the big picture. But sometimes a working example is valuable.
While I could glibly suggest browsing the source kit to the Linux kernel (which is both open source and provides multitasking even on single-core machines), that is not the best place to start for an understanding of how to actually implement a scheduler.
A much better place to start is with the source kit to one of the hundreds (if not thousands) of real time operating systems. Many of these are open source, and most can run even on extremely small processors, including the 8051. I'll describe Micrium's uC/OS-II here in more details because it has a typical set of features and it is the one I've used extensively. Others I've evaluated in the past include OS-9, eCos, and FreeRTOS. With those names as a starting point along with keywords like "RTOS" Google will reward you with names of many others.
My first reach for an RTOS kernel would be uC/OS-II (or its newer family memeber uC/OS-III). This is a commercial product that started life as an educational exercise for readers of Embedded Systems Design magazine. The magazine articles and their attached source code became the subject of one of the better books on the subject. The OS is open source, but does carry license restrictions on commercial use. In the interest of disclosure, I am the author of the port of uC/OS-II to the ColdFire MCF5307.
Since it was originally written as an educational tool, the source code is well documented. The text book (as of the 2nd edition on my shelf here somewhere, at least) is well written as well, and goes into a lot of theoretical background on each of the features it supports.
I successfully used it in several product development projects, and would considering it again for a project that needs multitasking but does not need to carry the weight of a full OS like Linux.
uC/OS-II provides a preemptive task scheduler, along with a useful collection of inter-task communications primitives (semaphore, mutex, mailbox, message queue), timers, and a thread-safe pooled memory allocator.
It also supports task priority, and includes deadlock prevention if used correctly.
It is written entirely in a subset of standard C (meeting almost all requirements of the the MISRA-C:1998 guidelines) which helped make it possible for it to it to receive a variety of safety critical certifications.
While my applications were never in safety critical systems, it was comforting to know that the OS kernel on which I was standing had achieved those ratings. It provided assurance that the most likely reason I had a bug was either a misunderstanding of how a primitive worked, or possibly more likely, was actually a bug in my application logic.
Most RTOSes (and uC/OS-II especially) are able to run in limited resources. uC/OS-II can be built in as little as 6KB of code, and with as little as 1KB of RAM required for OS structures.
The bottom line is that apparent concurrency can be achieved in a variety of ways, and one of those ways is to use an OS kernel designed to schedule and execute each concurrent task in parallel by sharing the resources of the sequential CPU among all the tasks. For simple cases, all you might need is interrupt handlers and a main loop, but when your requirements grow to the point of implementing several protocols, managing a display, managing user input, background computation, and monitoring overall system health, standing on a well-designed RTOS kernel along with known to work communications primitives can save a lot of development and debugging effort.
Well, I see a lot of ground covered by other answers; so, hopefully I don't end up turning this into something bigger than I intend. (TL;DR: Girl to the rescue! :D). But, I do have (what I believe to be) a very good solution to offer; so I hope you can make use of it. I only have a small amount of experience with the 8051[☆]; although I did work for ~3 months (plus ~3 more full-time) on another microcontroller, with moderate success. In the course of that I ended up doing a little bit of almost everything the little thing had to offer: serial communications, SPI, PWM signals, servo control, DIO, thermocouples, and so forth. While I was working on it, I lucked out and came across an excellent (IMO) solution for (cooperative) 'thread' scheduling, which mixed well with some small amount of additional real-time stuff done off of interrupts on the PIC. And, of course, other interrupt handlers for the other devices.
pt_thread: Invented by Adam Dunkels (with Oliver Schmidt) (v1.0 released in Feb., 2005), his site is a great introduction to them, wand includes downloads through v1.4 from Oct., 2006; and I am very glad to have gone to look again because I found ; but there's an item from Jan. 2009 stating that Larry Ruane used event-driven techniques "[for] a complete reimplementation [using GCC; and with] a very nice syntax", and available on sourceforge. Unfortunately, it looks like there are no updates to either since around 2009; but the 2006 version served me very well. The last news item (from Dec. 2009) notes that "Sonic Unleashed" indicated in its manual that protothreads were used!
One of the things that I think are awesome about pt_threads is that they're so simple; and, whatever the benefits of the newer (Ruane) version, it's certainly more complex. Although it may well be worth taking a look at, I am going to stick with Dunkels' original implementation here. His original pt_threads "library" consists of: five header files. And, really, that seems like an overstatement, as once I minified a few macros and other things, removed the doxygen sections, examples, and culled down the comments to the bare minimum I still felt gave an explanation, it clocks in at just around 115 lines (included below.)
There are examples included with the source tarball, and very nice .pdf document (or .html) available on his site (linked above.) But, let me walk through a quick example to elucidate some of the concepts. (Not the macros themselves, it took me a while to grok those, and they aren't really necessary just to use the functionality. :D)
Unfortunately, I've run out of time for tonight; but I will try to get back on at some point tomorrow to write up a little example; either way, there are a ton of resources on his website, linked above; it's a fairly straightforward procedure, the tricky part for me (as I suppose it is with any cooperative multi-threading; Win 3.1 anyone? :D) was ensuring that I had properly cycle-counted the code, so as not to overrun the time I needed to process the next thing before yielding the pt_thread.
I hope this gives you a start; let me know how it goes if you try it out!
FILE: pt.h
#ifndef __PT_H__
#define __PT_H__
#include "lc.h"
// NOTE: the enums are mine to compress space; originally all were #defines
enum PT_STATUS_ENUM { PT_WAITING, PT_YIELDED, PT_EXITED, PT_ENDED };
struct pt { lc_t lc; } // protothread control structure (pt_thread)
#define PT_INIT(pt) LC_INIT((pt)->lc) // initializes pt_thread prior to use
// you can use this to declare pt_thread functions
#define PT_THREAD(name_args) char name_args
// NOTE: looking at this, I think I might define my own macro as follows, so as not
// to have to redclare the struct pt *pt every time.
//#define PT_DECLARE(name, args) char name(struct pt *pt, args)
// start/end pt_thread (inside implementation fn); must always be paired
#define PT_BEGIN(pt) { char PT_YIELD_FLAG = 1; LC_RESUME((pt)->lc)
#define PT_END(pt) LC_END((pt)->lc);PT_YIELD_FLAG = 0;PT_INIT(pt);return PT_ENDED;}
// {block, yield} 'pt' {until,while} 'c' is true
#define PT_WAIT_UNTIL(pt,c) do { \
LC_SET((pt)->lc); if(!(c)) {return PT_WAITING;} \
} while(0)
#define PT_WAIT_WHILE(pt, cond) PT_WAIT_UNTIL((pt), !(cond))
#define PT_YIELD_UNTIL(pt, cond) \
do { PT_YIELD_FLAG = 0; LC_SET((pt)->lc); \
if((PT_YIELD_FLAG == 0) || !(cond)) { return PT_YIELDED; } } while(0)
// NOTE: no corresponding "YIELD_WHILE" exists; oversight? [shelleybutterfly]
//#define PT_YIELD_WHILE(pt,cond) PT_YIELD_UNTIL((pt), !(cond))
// block pt_thread 'pt', waiting for child 'thread' to complete
#define PT_WAIT_THREAD(pt, thread) PT_WAIT_WHILE((pt), PT_SCHEDULE(thread))
// spawn pt_thread 'ch' as child of 'pt', waiting until 'thr' exits
#define PT_SPAWN(pt,ch,thr) do { \
PT_INIT((child)); PT_WAIT_THREAD((pt),(thread)); } while(0)
// block and cause pt_thread to restart its execution at its PT_BEGIN()
#define PT_RESTART(pt) do { PT_INIT(pt); return PT_WAITING; } while(0)
// exit the pt_thread; if a child, then parent will unblock and run
#define PT_EXIT(pt) do { PT_INIT(pt); return PT_EXITED; } while(0)
// schedule pt_thread: fn ret != 0 if pt is running, or 0 if exited
#define PT_SCHEDULE(f) ((f) lc); \
if(PT_YIELD_FLAG == 0) { return PT_YIELDED; } } while(0)
FILE: lc.h
#ifndef __LC_H__
#define __LC_H__
#ifdef LC_INCLUDE
#include LC_INCLUDE
#else
#include "lc-switch.h"
#endif /* LC_INCLUDE */
#endif /* __LC_H__ */
FILE: lc-switch.h
// WARNING: implementation using switch() won't work with an LC_SET() inside a switch()!
#ifndef __LC_SWITCH_H__
#define __LC_SWITCH_H__
typedef unsigned short lc_t;
#define LC_INIT(s) s = 0;
#define LC_RESUME(s) switch(s) { case 0:
#define LC_SET(s) s = __LINE__; case __LINE__:
#define LC_END(s) }
#endif /* __LC_SWITCH_H__ */
FILE: lc-addrlabels.h
#ifndef __LC_ADDRLABELS_H__
#define __LC_ADDRLABELS_H__
typedef void * lc_t;
#define LC_INIT(s) s = NULL
#define LC_RESUME(s) do { if(s != NULL) { goto *s; } } while(0)
#define LC_CONCAT2(s1, s2) s1##s2
#define LC_CONCAT(s1, s2) LC_CONCAT2(s1, s2)
#define LC_END(s)
#define LC_SET(s) \
do {LC_CONCAT(LC_LABEL, __LINE__):(s)=&&LC_CONCAT(LC_LABEL,__LINE__);} while(0)
#endif /* __LC_ADDRLABELS_H__ */
FILE: pt-sem.h
#ifndef __PT_SEM_H__
#define __PT_SEM_H__
#include "pt.h"
struct pt_sem { unsigned int count; };
// macros to initiaize, await, and signal a pt_sem semaphore
#define PT_SEM_INIT(s, c) (s)->count = c
#define PT_SEM_WAIT(pt, s) do \
{ PT_WAIT_UNTIL(pt, (s)->count > 0); -(s)->count; } while(0)
#define PT_SEM_SIGNAL(pt, s) ++(s)->count
#endif /* __PT_SEM_H__ */
[☆] *about a week learning about microcontrollers[†] and a week playing with it during an evaluation to see if it could meet our needs for a little line-replaceable remote I/O unit. (long story, short: no)
[†] The 8051 Microcontroller, Third Edition *was suggested to me as the 8051 programming "bible" I don't know if it is or not, but I was certainly able to get my head around things using it.[‡]
[‡] and even looking over it again now I don't see much not to like about it. :) well, I mean... I wish I hadn't bought two copies; but they were so cheap!
LICENSE AGREEMENT (where applicable)
This post contains code based on (or taken from) 'The Protothreads Library' (referred to herein and henceforth as "PTLIB";
including v1.4 and earlier revisions) relying extensively on the source code as well as the documentation for PTLIB.
PTLIB original source code and documentation was received from, and freely available for download at the author's PTLIB site
'http://dunkels.com/adam/pt/', available through a link on the downloads page at 'http://dunkels.com/adam/pt/download.html'
or directly via 'http://dunkels.com/adam/download/pt-1.4.tar.gz'.
This post consists of original text, for which I hereby give to you (with love!) under a full waiver of whatever copyright
interest I may have, under the following terms: "copyheart ♥ 2014, shelleybutterfly, share with love!"; or, if you prefer,
a fully non-restrictive, attribution-only license appropriate to the material (such as Apache 2.0 for software; or CC-BY
license for text) so that you may use it as you see fit, so that it may best suit your needs.
This post also contains source code, almost entirely created from the original source by removing explanatory material,
reformatting, and paraphrasing the in-line documentation/comments, as well as a few modifications/additions by me
(shelleybutterfly on the stackexchange network). Anything derivative of PTLIB for which I may have, legally, gained any
copyright or other interest, I hereby cede all such interest back to and all copyright interest in the original work to
the original copyright holder, as specified in the license from PTLIB, which follows this agreement.
In any jurisdiction where it is not possible for the terms above to apply to you for whatever reason, then, for whatever
interest I have in the material, I hereby offer it to you under any non-restrictive, attribution-only, license of your
choosing; or, should this also not be possible, then I give permission to stack exchange inc to provide it to you under
whatever terms the y determine to be acceptable in your jurisdiction.
All source code from PTLIB, and that which is derivative of PTLIB, that is not covered under other terms detailed above
hereby provided to 'stack exchange inc' and to you under the following agreement:
LICENSE AGREEMENT for "The Protothreads Library"
Copyright (c) 2004-2005, Swedish Institute of Computer Science. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the
following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following
disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the Institute nor the names of its contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE INSTITUTE AND CONTRIBUTORS `AS IS' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
EVENT SHALL THE INSTITUTE OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
Author: Adam Dunkels
There are some really good answers here, but just a little more context regarding your birthday card example might be a good lead in before digging in with the longer answers.
The way a single cpu can seem to do multiple things at once is by rapidly switching between tasks, as well as employing the assistance of timers, interrupts and independent hardware units that can do things independently of the cpu. (see #Zack's answer for a nice discussion and starter list of HW) So for your birthday card, the cpu could be telling a bit of audio hardware "play this chunk of sound", then go blink the LED, then come back and load the next bit of sound before the first portion is finished playing. In this situation, cpu might take say 1 msec of time to load audio that might play for 5 msec of real time leaving you with 4 msec of time to do something else before loading the next bit of sound.
The digital clock might beep by setting up a bit of PWM hardware to output at some frequency to a piezio buzzer, a timer for an interrupt to stop the beep, then go off and check a real time counter to see if the time display leds need to be updated. When the timer fires the interrupt, your code shuts off the PWM.
The details will vary according the the hardware of the chip, and going over the datasheet is the way to find out what capability a given microcontroller might have, and how to access it.
I have had good experiences with Freertos, even though it uses a fair bit of memory. Freertos gives you true preemptive threading, there's tons of ports if you ever want to upgrade those dusty old 8051s, there's semaphores and message queues and priorities and all kinds of stuff and it's totally free. I've only worked with the arduino port personally, but it seems to be one of the most popular of the free rtosses.
I think they sell a book that isn't free, but there's enough info on their website and in the arduino examples to pretty much figure it out.