Embedded Firmware Architecture

Embedded Firmware Architecture - c

I’ve been writing increasingly complex firmware and am starting to notice that my knowledge of design patterns and architecture is a bit lacking. I’m trying to work on developing these skills and am hoping for some input. Note: this is for embedded c for microcontrollers.
I’m working with a concept for a new project as an exercise right now that goes something like this:
We have a battery management module with user I/O
The main controller is responsible for I/O (button, LCD, UART debug), detecting things like the charger being plugged in/unplugged, and managing high level operations.
The sub controller is a battery management controller (pretty much a custom PMIC) capable of monitoring battery levels, running charging/discharging firmware etc.
The PMIC interfaces with a fuel gauge IC that it uses to read battery information from
The interface between the two controllers, fuel gauge and the LCD are all I2C
Here is a rough system diagram:
Now what I’m trying to do is to come up with a good firmware architecture that will allow for expandability (adding multiple batteries, adding more sensors, changing the LCD (or other) interface from I2C to SPI, etc), and for testing (simulate button presses via UART, replace battery readings with simulated values to test PMIC charge firmware, etc).
What I would normally do is write a custom driver for each peripheral, and a firmware module for each block. I would implement a flagging module as well with a globally available get/set that would be used throughout the system. For example my timers would set 100Hz, 5Hz, 1Hz, flags, which the main loop would handle and call the individual modules at their desired rate. Then the modules themselves could set flags for the main loop to handle for events like I2C transaction complete, transaction timed out, temperature exceeded, etc.
What I am hoping to get from this is a few suggestions on a better way to architect the system to achieve my goals of scalability, encapsulation and abstraction. It seems like what I’m doing is sort of a pseudo event-driven system but one that’s been hacked together.
In any case here’s my attempt at an architecture diagram:

The concept of an "event bus" is over-complicated. In many cases, the simplest approach is to minimize the number of things that need to happen asynchronously, but instead have a "main poll" routine which runs on an "as often as convenient" basis and calls polling routines for each subsystem. It may be helpful to have such routine in a compilation by itself, so that the essence of that file would simply be a list of all polling functions used by other subsystems, rather than anything with semantics of its own. If one has a "get button push" routine, one can have a loop within that routine which calls the main poll routine until a button is pushed, there's a keyboard timeout, or something else happens that the caller needs to deal with. That would then allow the main UI to be implemented using code like:
void maybe_do_something_fun(void)
{
while(1)
{
show_message("Do something fun?");
wait_for_button();
if (button_hit(YES_BUTTON))
{
... do something fun
return;
}
else if (button_hit(NO_BUTTON))
{
... do something boring
return;
}
} while(1);
}
This is often much more convenient than trying to have a giant state machine and say that if the code is the STATE_MAYBE_DO_SOMETHING_FUN state and the yes or no button is pushed, it will need to advance to the STATE_START_DOING_SOMETHING_FUN or STATE_START_DOING_SOMETHING_BORING state.
Note that if one uses this approach, one will need to ensure that the worst-case time between calls to main_poll will always satisfy the timeliness requirements of the polling operations handled through main_poll, but in cases where that requirement can be met, this approach can be far more convenient and efficient than doing everything necessary to preemptively-scheduled multi-threaded code along with the locks and other guards needed to make it work reliably.

Related

Making driver library for a slow module, watchdog friendly

Context
I'm making some libraries to manage internet protocol trough GPRS, some part of this communications (made trough UART) are rather slow (some can take more than 30 seconds) because the module has to connect through GPRS.
First I made a driver library to control the module and manage TCP/IP connections, this library worked whit blocking functions, for example a function like Init_GPRS_connection() could take several seconds to end, I have been made to notice that this is bad practice, cause now I have to implement a watchdog timer and this kind of function is not friendly whit short timeout like watchdogs have (I cannot kick the timer before it expire)
What have I though
I need to rewrite part of my libraries to be watchdog friendly, for this purpose I have tough in this scheme, I need functions that have state machine inside, those will be pulling data acquired trough UART interruptions to advance trough the state machines, so then I can write code like:
GPRS_typef Init_GPRS_connection(){
switch(state){ //state would be a global functions that take the current state of the state machine
.... //here would be all the states of the state machine
case end:
state = 0;
return Done;
}
}
while(Init_GPRS_connection() != Done){
Do_stuff(); //Like kick the Watchdog
}
But I see a few problems whit this solution:
This is a less user-friendly implementation, the user should be careful using this library driver because extra lines of code would be always necessary (kind of defeating the purpose of using functions).
If, for some reason, the module wouldn't answer at some point the code would get stuck in the state machine because the watchdog would be kicked outside this function even though the code got stuck in a loop, this kind of defeat the purpose of using watchdog Timer's
My question
What kind of implementation should I use to make a user and watchdog friendly driver library?, how does other drivers library manage this?
Extra information
All this in the context of embedded systems
I would like to implement the watchdog kicking action outside the driver's functions

Given where you are and assuming you do not what too much upheaval to your project to "do it properly", what you might to is add variable watchdog timeout extension, such that you set a counter that is decremented in a timer interrupt and if the counter is not zero, the watch dog is reset.
That way you are not allowing the timer interrupt to reset the watchdog indefinitely while your main thread is stuck, but you can extend the watchdog immediately before executing any blocking code, essentially setting a timeout for that operation.
So you might have (pseudocode):
static volatile uint8_t wdg_repeat_count = 0 ;
void extendWatchdog( uint8_t repeat ) { wdg_repeat_count = repeat ; }
void timerISR( void )
{
if( wdg_repeat_count > 0 )
{
resetWatchdog() ;
wdg_repeat_count-- ;
}
}
Then you can either:
extendWatchdog( CONNECTION_INIT_WDG_TIMEOUT ) ;
while(Init_GPRS_connection() != Done){
Do_stuff(); //Like kick the Watchdog
}
or continue to use your existing non-state-machine based solution:
extendWatchdog( CONNECTION_INIT_WDG_TIMEOUT ) ;
bool connected = Init_GPRS_connection() ;
if( connected ) ...
The idea is compatible with both what you have and what you propose, it simply allows you to extend the watchdog timeout beyond that dictated by the hardware.
I suggest a uint8_t, because it prevents a lazy developer simply setting a large value and effectively disabling the watchdog protection, and it is likely to be atomic and so shareable between the main and interrupt context.
All that said, it would clearly have been better to design in your integrity infrastructure from the outset at the architectural level rather than trying to bolt it on after the event. For example if you were using an RTOS, you might reset the watchdog in a low priority task that if starved, would cause a watchdog expiry, and that "watchdog task" could be use to monitor the other tasks to ensure they are scheduling as expected.
Without an RTOS you might have a "big-loop" architecture with each "task" implemented as a state-machine. In your example you seem to have missed the point of a state-machine. "initialising connection" should be a single state of a high level state-machine, the internals of that state may itself be a state-machine (hierarchical state machines). So your entire system would be a single master state-machine in the main loop, and the watchdog reset once at each loop iteration. Nothing in any sub-state should block to ensure the loop time is low and deterministic. That is how for example Arduino framework's loop() function should work (when done properly - unfortunately seldom the case in examples). To understand how to implement a real-time deterministic state-machine architecture you couls do worse that look at the work of Miro Samek. The framework described therein is available via his company.

You should make your library non-blocking, but other than that, you should not worry about the watchdog at all. The watchdog management should be left to the user.
To allow the user to do other work while your library is waiting, you can use these approaches:
Provide a function to feed the data into your library (e.g. receive()). The user should call this function when the data is available, for example from the interrupt. As this function can be called from the interrupt, make sure it does not do heavy processing. Typically, you would just buffer the data and process it later (Step 2).
Provide a function, that user calls periodically, that updates the state of your library and does any other housekeeping tasks (like timeout detection). Typically, this function is called run(), process(), tick() or something along these lines. The user would call this function in their main loop or from a dedicated RTOS task.
Provide a way to tell the user the state of your library. You can do it either by some sort of getState() function or using a callback or both. Based on this information, the user can implement their own state machine to do things on connect, disconnect etc.

How do I efficiently use the same interrupt handler for every (identical) peripheral port?

(Hopefully) simplified version of my problem:
Say I'm using every GPIO port of my cortex M-4 mcu to do the exact same thing, like read the port on a pin-level change. I've simplified my code so it's port-agnostic, but I'm having issues with a nice solution for re-using the same interrupt handler function.
Is there a way I can use the same interrupt handler function while
having a method of finding which port triggered the interrupt? Ideally some O(1)/doesn't scale up depending on how many ports the board has.
Should I just have different handlers for each port that call the same function that takes in a "port" parameter? (Best I could come up with so far)
So like:
void worker (uint32_t gpio_id) {
*work goes here*
}
void GPIOA_IRQ_Handler(void) { worker(GPIOA_id); }
void GPIOB_IRQ_Handler(void) { worker(GPIOB_id); }
void GPIOC_IRQ_Handler(void) { worker(GPIOC_id); }
...
My actual problem:
I'm learning about and fiddling around with FreeRTOS and creating simple drivers for debug/stdio UART, some buttons that are on my dev. board, so on. So far I've been making drivers for a specific peripheral/port.
Now I'm looking to make an I2C driver without knowing which interface I'm gonna use (there are 10 I2C ports in my mcu), and to potentially to allow the driver code to be used on multiple ports at the same time. I'd know all the ports used at compile-time though.
I have a pretty good idea on how to make the driver to be port-agnostic, except I'm getting hung up on figuring out a nice way to find which port triggered the interrupt using a single handler function. (besides cycling through every port's interrupt status reg since that's O(n)).
Like I said the best I came up with is to not have a single handler and instead have different handlers on the vector table that all call the same "worker" function in it and passing a "port" parameter. This method clutters up the driver code, but it is O(1) (unless you take code-complexity into account).
Am I going about this all wrong and should just "keep it simple stupid" and implement the driver according to the port(s)/use-case I will actually need in the simplest way possible? (don't even have plans to use multiple I2C buses, just though it'd be interesting to implement)
Thank you in advance, hopefully the post isn't too ambiguous or long (I feel like it's pretty long sry).

Is there a way I can use the same interrupt handler function while having a method of finding which port triggered the interrupt?
Only if the different interrupts are cleared in the same way and your application doesn't care which pin that triggered the interrupt. Quite unlikely use-case.
Should I just have different handlers for each port that call the same function that takes in a "port" parameter?
Yeah that's usually what I do. You should pass on the parameters from the ISR to the function, that are unique to the specific interrupt. Important: note that the function should be inline static! A fast ISR is significantly more important than saving a tiny bit of flash by re-using the same function. So in the machine code you'll have 4 different ISRs with the worker function inlined. (Might want to disable inlining in debug build though.)
Am I going about this all wrong and should just "keep it simple stupid"
Sounds like you are doing it right. A properly written driver should be able to handle multiple hardware peripheral instances with the same code. That being said, C programmers have a tendency to obsess about avoiding code repetition. "KISS" is often far more sound than avoiding code repetition. Avoiding repetition is of course nice, but not your top priority here.
Priorities in this case should be, most important first:
As fast, slim interrupts as possible.
Readable code.
Flash size used.
Avoiding code repetition.

Where does finite-state machine code belong in µC?

I asked this question on EE forum. You guys on StackOverflow know more about coding than we do on EE so maybe you can give more detail information about this :)
When I learned about micrcontrollers, teachers taught me to always end the code with while(1); with no code inside that loop.
This was to be sure that the software get "stuck" to keep interruption working. When I asked them if it was possible to put some code in this infinite loop, they told me it was a bad idea. Knowing that, I now try my best to keep this loop empty.
I now need to implement a finite state machine in a microcontroller. At first view, it seems that that code belong in this loop. That makes coding easier.
Is that a good idea? What are the pros and cons?
This is what I plan to do :
void main(void)
{
// init phase
while(1)
{
switch(current_State)
{
case 1:
if(...)
{
current_State = 2;
}
else(...)
{
current_State = 3;
}
else
current_State = 4;
break;
case 2:
if(...)
{
current_State = 3;
}
else(...)
{
current_State = 1;
}
else
current_State = 5;
break;
}
}
Instead of:
void main(void)
{
// init phase
while(1);
}
And manage the FSM with interrupt

It is like saying return all functions in one place, or other habits. There is one type of design where you might want to do this, one that is purely interrupt/event based. There are products, that go completely the other way, polled and not even driven. And anything in between.
What matters is doing your system engineering, thats it, end of story. Interrupts add complication and risk, they have a higher price than not using them. Automatically making any design interrupt driven is automatically a bad decision, simply means there was no effort put into the design, the requirements the risks, etc.
Ideally you want most of your code in the main loop, you want your interrupts lean and mean in order to keep the latency down for other time critical tasks. Not all MCUs have a complicated interrupt priority system that would allow you to burn a lot of time or have all of your application in handlers. Inputs into your system engineering, may help choose the mcu, but here again you are adding risk.
You have to ask yourself what are the tasks your mcu has to do, what if any latency is there for each task from when an event happens until they have to start responding and until they have to finish, per event/task what if any portion of it can be deferred. Can any be interrupted while doing the task, can there be a gap in time. All the questions you would do for a hardware design, or cpld or fpga design. except you have real parallelism there.
What you are likely to end up with in real world solutions are some portion in interrupt handlers and some portion in the main (infinite) loop. The main loop polling breadcrumbs left by the interrupts and/or directly polling status registers to know what to do during the loop. If/when you get to where you need to be real time you can still use the main super loop, your real time response comes from the possible paths through the loop and the worst case time for any of those paths.
Most of the time you are not going to need to do this much work. Maybe some interrupts, maybe some polling, and a main loop doing some percentage of the work.
As you should know from the EE world if a teacher/other says there is one and only one way to do something and everything else is by definition wrong...Time to find a new teacher and or pretend to drink the kool-aid, pass the class and move on with your life. Also note that the classroom experience is not real world. There are so many things that can go wrong with MCU development, that you are really in a controlled sandbox with ideally only a few variables you can play with so that you dont have spend years to try to get through a few month class. Some percentage of the rules they state in class are to get you through the class and/or to get the teacher through the class, easier to grade papers if you tell folks a function cant be bigger than X or no gotos or whatever. First thing you should do when the class is over or add to your lifetime bucket list, is to question all of these rules. Research and try on your own, fall into the traps and dig out.

When doing embedded programming, one commonly used idiom is to use a "super loop" - an infinite loop that begins after initialization is complete that dispatches the separate components of your program as they need to run. Under this paradigm, you could run the finite state machine within the super loop as you're suggesting, and continue to run the hardware management functions from the interrupt context as it sounds like you're already doing. One of the disadvantages to doing this is that your processor will always be in a high power draw state - since you're always running that loop, the processor can never go to sleep. This would actually also be a problem in any of the code you had written however - even an empty infinite while loop will keep the processor running. The solution to this is usually to end your while loop with a series of instructions to put the processor into a low power state (completely architecture dependent) that will wake it when an interrupt comes through to be processed. If there are things happening in the FSM that are not driven by any interrupts, a normally used approach to keep the processor waking up at periodic intervals is to initialize a timer to interrupt on a regular basis to cause your main loop to continue execution.
One other thing to note, if you were previously executing all of your code from the interrupt context - interrupt service routines (ISRs) really should be as short as possible, because they literally "interrupt" the main execution of the program, which may cause unintended side effects if they take too long. A normal way to handle this is to have handlers in your super loop that are just signalled to by the ISR, so that the bulk of whatever processing that needs to be done is done in the main context when there is time, rather than interrupting a potentially time critical section of your main context.

What should you implement is your choice and debugging easiness of your code.
There are times that it will be right to use the while(1); statement at the end of the code if your uC will handle interrupts completely (ISR). While at some other application the uC will be used with a code inside an infinite loop (called a polling method):
while(1)
{
//code here;
}
And at some other application, you might mix the ISR method with the polling method.
When said 'debugging easiness', using only ISR methods (putting the while(1); statement at the end), will give you hard time debugging your code since when triggering an interrupt event the debugger of choice will not give you a step by step event register reading and following. Also, please note that writing a completely ISR code is not recommended since ISR events should do minimal coding (such as increment a counter, raise/clear a flag, e.g.) and being able to exit swiftly.

It belongs in one thread that executes it in response to input messages from a producer-consumer queue. All the interrupts etc. fire input to the queue and the thread processes them through its FSM serially.
It's the only way I've found to avoid undebuggable messes whilst retaining the low latencty and efficient CPU use of interrupt-driven I/O.
'while(1);' UGH!

what's wrong in using interrupt handlers as event listeners

My system is simple enough that it runs without an OS, I simply use interrupt handlers like I would use event listener in a desktop program. In everything I read online, people try to spend as little time as they can in interrupt handlers, and give the control back to the tasks. But I don't have an OS or real task system, and I can't really find design information on OS-less targets.
I have basically one interrupt handler that reads a chunk of data from the USB and write the data to memory, and one interrupt handler that reads the data, sends the data on GPIO and schedule itself on an hardware timer again.
What's wrong with using the interrupts the way I do, and using the NVIC (I use a cortex-M3) to manage the work hierarchy ?

First of all, in the context of this question, let's refer to the OS as a scheduler.
Now, unlike threads, interrupt service routines are "above" the scheduling scheme.
In other words, the scheduler has no "control" over them.
An ISR enters execution as a result of a HW interrupt, which sets the PC to a different address in the code-section (more precisely, to the interrupt-vector, where you "do a few things" before calling the ISR).
Hence, essentially, the priority of any ISR is higher than the priority of the thread with the highest priority.
So one obvious reason to spend as little time as possible in an ISR, is the "side effect" that ISRs have on the scheduling scheme that you design for your system.
Since your system is purely interrupt-driven (i.e., no scheduler and no threads), this is not an issue.
However, if nested ISRs are not allowed, then interrupts must be disabled from the moment an interrupt occurs and until the corresponding ISR has completed. In that case, if any interrupt occurs while an ISR is in execution, then your program will effectively ignore it.
So the longer you spend inside an ISR, the higher the chances are that you'll "miss out" on an interrupt.

In many desktop programs, events are send to queue and there is some "event loop" that handle this queue. This event loop handles event by event so it is not possible to interrupt one event by other one. It also is good practise in event driven programming to have all event handlers as short as possible because they are not interruptable.
In bare metal programming, interrupts are similar to events but they are not send to queue.
execution of interrupt handlers is not sequential, they can be interrupted by interrupt with higher priority (numerically lower number in Cortex-M3)
there is no queue of same interrupts - e.g. you can't detect multiple GPIO interrupts while you are in that interrupt - this is the reason you should have all routines as short as possible.
It is possible to implement queues by yourself, feed these queues by interrupts and consume these queues in your super loop (consume while disabling all interrupts). By this approach, you can get sequential processing of interrupts. If you keep your handlers short, this is mostly not needed and you can do the work in handlers directly.
It is also good practise in OS based systems that they are using queues, semaphores and "interrupt handler tasks" to handle interrupts.

With bare metal it is perfectly fine to design for application bound or interrupt/event bound so long as you do your analysis. So if you know what events/interrupts are coming at what rate and you can insure that you will handle all of them in the desired/designed amount of time, you can certainly take your time in the event/interrupt handler rather than be quick and send a flag to the foreground task.
The common approach of course is to get in and out fast, saving just enough info to handle the thing in the foreground task. The foreground task has to spin its wheels of course looking for event flags, prioritizing, etc.
You could of course make it more complicated and when the interrupt/event comes, save state, and return to the forground handler in the forground mode rather than interrupt mode.
Now that is all general but specific to the cortex-m3 I dont think there are really modes like big brother ARMs. So long as you take a real-time approach and make sure your handlers are deterministic, and you do your system engineering and insure that no situation happens where the events/interrupts stack up such that the response is not deterministic, not too late or too long or loses stuff it is okay

What you have to ask yourself is whether all events can be services in time in all circumstances:
For example;
If your interrupt system were run-to-completion, will the servicing of one interrupt cause unacceptable delay in the servicing of another?
On the other hand, if the interrupt system is priority-based and preemptive, will the servicing of a high priority interrupt unacceptably delay a lower one?
In the latter case, you could use Rate Monotonic Analysis to assign priorities to assure the greatest responsiveness (the shortest execution-time handlers get the highest priority). In the first case your system may lack a degree of determinism, and performance will be variable under both event load, and code changes.
One approach is to divide the handler into real-time critical and non-critical sections, the time-critical code can be done in the handler, then a flag set to prompt the non-critical action to be performed in the "background" non-interrupt context in a "big-loop" system that simply polls event flags or shared data for work to complete. Often all that might be necessary in the interrupt handler is to copy some data to timestamp some event - making data available for background processing without holding up processing of new events.
For more sophisticated scheduling, there are a number of simple, low-cost or free RTOS schedulers that provide multi-tasking, synchronisation, IPC and timing services with very small footprints and can run on very low-end hardware. If you have a hardware timer and 10K of code space (sometimes less), you can deploy an RTOS.

I am taking your described problem first
As I interpret it your goal is to create a device which by receiving commands from the USB, outputs some GPIO, such as LEDs, relays etc. For this simple task, your approach seems to be fine (if the USB layer can work with it adequately).
A prioritizing problem exists though, in this case it may be that if you overload the USB side (with data from the other end of the cable), and the interrupt handling it is higher priority than that triggered by the timer, handling the GPIO, the GPIO side may miss ticks (like others explained, interrupts can't queue).
In your case this is about what could be considered.
Some general guidance
For the "spend as little time in the interrupt handler as possible" the rationale is just what others told: an OS may realize a queue, etc., however hardware interrupts offer no such concepts. If the event causing the interrupt happens, the CPU enters your handler. Then until you handle it's source (such as reading a receive holding register in the case of a UART), you lose any further occurrences of that event. After this point, until exiting the handler, you may receive whether the event happened, but not how many times (if the event happened again while the CPU was still processing the handler, the associated interrupt line goes active again, so after you return from the handler, the CPU immediately re-enters it provided nothing higher priority is waiting).
Above I described the general concept observable on 8 bit processors and the AVR 32bit (I have experience with these).
When designing such low-level systems (no OS, one "background" task, and some interrupts) it is fundamental to understand what goes on on each priority level (if you utilize such). In general, you would make the most real-time critical tasks the highest priority, taking the most care of serving those fast, while being more relaxed with the lower priority levels.
From an other aspect usually at design phase it can be planned how the system should react to missed interrupts, since where there are interrupts, missing one will eventually happen anyway. Critical data going across communication lines should have adequate checksums, an especially critical timer should be sourced from a count register, not from event counting, and the likes.
An other nasty part of interrupts is their asynchronous nature. If you fail to design the related locks properly, they will eventually corrupt something giving nightmares to that poor soul who will have to debug it. The "spend as little time in the interrupt handler as possible" statement also encourages you to keep the interrupt code reasonably short which means less code to consider for this problem as well. If you also worked with multitasking assisted by an RTOS you should know this part (there are some differences though: a higher priority interrupt handler's code does not need protection against a lower priority handler's).
If you can properly design your architecture regarding the necessary asynchronous tasks, getting around without an OS (from the no multitasking aspect) may even prove to be a nicer solution. It needs way more thinking to design it properly, however later there are much less locking related problems. I got through some mid-sized safety critical projects designed over a single background "task" with very few and little interrupts, and the experience and maintenance demands regarding those (especially the tracing of bugs) were quite satisfactory compared to some others in the company built over multitasking concepts.

How do you test your interrupt handling module?

I've got an interrupt handling module which controls the interrupt controller hardware on an embedded processor. Now I want to add more tests to it. Currently, the tests only tests if nesting of interrupts works by making two software interrupts from within an ISR, one with low priority and one with high priority. How can I test this module further?

I suggest that you try to create other stimuli as well.
Often, also hardware interrupts can be triggered by software (automatic testing) or the debugger by setting a flag. Or as an interrupt via I/O. Or a timer interrupt. Or you can just set the interrupt bit in an interrupt controller via the debugger while you are single stepping.
You can add some runtime checks on things which are not supposed to happen. Sometimes I elect to set output pins to monitor externally (nice if you have an oscilloscope or logic analyser...)
low_prio_isr(void)
{
LOW_PRIO_ISR=1;
if (1 == HIGH_PRIO_ISR)
{ this may never happen. dummy statement to allow breakpoint in debugger }
}
high_prio_isr(void)
{
HIGH_PRIO_ISR=1
}
The disadvantage of the software interrupt is that the moment is fixed; always the same instruction. I believe you would like to see evidence that it always works; deadlock free.
For interrupt service routines I find code reviews very valuable. In the end you can only test the situations you've imagined and at some point the effort of testing will be very high. ISRs are notoriously difficult to debug.
I think it is useful to provide tests for the following:
- isr is not interrupted for lower priority interrupt
- isr is not interrupted for same priority interrupt
- isr is interrupted for higher priority interrupt
- maximum nesting count within stack limitations.
Some of your tests may stay in the code as instrumentation (so you can monitor for instance maximum nesting level.
Oh, and one more thing: I've generally managed to keep ISRs so short that I can refrain from nesting.... if you can this will gain you additional simplicity and more performance.
[EDIT]
Of course, ISRs need to be tested on hardware in system too. Apart from the bit-by-bit, step-by-step approach you may want to prove:
- stability of system at maximum interrupt load (preferably several times the predicted maximum load; if your 115kbps serial driver can also handle 2MBps you'll be ok!)
- correct moment of enabling / disabling isr, especially if system also enters a sleep mode
- # of interrupts. Can be surprising if you add mechanical switches, mechanical rotary (hundreds of break/contact moments before reaching steady situation)

I recommend real-hardware testing. Interrupt handling is inherently random and unpredictable.
Use a signal generator and feed a square wave into the appropriate interrupt pin. Use multiple generators (or one with multiple outputs) to test multiple IRQ lines and verify priority handling.
Experiment with dialing the frequency up & down on the signal generators (vary the rates between them), and see what happens. Have lots of diagnostic code to verify the state of the interrupt controller in the various states.
Alternative: If your platform has timers that can trigger interrupts, you can use them instead of external hardware.

I'm not an embedded developer, so I don't know if this is possible, but how about decoupling the code that handles the interrupts from the callback-registration mechanism? This would allow you to write simulator code fireing interrupt-events as you like it...

For stuff like this I highly recommend something like the SPIN model checker. You wind up testing the algorithm, not the code, but the testing is exhaustive. Back in the day, I found a bug in gdb using this technique.