ARM GIC Interrupt starvation - arm

Not sure if there are similar questions. I tried to backread but can't find any, so here it is.
In my bare-metal application that uses ARM Cortex-A9 (dual core with GIC), some of the interrupt sources are 4 FPGA interrupts (let's say IRQ ID 58, 59, 60, 61) that have the same priority and the idea is that all simultaneously trigger continuously in run-time. I can say the interrupt handlers may qualify as long, but not very long.
All interrupts fire and are detected by GIC and all are flagged as PENDING. The problem is, only the two higher ID'ed interrupts (58, 59) get handled by CPU, starving the other two. Once 58 or 59 are done, their source will trigger again and grab the CPU over and over again. My other interrupts are indefinitely being starved.
I played around with priority, assigning higher interrupts to 60 and 61. Sure enough, 60 and 61 triggered and got handled by CPU, but 58 and 59 are starved. So it's really an issue of starvation.
Is there any way out of here, such that the other two will still be processed given their triggering rate?

Assuming the GIC implementation is one of ARM's designs, then the arbitration scheme for multiple interrupts at the same priority is fixed at "dispatch the lowest-numbered one", so if you were hoping it could be changed to some kind of round-robin scheme you're probably out of luck.
That said, if these interrupts are more or less permanently asserted and you're taking them back-to-back then that's a sign that you probably don't need to use interrupts, or at least that the design of your code is inappropriate. Depending on the exact nature of the task, here are some ideas I'd consider:
Just run a continuous polling loop cycling through each device in turn. If there are periods when each device might not need servicing and it's not straightforward to tell, retain a trivial interrupt handler that just atomically sets a flag/sequence number/etc. to inform the loop who's ready.
Handle all the interrupts on one core, and the actual processing on the other. The handler just grabs the necessary data, stuffs it into a queue, and returns as quickly as possible, while the other guy just steadily chews through the queue.
If catching every single interrupt is less important than just getting "enough" of each of them on average, leave each one disabled for a suitable timeout after handling it. Alternatively, hack up your own round-robin scheduling by having only one enabled at a time, and the handler reenables the next interrupt instead of the one just taken.

In my bare-metal application that uses ARM Cortex-A9 (dual core with GIC)...
Is there any way out of here, such that the other two will still be processed given their triggering rate?
Of course there are many ways.
You have a dual CPU so you can route a set to each CPU; 58/59 to CPU0 and 60/61 to CPU1. It is not clear how you have handled things with the distributor nor the per-CPU interfaces.
A 2nd way is to just read the status in the 58/59 handlers of the 60/61 and do the work. Ie, you can always read a status of another interrupt from the IRQ handler.
You can also service each and every pending interrupt recorded at the start of the IRQ before acknowledging the original source. A variant of '2' implemented at the IRQ controller layer.
I believe that most of these solutions avoid needless context save/restores and should also be more efficient.
Of course if you are asking the CPU to do more work than it can handle, priorities don't matter. The issue may be your code is not efficient; either the bare metal interrupt infrastructure or your FPGA IRQ handler. It is also quite likely the FPGA to CPU interface is not designed well. You may need to add FIFOs in the FPGA to buffer the data so the CPU can handle more data at a time. I have worked with several FPGA designers. They have a lot of flexibility and usually if you ask for something that will make the IRQ handler more efficient, they can implement it.

Related

How to know whether an IRQ was served immediately on ARM Cortex M0+ (or any other MCU)

For my application (running on an STM32L082) I need accurate (relative) timestamping of a few types of interrupts. I do this by running a timer at 1 MHz and taking its count as soon as the ISR is run. They are all given the highest priority so they pre-empt less important interrupts. The problem I'm facing is that they may still be delayed by other interrupts at the same priority and by code that disables interrupts, and there seems to be no easy way to know this happened. It is no problem that the ISR was delayed, as long as I know that the particular timestamp is not accurate because of this.
My current approach is to let each ISR and each block of code with interrupts disabled check whether interrupts are pending using NVIC->ISPR[0] and flagging this for the pending ISR. Each ISR checks this flag and, if needed, flags the timestamp taken as not accurate.
Although this works, it feels like it's the wrong way around. So my question is: is there another way to know whether an IRQ was served immediately?
The IRQs in question are EXTI4-15 for a GPIO pin change and RTC for the wakeup timer. Unfortunately I'm not in the position to change the PCB layout and use TIM input capture on the input pin, nor to change the MCU used.
update
The fundamental limit to accuracy in the current setup is determined by the nature of the internal RTC calibration, which periodically adds/removes 32kHz ticks, leading to ~31 µs jitter. My goal is to eliminate (or at least detect) additional timestamping inaccuracies where possible. Having interrupts blocked incidentally for, say, 50+ µs is hard to avoid and influences measurements, hence the need to at least know when this occurs.
update 2
To clarify, I think this is a software question, asking if a particular feature exists and if so, how to use it. The answer I am looking for is one of: "yes it is possible, just check bit X of register Y", or "no it is not possible, but MCU ... does have such a feature, called ..." or "no, such a feature is generally not available on any platform (but the common workaround is ...)". This information will guide me (and future readers) towards a solution in software, and/or requirements for better hardware design.
In general
The ideal solution for accurate timestamping is to use timer capture hardware (built-in to the microcontroller, or an external implementation). Aside from that, using a CPU with enough priority levels to make your ISR always the highest priority could work, or you might be able to hack something together by making the DMA engine sample the GPIO pins (specifics below).
Some microcontrollers have connections between built-in peripherals that allow one peripheral to trigger another (like a GPIO pin triggering timer capture even though it isn't a dedicated timer capture input pin). Manufacturers have different names for this type of interconnection, but a general overview can be found on Wikipedia, along with a list of the various names. Exact capabilities vary by manufacturer.
I've never come across a feature in a microcontroller for indicating if an ISR was delayed by a higher priority ISR. I don't think it would be a commonly-used feature, because your ISR can be interrupted by a higher priority ISR at any moment, even after you check the hypothetical was_delayed flag. A higher priority ISR can often check if a lower priority interrupt is pending though.
For your specific situation
A possible approach is to use a timer and DMA (similar to audio streaming, double-buffered/circular modes are preferred) to continuously sample your GPIO pins to a buffer, and then you scan the buffer to determine when the pins changed. Note that this means the CPU must scan the buffer before it is overwritten again by DMA, which means the CPU can only sleep in short intervals and must keep the timer and DMA clocks running. ST's AN4666 is a relevant document, and has example code here (account required to download example code). They're using a different microcontroller, but they claim the approach can be adapted to others in their lineup.
Otherwise, with your current setup, I don't think there is a better solution than the one you're using (the flag that's set when you detect a delay). The ARM Cortex-M0+ NVIC does not have a feature to indicate if an ISR was delayed.
A refinement to your current approach might be making the ISRs as short as possible, so they only do the timestamp collection and then put any other work into a queue for processing by the main application at a lower priority (only applicable if the work is more complex than the enqueue operation, and if the work isn't time-sensitive). Eliminating or making the interrupts-disabled regions short should also help.

STM32 RTOS timer interrupt and threads

I am working on a project where I need to execute 2 pieces of code off TIM interrupts. One of them has a slightly higher priority than the other, and both will be running on 2 different timers (of course not at the same time interval). Due to both timers being proportional to another (one is 1KHz, one is 8Khz) both will trigger at the same time.
Since I am already using the RTOS middle-ware for another purposes (threads of a much lower priority than these too), I was thinking of creating one thread of each these routines.
However, looking at how cubeMX is generating code, I am even wondering if this is possible.
I can start/stop these timers from any thread, but there is only one HAL_TIM_PeriodElapsedCallback which you usually fill with if statements like so:
if (htim->Instance == TIM2)
Am I correct to assume, regardless of which thread the timers are started from, the TIM callback will always occur "outside" of the RTOS environment?
if so, what would be a better strategy to achieve something close to what I need?
Cheers
Interrupts will triger. But remember:
Its priority (not the RTOS priority as they are unrelated) must be lower the SVC interrupt if you want to use any ...fromISR RTOS functions
They will not happen at the same time (as you have only one core)
I am working on a project where I need to execute 2 pieces of code off
TIM interrupts. One of them has a slightly higher priority than the
other, and both will be running on 2 different timers...
What exactly do you mean by "one of them has a [..] higher priority" - the HW timer events will occur just when the timer underflows occur. I think you mean, the handler code servicing the timeout events.
... (of course not at the same time interval). Due to both timers being proportional to another (one is 1KHz, one is 8Khz) both will trigger at the same time.
In embedded realtime programming, you should never build on the assumption that IRQ events are not occurring at the same time: Your ISR handlers may be suppressed at the moment when a trigger event occurs. This way, even if two concurrent events trigger closely after each other, it may look for your software code as if they had triggered at the same time. The solution is what your question points at: Context priorities (of tasks (= "threads") and ISRs (= "Interrupt handlers")) let you avoid the question which event came earlier and control which event to treat first.
Since I am already using the RTOS middle-ware for another purposes (threads of a much lower priority than these too), I was thinking of creating one thread of each these routines.
You are free to deploy code to an RTOS task or to an ISR, but keep in mind that any ISR will have a higher priority than any task. Your TIM event will trigger an ISR (= interrupt context), but you can (and often should) use the ISR to send a notification (or event, or semaphore, or queue message) to a task in order to have the main part of the timer event processed at the lower priority of a task.
However, looking at how cubeMX is generating code, I am even wondering if this is possible.
CubeMX is not limiting you to use or not use tasks. The question is rather how far CubeMX will generate the code you need, and how much you have to add manually. Please note that you don't have to use the CubeMX feature to generate tasks through its configuration, but this can be done by your own C code, too.
I can start/stop these timers from any thread, but there is only one HAL_TIM_PeriodElapsedCallback which you usually fill with if statements like so:
if (htim->Instance == TIM2)
Am I correct to assume, regardless of which thread the timers are started from, the TIM callback will always occur "outside" of the RTOS environment?
Yes, you are. The question who started the timer is not relevant to the context type/selection triggered by the timer. In any case, the TIM will trigger its ISR (at the interrupt priority configured for that interrupt).
If you use the CubeHAL library, it will implement the root of that ISR, check which of the TIMs related to that ISR have elapsed, and invoke the code you printed. Here, you can insert your user code to the different TIM instances (like TIM2 in your case).
if so, what would be a better strategy to achieve something close to what I need?
Re-check your favourite textbook on RTOS and microcontrollers. Any SO answer cannot include all the theory to solve the problem properly.
Decide whether there will be any more urgent reaction on your system than treating the timeout events. If no, you may implement the timeout reaction in the ISR handler. If yes (or in cases of doubt), implement the ISR with a task notification that goes to a task where you do what the timeout event requires. This may be the task from where you started the timer, or another one.

what's wrong in using interrupt handlers as event listeners

My system is simple enough that it runs without an OS, I simply use interrupt handlers like I would use event listener in a desktop program. In everything I read online, people try to spend as little time as they can in interrupt handlers, and give the control back to the tasks. But I don't have an OS or real task system, and I can't really find design information on OS-less targets.
I have basically one interrupt handler that reads a chunk of data from the USB and write the data to memory, and one interrupt handler that reads the data, sends the data on GPIO and schedule itself on an hardware timer again.
What's wrong with using the interrupts the way I do, and using the NVIC (I use a cortex-M3) to manage the work hierarchy ?
First of all, in the context of this question, let's refer to the OS as a scheduler.
Now, unlike threads, interrupt service routines are "above" the scheduling scheme.
In other words, the scheduler has no "control" over them.
An ISR enters execution as a result of a HW interrupt, which sets the PC to a different address in the code-section (more precisely, to the interrupt-vector, where you "do a few things" before calling the ISR).
Hence, essentially, the priority of any ISR is higher than the priority of the thread with the highest priority.
So one obvious reason to spend as little time as possible in an ISR, is the "side effect" that ISRs have on the scheduling scheme that you design for your system.
Since your system is purely interrupt-driven (i.e., no scheduler and no threads), this is not an issue.
However, if nested ISRs are not allowed, then interrupts must be disabled from the moment an interrupt occurs and until the corresponding ISR has completed. In that case, if any interrupt occurs while an ISR is in execution, then your program will effectively ignore it.
So the longer you spend inside an ISR, the higher the chances are that you'll "miss out" on an interrupt.
In many desktop programs, events are send to queue and there is some "event loop" that handle this queue. This event loop handles event by event so it is not possible to interrupt one event by other one. It also is good practise in event driven programming to have all event handlers as short as possible because they are not interruptable.
In bare metal programming, interrupts are similar to events but they are not send to queue.
execution of interrupt handlers is not sequential, they can be interrupted by interrupt with higher priority (numerically lower number in Cortex-M3)
there is no queue of same interrupts - e.g. you can't detect multiple GPIO interrupts while you are in that interrupt - this is the reason you should have all routines as short as possible.
It is possible to implement queues by yourself, feed these queues by interrupts and consume these queues in your super loop (consume while disabling all interrupts). By this approach, you can get sequential processing of interrupts. If you keep your handlers short, this is mostly not needed and you can do the work in handlers directly.
It is also good practise in OS based systems that they are using queues, semaphores and "interrupt handler tasks" to handle interrupts.
With bare metal it is perfectly fine to design for application bound or interrupt/event bound so long as you do your analysis. So if you know what events/interrupts are coming at what rate and you can insure that you will handle all of them in the desired/designed amount of time, you can certainly take your time in the event/interrupt handler rather than be quick and send a flag to the foreground task.
The common approach of course is to get in and out fast, saving just enough info to handle the thing in the foreground task. The foreground task has to spin its wheels of course looking for event flags, prioritizing, etc.
You could of course make it more complicated and when the interrupt/event comes, save state, and return to the forground handler in the forground mode rather than interrupt mode.
Now that is all general but specific to the cortex-m3 I dont think there are really modes like big brother ARMs. So long as you take a real-time approach and make sure your handlers are deterministic, and you do your system engineering and insure that no situation happens where the events/interrupts stack up such that the response is not deterministic, not too late or too long or loses stuff it is okay
What you have to ask yourself is whether all events can be services in time in all circumstances:
For example;
If your interrupt system were run-to-completion, will the servicing of one interrupt cause unacceptable delay in the servicing of another?
On the other hand, if the interrupt system is priority-based and preemptive, will the servicing of a high priority interrupt unacceptably delay a lower one?
In the latter case, you could use Rate Monotonic Analysis to assign priorities to assure the greatest responsiveness (the shortest execution-time handlers get the highest priority). In the first case your system may lack a degree of determinism, and performance will be variable under both event load, and code changes.
One approach is to divide the handler into real-time critical and non-critical sections, the time-critical code can be done in the handler, then a flag set to prompt the non-critical action to be performed in the "background" non-interrupt context in a "big-loop" system that simply polls event flags or shared data for work to complete. Often all that might be necessary in the interrupt handler is to copy some data to timestamp some event - making data available for background processing without holding up processing of new events.
For more sophisticated scheduling, there are a number of simple, low-cost or free RTOS schedulers that provide multi-tasking, synchronisation, IPC and timing services with very small footprints and can run on very low-end hardware. If you have a hardware timer and 10K of code space (sometimes less), you can deploy an RTOS.
I am taking your described problem first
As I interpret it your goal is to create a device which by receiving commands from the USB, outputs some GPIO, such as LEDs, relays etc. For this simple task, your approach seems to be fine (if the USB layer can work with it adequately).
A prioritizing problem exists though, in this case it may be that if you overload the USB side (with data from the other end of the cable), and the interrupt handling it is higher priority than that triggered by the timer, handling the GPIO, the GPIO side may miss ticks (like others explained, interrupts can't queue).
In your case this is about what could be considered.
Some general guidance
For the "spend as little time in the interrupt handler as possible" the rationale is just what others told: an OS may realize a queue, etc., however hardware interrupts offer no such concepts. If the event causing the interrupt happens, the CPU enters your handler. Then until you handle it's source (such as reading a receive holding register in the case of a UART), you lose any further occurrences of that event. After this point, until exiting the handler, you may receive whether the event happened, but not how many times (if the event happened again while the CPU was still processing the handler, the associated interrupt line goes active again, so after you return from the handler, the CPU immediately re-enters it provided nothing higher priority is waiting).
Above I described the general concept observable on 8 bit processors and the AVR 32bit (I have experience with these).
When designing such low-level systems (no OS, one "background" task, and some interrupts) it is fundamental to understand what goes on on each priority level (if you utilize such). In general, you would make the most real-time critical tasks the highest priority, taking the most care of serving those fast, while being more relaxed with the lower priority levels.
From an other aspect usually at design phase it can be planned how the system should react to missed interrupts, since where there are interrupts, missing one will eventually happen anyway. Critical data going across communication lines should have adequate checksums, an especially critical timer should be sourced from a count register, not from event counting, and the likes.
An other nasty part of interrupts is their asynchronous nature. If you fail to design the related locks properly, they will eventually corrupt something giving nightmares to that poor soul who will have to debug it. The "spend as little time in the interrupt handler as possible" statement also encourages you to keep the interrupt code reasonably short which means less code to consider for this problem as well. If you also worked with multitasking assisted by an RTOS you should know this part (there are some differences though: a higher priority interrupt handler's code does not need protection against a lower priority handler's).
If you can properly design your architecture regarding the necessary asynchronous tasks, getting around without an OS (from the no multitasking aspect) may even prove to be a nicer solution. It needs way more thinking to design it properly, however later there are much less locking related problems. I got through some mid-sized safety critical projects designed over a single background "task" with very few and little interrupts, and the experience and maintenance demands regarding those (especially the tracing of bugs) were quite satisfactory compared to some others in the company built over multitasking concepts.

What is the irq latency due to the operating system?

How can I estimate the irq latency on ARM processor?
What is the definition for irq latency?
Interrupt Request (irq) latency is the time that takes for interrupt request to travel from source of the interrupt to the point when it will be serviced.
Because there are different interrupts coming from different sources via different paths, obviously their latency is depending on the type of the interrupt. You can find table with very good explanations about latency (both value and causes) for particular interrupts on ARM site
You can find more information about it in ARM9E-S Core Technical Reference Manual:
4.3 Maximum interrupt latency
If the sampled signal is asserted at the same time as a multicycle instruction has started
its second or later cycle of execution, the interrupt exception entry does not start until
the instruction has completed.
The longest LDM instruction is one that loads all of the registers, including the PC.
Counting the first Execute cycle as 1, the LDM takes 16 cycles.
• The last word to be transferred by the LDM is transferred in cycle 17, and the abort
status for the transfer is returned in this cycle.
• If a Data Abort happens, the processor detects this in cycle 18 and prepares for
the Data Abort exception entry in cycle 19.
• Cycles 20 and 21 are the Fetch and Decode stages of the Data Abort entry
respectively.
• During cycle 22, the processor prepares for FIQ entry, issuing Fetch and Decode
cycles in cycles 23 and 24.
• Therefore, the first instruction in the FIQ routine enters the Execute stage of the
pipeline in stage 25, giving a worst-case latency of 24 cycles.
and
Minimum interrupt latency
The minimum latency for FIQ or IRQ is the shortest time the request can be sampled
by the input register (one cycle), plus the exception entry time (three cycles). The first
interrupt instruction enters the Execute pipeline stage four cycles after the interrupt is
asserted
There are three parts to interrupt latency:
The interrupt controller picking up the interrupt itself. Modern processors tend to do this quite quickly, but there is still some time between the device signalling it's pin and the interrupt controller picking it up - even if it's only 1ns, it's time [or whatever the method of signalling interrupts are].
The time until the processor starts executing the interrupt code itself.
The time until the actual code supposed to deal with the interrupt is running - that is, after the processor has figured out which interrupt, and what portion of driver-code or similar should deal with the interrupt.
Normally, the operating system won't have any influence over 1.
The operating system certainly influences 2. For example, an operating system will sometimes disable interrupts [to avoid an interrupt interfering with some critical operation, such as for example modifying something to do with interrupt handling, or when scheduling a new task, or even when executing in an interrupt handler. Some operating systems may disable interrupts for several milliseconds, where a good realtime OS will not have interrupts disabled for more than microseconds at the most.
And of course, the time it takes from the first instruction in the interrupt handler runs, until the actual driver code or similar is running can be quite a few instructions, and the operating system is responsible for all of them.
For real time behaviour, it's often the "worst case" that matters, where in non-real time OS's, the overall execution time is much more important, so if it's quicker to not enable interrupts for a few hundred instructions, because it saves several instructions of "enable interrupts, then disable interrupts", a Linux or Windows type OS may well choose to do so.
Mats and Nemanja give some good information on interrupt latency. There are two is one more issue I would add, to the three given by Mats.
Other simultaneous/near simultaneous interrupts.
OS latency added due to masking interrupts. Edit: This is in Mats answer, just not explained as much.
If a single core is processing interrupts, then when multiple interrupts occur at the same time, usually there is some resolution priority. However, interrupts are often disabled in the interrupt handler unless priority interrupt handling is enabled. So for example, a slow NAND flash IRQ is signaled and running and then an Ethernet interrupt occurs, it may be delayed until the NAND flash IRQ finishes. Of course, if you have priorty interrupts and you are concerned about the NAND flash interrupt, then things can actually be worse, if the Ethernet is given priority.
The second issue is when mainline code clears/sets the interrupt flag. Typically this is done with something like,
mrs r9, cpsr
biceq r9, r9, #PSR_I_BIT
Check arch/arm/include/asm/irqflags.h in the Linux source for many macros used by main line code. A typical sequence is like this,
lock interrupts;
manipulate some flag in struct;
unlock interrupts;
A very large interrupt latency can be introduced if that struct results in a page fault. The interrupts will be masked for the duration of the page fault handler.
The Cortex-A9 has lots of lock free instructions that can prevent this by never masking interrupts; because of better assembler instructions than swp/swpb. This second issue is much like the IRQ latency due to ldm/stm type instructions (these are just the longest instructions to run).
Finally, a lot of the technical discussions will assume zero-wait state RAM. It is likely that the cache will need to be filled and if you know your memory data rate (maybe 2-4 machine cycles), then the worst case code path would multiply by this.
Whether you have SMP interrupt handling, priority interrupts, and lock free main line depends on your kernel configuration and version; these are issues for the OS. Other issues are intrinsic to the CPU/SOC interrupt controller, and to the interrupt code itself.

disable_local_irq and kernel timers

While doing SMP porting of some of our drivers (on
powerpc target) we observed some behavior on which I need you guys to
shed some light:
On doing a local_irq_disable() on a UP system the jiffies tend to
freeze i.e. the count stops incrementing. Is this expected? I thought
that the decrementer interrupt is 'internal' and should not get
affected by the local_irq_disable() kind off call since I expected it to
disable local IRQ interrupt processing (external interrupt). The
system of course freezes then also upon doing a local_irq_enable() the
jiffies count jumps and it seems to be compensating for the 'time
lapse' between the local_irq_disable() and enable() call.
Doing the same on an SMP system (P2020 with 2 e500 cores) the
results are surprising. Firstly the module that is being inserted to
do this testing always executes on core 1. Further it sometimes does
not see a freeze of 'jiffies' counter and sometimes we see that it
indeed freezes. Again in case of a freeze of count it tends to jump
after doing a local_irq_enable(). I have no idea why this may be
happening.
Do we know in case of an SMP do both cores run a schedule timer, so
that in some cases we do not see a freeze of jiffies counts or is it
just on core 0 ?
Also since the kernel timers rely on 'jiffies' -- this would mean that
none of our kernel timers will fire if local_irq_disable() has been
done? What would be the case this is done on one of the cores in an
SMP system?
There are many other questions, but I guess these will be enough to
begin on a general discussion about the same :)
TIA
NS
Some more comments from the experimentation done.
My understanding at this point in time is that since kernel timers depend on 'jiffies' to fire, they wont actually fire on a UP system when I issue a local_irq_save(). Infact some of our code is based on the assumption that when I do issue a local_irq_save() it guarantees protection against interrupts on the local processor and kernel timers as well.
However carrying out the same experiment on an SMP system, even with both cores executing a local_irq_save(), the jiffies do NOT stop incrementing and the system doesn't freeze. How is this possible ? Is LINUX using some other mechanism to trigger timer interrupts in the SMP system or possibly using IPIs? This also breaks our assumption that local_irq_disable() will protect the system against kernel timers running on the same core atleast.
How do we go about writing a code that is safe against async events i.e. interrupts and kernel timers and is valid for both UP and SMP.
local_irq_disable only disables interrupts on the current core, so, when you're single core, everything is disabled (including timer interrupts) and that is why jiffies are not updated.
When running on SMP, sometimes you happen to disable the interrupts on the core that's updating the jiffies, sometimes not.
This usually is not a problem, because interrupts are supposed to be disabled only for a very short periods, and all scheduled timers will fire after interrupts gets enabled again.
How do you know that your module always run on core 1? On current versions of the kernel, it may even be running on more than one core at the same time (that is, if you didn't forced it to don't do it).
There are several facets to this problem. Lets take them 1 by 1.
1.
a)
local_irq_save() simply clears the IF flag of the eflags register. IRQ handlers can run concurently on the other cores.
global_irq_save() is not available because that would required interprocessor communication to implement and it is not really needed anyway since local irq disabling is intended for very short period of time only.
b)
modern APICs allows IRQ dynamic distribution among the present cores and besides rare exceptions, the kernel essentially programs the necessary registers to obtain a round-robin distribution of the IRQs.
The consequence of that is that if the irqs are disabled long enough locally, when the APIC delivers an IRQ to the core that has them disabled, the end result will be that the system will globally stop receiving this particular IRQ up to the point where the irqs are finally reenabled locally on the core that received the last IRQ of that type.
2.
Concerning the different results concerning jiffies updates and irq disabling, it depends on the selected clocksource.
You can figure out which one is choosen by consulting:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
if you have tsc as clocksource then all cores have it locally. However if your clocksource is something else ie: HPET an external device, then jiffies will become frozen for the reasons described in point #1.

Resources