Preserving the Execution pipeline

Preserving the Execution pipeline - c

Return types are frequently checked for errors. But, the code that will continue to execute may be specified in different ways.
if(!ret)
{
doNoErrorCode();
}
exit(1);
or
if(ret)
{
exit(1);
}
doNoErrorCode();
One way heavyweight CPU's can speculate about the branches taken in near proximity/locality using simple statistics - I studied a 4-bit mechanism for branch speculation (-2,-1,0,+1,+2) where zero is unknown and 2 will be considered a true branch.
Considering the simple technique above, my questions are about how to structure code. I assume that there must be a convention among major compilers and major architectures. These are my two questions
When the code isn't an often-visited loop which boolean value is biased for when the pipeline is being filled ?
Speculation about branching must begin at either true or false or zero ( the pipeline must be filled with something). Which is it likely to be ?

The behavior varies among CPUs, and the compiler often reorders instructions.
You will find all the information you need in these manuals: http://agner.org/optimize/.
In my opinion the only way to know what happens is to read the assembly code generated by the compiler.

On gcc you can use __builtin_expect to provide the compiler with branch prediction information. To make it slightly easier you can then borrow the likely/unlikely macros used e.g. in the Linux kernel:
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
and then e.g.
if (unlikely(!some_function())
error_handling();

Related

C99 "atomic" load in baremetal portable library

I'm working on a portable library for baremetal embedded applications.
Assume that I have a timer ISR that increments a counter and, in the main loop, this counter read is from in a most certainly not atomic load.
I'm trying to ensure load consistency (i.e. that I'm not reading garbage because the load was interrupted and the value changed) without resorting to disabling interrupts. It does not matter if the value changed after reading the counter as long as the read value is proper. Does this do the trick?
uint32_t read(volatile uint32_t *var){
uint32_t value;
do { value = *var; } while(value != *var);
return value;
}

It's highly unlikely that there's any sort of a portable solution for this, not least because plenty of C-only platforms are really C-only and use one-off compilers, i.e. nothing mainstream and modern-standards-compliant like gcc or clang. So if you're truly targeting entrenched C, then it's all quite platform-specific and not portable - to the point where "C99" support is a lost cause. The best you can expect for portable C code is ANSI C support - referring to the very first non-draft C standard published by ANSI. That is still, unfortunately, the common denominator - that major vendors get away with. I mean: Zilog somehow gets away with it, even if they are now but a division of Littelfuse, formerly a division of IXYS Semiconductor that Littelfuse had acquired.
For example, here are some compilers where there's only a platform-specific way of doing it:
Zilog eZ8 using a "recent" Zilog C compiler (anything 20 years old or less is OK): 8-bit value read-modify-write is atomic. 16-bit operations where the compiler generates word-aligned word instructions like LDWX, INCW, DECW are atomic as well. If the read-modify-write otherwise fits into 3 instructions or less, you'd prepend the operation with asm("\tATM");. Otherwise, you'd need to disable the interrupts: asm("\tPUSHF\n\tDI");, and subsequently re-enable them: asm("\tPOPF");.
Zilog ZNEO is a 16 bit platform with 32-bit registers, and read-modify-write accesses on registers are atomic, but memory read-modify-write round-trips via a register, usually, and takes 3 instructions - thus prepend the R-M-W operation with asm("\tATM").
Zilog Z80 and eZ80 require wrapping the code in asm("\tDI") and asm("\tEI"), although this is valid only when it's known that the interrupts are always enabled when your code runs. If they may not be enabled, then there's a problem since Z80 does not allow reading the state of IFF1 - the interrupt enable flip-flop. So you'd need to save a "shadow" of its state somewhere, and use that value to conditionally enable interrupts. Unfortunately, eZ80 does not provide an interrupt controller register that would allow access to IEF1 (eZ80 uses the IEFn nomenclature instead of IFFn) - so this architectural oversight is carried over from the venerable Z80 to the "modern" one.
Those aren't necessarily the most popular platforms out there, and many people don't bother with Zilog compilers due to their fairly poor quality (low enough that yours truly had to write an eZ8-targeting compiler*). Yet such odd corners are the mainstay of C-only code bases, and library code has no choice but to accommodate this, if not directly then at least by providing macros that can be redefined with platform-specific magic.
E.g. you could provide empty-by-default macros MYLIB_BEGIN_ATOMIC(vector) and MYLIB_END_ATOMIC(vector) that would be used to wrap code that requires access atomic with respect to a given interrupt vector (or e.g. -1 if with respect to all interrupt vectors). Naturally, replace MYLIB_ with a "namespace" prefix specific to your library.
To enable platform-specific optimizations such as ATM vs DI on "modern" Zilog platforms, an additional argument could be provided to the macro to separate the presumed "short" sequences that the compiler is apt to generate three-instruction sequences for vs. longer ones. Such micro-optimization requires usually an assembly output audit (easily automatable) to verify the assumption of the instruction sequence length, but at least the data to drive the decision would be available, and the user would have a choice of using it or ignoring it.
*If some lost soul wants to know anything bordering on the arcane re. eZ8 - ask away. I know entirely too much about that platform, in details so gory that even modern Hollywood CG and SFX would have a hard time reproducing the true depth of the experience on-screen. I'm also possibly the only one out there running the 20MHz eZ8 parts occasionally at 48MHz clock - as sure a sign of demonic possession as the multiverse allows. If you think it's outrageous that such depravity makes it into production hardware - I'm with you. Alas, business case is business case, laws of physics be damned.

Are you running on any systems that have uint32_t larger than a single assembly instruction word read/write size? If not, the IO to memory should be a single instructions and therefore atomic (assuming the bus is also word sized...) You get in trouble when the compiler breaks it up into multiple smaller read/writes. Otherwise, I've always had to resort to DI/EI. You could have the user configure your library such that it has information if atomic instructions or minimum 32-bit word size are available to prevent interrupt twiddling. If you have these guarantees, you don't need to verification code.
To answer the question though, on a system that must split the read/writes, your code is not safe. Imagine a case where you read your value in correctly in the "do" part, but the value gets split during the "while" part check. Further, in an extreme case, this is an infinite loop. For complete safety, you'd need a retry count and error condition to prevent that. The loop case is extreme for sure, but I'd want it just in case. That of course makes the run time longer.
Let's show a failure case for examples - will use 16-bit numbers on a machine that reads 8-bit values at a time to make it easier to follow:
Value to read from memory *var is 0x1234
Read 8-bit 0x12
*var becomes 0x5678
Read 8-bit 0x78 - value is now 0x1278 (invalid)
*var becomes 0x1234
Verification step reads 8-bit 0x12
*var becomes 0x5678
Verification reads 8-bit 0x78
Value confirmed correct 0x1278, but this is an error as *var was only 0x1234 and 0x5678.
Another failure case would be when *var just happens to change at the same frequency as your code is running, which could lead to an infinite loop as each verification fails. Or even if it did break out eventually, this would be a very hard to track performance bug.

How to find issues related to Data consistency in an Embedded C code base?

Let me explain what I mean by data consistency issue. Take following scenario for example
uint16 x,y;
x=0x01FF;
y=x;
Clearly, these variables are 16 bit but if an 8 bit CPU is used with this code, read or write operations would not be atomic. Thereby an interrupt can occur in between and change the value.This is one situation which MIGHT lead to data inconsistency.
Here's another example,
if(x>7) //x is global variable
{
switch(x)
{
case 8://do something
break;
case 10://do something
break;
default: //do default
}
}
In the above excerpt code, if an interrupt is changing the value of x from 8 to 5 after the if statement but before the switch statement,we end up in default case, instead of case 8.
Please note, I'm looking for ways to detect such scenarios (but not solutions)
Are there any tools that can detect such issues in Embedded C?

It is possible for a static analysis tool that is context (thread/interrupt) aware to determine the use of shared data, and that such a tool could recognise specific mechanisms to protect such data (or lack thereof).
One such tool is Polyspace Code Prover; it is very expensive and very complex, and does a lot more besides that described above. Specifically to quote (elided) from the whitepaper here:
With abstract interpretation the following program elements are interpreted in new ways:
[...]
Any global shared data may change at any time in a multitask program, except when protection
mechanisms, such as memory locks or critical sections, have been applied
[...]
It may have improved in the long time since I used it, but one issue I had was that it worked on a lock-access-unlock idiom, where you specified to the tool what the lock/unlock calls or macros were. The problem with that is that the C++ project I worked on used a smarter method where a locking object (mutex, scheduler-lock or interrupt disable for example) locked on instantiation (in the constructor) and unlocked in the destructor so that it unlocked automatically when the object went out of scope (a lock by scope idiom). This meant that the unlock was implicit and invisible to Polyspace. It could however at least identify all the shared data.
Another issue with the tool is that you must specify all thread and interrupt entry points for concurrency analysis, and in my case these were private-virtual functions in task and interrupt classes, again making them invisible to Polyspace. This was solved by conditionally making the entry-points public for the abstract analysis only, but meant that the code being tested does not have the exact semantics of the code to be run.
Of course these are non-problems for C code, and in my experience Polyspace is much more successfully applied to C in any case; you are far less likely to be writing code in a style to suit the tool rather than the tool working with your existing code-base.

There are no such tools as far as I am aware. And that is probably because you can't detect them.
Pretty much every operation in your C code has the potential to get interrupted before it is finished. Less obvious than the 16 bit scenario, there is also this:
uint8_t a, b;
...
a = b;
There is no guarantees that this is atomic! The above assignment may as well translate to multiple assembler instructions, such as 1) load a into register, 2) store register at memory address. You can't know this unless you disassemble the C code and check.
This can create very subtle bugs. Any assumption of the kind "as long as I use 8 bit variables on my 8 bit CPU, I can't get interrupted" is naive. Even if such code would result in atomic operations on a given CPU, the code is non-portable.
The only reliable, fully-portable solution is to use some manner of semaphore. On embedded systems, this could be as simple as a bool variable. Another solution is to use inline assembler, but that can't be ported cross platform.
To solve this, C11 introduced the qualifier _Atomic to the language. However, C11 support among embedded systems compilers is still mediocre.

How to enable the DIV instruction in ASM output of C compiler

I am using vbcc compiler to translate my C code into Motorola 68000 ASM.
For whatever reason, every time I use the division (just integer, not floats) in code, the compiler only inserts the following stub into the ASM output (that I get generated upon every recompile):
public __ldivs
jsr __ldivs
I explicitly searched for all variations of DIVS/DIVU, but every single time, there is just that stub above. The code itself works (I debugged it on target device), so the final code does have the DIV instruction, just not the intermediate output.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
However, I can't do it if I don't see the resulting ASM code. Any ideas how to enable it ? The compiler manual does not specify anything like that, so there must clearly must be some other - probably common - higher principle in play ?

From the vbcc compiler system manual by Volker Barthelmann:
4.1 Additional options
This backend provides the following additional options:
-cpu=n Generate code for cpu n (e.g. -cpu=68020), default: 68000.
...
4.5 CPUs
The values of -cpu=n have those effects:
...
n>=68020
32bit multiplication/division/modulo is done with the mul?.l, div?.l and
div?l.l instructions.
The original 68000 CPU didn't have support for 32-bit divides, only 16-bit division, so by default vbcc doesn't generate 32-bit divide instructions.

Basically your question doesn't even belong here. You're asking about the workings of your compiler not the 68K cpu family.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
Then you are already fighting windmills. Chosing an obscure C compiler while at the same time desiring top performance are conflicting goals.
If you really need MC68000 code compatibility, the choice of C is questionable. Since the 68000 has zero cache, store/load orgies that simple C compilers tend to produce en masse, have a huge performance impact. It lessens considerably for the higher members and may become invisible on the superscalar pipelined ones (erm, one; the 68060).
Switch to 68020 code model if target platform permits, and switch compiler if you're not satisfied with your current one.

How to write ISO-C compliant code while allowing multiple instructions between sequencing points?

I have a subtle question; I would like to write code that is portable (that's why I am sticking to any of the last three ISO-C standard definitions) and machine-independent (thus, assembler is out of the question), but that let the compiler pack several (independent) instructions within one CPU cycle.
I thought that using the comma operator would do the trick, but the standard says that each coma is a sequencing point, so it would not do.
I would like to take advantage of multiple independent assignments, additions, etc. (just as a register variable is an indication to the compiler of possible optimizations and of independentness of the operations).
Does anyone have any idea?

Let the compiler do optimizations.
The compiler can optimize across sequence points when it recognizes that they are independent and without interactions.
For example, in code:
a = x+y;
b = y+z;
A compiler can recognize that the assignment of a and b are fully independent of each other, and can do both at the same time, despite the sequence point.
As a general rule, you cannot do a better job than the compiler.
Let the compiler do its job of creating fast, efficient code, and you should focus on your job:writing clear, unambiguous instructions for bug-free algorithms.

The compiler generates code. The processor executes it. It is up to the processor to perform more than one instruction per cycle, and modern processors are quite good at this. If operations are independent, the processor will figure it out.
The processor will also rearrange instructions, and often perform multiple instructions that are nowhere near together in your source code. There is nothing that you can do to help in source code.

Your question is deeply misguided, as other answerers have pointed out. (Compilers usually reorder things and do all sorts of horrible stuff even when things are separated by sequence points; conversely, doing two things that can "interfere" with one another that aren't separated by a sequence point is undefined behaviour.) However, you can do what you're asking in a bit of a silly way.
The evaluation of different arguments to a function call are not sequenced with respect to one another, so you can make up a dummy function like this:
void dont_sequence(int, int) {}
and use it like this:
dont_sequence(i += 2, j += 4);
Again, I don't believe there is any purpose to this. This won't help any compiler I've ever used. The compiler doesn't have to follow your instructions; it's only required to generate code that behaves as if it followed your instructions, and that's what modern compilers do.

TL;DR there is no such trick available. Choose different language
1. C language was designed to be portable and machine-independent but does not have language constructs to clearly express data flow independence or other hints that might be utilized by compilers when targeting processors with different parallel granularity (see e.g. article Threading and Parallel Programming Constructs used in multicore systems development: Part 2 for discussion of such constructs).
2. Abstract machine-independent compiler target (which reflects current processor architectures) was described by computer scientist Donald Knuth as MMIX. There is also GCC compiler available that can target this processor. So you might check your C code against this output
3. For more detail explanation of how compilers and processors derive their hints (you call them sequencing points) see e.g. book Processor Architecture: From Dataflow to Superscalar and Beyond ; with 34 Tables - Jurij Silc, Borut Robic, Theo Ungerer
3. For list of portable and machine-independent languages that explicitly support paralelism see e.g. Wikipedia: List of concurrent and parallel programming languages
4. For some discussion about how to use C for parallel programming see e.g. Which is the best parallel programming language for initiating undergraduate students in the world of multicore/parallel computing?

Is there a difference in performance when swapping if/else condition?

Is there a difference in performance between
if(array[i] == -1){
doThis();
}
else {
doThat();
}
and
if(array[i] != -1){
doThat();
}
else {
doThis();
}
when I already now that there is only one element (or in general few elements) with the value -1 ?

That will depend entirely on how your compiler chooses to optimise it. You have no guarantee as to which is faster. If you really need to give hints to the compiler, look at the unlikely macro in the Linux kernel which are defined thus:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
Which means you can use
if (likely(something)) { ... }
or
if (unlikely(something)) { ... }
Details here: http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Moral: write your code for readability, and not how you think the compiler will optimise it, as you are likely to be wrong.

Performance is always implementation dependent. If it is sufficiently important to you, then you need to benchmark it in your environment.
Having said that: there is probably no difference, because modern compilers are likely to turn both versions into equally efficient machine code.
One thing that might cause a difference is if the different code order changes the compiler's branch prediction heuristics. This can occasionally make a noticeable difference.

The compiler wouldn't know about your actual data, so it will produce roughly the same low-level code.
However, given that if-statements generate assembly branches and jumps, your code may run a little faster in your second version because if your value is not -1 then your code will run the very next instruction. Whereas in your first version the code would need to jump to a new instruction address, which may be costly, especially when you deal with a large number of values (say millions).

That would depend on which condition is encountered first. As such there is not such a big diffrence.
-----> If You have to test a lot of statements, rather than using nested if-else a switch statement would be faster.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight