Mapping inputs to an array (A better way?) - c

I'm working in an embedded system and have "mapped" some defines to an array for inputs.
volatile int INPUT_ARRAY[40];
#define INPUT01 INPUT_ARRAY[0]
#define INPUT02 INPUT_ARRAY[1]
// section 2
if ( INPUT01 && INPUT02 ) {
writepin(outputpin, value);
}
If I want to read from Input 1, I can simply say newvariable = INPUT01 or I can compare data with Input 1, like in section 2 of my code. I'm not sure if this is a normal way of mapping the name INPUT01 to where the array position is. Or for an Input pin in the first place. Each array value represents a binary pin, and are read into the array by decoding a port value (16 bit). Question: Is using the defines and array like this reasonably efficient?

Yes, your solution is efficient.
Before the C compiler even sees your code, the C preprocessor substitutes INPUT_ARRAY[0] for INPUT01 and, similarly, INPUT_ARRAY[1] for INPUT02; so this substitution uses zero time and zero power at run time.
Moreover, when the C compiler sees INPUT_ARRAY[1] in the preprocessed code, it adds 1 at compile time to the base address of INPUT_ARRAY. Therefore, you get maximal efficiency at run time.
Admittedly, were you manually to turn your C compiler's optimizer off, as with the -O0 option of GCC, then it is conceivable that the compiler would emit assembly code to add the 1 at run time. So don't do that.
The only likely exception to the foregoing would be the case that the base address of INPUT_ARRAY were unknown to the compiler at run time, not likely because INPUT_ARRAY were dynamically allocated on the heap (which would make little sense for hardware device addressing), but likely because the base address of INPUT_ARRAY were configurable during boot via device configuration registers. Some hardware does this, but if yours does, why, that is exactly the reason your MCU (or MPU) possesses an index-offset indirect addressing mode in the first place. Though this mode engages the MCU's integer arithmetic unit, [a] the mode does not multiply (multiplication being a power-hungry operation); and, [b] anyway, the mode is such a normal, often-used mode that MCUs are invariably designed to support it efficiently—not perhaps as efficiently as precomputed direct addressing, but as efficiently as one can reasonably expect for such a use. The MCU's manufacturer knows that device pins are things you need to address. The engineer who designed your MCU will have given priority to making the index-offset indirect mode as efficient as possible for this and other reasons. (You could maybe still cheat the matter to save a few millijoules via self-modifying code, if your MCU even allowed that; but, as an engineer, you'd regret the cheat, I suspect, unless security and maintainability were non-issues to you. The problem probably is not much of a real problem. Index-offset indirect addressing is the normal technique when the base address remains unknown until run time. If you really need to save that last millijoule, then you might not be using a C compiler for your code's inner loop, anyway, but might be handcrafting assembly code.)
I suspect that you would find it instructive to tell your compiler to emit assembly code for your inspection. I do not know which compiler you are using but, if you were using GCC, then gcc -S myfile.c.

Related

C99 "atomic" load in baremetal portable library

I'm working on a portable library for baremetal embedded applications.
Assume that I have a timer ISR that increments a counter and, in the main loop, this counter read is from in a most certainly not atomic load.
I'm trying to ensure load consistency (i.e. that I'm not reading garbage because the load was interrupted and the value changed) without resorting to disabling interrupts. It does not matter if the value changed after reading the counter as long as the read value is proper. Does this do the trick?
uint32_t read(volatile uint32_t *var){
uint32_t value;
do { value = *var; } while(value != *var);
return value;
}
It's highly unlikely that there's any sort of a portable solution for this, not least because plenty of C-only platforms are really C-only and use one-off compilers, i.e. nothing mainstream and modern-standards-compliant like gcc or clang. So if you're truly targeting entrenched C, then it's all quite platform-specific and not portable - to the point where "C99" support is a lost cause. The best you can expect for portable C code is ANSI C support - referring to the very first non-draft C standard published by ANSI. That is still, unfortunately, the common denominator - that major vendors get away with. I mean: Zilog somehow gets away with it, even if they are now but a division of Littelfuse, formerly a division of IXYS Semiconductor that Littelfuse had acquired.
For example, here are some compilers where there's only a platform-specific way of doing it:
Zilog eZ8 using a "recent" Zilog C compiler (anything 20 years old or less is OK): 8-bit value read-modify-write is atomic. 16-bit operations where the compiler generates word-aligned word instructions like LDWX, INCW, DECW are atomic as well. If the read-modify-write otherwise fits into 3 instructions or less, you'd prepend the operation with asm("\tATM");. Otherwise, you'd need to disable the interrupts: asm("\tPUSHF\n\tDI");, and subsequently re-enable them: asm("\tPOPF");.
Zilog ZNEO is a 16 bit platform with 32-bit registers, and read-modify-write accesses on registers are atomic, but memory read-modify-write round-trips via a register, usually, and takes 3 instructions - thus prepend the R-M-W operation with asm("\tATM").
Zilog Z80 and eZ80 require wrapping the code in asm("\tDI") and asm("\tEI"), although this is valid only when it's known that the interrupts are always enabled when your code runs. If they may not be enabled, then there's a problem since Z80 does not allow reading the state of IFF1 - the interrupt enable flip-flop. So you'd need to save a "shadow" of its state somewhere, and use that value to conditionally enable interrupts. Unfortunately, eZ80 does not provide an interrupt controller register that would allow access to IEF1 (eZ80 uses the IEFn nomenclature instead of IFFn) - so this architectural oversight is carried over from the venerable Z80 to the "modern" one.
Those aren't necessarily the most popular platforms out there, and many people don't bother with Zilog compilers due to their fairly poor quality (low enough that yours truly had to write an eZ8-targeting compiler*). Yet such odd corners are the mainstay of C-only code bases, and library code has no choice but to accommodate this, if not directly then at least by providing macros that can be redefined with platform-specific magic.
E.g. you could provide empty-by-default macros MYLIB_BEGIN_ATOMIC(vector) and MYLIB_END_ATOMIC(vector) that would be used to wrap code that requires access atomic with respect to a given interrupt vector (or e.g. -1 if with respect to all interrupt vectors). Naturally, replace MYLIB_ with a "namespace" prefix specific to your library.
To enable platform-specific optimizations such as ATM vs DI on "modern" Zilog platforms, an additional argument could be provided to the macro to separate the presumed "short" sequences that the compiler is apt to generate three-instruction sequences for vs. longer ones. Such micro-optimization requires usually an assembly output audit (easily automatable) to verify the assumption of the instruction sequence length, but at least the data to drive the decision would be available, and the user would have a choice of using it or ignoring it.
*If some lost soul wants to know anything bordering on the arcane re. eZ8 - ask away. I know entirely too much about that platform, in details so gory that even modern Hollywood CG and SFX would have a hard time reproducing the true depth of the experience on-screen. I'm also possibly the only one out there running the 20MHz eZ8 parts occasionally at 48MHz clock - as sure a sign of demonic possession as the multiverse allows. If you think it's outrageous that such depravity makes it into production hardware - I'm with you. Alas, business case is business case, laws of physics be damned.
Are you running on any systems that have uint32_t larger than a single assembly instruction word read/write size? If not, the IO to memory should be a single instructions and therefore atomic (assuming the bus is also word sized...) You get in trouble when the compiler breaks it up into multiple smaller read/writes. Otherwise, I've always had to resort to DI/EI. You could have the user configure your library such that it has information if atomic instructions or minimum 32-bit word size are available to prevent interrupt twiddling. If you have these guarantees, you don't need to verification code.
To answer the question though, on a system that must split the read/writes, your code is not safe. Imagine a case where you read your value in correctly in the "do" part, but the value gets split during the "while" part check. Further, in an extreme case, this is an infinite loop. For complete safety, you'd need a retry count and error condition to prevent that. The loop case is extreme for sure, but I'd want it just in case. That of course makes the run time longer.
Let's show a failure case for examples - will use 16-bit numbers on a machine that reads 8-bit values at a time to make it easier to follow:
Value to read from memory *var is 0x1234
Read 8-bit 0x12
*var becomes 0x5678
Read 8-bit 0x78 - value is now 0x1278 (invalid)
*var becomes 0x1234
Verification step reads 8-bit 0x12
*var becomes 0x5678
Verification reads 8-bit 0x78
Value confirmed correct 0x1278, but this is an error as *var was only 0x1234 and 0x5678.
Another failure case would be when *var just happens to change at the same frequency as your code is running, which could lead to an infinite loop as each verification fails. Or even if it did break out eventually, this would be a very hard to track performance bug.

Compiler Hints and Semantics for Optimizations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I spent the last couple of weeks optimizing a numerical algorithm. Through a combination of precomputation, memory alignment, compiler hints and flags, and trial and error experimentation, I brought the run-time down over an order of magnitude. I have not yet explicitly vectorized using intrinsics or used multi-threading.
Frequently when working on this type of problem, there is an initialization routine, after which, many parameters become constant. These might be filter lengths, the expression of a switch statement, for loop length or iteration increment. If the parameters were know at compile time, the compiler should be able to do a much more effective job of optimization by knowing exactly how to unroll loops, replace index calculations with instructions that have the offset encoded in the instruction, simplify or eliminate expressions at compile time, possibly eliminate switch statements, etc. The most extreme way of dealing with this problem would be to run the initialization routine (at run-time), then run the compiler on the critical function to be optimized using some kind of plugin that allows iteration over the abstract syntax tree, replace the parameters with constants, and finally dynamically link to the shared object. If the routine is short, it could be dynamically compiled inside the binary using a number of tools.
More practically, I rely very heavily on alignment, gcc __builtin_assume_aligned, restrict, manual loop unrolling, and compiler flags to get the compiler to do what I want given the unknown value of parameters at compile time. I'm wondering what other options are available to me that are at least close to portable. I only use intrinsics as a last resort since it's not portable and a lot of work. Specifically, how can I provide the compiler (gcc) with additional information concerning loop variables using either language semantics, compiler extensions, or external tools so it can do a better job of doing optimizations for me. Similarly is there any way to qualify variables as having a stride so that loads and stores are always aligned, thus more easily enabling the auto-vectorization and loop unrolling process.
These issues come up frequently, so I am hoping there is some more elegant method of solving them. What follows are examples of the kind of problems I hand optimize but I believe the compiler ought to be able to do for me. These are not intended to be further questions.
Sometimes you have a filter, the length of which is not a multiple of the length of the longest SIMD register, and there may be memory alignment issues as well. In this case case I either (A) unroll the loop by a multiple of the vector register and call into the unoptimized code for the epilogue/prologue or (B) pad the start or end of the filter with zeros. I've recently learned gcc and other compilers have the ability to peel loops. From the limited documentation I've been able to find, I believe the finest grain control you have over peeling is over entire functions (rather than individual loops) using compiler directives. Further, there are some parameters you can provide, but it's mostly just an upper or lower bound on the amount of unrolling or number of instructions produced.
In order to really know the best method of unrolling/peeling or zero padding, the compiler needs to know something about the length of the loop and/or the size of the increment. For example, it would be very helpful to know that a loop is likely to have a length greater than a million or less than 100. It would be helpful to know that the loop will always run either 32 or 34 times. In fact, since the compiler knows much more about the computer architecture than I do, it would be much better if it made all the unrolling decisions based on information I provide about the loop variables. I had a situation where I wanted the compiler to unroll a loop. I specifically gave it the #pragma GCC optimize ("unroll-loops") directive. However, what it required to work was also the statement N &= ~7, thus informing the compiler that the loop length was a multiple of 8. This is not a semantic feature of the language, and it does not have the effect of changing the value of N. It was strictly to inform the static analyzer of the compiler that the loop was already a multiple of the length of the AVX register. In this case I was lucky and it worked because gcc is very clever. But in other cases, my hints don't seem to work (or they do, but there is no compiler feedback to let me know the additional information was of no value). In one case I had to explicitly tell the compiler not to unroll the loop because the outer loop was very short and the overhead was not worth it. With the optimizer on the maximum setting, often the only way to know what's going on is to look at the assembly listing, make some changes, and try again.
In another situation I carefully unrolled a loop so the compiler would use the AVX registers. The manual unrolling was probably necessary as the compiler doesn't have sufficient information about the length of the loop or that the length was of a particular multiple. Unfortunately, the inner loop was accessing an unaligned array of floats of length four per group (16 byte alignment). The compiler was using only the legacy 128 bit XMM registers. After making a weak attempt to vectorize using AVX intrinsics, I discovered the extra overhead of the unaligned access made the performance no better than what gcc was doing. So I thought, I could align each group of floats on the start of a cache line and use a stride equal to the cache length (or half, which is the length of and AVX register) to eliminate the alignment problem. However, this may turn out to be ineffective due to the extra memory bandwidth. It's certainly more work on my part. It makes the code harder to understand. And, at the very least, getting the stride right would depend on compile time constants I would need to supply. I wonder if there is some simpler way to do this relying on the compiler to do all the work? I would be willing to try it if it meant only changing a line or two of code. It's not worth it if I have to do it manually (in this case anyway). (Thinking about it as I write this, I may be able to use a union or struct with 48 bytes of padding and and a few extra lines of code. I would have to give that some thought...)

How to enable the DIV instruction in ASM output of C compiler

I am using vbcc compiler to translate my C code into Motorola 68000 ASM.
For whatever reason, every time I use the division (just integer, not floats) in code, the compiler only inserts the following stub into the ASM output (that I get generated upon every recompile):
public __ldivs
jsr __ldivs
I explicitly searched for all variations of DIVS/DIVU, but every single time, there is just that stub above. The code itself works (I debugged it on target device), so the final code does have the DIV instruction, just not the intermediate output.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
However, I can't do it if I don't see the resulting ASM code. Any ideas how to enable it ? The compiler manual does not specify anything like that, so there must clearly must be some other - probably common - higher principle in play ?
From the vbcc compiler system manual by Volker Barthelmann:
4.1 Additional options
This backend provides the following additional options:
-cpu=n Generate code for cpu n (e.g. -cpu=68020), default: 68000.
...
4.5 CPUs
The values of -cpu=n have those effects:
...
n>=68020
32bit multiplication/division/modulo is done with the mul?.l, div?.l and
div?l.l instructions.
The original 68000 CPU didn't have support for 32-bit divides, only 16-bit division, so by default vbcc doesn't generate 32-bit divide instructions.
Basically your question doesn't even belong here. You're asking about the workings of your compiler not the 68K cpu family.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
Then you are already fighting windmills. Chosing an obscure C compiler while at the same time desiring top performance are conflicting goals.
If you really need MC68000 code compatibility, the choice of C is questionable. Since the 68000 has zero cache, store/load orgies that simple C compilers tend to produce en masse, have a huge performance impact. It lessens considerably for the higher members and may become invisible on the superscalar pipelined ones (erm, one; the 68060).
Switch to 68020 code model if target platform permits, and switch compiler if you're not satisfied with your current one.

Does memory dependence speculation prevent BN_consttime_swap from being constant-time?

Context
The function BN_consttime_swap in OpenSSL is a thing of beauty. In this snippet, condition has been computed as 0 or (BN_ULONG)-1:
#define BN_CONSTTIME_SWAP(ind) \
do { \
t = (a->d[ind] ^ b->d[ind]) & condition; \
a->d[ind] ^= t; \
b->d[ind] ^= t; \
} while (0)
…
BN_CONSTTIME_SWAP(9);
…
BN_CONSTTIME_SWAP(8);
…
BN_CONSTTIME_SWAP(7);
The intention is that so as to ensure that higher-level bignum operations take constant time, this function either swaps two bignums or leaves them in place in constant time. When it leaves them in place, it actually reads each word of each bignum, computes a new word that is identical to the old word, and write that result back to the original location.
The intention is that this will take the same time as if the bignums had effectively been swapped.
In this question, I assume a modern, widespread architecture such as those described by Agner Fog in his optimization manuals. Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.
Question
I am trying to understand whether the construct above characterizes as a “best effort” sort of constant-time execution, or as perfect constant-time execution.
In particular, I am concerned about the scenario where bignum a is already in the L1 data cache when the function BN_consttime_swap is called, and the code just after the function returns start working on the bignum a right away. On a modern processor, enough instructions can be in-flight at the same time for the copy not to be technically finished when the bignum a is used. The mechanism allowing the instructions after the call to BN_consttime_swap to work on a is memory dependence speculation. Let us assume naive memory dependence speculation for the sake of the argument.
What the question seems to boil down to is this:
When the processor finally detects that the code after BN_consttime_swap read from memory that had, contrary to speculation, been written to inside the function, does it cancel the speculative execution as soon as it detects that the address had been written to, or does it allow itself to keep it when it detects that the value that has been written is the same as the value that was already there?
In the first case, BN_consttime_swap looks like it implements perfect constant-time. In the second case, it is only best-effort constant-time: if the bignums were not swapped, execution of the code that comes after the call to BN_consttime_swap will be measurably faster than if they had been swapped.
Even in the second case, this is something that looks like it could be fixed for the foreseeable future (as long as processors remain naive enough) by, for each word of each of the two bignums, writing a value different from the two possible final values before writing either the old value again or the new value. The volatile type qualifier may need to be involved at some point to prevent an ordinary compiler to over-optimize the sequence, but it still sounds possible.
NOTE: I know about store forwarding, but store forwarding is only a shortcut. It does not prevent a read being executed before the write it is supposed to come after. And in some circumstances it fails, although one would not expect it to in this case.
Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.
I know it's not the thrust of your question, and I know that you know this, but I need to rant for a minute. This does not even qualify as a "best effort" attempt to provide constant-time execution. A compiler is licensed to check the value of condition, and skip the whole thing if condition is zero. Obfuscating the setting of condition makes this less likely to happen, but is no guarantee.
Purportedly "constant-time" code should not be written in C, full stop. Even if it is constant time today, on the compilers that you test, a smarter compiler will come along and defeat you. One of your users will use this compiler before you do, and they will not be aware of the risk to which you have exposed them. There are exactly three ways to achieve constant time that I am aware of: dedicated hardware, assembly, or a DSL that generates machine code plus a proof of constant-time execution.
Rant aside, on to the actual architecture question at hand: assuming a stupidly naive compiler, this code is constant time on the µarches with which I am familiar enough to evaluate the question, and I expect it to broadly be true for one simple reason: power. I expect that checking in a store queue or cache if a value being stored matches the value already present and conditionally short-circuiting the store or avoiding dirtying the cache line on every store consumes more energy than would be saved in the rare occasion that you get to avoid some work. However, I am not a CPU designer, and do not presume to speak on their behalf, so take this with several tablespoons of salt, and please consult one before assuming this to be true.
This blog post, and the comments made by the author, Henry, on the subject of this question should be considered as authoritative as anyone should allowed to expect. I will reproduce the latter here for archival:
I didn’t think the case of overwriting a memory location with the same value had a practical use. I think the answer is that in current processors, the value of the store is irrelevant, only the address is important.
Out here in academia, I’ve heard of two approaches to doing memory disambiguation: Address-based, or value-based. As far as I know, current processors all do address-based disambiguation.
I think the current microbenchmark has some evidence that the value isn’t relevant. Many of the cases involve repeatedly storing the same value into the same location (particularly those with offset = 0). These were not abnormally fast.
Address-based schemes uses a store queue and a load queue to track outstanding memory operations. Loads check the store queue to for an address match (Should this load do store-to-load forwarding instead of reading from cache?), while stores check the load queue (Did this store clobber the location of a later load I allowed to execute early?). These checks are based entirely on addresses (where a store and load collided). One advantage of this scheme is that it’s a fairly straightforward extension on top of store-to-load forwarding, since the store queue search is also used there.
Value-based schemes get rid of the associative search (i.e., faster, lower power, etc.), but requires a better predictor to do store-to-load forwarding (Now you have to guess whether and where to forward, rather than searching the SQ). These schemes check for ordering violations (and incorrect forwarding) by re-executing loads at commit time and checking whether their values are correct. In these schemes, if you have a conflicting store (or made some other mistake) that still resulted in the correct result value, it would not be detected as an ordering violation.
Could future processors move to value-based schemes? I suspect they might. They were proposed in the mid-2000s(?) to reduce the complexity of the memory execution hardware.
The idea behind constant-time implementation is not to actually perform everything in constant time. That will never happen on an out-of-order architecture.
The requirement is that no secret information can be revealed by timing analysis.
To prevent this there are basically two requirements:
a) Do not use anything secret as a stop condition for a loop, or as a predicate to a branch. Failing to do so will open you to a branch prediction attack https://eprint.iacr.org/2006/351.pdf
b) Do not use anything secret as an index to memory access. This leads to cache timing attacks http://www.daemonology.net/papers/htt.pdf
As for your code: assuming that your secret is "condition" and possibly the contents of a and b the code is perfectly constant time in the sense that its execution does not depend on the actual contents of a, b and condition. Of course the locality of a and b in memory will affect the execution time of the loop, but not the CONTENTS which are secret.
That is assuming of course condition was computed in a constant time manner.
As for C optimizations: the compiler can only optimize code based on information it knows. If "condition" is truly secret the compiler should not be able to discern it contents and optimize. If it can be deducted from your code then the compiler will most likely make optimization for the 0 case.

How do I determine the start and end of instructions in an object file?

So, I've been trying to write an emulator, or at least understand how stuff works. I have a decent grasp of assembly, particularly z80 and x86, but I've never really understood how an object file (or in my case, a .gb ROM file) indicates the start and end of an instruction.
I'm trying to parse out the opcode for each instruction, but it occurred to me that it's not like there's a line break after every instruction. So how does this happen? To me, it just looks like a bunch of bytes, with no way to tell the difference between an opcode and its operands.
For most CPUs - and I believe Z80 falls in this category - the length of an instruction is implicit.
That is, you must decode the instruction in order to figure out how long it is.
If you're writing an emulator you don't really ever need to be able to obtain a full disassembly. You know what the program counter is now, you know whether you're expecting a fresh opcode, an address, a CB page opcode or whatever and you just deal with it. What people end up writing, in effect, is usually a per-opcode recursive descent parser.
To get to a full disassembler, most people impute some mild simulation, recursively tracking flow. Instructions are found, data is then left by deduction.
Not so much on the GB where storage was plentiful (by comparison) and piracy had a physical barrier, but on other platforms it was reasonably common to save space or to effect disassembly-proof code by writing code where a branch into the middle of an opcode would create a multiplexed second stream of operations, or where the same thing might be achieved by suddenly reusing valid data as valid code. One of Orlando's 6502 efforts even re-used some of the loader text — regular ASCII — as decrypting code. That sort of stuff is very hard to crack because there's no simple assembly for it and a disassembler therefore usually won't be able to figure out what to do heuristically. Conversely, on a suitably accurate emulator such code should just work exactly as it did originally.

Resources