C99 "atomic" load in baremetal portable library - c

I'm working on a portable library for baremetal embedded applications.
Assume that I have a timer ISR that increments a counter and, in the main loop, this counter read is from in a most certainly not atomic load.
I'm trying to ensure load consistency (i.e. that I'm not reading garbage because the load was interrupted and the value changed) without resorting to disabling interrupts. It does not matter if the value changed after reading the counter as long as the read value is proper. Does this do the trick?
uint32_t read(volatile uint32_t *var){
uint32_t value;
do { value = *var; } while(value != *var);
return value;
}

It's highly unlikely that there's any sort of a portable solution for this, not least because plenty of C-only platforms are really C-only and use one-off compilers, i.e. nothing mainstream and modern-standards-compliant like gcc or clang. So if you're truly targeting entrenched C, then it's all quite platform-specific and not portable - to the point where "C99" support is a lost cause. The best you can expect for portable C code is ANSI C support - referring to the very first non-draft C standard published by ANSI. That is still, unfortunately, the common denominator - that major vendors get away with. I mean: Zilog somehow gets away with it, even if they are now but a division of Littelfuse, formerly a division of IXYS Semiconductor that Littelfuse had acquired.
For example, here are some compilers where there's only a platform-specific way of doing it:
Zilog eZ8 using a "recent" Zilog C compiler (anything 20 years old or less is OK): 8-bit value read-modify-write is atomic. 16-bit operations where the compiler generates word-aligned word instructions like LDWX, INCW, DECW are atomic as well. If the read-modify-write otherwise fits into 3 instructions or less, you'd prepend the operation with asm("\tATM");. Otherwise, you'd need to disable the interrupts: asm("\tPUSHF\n\tDI");, and subsequently re-enable them: asm("\tPOPF");.
Zilog ZNEO is a 16 bit platform with 32-bit registers, and read-modify-write accesses on registers are atomic, but memory read-modify-write round-trips via a register, usually, and takes 3 instructions - thus prepend the R-M-W operation with asm("\tATM").
Zilog Z80 and eZ80 require wrapping the code in asm("\tDI") and asm("\tEI"), although this is valid only when it's known that the interrupts are always enabled when your code runs. If they may not be enabled, then there's a problem since Z80 does not allow reading the state of IFF1 - the interrupt enable flip-flop. So you'd need to save a "shadow" of its state somewhere, and use that value to conditionally enable interrupts. Unfortunately, eZ80 does not provide an interrupt controller register that would allow access to IEF1 (eZ80 uses the IEFn nomenclature instead of IFFn) - so this architectural oversight is carried over from the venerable Z80 to the "modern" one.
Those aren't necessarily the most popular platforms out there, and many people don't bother with Zilog compilers due to their fairly poor quality (low enough that yours truly had to write an eZ8-targeting compiler*). Yet such odd corners are the mainstay of C-only code bases, and library code has no choice but to accommodate this, if not directly then at least by providing macros that can be redefined with platform-specific magic.
E.g. you could provide empty-by-default macros MYLIB_BEGIN_ATOMIC(vector) and MYLIB_END_ATOMIC(vector) that would be used to wrap code that requires access atomic with respect to a given interrupt vector (or e.g. -1 if with respect to all interrupt vectors). Naturally, replace MYLIB_ with a "namespace" prefix specific to your library.
To enable platform-specific optimizations such as ATM vs DI on "modern" Zilog platforms, an additional argument could be provided to the macro to separate the presumed "short" sequences that the compiler is apt to generate three-instruction sequences for vs. longer ones. Such micro-optimization requires usually an assembly output audit (easily automatable) to verify the assumption of the instruction sequence length, but at least the data to drive the decision would be available, and the user would have a choice of using it or ignoring it.
*If some lost soul wants to know anything bordering on the arcane re. eZ8 - ask away. I know entirely too much about that platform, in details so gory that even modern Hollywood CG and SFX would have a hard time reproducing the true depth of the experience on-screen. I'm also possibly the only one out there running the 20MHz eZ8 parts occasionally at 48MHz clock - as sure a sign of demonic possession as the multiverse allows. If you think it's outrageous that such depravity makes it into production hardware - I'm with you. Alas, business case is business case, laws of physics be damned.

Are you running on any systems that have uint32_t larger than a single assembly instruction word read/write size? If not, the IO to memory should be a single instructions and therefore atomic (assuming the bus is also word sized...) You get in trouble when the compiler breaks it up into multiple smaller read/writes. Otherwise, I've always had to resort to DI/EI. You could have the user configure your library such that it has information if atomic instructions or minimum 32-bit word size are available to prevent interrupt twiddling. If you have these guarantees, you don't need to verification code.
To answer the question though, on a system that must split the read/writes, your code is not safe. Imagine a case where you read your value in correctly in the "do" part, but the value gets split during the "while" part check. Further, in an extreme case, this is an infinite loop. For complete safety, you'd need a retry count and error condition to prevent that. The loop case is extreme for sure, but I'd want it just in case. That of course makes the run time longer.
Let's show a failure case for examples - will use 16-bit numbers on a machine that reads 8-bit values at a time to make it easier to follow:
Value to read from memory *var is 0x1234
Read 8-bit 0x12
*var becomes 0x5678
Read 8-bit 0x78 - value is now 0x1278 (invalid)
*var becomes 0x1234
Verification step reads 8-bit 0x12
*var becomes 0x5678
Verification reads 8-bit 0x78
Value confirmed correct 0x1278, but this is an error as *var was only 0x1234 and 0x5678.
Another failure case would be when *var just happens to change at the same frequency as your code is running, which could lead to an infinite loop as each verification fails. Or even if it did break out eventually, this would be a very hard to track performance bug.

Related

Mapping inputs to an array (A better way?)

I'm working in an embedded system and have "mapped" some defines to an array for inputs.
volatile int INPUT_ARRAY[40];
#define INPUT01 INPUT_ARRAY[0]
#define INPUT02 INPUT_ARRAY[1]
// section 2
if ( INPUT01 && INPUT02 ) {
writepin(outputpin, value);
}
If I want to read from Input 1, I can simply say newvariable = INPUT01 or I can compare data with Input 1, like in section 2 of my code. I'm not sure if this is a normal way of mapping the name INPUT01 to where the array position is. Or for an Input pin in the first place. Each array value represents a binary pin, and are read into the array by decoding a port value (16 bit). Question: Is using the defines and array like this reasonably efficient?
Yes, your solution is efficient.
Before the C compiler even sees your code, the C preprocessor substitutes INPUT_ARRAY[0] for INPUT01 and, similarly, INPUT_ARRAY[1] for INPUT02; so this substitution uses zero time and zero power at run time.
Moreover, when the C compiler sees INPUT_ARRAY[1] in the preprocessed code, it adds 1 at compile time to the base address of INPUT_ARRAY. Therefore, you get maximal efficiency at run time.
Admittedly, were you manually to turn your C compiler's optimizer off, as with the -O0 option of GCC, then it is conceivable that the compiler would emit assembly code to add the 1 at run time. So don't do that.
The only likely exception to the foregoing would be the case that the base address of INPUT_ARRAY were unknown to the compiler at run time, not likely because INPUT_ARRAY were dynamically allocated on the heap (which would make little sense for hardware device addressing), but likely because the base address of INPUT_ARRAY were configurable during boot via device configuration registers. Some hardware does this, but if yours does, why, that is exactly the reason your MCU (or MPU) possesses an index-offset indirect addressing mode in the first place. Though this mode engages the MCU's integer arithmetic unit, [a] the mode does not multiply (multiplication being a power-hungry operation); and, [b] anyway, the mode is such a normal, often-used mode that MCUs are invariably designed to support it efficiently—not perhaps as efficiently as precomputed direct addressing, but as efficiently as one can reasonably expect for such a use. The MCU's manufacturer knows that device pins are things you need to address. The engineer who designed your MCU will have given priority to making the index-offset indirect mode as efficient as possible for this and other reasons. (You could maybe still cheat the matter to save a few millijoules via self-modifying code, if your MCU even allowed that; but, as an engineer, you'd regret the cheat, I suspect, unless security and maintainability were non-issues to you. The problem probably is not much of a real problem. Index-offset indirect addressing is the normal technique when the base address remains unknown until run time. If you really need to save that last millijoule, then you might not be using a C compiler for your code's inner loop, anyway, but might be handcrafting assembly code.)
I suspect that you would find it instructive to tell your compiler to emit assembly code for your inspection. I do not know which compiler you are using but, if you were using GCC, then gcc -S myfile.c.

How to enable the DIV instruction in ASM output of C compiler

I am using vbcc compiler to translate my C code into Motorola 68000 ASM.
For whatever reason, every time I use the division (just integer, not floats) in code, the compiler only inserts the following stub into the ASM output (that I get generated upon every recompile):
public __ldivs
jsr __ldivs
I explicitly searched for all variations of DIVS/DIVU, but every single time, there is just that stub above. The code itself works (I debugged it on target device), so the final code does have the DIV instruction, just not the intermediate output.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
However, I can't do it if I don't see the resulting ASM code. Any ideas how to enable it ? The compiler manual does not specify anything like that, so there must clearly must be some other - probably common - higher principle in play ?
From the vbcc compiler system manual by Volker Barthelmann:
4.1 Additional options
This backend provides the following additional options:
-cpu=n Generate code for cpu n (e.g. -cpu=68020), default: 68000.
...
4.5 CPUs
The values of -cpu=n have those effects:
...
n>=68020
32bit multiplication/division/modulo is done with the mul?.l, div?.l and
div?l.l instructions.
The original 68000 CPU didn't have support for 32-bit divides, only 16-bit division, so by default vbcc doesn't generate 32-bit divide instructions.
Basically your question doesn't even belong here. You're asking about the workings of your compiler not the 68K cpu family.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
Then you are already fighting windmills. Chosing an obscure C compiler while at the same time desiring top performance are conflicting goals.
If you really need MC68000 code compatibility, the choice of C is questionable. Since the 68000 has zero cache, store/load orgies that simple C compilers tend to produce en masse, have a huge performance impact. It lessens considerably for the higher members and may become invisible on the superscalar pipelined ones (erm, one; the 68060).
Switch to 68020 code model if target platform permits, and switch compiler if you're not satisfied with your current one.

AVR 8bit, C standard compliance regarding bit accessing of SFRs

One of my colleagues ran in some strange problems with programming an ATMega, related to accessing input - output ports.
Observing the problem after some research I concluded we should avoid accessing SFR's using operations which may compile to SBI or CBI instructions if we aim for a safe C standard compliant software. I am looking for whether this decision was righteous or not, so if my concerns here are valid.
The datasheet of the Atmel processor is here, it's an ATMega16. I will refer to some pages of this document below.
I will refer to the C standard using the version found on this site under the WG14 N1256 link.
The SBI and CBI instructions of the processor operate at bit-level accessing only the bit in question. So they are not true Read-Modify-Write (R-M-W) instructions since they, as I understand, do not perform a read (of the targeted 8 bit SFR).
On page 50 of the above datasheet the first sentence begins like All AVR ports have true Read-Modify-Write functionality..., while ongoing it specifies that this only applies to accesses with the SBI and CBI instructions which technically are not R-M-W. The datasheet does not define what reading for example the PORTx registers are supposed to return (it however indicates that they are readable). So I assumed reading these SFRs are undefined (they might return the last thing written on them or the current input state or whatever).
On page 70 it lists some external interrupt flags, this is interesting because this is where the nature of the SBI and CBI instructions come to be important. The flags are set when an interrupt occurred, and they may be cleared by writing them to one. So if SBI was a true R-M-W instruction, it would clear all three flags regardless of the bit specified in the opcode.
And now let's get into the matters of C.
The compiler itself is truly irrelevant, the only important fact is that it might use the CBI and SBI instructions in certain situations which I think make it non-compliant.
In the above mentioned C99 standard, the section 5.1.2.3 Program execution, point 2 and 3 refers to this (on page 13), and 6.7.3 Type qualifiers, point 6 (on page 109). The latter mentions that What constitutes an access to an object that has volatile-qualified type is implementation-defined, however a few phrases before it requires that any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine.
Also note that hardware ports such as that used in the example are declared volatile in the appropriate headers.
Example:
PORTA |= 1U << 6;
This is known to translate to an SBI. This implies that only a Write access happens on the volatile (PORTA) object. However if one would write:
var = 6;
...
PORTA |= 1U << var;
That would not translate to an SBI even though it will still only set one bit (since SBI has the bit to set encoded in the opcode). So this will expand to a true R-M-W sequence with a potentially different result than above (in the case of PORTA this is undefined behaviour as far as I could deduct from the datasheet).
By the C standard this behaviour might or might not be permitted. It is messy in that term too that here two things happen which mix in. One, the more apparent is the lack of the Read access in one of the cases. The other, less apparent is how the Write is performed.
If the compiled code omits the Read, it might fail to trigger hardware behaviour which is tied to such an access. However the AVR as far as I know has no such mechanism, so it might pass by the standard.
The Write is more interesting, however it also takes in the Read.
Omitting the Read in the case of using SBI implies that the affected SFR's must all work like latches (or any bit not working like so is either tied to 0 or 1), so the compiler can be sure of what it would read from them if it actually did the access. If this was not be the case then the compiler would at least be buggy. By the way this also clashes with that the datasheet did not define what is read from the PORTx registers.
How the write is performed is also a source of inconsistency: the result is different depending on how the compiler compiles it (a CBI or SBI affecting only one bit, a byte write affecting all bits). So writing code to clear / set one bit might either "work" (as in not "accidentally" clearing interrupt flags), or not if the compiler produces a true R-M-W sequence instead.
Maybe these are technically permitted by the C standard (as "implementation defined" behaviour, and the compiler deducting these cases that the Read access is not necessary to the volatile object), but at least I would consider it a buggy or inconsistent implementation.
Another example:
PORTA = PORTA | (1U << 6);
It is clearly visible that normally to conform with the standard a Read and then a Write of PORTA should be carried out. While according to the behaviour of SBI, it will lack a Read access, although as above this may pass for a mix of implementation defined behaviour and the compiler deducting that the Read is unnecessary here. (Or was my assumption wrong? That is assuming a |= b identical to a = a | b?)
So based on these I settled with that we should avoid these types of code as it is (or may be in the future) unclear how they might behave depending on whether the compiler would use SBI or CBI, or a true R-M-W sequence.
To tell the truth I mostly went after various forum posts etc. resolving this, not analysing actual compiler output. Not my project after all (and now I am not at work). I accepted it reading AVRFreaks for example that AVR-GCC would output these instructions in the above mentioned situations which alone may pose a problem even if with the actual version we used we wouldn't observe this. (However I think this case it stood as my suggestion to implement port accesses using a shadow work variables fixed the problems my colleague observed)
Note: I edited the middle based on some research on the C (C99) standard.
Edit: Reading the AVR Libc FAQ I again found something which contradicts the automatic use of SBI or CBI. It is the last question & answer where it specifically states that since the ports are declared volatile the compiler can not optimize out the read access, according to the rules of the C language (as it phrases).
I also understand that it is very unlikely that this particular behaviour (that is using SBI or CBI) would directly introduce bugs, but by masking "bugs" it may introduce very nasty ones in the long run if someone accidentally generalizes based on this behaviour while not understanding the AVR at assembly level.
You should probably stop trying to apply the C memory model to I/O registers. They are not plain memory. In the case of PORTn registers, it is in fact irrelevant whether it is a single bit write or a R-M-W operation unless you're mixing in interrupts. If you do read-modify-write an interrupt may alter state in between, causing a race condition; but that would be exactly the same issue for memory. The advantage of the SBI/CBI instructions there is that they are atomic.
The PORTn registers are readable, and also drive the output buffers. They are not different functions on read and write (as on PIC), but a normal register. Newer PICs also have the output registers readable on LAT addresses, precisely so you won't need a shadow variable. Other SFRs such as PINn or interrupt flags have more complicated behaviour. On recent AVRs, writing to PINn instead toggles bits in PORTn, which again is useful for its fast and atomic operation. Writing 1s to interrupt flag registers clears them, again to prevent race conditions.
The point is, these features are in place to produce correct behaviour for hardware aware programs, even if some of it looks odd in C code (i.e. using reg=_BV(2); instead of reg&=~_BV(2);). Precise compliance with the C standard is an impractical goal when the code is by its very nature hardware specific (though semantic similarity does help, which the interrupt flag behaviour fails at). Wrapping the odd constructs in inline functions or macros with names that explain what they truly do is probably a good idea, or at least commenting what the effects are. A set of such I/O routines could also form the basis of a hardware abstraction layer that may help you port code.
Trying to interpret the C specification strictly here is also rather confusing, as it doesn't admit to addressing bits (which is what SBI and CBI do), and digging through my old (1992) copy finds that volatile accesses may result in several implementation defined behaviours, including the possibility of no accesses at all.

Dealing with reserved register bits of an ARM chip

I am working with the registers of an ARM Cortex M3. In the documentation, some of the bits may be "reserved". It is unclear to me how I should deal with these reserved bits when writing on the registers.
Are these reserved bits even writeable? Should I be cautious to not touch them? Will something bad happen if I touch them?
This is a classic embedded world problem as to what to do with reserved bits! First, you should NOT write randomly into it lest your code becomes un-portable. What happens when the architecture assigns a new meaning to the reserved bits in future? Your code will break. So the best mantra when dealing with registers having reserved bits is Read-Modify-Write. i.e read the register contents, modify only the bits you want and then write back the value so that reserved bits are untouched ( untouched, does not mean we dont write into them, but in the sense, that we wrote that which was there before )
For example, say there is a register in which only the LSBit has meaning and all others are reserved. I would do this
ldr r0,=memoryAddress
ldr r1,[r0]
orr r1,r1,#1
str r1,[r0]
If there is no other clue in the documentation, write a zero. You cannot avoid writing to a few reserved bits spread around in a 32-bit register.
Read-Modify-Write should work most of the time, however there are cases where reserved bits are undefined on read but must be written with a specific value. See this post from the LPC2000 group (the whole thread is quite interesting too). So, always check the docs carefully, and also any errata that's available. When in doubt or docs are unclear, don't hesitate to write to the manufacturer.
Ideally you should read-modify-write, no guarantee for success, when you change to a newer chip with different bits, you are changing your code anyway. I have seen vendors where writing zeros to the reserved bits failed when they revved the chip and the code had to be touched. So there are no guarantees. The biggest clue is if in the vendors code you see a register or set that are clearly read-modify-write or clearly just a write. This could be different developers writing different sections of the example or there is a register in that peripheral that is sensitive, has an undocumented bit, and needs the read-modify-write.
On the chips that I work on I make sure that undocumented (to the customer), but not unused bits are marked in some way to stand out from other unused bits. We normally mark unused/reserved bits as zero, and these other bits get a name, and a must write this value marking. Not all vendors do this.
The bottom line is there is no guarantee, assume all documentation and example programs have bugs and you have to hack your way through to figure out what is right and what is wrong. No matter what path you take (read-modify-write, write zeros, etc) you will be wrong from time to time and have to re-do the code to match a hardware change. I strongly suggest that if a vendor has a chip id of some sort, that your software reads that ID and if it is an id that you have not tested your code against, declare a failure and not program that part. In production testing long before a customer sees the product, the part change will get detected and software will be involved in understanding the reason for the part change, the resolution being the alternate part is not compatible and rejected or the software changes, etc.
Reserved most of the time mean that they aren't used in this chip, but they might be used on feature devices (other product line). (Most chip manufacturers produce one peripheral driver and they use it for all there chips. This way it's mostly copy past work and there is less change for errors) Most of the time it doesn't matter if you write to reserved bits in peripheral registers, this because there isn't any logic attached to it.
It is possible that if you write something to it, it won't be stored and next time you attempt to read the register / bits it seams unchanged.

What does the C compiler do with bitfields?

I'm working on an embedded project (PowerPC target, Freescale Metrowerks Codewarrior compiler) where the registers are memory-mapped and defined in nice bitfields to make twiddling the individual bit flags easy.
At the moment, we are using this feature to clear interrupt flags and control data transfer. Although I haven't noticed any bugs yet, I was curious if this is safe. Is there some way to safely use bit fields, or do I need to wrap each in DISABLE_INTERRUPTS ... ENABLE_INTERRUPTS?
To clarify: the header supplied with the micro has fields like
union {
vuint16_t R;
struct {
vuint16_t MTM:1; /* message buffer transmission mode */
vuint16_t CHNLA:1; /* channel assignement */
vuint16_t CHNLB:1; /* channel assignement */
vuint16_t CCFE:1; /* cycle counter filter enable */
vuint16_t CCFMSK:6; /* cycle counter filter mask */
vuint16_t CCFVAL:6; /* cycle counter filter value */
} B;
} MBCCFR;
I assume setting a bit in a bitfield is not atomic. Is this a correct assumption? What kind of code does the compiler actually generate for bitfields? Performing the mask myself using the R (raw) field might make it easier to remember that the operation is not atomic (it is easy to forget that an assignment like CAN_A.IMASK1.B.BUF00M = 1 isn't atomic).
Your advice is appreciated.
Atomicity depends on the target and the compiler. AVR-GCC for example trys to detect bit access and emit bit set or clear instructions if possible. Check the assembler output to be sure ...
EDIT: Here is a resource for atomic instructions on PowerPC directly from the horse's mouth:
http://www.ibm.com/developerworks/library/pa-atom/
It is correct to assume that setting bitfields is not atomic. The C standard isn't particularly clear on how bitfields should be implemented and various compilers go various ways on them.
If you really only care about your target architecture and compiler, disassemble some object code.
Generally, your code will achieve the desired result but be much less efficient than code using macros and shifts. That said, it's probably more readable to use your bit fields if you don't care about performance here.
You could always write a setter wrapper function for the bits that is atomic, if you're concerned about future coders (including yourself) being confused.
Yes, your assumption is correct, in the sense that you may not assume atomicity. On a specific platform you might get it as an extra, but you can't rely on it in any case.
Basically the compiler performs masking and things for you. He might be able to take advantage of corner cases or special instructions. If you are interested in efficiency look into the assembler that your compiler produces with that, usually it is quite instructive. As a rule of thumb I'd say that modern compilers produces code that is as efficient as medium programming effort would be. Real deep bit twiddeling for your specific compiler could perhaps gain you some cycles.
I think that using bitfields to model hardware registers is not a good idea.
So much about how bitfields are handled by a compiler is implementation-defined (including how fields that span byte or word boundaries are handled, endianess issues, and exactly how getting, setting and clearing bits is implemented). See C/C++: Force Bit Field Order and Alignment
To verify that register accesses are being handled how you might expect or need them to be handled, you would have to carefully study the compiler docs and/or look at the emitted code. I suppose that if the headers supplied with the microprocessor toolset uses them you can be assume that most of my concerns are taken care of. However, I'd guess that atomic access isn't necessarily...
I think it's best to handle these type of bit-level accesses of hardware registers using functions (or macros, if you must) that perform explicit read/modify/write operations with the bit mask that you need, if that's what your processor requires.
Those functions could be modified for architectures that support atomic bit-level accesses (such as the ARM Cortex M3's "bit-banding" addressing). I don't know if the PowerPC supports anything like this - the M3 is the only processor I've dealt with that supports it in a general fashion. And even the M3's bit-banding supports 1-bit accesses; if you're dealing with a field that's 6-bits wide, you have to go back to the read/modify/write scenario.
It totally depends on the architecture and compiler whether the bitfield operations are atomic or not. My personal experience tells: don't use bitfields if you don't have to.
I'm pretty sure that on powerpc this is not atomic, but if your target is a single core system then you can just:
void update_reg_from_isr(unsigned * reg_addr, unsigned set, unsigned clear, unsigned toggle) {
unsigned reg = *reg_addr;
reg |= set;
reg &= ~clear;
reg ^= toggle;
*reg_addr = reg;
}
void update_reg(unsigned * reg_addr, unsigned set, unsigned clear, unsigned toggle) {
interrupts_block();
update_reg_from_isr(reg_addr, set, clear, toggle);
interrupts_enable();
}
I don't remember if powerpc's interrupt handlers are interruptible, but if they are then you should just use the second version always.
If your target is a multiprocessor system then you should make locks (spinlocks, which disable interrupts on the local processor and then wait for any other processors to finish with the lock) that protect access to things like hardware registers, and acquire the needed locks before you access the register, and then release the locks immediately after you have finished updating the register (or registers).
I read once how to implement locks in powerpc -- it involved telling the processor to watch the memory bus for a certain address while you did some operations and then checking back at the end of those operations to see if the watch address had been written to by another core. If it hadn't then your operation was sucessful; if it had then you had to redo the operation. This was in a document written for compiler, library, and OS developers. I don't remember where I found it (probably somewhere on IBM.com) but a little hunting should turn it up. It probably also has info on how to do atomic bit twiddling.

Resources