Explain how the AF flag works in an x86 instructions? - c

I have a little 8086 emulator and I've had a long standing bug for like 2 years now that AF does not behave properly inside of sub and add instructions.
My current way of computing its value is this for 8 bit numbers and subtraction:
uint8_t base=... , subt=...
base=base&0xF;
subt=subt&0xF; //isolate bottom nibble
if((int16_t)base-subt>7 || (int16_t)base-subt<-7){
flags.af=1;
}else{
flags.af=0;
}
(assuming an instruction like sub base,subt )
And for adding it's like this:
uint8_t base=... , adder=...
base=base&0xF;
adder=adder&0xF; //isolate bottom nibble
if(base+adder>7 || base+adder<-7){
flags.af=1;
}else{
flags.af=0;
}
(for an instruction like add base,adder)
How do I properly calculate the AF flag in my emulator for such instructions?

flags.af = (((base-subt) & ~0xf) != 0);
Checks to see if the upper bits are anything except zero, which would indicate an overflow or underflow out of the bottom 4 bits.
Here's a version that's a little closer to your original. Note that the difference between two 4-bit quantities will never be greater than 15. Likewise the addition will never be less than 0.
flags.af = ((int16_t)base-subt < -15);
flags.af = ((int16_t)base+adder > 15);
Putting parentheses around a boolean expression is just a style preference of mine, I know they're redundant.

Related

Comparing two pairs of 4 variables and returning the number of matches?

Given the following struct:
struct four_points {
uint32_t a, b, c, d;
}
What would be the absolute fastest way to compare two such structures and return the number of variables that match (in any position)?
For example:
four_points s1 = {0, 1, 2, 3};
four_points s2 = {1, 2, 3, 4};
I'd be looking for a result of 3, since three numbers match between the two structs. However, given the following:
four_points s1 = {1, 0, 2, 0};
four_points s2 = {0, 1, 9, 7};
Then I'd expect a result of only 2, because only two variables match between either struct (despite there being two zeros in the first).
I've figured out a few rudimentary systems for performing the comparison, but this is something that is going to be called a couple million times in a short time span and needs to be relatively quick. My current best attempt was to use a sorting network to sort all four values for either input, then loop over the sorted values and keep a tally of the values that are equal, advancing the current index of either input accordingly.
Is there any kind of technique that might be able to perform better then a sort & iteration?
On modern CPUs, sometimes brute force applied properly is the way to go. The trick is writing code that isn't limited by instruction latencies, just throughput.
Are duplicates common? If they're very rare, or have a pattern, using a branch to handle them makes the common case faster. If they're really unpredictable, it's better to do something branchless. I was thinking about using a branch to check for duplicates between positions where they're rare, and going branchless for the more common place.
Benchmarking is tricky because a version with branches will shine when tested with the same data a million times, but will have lots of branch mispredicts in real use.
I haven't benchmarked anything yet, but I have come up with a version that skips duplicates by using OR instead of addition to combine found-matches. It compiles to nice-looking x86 asm that gcc fully unrolls. (no conditional branches, not even loops).
Here it is on godbolt. (g++ is dumb and uses 32bit operations on the output of x86 setcc, which only sets the low 8 bits. This partial-register access will produce slowdowns. And I'm not even sure it ever zeroes the upper 24bits at all... Anyway, the code from gcc 4.9.2 looks good, and so does clang on godbolt)
// 8-bit types used because x86's setcc instruction only sets the low 8 of a register
// leaving the other bits unmodified.
// Doing a 32bit add from that creates a partial register slowdown on Intel P6 and Sandybridge CPU families
// Also, compilers like to insert movzx (zero-extend) instructions
// because I guess they don't realize the previous high bits are all zero.
// (Or they're tuning for pre-sandybridge Intel, where the stall is worse than SnB inserting the extra uop itself).
// The return type is 8bit because otherwise clang decides it should generate
// things as 32bit in the first place, and does zero-extension -> 32bit adds.
int8_t match4_ordups(const four_points *s1struct, const four_points *s2struct)
{
const int32_t *s1 = &s1struct->a; // TODO: check if this breaks aliasing rules
const int32_t *s2 = &s2struct->a;
// ignore duplicates by combining with OR instead of addition
int8_t matches = 0;
for (int j=0 ; j<4 ; j++) {
matches |= (s1[0] == s2[j]);
}
for (int i=1; i<4; i++) { // i=0 iteration is broken out above
uint32_t s1i = s1[i];
int8_t notdup = 1; // is s1[i] a duplicate of s1[0.. i-1]?
for (int j=0 ; j<i ; j++) {
notdup &= (uint8_t) (s1i != s1[j]); // like dup |= (s1i == s1[j]); but saves a NOT
}
int8_t mi = // match this iteration?
(s1i == s2[0]) |
(s1i == s2[1]) |
(s1i == s2[2]) |
(s1i == s2[3]);
// gcc and clang insist on doing 3 dependent OR insns regardless of parens, not that it matters
matches += mi & notdup;
}
return matches;
}
// see the godbolt link for a main() simple test harness.
On a machine with 128b vectors that can work with 4 packed 32bit integers (e.g. x86 with SSE2), you can broadcast each element of s1 to its own vector, deduplicate, and then do 4 packed-compares. icc does something like this to autovectorize my match4_ordups function (check it out on godbolt.)
Store the compare results back to integer registers with movemask, to get a bitmap of which elements compared equal. Popcount those bitmaps, and add the results.
This led me to a better idea: Getting all the compares done with only 3 shuffles with element-wise rotation:
{ 1d 1c 1b 1a }
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1a 1d 1c 1b }
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1b 1a 1d 1c } # if dups didn't matter: do this shuffle on s2
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1c 1b 1a 1d } # if dups didn't matter: this result from { 1a ... }
== == == == packed-compare with
{ 2d 2c 2b 2a } { 2b ...
That's only 3 shuffles, and still does all 16 comparisons. The trick is combining them with ORs where we need to merge duplicates, and then being able to count them efficiently. A packed-compare outputs a vector with each element = zero or -1 (all bits set), based on the comparison between the two elements in that position. It's designed to make a useful operand to AND or XOR to mask off some vector elements, e.g. to make v1 += v2 & mask conditional on a per-element basis. It also works as just a boolean truth value.
All 16 compares with only 2 shuffles is possible by rotating one vector by two, and the other vector by one, and then comparing between the four shifted and unshifted vectors. This would be great if we didn't need to eliminate dups, but since we do, it matters which results are where. We're not just adding all 16 comparison results.
OR together the packed-compare results down to one vector. Each element will be set based on whether that element of s2 had any matches in s1. int _mm_movemask_ps (__m128 a) to turn the vector into a bitmap, then popcount the bitmap. (Nehalem or newer CPU required for popcnt, otherwise fall back to a version with a 4-bit lookup table.)
The vertical ORs take care of duplicates in s1, but duplicates in s2 is a less obvious extension, and would take more work. I did eventually think of a way that was less than twice as slow (see below).
#include <stdint.h>
#include <immintrin.h>
typedef struct four_points {
int32_t a, b, c, d;
} four_points;
//typedef uint32_t four_points[4];
// small enough to inline, only 62B of x86 instructions (gcc 4.9.2)
static inline int match4_sse_noS2dup(const four_points *s1pointer, const four_points *s2pointer)
{
__m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
__m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
__m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
// no shuffle needed for first compare
__m128i match = _mm_cmpeq_epi32(s1 , s2); //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
__m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
s1b = _mm_cmpeq_epi32(s1b, s2);
match = _mm_or_si128(match, s1b); // merge dups by ORing instead of adding
// note that we shuffle the original vector every time
// multiple short dependency chains are better than one long one.
__m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
s1c = _mm_cmpeq_epi32(s1c, s2);
match = _mm_or_si128(match, s1c);
s1d = _mm_cmpeq_epi32(s1d, s2);
match = _mm_or_si128(match, s1d); // match = { s2.a in s1?, s2.b in s1?, etc. }
// turn the the high bit of each 32bit element into a bitmap of s2 elements that have matches anywhere in s1
// use float movemask because integer movemask does 8bit elements.
int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
return _mm_popcnt_u32(matchmask); // or use a 4b lookup table for CPUs with SSE2 but not popcnt
}
See the version that eliminates duplicates in s2 for the same code with lines in a more readable order. I tried to schedule instructions in case the CPU was only just barely decoding instructions ahead of what was executing, but gcc puts the instructions in the same order regardless of what order you put the intrinsics in.
This is extremely fast, if there isn't a store-forwarding stall in the 128b loads. If you just wrote the struct with four 32bit stores, running this function within the next several clock cycles will produce a stall when it tries to load the whole struct with a 128b load. See Agner Fog's site. If calling code already has many of the 8 values in registers, the scalar version could be a win, even though it'll be slower for a microbenchmark test that only reads the structs from memory.
I got lazy on cycle-counting for this, since dup-handling isn't done yet. IACA says Haswell can run it with a throughput of one iteration per 4.05 clock cycles, and latency of 17 cycles (Not sure if that's including the memory latency of the loads. There's a lot of instruction-level parallelism available, and all the instructions have single-cycle latency, except for movmsk(2) and popcnt(3)). It's slightly slower without AVX, because gcc chooses a worse instruction ordering, and still wastes a movdqa instruction copying a vector register.
With AVX2, this could do two match4 operations in parallel, in 256b vectors. AVX2 usually works as two 128b lanes, rather than full 256b vectors. Setting up your code to be able to take advantage of 2 or 4 (AVX-512) match4 operations in parallel will give you gains when you can compile for those CPUs. It's not essential for both the s1s or s2s to be stored contiguously so a single 32B load can get two structs. AVX2 has a fairly fast load 128b to the upper lane of a register.
Handling duplicates in s2
Maybe compare s2 to a shifted instead of rotated version of itself.
#### comparing S2 with itself to mask off duplicates
{ 0 2d 2c 2b }
{ 2d 2c 2b 2a } == == ==
{ 0 0 2d 2c }
{ 2d 2c 2b 2a } == ==
{ 0 0 0 2d }
{ 2d 2c 2b 2a } ==
Hmm, if zero can occur as a regular element, we may need to byte-shift after the compare as well, to turn potential false positives into zeros. If there was a sentinel value that couldn't appear in s1, you could shift in elements of that, instead of 0. (SSE has PALIGNR, which gives you any contiguous 16B window you want of the contents of two registers appended. Named for the use-case of simulating an unaligned load from two aligned loads. So you'd have a constant vector of that element.)
update: I thought of a nice trick that avoids the need for an identity element. We can actually get all 6 necessary s2 vs. s2 comparisons to happen with just two vector compares, and then combine the results.
Doing the same compare in the same place in two vectors lets you OR two results together without having to mask before the OR. (Works around the lack of a sentinel value).
Shuffling the output of the compares instead of extra shuffle&compare of S2. This means we can get d==a done next to the other compares.
Notice that we aren't limited to shuffling whole elements around. Byte-wise shuffle to get bytes from different compare results into a single vector element, and compare that against zero. (This saves less than I'd hoped, see below).
Checking for duplicates is a big slowdown (esp. in throughput, not so much in latency). So you're still best off arranging for a sentinel value in s2 that will never match any s1 element, which you say is possible. I only present this because I thought it was interesting. (And gives you an option in case you need a version that doesn't require sentinels sometime.)
static inline
int match4_sse(const four_points *s1pointer, const four_points *s2pointer)
{
// IACA_START
__m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
__m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
// s1a = unshuffled = s1.a in the low element
__m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
__m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
__m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
__m128i match = _mm_cmpeq_epi32(s1 , s2); //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
s1b = _mm_cmpeq_epi32(s1b, s2);
match = _mm_or_si128(match, s1b); // merge dups by ORing instead of adding
s1c = _mm_cmpeq_epi32(s1c, s2);
match = _mm_or_si128(match, s1c);
s1d = _mm_cmpeq_epi32(s1d, s2);
match = _mm_or_si128(match, s1d);
// match = { s2.a in s1?, s2.b in s1?, etc. }
// s1 vs s2 all done, now prepare a mask for it based on s2 dups
/*
* d==b c==a b==a d==a #s2b
* d==c c==b b==a d==a #s2c
* OR together -> s2bc
* d==abc c==ba b==a 0 pshufb(s2bc) (packed as zero or non-zero bytes within the each element)
* !(d==abc) !(c==ba) !(b==a) !0 pcmpeq setzero -> AND mask for s1_vs_s2 match
*/
__m128i s2b = _mm_shuffle_epi32(s2, _MM_SHUFFLE(1, 0, 0, 3));
__m128i s2c = _mm_shuffle_epi32(s2, _MM_SHUFFLE(2, 1, 0, 3));
s2b = _mm_cmpeq_epi32(s2b, s2);
s2c = _mm_cmpeq_epi32(s2c, s2);
__m128i s2bc= _mm_or_si128(s2b, s2c);
s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12, -1,-1,-1,8, -1,-1,-1,4, -1,-1,-1,-1));
__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
// see below for alternate insn sequences that can go here.
match = _mm_and_si128(match, dupmask);
// turn the the high bit of each 32bit element into a bitmap of s2 matches
// use float movemask because integer movemask does 8bit elements.
int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
int ret = _mm_popcnt_u32(matchmask); // or use a 4b lookup table for CPUs with SSE2 but not popcnt
// IACA_END
return ret;
}
This requires SSSE3 for pshufb. It and a pcmpeq (and a pxor to generate a constant) are replacing a shuffle (bslli(s2bc, 12)), an OR, and an AND.
d==bc c==ab b==a a==d = s2b|s2c
d==a 0 0 0 = byte-shift-left(s2b) = s2d0
d==abc c==ab b==a a==d = s2abc
d==abc c==ab b==a 0 = mask(s2abc). Maybe use PBLENDW or MOVSS from s2d0 (which we know has zeros) to save loading a 16B mask.
__m128i s2abcd = _mm_or_si128(s2b, s2c);
//s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12, -1,-1,-1,8, -1,-1,-1,4, -1,-1,-1,-1));
//__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
__m128i s2d0 = _mm_bslli_si128(s2b, 12); // d==a 0 0 0
s2abcd = _mm_or_si128(s2abcd, s2d0);
__m128i dupmask = _mm_blend_epi16(s2abcd, s2d0, 0 | (2 | 1));
//__m128i dupmask = _mm_and_si128(s2abcd, _mm_set_epi32(-1, -1, -1, 0));
match = _mm_andnot_si128(dupmask, match); // ~dupmask & match; first arg is the one that's inverted
I can't recommend MOVSS; it will incur extra latency on AMD because it runs in the FP domain. PBLENDW is SSE4.1. popcnt is available on AMD K10, but PBLENDW isn't (some Barcelona-core PhenomII CPUs are probably still in use). Actually, K10 doesn't have PSHUFB either, so just require SSE4.1 and POPCNT, and use PBLENDW. (Or use the PSHUFB version, unless it's going to cache-miss a lot.)
Another option to avoid a loading a vector constant from memory is to movemask s2bc, and use integer instead of vector ops. However, it looks like that'll be slower, because the extra movemask isn't free, and integer ANDN isn't usable. BMI1 didn't appear until Haswell, and even Skylake Celerons and Pentiums won't have it. (Very annoying, IMO. It means compilers can't start using BMI for even longer.)
unsigned int dupmask = _mm_movemask_ps(cast(s2bc));
dupmask |= dupmask << 3; // bit3 = d==abc. garbage in bits 4-6, careful if using AVX2 to do two structs at once
// only 2 instructions. compiler can use lea r2, [r1*8] to copy and scale
dupmask &= ~1; // clear the low bit
unsigned int matchmask = _mm_movemask_ps(cast(match));
matchmask &= ~dupmask; // ANDN is in BMI1 (Haswell), so this will take 2 instructions
return _mm_popcnt_u32(matchmask);
AMD XOP's VPPERM (pick bytes from any element of two source registers) would let the byte-shuffle replace the OR that merges s2b and s2c as well.
Hmm, pshufb isn't saving me as much as I thought, because it requires a pcmpeqd, and a pxor to zero a register. It's also loading its shuffle mask from a constant in memory, which can miss in the D-cache. It is the fastest version I've come up with, though.
If inlined into a loop, the same zeroed register could be used, saving one instruction. However, OR and AND can run on port0 (Intel CPUs), which can't run shuffle or compare instructions. The PXOR doesn't use any execution ports, though (on Intel SnB-family microarchitecture).
I haven't run real benchmarks of any of these, only IACA.
The PBLENDW and PSHUFB versions have the same latency (22 cycles, compiled for non-AVX), but the PSHUFB version has better throughput (one per 7.1c, vs. one per 7.4c, because PBLENDW needs the shuffle port, and there's already a lot of contention for it.) IACA says the version using PANDN with a constant instead of PBLENDW is also one-per-7.4c throughput, disappointingly. Port0 isn't saturated, so IDK why it's as slow as PBLENDW.
Old ideas that didn't pan out.
Leaving them in for the benefit of people looking for things to try when using vectors for related things.
Dup-checking s2 with vectors is more work than checking s2 vs. s1, because one compare is as expensive as 4 if done with vectors. The shuffling or masking needed after the compare, to remove false positives if there's no sentinel value, is annoying.
Ideas so far:
Shift s2 over by an element, and compare it to itself. Mask off false positives from shifting in 0. Vertically OR these together, and use it to ANDN the s1 vs s2 vector.
scalar code to do the smaller number of s2 vs. itself comparisons, building a bitmask to use before popcnt.
Broadcast s2.d and check it against s2 (all positions). But that puts the results horizontally in one vector, instead of vertically in 3 vectors. To use that, maybe PTEST / SETCC to make a mask for the bitmap (to apply before popcount). (PTEST with a mask of _mm_setr_epi32(0, -1, -1, -1), to only test the c,b,a, not d==d). Do (c==a | c==b) and b==a with scalar code, and combine that into a mask. Intel Haswell and later have 4 ALU execution ports, but only 3 of them can run vector instructions, so some scalar code in the mix could fill port6. AMD has even more separation between vector and integer execution resources.
shuffle s2 to get all the necessary comparisons done somehow, then shuffle the outputs. Maybe use movemask -> 4-bit lookup table for something?

Why is it the convention of some programmers to use long bit shifts instead of directly assigning a value / storing it as a constant

Consider the following code from the wikipedia article on square root algorythms
short isqrt(short num) {
short res = 0;
short bit = 1 << 14; // The second-to-top bit is set: 1 << 30 for 32 bits
// "bit" starts at the highest power of four <= the argument.
while (bit > num)
bit >>= 2;
while (bit != 0) {
if (num >= res + bit) {
num -= res + bit;
res = (res >> 1) + bit;
}
else
res >>= 1;
bit >>= 2;
}
return res;
}
Why would someone ever do this:
short bit = 1<< 14;
Instead of just assigning the value directly:
short bit = 0x4000;
Some RISC instruction sets will let you shift by a given amount, which is handy. MIPS lets you supply the shamt parameter for example. Other instruction sets aren't so handy. An MSP430 (16 bit - not the extended instructions) compiler would need to render this as a looped call to the RLA pseudo instruction I suppose.
So in some cases it seems like it does not 'hurt' to do it this way, but in other cases it seems like it could. So is there ever an advantage to having long shifts like that? Because it seems like it would make code less portable in a certain sense.
Do X86 or other CISC machines do something magical with this that I just don't know about? :)
It is just more explicit, and, as it involves only constants, will most certainly be evaluated at compile time, not at run time, so in the end, the resulting code will be exactly the same.
C has integer constant expressions, which lend themselves to evaluation during translation. That is why you can do things like this:
enum {
foo_bit,
bar_bit,
baz_bit,
foo_flag = 1 << foo_bit,
bar_flag = 1 << bar_bit,
baz_flag = 1 << baz_bit
};
static unsigned int some_flags = bar_flag | baz_flag;
Because it has static duration, the initializer for some_flags must be an integer constant expression; computable during translation. What we give it certainly is, even though there are multiple computations involved. This example only has one magic number: 1.
More options to match documentation. Whatever is driving the code, drives the style.
A supporting document may say the 14th bit (of a range 0 to 15) is set and the following code relates directly to that. Note 1 << 14 is signed.
int x = 1 << 14;
Docs may say use mask 0x4000, then the following is better. Note 0x4000 is unsigned.
int mask = 0x4000;
It's more readable and just as #jcaron said, it will all be the same thing
after all. Here is the relevant part if I compile to assembly, not even
optimised, just -S:
_isqrt: ## #isqrt
[...]
Ltmp4:
.cfi_def_cfa_register %rbp
movw %di, %ax
movw %ax, -2(%rbp)
movw $0, -4(%rbp)
movw $16384, -6(%rbp) ## imm = 0x4000
Writing 16384, 0x4000 or 1<<14 in this declaration is a matter of
preference.
Maybe even if your compiler supports it, why not 0b10000000000000? :-)

Finding position of '1's efficiently in an bit array

I'm wiring a program that tests a set of wires for open or short circuits. The program, which runs on an AVR, drives a test vector (a walking '1') onto the wires and receives the result back. It compares this resultant vector with the expected data which is already stored on an SD Card or external EEPROM.
Here's an example, assume we have a set of 8 wires all of which are straight through i.e. they have no junctions. So if we drive 0b00000010 we should receive 0b00000010.
Suppose we receive 0b11000010. This implies there is a short circuit between wire 7,8 and wire 2. I can detect which bits I'm interested in by 0b00000010 ^ 0b11000010 = 0b11000000. This tells me clearly wire 7 and 8 are at fault but how do I find the position of these '1's efficiently in an large bit-array. It's easy to do this for just 8 wires using bit masks but the system I'm developing must handle up to 300 wires (bits). Before I started using macros like the following and testing each bit in an array of 300*300-bits I wanted to ask here if there was a more elegant solution.
#define BITMASK(b) (1 << ((b) % 8))
#define BITSLOT(b) ((b / 8))
#define BITSET(a, b) ((a)[BITSLOT(b)] |= BITMASK(b))
#define BITCLEAR(a,b) ((a)[BITSLOT(b)] &= ~BITMASK(b))
#define BITTEST(a,b) ((a)[BITSLOT(b)] & BITMASK(b))
#define BITNSLOTS(nb) ((nb + 8 - 1) / 8)
Just to further show how to detect an open circuit. Expected data: 0b00000010, received data: 0b00000000 (the wire isn't pulled high). 0b00000010 ^ 0b00000000 = 0b0b00000010 - wire 2 is open.
NOTE: I know testing 300 wires is not something the tiny RAM inside an AVR Mega 1281 can handle, that is why I'll split this into groups i.e. test 50 wires, compare, display result and then move forward.
Many architectures provide specific instructions for locating the first set bit in a word, or for counting the number of set bits. Compilers usually provide intrinsics for these operations, so that you don't have to write inline assembly. GCC, for example, provides __builtin_ffs, __builtin_ctz, __builtin_popcount, etc., each of which should map to the appropriate instruction on the target architecture, exploiting bit-level parallelism.
If the target architecture doesn't support these, an efficient software implementation is emitted by the compiler. The naive approach of testing the vector bit by bit in software is not very efficient.
If your compiler doesn't implement these, you can still code your own implementation using a de Bruijn sequence.
How often do you expect faults? If you don't expect them that often, then it seems pointless to optimize the "fault exists" case -- the only part that will really matter for speed is the "no fault" case.
To optimize the no-fault case, simply XOR the actual result with the expected result and a input ^ expected == 0 test to see if any bits are set.
You can use a similar strategy to optimize the "few faults" case, if you further expect the number of faults to typically be small when they do exist -- mask the input ^ expected value to get just the first 8 bits, just the second 8 bits, and so on, and compare each of those results to zero. Then, you just need to search for the set bits within the ones that are not equal to zero, which should narrow the search space to something that can be done pretty quickly.
You can use a lookup table. For example log-base-2 lookup table of 255 bytes can be used to find the most-significant 1-bit in a byte:
uint8_t bit1 = log2[bit_mask];
where log2 is defined as follows:
uint8_t const log2[] = {
0, /* not used log2[0] */
0, /* log2[0x01] */
1, 1 /* log2[0x02], log2[0x03] */
2, 2, 2, 2, /* log2[0x04],..,log2[0x07] */
3, 3, 3, 3, 3, 3, 3, 3, /* log2[0x08],..,log2[0x0F */
...
}
On most processors a lookup table like this will go to ROM. But AVR is a Harvard machine and to place data in code space (ROM) requires special non-standard extension, which depends on the compiler. For example the IAR AVR compiler would need use the extended keyword __flash. In WinAVR (GNU AVR) you would need to use the PROGMEM attribute, but it's more complex than that, because you would also need to use special macros to to read from the program space.
I think there is only one way to do this:
Create an array out "outdata". Each item of the array can for example correspond an 8-bit port register.
Send the outdata on the wires.
Read back this data as "indata".
Store the indata in an array mapped exactly as the outdata.
In a loop, XOR each byte of outdata with each byte of indata.
I would strongly recommend inline functions instead of those macros.
Why can't your MCU handle 300 wires?
300/8 = 37.5 bytes. Rounded to 38. It needs to be stored twice, outdata and indata, 38*2 = 76 bytes.
You can't spare 76 bytes of RAM?
I think you're missing the forest through the trees. Seems like a bed of nails test. First test some assumptions:
1) You know which pins should be live for each pin tested/energized.
2) you have a netlist translated for step 1 into a file on sd
If you operate on a byte level as well as bit, it simplifies the issue. If you energize a pin, there is an expected pattern out stored in your file. First find the mismatched bytes; identify mismatched pins in the byte; finally store the energized pin with the faulty pin numbers.
You don't need an array for searching, or results. general idea:
numwires=300;
numbytes=numwires/8 + (numwires%8)?1:0;
for(unsigned char currbyte=0; currbyte<numbytes; currbyte++)
{
unsigned char testbyte=inchar(baseaddr+currbyte)
unsigned char goodbyte=getgoodbyte(testpin,currbyte/*byte offset*/);
if( testbyte ^ goodbyte){
// have a mismatch report the pins
for(j=0, mask=0x01; mask<0x80;mask<<=1, j++){
if( (mask & testbyte) != (mask & goodbyte)) // for clarity
logbadpin(testpin, currbyte*8+j/*pin/wirevalue*/, mask & testbyte /*bad value*/);
}
}

Increasing performance of 32bit math on 16bit processor

I am working on some firmware for an embedded device that uses a 16 bit PIC operating at 40 MIPS and programming in C. The system will control the position of two stepper motors and maintain the step position of each motor at all times. The max position of each motor is around 125000 steps so I cannot use a 16bit integer to keep track of the position. I must use a 32 bit unsigned integer (DWORD). The motor moves at 1000 steps per second and I have designed the firmware so that steps are processed in a Timer ISR. The timer ISR does the following:
1) compare the current position of one motor to the target position, if they are the same set the isMoving flag false and return. If they are different set the isMoving flag true.
2) If the target position is larger than the current position, move one step forward, then increment the current position.
3) If the target position is smaller than the current position, move one step backward, then decrement the current position.
Here is the code:
void _ISR _NOPSV _T4Interrupt(void)
{
static char StepperIndex1 = 'A';
if(Device1.statusStr.CurrentPosition == Device1.statusStr.TargetPosition)
{
Device1.statusStr.IsMoving = 0;
// Do Nothing
}
else if (Device1.statusStr.CurrentPosition > Device1.statusStr.TargetPosition)
{
switch (StepperIndex1) // MOVE OUT
{
case 'A':
SetMotor1PosB();
StepperIndex1 = 'B';
break;
case 'B':
SetMotor1PosC();
StepperIndex1 = 'C';
break;
case 'C':
SetMotor1PosD();
StepperIndex1 = 'D';
break;
case 'D':
default:
SetMotor1PosA();
StepperIndex1 = 'A';
break;
}
Device1.statusStr.CurrentPosition--;
Device1.statusStr.IsMoving = 1;
}
else
{
switch (StepperIndex1) // MOVE IN
{
case 'A':
SetMotor1PosD();
StepperIndex1 = 'D';
break;
case 'B':
SetMotor1PosA();
StepperIndex1 = 'A';
break;
case 'C':
SetMotor1PosB();
StepperIndex1 = 'B';
break;
case 'D':
default:
SetMotor1PosC();
StepperIndex1 = 'C';
break;
}
Device1.statusStr.CurrentPosition++;
Device1.statusStr.IsMoving = 1;
}
_T4IF = 0; // Clear the Timer 4 Interrupt Flag.
}
The target position is set in the main program loop when move requests are received. The SetMotorPos lines are just macros to turn on/off specific port pins.
My question is: Is there any way to improve the efficiency of this code? The code functions fine as is if the positions are 16bit integers but as 32bit integers there is too much processing. This device must communicate with a PC without hesitation and as written there is a noticeable performance hit. I really only need 18 bit math but I don't know of an easy way of doing that! Any constructive input/suggestions would be most appreciated.
Warning: all numbers are made up...
Supposing that the above ISR has about 200 (likely, fewer) instructions of compiled code and those include the instructions to save/restore the CPU registers before and after the ISR, each taking 5 clock cycles (likely, 1 to 3) and you call 2 of them 1000 times a second each, we end up with 2*1000*200*5 = 2 millions of clock cycles per second or 2 MIPS.
Do you actually consume the rest 38 MIPS elsewhere?
The only thing that may be important here and I can't see it, is what's done inside of the SetMotor*Pos*() functions. Do they do any complex calculations? Do they perform some slow communication with the motors, e.g. wait for them to respond to the commands sent to them?
At any rate, it's doubtful that such simple code would be noticeably slower when working with 32-bit integers than with 16-bit.
If your code is slow, find out where time is spent and how much, profile it. Generate a square pulse signal in the ISR (going to 1 when the ISR starts, going to 0 when the ISR is about to return) and measure its duration with an oscilloscope. Or do whatever is easier to find it out. Measure the time spent in all parts of the program, then optimize where really necessary, not where you have previously thought it would be.
The difference between 16 and 32 bits arithmetic shouldn't be that big, I think, since you use only increment and comparision. But maybe the problem is that each 32-bit arithmetic operation implies a function call (if the compiler isn't able/willing to do inlining of simpler operations).
One suggestion would be to do the arithmetic yourself, by breaking the Device1.statusStr.CurrentPosition in two, say, Device1.statusStr.CurrentPositionH and Device1.statusStr.CurrentPositionL. Then use some macros to do the operations, like:
#define INC(xH,xL) {xL++;if (xL == 0) xH++;}
I would get rid of the StepperIndex1 variable and instead use the two low-order bits of CurrentPosition to keep track of the current step index. Alternately, keep track of the current position in full rotations (rather than each step), so it can fit in a 16 bit variable. When moving, you only increment/decrement the position when moving to phase 'A'. Of course, this means you can only target each full rotation, rather than every step.
Sorry, but you are using bad program design.
Let's check the difference between 16 bit and 32 bit PIC24 or PIC33 asm code...
16 bit increment
inc PosInt16 ;one cycle
So 16 bit increment takes one cycle
32bit increment
clr Wd ;one cycle
inc low PosInt32 ;one cycle
addc high PosInt32, Wd ;one cycle
and 32 increment takes three cycles.
The total difference is 2 cycles or 50ns (nano seconds).
Simple calcolation will show you all. You have 1000 steps per second and 40Mips DSP so you have 40000 instructions per step at 1000 steps per second. More than enough!
When you change it from 16bit to 32bit do you change any of the compile flags to tell it to compile as a 32bit application instead.
have you tried compiling with the 32bit extensions but using only 16bit integers. do you still get such a performance drop?
It's likely that just by changing from 16bit to 32bit that some operations are compiled differently, perhaps do a Diff between the two sets of compiled ASM code and see what is actually different, is it lots or is it only a couple of lines.
Solutions would be maybe instead of using a 32bit integer, just use two 16bit integers,
when the valueA is int16.Max then set it to 0 and then increment valueB by 1 otherwise just incriment ValueA by 1, when value B is >= 3 you then check valueA >= 26696 (or something similar depending if you use unsigned or signed int16) and then you have your motor checking at 12500.

x86 Assembly: INC and DEC instruction and overflow flag

In x86 assembly, the overflow flag is set when an add or sub operation on a signed integer overflows, and the carry flag is set when an operation on an unsigned integer overflows.
However, when it comes to the inc and dec instructions, the situation seems to be somewhat different. According to this website, the inc instruction does not affect the carry flag at all.
But I can't find any information about how inc and dec affect the overflow flag, if at all.
Do inc or dec set the overflow flag when an integer overflow occurs? And is this behavior the same for both signed and unsigned integers?
============================= EDIT =============================
Okay, so essentially the consensus here is that INC and DEC should behave the same as ADD and SUB, in terms of setting flags, with the exception of the carry flag. This is also what it says in the Intel manual.
The problem is I can't actually reproduce this behavior in practice, when it comes to unsigned integers.
Consider the following assembly code (using GCC inline assembly to make it easier to print out results.)
int8_t ovf = 0;
__asm__
(
"movb $-128, %%bh;"
"decb %%bh;"
"seto %b0;"
: "=g"(ovf)
:
: "%bh"
);
printf("Overflow flag: %d\n", ovf);
Here we decrement a signed 8-bit value of -128. Since -128 is the smallest possible value, an overflow is inevitable. As expected, this prints out: Overflow flag: 1
But when we do the same with an unsigned value, the behavior isn't as I expect:
int8_t ovf = 0;
__asm__
(
"movb $255, %%bh;"
"incb %%bh;"
"seto %b0;"
: "=g"(ovf)
:
: "%bh"
);
printf("Overflow flag: %d\n", ovf);
Here I increment an unsigned 8-bit value of 255. Since 255 is the largest possible value, an overflow is inevitable. However, this prints out: Overflow flag: 0.
Huh? Why didn't it set the overflow flag in this case?
The overflow flag is set when an operation would cause a sign change. Your code is very close. I was able to set the OF flag with the following (VC++) code:
char ovf = 0;
_asm {
mov bh, 127
inc bh
seto ovf
}
cout << "ovf: " << int(ovf) << endl;
When BH is incremented the MSB changes from a 0 to a 1, causing the OF to be set.
This also sets the OF:
char ovf = 0;
_asm {
mov bh, 128
dec bh
seto ovf
}
cout << "ovf: " << int(ovf) << endl;
Keep in mind that the processor does not distinguish between signed and unsigned numbers. When you use 2's complement arithmetic, you can have one set of instructions that handle both. If you want to test for unsigned overflow, you need to use the carry flag. Since INC/DEC don't affect the carry flag, you need to use ADD/SUB for that case.
IntelĀ® 64 and IA-32 Architectures Software Developer's Manuals
Look at the appropriate manual Instruction Set Reference, A-M. Every instruction is precisely documented.
Here is the INC section on affected flags:
The CF flag is not affected. The OF, SZ, ZF, AZ, and PF flags are set according to the result.
try changing your test to pass in the number rather than hard code it, then have a loop that tries all 256 numbers to find the one if any that affects the flag. Or have the asm perform the loop and exit out when it hits the flag and or when it wraps around to the number it started with (start with something other than 0x00, 0x7f, 0x80, or 0xFF).
EDIT
.globl inc
inc:
mov $33, %eax
top:
inc %al
jo done
jmp top
done:
ret
.globl dec
dec:
mov $33, %eax
topx:
dec %al
jo donex
jmp topx
donex:
ret
Inc overflows when it goes from 0x7F to 0x80. dec overflows when it goes from 0x80 to 0x7F, I suspect the problem is in the way you are using inline assembler.
As many of the other answers have pointed out, INC and DEC do not affect the CF, whereas ADD and SUB do.
What has not been said yet, however, is that this might make a performance difference. Not that you'd usually be bothered by that unless you are trying to optimise the hell out of a routine, but essentially not setting the CF means that INC/DEC only write to part of the flags register, which can cause a partial flag register stall, see Intel 64 and IA-32 Architectures Optimization Reference Manual or Agner Fog's optimisation manuals.
Except for the carry flag inc sets the flags the same way as add operand 1 would.
The fact that inc does not affect the carry flag is very important.
http://oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_6/CH06-2.html#HEADING2-117
The CPU/ALU is only capable of handling unsigned binary numbers, and then it uses OF, CF, AF, SF, ZF, etc., to allow you to decide whether to use it as a signed number (OF), an unsigned number (CF) or a BCD number (AF).
About your problem, remember to consider the binary numbers themselves, as unsigned.
**Also, the overflow and the OF require 3 numbers: The input number, a second number to use in the arithmetic, and the result number.
Overflow is activated only if the first and second numbers have the same value for the sign bit (the most significant bit) and the result has a different sign. As in, adding 2 negative numbers resulted in a positive number, or adding 2 positive numbers resulted in a negative number:
if( (Sign_Num1==Sign_Num2) && (Sign_Result!=Sign_Num1) ) OF=1;
else OF=0;
For your first problem, you are using -128 as the first number. The second number is implicitly -1, used by the DEC instruction. So we really have the binary numbers 0x80 and 0xFF. Both them have the sign bit set to 1. The result is 0x7F, which is a number with the sign bit set to 0. We got 2 initial numbers with the same sign, and a result with a different sign, so we indicate an overflow. -128-1 resulted in 127, and thus the overflow flag is set to indicate a wrong signed result.
For your second problem, you are using 255 as the first number. The second number is implicitly 1, used by the INC instruction. So we really have the binary numbers 0xFF and 0x01. Both them have a different sign bit, so it is not possible to get an overflow (it is only possible to overflow when basically adding 2 numbers of the same sign, but it is never possible to overflow with 2 numbers of a different sign because they will never lead to go beyond the possible signed value). The result is 0x00, and it doesn't set the overflow flag because 255+1, or more exactly, -1+1 gives 0, which is obviously correct for signed arithmetic.
Remember that for the overflow flag to be set, the 2 numbers being added/subtracted need to have the sign bit with the same value, and then the result must have a sign bit with a value different from them.
What the processor does is set the appropriate flags for the results of these instructions (add, adc, dec, inc, sbb, sub) for both the signed and unsigned cases i e two different flag results for every op. The alternative would be having two sets of instructions where one sets signed-related flags and the other the unsigned-related. If the issuing compiler is using unsigned variables in the operation it will test carry and zero (jc, jnc, jb, jbe etc), if signed it tests overflow, sign and zero (jo, jno, jg, jng, jl, jle etc).

Resources