Efficient tiny boolean matrix multiplication

Efficient tiny boolean matrix multiplication - c

I have some unsigned 16 bit integer s which I'd like to map to an unsigned 32 bit integer r in such a way that each flipped bit in s flips at most one (given) bit in r -- simply a mapping between 0..16 and 0..32 that is. So we can see this as a matrix equation
Ps = r
where P is a 32 x 16 boolean matrix, s is a 16 x 1 boolean vector and r is 32 x 1 boolean vector. I have a gut feeling there exists some super simple hack that I'm missing. Important note: the target machine is a 16 bit mcu!
Here's the best I can do:
static u16 P[32] = someArrayOrWhatever();
u32 FsiPermutationHack(u16 s) {
u32 r;
for (u16 i = 0; i < 32; i++)
{
r |= ((u32)((P[i] & s) > 0) << i);
}
return r;
}
The rationale is this: the i:th bit of r is 1 if and only if (P[i] & s) != 0x0000. I am too stupid to disassemble stuff, but I am guessing this would be like ~100 instructions IF we didn't have to do that stupid u32 cast. But then again, perhaps the compiler auto-splits the loop in two for us in which case it's looking pretty good for us.
Apologies for the tangent, just thought I'd share my attempted solution -- do you have a better one?

Inasmuch as you say,
I am guessing this would be like ~100 instructions IF we didn't have
to do that stupid u32 cast. But then again, perhaps the compiler
auto-splits the loop in two for us in which case it's looking pretty
good for us.
and
I have a gut feeling there exists some super simple hack that I'm missing
, I will interpret you to be asking how to minimize the use of 32-bit arithmetic in this code intended for a 16-bit processor.
You really ought to learn how to disassemble and check the compiled result to see whether the compiler does automatically split the loop as you hypothesize, but supposing that it does not, I don't see why you couldn't do the same manually:
static u16 P[32]; /* value assigned elsewhere */
u32 FsiPermutationHack(u16 s) {
u16 *P_hi = P + 16;
u16 r_lo = 0;
u16 r_hi = 0;
for (u16 i = 0; i < 16; i++) {
r_lo |= (P[i] & s) != 0) << i;
r_hi |= (P_hi[i] & s) != 0) << i;
}
return ((u32) r_hi << 16) + r_lo;
}
That supposes u16 and u32 to be unsigned 16-bit and 32-bit (respectively) integers with no padding bits.
Note also that the idea that performing arithmetic with type u16 instead of u32 should be an improvement assumes that type u32 has a higher integer promotion rank than unsigned int. Roughly speaking, that comes down to the implementation's unsigned int being a 16-bit type. That's entirely plausible for an implementation for a 16-bit processor. On a system whose int and unsigned int are instead 32-bit types, however, all narrower integer arithmetic arguments would be promoted to 32 bits anyway.
Update:
As far as the possibility of a better alternative algorithm, I observe that each bit of the result is computed from a different element of array P, that the whole value of each element is used, and that the element size is the same as the target machine's native word size. There seems then no scope for performing fewer 16-bit bitwise AND operations than there are array elements (but see below).
If we accept that each array element must be processed separately, then the provided implementation does a pretty good job of approaching it efficiently:
It performs only 16-bit computations until the time comes to assemble the final result;
It computes both the upper and lower halves of the result in the same loop, thus incurring only 16 iterations' worth of loop overhead instead of 32
It largely removes the extra indexing arithmetic that that would otherwise have required by creating P_hi for accessing the upper half of the array
It would be possible to manually unroll the loop to possibly save a few more cycles, but that's the kind of optimization that you absolutely should rely on your compiler to perform for you.
As far as "bit twiddling hacks", the only scope I see for anything of that nature would be processing adjacent pairs of 16-bit array elements as 32-bit unsigned integers. That would allow performing one 32-bit bitwise AND in place of each two 16-bit ANDs. That would be coupled with two 32-bit comparisons (vs. two 16-bit comparisons in the above code). The 16-bit shift and bitwise OR operations of the above approach could be retained. Aside from that having formally undefined behavior as a result of violating the strict aliasing rule, that would involve 32-bit arithmetic, which presumably is about half as fast as 16-bit arithmetic on your 16-bit machine. Performance is better measured than predicted, but I don't see any reason to expect a significant win from that approach.

Related

Are leftshift operators dependent on register size?

Let uint8 and uint16 be datatypes for 8bit and 16bit positive integers.
uint8 a = 1;
uint16 b = a << 8;
I tested this program on 32Bit architecture with result
b = 256
Would the same programm on a system with registers of 8bit length yield the result:
b = 0 ?
because all bits in register gets shifted to 0 by a << 8?

Registers are irrelevant. This is about the width of your types.
When you shift a value by more bits than it possesses, the behaviour is undefined. The compiler, the program, the computer, the tax office can legally manifest any results accordingly. And, no, that's not just theoretical.
However, operands in C are promoted before interesting things are done on them. So, your uint8_t becomes an int before the left-shift.
Now it depends on your architecture (as determined by your compiler configuration) as to what happens: is int on your implementation only 8-bit? No, it's not! The result, then — regardless of any "register size" — must abide by the rules of the language, yielding the mathematically appropriate answer (256). And, even if it were, you'd hit that undefined behaviour so the question would be moot.
Under the bonnet, if more than one register is needed to hold a variable, then that's what will and must happen (at whatever performance cost is implied as a result). That's if a register is used at all; remember, you're programming in an abstraction, not hand-crafting machine code. The program snippet you showed can be completely optimised away during compilation and doesn't require any runtime instructions at all.

Would the same programm on a system with registers of 8bit length the result be b=0?
No.
In the expression a << 8 the variable a will get promoted to an int before the bit shift. And an int is guaranteed to be at least 16 bits.
b will have the value 256 on all platforms unless there's a bug in the compiler.
However, if you changed the second line to uint32 b = a << 16; you might get strange results. a would still get promoted to an int, but if int is two bytes long, then a << 16 will invoke undefined behavior.

AVR uint8_t doesn't get correct value

I have a uint8_t that should contain the result of a bitwise calculation. The debugger says the variable is set correctly, but when i check the memory, the var is always at 0. The code proceeds like the var is 0, no matter what the debugger tells me. Here's the code:
temp = (path_table & (1 << current_bit)) >> current_bit;
//temp is always 0, debugger shows correct value
if (temp > 0) {
DS18B20_send_bit(pin, 0x01);
} else {
DS18B20_send_bit(pin, 0x00);
}
Temp's a uint8_t, path_table's a uint64_t and current_bit's a uint8_t. I've tried to make them all uint64_t but nothing changed. I've also tried using unsigned long long int instead. Nothing again.
The code always enters the else clause.
Chip's Atmega4809, and uses uint64_t in other parts of the code with no issues.
Note - If anyone knows a more efficient/compact way to extract a single bit from a variable i would really appreciate if you could share ^^

1 is an integer constant, of type int. The expression 1 << current_bit also has type int, but for 16-bit int, the result of that expression is undefined when current_bit is larger than 14. The behavior being undefined in your case, then, it is plausible that your debugger presents results for the overall expression that seem inconsistent with the observed behavior. If you used an unsigned int constant instead, i.e. 1u, then the resulting value of temp would be well defined as 0 whenever current_bit was greater than 15, because the result of the left shift would be zero.
Solve this problem by performing the computation in a type wide enough to hold the result. Here's a compact, correct, and pretty clear way to correct your code to do that:
DS18B20_send_bit(pin, (path_table & (((uint64_t) 1) << current_bit)) != 0);
Or if path_table has an unsigned type then I prefer this, though it's more of a departure from your original:
DS18B20_send_bit(pin, (path_table >> current_bit) & 1);

Realization #1 here is that AVR is 1980-1990s technology core. It is not a x64 PC that chews 64 bit numbers for breakfast, but an extremely inefficient 8-bit MCU. As such:
It likes 8 bit arithmetic.
It will struggle with 16 bit arithmetic, by doing tricks with 16 bit index registers, double accumulators or whatever 8 bit core tricks it prefers to do.
It will literally take ages to execute 32 bit arithmetic, by invoking software libraries inline.
It will probably melt through the floor if attempting 64 bit arithmetic.
Before you do anything else, you need to get rid of all 64 bit arithmetic and radically minimize the use of 32 bit arithmetic. Period. There should be no single variable of uint64_t in your code or you are doing it very very wrong.
With this revelation also comes that all 8 bit MCUs always have an int type which is 16 bits.
In the code 1<<current_bit, the integer constant 1 is of type int. Meaning that if current_bit is 15 or larger, you will shift bits into the sign bit of this temporary int. This is always a bug. Strictly speaking this is undefined behavior. In practice, you might end up with random change of sign of your numbers.
To avoid this, never use any form of bitwise operators on signed numbers. When mixing integer constants such as 1 with bitwise operators, change them to 1u to avoid bugs like the one mentioned.
If anyone knows a more efficient/compact way to extract a single bit from a variable i would really appreciate if you could share
The most efficient way in C is: uint8_t variable; ... if(variable & (1u << bits)). This should translate to the relevant "branch if bit set" instruction.
My general advise would be find your tool chain's disassembler and see what machine code that the C code actually generated. You don't have to be an assembler guru to read it, peeking at the instruction set should be enough.

In C, How do I calculate the signed difference between two 48-bit unsigned integers?

I've got two values from an unsigned 48bit nanosecond counter, which may wrap.
I need the difference, in nanoseconds, of the two times.
I think I can assume that the readings were taken at roughly the same time, so of the two possible answers I think I'm safe taking the smallest.
They're both stored as uint64_t. Because I don't think I can have 48 bit types.
I'd like to calculate the difference between them, as a signed integer (presumably int64_t), accounting for the wrapping.
so e.g. if I start out with
x=5
y=3
then the result of x-y is 2, and will stay so if I increment both x and y, even as they wrap over the top of the max value 0xffffffffffff
Similarly if x=3, y=5, then x-y is -2, and will stay so whenever x and y are incremented simultaneously.
If I could declare x,y as uint48_t, and the difference as int48_t, then I think
int48_t diff = x - y;
would just work.
How do I simulate this behaviour with the 64-bit arithmetic I've got available?
(I think any computer this is likely to run on will use 2's complement arithmetic)
P.S. I can probably hack this out, but I wonder if there's a nice neat standard way to do this sort of thing, which the next person to read my code will be able to understand.
P.P.S Also, this code is going to end up in the tightest of tight loops, so something that will compile efficiently would be nice, so that if there has to be a choice, speed trumps readability.

You can simulate a 48-bit unsigned integer type by just masking off the top 16 bits of a uint64_t after any arithmetic operation. So, for example, to take the difference between those two times, you could do:
uint64_t diff = (after - before) & 0xffffffffffff;
You will get the right value even if the counter wrapped around during the procedure. If the counter didn't wrap around, the masking is not needed but not harmful either.
Now if you want this difference to be recognized as a signed integer by your compiler, you have to sign extend the 48th bit. That means that if the 48th bit is set, the number is negative, and you want to set the 49th through the 64th bit of your 64-bit integer. I think a simple way to do that is:
int64_t diff_signed = (int64_t)(diff << 16) >> 16;
Warning: You should probably test this to make sure it works, and also beware there is implementation-defined behavior when I cast the uint64_t to an int64_t, and I think there is implementation-defined behavior when I shift a signed negative number to the right. I'm sure a C language lawyer could some up with something more robust.
Update: The OP points out that if you combine the operation of taking the difference and doing the sign extension, there is no need for masking. That would look like this:
int64_t diff = (int64_t)(x - y) << 16 >> 16;

struct Nanosecond48{
unsigned long long u48 : 48;
// int res : 12; // just for clarity, don't need this one really
};
Here we just use the explicit width of the field to be 48 bits and with that (admittedly somewhat awkward) type you live it up to your compiler to properly handle different architectures/platforms/whatnot.
Like the following:
Nanosecond48 u1, u2, overflow;
overflow.u48 = -1L;
u1.u48 = 3;
u2.u48 = 5;
const auto diff = (u2.u48 + (overflow.u48 + 1) - u1.u48) & 0x0000FFFFFFFFFFFF;
Of course in the last statement you can just do the remainder operation with % (overflow.u48 + 1) if you prefer.

Do you know which was the earlier reading and which was later? If so:
diff = (earlier <= later) ? later - earlier : WRAPVAL - earlier + later;
where WRAPVAL is (1 << 48) is pretty easy to read.

Should I use the stdint.h integer types on 32/64 bit machines?

One thing that bugs me about the regular c integer declarations is that their names are strange, "long long" being the worst. I am only building for 32 and 64 bit machines so I do not necessarily need the portability that the library offers, however I like that the name for each type is a single word in similar length with no ambiguity in size.
// multiple word types are hard to read
// long integers can be 32 or 64 bits depending on the machine
unsigned long int foo = 64;
long int bar = -64;
// easy to read
// no ambiguity
uint64_t foo = 64;
int64_t bar = -64;
On 32 and 64 bit machines:
1) Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
2) If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
for (int8_t i = 0; i < 10; i++) {
}
3) Whenever I use an integer that I know will never be negative is it ok to prefer using the unsigned version even if I do not need the extra range in provides?
// instead of the one above
for (uint8_t i = 0; i < 10; i++) {
}
4) Is it safe to use a typedef for the types included from stdint.h
typedef int32_t signed_32_int;
typedef uint32_t unsigned_32_int;
edit: both answers were equally good and I couldn't really lean towards one so I just picked the answerer with lower rep

1) Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
Yes it can be slower. Use int_fast16_t instead. Profile the code as needed. Performance is very implementation dependent. A prime benefit of int16_t is its small, well defined size (also it must be 2's complement) as used in structures and arrays, not so much for speed.
The typedef name int_fastN_t designates the fastest signed integer type with a width of at least N. C11 §7.20.1.3 2
2) If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
Yes but that savings in code and speed is questionable. Suggest int instead. Emitted code tends to be optimal in speed/size with the native int size.
3) Whenever I use an integer that I know will never be negative is it OK to prefer using the unsigned version even if I do not need the extra range in provides?
Using some unsigned type is preferred when the math is strictly unsigned (such as array indexing with size_t), yet code needs to watch for careless application like
for (unsigned i = 10 ; i >= 0; i--) // infinite loop
4) Is it safe to use a typedef for the types included from stdint.h
Almost always. Types like int16_t are optional. Maximum portability uses required types uint_least16_t and uint_fast16_t for code to run on rare platforms that use bits widths like 9, 18, etc.

Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
Yes. Some CPUs do not have dedicated 16-bit arithmetic instructions; arithmetic on 16-bit integers must be emulated with an instruction sequence along the lines of:
r1 = r2 + r3
r1 = r1 & 0xffff
The same principle applies to 8-bit types.
Use the "fast" integer types in <stdint.h> to avoid this -- for instance, int_fast16_t will give you an integer that is at least 16 bits wide, but may be wider if 16-bit types are nonoptimal.
If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
Don't bother; just use int. Using a narrower type doesn't actually save any space, and may cause you issues down the line if you decide to increase the number of iterations to over 127 and forget that the loop variable is using a narrow type.
Whenever I use an integer that I know will never be negative is it ok to prefer using the unsigned version even if I do not need the extra range in provides?
Best avoided. Certain C idioms do not work properly on unsigned integers; for instance, you cannot write a loop of the form:
for (i = 100; i >= 0; i--) { … }
if i is an unsigned type, because i >= 0 will always be true!
Is it safe to use a typedef for the types included from stdint.h
Safe from a technical perspective, but it'll annoy other developers who have to work with your code.
Get used to the <stdint.h> names. They're standardized and reasonably easy to type.

Absolutely possible, yes. On my laptop (Intel Haswell), in a microbenchmark that counts up and down between 0 and 65535 on two registers 2 billion times, this takes
1.313660150s - ax dx (16-bit)
1.312484805s - eax edx (32-bit)
1.312270238s - rax rdx (64-bit)
Minuscule but repeatable differences in timing. (I wrote the benchmark in assembly, because C compilers may optimize it to a different register size.)
It will work, but you'll have to keep it up to date if you change the bounds and the C compiler will probably optimize it to the same assembly code anyway.
As long as it's correct C, that's totally fine. Keep in mind that unsigned overflow is defined and signed overflow is undefined, and compilers do take advantage of that for optimization. For example,
void foo(int start, int count) {
for (int i = start; i < start + count; i++) {
// With unsigned arithmetic, this will execute 0 times if
// "start + count" overflows to a number smaller than "start".
// With signed arithmetic, that may happen, or the compiler
// may assume this loop always runs "count" times.
// For defined behavior, avoid signed overflow.
}
Yes. Also, POSIX provides inttypes.h which extends stdint.h with some useful functions and macros.

On embedded platforms, is it more efficient to use unsigned int instead of (implicity signed) int?

I've got into this habit of always using unsigned integers where possible in my code, because the processor can do divides by powers of two on unsigned types, which it can't with signed types. Speed is critical for this project. The processor operates at up to 40 MIPS.
My processor has an 18 cycle divide, but it takes longer than the single cycle barrel shifter. So is it worth using unsigned integers here to speed things up or do they bring other disadvantages? I'm using a dsPIC33FJ128GP802 - a member of the dsPIC33F series by Microchip. It has single cycle multiply for both signed and unsigned ints. It also has sign and zero extend instructions.
For example, it produces this code when mixing signed and unsigned integers.
026E4 97E80F mov.b [w15-24],w0
026E6 FB0000 se w0,w0
026E8 97E11F mov.b [w15-31],w2
026EA FB8102 ze w2,w2
026EC B98002 mul.ss w0,w2,w0
026EE 400600 add.w w0,w0,w12
026F0 FB8003 ze w3,w0
026F2 100770 subr.w w0,#16,w14
I'm using C (GCC for dsPIC.)

I think we all need to know a lot more about the peculiarities of your processor to answer this question. Why can't it do divides by powers of two on signed integers? As far as I remember the operation is the same for both. I.e.
10/2 = 00001010 goes to 00000101
-10/2 = 11110110 goes to 11111011
Maybe you should write some simple code doing an unsigned divide and a signed divide and compare the compiled output.
Also benchmarking is a good idea. It doesn't need to be precise. Just have a an array of a few thousand numbers, start a timer and start dividing them a few million times and time how long it takes. Maybe do a few billion times if your processor is fast. E.g.
int s_numbers[] = { etc. etc. };
int s_array_size = sizeof(s_numbers);
unsigned int u_numbers[] = { etc. etc.};
unsigned int u_array_size = sizeof(u_numbers);
int i;
int s_result;
unsigned int u_result;
/* Start timer. */
for(i = 0; i < 100000000; i++)
{
i_result = s_numbers[i % s_array_size] / s_numbers[(i + 1) % s_array_size];
}
/* Stop timer and print difference. */
/* Repeat for unsigned integers. */
Written in a hurry to show the principle, please forgive any errors.
It won't give precise benchmarking but should give a general idea of which is faster.

I don't know much about the instruction set available on your processor but a quick look makes me think that it has instructions that may be used for both arithmetic and logical shifts, which should mean that shifting a signed value costs about the same as shifting an unsigned value, and dividing by powers of 2 for each using the shifts should also cost the same. (my knowledge about this is from a quick glance at some intrinsic functions for a C compiler that targets your processor family).
That being said, if you are working with values which are to be interpreted as unsigned then you might as well declare them as unsigned. For the last few years I've been using the types from stdint.h more and more, and usually I end up using the unsigned versions because my values are either inherently unsigned or I'm just using them as bit arrays.

Generate assembly both ways and count cycles.

I'm going to guess the unsigned divide of powers of two are faster because it can simply do a right shift as needed without needing to worry about sign extension.
As for disadvantages: detecting arithmetic overflows, overflowing a signed type because you didn't realize it while using unsigned, etc. Nothing blocking, just different things to watch out for.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight