Is a binary operation faster than memmove?

Is a binary operation faster than memmove? - c

I'm writing a digital filter, and I need to keep the last X values and sum them all together.
Now there are two possible approaches to this. Either I shift the whole array using memmove to make room for the next value, and have the right indexes to the array as hard-coded values in my summing algorithm.
memmove(&Fifo[0], &Fifo[1], 12 * 4); // Shift array to the left
Result += Factor[1] * (Fifo[5] + Fifo[7]);
Result += Factor[2] * (Fifo[4] + Fifo[8]);
Result += Factor[3] * (Fifo[3] + Fifo[9]);
Result += Factor[4] * (Fifo[2] + Fifo[10]);
Result += Factor[5] * (Fifo[1] + Fifo[11]);
Result += Factor[6] * (Fifo[0] + Fifo[12]);
Or alternatively, I don't copy any memory, but increment a counter instead, and calculate each index from that using a modulo operation (like a circular buffer).
i++; // Increment the index
Result += Factor[1] * (Fifo[(i + 5) % 13] + Fifo[(i + 7) % 13]);
Result += Factor[2] * (Fifo[(i + 4) % 13] + Fifo[(i + 8) % 13]);
Result += Factor[3] * (Fifo[(i + 3) % 13] + Fifo[(i + 9) % 13]);
Result += Factor[4] * (Fifo[(i + 2) % 13] + Fifo[(i + 10) % 13]);
Result += Factor[5] * (Fifo[(i + 1) % 13] + Fifo[(i + 11) % 13]);
Result += Factor[6] * (Fifo[(i + 0) % 13] + Fifo[(i + 12) % 13]);
Since its an embedded ARM cpu, I was wondering what would be more efficient. Since I assume that the CPU has to move at least one 32-bit value internally to do the modulo operation, could it be that just moving the whole array is just as fast as calculating the right indexes?

If you need to know which is faster, you need to do benchmark. If you want to
know why, you need to examine the assembly.
That being said, there is also halfway solution which could be good enough:
Use buffer larger than needed and only do memmove when your buffer is full.
That way you only have to keep track of starting offset, and not have to worry
about the problems that come with circular buffers. You have to use more memory though.
So if you wish to have 5 elements and use buffer for 10 elements, you only have
to do memmove every 5 insertions. (Except the first pass when you can do 10 insertions)

I've done exactly that on a Cortex M0 (LPC11C14) for a FIR filter of size 15 (Savitzky-Golay for measuring line voltage).
I found that in my case copying was somewhat slower than using a circular buffer of size 16 and computing the indices using the modulo operator. Note that 16 is a power of two, which makes division very cheap.
I tried several variants and used a port pin for measuring execution time, I recommend you do the same.

Assuming 32-bit values, Modulo on ARM can be executed in 2 assembly instructions, but so is moving memory (1 to get it in a register, 1 to get it out). So no definitive answer here; it will depend on the code around it.
My gut feeling says you should go for the circular buffer approach.

There is a third way which requires neither memmove nor modulo involving two switch blocks. I'm too lazy to type it up, but the idea is that you calculate the offset, use the first switch to calculate one 'half' of the buffer, then recaulculate the offset and use the second switch to calculate the other half of the buffer. You basically enter the second switch where the first one 'left'. Note that in one switch block the instruction order would have to be reverted.

My intuition says that the memmove may cause all sorts of memory conflicts and prevent internal bypasses, since you load and store to the same area, perhaps even the same cache lines. Some processors would simply give up on optimizing this and defer all the memory operations, effectively serializing them (an embedded CPU may be simple enough to do this anyway, but i'm talking about the general case - on x86 or even cortex a15 you may get a bigger penalty)

Related

What is the advantage of this sizing code in C?

Apologies for the generic question title, I wasn't sure how to phrase it properly (suggestions welcome!)
I'm trying to get my head around some of the code for the Common Mark parser and came across this:
/* Oversize the buffer by 50% to guarantee amortized linear time
* complexity on append operations. */
bufsize_t new_size = target_size + target_size / 2;
new_size += 1;
new_size = (new_size + 7) & ~7;
So given a number, eg 32, it will add (32 / 2) [48], add 1 [49], add 7 [56], finally ANDing that with -8 [56].
Is this a common pattern? Specifically the adding of a number and then ANDing with its complement.
Is anyone able to provide any insight into what this is doing and what advantages, if any, exist?

The (+7) & ~7 part rounds the number up to the first multiple of 8. It works only with powers of 2 (7 is 2^3-1). If you want to round to a multiple of 32 then use 31 instead of 7.
The reason to round the size to a multiple of 8 is probably specific to the algorithm.
It is also possible that the author of the code knows how the memory allocator works. If the allocator uses internally blocks of memory of multiple of 8 bytes, an allocation request of any number of bytes between 1 and 8 uses an entire block. By asking for a block having a size that is multiple of 8 one gets several extra bytes for the same price.

what (r+1 + (r >> 8)) >> 8 does?

In some old C/C++ graphics related code, that I have to port to Java and JavaScript I found this:
b = (b+1 + (b >> 8)) >> 8; // very fast
Where b is short int for blue, and same code is seen for r and b (red & blue). The comment is not helpful.
I cannot figure out what it does, apart from obvious shifting and adding. I can port without understanding, I just ask out of curiosity.

y = ( x + 1 + (x>>8) ) >> 8 // very fast
This is a fixed-point approximation of division by 255. Conceptually, this is useful for normalizing calculations based on pixel values such that 255 (typically the maximum pixel value) maps to exactly 1.
It is described as very fast because fully general integer division is a relatively slow operation on many CPUs -- although it is possible that your compiler would make a similar optimization for you if it can deduce the input constraints.
This works based on the idea that 257/(256*256) is a very close approximation of 1/255, and that x*257/256 can be formulated as x+(x>>8). The +1 is rounding support which allows the formula to exactly match the integer division x/255 for all values of x in [0..65534].
Some algebra on the inner portion may make things a bit more clear...
x*257/256
= (x*256+x)/256
= x + x/256
= x + (x>>8)
There is more discussion here: How to do alpha blend fast? and here: Division via Multiplication
By the way, if you want round-to-nearest, and your CPU can do fast multiplies, the following is accurate for all uint16_t dividend values -- actually [0..(2^16)+126].
y = ((x+128)*257)>>16 // divide by 255 with round-to-nearest for x in [0..65662]

Looks like it is meant to check if blue (or red or green) is fully used. It evaluates to 1, when b is 255, and is 0 for all lower values.

A common use case of when you'd want to use a formula that's more accurate than 257/256 is when you have to combine a lot of alpha values together for each pixel. As one example, when doing image shrinking, you need to combine 4 alphas for each source pixel contributing to the destination, and then combine all the source pixels contributing to the destination.
I posted an infinitely accurate bit twiddling version of /255 but it was rejected without reason. So I'll add that I implement alpha blending hardware for a living, I write real time graphics code and game engines for a living, and I've published articles on this topic in conferences like MICRO, so I really know what I'm talking about. And it might be useful or at least entertaining for people to understand the more accurate formula that is EXACTLY 1/255:
Version 1: x = (x + (x >> 8)) >> 8
- no constant added, won't satisfy (x * 255) / 255 = x, but will look fine in most cases.
Version 2: x = (x + (x >> 8) + 1) >> 8
- WILL satisfy (x * 255) / 255 = x for integers, but won't hit correct integer values for all alphas
Version 3: (simple integer rounding):
(x + (x >> 8) + 128) >> 8
- Won't hit correct integer values for all alphas, but will on average be closer than Version 2 at the same cost.
Version 4: Infinitely accurate version, to any level of precision desired, for any number of composite alphas: (useful for image resizing, rotation, etc.):
[(x + (x >> 8)) >> 8] + [ ( (x & 255) + (x >> 8) ) >> 8]
Why is version 4 infinitely accurate?
Because 1/255 = 1/256 + 1/65536 + 1/256^3 + 1/256^4 + ...
The simplest expression above (version 1) doesn't handle rounding, but it also doesn't handle the carries that occur from this infinite number of identical sum columns. The new term added above determines the carry out (0 or 1) from this infinite number of base 256 digits. By adding it, you are getting the same result as if you added all the infinite addends. At which point you can round by adding a half bit to whatever accuracy point you want.
Not needed for the OP perhaps, but people should know that you don't need to approximate at all. The formula above is actually more accurate than double precision floating point.
As for speed: In hardware, this method is faster than even a single (full width) add. In software, you have to consider throughput vs latency. In latency, it may still be faster than a narrow multiply (definitely faster than a full width multiply), but in the OP context, you can unroll many pixels at once, and since modern multiply units are pipelined, you are still OK. In translation to Java, you probably have no narrow multiplies, so this could still be faster, but need to check.
WRT the one person who said "why not use the built in OS capabilities for alpha blitting?": If you already have a substantial graphical code base in that OS, this might be a fine option. If not, you're looking at hundreds to thousands as many lines of code to leverage the OS version - code that's far harder to write and debug than this code. And in the end, the OS code you have isn't portable at all, while this code can be used anywhere.

I suspect that it is trying to do the following:
boolean isBFullyOn = false;
if (b == 0xff) {
isBFullyOn = true;
}
Back in the days of slow processors; smart bit-shifting tricks like the above could be faster than the obvious if-then-else logic. It avoids a jump statement which was costly.
It probably also sets an overflow flag in the processor which was used for some latter logic. This is all highly dependant upon the target processor.
And also on my part speculative!!

Is value of b+1 + b/256, this calculation divided by 256.
In that way, using bit shift the compiler tranlte using CPU level shift instruction, instead of using FPU or library division functions.

b = (b + (b >> 8)) >> 8; is basically b = b *257/256 .
I would consider +1 being an ugly hack of the -0.5 mean reduce caused by the inner >>8.
I would write it as b = (b + 128 + ((b +128)>> 8)) >> 8; instead.

Running this test code:
public void test() {
Set<Integer> results = new HashSet<Integer>();
// short int ranges between -32767 and 32767
for (int i = -32767; i <= 32767; i++) {
int b = (i + 1 + (i >> 8)) >> 8;
if (!results.contains(b)) {
System.out.println(i + " -> " + b);
results.add(b);
}
}
}
Produces all possible values between -129 and 128. However, if you are working with 8-bit colours (0 - 255) then the only possible outputs are 0 (for 0 - 254) and 1 (for 255) so it is likely that it is attempting the function #kaykay posted.

optimizing a line of C code for 8 bit processor

I'm working on a 8bit processor and have written code in a C compiler, now more than 140 lines of code are taking just 1200 bytes and this single line is taking more than 200 bytes of ROM space. eeprom_read() is a function, there should be a problem with this 1000 and 100 and 10 multiplication.
romAddr = eeprom_read(146)*1000 + eeprom_read(147)*100 +
eeprom_read(148)*10 + eeprom_read(149);
Processor is 8-bit and data type of romAddr is int. Is there any way to write this line in a more optimized way?

It's possible that the thing that uses the most space is the use of multiplication. If your processor lacks an instruction to do multiplication, the compiler is forced to use software to do it step by step, which can require quite a bit of code.
It's hard to say, since you don't specify anything about your target processor (or which compiler you're using).
One way might be to somehow try to reduce inlining, so the code to multiply by 10 (which is used in all four terms) can be re-used.
To know if this is the case at all, the machine code must be inspected. By the way, the use of decimal constants for an address calculation is really odd.

Sometimes the multiplication can be compiled into a sequence of additions, yes. You can optimize it say by using left shift operator.
A*1000 = A*512 + A*256 + A*128 + A*64 + A*32 + A*8
Or the same thing:
A<<9 + A<<8 + A<<7 + A<<6 + A<<5 + A<<3
This still is way longer then a single "multiply" instruction, but your processor apparently doesn't have it anyway, so this might be the next best thing.

You're concerned about space, not time, right?
You've got four function calls, with an integer argument being passed to each one, followed by a multiplication by a constant, followed by adding.
Just as a first guess, that could be
load integer constant into register (6 bytes)
push register (2 bytes,
call eeprom_read (6 bytes)
adjust stack (4 bytes)
load integer multiplier into register (6 bytes)
push both registers (4 bytes),
call multiplication routine (6 bytes)
adjust stack (4 bytes)
load temporary sum into a register (6 bytes)
add to that register the result of the multiplication (2 bytes)
store back in the temporary sum (6 bytes).
Let's see, 6+2+6+4+6+4+6+4+6+2+6= about 52 bytes per call to eeprom_read.
The last call would be shorter because it doesn't do the multiply.
I would try calling eeprom_read not with arguments like 146 but with (unsigned char)146, and multiplying not by 1000 but by (unsigned short)1000.
That way, you might be able to tease the compiler into using shorter instructions, and possibly using a multiply instruction rather than a multiply function call.
Also, the call to eeprom_read might be macro'ed into a direct memory fetch, saving the pushing of the argument, the calling of the function, and the stack adjustment.
Another trick could be to store each one of the four products in a local variable, and add them all together at the end. That could generate less code.
All these possibilities would also make it faster, as well as smaller, though you probably don't need to care about that.
Another possibility for saving space could be to use a loop, like this:
static unsigned short powerOf10[] = {1000, 100, 10, 1};
unsigned short i;
romAddr = 0;
for (i = 146; i < 150; i++){
romAddr += powerOf10[i-146] * eeprom_read(i);
}
which should save space by having the call and the multiply only once, plus the looping instructions, rather than four copies.
In any case, get handy with the assembler language that the compiler generates.

It depends very, very much on the compiler, but I would suggest that you at least simplify the multiplication this way:
romAddr = ((eeprom_read(146)*10 + eeprom_read(147))*10 +
eeprom_read(148))*10 + eeprom_read(149);
You could put this in a loop:
uint8_t i = 146;
romAddr = eeprom_read(i);
for (i = 147; i < 150; i++)
romAddr = romAddr * 10 + eeprom_read(i);
Hopefully the compiler should recognise how much simpler it is to multiply a 16-bit value by ten, compared with separately implementing multiplications by 1000 and 100.
I'm not completely comfortable relying on the compiler to deal with the loop effectively, though.
Maybe:
uint8_t hi, lo;
hi = (uint8_t)eeprom_read(146) * (uint8_t)10 + (uint8_t)eeprom_read(147);
lo = (uint8_t)eeprom_read(148) * (uint8_t)10 + (uint8_t)eeprom_read(149);
romAddr = hi * (uint8_t)100 + lo;
All of these are untested.

Guaranteeing enough storage space for 4*ceil(n/3), where n is an int

Let's say n is an integer (an int variable in C). I need enough space for “4 times the ceiling of n divided by 3” bytes. How do I guarantee enough space for this?
Do you think malloc(4*(int)ceil(n/3.0)) will do, or do I have to add, say, 1 in order to be absolutely safe (due to possible rounding errors)?

you can achieve the same thing with pure integer arithmetic which guarantees that you allocate the correct amount of memory:
edit fixed brackets
malloc(4*((n+2)/3))

An alternative to KerrekSB's general formula which guarantees that only one division is used, is to calculate
(n+m-1)/m
To see that it produces the same, write n = k*m + r with 0 <= r < m. Then n%m == r, and if r == 0, we have n+m-1 = k*m + (m-1) and (n+m-1)/m == k, otherwise n+m-1 = (k+1)*m + (r-1) and (n+m-1)/m == k+1.
Most modern hardware gives you the quotient (n/m) in one register and the remainder (n%m) in another when you do an integer division, so you can get both parts of Kerrek's formula in one division, and most compilers will do so. If the compiler doesn't, but uses two divisions, the calculation will be considerably slower, so if the computation is done often and performance is an issue, you can work around the compiler's weakness with somewhat less obvious code.
In the given case, the malloc would be
malloc(4*((n+2)/3));
But since it's not obvious to everyone what that formula does, if you use it, explain it in a comment, and if you don't need to use it, use the more obvious code.

To compute the ceiling of n / m integrally, just say:
n / m + (n % m == 0 ? 0 : 1)
All in all, say malloc(4 * (n / 3 + (n % 3 ? 1 : 0)));.

While Kerrek SB has a precise answer, in practice most engineers would use malloc (4 + 4 * n / 3) or (equivalently) malloc (4 * (1 + n / 3)). The rules for C evaluate n/3 as an integer resulting in truncating remainder away. Adding a little more to the expression ensures that any fraction ignored by the division is allocated.
At most, this might waste three bytes. Only if there are at thousands of these would any extra computation to account for that be justified—maybe. Implementations of malloc often round storage allocations up to multiples of 4, 8, or 16 bytes to simplify its housekeeping.
Consider the cost of 3 bytes of memory: Current pricing is $5 to $15 per gigabyte. Three bytes cost $0.000 000 009.

C quick calculation of next multiple of 4?

What's a fast way to round up an unsigned int to a multiple of 4?
A multiple of 4 has the two least significant bits 0, right? So I could mask them out and then do a switch statement, adding either 1,2 or 3 to the given uint.
That's not a very elegant solution..
There's also the arithmetic roundup:
myint == 0 ? 0 : ((myint+3)/4)*4
Probably there's a better way including some bit operations?

(myint + 3) & ~0x03
The addition of 3 is so that the next multiple of 4 becomes previous multiple of 4, which is produced by a modulo operation, doable by masking since the divisor is a power of 2.

I assume that what you are trying to achieve is the alignment of the input number, i.e. if the original number is already a multiple of 4, then it doesn't need to be changed. However, this is not clear from your question. Maybe you want next multiple even when the original number is already a multiple? Please, clarify.
In order to align an arbitrary non-negative number i on an arbitrary boundary n you just need to do
i = i / n * n;
But this will align it towards the negative infinity. In order to align it to the positive infinity, add n - 1 before peforming the alignment
i = (i + n - 1) / n * n;
This is already good enough for all intents and purposes. In your case it would be
i = (i + 3) / 4 * 4;
However, if you would prefer to to squeeze a few CPU clocks out of this, you might use the fact that the i / 4 * 4 can be replaced with a bit-twiddling i & ~0x3, giving you
i = (i + 3) & ~0x3;
although it wouldn't surprise me if modern compilers could figure out the latter by themselves.

If by "next multiple of 4" you mean the smallest multiple of 4 that is larger than your unsigned int value myint, then this will work:
(myint | 0x03) + 1;

(myint + 4) & 0xFFFC

If you want the next multiple of 4 strictly greater than myint, this solution will do (similar to previous posts):
(myint + 4) & ~3u
If you instead want to round up to the nearest multiple of 4 (leaving myint unchanged if it is a multiple of 4), this should work:
(0 == myint & 0x3) ? myint : ((myint + 4) & ~3u);

myint = (myint + 4) & 0xffffffc
This is assuming that by "next multiple of 4" that you are always moving upwards; i.e. 5 -> 8 and 4 -> 8.

This is branch-free, generally configurable, easy to understand (if you know about C byte strings), and it lets you avoid thinking about the bit size of myInt:
myInt += "\x00\x03\x02\x01"[myInt & 0x3];
Only downside is a possible single memory access to elsewhere (static string storage) than the stack.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Is a binary operation faster than memmove? - c

Assuming 32-bit values, Modulo on ARM can be executed in 2 assembly instructions, but so is moving memory (1 to get it in a register, 1 to get it out). So no definitive answer here; it will depend on the code around it. My gut feeling says you should go for the circular buffer approach.

Related

What is the advantage of this sizing code in C?

what (r+1 + (r >> 8)) >> 8 does?

optimizing a line of C code for 8 bit processor

Guaranteeing enough storage space for 4*ceil(n/3), where n is an int

C quick calculation of next multiple of 4?

Categories

Resources