optimizing a line of C code for 8 bit processor - c

I'm working on a 8bit processor and have written code in a C compiler, now more than 140 lines of code are taking just 1200 bytes and this single line is taking more than 200 bytes of ROM space. eeprom_read() is a function, there should be a problem with this 1000 and 100 and 10 multiplication.
romAddr = eeprom_read(146)*1000 + eeprom_read(147)*100 +
eeprom_read(148)*10 + eeprom_read(149);
Processor is 8-bit and data type of romAddr is int. Is there any way to write this line in a more optimized way?

It's possible that the thing that uses the most space is the use of multiplication. If your processor lacks an instruction to do multiplication, the compiler is forced to use software to do it step by step, which can require quite a bit of code.
It's hard to say, since you don't specify anything about your target processor (or which compiler you're using).
One way might be to somehow try to reduce inlining, so the code to multiply by 10 (which is used in all four terms) can be re-used.
To know if this is the case at all, the machine code must be inspected. By the way, the use of decimal constants for an address calculation is really odd.

Sometimes the multiplication can be compiled into a sequence of additions, yes. You can optimize it say by using left shift operator.
A*1000 = A*512 + A*256 + A*128 + A*64 + A*32 + A*8
Or the same thing:
A<<9 + A<<8 + A<<7 + A<<6 + A<<5 + A<<3
This still is way longer then a single "multiply" instruction, but your processor apparently doesn't have it anyway, so this might be the next best thing.

You're concerned about space, not time, right?
You've got four function calls, with an integer argument being passed to each one, followed by a multiplication by a constant, followed by adding.
Just as a first guess, that could be
load integer constant into register (6 bytes)
push register (2 bytes,
call eeprom_read (6 bytes)
adjust stack (4 bytes)
load integer multiplier into register (6 bytes)
push both registers (4 bytes),
call multiplication routine (6 bytes)
adjust stack (4 bytes)
load temporary sum into a register (6 bytes)
add to that register the result of the multiplication (2 bytes)
store back in the temporary sum (6 bytes).
Let's see, 6+2+6+4+6+4+6+4+6+2+6= about 52 bytes per call to eeprom_read.
The last call would be shorter because it doesn't do the multiply.
I would try calling eeprom_read not with arguments like 146 but with (unsigned char)146, and multiplying not by 1000 but by (unsigned short)1000.
That way, you might be able to tease the compiler into using shorter instructions, and possibly using a multiply instruction rather than a multiply function call.
Also, the call to eeprom_read might be macro'ed into a direct memory fetch, saving the pushing of the argument, the calling of the function, and the stack adjustment.
Another trick could be to store each one of the four products in a local variable, and add them all together at the end. That could generate less code.
All these possibilities would also make it faster, as well as smaller, though you probably don't need to care about that.
Another possibility for saving space could be to use a loop, like this:
static unsigned short powerOf10[] = {1000, 100, 10, 1};
unsigned short i;
romAddr = 0;
for (i = 146; i < 150; i++){
romAddr += powerOf10[i-146] * eeprom_read(i);
}
which should save space by having the call and the multiply only once, plus the looping instructions, rather than four copies.
In any case, get handy with the assembler language that the compiler generates.

It depends very, very much on the compiler, but I would suggest that you at least simplify the multiplication this way:
romAddr = ((eeprom_read(146)*10 + eeprom_read(147))*10 +
eeprom_read(148))*10 + eeprom_read(149);
You could put this in a loop:
uint8_t i = 146;
romAddr = eeprom_read(i);
for (i = 147; i < 150; i++)
romAddr = romAddr * 10 + eeprom_read(i);
Hopefully the compiler should recognise how much simpler it is to multiply a 16-bit value by ten, compared with separately implementing multiplications by 1000 and 100.
I'm not completely comfortable relying on the compiler to deal with the loop effectively, though.
Maybe:
uint8_t hi, lo;
hi = (uint8_t)eeprom_read(146) * (uint8_t)10 + (uint8_t)eeprom_read(147);
lo = (uint8_t)eeprom_read(148) * (uint8_t)10 + (uint8_t)eeprom_read(149);
romAddr = hi * (uint8_t)100 + lo;
All of these are untested.

Related

What is the advantage of this sizing code in C?

Apologies for the generic question title, I wasn't sure how to phrase it properly (suggestions welcome!)
I'm trying to get my head around some of the code for the Common Mark parser and came across this:
/* Oversize the buffer by 50% to guarantee amortized linear time
* complexity on append operations. */
bufsize_t new_size = target_size + target_size / 2;
new_size += 1;
new_size = (new_size + 7) & ~7;
So given a number, eg 32, it will add (32 / 2) [48], add 1 [49], add 7 [56], finally ANDing that with -8 [56].
Is this a common pattern? Specifically the adding of a number and then ANDing with its complement.
Is anyone able to provide any insight into what this is doing and what advantages, if any, exist?
The (+7) & ~7 part rounds the number up to the first multiple of 8. It works only with powers of 2 (7 is 2^3-1). If you want to round to a multiple of 32 then use 31 instead of 7.
The reason to round the size to a multiple of 8 is probably specific to the algorithm.
It is also possible that the author of the code knows how the memory allocator works. If the allocator uses internally blocks of memory of multiple of 8 bytes, an allocation request of any number of bytes between 1 and 8 uses an entire block. By asking for a block having a size that is multiple of 8 one gets several extra bytes for the same price.

Is a binary operation faster than memmove?

I'm writing a digital filter, and I need to keep the last X values and sum them all together.
Now there are two possible approaches to this. Either I shift the whole array using memmove to make room for the next value, and have the right indexes to the array as hard-coded values in my summing algorithm.
memmove(&Fifo[0], &Fifo[1], 12 * 4); // Shift array to the left
Result += Factor[1] * (Fifo[5] + Fifo[7]);
Result += Factor[2] * (Fifo[4] + Fifo[8]);
Result += Factor[3] * (Fifo[3] + Fifo[9]);
Result += Factor[4] * (Fifo[2] + Fifo[10]);
Result += Factor[5] * (Fifo[1] + Fifo[11]);
Result += Factor[6] * (Fifo[0] + Fifo[12]);
Or alternatively, I don't copy any memory, but increment a counter instead, and calculate each index from that using a modulo operation (like a circular buffer).
i++; // Increment the index
Result += Factor[1] * (Fifo[(i + 5) % 13] + Fifo[(i + 7) % 13]);
Result += Factor[2] * (Fifo[(i + 4) % 13] + Fifo[(i + 8) % 13]);
Result += Factor[3] * (Fifo[(i + 3) % 13] + Fifo[(i + 9) % 13]);
Result += Factor[4] * (Fifo[(i + 2) % 13] + Fifo[(i + 10) % 13]);
Result += Factor[5] * (Fifo[(i + 1) % 13] + Fifo[(i + 11) % 13]);
Result += Factor[6] * (Fifo[(i + 0) % 13] + Fifo[(i + 12) % 13]);
Since its an embedded ARM cpu, I was wondering what would be more efficient. Since I assume that the CPU has to move at least one 32-bit value internally to do the modulo operation, could it be that just moving the whole array is just as fast as calculating the right indexes?
If you need to know which is faster, you need to do benchmark. If you want to
know why, you need to examine the assembly.
That being said, there is also halfway solution which could be good enough:
Use buffer larger than needed and only do memmove when your buffer is full.
That way you only have to keep track of starting offset, and not have to worry
about the problems that come with circular buffers. You have to use more memory though.
So if you wish to have 5 elements and use buffer for 10 elements, you only have
to do memmove every 5 insertions. (Except the first pass when you can do 10 insertions)
I've done exactly that on a Cortex M0 (LPC11C14) for a FIR filter of size 15 (Savitzky-Golay for measuring line voltage).
I found that in my case copying was somewhat slower than using a circular buffer of size 16 and computing the indices using the modulo operator. Note that 16 is a power of two, which makes division very cheap.
I tried several variants and used a port pin for measuring execution time, I recommend you do the same.
Assuming 32-bit values, Modulo on ARM can be executed in 2 assembly instructions, but so is moving memory (1 to get it in a register, 1 to get it out). So no definitive answer here; it will depend on the code around it.
My gut feeling says you should go for the circular buffer approach.
There is a third way which requires neither memmove nor modulo involving two switch blocks. I'm too lazy to type it up, but the idea is that you calculate the offset, use the first switch to calculate one 'half' of the buffer, then recaulculate the offset and use the second switch to calculate the other half of the buffer. You basically enter the second switch where the first one 'left'. Note that in one switch block the instruction order would have to be reverted.
My intuition says that the memmove may cause all sorts of memory conflicts and prevent internal bypasses, since you load and store to the same area, perhaps even the same cache lines. Some processors would simply give up on optimizing this and defer all the memory operations, effectively serializing them (an embedded CPU may be simple enough to do this anyway, but i'm talking about the general case - on x86 or even cortex a15 you may get a bigger penalty)

simd store delay

I have the following type of code
short v[8] __attribute__ (( aligned(16)));
...
// in an inlined function :
_mm_store_si128(v, some_m128i_value);
... // some more operation (4 additions )
outp[0] = v[1] / 2; // <- first access of v since the previous store
When I annotate this code with perf, this single line is accounting for 18 %
of the whole sampling ! When I say line, it is at the assembly level, ie the instruction immediately after the move from v count for 18 %
Is it a cache miss ? How can I test that ?
I don't really need to store the result, but how can I avoid a round trip to memory, and still individually access the 8 short composing my m128i value.
Update :
If I use _mm_extract_epi16, then the overall performance is not better, but the waiting is equally divided between each access instead of hitting just the first.
Instead of doing a SIMD store followed by scalar loads you should be using _mm_extract_epi16 (PEXTRW) to get 16 bit scalar values directly from your 128 bit SSE register without going via memory, e.g.
outp[0] = _mm_extract_epi16(some_m128i_value, 6);

Is there a standard, strided version of memcpy?

I have a column vector A which is 10 elements long. I have a matrix B which is 10 by 10. The memory storage for B is column major. I would like to overwrite the first row in B with the column vector A.
Clearly, I can do:
for ( int i=0; i < 10; i++ )
{
B[0 + 10 * i] = A[i];
}
where I've left the zero in 0 + 10 * i to highlight that B uses column-major storage (zero is the row-index).
After some shenanigans in CUDA-land tonight, I had a thought that there might be a CPU function to perform a strided memcpy?? I guess at a low-level, performance would depend on the existence of a strided load/store instruction, which I don't recall there being in x86 assembly?
Short answer: The code you have written is as fast as it's going to get.
Long answer: The memcpy function is written using some complicated intrinsics or assembly because it operates on memory operands that have arbitrary size and alignment. If you are overwriting a column of a matrix, then your operands will have natural alignment, and you won't need to resort to the same tricks to get decent speed.

32x32 Multiply and add optimization

I'm working on optimizing an application . I found that i need to optimize an inner loop for improved performance.
rgiFilter is a 16 bit arrary.
for (i = 0; i < iLen; i++) {
iPredErr = (I32)*rgiResidue;
rgiFilter = rgiFilterBuf;
rgiPrevVal = rgiPrevValRdBuf + iRecent;
rgiUpdate = rgiUpdateRdBuf + iRecent;
iPred = iScalingOffset;
for (j = 0; j < iOrder_Div_8; j++) {
iPred += (I32) rgiFilter[0] * rgiPrevVal[0];
rgiFilter[0] += rgiUpdate[0];
iPred += (I32) rgiFilter[1] * rgiPrevVal[1];
rgiFilter[1] += rgiUpdate[1];
iPred += (I32) rgiFilter[2] * rgiPrevVal[2];
rgiFilter[2] += rgiUpdate[2];
iPred += (I32) rgiFilter[3] * rgiPrevVal[3];
rgiFilter[3] += rgiUpdate[3];
iPred += (I32) rgiFilter[4] * rgiPrevVal[4];
rgiFilter[4] += rgiUpdate[4];
iPred += (I32) rgiFilter[5] * rgiPrevVal[5];
rgiFilter[5] += rgiUpdate[5];
iPred += (I32) rgiFilter[6] * rgiPrevVal[6];
rgiFilter[6] += rgiUpdate[6];
iPred += (I32) rgiFilter[7] * rgiPrevVal[7];
rgiFilter[7] += rgiUpdate[7];
rgiFilter += 8;
rgiPrevVal += 8;
rgiUpdate += 8;
}
ode here
Your only bet is to do more than one operation at a time, and that means one of these 3 options:
SSE instructions (SIMD). You process multiple memory locations with a single instructions
Multi-threading (MIMD). This works best if you have more than 1 cpu core. Split your array into multiple, similarly sized strips that are independant of eachother (dependency will increase this option's complexity a lot, to the point of being slower than sequentially calculating everything if you need a lot of locks). Note that the array has to be big enough to offset the extra context switching and synchronization overhead (it's pretty small, but not negligeable). Best for 4 cores or more.
Both at once. If your array is really big, you could gain a lot by combining both.
If rgiFilterBuf, rgiPrevValRdBuf and rgiUpdateRdBuf are function parameters that don't alias, declare them with the restrict qualifier. This will allow the compiler to optimise more aggresively.
As some others have commented, your inner loop looks like it may be a good fit for vector processing instructions (like SSE, if you're on x86). Check your compiler's intrinsics.
I don't think you can do much to optimize it in C. Your compiler might have options to generate SIMD code, but you probably need to just go and write your own SIMD assembly code if performance is critical...
You can replace the inner loop with very few SSE2 intrinsics
see [_mm_madd_epi16][1] to replace the eight
iPred += (I32) rgiFilter[] * rgiPrevVal[];
and [_mm_add_epi16][2] or _[mm_add_epi32][3] to replace the eight
rgiFilter[] += rgiUpdate[];
You should see a nice acceleration with that alone.
These intrinsics are specific to Microsoft and Intel Compilers.
I am sure equivalents exist for GCC I just havent used them.
EDIT: based on the comments below I would change the following...
If you have mixed types the compiler is not always smart enough to figure it out.
I would suggest the following to make it more obvious and give it a better chance
at autovectorizing.
declare rgiFilter[] as I32 bit for
the purposes of this function. You
will pay one copy.
change iPred to iPred[] as I32 also
do the iPred[] summming outside the inner (or even outer) loop
Pack similar instructions in groups of four
iPred[0] += rgiFilter[0] * rgiPrevVal[0];
iPred[1] += rgiFilter[1] * rgiPrevVal[1];
iPred[2] += rgiFilter[2] * rgiPrevVal[2];
iPred[3] += rgiFilter[3] * rgiPrevVal[3];
rgiFilter[0] += rgiUpdate[0];
rgiFilter[1] += rgiUpdate[1];
rgiFilter[2] += rgiUpdate[2];
rgiFilter[3] += rgiUpdate[3];
This should be enough for the Intel compiler to figure it out
Ensure that iPred is hold in a register (not read from memory before and not written back into memory after each += operation).
Optimize the memory layout for 1st level cache. Ensure that the 3 arrays to not fight for same cache entries. This depends on CPU architecture and isn't simple at all.
Loop unrolling and vectorizing should left to the compiler.
See Gcc Auto-vectorization
Start out by making sure that the data is layed out linearly in memory so that you get no cache misses. This doesn't seem to be an issue though.
If you can't SSE the operations (and if the compiler fails with it - look at the assembly), try to separate into several different for-loops that are smaller (one for each 0 .. 8). Compilers tend to be able to do better optimizations on loops that perform less amount of operations (except in cases like this where it might be able to do vectorization/SSE).
16 bit integers are more expensive for 32/64 bit architecture to use (unless they have specific 16-bit registers). Try converting it to 32 bits before doing the loop (most 64-bit architectures have 32bit registers as well afaik).
Pretty good code.
At each step, you're basically doing three things, a multiplication and two additions.
The other suggestions are good. Also, I've sometimes found that I get faster code if I separate those activities into different loops, like
one loop to do the multiplication and save to a temporary array.
one loop to sum that array in iPred.
one loop to add rgiUpdate to rgiFilter.
With the unrolling, your loop overhead is negligible, but if the number of different things done inside each loop is minimized, the compiler can sometimes make better use of its registers.
There's lots of optimizations that you can do that involve introducing target specific code. I'll stick mostly with generic stuff, though.
First, if you are going to loop with index limits then you should usually try to loop downward.
Change:
for (i = 0; i < iLen; i++) {
to
for (i = iLen-1; i <= 0; i--) {
This can take advantage of the fact that many common processors essentially do a comparison with 0 for the results of any math operation, so you don't have to do an explicit comparison.
This only works, though, if going backwards through the loop has the same results and if the index is signed (though you can sneak around that).
Alternately you could try limiting by pointer math. This might eliminate the need for an explicit index (counter) variable, which could speed things up, especially if registers are in short supply.
for (p = rgiFilter; p <= rgiFilter+8; ) {
iPred += (I32) (*p) + *rgiPreval++;
*p++ += *rgiUpdate++;
....
}
This also gets rid of the odd updating at the end of your inner loop. The updating at the end of the loop could confuse the compiler and make it produce worse code. You may also find that the loop unrolling that you did do may produce worse or equally as good results as if you had only two statements in the body of the inner loop. The compiler is likely able to make good decisions about how this loop should be rolled/unrolled. Or you might just want to make sure that the loop is unrolled twice since rgiFilter is an array of 16 bit values and see if the compiler can take advantage of accessing it just twice to accomplish two reads and two writes -- doing one 32 bit load and one 32 bit store.
for (p = rgiFilter; p <= rgiFilter+8; ) {
I16 x = *p;
I16 y = *(p+1); // Hope that the compiler can combine these loads
iPred += (I32) x + *rgiPreval++;
iPred += (I32) y + *rgiPreval++;
*p++ += *rgiUpdate++;
*p++ += *rgiUpdate++; // Hope that the complier can combine these stores
....
}
If your compiler and/or target processor supports it you can also try issuing prefetch instructions. For instance gcc has:
__builtin_prefetch (const void * addr)
__builtin_prefetch (const void * addr, int rw)
__builtin_prefetch (const void * addr, int rw, int locality)
These can be used to tell the compiler that if the target has prefetch instructions it should use them to try to go ahead and get addr into the cache. Optimally these should be issued once per cache line step per array you're working on. The rw argument is to tell the compiler if you want to read or write to address. Locality has to do with if the data needs to stay in cache after you access it. The compiler just tries to do the best it can figure out how to to generate the right instructions for this, but if it can't do what you ask for on a certain target it just does nothing and it doesn't hurt anything.
Also, since the __builtin_ functions are special the normal rules about variable number of arguments don't really apply -- this is a hint to the compiler, not a call to a function.
You should also look into any vector operations your target supports as well as any generic or platform specific functions, builtins, or pragmas that your compiler supports for doing vector operations.

Resources