Fastest way to calculate a number of memory blocks

Fastest way to calculate a number of memory blocks - c

I have a memory region, which is divided in blocks of predefined BLOCKSIZE size. Given a memory chunk, defined by its offset OFFSET in bytes and SIZE in bytes, how to efficiently calculate a number of blocks, containing this memory chunk?
For example, let's say the BLOCKSIZE=8. Then a memory chunk with OFFSET=0 and SIZE=16 will take 2 blocks, but a chunk with OFFSET=4 and SIZE=16 will take 3 blocks.
I can write a formula like this (using integer arithmetics in C):
numberOfBlocks = (OFFSET + SIZE - 1) / BLOCKSIZE - (OFFSET / BLOCKSIZE) + 1;
This calculation will take 2 divisions and 4 additions. Can we do better, provided that the BLOCKSIZE is a power of 2 and OFFSET >= 0 and SIZE > 0?
UPDATE: I uderstand that division can be replaced by shifting in this case.

Can we do better, provided that the BLOCKSIZE is a power of 2?
I don't think so. Your (corrected) formula is basically (index of the first block after the chunk) - (index of the first containing any part of the chunk). You could formulate it differently -- say, as the sum of a base number of blocks plus an adjustment for certain layouts that require one extra block -- but that actually increases the number of operations needed by a couple:
numberOfBlocks = (SIZE + BLOCKSIZE - 1) / BLOCKSIZE
+ ((SIZE % BLOCKSIZE) + (OFFSET % BLOCKSIZE)) / BLOCKSIZE;
I don't see any way around performing (at least) two integer divisions (or equivalent bit shifts), because any approach to the calculation requires computing two block counts. These two computations cannot be combined, because each one requires a separate remainder truncation.
That BLOCKSIZE is a power of two may help you choose more efficient operations, but it does not help reduce the number of operations required. However, you could reduce the number of operations slightly if you could rely on SIZE to be a multiple of BLOCKSIZE. In that case, you could do this:
numberOfBlocks = SIZE / BLOCKSIZE + (OFFSET % BLOCKSIZE) ? 1 : 0;
Alternatively, if it would be sufficient to compute an upper bound on the number of blocks covered, then you could do this:
numberOfBlocksBound = SIZE / BLOCKSIZE + 2;
or slightly tighter in many cases, but more costly to compute:
numberOfBlocksBound = (SIZE + BLOCKSIZE - 1) / BLOCKSIZE + 1;

Related

What is the advantage of this sizing code in C?

Apologies for the generic question title, I wasn't sure how to phrase it properly (suggestions welcome!)
I'm trying to get my head around some of the code for the Common Mark parser and came across this:
/* Oversize the buffer by 50% to guarantee amortized linear time
* complexity on append operations. */
bufsize_t new_size = target_size + target_size / 2;
new_size += 1;
new_size = (new_size + 7) & ~7;
So given a number, eg 32, it will add (32 / 2) [48], add 1 [49], add 7 [56], finally ANDing that with -8 [56].
Is this a common pattern? Specifically the adding of a number and then ANDing with its complement.
Is anyone able to provide any insight into what this is doing and what advantages, if any, exist?

The (+7) & ~7 part rounds the number up to the first multiple of 8. It works only with powers of 2 (7 is 2^3-1). If you want to round to a multiple of 32 then use 31 instead of 7.
The reason to round the size to a multiple of 8 is probably specific to the algorithm.
It is also possible that the author of the code knows how the memory allocator works. If the allocator uses internally blocks of memory of multiple of 8 bytes, an allocation request of any number of bytes between 1 and 8 uses an entire block. By asking for a block having a size that is multiple of 8 one gets several extra bytes for the same price.

Chunking Arithmetics

Consider two coordinate systems, one for the objects themselves, one for the chunks the objects are contained in. Let's consider a chunk size of 4, meaning that the object at coordinate 0 is in chunk 0, the object at coordinate 3 is also in chunk 0, but the object at coordinate 8 is in chunk 2 , the object at coordinate -4 is in chunk -1.
Calculating the chunk number for an object with a positive position is easy: object.number/chunk_size
But I do not find a formular that calculates the correct chunk position for objects at negative positions:
-4/4 = -1 is correct but -2/4 = 0 is not the required result, though -4/4 -1 = -2 is now incorrect, but -2/4 -1 = -1 is now correct ...
Is there a sweet, short way to calculate each position, or do I need to check 2 conditions:
chunkx = objectx > 0 ?
objectx / chunksize :
objectx % chunksize == 0 ?
objectx / chunksize :
objectx / chunksize - 1;
Alternative:
chunkx = objectx > 0 || objectx % chunksize == 0 ?
objectx / chunksize :
objectx / chunksize -1;
On a side note: calculating the position of an object within the chunk is:
internalx = objectx - chunkx * chunksize
for both positive and negative (-4 -> 0 ; -2 -> 2; 1 -> 1; 4 -> 0)
Is there a more elegant way to calculate this that I am blatantly overseeing here?

If you can afford to convert your numbers to floating point and have a cheap floor function you can use floor(-1.0/4.0) to get -1 as you wish, but the conversion to floating point may be more expensive than the branch.
Another option is to work with positive numbers only by adding a large enough number (multiple of chunk size) to your object coordinate, and subtracting that number divided by the chunk size from your chunk coordinate. This may be cheaper than a branch.
For your second question, if your chunk size happens to be a power of 2 as in your example you can use binary and (-1 & (chunksize-1) == 3)

If your chunk size is a power of two you can do:
chunkx = (objectx & ~(chunksize - 1)) / chunksize;
If the chunk size is also constant the compiler can probably turn that into a trivial AND and shift. E.g.,
chunkx = (objectx & 3) >> 2;
For the general case, I don't think you can eliminate the branch, but you can eliminate the slow modulo operation by offsetting the number before division:
chunkx = ((objectx >= 0) ? (objectx) : (objectx - (chunksize - 1))) / chunksize;

Is a binary operation faster than memmove?

I'm writing a digital filter, and I need to keep the last X values and sum them all together.
Now there are two possible approaches to this. Either I shift the whole array using memmove to make room for the next value, and have the right indexes to the array as hard-coded values in my summing algorithm.
memmove(&Fifo[0], &Fifo[1], 12 * 4); // Shift array to the left
Result += Factor[1] * (Fifo[5] + Fifo[7]);
Result += Factor[2] * (Fifo[4] + Fifo[8]);
Result += Factor[3] * (Fifo[3] + Fifo[9]);
Result += Factor[4] * (Fifo[2] + Fifo[10]);
Result += Factor[5] * (Fifo[1] + Fifo[11]);
Result += Factor[6] * (Fifo[0] + Fifo[12]);
Or alternatively, I don't copy any memory, but increment a counter instead, and calculate each index from that using a modulo operation (like a circular buffer).
i++; // Increment the index
Result += Factor[1] * (Fifo[(i + 5) % 13] + Fifo[(i + 7) % 13]);
Result += Factor[2] * (Fifo[(i + 4) % 13] + Fifo[(i + 8) % 13]);
Result += Factor[3] * (Fifo[(i + 3) % 13] + Fifo[(i + 9) % 13]);
Result += Factor[4] * (Fifo[(i + 2) % 13] + Fifo[(i + 10) % 13]);
Result += Factor[5] * (Fifo[(i + 1) % 13] + Fifo[(i + 11) % 13]);
Result += Factor[6] * (Fifo[(i + 0) % 13] + Fifo[(i + 12) % 13]);
Since its an embedded ARM cpu, I was wondering what would be more efficient. Since I assume that the CPU has to move at least one 32-bit value internally to do the modulo operation, could it be that just moving the whole array is just as fast as calculating the right indexes?

If you need to know which is faster, you need to do benchmark. If you want to
know why, you need to examine the assembly.
That being said, there is also halfway solution which could be good enough:
Use buffer larger than needed and only do memmove when your buffer is full.
That way you only have to keep track of starting offset, and not have to worry
about the problems that come with circular buffers. You have to use more memory though.
So if you wish to have 5 elements and use buffer for 10 elements, you only have
to do memmove every 5 insertions. (Except the first pass when you can do 10 insertions)

I've done exactly that on a Cortex M0 (LPC11C14) for a FIR filter of size 15 (Savitzky-Golay for measuring line voltage).
I found that in my case copying was somewhat slower than using a circular buffer of size 16 and computing the indices using the modulo operator. Note that 16 is a power of two, which makes division very cheap.
I tried several variants and used a port pin for measuring execution time, I recommend you do the same.

Assuming 32-bit values, Modulo on ARM can be executed in 2 assembly instructions, but so is moving memory (1 to get it in a register, 1 to get it out). So no definitive answer here; it will depend on the code around it.
My gut feeling says you should go for the circular buffer approach.

There is a third way which requires neither memmove nor modulo involving two switch blocks. I'm too lazy to type it up, but the idea is that you calculate the offset, use the first switch to calculate one 'half' of the buffer, then recaulculate the offset and use the second switch to calculate the other half of the buffer. You basically enter the second switch where the first one 'left'. Note that in one switch block the instruction order would have to be reverted.

My intuition says that the memmove may cause all sorts of memory conflicts and prevent internal bypasses, since you load and store to the same area, perhaps even the same cache lines. Some processors would simply give up on optimizing this and defer all the memory operations, effectively serializing them (an embedded CPU may be simple enough to do this anyway, but i'm talking about the general case - on x86 or even cortex a15 you may get a bigger penalty)

optimizing a line of C code for 8 bit processor

I'm working on a 8bit processor and have written code in a C compiler, now more than 140 lines of code are taking just 1200 bytes and this single line is taking more than 200 bytes of ROM space. eeprom_read() is a function, there should be a problem with this 1000 and 100 and 10 multiplication.
romAddr = eeprom_read(146)*1000 + eeprom_read(147)*100 +
eeprom_read(148)*10 + eeprom_read(149);
Processor is 8-bit and data type of romAddr is int. Is there any way to write this line in a more optimized way?

It's possible that the thing that uses the most space is the use of multiplication. If your processor lacks an instruction to do multiplication, the compiler is forced to use software to do it step by step, which can require quite a bit of code.
It's hard to say, since you don't specify anything about your target processor (or which compiler you're using).
One way might be to somehow try to reduce inlining, so the code to multiply by 10 (which is used in all four terms) can be re-used.
To know if this is the case at all, the machine code must be inspected. By the way, the use of decimal constants for an address calculation is really odd.

Sometimes the multiplication can be compiled into a sequence of additions, yes. You can optimize it say by using left shift operator.
A*1000 = A*512 + A*256 + A*128 + A*64 + A*32 + A*8
Or the same thing:
A<<9 + A<<8 + A<<7 + A<<6 + A<<5 + A<<3
This still is way longer then a single "multiply" instruction, but your processor apparently doesn't have it anyway, so this might be the next best thing.

You're concerned about space, not time, right?
You've got four function calls, with an integer argument being passed to each one, followed by a multiplication by a constant, followed by adding.
Just as a first guess, that could be
load integer constant into register (6 bytes)
push register (2 bytes,
call eeprom_read (6 bytes)
adjust stack (4 bytes)
load integer multiplier into register (6 bytes)
push both registers (4 bytes),
call multiplication routine (6 bytes)
adjust stack (4 bytes)
load temporary sum into a register (6 bytes)
add to that register the result of the multiplication (2 bytes)
store back in the temporary sum (6 bytes).
Let's see, 6+2+6+4+6+4+6+4+6+2+6= about 52 bytes per call to eeprom_read.
The last call would be shorter because it doesn't do the multiply.
I would try calling eeprom_read not with arguments like 146 but with (unsigned char)146, and multiplying not by 1000 but by (unsigned short)1000.
That way, you might be able to tease the compiler into using shorter instructions, and possibly using a multiply instruction rather than a multiply function call.
Also, the call to eeprom_read might be macro'ed into a direct memory fetch, saving the pushing of the argument, the calling of the function, and the stack adjustment.
Another trick could be to store each one of the four products in a local variable, and add them all together at the end. That could generate less code.
All these possibilities would also make it faster, as well as smaller, though you probably don't need to care about that.
Another possibility for saving space could be to use a loop, like this:
static unsigned short powerOf10[] = {1000, 100, 10, 1};
unsigned short i;
romAddr = 0;
for (i = 146; i < 150; i++){
romAddr += powerOf10[i-146] * eeprom_read(i);
}
which should save space by having the call and the multiply only once, plus the looping instructions, rather than four copies.
In any case, get handy with the assembler language that the compiler generates.

It depends very, very much on the compiler, but I would suggest that you at least simplify the multiplication this way:
romAddr = ((eeprom_read(146)*10 + eeprom_read(147))*10 +
eeprom_read(148))*10 + eeprom_read(149);
You could put this in a loop:
uint8_t i = 146;
romAddr = eeprom_read(i);
for (i = 147; i < 150; i++)
romAddr = romAddr * 10 + eeprom_read(i);
Hopefully the compiler should recognise how much simpler it is to multiply a 16-bit value by ten, compared with separately implementing multiplications by 1000 and 100.
I'm not completely comfortable relying on the compiler to deal with the loop effectively, though.
Maybe:
uint8_t hi, lo;
hi = (uint8_t)eeprom_read(146) * (uint8_t)10 + (uint8_t)eeprom_read(147);
lo = (uint8_t)eeprom_read(148) * (uint8_t)10 + (uint8_t)eeprom_read(149);
romAddr = hi * (uint8_t)100 + lo;
All of these are untested.

Guaranteeing enough storage space for 4*ceil(n/3), where n is an int

Let's say n is an integer (an int variable in C). I need enough space for “4 times the ceiling of n divided by 3” bytes. How do I guarantee enough space for this?
Do you think malloc(4*(int)ceil(n/3.0)) will do, or do I have to add, say, 1 in order to be absolutely safe (due to possible rounding errors)?

you can achieve the same thing with pure integer arithmetic which guarantees that you allocate the correct amount of memory:
edit fixed brackets
malloc(4*((n+2)/3))

An alternative to KerrekSB's general formula which guarantees that only one division is used, is to calculate
(n+m-1)/m
To see that it produces the same, write n = k*m + r with 0 <= r < m. Then n%m == r, and if r == 0, we have n+m-1 = k*m + (m-1) and (n+m-1)/m == k, otherwise n+m-1 = (k+1)*m + (r-1) and (n+m-1)/m == k+1.
Most modern hardware gives you the quotient (n/m) in one register and the remainder (n%m) in another when you do an integer division, so you can get both parts of Kerrek's formula in one division, and most compilers will do so. If the compiler doesn't, but uses two divisions, the calculation will be considerably slower, so if the computation is done often and performance is an issue, you can work around the compiler's weakness with somewhat less obvious code.
In the given case, the malloc would be
malloc(4*((n+2)/3));
But since it's not obvious to everyone what that formula does, if you use it, explain it in a comment, and if you don't need to use it, use the more obvious code.

To compute the ceiling of n / m integrally, just say:
n / m + (n % m == 0 ? 0 : 1)
All in all, say malloc(4 * (n / 3 + (n % 3 ? 1 : 0)));.

While Kerrek SB has a precise answer, in practice most engineers would use malloc (4 + 4 * n / 3) or (equivalently) malloc (4 * (1 + n / 3)). The rules for C evaluate n/3 as an integer resulting in truncating remainder away. Adding a little more to the expression ensures that any fraction ignored by the division is allocated.
At most, this might waste three bytes. Only if there are at thousands of these would any extra computation to account for that be justified—maybe. Implementations of malloc often round storage allocations up to multiples of 4, 8, or 16 bytes to simplify its housekeeping.
Consider the cost of 3 bytes of memory: Current pricing is $5 to $15 per gigabyte. Three bytes cost $0.000 000 009.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Fastest way to calculate a number of memory blocks - c

Related

What is the advantage of this sizing code in C?

Chunking Arithmetics

Is a binary operation faster than memmove?

optimizing a line of C code for 8 bit processor

Guaranteeing enough storage space for 4*ceil(n/3), where n is an int

Categories

Resources