What's the most efficient way to calculate the amount of padding for 8-bit data that needs to be a multiple of 32-bit in C?
At the moment I do it like this:
pad = (4-size%4)%4;
As long as the optimizing compiler uses bitmasking for the % 4 instead of division, I think your code is probably pretty good. This might be a slight improvement:
// only the last 2 bits (hence & 3) matter
pad = (4 - (size & 3)) & 3;
But again, the optimizing compiler is probably smart enough to be reducing your code to this anyway. I can't think of anything better.
// align n bytes on size boundary
pad n size = (~n + 1) & (size - 1)
this is similar to TypeIA's solution and only machine language ops are used.
(~n + 1) computes the negative value, that would make up 0 when added to n
& (size - 1) filters only the last relevant bits.
examples
pad 13 8 = 3
pad 11 4 = 1
pad = (-size)&3;
This should be the fastest.
size 0: pad 0
size 1: pad 3
size 2: pad 2
size 3: pad 1
Related
I have a header that can be any number of bits, and there is a variable called ByteAlign that's calculated by subtracting the current file position from the file position at the beginning of the file, the point of this variable is to pad the header to the next complete byte. so if the header is taking up 57 bits, the ByteAlign variable needs to be 7 bits in length to pad the header to 64 bits total, or 8 bytes.
Solutions that don't work:
Variable % 8 - 8, the result is the answer, but negative.
8 % Variable; this is completely inaccurate, and gives answers like 29, which is blatantly wrong, the largest number it should be is 7.
how exactly do I do this?
The number of bytes you need to accommodate n bits is (n + 7) / 8.
The number of bits in this is 8 * ((n + 7) / 8).
The amount of padding is thus 8 * ((n + 7) / 8) - n.
This should work:
(8 - (Variable & 7)) & 7
I'm trying to perform a matrix trasposition of specifically 8 n-bits arrays, each having n bits (around 70,000), to a byte array of n elements.
Context information: The 8 n-bits arrays are RGB data for 8 channels. I need to have one byte representing the nth-bit position of the 8 arrays. This will be running on an ARM Cortex-M3 processor and needs to perform as fast as possible since I'm generating 8 simultaneous signals using the resulting array.
I've come up with a pseudo algorithm (in the link) to do this, but I'm afraid it might be too costly for the processor.
Pseudo Algorithm
I'm looking for the fastest executing code. Size is of secondary importance.
I will appreciate suggestions.
This is what I implemented but the results are not that good.
do{
for(b=0;b<24;b++){ //Optimize to for(b=24;b!=0;b--)
m = 1 << b;
*dataBytes = *dataBytes + __ROR((*s0 & m),32+b-0); //strip 0 data
*dataBytes = *dataBytes + __ROR((*s1 & m),32+b-1); //strip 1 data
*dataBytes = *dataBytes + __ROR((*s2 & m),32+b-2); //strip 2 data
*dataBytes = *dataBytes + __ROR((*s3 & m),32+b-3); //strip 3 data
*dataBytes = *dataBytes + __ROR((*s4 & m),32+b-4); //strip 4 data
*dataBytes = *dataBytes + __ROR((*s5 & m),32+b-5); //strip 5 data
*dataBytes = *dataBytes + __ROR((*s6 & m),32+b-6); //strip 6 data
*dataBytes = *dataBytes + __ROR((*s7 & m),32+b-7); //strip 7 data
dataBytes++;
}
s0 += 3;
s1 += 3;
s2 += 3;
s3 += 3;
s4 += 3;
s5 += 3;
s6 += 3;
s7 += 3;
}while(n--);
S0 to 7 are the 8 individual vectors from which the bits are being taken in groups of 24.
N is the number of groups, m is the mask and b is the mask position.
dataBytes is the resulting array.
There are two things that are always present when optimizing,
Memory bandwidth
CPU clocks
Bandwidth
Your current algorithm is loading a byte at a time. You may do this more efficiently by loading at least 32bits at a time. This will optimize the ARM BUS. For certain the end algorithm will not be BUS bound and if it is, you have optimized for this.
For the different ARM CPUs, there are instructions like pld, etc which can try to optimize the BUS by pre-fetching the next data elements in advance. This may or may not apply to your Cortex-M. Another technique is to relocate the data to faster memory such as TCM if possible.
CPU speed
Pixel processing is almost always speed up by SIMD type instructions. The Cortex-M has instructions labelled SIMD. Don't get hung up on the label SIMD; use the concept. If you have loaded multiple bytes into a word, then you can use a table.
const unsigned long bits[16] = {
0, 1, 0x100, 0x101,
0x10000, 0x10001, 0x10100, 0x10101,
0x1000000, 0x1000001, 0x1000100, 0x1000101,
0x1010000, 0x1010001, 0x1010100, 0x1010101
}
A similar concept is used in many CRC algorithms on the Internet. Process each nibble (4 bits) and form the next four bytes of output a bit at a time. Probably there is a multiplication value which can replace the table, but this depends on the speed of you multiple which depends on the type of Cortex-M and/or ARM.
Definitely prototype in 'C' and then convert to assembler or use inline assembler if possible. If you have many mov statements in your algorithm, it is a signal that a compiler could probably allocate the register better than you. Many sophisticated algorithm use a code generator (scripted in phython, perl, etc) which may unroll whatever optimum loop you end up with and also track registers in a algorithmic way.
Note: Double check my table; it is just a first crack and I have not actually coded this particular algorithm. There maybe more slick ways to process multiple bits at a time, but the idea is probably fruitful.
I am learning the Redis source code , and in the zmalloc.c,
size_t zmalloc_size(void *ptr) {
void *realptr = (char*)ptr-PREFIX_SIZE;
size_t size = *((size_t*)realptr);
/* Assume at least that all the allocations are padded at sizeof(long) by
* the underlying allocator. */
if (size&(sizeof(long)-1)) size += sizeof(long)-(size&(sizeof(long)-1));
return size+PREFIX_SIZE;
}
I am confused with
if (size&(sizeof(long)-1)) size += sizeof(long)-(size&(sizeof(long)-1));
what's the effect of it? Memory padding?Then why sizeof(long)?
Yes, it seems to be to include the memory padding with the assumption that all allocations are padded at the sizeof(long) (as said by the comment).
Pseudo-code example:
size = 6 // as an example
sizeof(long) == 4
size & (sizeof(long) - 1) == 6 & (4 - 1) == 6 & 3 == 2
size += 4 - 2
size == 8 // two bytes of padding included
I'm pretty fresh in C though so you should probably not take my word for it. I'm not sure why one can assume that the underlying allocator will align at the size of long, perhaps it's only a decent approximation that is sufficient for zmalloc_size's use-case.
I am trying to figure out a way to get as much out of the limited memory in my microcontroller (32kb) and am seeking suggestions or pointers to an algorithm that performs what I am attempting to do.
Some background: I am sending Manchester Encoded bits out a SPI (Serial Peripheral Interface) directly from DMA. As the smallest possible unit I can store data into DMA is a byte (8 bits), I am having to represent my 1's as 0b11110000 and my 0's as 0b00001111. This basically means that for every bit of information, I need to use a byte (8 bits) of memory. Which is very inefficient.
If I could reduce this, so that my 1's are represented as 0b10 and my 0's as 0b01, I'd only have to use a 1/4 of a byte (2 bits) for every 1 bit of memory, which is fine for my solution.
Now, if I could save to DMA in bits, this would not be a problem, but of course I need to work with bytes. So I know the solution to my problem involves collecting the 8 bits (or in my case, 4 2bits) and then storing to DMA as a byte.
Questions:
Is there a standard way to solve this problem?
How can I some how create a 8 bit number from a collection of 4 2 bit numbers? But I do not want the addition of these numbers, but the actual way it looks when collected together.
For example: I have the following 4 2 bit numbers (keeping in mind that 0b10 represents 1 and 0b01 represents 0) (Also, the type these are stored in is open to the solution, as obviously there is no such thing as a 2 bit type)
Number1: 0b01 Number 2: 0b10 Number 3: 0b10 Number4: 0b01
And I want to create the following 8 bit number from these:
8 Bit Number: 0b01 10 10 01 or without the spaces 0b01101001 (0x69)
I am programming in c
It seems that you can pack four numbers a, b, c, d, all of which of value zero or one, like so:
64 * (a + 1) + 16 * (b + 1) + 4 * (c + 1) + (d + 1)
This is using the fact that x + 1 encodes your two-bit integer: 1 becomes 0b10, and 0 becomes 0b01.
It's Manchester encoding so 0b11110000 and 0b00001111 should be the only candidates. If so, then reduce the memory by a factor of 8.
uint8_t PackedByte = 0;
for (i=0; i<8; i++) {
PackedByte <<= 1;
if (buf[i] == 0xF0) // 0b11110000
PackedByte++;
}
Other other hand, if it's Manchester encoding and one may not have perfect encoding, then there are 3 results: 0, 1, indeterminate.
uint8_t PackedByte = 0;
for (i=0; i<8; i++) {
int upper = BitCount(buf[i] >> 4);
int lower = BitCount(buf[i] & 0xF);
if (upper > lower)
PackedByte++;
else if (upper == lower)
Hande_Indeterminate();
}
Various simplifications absent in the above, but shown for logic flow.
To number get abcd from (a,b,c,d) you need to shift the number to their places and OR :-
(a<<6)|(b<<4)|(c<<2)|d
I was recently asked in an interview how to set the 513th bit of a char[1024] in C, but I'm unsure how to approach the problem. I saw How do you set, clear, and toggle a single bit?, but how do I choose the bit from such a large array?
int bitToSet = 513;
inArray[bitToSet / 8] |= (1 << (bitToSet % 8));
...making certain assumptions about character size and desired endianness.
EDIT: Okay, fine. You can replace 8 with CHAR_BIT if you want.
#include <limits.h>
int charContaining513thBit = 513 / CHAR_BIT;
int offsetOf513thBitInChar = 513 - charContaining513thBit*CHAR_BIT;
int bit513 = array[charContaining513thBit] >> offsetOf513thBitInChar & 1;
You have to know the width of characters (in bits) on your machine. For pretty much everyone, that's 8. You can use the constant CHAR_BIT from limits.h in a C program. You can then do some fairly simple math to find the offset of the bit (depending on how you count them).
Numbering bits from the left, with the 2⁷ bit in a[0] being bit 0, the 2⁰ bit being bit 7, and the 2⁷ bit in a[1] being bit 8, this gives:
offset = 513 / CHAR_BIT; /* using integer (truncating) math, of course */
bit = 513 % CHAR_BIT;
a[offset] |= (0x80>>bit)
There are many sane ways to number bits, here are two:
a[0] a[1]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 This is the above
7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 This is |= (1<<bit)
You could also number from the other end of the array (treating it as one very large big-endian number).
Small optimization:
The / and % operators are rather slow, even on a lot of modern cpus, with modulus being slightly slower. I would replace them with the equivalent operations using bit shifting (and subtraction), which only works nicely when the second operand is a power of two, obviously.
x / 8 becomes x >> 3
x % 8 becomes x-((x>>3)<<3)
for this second operation, just reuse the result from the initial division.
Depending on the desired order (left to right versus right to left), it might change. But the general idea assuming 8 bits per byte would be to choose the byte as. This is expanded into lots of lines of code to hopefully show more clearly the intended steps (or perhaps it just obfuscates the intention):
int bitNum = 513;
int bytePos = bitNum / 8;
Then the bit position would be computed as:
int bitInByte = bitNum % 8;
Then set the bit (assuming the goal is to set it to 1 as opposed to clear or toggle it):
charArray[bytePos] |= ( 1 << bitInByte );
When you say 513th are you using index 0 or 1 for the 1st bit? If it's the former your post refers to the bit at index 512. I think the question is valid since everywhere else in C the first index is always 0.
BTW
static char chr[1024];
...
chr[512>>3]=1<<(512&0x7);