Improving speed of bit copying in a lossless audio encoding algorithm (written in C) - c

I'm trying to implement a lossless audio codec that will be able to process data coming in at roughly 190 kHz to then be stored to an SD card using SPI DMA. I've found that the algorithm basically works, but has certain bottlenecks that I can't seem to overcome. I was hoping to get some advice on how to best optimize a certain portion of the code that I found to be the "slowest". I'm writing in C on a TI DSP and am using -O3 optimization.
for (j = packet_to_write.bfp_bits; j>0; j--)
{
encoded_data[(filled/16)] |= ((buf_filt[i] >> (j- 1)) & 1) << (filled++ % 16);
}
In this section of code, I am taking X number of bits from the original data and fitting it into a buffer of encoded data. I've found that the loop is fairly costly and when I am working with a set of data represented by 8+ bits, then this code is too slow for my application. Loop unrolling doesn't really work here since each block of data can be represented by a different number of bits. The "filled" variable represents a bit counter filling up Uint16 indices in the encoded_data buffer.
I'd like some help understanding where bottlenecks may come from in this snippet of code (and hopefully I can take those findings and apply that to other areas of the algo). The authors of the paper that I'm reading (whose algorithm I'm trying to replicate) noted that they used a mixture of C and assembly code, but I'm not sure how assembly would be useful in this case.
Finally, the code itself is functional and I have done some extensive testing on actual audio samples. It's just not fast enough for real-time!
Thanks!

You really need to change the representation that you use for the output data. Instead of just a target buffer and the number of bits written, expand this to:
//complete words that have been written
uint16_t *encoded_data;
//number of complete words that have been written
unsigned filled_words;
//bits waiting to be written to encoded_data, LSB first
uint32_t encoded_bits;
//number of bits in encoded_bits
unsinged filled_bits;
This uses a single 32-bit word to buffer bits until we have enough to write out a complete uint16_t. This greatly simplifies the shifting and masking, because you always have at least 16 free bits to write into.
Then you can write out n bits of any source word like this:
void write_bits(uint16_t bits, unsigned n) {
uint32_t mask = ((uint32_t)0x0FFFF) >> (16-n);
encoded_bits |= (bits&mask) << filled_bits;
filled_bits += n;
if (filled_bits >= 16) {
encoded_data[filled_words++] = (uint16_t)encoded_bits;
encoded_bits >>= 16;
filled_bits -= 16;
}
}
and instead of your loop, you just write
write_bits(buf_filt[i], packet_to_write.bfp_bits);
No one-bit-at-a-time operations are required.

Related

Copy from one memory to another skipping constant bytes in C

I am working on embedded system application. I want to copy from source to destination, skipping constant number of bytes. For example: source[6] = {0,1,2,3,4,5} and I want destination to be {0,2,4} skipping one byte. Unfortunately memcpy could not fulfilled my requirement. How can I achieve this in 'C' without using loop as I have large data to process and using loop experiences time overhead.
My current implementation is something like this which takes upto 5-6 milli-seconds for 1500 bytes to copy:
unsigned int len_actual = 1500;
/* Fill in the SPI DMA buffer. */
while (len_actual-- != 0)
{
*(tgt_handle->spi_tx_buff ++) = ((*write_irp->buffer ++)) | (2 << 16) | DSPI_PUSHR_CONT;
}
You could write a "cherry picker" function
void * memcpk(void * destination, const void * source,
size_t num, size_t size
int (*test)(const void * item));
which copies at most num "objects", each having size size from
source to destination. Only the objects that satisfy the test are copied.
Then with
int oddp(const void * intptr) { return (*((int *)intptr))%2; }
int evenp(const void * intptr) { return !oddp(intptr); }
you could do
int destination[6];
memcpk(destination, source, 6, sizeof(int), evenp);
.
Almost all CPUs have caches; which means that (e.g.) when you modify one byte the CPU fetches an entire cache line from RAM, modifies the byte in the cache, then writes the entire cache line back to RAM. By skipping small pieces you add overhead (more instructions for CPU to care about) and won't reduce the amount of data transfered between cache and RAM.
Also, typically memcpy() is optimised to copy larger pieces. For example, if you copy an array of bytes but the CPU is capable of copying 32-bits (4 bytes) at once, then memcpy() will probably do the majority of the copying as a loop with 4 bytes per iteration (to reduce the number of reads and writes and reduce the number of loop iterations).
In other words; code to avoid copying specific bytes will make it significantly slower than mempcy() for multiple reasons.
To avoid that, you really want to separate the data that needs to be copied from the data that doesn't - e.g. put everything that doesn't need to be copied at the end of the array and only copy the first part of the array (so that it remains "copy a contiguous area of bytes").
If you can't do that the next alternative to consider would be masking. For example, if you have an array of bytes where some bytes shouldn't be copied, then you'd also have an array of "mask bytes" and do something like dest[i] = (dest[i] & mask[i]) | (src[i] & ~mask[i]); in a loop. This sounds horrible (and is horrible) until you optimise it by operating on larger pieces - e.g. if the CPU can copy 32-bit pieces, masking allows you to do 4 bytes per iteration by pretending all of the arrays are arrays of uint32_t). Note that for this technique wider is better - e.g. if the CPU supports operations on 256-bit pieces (AVX on 80x86) you'd be able to do 32 bytes per iteration of the loop. It also helps if you can make guarantees about the size and alignment (e.g. if the CPU can operate on 32 bits/4 bytes at a time, ensure that the size of the arrays is always a multiple of 4 bytes and that the arrays are always 4-byte aligned; even if it means adding unused padding at the end).
Also note that depending on which CPU it actually is, there might be special support in the instruction set. For one example, modern 80x86 CPUs (that support SSE2) have a maskmovdqu instruction that is designed specifically for selectively writing some bytes but not others. In that case, you'd need to resort to instrinsics or inline assembly because "pure C" has no support for this type of thing (beyond bitwise operators).
Having overlooked your speed requirements:
You may try to find a way which solves the problem without copying at all.
Some ideas here:
If you want to iterate the destination array you could define
kind of a "picky iterator" for source that advances to the next number you allow: Instead of iter++ do iter = advance_source(iter)
If you want to search the destination array then wrap a function around bsearch() that searches source and inspects the result. And so on.
Depending on your processor memory width, and number of internal registers, you might be able to speed this up by using shift operations.
You need to know if your processor is big-endian or little-endian.
Lets say you have a 32 bit processor and bus, and at least 4 spare registers that the compiler can use for optimisation. This means you can read or write 4 bytes in the same target word, having read 2 source words. Note that you are reading the bytes you are going to discard.
You can also improve the speed by making sure that everything is word aligned, and ignoring the gaps between the buffers, so not having to worry about the odd counts of bytes.
So, for little-endian:
inline unsigned long CopyEven(unsigned long a, unsigned long b)
{
long c = a & 0xff;
c |= (a>>8) & 0xff00;
c |= (b<<16) & 0xff0000;
c |= (b<<8) &0xff000000;
return c;
}
unsigned long* d = (unsigned long*)dest;
unsigned long* s = (unsigned long*)source;
for (int count =0; count <sourceLenBytes; count+=8)
{
*d = CopyEven(s[0], s[1]);
d++;
s+=2;
}

Is my Blowfish algorithm "standard"?

I wrote my implementation of a program, in C, in which I can encrypt a text file and vice versa.
The BLOWFISH algorithm is the standard one provided.
But then my thought is this: if I create a set of 4 chars in a long file, let's say 0x12345678, I can decode it because I know the proper order in which I read the file.
On the other hand, using a pre-made function like memcpy(), the content read is ordered like as 0x87654321, not as my previous function do. But the algorithm used is the same.
Is there a "standard" way to read and acquire data from a file, or both of the previous examples are fine? In an online site (blowfish online) the version used with memcpy() does not comply with that, when using the ECB mode. The version that acquires the data like 0x1234567 is working fine with the site. (Working means making an encrypted file with my program and decrypting it online).
For example, if I code and decode stuff with my program, that stuff should be (knowing the key) coded/decoded by other people who don't know my program (as general rule, at least)?
EDIT: the memcpy() function translate the lowest index of the array to the right end of the INT number.
This is the code which manipulate data for 64bit block:
memcpy(&cl, &file_cache[i], sizeof(unsigned long));
memcpy(&cr, &file_cache[i + 4], sizeof(unsigned long));
And this is the core part (is working fine, by correctly rearranging the read from the buffer, i.e. looping 8 times for each block) of the same portion which uses bitwise magic instead of memcpy() and comply with the endianess problem:
if (i==0){
cl <<= 24;
L |= 0xff000000 & cl;
}
else if (i==1){
cl <<= 16;
L |= 0x00ff0000 & cl;
}
else if (i==2){
cl <<= 8;
L |= 0x0000ff00 & cl;
}
else if (i==3){
//cl <<= 24;
L |= 0x000000ff & cl;
}
else if (i==4){
cl <<= 24;
R |= 0xff000000 & cl;
}
else if (i==5){
cl <<= 16;
R |= 0x00ff0000 & cl;
}
else if (i==6){
cl <<= 8;
R |= 0x0000ff00 & cl;
}
else if (i==7){
//cl <<= 8;
R |= 0x000000ff & cl;
}
Then L and R are sent to be encrypted. This last implementation works if I use other blowfish versions on line, so in principle should be better.
Which implementation is faster/better/lighter/stronger?
If the memcpy() is the one adviced, there's a convenient and faster way to reverse/mirroring the content of cl and cr?
Note that the leftmost byte is usually the "first byte send/received" for cryptography; i.e. if you have an array then the lowest index is to the left. If nothing has been specified, then this is the ad-hoc standard.
However, the Blowfish test vectors - as indicated by GregS - explicitly specify this default order, so there is no need to guess:
...
All data is shown as a hex string with 012345 loading as
data[0]=0x01;
data[1]=0x23;
data[2]=0x45;
...
As long as your code produces the same test vectors then you're OK, keeping in mind that your input / output should comply with the order of the test vectors.
It is highly recommended to make any cryptographic API operate on bytes (or rather, octets), not on other data types even if those bytes are internally handled as 32 or 64 bit words. The time required for conversion to/from bytes should be minimal compared to the actual encryption/decryption.
If you read the file as a sequence of 4-byte words then you would need to account for the endianness of those words in the memory layout, swapping the bytes as required to ensure the individual bytes are handled in a consistent order.
However, if you read/write your file as a sequence of bytes, and stored directly in sequence in memory (in an unsigned char array for example) then the data in file should have the same layout as in memory. That way you can obtain a consistent encoding/decoding whether you encode directly from/to memory or from/to file.

logic operators & bit separation calculation in C (PIC programming)

I am programming a PIC18F94K20 to work in conjunction with a MCP7941X I2C RTCC ship and a 24AA128 I2C CMOS Serial EEPROM device. Currently I have code which successfully intialises the seconds/days/etc values of the RTCC and starts the timer, toggling a LED upon the turnover of every second.
I am attempting to augment the code to read back the correct data for these values, however I am running into trouble when I try to account for the various 'extra' bits in the values. The memory map may help elucidate my problem somewhat:
Taking, for example, the hours column, or the 02h address. Bit 6 is set as 1 to toggle 12 hour time, adding 01000000 to the hours bit. I can read back the entire contents of the byte at this address, but I want to employ an if statement to detect whether 12 or 24 hour time is in place, and adjust accordingly. I'm not worried about the 10-hour bits, as I can calculate that easily enough with a BCD conversion loop (I think).
I earlier used the bitwise OR operator in C to augment the original hours data to 24. I initialised the hours in this particular case to 0x11, and set the 12 hour control bit which is 0x64. When setting the time:
WriteI2C(0x11|0x64);
which as you can see uses the bitwise OR.
When reading back the hours, how can I incorporate operators into my code to separate the superfluous bits from the actual time bits? I tried doing something like this:
current_seconds = ReadI2C();
current_seconds = ST & current_seconds;
but that completely ruins everything. It compiles, but the device gets 'stuck' on this sequence.
How do I separate the ST / AMPM / VBATEN bits from the actual data I need, and what would a good method be of implementing for loops for the various circumstances they present (e.g. reading back 12 hour time if bit 6 = 0 and 24 hour time if bit6 = 1, and so on).
I'm a bit of a C novice and this is my first foray into electronics so I really appreciate any help. Thanks.
To remove (zero) a bit, you can AND the value with a mask having all other bits set, i.e., the complement of the bits that you wish to zero, e.g.:
value_without_bit_6 = value & ~(1<<6);
To isolate a bit within an integer, you can AND the value with a mask having only those bits set. For checking flags this is all you need to do, e.g.,
if (value & (1<<6)) {
// bit 6 is set
} else {
// bit 6 is not set
}
To read the value of a small integer offset within a larger one, first isolate the bits, and then shift them right by the index of the lowest bit (to get the least significant bit into correct position), e.g.:
value_in_bits_4_and_5 = (value & ((1<<4)|(1<<5))) >> 4;
For more readable code, you should use constants or #defined macros to represent the various bit masks you need, e.g.:
#define BIT_VBAT_EN (1<<3)
if (value & BIT_VBAT_EN) {
// VBAT is enabled
}
Another way to do this is to use bitfields to define the organisation of bits, e.g.:
typedef union {
struct {
unsigned ones:4;
unsigned tens:3;
unsigned st:1;
} seconds;
uint8_t byte;
} seconds_register_t;
seconds_register_t sr;
sr.byte = READ_ADDRESS(0x00);
unsigned int seconds = sr.seconds.ones + sr.seconds.tens * 10;
A potential problem with bitfields is that the code generated by the compiler may be unpredictably large or inefficient, which is sometimes a concern with microcontrollers, but obviously it's nicer to read and write. (Another problem often cited is that the organisation of bit fields, e.g., endianness, is largely unspecified by the C standard and thus not guaranteed portable across compilers and platforms. However, it is my opinion that low-level development for microcontrollers tends to be inherently non-portable, so if you find the right bit layout I wouldn't consider using bitfields “wrong”, especially for hobbyist projects.)
Yet you can accomplish similarly readable syntax with macros; it's just the macro itself that is less readable:
#define GET_SECONDS(r) ( ((r) & 0x0F) + (((r) & 0x70) >> 4) * 10 )
uint8_t sr = READ_ADDRESS(0x00);
unsigned int seconds = GET_SECONDS(sr);
Regarding the bit masking itself, you are going to want to make a model of that memory map in your microcontroller. The simplest, cudest way to do that is to #define a number of bit masks, like this:
#define REG1_ST 0x80u
#define REG1_10_SECONDS 0x70u
#define REG1_SECONDS 0x0Fu
#define REG2_10_MINUTES 0x70u
...
And then when reading each byte, mask out the data you are interested in. For example:
bool st = (data & REG1_ST) != 0;
uint8_t ten_seconds = (data & REG1_10_SECONDS) >> 4;
uint8_t seconds = (data & REG1_SECONDS);
The important part is to minimize the amount of "magic numbers" in the source code.
Writing data:
reg1 = 0;
reg1 |= st ? REG1_ST : 0;
reg1 |= (ten_seconds << 4) & REG1_10_SECONDS;
reg1 |= seconds & REG1_SECONDS;
Please note that I left out the I2C communication of this.

Applications of bitwise operators in C and their efficiency? [duplicate]

This question already has answers here:
Real world use cases of bitwise operators [closed]
(41 answers)
Closed 6 years ago.
I am new to bitwise operators.
I understand how the logic functions work to get the final result. For example, when you bitwise AND two numbers, the final result is going to be the AND of those two numbers (1 & 0 = 0; 1 & 1 = 1; 0 & 0 = 0). Same with OR, XOR, and NOT.
What I don't understand is their application. I tried looking everywhere and most of them just explain how bitwise operations work. Of all the bitwise operators I only understand the application of shift operators (multiplication and division). I also came across masking. I understand that masking is done using bitwise AND but what exactly is its purpose and where and how can I use it?
Can you elaborate on how I can use masking? Are there similar uses for OR and XOR?
The low-level use case for the bitwise operators is to perform base 2 math. There is the well known trick to test if a number is a power of 2:
if ((x & (x - 1)) == 0) {
printf("%d is a power of 2\n", x);
}
But, it can also serve a higher level function: set manipulation. You can think of a collection of bits as a set. To explain, let each bit in a byte to represent 8 distinct items, say the planets in our solar system (Pluto is no longer considered a planet, so 8 bits are enough!):
#define Mercury (1 << 0)
#define Venus (1 << 1)
#define Earth (1 << 2)
#define Mars (1 << 3)
#define Jupiter (1 << 4)
#define Saturn (1 << 5)
#define Uranus (1 << 6)
#define Neptune (1 << 7)
Then, we can form a collection of planets (a subset) like using |:
unsigned char Giants = (Jupiter|Saturn|Uranus|Neptune);
unsigned char Visited = (Venus|Earth|Mars);
unsigned char BeyondTheBelt = (Jupiter|Saturn|Uranus|Neptune);
unsigned char All = (Mercury|Venus|Earth|Mars|Jupiter|Saturn|Uranus|Neptune);
Now, you can use a & to test if two sets have an intersection:
if (Visited & Giants) {
puts("we might be giants");
}
The ^ operation is often used to see what is different between two sets (the union of the sets minus their intersection):
if (Giants ^ BeyondTheBelt) {
puts("there are non-giants out there");
}
So, think of | as union, & as intersection, and ^ as union minus the intersection.
Once you buy into the idea of bits representing a set, then the bitwise operations are naturally there to help manipulate those sets.
One application of bitwise ANDs is checking if a single bit is set in a byte. This is useful in networked communication, where protocol headers attempt to pack as much information into the smallest area as is possible in an effort to reduce overhead.
For example, the IPv4 header utilizes the first 3 bits of the 6th byte to tell whether the given IP packet can be fragmented, and if so whether to expect more fragments of the given packet to follow. If these fields were the size of ints (1 byte) instead, each IP packet would be 21 bits larger than necessary. This translates to a huge amount of unnecessary data through the internet every day.
To retrieve these 3 bits, a bitwise AND could be used along side a bit mask to determine if they are set.
char mymask = 0x80;
if(mymask & (ipheader + 48) == mymask)
//the second bit of the 6th byte of the ip header is set
Small sets, as has been mentioned. You can do a surprisingly large number of operations quickly, intersection and union and (symmetric) difference are obviously trivial, but for example you can also efficiently:
get the lowest item in the set with x & -x
remove the lowest item from the set with x & (x - 1)
add all items smaller than the smallest present item
add all items higher than the smallest present item
calculate their cardinality (though the algorithm is nontrivial)
permute the set in some ways, that is, change the indexes of the items (not all permutations are equally efficient)
calculate the lexicographically next set that contains as many items (Gosper's Hack)
1 and 2 and their variations can be used to build efficient graph algorithms on small graphs, for example see algorithm R in The Art of Computer Programming 4A.
Other applications of bitwise operations include, but are not limited to,
Bitboards, important in many board games. Chess without bitboards is like Christmas without Santa. Not only is it a space-efficient representation, you can do non-trivial computations directly with the bitboard (see Hyperbola Quintessence)
sideways heaps, and their application in finding the Nearest Common Ancestor and computing Range Minimum Queries.
efficient cycle-detection (Gosper's Loop Detection, found in HAKMEM)
adding offsets to Z-curve addresses without deconstructing and reconstructing them (see Tesseral Arithmetic)
These uses are more powerful, but also advanced, rare, and very specific. They show, however, that bitwise operations are not just a cute toy left over from the old low-level days.
Example 1
If you have 10 booleans that "work together" you can do simplify your code a lot.
int B1 = 0x01;
int B2 = 0x02;
int B10 = 0x0A;
int someValue = get_a_value_from_somewhere();
if (someValue & (B1 + B10)) {
// B1 and B10 are set
}
Example 2
Interfacing with hardware. An address on the hardware may need bit level access to control the interface. e.g. an overflow bit on a buffer or a status byte that can tell you the status of 8 different things. Using bit masking you can get down the the actual bit of info you need.
if (register & 0x80) {
// top bit in the byte is set which may have special meaning.
}
This is really just a specialized case of example 1.
Bitwise operators are particularly useful in systems with limited resources as each bit can encode a boolean. Using many chars for flags is wasteful as each takes one byte of space (when they could be storing 8 flags each).
Commonly microcontrollers have C interfaces for their IO ports in which each bit controls 1 of 8 ports. Without bitwise operators these would be quite difficult to control.
Regarding masking, it is common to use both & and |:
x & 0x0F //ensures the 4 high bits are 0
x | 0x0F //ensures the 4 low bits are 1
In microcontroller applications, you can utilize bitwise to switch between ports. In the below picture, if we would like to turn on a single port while turning off the rest, then the following code can be used.
void main()
{
unsigned char ON = 1;
TRISB=0;
PORTB=0;
while(1){
PORTB = ON;
delay_ms(200);
ON = ON << 1;
if(ON == 0) ON=1;
}
}

optimized byte array shifter

I'm sure this has been asked before, but I need to implement a shift operator on a byte array of variable length size. I've looked around a bit but I have not found any standard way of doing it. I came up with an implementation which works, but I'm not sure how efficient it is. Does anyone know of a standard way to shift an array, or at least have any recommendation on how to boost the performance of my implementation;
char* baLeftShift(const char* array, size_t size, signed int displacement,char* result)
{
memcpy(result,array,size);
short shiftBuffer = 0;
char carryFlag = 0;
char* byte;
if(displacement > 0)
{
for(;displacement--;)
{
for(byte=&(result[size - 1]);((unsigned int)(byte))>=((unsigned int)(result));byte--)
{
shiftBuffer = *byte;
shiftBuffer <<= 1;
*byte = ((carryFlag) | ((char)(shiftBuffer)));
carryFlag = ((char*)(&shiftBuffer))[1];
}
}
}
else
{
unsigned int offset = ((unsigned int)(result)) + size;
displacement = -displacement;
for(;displacement--;)
{
for(byte=(char*)result;((unsigned int)(byte)) < offset;byte++)
{
shiftBuffer = *byte;
shiftBuffer <<= 7;
*byte = ((carryFlag) | ((char*)(&shiftBuffer))[1]);
carryFlag = ((char)(shiftBuffer));
}
}
}
return result;
}
If I can just add to what #dwelch is saying, you could try this.
Just move the bytes to their final locations. Then you are left with a shift count such as 3, for example, if each byte still needs to be left-shifted 3 bits into the next higher byte. (This assumes in your mind's eye the bytes are laid out in ascending order from right to left.)
Then rotate each byte to the left by 3. A lookup table might be faster than individually doing an actual rotate. Then, in each byte, the 3 bits to be shifted are now in the right-hand end of the byte.
Now make a mask M, which is (1<<3)-1, which is simply the low order 3 bits turned on.
Now, in order, from high order byte to low order byte, do this:
c[i] ^= M & (c[i] ^ c[i-1])
That will copy bits to c[i] from c[i-1] under the mask M.
For the last byte, just use a 0 in place of c[i-1].
For right shifts, same idea.
My first suggestion would be to eliminate the for loops around the displacement. You should be able to do the necessary shifts without the for(;displacement--;) loops. For displacements of magnitude greater than 7, things get a little trickier because your inner loop bounds will change and your source offset is no longer 1. i.e. your input buffer offset becomes magnitude / 8 and your shift becomes magnitude % 8.
It does look inefficient and perhaps this is what Nathan was referring to.
assuming a char is 8 bits where this code is running there are two things to do first move the whole bytes, for example if your input array is 0x00,0x00,0x12,0x34 and you shift left 8 bits then you get 0x00 0x12 0x34 0x00, there is no reason to do that in a loop 8 times one bit at a time. so start by shifting the whole chars in the array by (displacement>>3) locations and pad the holes created with zeros some sort of for(ra=(displacement>>3);ra>3)] = array[ra]; for(ra-=(displacement>>3);ra>(7-(displacement&7))). a good compiler will precompute (displacement>>3), displacement&7, 7-(displacement&7) and a good processor will have enough registers to keep all of those values. you might help the compiler by making separate variables for each of those items, but depending on the compiler and how you are using it it could make it worse too.
The bottom line though is time the code. perform a thousand 1 bit shifts then a thousand 2 bit shifts, etc time the whole thing, then try a different algorithm and time it the same way and see if the optimizations make a difference, make it better or worse. If you know ahead of time this code will only ever be used for single or less than 8 bit shifts adjust the timing test accordingly.
your use of the carry flag implies that you are aware that many processors have instructions specifically for chaining infinitely long shifts using the standard register length (for single bit at a time) rotate through carry basically. Which the C language does not support directly. for chaining single bit shifts you could consider assembler and likely outperform the C code. at least the single bit shifts are faster than C code can do. A hybrid of moving the bytes then if the number of bits to shift (displacement&7) is maybe less than 4 use the assembler else use a C loop. again the timing tests will tell you where the optimizations are.

Resources