How to transpose a matrix in ARM assembly - arrays

I'm trying to perform a matrix trasposition of specifically 8 n-bits arrays, each having n bits (around 70,000), to a byte array of n elements.
Context information: The 8 n-bits arrays are RGB data for 8 channels. I need to have one byte representing the nth-bit position of the 8 arrays. This will be running on an ARM Cortex-M3 processor and needs to perform as fast as possible since I'm generating 8 simultaneous signals using the resulting array.
I've come up with a pseudo algorithm (in the link) to do this, but I'm afraid it might be too costly for the processor.
Pseudo Algorithm
I'm looking for the fastest executing code. Size is of secondary importance.
I will appreciate suggestions.
This is what I implemented but the results are not that good.
do{
for(b=0;b<24;b++){ //Optimize to for(b=24;b!=0;b--)
m = 1 << b;
*dataBytes = *dataBytes + __ROR((*s0 & m),32+b-0); //strip 0 data
*dataBytes = *dataBytes + __ROR((*s1 & m),32+b-1); //strip 1 data
*dataBytes = *dataBytes + __ROR((*s2 & m),32+b-2); //strip 2 data
*dataBytes = *dataBytes + __ROR((*s3 & m),32+b-3); //strip 3 data
*dataBytes = *dataBytes + __ROR((*s4 & m),32+b-4); //strip 4 data
*dataBytes = *dataBytes + __ROR((*s5 & m),32+b-5); //strip 5 data
*dataBytes = *dataBytes + __ROR((*s6 & m),32+b-6); //strip 6 data
*dataBytes = *dataBytes + __ROR((*s7 & m),32+b-7); //strip 7 data
dataBytes++;
}
s0 += 3;
s1 += 3;
s2 += 3;
s3 += 3;
s4 += 3;
s5 += 3;
s6 += 3;
s7 += 3;
}while(n--);
S0 to 7 are the 8 individual vectors from which the bits are being taken in groups of 24.
N is the number of groups, m is the mask and b is the mask position.
dataBytes is the resulting array.

There are two things that are always present when optimizing,
Memory bandwidth
CPU clocks
Bandwidth
Your current algorithm is loading a byte at a time. You may do this more efficiently by loading at least 32bits at a time. This will optimize the ARM BUS. For certain the end algorithm will not be BUS bound and if it is, you have optimized for this.
For the different ARM CPUs, there are instructions like pld, etc which can try to optimize the BUS by pre-fetching the next data elements in advance. This may or may not apply to your Cortex-M. Another technique is to relocate the data to faster memory such as TCM if possible.
CPU speed
Pixel processing is almost always speed up by SIMD type instructions. The Cortex-M has instructions labelled SIMD. Don't get hung up on the label SIMD; use the concept. If you have loaded multiple bytes into a word, then you can use a table.
const unsigned long bits[16] = {
0, 1, 0x100, 0x101,
0x10000, 0x10001, 0x10100, 0x10101,
0x1000000, 0x1000001, 0x1000100, 0x1000101,
0x1010000, 0x1010001, 0x1010100, 0x1010101
}
A similar concept is used in many CRC algorithms on the Internet. Process each nibble (4 bits) and form the next four bytes of output a bit at a time. Probably there is a multiplication value which can replace the table, but this depends on the speed of you multiple which depends on the type of Cortex-M and/or ARM.
Definitely prototype in 'C' and then convert to assembler or use inline assembler if possible. If you have many mov statements in your algorithm, it is a signal that a compiler could probably allocate the register better than you. Many sophisticated algorithm use a code generator (scripted in phython, perl, etc) which may unroll whatever optimum loop you end up with and also track registers in a algorithmic way.
Note: Double check my table; it is just a first crack and I have not actually coded this particular algorithm. There maybe more slick ways to process multiple bits at a time, but the idea is probably fruitful.

Related

Improving speed of bit copying in a lossless audio encoding algorithm (written in C)

I'm trying to implement a lossless audio codec that will be able to process data coming in at roughly 190 kHz to then be stored to an SD card using SPI DMA. I've found that the algorithm basically works, but has certain bottlenecks that I can't seem to overcome. I was hoping to get some advice on how to best optimize a certain portion of the code that I found to be the "slowest". I'm writing in C on a TI DSP and am using -O3 optimization.
for (j = packet_to_write.bfp_bits; j>0; j--)
{
encoded_data[(filled/16)] |= ((buf_filt[i] >> (j- 1)) & 1) << (filled++ % 16);
}
In this section of code, I am taking X number of bits from the original data and fitting it into a buffer of encoded data. I've found that the loop is fairly costly and when I am working with a set of data represented by 8+ bits, then this code is too slow for my application. Loop unrolling doesn't really work here since each block of data can be represented by a different number of bits. The "filled" variable represents a bit counter filling up Uint16 indices in the encoded_data buffer.
I'd like some help understanding where bottlenecks may come from in this snippet of code (and hopefully I can take those findings and apply that to other areas of the algo). The authors of the paper that I'm reading (whose algorithm I'm trying to replicate) noted that they used a mixture of C and assembly code, but I'm not sure how assembly would be useful in this case.
Finally, the code itself is functional and I have done some extensive testing on actual audio samples. It's just not fast enough for real-time!
Thanks!
You really need to change the representation that you use for the output data. Instead of just a target buffer and the number of bits written, expand this to:
//complete words that have been written
uint16_t *encoded_data;
//number of complete words that have been written
unsigned filled_words;
//bits waiting to be written to encoded_data, LSB first
uint32_t encoded_bits;
//number of bits in encoded_bits
unsinged filled_bits;
This uses a single 32-bit word to buffer bits until we have enough to write out a complete uint16_t. This greatly simplifies the shifting and masking, because you always have at least 16 free bits to write into.
Then you can write out n bits of any source word like this:
void write_bits(uint16_t bits, unsigned n) {
uint32_t mask = ((uint32_t)0x0FFFF) >> (16-n);
encoded_bits |= (bits&mask) << filled_bits;
filled_bits += n;
if (filled_bits >= 16) {
encoded_data[filled_words++] = (uint16_t)encoded_bits;
encoded_bits >>= 16;
filled_bits -= 16;
}
}
and instead of your loop, you just write
write_bits(buf_filt[i], packet_to_write.bfp_bits);
No one-bit-at-a-time operations are required.

Can I use SIMD to bucket sort / categorize?

I'm curious about SIMD and wondering if it can handle this use case.
Let's say I have an array of 2048 integers, like
[0x018A, 0x004B, 0x01C0, 0x0234, 0x0098, 0x0343, 0x0222, 0x0301, 0x0398, 0x0087, 0x0167, 0x0389, 0x03F2, 0x0034, 0x0345, ...]
Note how they all start with either 0x00, 0x01, 0x02, or 0x03. I want to split them into 4 arrays:
One for all the integers starting with 0x00
One for all the integers starting with 0x01
One for all the integers starting with 0x02
One for all the integers starting with 0x03
I imagine I would have code like this:
int main() {
uint16_t in[2048] = ...;
// 4 arrays, one for each category
uint16_t out[4][2048];
// Pointers to the next available slot in each of the arrays
uint16_t *nextOut[4] = { out[0], out[1], out[2], out[3] };
for (uint16_t *nextIn = in; nextIn < 2048; nextIn += 4) {
(*** magic simd instructions here ***)
// Equivalent non-simd code:
uint16_t categories[4];
for (int i = 0; i < 4; i++) {
categories[i] = nextIn[i] & 0xFF00;
}
for (int i = 0; i < 4; i++) {
uint16_t category = categories[i];
*nextOut[category] = nextIn[i];
nextOut[category]++;
}
}
// Now I have my categoried arrays!
}
I imagine that my first inner loop doesn't need SIMD, it can be just a (x & 0xFF00FF00FF00FF00) instruction, but I wonder if we can make that second inner loop into a SIMD instruction.
Is there any sort of SIMD instruction for this "categorizing" action that I'm doing?
The "insert" instructions seem somewhat promising, but I'm a bit too green to understand the descriptions at https://software.intel.com/en-us/node/695331.
If not, does anything come close?
Thanks!
You can do it with SIMD, but how fast it is will depend on exactly what instruction sets you have available, and how clever you are in your implementation.
One approach is to take the array and "sift" it to separate out elements that belong in different buckets. For example, grab 32 bytes from your array which will have 16 16-bit elements. Use some cmpgt instructions to get a mask where which determines whether each element falls into the 00 + 01 bucket or the 02 + 03 bucket. Then use some kind of "compress" or "filter" operation to move all masked elements contiguously into one end a register and then same for the unmasked elements.
Then repeat this one more time to sort out 00 from 01 and 02 from 03.
With AVX2 you could start with this question for inspiration on the "compress" operation. With AVX512 you could use the vcompress instruction to help out: it does exactly this operation but only at 32-bit or 64-bit granularity so you'd need to do a couple at least per vector.
You could also try a vertical approach, where you load N vectors and then swap between them so that the 0th vector has the smallest elements, etc. At this point, you can use a more optimized algorithm for the compress stage (e.g,. if you vertically sort enough vectors, the vectors at the ends may be entirely starting with 0x00 etc).
Finally, you might also consider organizing your data differently, either at the source or as a pre-processing step: separating out the "category" byte which is always 0-3 from the payload byte. Many of the processing steps only need to happen on one or the other, so you can potentially increase efficiency by splitting them out that way. For example, you could do the comparison operation on 32 bytes that are all categories, and then do the compress operation on the 32 payload bytes (at least in the final step where each category is unique).
This would lead to arrays of byte elements, not 16-bit elements, where the "category" byte is implicit. You've cut your data size in half, which might speed up everything else you want to do with the data in the future.
If you can't produce the source data in this format, you could use the bucketing as an opportunity to remove the tag byte as you put the payload into the right bucket, so the output is uint8_t out[4][2048];. If you're doing a SIMD left-pack with a pshufb byte-shuffle as discussed in comments, you could choose a shuffle control vector that packs only the payload bytes into the low half.
(Until AVX512BW, x86 SIMD doesn't have any variable-control word shuffles, only byte or dword, so you already needed a byte shuffle which can just as easily separate payloads from tags at the same time as packing payload bytes to the bottom.)

Applications of bitwise operators in C and their efficiency? [duplicate]

This question already has answers here:
Real world use cases of bitwise operators [closed]
(41 answers)
Closed 6 years ago.
I am new to bitwise operators.
I understand how the logic functions work to get the final result. For example, when you bitwise AND two numbers, the final result is going to be the AND of those two numbers (1 & 0 = 0; 1 & 1 = 1; 0 & 0 = 0). Same with OR, XOR, and NOT.
What I don't understand is their application. I tried looking everywhere and most of them just explain how bitwise operations work. Of all the bitwise operators I only understand the application of shift operators (multiplication and division). I also came across masking. I understand that masking is done using bitwise AND but what exactly is its purpose and where and how can I use it?
Can you elaborate on how I can use masking? Are there similar uses for OR and XOR?
The low-level use case for the bitwise operators is to perform base 2 math. There is the well known trick to test if a number is a power of 2:
if ((x & (x - 1)) == 0) {
printf("%d is a power of 2\n", x);
}
But, it can also serve a higher level function: set manipulation. You can think of a collection of bits as a set. To explain, let each bit in a byte to represent 8 distinct items, say the planets in our solar system (Pluto is no longer considered a planet, so 8 bits are enough!):
#define Mercury (1 << 0)
#define Venus (1 << 1)
#define Earth (1 << 2)
#define Mars (1 << 3)
#define Jupiter (1 << 4)
#define Saturn (1 << 5)
#define Uranus (1 << 6)
#define Neptune (1 << 7)
Then, we can form a collection of planets (a subset) like using |:
unsigned char Giants = (Jupiter|Saturn|Uranus|Neptune);
unsigned char Visited = (Venus|Earth|Mars);
unsigned char BeyondTheBelt = (Jupiter|Saturn|Uranus|Neptune);
unsigned char All = (Mercury|Venus|Earth|Mars|Jupiter|Saturn|Uranus|Neptune);
Now, you can use a & to test if two sets have an intersection:
if (Visited & Giants) {
puts("we might be giants");
}
The ^ operation is often used to see what is different between two sets (the union of the sets minus their intersection):
if (Giants ^ BeyondTheBelt) {
puts("there are non-giants out there");
}
So, think of | as union, & as intersection, and ^ as union minus the intersection.
Once you buy into the idea of bits representing a set, then the bitwise operations are naturally there to help manipulate those sets.
One application of bitwise ANDs is checking if a single bit is set in a byte. This is useful in networked communication, where protocol headers attempt to pack as much information into the smallest area as is possible in an effort to reduce overhead.
For example, the IPv4 header utilizes the first 3 bits of the 6th byte to tell whether the given IP packet can be fragmented, and if so whether to expect more fragments of the given packet to follow. If these fields were the size of ints (1 byte) instead, each IP packet would be 21 bits larger than necessary. This translates to a huge amount of unnecessary data through the internet every day.
To retrieve these 3 bits, a bitwise AND could be used along side a bit mask to determine if they are set.
char mymask = 0x80;
if(mymask & (ipheader + 48) == mymask)
//the second bit of the 6th byte of the ip header is set
Small sets, as has been mentioned. You can do a surprisingly large number of operations quickly, intersection and union and (symmetric) difference are obviously trivial, but for example you can also efficiently:
get the lowest item in the set with x & -x
remove the lowest item from the set with x & (x - 1)
add all items smaller than the smallest present item
add all items higher than the smallest present item
calculate their cardinality (though the algorithm is nontrivial)
permute the set in some ways, that is, change the indexes of the items (not all permutations are equally efficient)
calculate the lexicographically next set that contains as many items (Gosper's Hack)
1 and 2 and their variations can be used to build efficient graph algorithms on small graphs, for example see algorithm R in The Art of Computer Programming 4A.
Other applications of bitwise operations include, but are not limited to,
Bitboards, important in many board games. Chess without bitboards is like Christmas without Santa. Not only is it a space-efficient representation, you can do non-trivial computations directly with the bitboard (see Hyperbola Quintessence)
sideways heaps, and their application in finding the Nearest Common Ancestor and computing Range Minimum Queries.
efficient cycle-detection (Gosper's Loop Detection, found in HAKMEM)
adding offsets to Z-curve addresses without deconstructing and reconstructing them (see Tesseral Arithmetic)
These uses are more powerful, but also advanced, rare, and very specific. They show, however, that bitwise operations are not just a cute toy left over from the old low-level days.
Example 1
If you have 10 booleans that "work together" you can do simplify your code a lot.
int B1 = 0x01;
int B2 = 0x02;
int B10 = 0x0A;
int someValue = get_a_value_from_somewhere();
if (someValue & (B1 + B10)) {
// B1 and B10 are set
}
Example 2
Interfacing with hardware. An address on the hardware may need bit level access to control the interface. e.g. an overflow bit on a buffer or a status byte that can tell you the status of 8 different things. Using bit masking you can get down the the actual bit of info you need.
if (register & 0x80) {
// top bit in the byte is set which may have special meaning.
}
This is really just a specialized case of example 1.
Bitwise operators are particularly useful in systems with limited resources as each bit can encode a boolean. Using many chars for flags is wasteful as each takes one byte of space (when they could be storing 8 flags each).
Commonly microcontrollers have C interfaces for their IO ports in which each bit controls 1 of 8 ports. Without bitwise operators these would be quite difficult to control.
Regarding masking, it is common to use both & and |:
x & 0x0F //ensures the 4 high bits are 0
x | 0x0F //ensures the 4 low bits are 1
In microcontroller applications, you can utilize bitwise to switch between ports. In the below picture, if we would like to turn on a single port while turning off the rest, then the following code can be used.
void main()
{
unsigned char ON = 1;
TRISB=0;
PORTB=0;
while(1){
PORTB = ON;
delay_ms(200);
ON = ON << 1;
if(ON == 0) ON=1;
}
}

Need help understanding bitmaps, bitwise operations, and C

Disclaimer: I am asking these questions in relation to an assignment. The assignment itself calls for implementing a bitmap and doing some operations with that, but that is not what I am asking about. I just want to understand the concepts so I can try the implementation for myself.
I need help understanding bitmaps/bit arrays and bitwise operations. I understand the basics of binary and how left/right shift work, but I don't know exactly how that use is beneficial.
Basically, I need to implement a bitmap to store the results of a prime sieve (of Eratosthenes.) This is a small part of a larger assignment focused on different IPC methods, but to get to that part I need to get the sieve completed first. I've never had to use bitwise operations nor have I ever learned about bitmaps, so I'm kind of on my own to learn this.
From what I can tell, bitmaps are arrays of a bit of a certain size, right? By that I mean you could have an 8-bit array or a 32-bit array (in my case, I need to find the primes for a 32-bit unsigned int, so I'd need the 32-bit array.) So if this is an array of bits, 32 of them to be specific, then we're basically talking about a string of 32 1s and 0s. How does this translate into a list of primes? I figure that one method would evaluate the binary number and save it to a new array as decimal, so all the decimal primes exist in one array, but that seems like you're using too much data.
Do I have the gist of bitmaps? Or is there something I'm missing? I've tried reading about this around the internet but I can't find a source that makes it clear enough for me...
Suppose you have a list of primes: {3, 5, 7}. You can store these numbers as a character array: char c[] = {3, 5, 7} and this requires 3 bytes.
Instead lets use a single byte such that each set bit indicates that the number is in the set. For example, 01010100. If we can set the byte we want and later test it we can use this to store the same information in a single byte. To set it:
char b = 0;
// want to set `3` so shift 1 twice to the left
b = b | (1 << 2);
// also set `5`
b = b | (1 << 4);
// and 7
b = b | (1 << 6);
And to test these numbers:
// is 3 in the map:
if (b & (1 << 2)) {
// it is in...
You are going to need a lot more than 32 bits.
You want a sieve for up to 2^32 numbers, so you will need a bit for each one of those. Each bit will represent one number, and will be 0 if the number is prime and 1 if it is composite. (You can save one bit by noting that the first bit must be 2 as 1 is neither prime nor composite. It is easier to waste that one bit.)
2^32 = 4,294,967,296
Divide by 8
536,870,912 bytes, or 1/2 GB.
So you will want an array of 2^29 bytes, or 2^27 4-byte words, or whatever you decide is best, and also a method for manipulating the individual bits stored in the chars (ints) in the array.
It sounds like eventually, you are going to have several threads or processes operating on this shared memory.You may need to store it all in a file if you can't allocate all that memory to yourself.
Say you want to find the bit for x. Then let a = x / 8 and b = x - 8 * a. Then the bit is at arr[a] & (1 << b). (Avoid the modulus operator % wherever possible.)
//mark composite
a = x / 8;
b = x - 8 * a;
arr[a] |= 1 << b;
This sounds like a fun assignment!
A bitmap allows you to construct a large predicate function over the range of numbers you're interested in. If you just have a single 8-bit char, you can store Boolean values for each of the eight values. If you have 2 chars, it doubles your range.
So, say you have a bitmap that already has this information stored, your test function could look something like this:
bool num_in_bitmap (int num, char *bitmap, size_t sz) {
if (num/8 >= sz) return 0;
return (bitmap[num/8] >> (num%8)) & 1;
}

optimized byte array shifter

I'm sure this has been asked before, but I need to implement a shift operator on a byte array of variable length size. I've looked around a bit but I have not found any standard way of doing it. I came up with an implementation which works, but I'm not sure how efficient it is. Does anyone know of a standard way to shift an array, or at least have any recommendation on how to boost the performance of my implementation;
char* baLeftShift(const char* array, size_t size, signed int displacement,char* result)
{
memcpy(result,array,size);
short shiftBuffer = 0;
char carryFlag = 0;
char* byte;
if(displacement > 0)
{
for(;displacement--;)
{
for(byte=&(result[size - 1]);((unsigned int)(byte))>=((unsigned int)(result));byte--)
{
shiftBuffer = *byte;
shiftBuffer <<= 1;
*byte = ((carryFlag) | ((char)(shiftBuffer)));
carryFlag = ((char*)(&shiftBuffer))[1];
}
}
}
else
{
unsigned int offset = ((unsigned int)(result)) + size;
displacement = -displacement;
for(;displacement--;)
{
for(byte=(char*)result;((unsigned int)(byte)) < offset;byte++)
{
shiftBuffer = *byte;
shiftBuffer <<= 7;
*byte = ((carryFlag) | ((char*)(&shiftBuffer))[1]);
carryFlag = ((char)(shiftBuffer));
}
}
}
return result;
}
If I can just add to what #dwelch is saying, you could try this.
Just move the bytes to their final locations. Then you are left with a shift count such as 3, for example, if each byte still needs to be left-shifted 3 bits into the next higher byte. (This assumes in your mind's eye the bytes are laid out in ascending order from right to left.)
Then rotate each byte to the left by 3. A lookup table might be faster than individually doing an actual rotate. Then, in each byte, the 3 bits to be shifted are now in the right-hand end of the byte.
Now make a mask M, which is (1<<3)-1, which is simply the low order 3 bits turned on.
Now, in order, from high order byte to low order byte, do this:
c[i] ^= M & (c[i] ^ c[i-1])
That will copy bits to c[i] from c[i-1] under the mask M.
For the last byte, just use a 0 in place of c[i-1].
For right shifts, same idea.
My first suggestion would be to eliminate the for loops around the displacement. You should be able to do the necessary shifts without the for(;displacement--;) loops. For displacements of magnitude greater than 7, things get a little trickier because your inner loop bounds will change and your source offset is no longer 1. i.e. your input buffer offset becomes magnitude / 8 and your shift becomes magnitude % 8.
It does look inefficient and perhaps this is what Nathan was referring to.
assuming a char is 8 bits where this code is running there are two things to do first move the whole bytes, for example if your input array is 0x00,0x00,0x12,0x34 and you shift left 8 bits then you get 0x00 0x12 0x34 0x00, there is no reason to do that in a loop 8 times one bit at a time. so start by shifting the whole chars in the array by (displacement>>3) locations and pad the holes created with zeros some sort of for(ra=(displacement>>3);ra>3)] = array[ra]; for(ra-=(displacement>>3);ra>(7-(displacement&7))). a good compiler will precompute (displacement>>3), displacement&7, 7-(displacement&7) and a good processor will have enough registers to keep all of those values. you might help the compiler by making separate variables for each of those items, but depending on the compiler and how you are using it it could make it worse too.
The bottom line though is time the code. perform a thousand 1 bit shifts then a thousand 2 bit shifts, etc time the whole thing, then try a different algorithm and time it the same way and see if the optimizations make a difference, make it better or worse. If you know ahead of time this code will only ever be used for single or less than 8 bit shifts adjust the timing test accordingly.
your use of the carry flag implies that you are aware that many processors have instructions specifically for chaining infinitely long shifts using the standard register length (for single bit at a time) rotate through carry basically. Which the C language does not support directly. for chaining single bit shifts you could consider assembler and likely outperform the C code. at least the single bit shifts are faster than C code can do. A hybrid of moving the bytes then if the number of bits to shift (displacement&7) is maybe less than 4 use the assembler else use a C loop. again the timing tests will tell you where the optimizations are.

Resources