neon:multiply and accumulate for 64 bit as IP & OP

neon:multiply and accumulate for 64 bit as IP & OP - arm

Is there any way to implement below logic in neon .
As I did not find any multiply and accumulate instruction for 64 bit input and output value .
int64x2_t result;
int64x2_t num1;
int64x2_t num2;
>> result + = num1*num2 <<

Technically two 64-bit values could result in a 128-bit result. That's why there are the following int64*int32+int32 functions, but not one that takes two 64-bit input values.
int64x2_t vmlal_s32 (int64x2_t, int32x2_t, int32x2_t);
int64x2_t vqdmlal_s32 (int64x2_t, int32x2_t, int32x2_t);
If those don't work for you, then you'll need to use a scalar 64*64 operations followed by vaddq_s64.
Note: Visual Studio implements _mul128, __umul128, _mulh, and __umulh for all architectures including ARM for handling the full 64*64 = 128 bit scenario.

Related

Is there a better way to define a preprocessor macro for doing bit manipulation?

Take macro:
GPIOxMODE(gpio,mode,port) ( GPIO##gpio->MODER = ((GPIO##gpio->MODER & ~((uint32_t)GPIO2BITMASK << (port*2))) | (mode << (port * 2))) )
Assuming that the reset value of the register is 0xFFFF.FFFF, I want to set a 2 bit width to an arbitrary value. This was written for an STM32
MCU that has 15 pins per port. GPIO2BITMASK is defined as 0x3. Is there a better way for clearing and setting a random 2 bits in anywhere in the
32-bit wide register.
Valid range for port 0 - 15
Valid range for mode 0 - 3
The method I came up with is to bit shift the mask, invert it, logically AND it with the existing register value, logically OR the result with a bit shifted new value.
I am looking to combine the mask and new value to reduce the number of logical operations bit shift operations. The goal is also keep the process generic enough so that I can use for bit operations of 1,2,3 or 4 bit widths.
Is there a better way?
In the long and sort of it, is there a better way is really an opened question. I am looking specifically for a method that will reduce the number of logical operations and bit shift operations, while being a simple one lined statement.
The answer is NO.
You MUST do reset/set to ensure that the bit field you are writing to has the desired value.
The answers received can be better (in a matter of opinion/preference/philosophy/practice) in that they aren't necessary a macros and have have parameter checking. Also pit falls of this style have been pointed out in both the comments and responses.

This kind of macros should be avoided as a plaque for many reasons:
They are not debuggable
They are hard to find error prone
and many other reasons
The same result you can archive using inline functions. The resulting code will be the same effective
static inline __attribute__((always_inline)) void GPIOMODE(GPIO_TypeDef *gpio, unsigned mode, unsigned pin)
{
gpio -> MODER &= ~(GPIO_MODER_MODE0_Msk << (pin * 2));
gpio -> MODER |= mode << (pin * 2);
}
but if you love macros
#define GPIOxMODE(gpio,mode,port) {volatile uint32_t *mdr = &GPIO##gpio->MODER; *mdr &= ~(GPIO_MODER_MODE0_Msk << (port*2)); *mdr |= mode << (port * 2);}
I am looking to combine the mask and new value to reduce the number of
logical operations bit shift operations.
you cant. You need to reset and then set the bits.

The method I came up with is to bit shift the mask, invert it,
logically AND it with the existing register value, logically OR the
result with a bit shifted new value.
That or an equivalent is the way to do it.
I am looking to combine the mask and new value to reduce the number of
logical operations bit shift operations. The goal is also keep the
process generic enough so that I can use for bit operations of 1,2,3
or 4 bit widths.
Is there a better way?
You must accomplish two basic objectives:
ensure that the bits that should be off in the affected range are in fact off, and
ensure that the bits that should be on in the affected range are in fact on.
In the general case, those require two separate operations: a bitwise AND to force bits off, and a bitwise OR (or XOR, if the bits are first cleared) to turn the wanted bits on. There may be ways to shortcut for specific cases of original and target values, but if you want something general-purpose, as you say, then your options are limited.
Personally, though, I think I would be inclined to build it from multiple pieces, separating the GPIO selection from the actual computation. At minimum, you can separate out a generic macro for setting a range of bits:
#define SETBITS32(x,bits,offset,mask) ((((uint32_t)(x)) & ~(((uint32_t)(mask)) << (offset))) | (((uint32_t)(bits)) << (offset)))
#define GPIOxMODE(gpio,mode,port) (GPIO##gpio->MODER = SETBITS32(GPIO##gpio->MODER, mode, port * 2, GPIO2BITMASK)
But do note that there appears to be no good way to avoid such a macro evaluating some of its arguments more than once. It might therefore be safer to write SETBITS32 as a function instead. The compiler will probably inline such a function in any case, but you can maximize the likelihood of that by declaring it static and inline:
static inline uint32_t SETBITS32(uint32_t x, uint32_t bits, unsigned offset, uint32_t mask) {
return x & ~(mask << offset) | (bits << offset);
}
That's easier to read, too, though it, like the macro, does assume that bits has no set bits outside the mask region.
Of course there are other, similar formulations. For instance, if you do not need to support discontinuous bit ranges, you might specify a bit count instead of a bit mask. This alternative does that, protects against the user providing bits outside the specified range, and also has some parameter validation:
static inline uint32_t set_bitrange_32(uint32_t x, uint32_t bits, unsigned width,
unsigned offset) {
if (width + offset > 32) {
// error: invalid parameters
return x;
} else if (width == 0) {
return x;
}
uint32_t mask = ~(uint32_t)0 >> (32 - width);
return x & ~(mask << offset) | ((bits & mask) << offset);
}

Bitshifting vs array indexing, which is more appropriate for usart interfaces on 32bit MCUs

I have an embedded project with a USART HAL. This USART can only transmit or receive 8 or 16 bits at a time (depending on the usart register I chose i.e. single/double in/out). Since it's a 32-bit MCU, I figured I might as well pass around 32-bit fields as (from what I have been lead to understand) this is a more efficient use of bits for the MPU. Same would apply for a 64-bit MPU i.e. pass around 64-bit integers. Perhaps that is misguided advice, or advice taken out of context.
With that in mind, I have packed the 8 bits into a 32-bit field via bit-shifting. I do this for both tx and rx on the usart.
The code for the 8-bit only register is as follows (the 16-bit register just has half the amount of rounds for bit-shifting):
int zg_usartTxdataWrite(USART_data* MPI_buffer,
USART_frameconf* MPI_config,
USART_error* MPI_error)
{
MPI_error = NULL;
if(MPI_config != NULL){
zg_usartFrameConfWrite(MPI_config);
}
HPI_usart_data.txdata = MPI_buffer->txdata;
for (int i = 0; i < USART_TXDATA_LOOP; i++){
if((USART_STATUS_TXC & usart->STATUS) > 0){
usart->TXDATAX = (i == 0 ? (HPI_usart_data.txdata & USART_TXDATA_DATABITS) : (HPI_usart_data.txdata >> SINGLE_BYTE_SHIFT) & USART_TXDATA_DATABITS);
}
usart->IFC |= USART_STATUS_TXC;
}
return 0;
}
EDIT: RE-ENTERTING LOGIC OF ABOVE CODE WITH ADDED DEFINES FOR CLARITY OF TERNARY OPERATOR IMPLICIT PROMOTION PROBLEM DISCUSSED IN COMMENTS SECTION
(the HPI_usart and USART_data structs are the same just different levels, I have since removed the HPI_usart layer, but for the sake of this example I will leave it in)
#define USART_TXDATA_LOOP 4
#define SINGLE_BYTE_SHIFT 8
typedef struct HPI_USART_DATA{
...
uint32_t txdata;
...
}HPI_usart
HPI_usart HPI_usart_data = {'\0'};
const uint8_t USART_TXDATA_DATABITS = 0xFF;
int zg_usartTxdataWrite(USART_data* MPI_buffer,
USART_frameconf* MPI_config,
USART_error* MPI_error)
{
MPI_error = NULL;
if(MPI_config != NULL){
zg_usartFrameConfWrite(MPI_config);
}
HPI_usart_data.txdata = MPI_buffer->txdata;
for (int i = 0; i < USART_TXDATA_LOOP; i++){
if((USART_STATUS_TXC & usart->STATUS) > 0){
usart->TXDATAX = (i == 0 ? (HPI_usart_data.txdata & USART_TXDATA_DATABITS) : (HPI_usart_data.txdata >> SINGLE_BYTE_SHIFT) & USART_TXDATA_DATABITS);
}
usart->IFC |= USART_STATUS_TXC;
}
return 0;
}
However, I now realize that this is potentially causing more issues than it solves because I am essentially internally encoding these bits which then have to be decoded almost immediately when they are passed through to/from different data layers. I feel like it's a clever and sexy solution, but I'm now trying to solve a problem that I shouldn't have created in the first place. Like how to extract variable bit fields when there is an offset i.e. in gps nmea sentences where the first 8 bits might be one relevant field and then the rest are 32bit fields. So it ends up being like this:
32-bit array member 0:
bits 24-31 bits 15-23 bits 8-15 bits 0-7
| 8-bit Value | 32-bit Value A, bits 24-31 | 32-bit Value A, bits 16-23 | 32-bit Value A, bits 8-15 |
32-bit array member 1:
bits 24-31 bits 15-23 bits 8-15 bits 0-7
| 32-bit Value A, bits 0-7 | 32-bit Value B, bits 24-31 | 32-bit Value B, bits 16-23 | 32-bit Value B, bits 8-15 |
32-bit array member 2:
bits 24-31 15-23 8-15 ...
| 32-bit Value B, bits 0-7 | etc... | .... | .... |
The above example requires manual decoding, which is fine I guess, but it's different for every nmea sentence and just feels more manual than programmatic.
My question is this: bitshifting vs array indexing, which is more appropriate?
Should I just have assigned each incoming/outgoing value to a 32-bit array member and then just index that way? I feel like that is the solution since it would not only make it easier to traverse the data on other layers, but I would be able to eliminate all this bit-shifting logic and then the only difference between an rx or tx function would be the direction the data is going.
It does mean a small rewrite of the interface and the resulting gps module layer, but that feels like less work and also a cheap lesson early on in my project.
Also any thoughts and general experience on this would be great.

Since it's a 32-bit MCU, I figured I might as well pass around 32-bit fields
That's not really the programmer's call to make. Put the 8 or 16 bit variable in a struct. Let the compiler add padding if needed. Alternatively you can use uint_fast8_t and uint_fast16_t.
My question is this: bitshifting vs array indexing, which is more appropriate?
Array indexing is for accessing arrays. If you have an array, use it. If not, then don't.
While it is possible to chew through larger chunks of data byte by byte, such code must be written much more carefully, to prevent running into various subtle type conversion and pointer aliasing bugs.
In general, bit shifting is preferred when accessing data up to the CPU's word size, 32 bits in this case. It is fast and also portable, so that you don't have to take endianess in account. It is the preferred method of serialization/de-serialization of integers.

Bitwise operations between 128-bit integers

I have a question about using 128-bit registers to gain speed in a code. Consider the following C/C++ code: I define two unsigned long long ints a and b, and give them some values.
unsigned long long int a = 4368, b = 56480;
Then, I want to compute
a & b;
Here a is represented in the computer as a 64-bit number 4369 = 100010001001, and same for b = 56481 = 1101110010100001, and I compute a & b, which is still a 64-bit number given by the bit-by-bit logical AND between a and b:
a & b = 1000000000001
My question is the following: Do computers have a 128-bit register where I could do the operation above, but with 128-bits integers rather than with 64-bit integers, and with the same computer time? To be clearer: I would like to gain a factor two of speed in my code by using 128 bit numbers rather than 64 bit numbers, e. g. I would like to compute 128 ANDs rather than 64 ANDs (one AND for every bit) with the same computer time. If this is possible, do you have a code example? I have heard that the SSE regiters might do this, but I am not sure.

Yes, SSE2 has a 128 bit bitwise AND - you can use it via intrinsics in C or C++, e.g.
#include "emmintrin.h" // SSE2 intrinsics
__m128i v0, v1, v2; // 128 bit variables
v2 = _mm_and_si128(v0, v1); // bitwise AND
or you can use it directly in assembler - the instruction is PAND.
You can even do a 256 bit AND on Haswell and later CPUs which have AVX2:
#include "immintrin.h" // AVX2 intrinsics
__m256i v0, v1, v2; // 256 bit variables
v2 = _mm256_and_si256(v0, v1); // bitwise AND
The corresponding instruction in this case is VPAND.

Applications of bitwise operators in C and their efficiency? [duplicate]

This question already has answers here:
Real world use cases of bitwise operators [closed]
(41 answers)
Closed 6 years ago.
I am new to bitwise operators.
I understand how the logic functions work to get the final result. For example, when you bitwise AND two numbers, the final result is going to be the AND of those two numbers (1 & 0 = 0; 1 & 1 = 1; 0 & 0 = 0). Same with OR, XOR, and NOT.
What I don't understand is their application. I tried looking everywhere and most of them just explain how bitwise operations work. Of all the bitwise operators I only understand the application of shift operators (multiplication and division). I also came across masking. I understand that masking is done using bitwise AND but what exactly is its purpose and where and how can I use it?
Can you elaborate on how I can use masking? Are there similar uses for OR and XOR?

The low-level use case for the bitwise operators is to perform base 2 math. There is the well known trick to test if a number is a power of 2:
if ((x & (x - 1)) == 0) {
printf("%d is a power of 2\n", x);
}
But, it can also serve a higher level function: set manipulation. You can think of a collection of bits as a set. To explain, let each bit in a byte to represent 8 distinct items, say the planets in our solar system (Pluto is no longer considered a planet, so 8 bits are enough!):
#define Mercury (1 << 0)
#define Venus (1 << 1)
#define Earth (1 << 2)
#define Mars (1 << 3)
#define Jupiter (1 << 4)
#define Saturn (1 << 5)
#define Uranus (1 << 6)
#define Neptune (1 << 7)
Then, we can form a collection of planets (a subset) like using |:
unsigned char Giants = (Jupiter|Saturn|Uranus|Neptune);
unsigned char Visited = (Venus|Earth|Mars);
unsigned char BeyondTheBelt = (Jupiter|Saturn|Uranus|Neptune);
unsigned char All = (Mercury|Venus|Earth|Mars|Jupiter|Saturn|Uranus|Neptune);
Now, you can use a & to test if two sets have an intersection:
if (Visited & Giants) {
puts("we might be giants");
}
The ^ operation is often used to see what is different between two sets (the union of the sets minus their intersection):
if (Giants ^ BeyondTheBelt) {
puts("there are non-giants out there");
}
So, think of | as union, & as intersection, and ^ as union minus the intersection.
Once you buy into the idea of bits representing a set, then the bitwise operations are naturally there to help manipulate those sets.

One application of bitwise ANDs is checking if a single bit is set in a byte. This is useful in networked communication, where protocol headers attempt to pack as much information into the smallest area as is possible in an effort to reduce overhead.
For example, the IPv4 header utilizes the first 3 bits of the 6th byte to tell whether the given IP packet can be fragmented, and if so whether to expect more fragments of the given packet to follow. If these fields were the size of ints (1 byte) instead, each IP packet would be 21 bits larger than necessary. This translates to a huge amount of unnecessary data through the internet every day.
To retrieve these 3 bits, a bitwise AND could be used along side a bit mask to determine if they are set.
char mymask = 0x80;
if(mymask & (ipheader + 48) == mymask)
//the second bit of the 6th byte of the ip header is set

Small sets, as has been mentioned. You can do a surprisingly large number of operations quickly, intersection and union and (symmetric) difference are obviously trivial, but for example you can also efficiently:
get the lowest item in the set with x & -x
remove the lowest item from the set with x & (x - 1)
add all items smaller than the smallest present item
add all items higher than the smallest present item
calculate their cardinality (though the algorithm is nontrivial)
permute the set in some ways, that is, change the indexes of the items (not all permutations are equally efficient)
calculate the lexicographically next set that contains as many items (Gosper's Hack)
1 and 2 and their variations can be used to build efficient graph algorithms on small graphs, for example see algorithm R in The Art of Computer Programming 4A.
Other applications of bitwise operations include, but are not limited to,
Bitboards, important in many board games. Chess without bitboards is like Christmas without Santa. Not only is it a space-efficient representation, you can do non-trivial computations directly with the bitboard (see Hyperbola Quintessence)
sideways heaps, and their application in finding the Nearest Common Ancestor and computing Range Minimum Queries.
efficient cycle-detection (Gosper's Loop Detection, found in HAKMEM)
adding offsets to Z-curve addresses without deconstructing and reconstructing them (see Tesseral Arithmetic)
These uses are more powerful, but also advanced, rare, and very specific. They show, however, that bitwise operations are not just a cute toy left over from the old low-level days.

Example 1
If you have 10 booleans that "work together" you can do simplify your code a lot.
int B1 = 0x01;
int B2 = 0x02;
int B10 = 0x0A;
int someValue = get_a_value_from_somewhere();
if (someValue & (B1 + B10)) {
// B1 and B10 are set
}
Example 2
Interfacing with hardware. An address on the hardware may need bit level access to control the interface. e.g. an overflow bit on a buffer or a status byte that can tell you the status of 8 different things. Using bit masking you can get down the the actual bit of info you need.
if (register & 0x80) {
// top bit in the byte is set which may have special meaning.
}
This is really just a specialized case of example 1.

Bitwise operators are particularly useful in systems with limited resources as each bit can encode a boolean. Using many chars for flags is wasteful as each takes one byte of space (when they could be storing 8 flags each).
Commonly microcontrollers have C interfaces for their IO ports in which each bit controls 1 of 8 ports. Without bitwise operators these would be quite difficult to control.
Regarding masking, it is common to use both & and |:
x & 0x0F //ensures the 4 high bits are 0
x | 0x0F //ensures the 4 low bits are 1

In microcontroller applications, you can utilize bitwise to switch between ports. In the below picture, if we would like to turn on a single port while turning off the rest, then the following code can be used.
void main()
{
unsigned char ON = 1;
TRISB=0;
PORTB=0;
while(1){
PORTB = ON;
delay_ms(200);
ON = ON << 1;
if(ON == 0) ON=1;
}
}

Is there any way to do 128-bit shifts on gcc <4.4?

gcc 4.4 seems to be the first version when they added int128_t. I need to use bit shifting and I have run out of room for some bit fields.
Edit: It might be because I'm on a 32-bit computer, there's no way to have it for a 32-bit computer (Intel Atom), is there? I wouldn't care if it generated tricky slow machine code if I would work as expected with bit shifting.

I'm pretty sure that __int128_t is available on earlier versions of gcc. Just checked on 4.2.1 and FreeBSD and sizeof(__int128_t) gives 16.

You could also use a library. This would have the advantage that it is portable (regarding platform and compiler) and you could easily switch to even bigger datatype. One I could recommend is gmp (even if its intention is not to handle bitwidth x, but variable as big as you want).

Bit shifting is very easy in any arbitrary number of bits. Just remember to shift the overflowed bits to the next limb. That's all
typedef struct {
int64_t high;
uint64_t low;
} int128_t;
int128_t shift_left(int128_t v, unsigned shiftcount)
{
int128_t result;
result.high = (v.high << shiftcount) | (v.low >> (64 - shiftcount));
result.low = v.low << shiftcount;
return result;
}
Similar for shift right
int128_t shift_right(int128_t v, unsigned shiftcount)
{
int128_t result;
result.low = (v.low >> shiftcount) | (v.high << (64 - shiftcount));
result.high = v.high >> shiftcount;
return result;
}

You could use two 64-bit ints, but then you need to keep track of the bits moving between.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

neon:multiply and accumulate for 64 bit as IP & OP - arm

Is there any way to implement below logic in neon . As I did not find any multiply and accumulate instruction for 64 bit input and output value . int64x2_t result; int64x2_t num1; int64x2_t num2; >> result + = num1*num2 <<

Related

Is there a better way to define a preprocessor macro for doing bit manipulation?

Bitshifting vs array indexing, which is more appropriate for usart interfaces on 32bit MCUs

Bitwise operations between 128-bit integers

Applications of bitwise operators in C and their efficiency? [duplicate]

Is there any way to do 128-bit shifts on gcc <4.4?

Categories

Resources