How do I reorder vector data using ARM Neon intrinsics?

How do I reorder vector data using ARM Neon intrinsics? - arm

This is specifically related to ARM Neon SIMD coding. I am using ARM Neon instrinsics for certain module in a video decoder. I have a vectorized data as follows:
There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit.
3B 3A 1B 1A
There are another four, 32 bit elements in other Neon register say Q1 which is of size 128 bit.
3D 3C 1D 1C
I want the final data to be in order as shown below:
1D 1C 1B 1A
3D 3C 3B 3A
What Neon instrinsics can achieve the desired data order?

how about something like this:
int32x4_t q0, q1;
/* split into 64 bit vectors */
int32x2_t q0_hi = vget_high_s32 (q0);
int32x2_t q1_hi = vget_high_s32 (q1);
int32x2_t q0_lo = vget_low_s32 (q0);
int32x2_t q1_lo = vget_low_s32 (q1);
/* recombine into 128 bit vectors */
q0 = vcombine_s32 (q0_lo, q1_lo);
q1 = vcombine_s32 (q0_hi, q1_hi);
In theory this should compile to just two move instructions because the vget_high and vget_low just reinterpret the 128 bit Q registers as two 64 bit D registers. vcombine otoh just compiles to one or two moves (depends on register allocation).
Oh - and the order of the integers in the output could be exactly the wrong way around. If so just swap the arguments to vcombine_s32.

Remember each q register is made up of two d registers, for instance the low part of q0 is d0 and the high part d1. So in fact, this operation is just swapping d0 and d3 (or d1 and d2, it is not entirely clear from your data presentation). There is even a swap instruction to do it in one instruction!
Disclaimer: I don't know Neon intrinsics (I directly code in assembly), though I'd be surprised if this couldn't be done using intrinsics.

It looks like you should be able to use the VTRN instruction (e.g. vtrnq_u32) for this.

Pierre is right.
vswp d0, d3
that will do.
#Pierre :
I read the post about NEON on your blog several months ago. I was pleasantly surprised that there was someone like me - writing hand optimized assembly codes, both ARM and NEON.
Nice to see you.

Related

Extract 10bits words from bitstream

I need to extract all 10-bit words from a raw bitstream whitch is built as ABACABACABAC...
It already works with a naive C implementation like
for(uint8_t *ptr = in_packet; ptr < max; ptr += 5){
const uint64_t val =
(((uint64_t)(*(ptr + 4))) << 32) |
(((uint64_t)(*(ptr + 3))) << 24) |
(((uint64_t)(*(ptr + 2))) << 16) |
(((uint64_t)(*(ptr + 1))) << 8) |
(((uint64_t)(*(ptr + 0))) << 0) ;
*a_ptr++ = (val >> 0);
*b_ptr++ = (val >> 10);
*a_ptr++ = (val >> 20);
*c_ptr++ = (val >> 30);
}
But performance is inadequate for my application so I would like to improve this using some AVX2 optimisations.
I visited the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/# to find any functions that can help but it seems there is nothing to works with 10-bit words, only 8 or 16-bit. That seems logical since 10-bit is not native for a processor, but it make things hard for me.
Is there any way to use AVX2 to solve this problem?

Your scalar loop does not compile efficiently. Compilers do it as 5 separate byte loads. You can express an unaligned 8-byte load in C++ with memcpy:
#include <stdint.h>
#include <string.h>
// do an 8-byte load that spans the 5 bytes we want
// clang auto-vectorizes using an AVX2 gather for 4 qwords. Looks pretty clunky but not terrible
void extract_10bit_fields_v2calar(const uint8_t *__restrict src,
uint16_t *__restrict a_ptr, uint16_t *__restrict b_ptr, uint16_t *__restrict c_ptr,
const uint8_t *max)
{
for(const uint8_t *ptr = src; ptr < max; ptr += 5){
uint64_t val;
memcpy(&val, ptr, sizeof(val));
const unsigned mask = (1U<<10) - 1; // unused in original source!?!
*a_ptr++ = (val >> 0) & mask;
*b_ptr++ = (val >> 10) & mask;
*a_ptr++ = (val >> 20) & mask;
*c_ptr++ = (val >> 30) & mask;
}
}
ICC and clang auto-vectorize your 1-byte version, but do a very bad job (lots of insert/extract of single bytes). Here's your original and this function on Godbolt (with gcc and clang -O3 -march=skylake)
None of those 3 compilers are really close to what we can do manually.
Manual vectorization
My current AVX2 version of this answer forgot a detail: there are only 3 kinds of fields ABAC, not ABCD like 10-bit RGBA pixels. So I have a version of this which unpacks to 4 separate output streams (which I'll leave in because of the packed-RGBA use-case if I ever add a dedicated version for the ABAC interleave).
The existing version can use vpunpcklwd to interleave the two A parts instead of storing with separate vmovq should work for your case. There might be something more efficient, IDK.
BTW, I find it easier to remember and type instruction mnemonics, not intrinsic names. Intel's online intrinsics guide is searchable by instruction mnemonic.
Observations about your layout:
Each field spans one byte boundary, never two, so it's possible to assemble any 4 pairs of bytes in a qword that hold 4 complete fields.
Or with a byte shuffle, to create 2-byte words that each have a whole field at some offset. (e.g. for AVX512BW vpsrlvw, or for AVX2 2x vpsrld + word-blend.) A word shuffle like AVX512 vpermw would not be sufficient: some individual bytes need to be duplicated with the start of one field and end of another. I.e the source positions aren't all aligned words, especially when you have 2x 5 bytes inside the same 16-byte "lane" of a vector.
00-07|08-15|16-23|24-31|32-39 byte boundaries (8-bit)
00...09|10..19|20...29|30..39 field boundaries (10-bit)
Luckily 8 and 10 have a GCD of 2 which is >= 10-8=2. 8*5 = 4*10 so we don't get all possible start positions, e.g. never a field starting at the last bit of 1 byte, spanning another byte, and including the first bit of a 3rd byte.
Possible AVX2 strategy: unaligned 32-byte load that leave 2x 5 bytes at the top of the low lane, and 2x 5 bytes at the bottom of the high lane. Then vpshufb in-lane shuffle to set up for 2x vpsrlvd variable-count shifts, and a blend.
Quick summary of a new idea I haven't expanded yet.
Given an input of xxx a0B0A0C0 a1B1A1C1 | a2B2A2C2 a3B3A3C3 from our unaligned load, we can get a result of
a0 A0 a1 A1 B0 B1 C0 C1 | a2 A2 a3 A3 B2 B3 C2 C3 with the right choice of vpshufb control.
Then a vpermd can put all of those 32-bit groups into the right order, with all the A elements in the high half (ready for a vextracti128 to memory), and the B and C in the low half (ready for vmovq / vmovhps stores).
Use different vpermd shuffles for adjacent pairs so we can vpblendd to merge them for 128-bit B and C stores.
Old version, probably worse than unaligned load + vpshufb.
With AVX2, one option is to broadcast the containing 64-bit element to all positions in a vector and then use variable-count right shifts to get the bits to the bottom of a dword element.
You probably want to do a separate 64-bit broadcast-load for each group (thus partially overlapping with the previous), instead of trying to pick apart a __m256i of contiguous bits. (Broadcast-loads are cheap, shuffling is expensive.)
After _mm256_srlvd_epi64, then AND to isolate the low 10 bits in each qword.
Repeat that 4 times for 4 vectors of input, then use _mm256_packus_epi32 to do in-lane packing down to 32-bit then 16-bit elements.
That's the simple version. Optimizations of the interleaving are possible, e.g. by using left or right shifts to set up for vpblendd instead of a 2-input shuffle like vpackusdw or vshufps. _mm256_blend_epi32 is very efficient on existing CPUs, running on any port.
This also allows delaying the AND until after the first packing step because we don't need to avoid saturation from high garbage.
Design notes:
shown as 32-bit chunks after variable-count shifts
[0 d0 0 c0 | 0 b0 0 a0] # after an AND mask
[0 d1 0 c1 | 0 b1 0 a1]
[0 d1 0 c1 0 d0 0 c0 | 0 b1 0 a1 0 b0 0 a0] # vpackusdw
shown as 16-bit elements but actually the same as what vshufps can do
---------
[X d0 X c0 | X b0 X a0] even the top element is only garbage right shifted by 30, not quite zero
[X d1 X c1 | X b1 X a1]
[d1 c1 d0 c0 | b1 a1 b0 a0 ] vshufps (can't do d1 d0 c1 c0 unfortunately)
---------
[X d0 X c0 | X b0 X a0] variable-count >> qword
[d1 X c1 X | b1 X a1 0] variable-count << qword
[d1 d0 c1 c0 | b1 b0 a1 a0] vpblendd
This last trick extends to vpblendw, allowing us to do everything with interleaving blends, no shuffle instructions at all, resulting in the outputs we want contiguous and in the right order in qwords of a __m256i.
x86 SIMD variable-count shifts can only be left or right for all elements, so we need to make sure that all the data is either left or right of the desired position, not some of each within the same vector. We could use an immediate-count shift to set up for this, but even better is to just adjust the byte-address we load from. For loads after the first, we know it's safe to load some of the bytes before the first bitfield we want (without touching an unmapped page).
# as 16-bit elements
[X X X d0 X X X c0 | ...] variable-count >> qword
[X X d1 X X X c1 X | ...] variable-count >> qword from an offset load that started with the 5 bytes we want all to the left of these positions
[X d2 X X X c2 X X | ...] variable-count << qword
[d3 X X X c3 X X X | ...] variable-count << qword
[X d2 X d0 X c2 X c0 | ...] vpblendd
[d3 X d1 X c3 X c1 X | ...] vpblendd
[d3 d2 d1 d0 c3 c2 c1 c0 | ...] vpblendw (Same behaviour in both high and low lane)
Then mask off the high garbage inside each 16-bit word
Note: this does 4 separate outputs, like ABCD or RGBA->planar, not ABAC.
// potentially unaligned 64-bit broadcast-load, hopefully vpbroadcastq. (clang: yes, gcc: no)
// defeats gcc/clang folding it into an AVX512 broadcast memory source
// but vpsllvq's ymm/mem operand is the shift count, not data
static inline
__m256i bcast_load64(const uint8_t *p) {
// hopefully safe with strict-aliasing since the deref is inside an intrinsic?
__m256i bcast = _mm256_castpd_si256( _mm256_broadcast_sd( (const double*)p ) );
return bcast;
}
// UNTESTED
// unpack 10-bit fields from 4x 40-bit chunks into 16-bit dst arrays
// overreads past the end of the last chunk by 1 byte
// for ABCD repeating, not ABAC, e.g. packed 10-bit RGBA
void extract_10bit_fields_4output(const uint8_t *__restrict src,
uint16_t *__restrict da, uint16_t *__restrict db, uint16_t *__restrict dc, uint16_t *__restrict dd,
const uint8_t *max)
{
// FIXME: cleanup loop for non-whole-vectors at the end
while( src<max ){
__m256i bcast = bcast_load64(src); // data we want is from bits [0 to 39], last starting at 30
__m256i ext0 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place at bottome of each qword
bcast = bcast_load64(src+5-2); // data we want is from bits [16 to 55], last starting at 30+16 = 46
__m256i ext1 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place it at bit 16 in each qword element
bcast = bcast_load64(src+10); // data we want is from bits [0 to 39]
__m256i ext2 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 32 in each qword element
bcast = bcast_load64(src+15-2); // data we want is from bits [16 to 55], last field starting at 46
__m256i ext3 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 48 in each qword element
__m256i blend20 = _mm256_blend_epi32(ext0, ext2, 0b10101010); // X d2 X d0 X c2 X c0 | X b2 ...
__m256i blend31 = _mm256_blend_epi32(ext1, ext3, 0b10101010); // d3 X d1 X c3 X c1 X | b3 X ...
__m256i blend3210 = _mm256_blend_epi16(blend20, blend31, 0b10101010); // d3 d2 d1 d0 c3 c2 c1 c0
__m256i res = _mm256_and_si256(blend3210, _mm256_set1_epi16((1U<<10) - 1) );
__m128i lo = _mm256_castsi256_si128(res);
__m128i hi = _mm256_extracti128_si256(res, 1);
_mm_storel_epi64((__m128i*)da, lo); // movq store of the lowest 64 bits
_mm_storeh_pi((__m64*)db, _mm_castsi128_ps(lo)); // movhps store of the high half of the low 128. Efficient: no shuffle uop needed on Intel CPUs
_mm_storel_epi64((__m128i*)dc, hi);
_mm_storeh_pi((__m64*)dd, _mm_castsi128_ps(hi)); // clang pessmizes this to vpextrq :(
da += 4;
db += 4;
dc += 4;
dd += 4;
src += 4*5;
}
}
This compiles (Godbolt) to about 21 front-end uops (on Skylake) in the loop per 4 groups of 4 fields. (Including has a useless register copy for _mm256_castsi256_si128 instead of just using the low half of ymm0 = xmm0). This will be very good on Skylake. There's a good balance of uops for different ports, and variable-count shift is 1 uop for either p0 or p1 on SKL (vs. more expensive previously). The bottleneck might be just the front-end limit of 4 fused-domain uops per clock.
Replays of cache-line-split loads will happen because the unaligned loads will sometimes cross a 64-byte cache-line boundary. But that's just in the back-end, and we have a few spare cycles on ports 2 and 3 because of the front-end bottleneck (4 loads and 4 stores per set of results, with indexed stores which thus can't use port 7). If dependent ALU uops have to get replayed as well, we might start seeing back-end bottlenecks.
Despite the indexed addressing modes, there won't be unlamination because Haswell and later can keep indexed stores micro-fused, and the broadcast loads are a single pure uop anyway, not micro-fused ALU+load.
On Skylake, it can maybe come close to 4x 40-bit groups per 5 clock cycles, if memory bandwidth isn't a bottleneck. (e.g. with good cache blocking.) Once you factor in overhead and cost of cache-line-split loads causing occasional stalls, maybe 1.5 cycles per 40 bits of input, i.e. 6 cycles per 20 bytes of input on Skylake.
On other CPUs (Haswell and Ryzen), the variable-count shifts will be a bottleneck, but you can't really do anything about that. I don't think there's anything better. On HSW it's 3 uops: p5 + 2p0. On Ryzen it's only 1 uop, but it only has 1 per 2 clock throughput (for the 128-bit version), or per 4 clocks for the 256-bit version which costs 2 uops.
Beware that clang pessmizes the _mm_storeh_pi store to vpextrq [mem], xmm, 1: 2 uops, shuffle + store. (Instead of vmovhps : pure store on Intel, no ALU). GCC compiles it as written.
I used _mm256_broadcast_sd even though I really want vpbroadcastq just because there's an intrinsic that takes a pointer operand instead of __m256i (because with AVX1, only the memory-source version existed. But with AVX2, register-source versions of all the broadcast instructions exist). To use _mm256_set1_epi64, I'd have to write pure C that didn't violate strict aliasing (e.g. with memcpy) to do an unaligned uint64_t load. I don't think it will hurt performance to use an FP broadcast load on current CPUs, though.
I'm hoping _mm256_broadcast_sd allows its source operand to alias anything without C++ strict-aliasing undefined behaviour, the same way _mm256_loadu_ps does. Either way it will work in practice if it doesn't inline into a function that stores into *src, and maybe even then. So maybe a memcpy unaligned load would have made more sense!
I've had bad results in the past with getting compilers to emit pmovzxdw xmm0, [mem] from code like _mm_cvtepu16_epi32( _mm_loadu_si64(ptr) ); you often get an actual movq load + reg-reg pmovzx. That's why I didn't try that _mm256_broadcastq_epi64(__m128i).
Old idea; if we already need a byte shuffle we might as well use plain word shifts instead of vpmultishift.
With AVX512VBMI (IceLake, CannonLake), you might want vpmultishiftqb. Instead of broadcasting / shifting one group at a time, we can do all the work for a whole vector of groups after putting the right bytes in the right places first.
You'd still need/want a version for CPUs with some AVX512 but not AVX512VBMI (e.g. Skylake-avx512). Probably vpermd + vpshufb can get the bytes we need into the 128-bit lanes we want.
I don't think we can get away with using only dword-granularity shifts to allow merge-masking instead of dword blend after qword shift. We might be able to merge-mask a vpblendw though, saving a vpblendd
IceLake has 1/clock vpermw and vpermb, single-uop. (It has a 2nd shuffle unit on another port that handles some shuffle uops). So we can load a full vector that contains 4 or 8 groups of 4 elements and shuffle every byte into place efficiently. I think every CPU that has vpermb has it single-uop. (But that's only Ice Lake and the limited-release Cannon Lake).
vpermt2w (to combine 16-bit element from 2 vectors into any order) is one per 2 clock throughput. (InstLatx64 for IceLake-Y), so unfortunately it's not as efficient as the one-vector shuffles.
Anyway, you might use it like this:
64-byte / 512-bit load (includes some over-read at the end from 8x 8-byte groups instead of 8x 5-byte groups. Optionally use a zero-masked load to make this safe near the end of an array thanks to fault suppression)
vpermb to put the 2 bytes containing each field into desired final destination position.
vpsrlvw + vpandq to extract each 10-bit field into a 16-bit word
That's about 4 uops, not including the stores.
You probably want the high half containing the A elements for a contiguous vextracti64x4 and the low half containing the B and C elements for vmovdqu and vextracti128 stores.
Or for 2x vpblenddd to set up for 256-bit stores. (Use 2 different vpermb vectors to create 2 different layouts.)
You shouldn't need vpermt2w or vpermt2d to combine adjacent vectors for wider stores.
Without AVX512VBMI, probably a vpermd + vpshufb can get all the necessary bytes into each 128-bit chunk instead of vpermb. The rest of it only requires AVX512BW which Skylake-X has.

Addition in neon register

suppose I have a 64 bit d register in neon. lets say it stores the value ABCDEFGH.
Now I want o add A&E, B&F, C&G, D&H and so on.. Is here any intrinsic by which it is possible to so such an operation
I looked at the documentation but didn't find something suitable.

If you want the addition to be carried out in 16 bits, i.e. produce an uint16x4 result, you can use vmovl to promote the input vector from uint8x8 to uint8x16, then use vadd to add the lower and higher halves. Expressed in NEON intrinsics, this is achieved by
const int16x8_t t = vmovl_u8(input);
const int16x4_t r = vadd_u16(vget_low(t), vget_high(t))
This should compile to the following assembly (d0 is the 64-bit input register, d1 is the 64-bit output register). Note the vget_low and vget_high don't produce any instructions - these intrinsics are implemented by suitable register allocation, by exploiting that Q registers are just a convenient way to name two consecutive D register. Q{n} refers to the pair (D{2n}, D{2n+1}).
VMOVL.U8 q1, d0
VADD.I16 d1, d2
If you want the operation to be carried out in 8 bits, and saturate in case of an overflow, do
const int8x8_t t = vreinterpret_u8_u64(vshr_n_u64(vreinterpret_u64_u8(input), 32));
const int8x8_t r = vqadd_u8(input, t);
This compiles to (d0 is the input again, output in d1)
VSHR.U64 d1, d0, #32
VQADD.I8 d1, d0
By replacing VQADD with just VADD, the results will wrap-around on overflow instead of being saturated to 0xff.

How to reorder a quadword vector data using Neon Intrinsics?

The question is related to ARM NEON intrinsics.
Iam using ARM neon intrinsics for FIR implementation.
I want to reorder a quadword vector data.
For example,
There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit.
A3 A2 A1 A0
I want to reorder Q0 as A0 A1 A2 A3.
Is there any option to do this?

Reading http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html together with the ARM infocenter, I think the following would do what you ask:
uint32x2_t dvec_h = vget_high_u32(qvec);
uint32x2_t dvec_l = vget_low_u32(qvec);
dvec_h = vrev64_u32(dvec_h);
dvec_l = vrev64_u32(dvec_l);
qvec = vcombine_u32(dvec_h, dvec_l);
In assembly, this could be written simply as:
VSWP d0, d1
VREV64.32 q0, q0

Where can I find soft-multiply and divide algorithms?

I'm working on a micro-controller without hardware multiply and divide. I need to cook up software algorithms for these basic operations that are a nice balance of compact size and efficiency. My C compiler port will employ these algos, not the the C developers themselves.
My google-fu is so far turning up mostly noise on this topic.
Can anyone point me to something informative? I can use add/sub and shift instructions. Table lookup based algos might also work for me, but I'm a bit worried about cramming so much into the compiler's back-end...um, so to speak.

Here's a simple multiplication algorithm:
Start with rightmost bit of multiplier.
If bit in multiplier is 1, add multiplicand
Shift multiplicand by 1
Move to next bit in multiplier and go back to step 2.
And here's a division algorithm:
If divisor is larger than dividend, stop.
While divisor register is less than dividend register, shift left.
Shift divisor register right by 1.
Subtract divisor register from dividend register and change the bit to 1 in the result register at the bit that corresponds with the total number of shifts done to the divisor register.
Start over at step 1 with divisor register in original state.
Of course you'll need to put in a check for dividing by 0, but it should work.
These algorithms, of course, are only for integers.

My favorite reference for things like this, available in book form:
http://www.hackersdelight.org/
Also you can't go wrong with TAoCP: http://www-cs-faculty.stanford.edu/~uno/taocp.html

Here's a division algorithm: http://www.prasannatech.net/2009/01/division-without-division-operator_24.html
I assume we're talking about ints?
If there's no hardware support, you'll have to implement your own divide-by-zero exception.
(I didn't have much luck quickly finding a multiplication algorithm, but I'll keep looking if someone else doesn't find one).

One simple and fairly performant multiplication algorithm for integers is Russian Peasant Multiplication.
For rationals, you could try a binary quote notation, for which division is easier than usual.

It turns out that I still have some old 68000 assembler code for long multiplication and long division. 68000 code is pretty clean and simple, so should be easy to translate for your chip.
The 68000 had multiply and divide instructions IIRC - I think these were written as a learning exercise.
Decided to just put the code here. Added comments and, in the process, fixed a problem.
;
; Purpose : division of longword by longword to give longword
; : all values signed.
; Requires : d0.L == Value to divide
; : d1.L == Value to divide by
; Changes : d0.L == Remainder
; : d2.L == Result
; : corrupts d1, d3, d4
;
section text
ldiv: move #0,d3 ; Convert d0 -ve to +ve - d3 records original sign
tst.l d0
bpl.s lib5a
neg.l d0
not d3
lib5a: tst.l d1 ; Convert d1 -ve to +ve - d3 records result sign
bpl.s lib5b
neg.l d1
not d3
lib5b: tst.l d1 ; Detect division by zero (not really handled well)
bne.s lib3a
rts
lib3a: moveq.l #0,d2 ; Init working result d2
moveq.l #1,d4 ; Init d4
lib3b: cmp.l d0,d1 ; while d0 < d1 {
bhi.s lib3c
asl.l #1,d1 ; double d1 and d4
asl.l #1,d4
bra.s lib3b ; }
lib3c: asr.l #1,d1 ; halve d1 and d4
asr.l #1,d4
bcs.s lib3d ; stop when d4 reaches zero
cmp.l d0,d1 ; do subtraction if appropriate
bhi.s lib3c
or.l d4,d2 ; update result
sub.l d1,d0
bne.s lib3c
lib3d: ; fix the result and remainder signs
; and.l #$7fffffff,d2 ; don't know why this is here
tst d3
beq.s lib3e
neg.l d2
neg.l d0
lib3e: rts
;
; Purpose : Multiply long by long to give long
; Requires : D0.L == Input 1
; : D1.L == Input 2
; Changes : D2.L == Result
; : D3.L is corrupted
;
lmul: move #0,d3 ; d0 -ve to +ve, original sign in d3
tst.l d0
bpl.s lib4c
neg.l d0
not d3
lib4c: tst.l d1 ; d1 -ve to +ve, result sign in d3
bpl.s lib4d
neg.l d1
not d3
lib4d: moveq.l #0,d2 ; init d2 as working result
lib4a: asr.l #1,d0 ; shift d0 right
bcs.s lib4b ; if a bit fell off, update result
asl.l #1,d1 ; either way, shift left d1
tst.l d0
bne.s lib4a ; if d0 non-zero, continue
tst.l d3 ; basically done - apply sign?
beq.s lib4e ; was broken! now fixed
neg.l d2
lib4e: rts
lib4b: add.l d1,d2 ; main loop body - update result
asl.l #1,d1
bra.s lib4a
By the way - I never did figure out whether it was necessary to convert everything to positive at the start. If you're careful with the shift operations, that may be avoidable overhead.

To multiply, add partial products from the shifted multiplicand to an accumulator iff the corresponding bit in the multiplier is set. Shift multiplicand and multiplier at end of loop, testing multiplier & 1 to see if addition should be done.

The Microchip PICmicro 16Fxxx series chips do not have a multiply or divide instruction.
Perhaps some of the soft multiply and soft divide routines for it can be ported to your MCU.
PIC Microcontroller Basic Math Multiplication Methods
PIC Microcontroller Basic Math Division Methods
Also check out "Newton's method" for division.
I think that method gives the smallest executable size of any division algorithm I've ever seen, although the explanation makes it sound more complicated than it really is.
I hear that some early Cray supercomputers used Newton's method for division.

C language: #DEFINEd value messes up 8-bit multiplication. Why?

I have the following C code:
#define PRR_SCALE 255
...
uint8_t a = 3;
uint8_t b = 4;
uint8_t prr;
prr = (PRR_SCALE * a) / b;
printf("prr: %u\n", prr);
If I compile this (using an msp430 platform compiler, for an small embedded OS called contiki) the result is 0 while I expected 191.
(uint8_t is typedef'ed as an unsigned char)
If I change it to:
uint8_t a = 3;
uint8_t b = 4;
uint8_t c = 255;
uint8_t prr;
prr = (c * a) / b;
printf("prr: %u\n", prr);
it works out correctly and prints 191.
Compiling a simple version of this 'normally' using gcc on an Ubuntu box prints the correct value in both cases.
I am not exactly sure why this is. I could circumvent it by assigning the DEFINEd value to a variable beforehand, but I'd rather not do that.
Does anybody know why this is? Perhaps with a link to some more information about this?

The short answer: you compiler is buggy. (There is no problem with overflow, as others suggested.)
In both cases, the arithmetic is done in int, which is guaranteed to be at least 16 bits long. In the former snippet it's because 255 is an int, in the latter it's because of integral promotion.
As you noted, gcc handles this correctly.

255 is being processed as an integer literal and causes the entire expression to be int based rather than unsigned char based. The second case forces the type to be correct. Try changing your #define as follows:
#define PRR_SCALE ((uint8_t) 255)

If the compiler in question is the mspgcc, it should put out an assembler listing of the compiled program together with the binary/hex file. Other compilers may require additional compiler flags to do so. Or maybe even a separate disassembler run on the binary.
This is the place where to look for an explanation.
Due to compiler optimizations, the actual code presented to the processor might have not much similarity to the original C code (but normally does the same job).
Stepping through the few assembler instructions representing the faulty code should reveal the cause of the problem.
My guess is that the compiler somehow optimizes the whole calculation sice the defined constant is a known part at compile time.
255*x could be optimized to x<<8-x (which is faster and smaller)
Maybe something is going wrong with the optimized assembler code.
I took the time to compile both versions on my system. With active optimization, the mspgcc produces the following code:
#define PRR_SCALE 255
uint8_t a = 3;
uint8_t b = 4;
uint8_t prr;
prr = (PRR_SCALE * a) / b;
40ce: 3c 40 fd ff mov #-3, r12 ;#0xfffd
40d2: 2a 42 mov #4, r10 ;r2 As==10
40d4: b0 12 fa 6f call __divmodhi4 ;#0x6ffa
40d8: 0f 4c mov r12, r15 ;
printf("prr: %u\n", prr);
40da: 7f f3 and.b #-1, r15 ;r3 As==11
40dc: 0f 12 push r15 ;
40de: 30 12 c0 40 push #16576 ;#0x40c0
40e2: b0 12 9c 67 call printf ;#0x679c
40e6: 21 52 add #4, r1 ;r2 As==10
As we can see, the compiler directly calculates the result of 255*3 to -3 (0xfffd). And here is the problem. Somehow the 255 gets interpreted as -1 signed 8-bit instead of 255 unsigned 16 bit. Or it is parsed to 8 bit first and then sign-extended to 16 bit. or whatever.
A discussion on this topic has been started at the mspgcc mailing list already.

I'm not sure why the define doesn't work, but you might be running into rollovers with the uint8_t variables. 255 is the max value for uint8_t (2^8 - 1), so if you multiply that by 3, you're bound to run into some subtle rollover problems.
The compiler might be optimizing your code, and pre-calculating the result of your math expression and shoving the result in prr (since it fits, even though the intermediate value doesn't fit).
Check what happens if you break up your expression like this (this will not behave like what you want):
prr = c * a; // rollover!
prr = prr / b;
You may need to just use a larger datatype.

One difference I can think in case-1 is,
The PRR_SCALE literal value may go into ROM or code area. And there may be some difference in the MUL opecode for say,
case-1: [register], [rom]
case -2: [register], [register]
It may not make sense at all.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How do I reorder vector data using ARM Neon intrinsics? - arm

It looks like you should be able to use the VTRN instruction (e.g. vtrnq_u32) for this.

Pierre is right. vswp d0, d3 that will do. #Pierre : I read the post about NEON on your blog several months ago. I was pleasantly surprised that there was someone like me - writing hand optimized assembly codes, both ARM and NEON. Nice to see you.

Related

Extract 10bits words from bitstream

Addition in neon register

How to reorder a quadword vector data using Neon Intrinsics?

Where can I find soft-multiply and divide algorithms?

C language: #DEFINEd value messes up 8-bit multiplication. Why?

Categories

Resources