Unsigned 64x64->128 bit integer multiply on 32-bit platforms - c

In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically armv7), with a side glance to RISC-V32 which I expect to grow in importance in the embedded space. The first sample building block I chose to examine is unsigned 64x64->128 bit integer multiplication. Other questions on this site about this building block do not provide detailed coverage of 32-bit platforms.
Over the past thirty years, I have implemented this and other arithmetic building blocks multiple times, but always in assembly language, for various architectures. However, at this point in time my hope and desire is that these could be programmed in straight ISO-C, without the use of intrinsics. Ideally a single version of the C code would generate good machine code across architectures. I know that the approach of manipulating HLL code to control machine code is generally brittle, but hope that processor architectures and toolchains have matured enough to make this feasible.
Some approaches used in assembly language implementations are not well suited for porting to C. In the exemplary code below I have selected six variants that seemed amenable to an HLL implementation. Besides the generation of partial products, which is common to all variants, the two basic approaches are: (1) Sum the partial products using 64-bit arithmetic, letting the compiler take care of the carry propagation between 32-bit halves. In this case there are multiple choices in which order to sum the partial products. (2) Use 32-bit arithmetic for the summing, simulating the carry flag directly. In this case we have a choice of generating the carry after an addition (a = a + b; carry = a < b;) or before the addition (carry = ~a < b; a = a + b;). Variants 1 through 3 below fall into the former category, variants 5 and 6 fall into the latter.
At Compiler Explorer, I focused on the toolchains gcc 12.2 and clang 15.0 for the platforms of interest. I compiled with -O3. The general finding is that on average clang generates more efficient code than gcc, and that the differences between the variants (number of instructions and registers used) are more pronounced with clang. While this may be understandable in the case of RISC-V as the newer architecture, it surprised me in the case of armv7 which has been around for well over a dozen years.
Three cases in particular struck me as noteworthy. While I have worked with compiler engineers before and have a reasonable understanding of basic code transformation, phase ordering issues, etc, the only technique I aware of that might apply to this code is idiom recognition, and I do not see how this could explain the observations by itself. The first case is variant 3, where clang 15.0 produces extremely tight code comprising just 10 instructions that I don't think can be improved upon:
umul64wide:
push {r4, lr}
umull r12, r4, r2, r0
umull lr, r0, r3, r0
umaal lr, r4, r2, r1
umaal r0, r4, r3, r1
ldr r1, [sp, #8]
strd r0, r4, [r1]
mov r0, r12
mov r1, lr
pop {r4, pc}
By contrast, gcc generates twice the number of instructions and requires twice the number of registers. I hypothesize that it does not recognize how to use the multiply-accumulate instruction umaal here, but is that the full story? The reverse situation, but not quite as dramatic, occurs in variant 6, where gcc 12.2 produces this sequence of 18 instructions, with low register usage:
umul64wide:
mov ip, r0
push {r4, r5, lr}
mov lr, r1
umull r0, r1, r0, r2
ldr r4, [sp, #12]
umull r5, ip, r3, ip
adds r1, r1, r5
umull r2, r5, lr, r2
adc ip, ip, #0
umull lr, r3, lr, r3
adds r1, r1, r2
adc r2, ip, #0
adds r2, r2, r5
adc r3, r3, #0
adds r2, r2, lr
adc r3, r3, #0
strd r2, r3, [r4]
pop {r4, r5, pc}
The generated code nicely turns the simulated carry propagation into real carry propagation. clang 15.0 uses nine instructions and five registers more, and I cannot really make out what it is trying to do without spending much more time on analysis. The third observation is with regard to the differences seen in the machine code produced for variant 5 vs. variant 6, in particular with clang. These use the same basic algorithm, with one variant computing the simulated carry before the additions, the other after it. I did find in the end that one variant, namely variant 4, seems to be efficient across both tool chains and both architectures. However, before I proceed to other building blocks and face a similar struggle, I would like to inquire:
(1) Are there coding idioms or algorithms I have not considered in the code below that might lead to superior results? (2) Are there specific optimization switches, e.g. a hypothetical -ffrobnicate (see here), that are not included in -O3 that would help the compilers generate efficient code more consistently for these kind of bit-manipulation scenarios? Explanations as to what compiler mechanisms are likely responsible for the cases of significant differences in code generation observed, and how one might influence or work round them, could also be helpful.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define VARIANT (3)
#define USE_X64_ASM_REF (0)
/* Multiply two unsigned 64-bit operands a and b. Returns least significant 64
bits of product as return value, most significant 64 bits of product via h.
*/
uint64_t umul64wide (uint64_t a, uint64_t b, uint64_t *h)
{
uint32_t a_lo = (uint32_t)a;
uint32_t a_hi = a >> 32;
uint32_t b_lo = (uint32_t)b;
uint32_t b_hi = b >> 32;
uint64_t p0 = (uint64_t)a_lo * b_lo;
uint64_t p1 = (uint64_t)a_lo * b_hi;
uint64_t p2 = (uint64_t)a_hi * b_lo;
uint64_t p3 = (uint64_t)a_hi * b_hi;
#if VARIANT == 1
uint32_t c = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);
*h = p3 + (p1 >> 32) + (p2 >> 32) + c;
return p0 + ((p1 + p2) << 32);
#elif VARIANT == 2
uint64_t s = (p0 >> 32) + p1;
uint64_t t = (uint32_t)s + p2;
*h = (s >> 32) + (t >> 32) + p3;
return (uint32_t)p0 + (t << 32);
#elif VARIANT == 3
*h = (p1 >> 32) + (((p0 >> 32) + (uint32_t)p1 + p2) >> 32) + p3;
return p0 + ((p1 + p2) << 32);
#elif VARIANT == 4
uint64_t t = (p0 >> 32) + p1 + (uint32_t)p2;
*h = (p2 >> 32) + (t >> 32) + p3;
return (uint32_t)p0 + (t << 32);
#elif VARIANT == 5
uint32_t r0, r1, r2, r3, r4, r5, r6;
r0 = (uint32_t)p0;
r1 = p0 >> 32;
r5 = (uint32_t)p1;
r2 = p1 >> 32;
r1 = r1 + r5;
r6 = r1 < r5;
r2 = r2 + r6;
r6 = (uint32_t)p2;
r5 = p2 >> 32;
r1 = r1 + r6;
r6 = r1 < r6;
r2 = r2 + r6;
r4 = (uint32_t)p3;
r3 = p3 >> 32;
r2 = r2 + r5;
r6 = r2 < r5;
r3 = r3 + r6;
r2 = r2 + r4;
r6 = r2 < r4;
r3 = r3 + r6;
*h = ((uint64_t)r3 << 32) | r2;
return ((uint64_t)r1 << 32) | r0;
#elif VARIANT == 6
uint32_t r0, r1, r2, r3, r4, r5, r6;
r0 = (uint32_t)p0;
r1 = p0 >> 32;
r5 = (uint32_t)p1;
r2 = p1 >> 32;
r4 = ~r1;
r4 = r4 < r5;
r1 = r1 + r5;
r2 = r2 + r4;
r6 = (uint32_t)p2;
r5 = p2 >> 32;
r4 = ~r1;
r4 = r4 < r6;
r1 = r1 + r6;
r2 = r2 + r4;
r4 = (uint32_t)p3;
r3 = p3 >> 32;
r6 = ~r2;
r6 = r6 < r5;
r2 = r2 + r5;
r3 = r3 + r6;
r6 = ~r2;
r6 = r6 < r4;
r2 = r2 + r4;
r3 = r3 + r6;
*h = ((uint64_t)r3 << 32) | r2;
return ((uint64_t)r1 << 32) | r0;
#else
#error unsupported VARIANT
#endif
}
#if defined(__SIZEOF_INT128__)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
unsigned __int128 prod = ((unsigned __int128)a) * b;
*h = (uint64_t)(prod >> 32);
return (uint64_t)prod;
}
#elif defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
*h = __umulh (a, b);
return a * b;
}
#elif USE_X64_ASM_REF
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
uint64_t res_l, res_h;
__asm__ (
"movq %2, %%rax;\n\t" // rax = a
"mulq %3;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res_h = rdx
"movq %%rax, %1;\n\t" // res_l = rax
: "=rm" (res_h), "=rm"(res_l)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
*h = res_h;
return res_l;
}
#else // generic (and slow) reference implementation
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
uint32_t cy, t0, t1;
uint32_t a_lo = (uint32_t)a;
uint32_t a_hi = a >> 32;
uint32_t b_lo = (uint32_t)b;
uint32_t b_hi = b >> 32;
uint64_t p0 = (uint64_t)a_lo * b_lo;
uint64_t p1 = (uint64_t)a_lo * b_hi;
uint64_t p2 = (uint64_t)a_hi * b_lo;
uint64_t p3 = (uint64_t)a_hi * b_hi;
uint32_t p0_lo = (uint32_t)p0;
uint32_t p0_hi = p0 >> 32;
uint32_t p1_lo = (uint32_t)p1;
uint32_t p1_hi = p1 >> 32;
uint32_t p2_lo = (uint32_t)p2;
uint32_t p2_hi = p2 >> 32;
uint32_t p3_lo = (uint32_t)p3;
uint32_t p3_hi = p3 >> 32;
uint32_t r0 = p0_lo;
uint32_t r1 = ADDcc (p0_hi, p1_lo, cy, t0, t1);
uint32_t r2 = ADDCcc (p1_hi, p2_hi, cy, t0, t1);
uint32_t r3 = ADDC (p3_hi, 0, cy, t0, t1);
r1 = ADDcc (r1, p2_lo, cy, t0, t1);
r2 = ADDCcc (r2, p3_lo, cy, t0, t1);
r3 = ADDC (r3, 0, cy, t0, t1);
*h = ((uint64_t)r3 << 32) + r2;
return ((uint64_t)r1 << 32) + r0;
}
#endif
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t a, b, res_hi, res_lo, ref_hi, ref_lo, count = 0;
printf ("Smoke test of umul64wide variant %d\n", VARIANT);
do {
a = KISS64;
b = KISS64;
ref_lo = umul64wide_ref (a, b, &ref_hi);
res_lo = umul64wide (a, b, &res_hi);
if ((res_lo ^ ref_lo) | (res_hi ^ ref_hi)) {
printf ("!!!! error: a=%016llx b=%016llx res=%016llx_%016llx ref=%016llx_%016llx\n",
a, b, res_hi, res_lo, ref_hi, ref_lo);
return EXIT_FAILURE;
}
if (!(count & 0xfffffff)) printf ("\r%llu", count);
count++;
} while (count);
return EXIT_SUCCESS;
}

I avoided the use of the ((x += y) < y) overflow test, since not every ISA handles conditional flags efficiently, and may inhibit re-ordering when using the results of flag register(s); x86[-64] is the obvious example, though later BMI(2) instructions may help mitigate this. I also added a 32 x 32 -> 64 bit C implementation for comparison - but I would expect any modern ISA to at least supply a 'high word' multiply like ARM's umulh.
/******************************************************************************/
/* stackoverflow.com/questions/74713642 */
#include <inttypes.h>
#include <stdio.h>
/* umul_32_32 : 32 x 32 => 64 */
/* force inline (non-portable), or implement it as macro, e.g.,
* #define umul_32_32(rh, rl, x, y) do { ... } while (0) */
#if (1)
static inline __attribute__((always_inline))
uint64_t umul_32_32 (uint32_t x, uint32_t y)
{
return (((uint64_t) x) * y);
}
#else
/* if no widening multiply is available, the compiler probably
* generates something at least as efficient as the following -
* or (worst case) it calls a builtin function. */
static inline __attribute__((always_inline))
uint64_t umul_32_32 (uint32_t x, uint32_t y)
{
uint32_t m0, m1, m2, m3; /* (partial products) */
uint32_t x0, x1, y0, y1;
x0 = x & UINT16_MAX, x1 = x >> (16);
y0 = y & UINT16_MAX, y1 = y >> (16);
m0 = x0 * y0, m1 = x1 * y0;
m2 = x0 * y1, m3 = x1 * y1;
m1 += m0 >> (16);
m3 += m2 >> (16);
m1 += m2 & UINT16_MAX;
uint32_t rh = m3 + (m1 >> (16));
uint32_t rl = m1 << (16) | (m0 & UINT16_MAX);
return (((uint64_t) rh) << 32 | rl);
/* 32 x 32 => 64 : no branching or carry overflow tests. */
}
#endif
/* ensure the function is called to inspect code gen / assembly,
* otherwise gcc and clang evaluate this at compile time. */
__attribute__((noinline)) void umul_64_64 (
uint64_t *rh, uint64_t *rl, uint64_t x, uint64_t y)
{
uint64_t m0, m1, m2, m3; /* (partial products) */
uint32_t x0, x1, y0, y1;
x0 = (uint32_t) (x), x1 = (uint32_t) (x >> (32));
y0 = (uint32_t) (y), y1 = (uint32_t) (y >> (32));
m0 = umul_32_32(x0, y0), m1 = umul_32_32(x1, y0);
m2 = umul_32_32(x0, y1), m3 = umul_32_32(x1, y1);
m1 += m0 >> (32);
m3 += m2 >> (32);
m1 += m2 & UINT32_MAX;
*rh = m3 + (m1 >> (32));
*rl = m1 << (32) | (m0 & UINT32_MAX);
/* 64 x 64 => 128 : no branching or carry overflow tests. */
}
#if (0)
int main (void)
{
uint64_t x = UINT64_MAX, y = UINT64_MAX, rh, rl;
umul_64_64(& rh, & rl, x, y);
fprintf(stdout, "0x%016" PRIX64 ":0x%016" PRIX64 "\n", rh, rl);
return (0);
}
#endif
/******************************************************************************/
For ARM-7, I'm getting more or less the same results as your 'variant 3' code, which isn't surprising, since it's the same essential idea. I tried different flags on gcc-12 and gcc-trunk, but couldn't improve it.
I'd hazard a guess that with Apple's investment in AArch64 silicon, there's simply been more aggressive optimization and funding directed toward clang that benefits 32-bit ARM-7 as well. But that's pure speculation. It's a pretty glaring disparity for such a major platform though.

Related

cuda SIMD instruction for per-byte multiplication with unsigned saturation

CUDA has a nice set of SIMD instructions for integers that allow efficient SIMD computations. Among those, there are some that compute addition and subtraction per byte or per half-word (like __vadd2 and __vadd4), however, I couldn't find a similar function that computes per-byte multiplication for a 32bit register. I would appreciate it if someone can help me find a proper solution.
however, I couldn't find a similar function that computes per-byte multiplication for a 32bit register.
There isn't one that returns the 4 individual products.
The closest is the __dp4a() intrinsic which returns the sum of the 4 products, in a 32-bit integer.
You could write an 8-bit packed unsigned multiply with saturation like this:
$ cat t2048.cu
#include <cstdio>
#include <cstdint>
__host__ __device__ uchar4 u8mulsat(const uchar4 &a, const uchar4 &b){
const unsigned sv = 255;
uchar4 result;
unsigned t;
t = a.x*b.x;
if (t > sv) t = sv;
result.x = t;
t = a.y*b.y;
if (t > sv) t = sv;
result.y = t;
t = a.z*b.z;
if (t > sv) t = sv;
result.z = t;
t = a.w*b.w;
if (t > sv) t = sv;
result.w = t;
return result;
}
__global__ void k(uchar4 a, uchar4 b, uchar4 *c){
*c = u8mulsat(a, b);
}
int main(){
uchar4 a,b,c, *d_c;
cudaMalloc(&d_c, sizeof(uchar4));
a.x = 1;
a.y = 2;
a.z = 4;
a.w = 8;
b.x = 64;
b.y = 64;
b.z = 64;
b.w = 1;
k<<<1,1>>>(a, b, d_c);
cudaMemcpy(&c, d_c, sizeof(uchar4), cudaMemcpyDeviceToHost);
printf("c.x = %u\n", (unsigned)c.x);
printf("c.y = %u\n", (unsigned)c.y);
printf("c.z = %u\n", (unsigned)c.z);
printf("c.w = %u\n", (unsigned)c.w);
}
$ nvcc -o t2048 t2048.cu
$ compute-sanitizer ./t2048
========= COMPUTE-SANITIZER
c.x = 64
c.y = 128
c.z = 255
c.w = 8
========= ERROR SUMMARY: 0 errors
$ cuobjdump -sass ./t2048
Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_52
Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_52
Function : _Z1k6uchar4S_PS_
.headerflags #"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
/* 0x001c4400e22007f6 */
/*0008*/ MOV R1, c[0x0][0x20] ; /* 0x4c98078000870001 */
/*0010*/ LDC.U8 R0, c[0x0][0x140] ; /* 0xef9000001407ff00 */
/*0018*/ LDC.U8 R2, c[0x0][0x144] ; /* 0xef9000001447ff02 */
/* 0x001d4400e6200731 */
/*0028*/ LDC.U8 R3, c[0x0][0x141] ; /* 0xef9000001417ff03 */
/*0030*/ LDC.U8 R4, c[0x0][0x145] ; /* 0xef9000001457ff04 */
/*0038*/ LDC.U8 R5, c[0x0][0x142] ; /* 0xef9000001427ff05 */
/* 0x001dfc00ee200751 */
/*0048*/ LDC.U8 R6, c[0x0][0x146] ; /* 0xef9000001467ff06 */
/*0050*/ LDC.U8 R7, c[0x0][0x143] ; /* 0xef9000001437ff07 */
/*0058*/ LDC.U8 R8, c[0x0][0x147] ; /* 0xef9000001477ff08 */
/* 0x009fd002fe200fe1 */
/*0068*/ XMAD R0, R2, R0, RZ ; /* 0x5b007f8000070200 */
/*0070*/ XMAD R2, R4, R3, RZ ; /* 0x5b007f8000370402 */
/*0078*/ XMAD R3, R6, R5, RZ ; /* 0x5b007f8000570603 */
/* 0x001fc408fe2007f1 */
/*0088*/ IMNMX.U32 R0, R0, 0xff, PT ; /* 0x382003800ff70000 */
/*0090*/ XMAD R4, R8, R7, RZ ; /* 0x5b007f8000770804 */
/*0098*/ IMNMX.U32 R2, R2, 0xff, PT ; /* 0x382003800ff70202 */
/* 0x001fc400fe2007e4 */
/*00a8*/ IMNMX.U32 R3, R3, 0xff, PT ; /* 0x382003800ff70303 */
/*00b0*/ IMNMX.U32 R4, R4, 0xff, PT ; /* 0x382003800ff70404 */
/*00b8*/ BFI R0, R2, 0x808, R0 ; /* 0x36f0000080870200 */
/* 0x001fd400fe2007f5 */
/*00c8*/ MOV R2, c[0x0][0x148] ; /* 0x4c98078005270002 */
/*00d0*/ BFI R5, R3, 0x810, R0 ; /* 0x36f0000081070305 */
/*00d8*/ MOV R3, c[0x0][0x14c] ; /* 0x4c98078005370003 */
/* 0x001ffc00fe2007e2 */
/*00e8*/ BFI R4, R4, 0x818, R5 ; /* 0x36f0028081870404 */
/*00f0*/ STG.E [R2], R4 ; /* 0xeedc200000070204 */
/*00f8*/ EXIT ; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*0108*/ BRA 0x100 ; /* 0xe2400fffff07000f */
/*0110*/ NOP; /* 0x50b0000000070f00 */
/*0118*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*0128*/ NOP; /* 0x50b0000000070f00 */
/*0130*/ NOP; /* 0x50b0000000070f00 */
/*0138*/ NOP; /* 0x50b0000000070f00 */
..........
Fatbin ptx code:
================
arch = sm_52
code version = [7,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$
The SASS code appears to be about as I would expect, roughly the same length as the C++ code, ignoring the LDC and STG instructions.
FWIW, on Tesla V100, CUDA 11.4, the implementation by njuffa and mine are pretty close in terms of register usage (njuffa: 16, mine: 17) and performance (njuffa about 1% faster):
$ cat t2048.cu
#include <iostream>
#include <cstdint>
__device__ unsigned int vmulus4 (unsigned int a, unsigned int b)
{
unsigned int plo, phi, res;
// compute products
plo = ((a & 0x000000ff) * (b & 0x000000ff) +
(a & 0x0000ff00) * (b & 0x0000ff00));
phi = (__umulhi (a & 0x00ff0000, b & 0x00ff0000) +
__umulhi (a & 0xff000000, b & 0xff000000));
// clamp products to 255
plo |= __vcmpne2 (plo & 0xff00ff00, 0x00000000);
phi |= __vcmpne2 (phi & 0xff00ff00, 0x00000000);
// extract least significant eight bits of each product
res = __byte_perm (plo, phi, 0x6420);
return res;
}
__host__ __device__ uchar4 u8mulsat(const uchar4 &a, const uchar4 &b){
const unsigned sv = 255;
uchar4 result;
unsigned t;
t = a.x*b.x;
if (t > sv) t = sv;
result.x = t;
t = a.y*b.y;
if (t > sv) t = sv;
result.y = t;
t = a.z*b.z;
if (t > sv) t = sv;
result.z = t;
t = a.w*b.w;
if (t > sv) t = sv;
result.w = t;
return result;
}
__global__ void k(const uchar4 * __restrict__ a, const uchar4 * __restrict__ b, uchar4 * __restrict__ c, unsigned N){
unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
if (idx < N)
c[idx] = u8mulsat(a[idx], b[idx]);
}
__global__ void k1(const unsigned * __restrict__ a, const unsigned * __restrict__ b, unsigned * __restrict__ c, unsigned N){
unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
if (idx < N)
c[idx] = vmulus4(a[idx], b[idx]);
}
int main(){
unsigned N = 256U*80U*8U*400U;
uchar4 *d_a,*d_b, *d_c;
cudaMalloc(&d_c, sizeof(uchar4)*N);
cudaMalloc(&d_a, sizeof(uchar4)*N);
cudaMalloc(&d_b, sizeof(uchar4)*N);
for (int i = 0; i < 100; i++) {
k<<<N/256,256>>>(d_a, d_b, d_c, N);
k1<<<N/256,256>>>((unsigned *)d_a, (unsigned *)d_b, (unsigned *)d_c, N);}
cudaDeviceSynchronize();
}
$ nvcc -o t2048 t2048.cu -arch=sm_70 -Xptxas -v
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z2k1PKjS0_Pjj' for 'sm_70'
ptxas info : Function properties for _Z2k1PKjS0_Pjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 16 registers, 380 bytes cmem[0]
ptxas info : Compiling entry function '_Z1kPK6uchar4S1_PS_j' for 'sm_70'
ptxas info : Function properties for _Z1kPK6uchar4S1_PS_j
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 17 registers, 380 bytes cmem[0]
$ nvprof ./t2048
==2696== NVPROF is profiling process 2696, command: ./t2048
==2696== Profiling application: ./t2048
==2696== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 50.21% 100.24ms 100 1.0024ms 998.26us 1.0084ms k(uchar4 const *, uchar4 const *, uchar4*, unsigned int)
49.79% 99.412ms 100 994.12us 990.33us 1.0015ms k1(unsigned int const *, unsigned int const *, unsigned int*, unsigned int)
API calls: 57.39% 279.76ms 3 93.254ms 557.75us 278.64ms cudaMalloc
40.69% 198.31ms 1 198.31ms 198.31ms 198.31ms cudaDeviceSynchronize
1.03% 5.0147ms 4 1.2537ms 589.80us 3.2328ms cuDeviceTotalMem
0.51% 2.4799ms 404 6.1380us 333ns 272.34us cuDeviceGetAttribute
0.30% 1.4715ms 200 7.3570us 6.5220us 68.684us cudaLaunchKernel
0.07% 354.69us 4 88.672us 61.927us 166.60us cuDeviceGetName
0.00% 20.956us 4 5.2390us 3.1200us 7.8000us cuDeviceGetPCIBusId
0.00% 10.445us 8 1.3050us 522ns 4.9100us cuDeviceGet
0.00% 3.7970us 4 949ns 780ns 1.2230us cuDeviceGetUuid
0.00% 3.2030us 3 1.0670us 751ns 1.5050us cuDeviceGetCount
$
Later:
Here is a slightly faster routine (a few percent, on sm_70) compared to my previous:
__device__ uchar4 u8mulsat(const uchar4 &a, const uchar4 &b){
uchar4 result;
const half sv = 255;
const short svi = 255;
__half2 ah2, bh2, rh2;
ah2 = __floats2half2_rn(a.x, a.y);
bh2 = __floats2half2_rn(b.x, b.y);
rh2 = __hmul2(ah2, bh2);
result.x = (rh2.x > sv) ? (svi):((short)rh2.x);
result.y = (rh2.y > sv) ? (svi):((short)rh2.y);
ah2 = __floats2half2_rn(a.z, a.w);
bh2 = __floats2half2_rn(b.z, b.w);
rh2 = __hmul2(ah2, bh2);
result.z = (rh2.x > sv) ? (svi):((short)rh2.x);
result.w = (rh2.y > sv) ? (svi):((short)rh2.y);
return result;
}
It has the disadvantage that it uses CUDA half-precision intrinsics, so it is "less portable" than the previous, and likewise cannot be decorated with __host__.
There is no existing intrinsic __vmulus8() in CUDA. However, it can be emulated using existing intrinsics. Basically, we can pack the four 16-bit products of four 8-bit quantities using two 32-bit variable to hold them. Then clamp each product to 255 and extract the least-significant byte of each product into the final result with the help of the permute operation. The code generated by CUDA 11 for compute capabilities >= 7.0 looks reasonable. Whether the performance is sufficient will depend on the use case. If this operation occurs in the middle of a processing pipeline computing with packed bytes, that should be the case.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
/* byte-wise multiply with unsigned saturation */
__device__ unsigned int vmulus4 (unsigned int a, unsigned int b)
{
unsigned int plo, phi, res;
// compute products
plo = ((a & 0x000000ff) * (b & 0x000000ff) +
(a & 0x0000ff00) * (b & 0x0000ff00));
phi = (__umulhi (a & 0x00ff0000, b & 0x00ff0000) +
__umulhi (a & 0xff000000, b & 0xff000000));
// clamp products to 255
plo |= __vcmpne2 (plo & 0xff00ff00, 0x00000000);
phi |= __vcmpne2 (phi & 0xff00ff00, 0x00000000);
// extract least significant eight bits of each product
res = __byte_perm (plo, phi, 0x6420);
return res;
}
__global__ void kernel (unsigned int a, unsigned int b, unsigned int *res)
{
*res = vmulus4 (a, b);
}
unsigned int vmulus4_ref (unsigned int a, unsigned int b)
{
unsigned char a0, a1, a2, a3, b0, b1, b2, b3;
unsigned int p0, p1, p2, p3;
a0 = (a >> 0) & 0xff;
a1 = (a >> 8) & 0xff;
a2 = (a >> 16) & 0xff;
a3 = (a >> 24) & 0xff;
b0 = (b >> 0) & 0xff;
b1 = (b >> 8) & 0xff;
b2 = (b >> 16) & 0xff;
b3 = (b >> 24) & 0xff;
p0 = (unsigned int)a0 * (unsigned int)b0;
p1 = (unsigned int)a1 * (unsigned int)b1;
p2 = (unsigned int)a2 * (unsigned int)b2;
p3 = (unsigned int)a3 * (unsigned int)b3;
if (p0 > 255) p0 = 255;
if (p1 > 255) p1 = 255;
if (p2 > 255) p2 = 255;
if (p3 > 255) p3 = 255;
return (p0 << 0) + (p1 << 8) + (p2 << 16) + (p3 << 24);
}
// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)
int main (void)
{
unsigned int *resD = 0;
unsigned int a, b, res, ref;
cudaMalloc ((void**)&resD, sizeof resD[0]);
for (int i = 0; i < 1000000; i++) {
a = KISS;
b = KISS;
kernel<<<1,1>>>(a, b, resD);
cudaMemcpy (&res, resD, sizeof res, cudaMemcpyDeviceToHost);
ref = vmulus4_ref (a, b);
if (res != ref) {
printf ("error: a=%08x b=%08x res=%08x ref=%08x\n", a, b, res, ref);
return EXIT_FAILURE;
}
}
cudaFree (resD);
return EXIT_SUCCESS;
}

Transposing 8x8 float matrix using NEON intrinsics

I have a program that needs to run a transpose operation on 8x8 float32 matrices many times. I want to transpose these using NEON SIMD intrinsics. I know that the array will always contain 8x8 float elements. I have a baseline non-intrinsic solution below:
void transpose(float *matrix, float *matrixT) {
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
matrixT[i*8+j] = matrix[j*8+i];
}
}
}
I also created an intrinsic solution that transposes each 4x4 quadrant of the 8x8 matrix, and swaps the positions of the second and third quadrants. This solution looks like this:
void transpose_4x4(float *matrix, float *matrixT, int store_index) {
float32x4_t r0, r1, r2, r3, c0, c1, c2, c3;
r0 = vld1q_f32(matrix);
r1 = vld1q_f32(matrix + 8);
r2 = vld1q_f32(matrix + 16);
r3 = vld1q_f32(matrix + 24);
c0 = vzip1q_f32(r0, r1);
c1 = vzip2q_f32(r0, r1);
c2 = vzip1q_f32(r2, r3);
c3 = vzip2q_f32(r2, r3);
r0 = vcombine_f32(vget_low_f32(c0), vget_low_f32(c2));
r1 = vcombine_f32(vget_high_f32(c0), vget_high_f32(c2));
r2 = vcombine_f32(vget_low_f32(c1), vget_low_f32(c3));
r3 = vcombine_f32(vget_high_f32(c1), vget_high_f32(c3));
vst1q_f32(matrixT + store_index, r0);
vst1q_f32(matrixT + store_index + 8, r1);
vst1q_f32(matrixT + store_index + 16, r2);
vst1q_f32(matrixT + store_index + 24, r3);
}
void transpose(float *matrix, float *matrixT) {
// Transpose top-left 4x4 quadrant and store the result in the top-left 4x4 quadrant
transpose_4x4(matrix, matrixT, 0);
// Transpose top-right 4x4 quadrant and store the result in the bottom-left 4x4 quadrant
transpose_4x4(matrix + 4, matrixT, 32);
// Transpose bottom-left 4x4 quadrant and store the result in the top-right 4x4 quadrant
transpose_4x4(matrix + 32, matrixT, 4);
// Transpose bottom-right 4x4 quadrant and store the result in the bottom-right 4x4 quadrant
transpose_4x4(matrix + 36, matrixT, 36);
}
This solution however, results in a slower performance than the baseline non-intrinsic solution. I am struggling to see, if there is one, a faster solution that can transpose my 8x8 matrix. Any help would be greatly appreciated!
Edit: both solutions are compiled using the -O1 flag.
First off, you shouldn't expect a huge performance boost to start with:
there is actually no computation
you are dealing with 32bit data, and thus, not much of bandwidth constraint.
to sum it up, just a little bit saving in bandwidth by vectorizing - that's all
As for the 4x4 transpose, you don't even need a separate function, but just a macro:
#define TRANSPOSE4x4(pSrc,pDst) vst1q_f32_x4(pDst,vld4q_f32(pSrc))
will do the job since NEON does the 4x4 transpose on the fly when you load the data with vld4.
But you should ask yourself at this point if your approach - transposing all the matrice prior to actual computation - is the right one if 4x4 transpose costs virtually nothing. This step could end up being a pure waste of computation and bandwidth. Optimization shouldn't be limited to the final step, but should be considered from the designing phase.
8x8 transpose is a different animal though:
void transpose8x8(float *pDst, float *pSrc)
{
float32x4_t row0a, row0b, row1a, row1b, row2a, row2b, row3a, row3b, row4a, row4b, row5a, row5b, row6a, row6b, row7a, row7b;
float32x4_t r0a, r0b, r1a, r1b, r2a, r2b, r3a, r3b, r4a, r4b, r5a, r5b, r6a, r6b, r7a, r7b;
row0a = vld1q_f32(pSrc);
pSrc += 4;
row0b = vld1q_f32(pSrc);
pSrc += 4;
row1a = vld1q_f32(pSrc);
pSrc += 4;
row1b = vld1q_f32(pSrc);
pSrc += 4;
row2a = vld1q_f32(pSrc);
pSrc += 4;
row2b = vld1q_f32(pSrc);
pSrc += 4;
row3a = vld1q_f32(pSrc);
pSrc += 4;
row3b = vld1q_f32(pSrc);
pSrc += 4;
row4a = vld1q_f32(pSrc);
pSrc += 4;
row4b = vld1q_f32(pSrc);
pSrc += 4;
row5a = vld1q_f32(pSrc);
pSrc += 4;
row5b = vld1q_f32(pSrc);
pSrc += 4;
row6a = vld1q_f32(pSrc);
pSrc += 4;
row6b = vld1q_f32(pSrc);
pSrc += 4;
row7a = vld1q_f32(pSrc);
pSrc += 4;
row7b = vld1q_f32(pSrc);
r0a = vtrn1q_f32(row0a, row1a);
r0b = vtrn1q_f32(row0b, row1b);
r1a = vtrn2q_f32(row0a, row1a);
r1b = vtrn2q_f32(row0b, row1b);
r2a = vtrn1q_f32(row2a, row3a);
r2b = vtrn1q_f32(row2b, row3b);
r3a = vtrn2q_f32(row2a, row3a);
r3b = vtrn2q_f32(row2b, row3b);
r4a = vtrn1q_f32(row4a, row5a);
r4b = vtrn1q_f32(row4b, row5b);
r5a = vtrn2q_f32(row4a, row5a);
r5b = vtrn2q_f32(row4b, row5b);
r6a = vtrn1q_f32(row6a, row7a);
r6b = vtrn1q_f32(row6b, row7b);
r7a = vtrn2q_f32(row6a, row7a);
r7b = vtrn2q_f32(row6b, row7b);
row0a = vtrn1q_f64(row0a, row2a);
row0b = vtrn1q_f64(row0b, row2b);
row1a = vtrn1q_f64(row1a, row3a);
row1b = vtrn1q_f64(row1b, row3b);
row2a = vtrn2q_f64(row0a, row2a);
row2b = vtrn2q_f64(row0b, row2b);
row3a = vtrn2q_f64(row1a, row3a);
row3b = vtrn2q_f64(row1b, row3b);
row4a = vtrn1q_f64(row4a, row6a);
row4b = vtrn1q_f64(row4b, row6b);
row5a = vtrn1q_f64(row5a, row7a);
row5b = vtrn1q_f64(row5b, row7b);
row6a = vtrn2q_f64(row4a, row6a);
row6b = vtrn2q_f64(row4b, row6b);
row7a = vtrn2q_f64(row5a, row7a);
row7b = vtrn2q_f64(row5b, row7b);
vst1q_f32(pDst, row0a);
pDst += 4;
vst1q_f32(pDst, row4a);
pDst += 4;
vst1q_f32(pDst, row1a);
pDst += 4;
vst1q_f32(pDst, row5a);
pDst += 4;
vst1q_f32(pDst, row2a);
pDst += 4;
vst1q_f32(pDst, row6a);
pDst += 4;
vst1q_f32(pDst, row3a);
pDst += 4;
vst1q_f32(pDst, row7a);
pDst += 4;
vst1q_f32(pDst, row0b);
pDst += 4;
vst1q_f32(pDst, row4b);
pDst += 4;
vst1q_f32(pDst, row1b);
pDst += 4;
vst1q_f32(pDst, row5b);
pDst += 4;
vst1q_f32(pDst, row2b);
pDst += 4;
vst1q_f32(pDst, row6b);
pDst += 4;
vst1q_f32(pDst, row3b);
pDst += 4;
vst1q_f32(pDst, row7b);
}
It boils down to : 16 load + 32 trn + 16 store vs 64 load + 64 store
Now we can clearly see it really isn't worth it. The neon routine above might be a little faster, but I doubt it will make a difference in the end.
No, you can't optimize it any further. Nobody can. Just make sure the pointers are 64byte aligned, test it, and decide for yourself.
ld1 {v0.4s-v3.4s}, [x1], #64
ld1 {v4.4s-v7.4s}, [x1], #64
ld1 {v16.4s-v19.4s}, [x1], #64
ld1 {v20.4s-v23.4s}, [x1]
trn1 v24.4s, v0.4s, v2.4s // row0
trn1 v25.4s, v1.4s, v3.4s
trn2 v26.4s, v0.4s, v2.4s // row1
trn2 v27.4s, v1.4s, v3.4s
trn1 v28.4s, v4.4s, v6.4s // row2
trn1 v29.4s, v5.4s, v7.4s
trn2 v30.4s, v4.4s, v6.4s // row3
trn2 v31.4s, v5.4s, v7.4s
trn1 v0.4s, v16.4s, v18.4s // row4
trn1 v1.4s, v17.4s, v19.4s
trn2 v2.4s, v16.4s, v18.4s // row5
trn2 v3.4s, v17.4s, v19.4s
trn1 v4.4s, v20.4s, v22.4s // row6
trn1 v5.4s, v21.4s, v23.4s
trn2 v6.4s, v20.4s, v22.4s // row7
trn2 v7.4s, v21.4s, v23.4s
trn1 v16.2d, v24.2d, v28.2d // row0a
trn1 v17.2d, v0.2d, v4.2d // row0b
trn1 v18.2d, v26.2d, v30.2d // row1a
trn1 v19.2d, v2.2d, v6.2d // row1b
trn2 v20.2d, v24.2d, v28.2d // row2a
trn2 v21.2d, v0.2d, v4.2d // row2b
trn2 v22.2d, v26.2d, v30.2d // row3a
trn2 v23.2d, v2.2d, v6.2d // row3b
st1 {v16.4s-v19.4s}, [x0], #64
st1 {v20.4s-v23.4s}, [x0], #64
trn1 v16.2d, v25.2d, v29.2d // row4a
trn1 v17.2d, v1.2d, v5.2d // row4b
trn1 v18.2d, v27.2d, v31.2d // row5a
trn1 v19.2d, v3.2d, v7.2d // row5b
trn2 v20.2d, v25.2d, v29.2d // row4a
trn2 v21.2d, v1.2d, v5.2d // row4b
trn2 v22.2d, v27.2d, v31.2d // row5a
trn2 v23.2d, v3.2d, v7.2d // row5b
st1 {v16.4s-v19.4s}, [x0], #64
st1 {v20.4s-v23.4s}, [x0]
ret
above is the hand optimized assembly version that's most probably shorter (as short as it can get), but not exactly meaningfully faster than:
Below is the pure C version that I'd settle with:
void transpose8x8(float *pDst, float *pSrc)
{
uint32_t i = 8;
do {
pDst[0] = *pSrc++;
pDst[8] = *pSrc++;
pDst[16] = *pSrc++;
pDst[24] = *pSrc++;
pDst[32] = *pSrc++;
pDst[40] = *pSrc++;
pDst[48] = *pSrc++;
pDst[56] = *pSrc++;
pDst++;
} while (--i);
}
or
void transpose8x8(float *pDst, float *pSrc)
{
uint32_t i = 8;
do {
*pDst++ = pSrc[0];
*pDst++ = pSrc[8];
*pDst++ = pSrc[16];
*pDst++ = pSrc[24];
*pDst++ = pSrc[32];
*pDst++ = pSrc[40];
*pDst++ = pSrc[48];
*pDst++ = pSrc[56];
pSrc++;
} while (--i);
}
PS: It could bring some gain in performance/power consumption if you declared pDst and pSrc uint32_t *, because the compiler would definitely generate pure integer machine code which has most various addressing modes, and only use w registers instead of s ones. Just typecase float * to uint32_t *
PS2: Clang already utilizes w registers instead of s ones while GCC is being GCC.... When will GNU-shills finally admit the fact that GCC is an extremely bad choice for ARM?
godbolt
PS3: Below is the non-neon version in assembly (zero latency) since I was very disappointed (even shocked) in both Clang and GCC above:
.arch armv8-a
.global transpose8x8
.text
.balign 64
.func
transpose8x8:
mov w10, #8
sub x0, x0, #8
.balign 16
1:
ldr w2, [x1, #0]
ldr w3, [x1, #32]
ldr w4, [x1, #64]
ldr w5, [x1, #96]
ldr w6, [x1, #128]
ldr w7, [x1, #160]
ldr w8, [x1, #192]
ldr w9, [x1, #224]
subs w10, w10, #1
stp w2, w3, [x0, #8]
add x1, x1, #4
stp w4, w5, [x0, #16]
stp w6, w7, [x0, #24]
stp w8, w9, [x0, #32]!
b.ne 1b
.balign 16
ret
.endfunc
.end
It's arguably the best version you will ever get if you still insist on doing pure 8x8 transpose. It might be a little slower than the neon assembly version, but consume considerably less power.
It's possible to optimise the 8x8 neon code presented in the other answer; 8x8 transpose can be not only thought of as recursive version of [A B;C D]' == [A' C'; B' D'] but also as repeated application of zip or unzip.
a b c d
e f g h
i j k l
m n o p == a b c d e f g h i j k l m n o p
zip(first_half, last_half) ==
zip(...) == a i b j c k d l e m f n g o h p
zip(...) == a e i m b f j n c g k o d h l p == transpose
For 8x8 matrix we need to apply this algorithm 3 times and reading the data by vld4 two of those passes have been already done.
float32x4x4_t d0 = vld4q_f32(input);
float32x4x4_t d1 = vld4q_f32(input + 16);
float32x4x4_t d2 = vld4q_f32(input + 32);
float32x4x4_t d3 = vld4q_f32(input + 48);
float32x4x4_t e0 = {
vzipq_f32(d0.val[0], d2.val[0]).val[0],
vzipq_f32(d0.val[1], d2.val[1]).val[0],
vzipq_f32(d0.val[2], d2.val[2]).val[0],
vzipq_f32(d0.val[3], d2.val[3]).val[0]
};
float32x4x4_t e1 = {
vzipq_f32(d1.val[0], d3.val[0]).val[0],
vzipq_f32(d1.val[1], d3.val[1]).val[0],
vzipq_f32(d1.val[2], d3.val[2]).val[0],
vzipq_f32(d1.val[3], d3.val[3]).val[0]
};
float32x4x4_t e2 = {
vzipq_f32(d0.val[0], d2.val[0]).val[1],
vzipq_f32(d0.val[1], d2.val[1]).val[1],
vzipq_f32(d0.val[2], d2.val[2]).val[1],
vzipq_f32(d0.val[3], d2.val[3]).val[1]
};
float32x4x4_t e3 = {
vzipq_f32(d1.val[0], d3.val[0]).val[1],
vzipq_f32(d1.val[1], d3.val[1]).val[1],
vzipq_f32(d1.val[2], d3.val[2]).val[1],
vzipq_f32(d1.val[3], d3.val[3]).val[1]
};
vst1q_f32_x4(output, e0);
vst1q_f32_x4(output + 16, e1);
vst1q_f32_x4(output + 32, e2);
vst1q_f32_x4(output + 48, e3);
One should be able to perform the transpose also by starting with vld1q_f32_x4, then uzpq and finish with vst4q_f32.

Problems with integer promotion in C

I'm developing a C code for an embedded application for ARM processor (LPC54628) using Keil software. There's a strange behavior that I am unable to resolve. I tried running this on the software simulator as well as on the microcontroller and the behavior is the same. The problem is with the execution of the second 'else if' condition.
Working code:
uint8_t a; uint8_t b ; uint8_t temp1; uint8_t temp2; uint8_t c;
a = 0x1; b = 0x80; temp1 = 0; temp2 = 0; c = 10U;
temp1 = (b << 1); // after execution, temp1 is 0x00
temp2 = (b >> 7); // after execution, temp2 is 0x01
__NOP();
temp1 = ((b << 1) | (b >> 7)); // after execution, temp1 is 0x00 | 0x01 = 0x01
if (a == b) { }
else if ( a == ((b >> 1) | (b << 7)) ) {c += 1; }
else if ( a == temp1 ) {c -= 1; } // this 'else if' executes since a= 0x01 and temp1 = 0x01
else if ( a == ((b >> 2) | (b << 6)) ) {c += 2; }
else if ( a == ((b << 2) | (b >> 6)) ) {c -= 2; }
else if ( a == ((b >> 3) | (b << 5)) ) {c += 3; }
else if ( a == ((b << 3) | (b >> 5)) ) {c -= 3; }
However, the 'else if' that worked in the code above fails to execute in the following code. Note that the only change I have done is to replace temp1 with the actual expression inside the 'else if' condition. No other change.
Non-working code:
a = 0x1; b = 0x80; temp1 = 0; temp2 = 0; c = 10U;
temp1 = (b << 1); // after execution, temp1 is 0x00
temp2 = (b >> 7); // after execution, temp2 is 0x01
__NOP();
temp1 = ((b << 1) | (b >> 7)); // after execution, temp1 is 0x00 | 0x01 = 0x01
if (a == b) { }
else if ( a == ((b >> 1) | (b << 7)) ) {c += 1; }
else if ( a == ((b << 1) | (b >> 7)) ) {c -= 1; } // this 'else if' DOES NOT execute.
else if ( a == ((b >> 2) | (b << 6)) ) {c += 2; }
else if ( a == ((b << 2) | (b >> 6)) ) {c -= 2; }
else if ( a == ((b >> 3) | (b << 5)) ) {c += 3; }
else if ( a == ((b << 3) | (b >> 5)) ) {c -= 3; }
Can you point out what I am doing wrong?
Integer promotion is annoying. You're fundamentally doing:
else if ( (int) a == (((int)(b << 1)) | ((int)(b >> 7))) ) {
c -= 1;
}
Which means that you're testing if 0x01 == 0x101, which it doesn't.
When you do something like:
uint8_t x = 3;
uint8_t y = x + 4;
You're really doing something like:
uint8_t x = 3;
uint8_t y = (uint8_t)((int) x) + 4)
In the expression ((b << 1) | (b >> 7)), the value b is first promoted to type int because its type is smaller than int. So this expression ends up being:
((0x80 << 1) | (0x80 >> 7)) == (0x100 | 0x1) == 0x101
When you assign this value to temp1, it is converted to a value that fits and you're left with 0x1. When you instead compare the result of this expression directly against a, you're comparing the value 0x1 with 0x101.
If you want the result of this expression to be 8 bit, you need to cast it to uint8_t to truncate the higher bits.
if (a == b) { }
else if ( a == (uint8_t)((b >> 1) | (b << 7)) ) {c += 1; }
else if ( a == (uint8_t)((b << 1) | (b >> 7)) ) {c -= 1; }
else if ( a == (uint8_t)((b >> 2) | (b << 6)) ) {c += 2; }
else if ( a == (uint8_t)((b << 2) | (b >> 6)) ) {c -= 2; }
else if ( a == (uint8_t)((b >> 3) | (b << 5)) ) {c += 3; }
else if ( a == (uint8_t)((b << 3) | (b >> 5)) ) {c -= 3; }
C compilers did NOT used to do this, I do not know exactly when it changed.
unsigned int fun0 ( unsigned char a, unsigned char b )
{
return((a<<1)|(b>>1));
}
unsigned int fun1 ( unsigned char a, unsigned char b )
{
return(unsigned char)((a<<1)|(b>>1));
}
00000000 <fun0>:
0: e1a010a1 lsr r1, r1, #1
4: e1810080 orr r0, r1, r0, lsl #1
8: e12fff1e bx lr
0000000c <fun1>:
c: e1a010a1 lsr r1, r1, #1
10: e1810080 orr r0, r1, r0, lsl #1
14: e20000ff and r0, r0, #255 ; 0xff
18: e12fff1e bx lr
The first one the operation is on 8 bit values before it is combined to be returned. The second is clipped.
I specifically had a day of year problem many many years ago now, the bug would appear late in the year (just so happened to be day 256) and fixed itself January first... day = (high_byte<<8)|(low_byte); (fixed with ...((unsigned int)high_byte)<<8...)
unsigned int fun ( unsigned char a, unsigned char b )
{
return((a<<8)|b);
}
00000000 <fun>:
0: e1810400 orr r0, r1, r0, lsl #8
4: e12fff1e bx lr
Would not have broken today...at least with gcc 10.x.x...I also want to say at some point it was implementation defined, but seems that from many of the various quotes on the net it has been this way since C99...
Note disassembly is your friend...But then always understand that sometimes it is implementation defined (does not seem so in this case) and that just because your compiler did it one way does not mean that is the standard and is true for all compilers. (you are using Kiel I am using gnu for example).
Folks run into this a lot with floating point
float fun0 ( float a, float b )
{
return(a*(b+2.0));
}
float fun1 ( float a, float b )
{
return(a*(b+2.0F));
}
00000000 <fun0>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a06000 mov r6, r0
8: e1a00001 mov r0, r1
c: ebfffffe bl 0 <__aeabi_f2d>
10: e3a02000 mov r2, #0
14: e3a03101 mov r3, #1073741824 ; 0x40000000
18: ebfffffe bl 0 <__aeabi_dadd>
1c: e1a04000 mov r4, r0
20: e1a00006 mov r0, r6
24: e1a05001 mov r5, r1
28: ebfffffe bl 0 <__aeabi_f2d>
2c: e1a02000 mov r2, r0
30: e1a03001 mov r3, r1
34: e1a00004 mov r0, r4
38: e1a01005 mov r1, r5
3c: ebfffffe bl 0 <__aeabi_dmul>
40: ebfffffe bl 0 <__aeabi_d2f>
44: e8bd4070 pop {r4, r5, r6, lr}
48: e12fff1e bx lr
0000004c <fun1>:
4c: e92d4010 push {r4, lr}
50: e1a04000 mov r4, r0
54: e1a00001 mov r0, r1
58: e3a01101 mov r1, #1073741824 ; 0x40000000
5c: ebfffffe bl 0 <__aeabi_fadd>
60: e1a01004 mov r1, r4
64: ebfffffe bl 0 <__aeabi_fmul>
68: e8bd4010 pop {r4, lr}
6c: e12fff1e bx lr
2.0 is a double in the eyes of the compiler but 2.0F is single. And a double plus a single gets promoted to a double operation. Not an integer promotion but constants have an implied type (integer or floating point) and that plays into promotion.

Bare metal audio output on Raspberry Pi3 working in AARCH64 asm but not the C version

I have been trying to write a bare metal kernel for over a year now and I am up to the point where I am ready to start working on audio output. I have written the code in asm however since I'm not great at it I'm not sure how I can pass audio samples as arguments to a asm function. I tried to rewrite it in C however it isn't working. This problem is really a spot the difference. I know my asm version works but the audio sample is written into the play_audio function. My goal is to have a init function for the audio with no arguments and a play_audio function that takes the pointer to the start of the audio function and a pointer to the end of the audio file. The audio file to be played is a 16 bit unsigned int pcm file. The same file that I'm trying to use for the C audio part is used successfully in the asm version. Since I set the hardware pwm to expect 13bit audio at 41400Hz there is a shift to convert the sample from 16bit to 13 bit so this isn't a mistake.
Not_working_audio.c
void init_audio_jack_c()//ERROR IN HERE
{
//Set phone jack to pwm output
uint32_t *gpio_addr = (uint32_t *)(PERIPHERAL_BASE + GPIO_BASE);
uint32_t *gpio_gpfsel4_addr = gpio_addr + GPIO_GPFSEL4;
*gpio_gpfsel4_addr = GPIO_FSEL0_ALT0 | GPIO_FSEL5_ALT0;
//Set clock
uint32_t *clock_manager_addr = (uint32_t *)(((PERIPHERAL_BASE + CM_BASE) & 0x0000FFFF) | ((PERIPHERAL_BASE + CM_BASE) & 0xFFFF0000));
*(clock_manager_addr + CM_PWMDIV) = (CM_PASSWORD | 0x2000);
*(clock_manager_addr + CM_PWMCTL) = ((CM_PASSWORD | CM_ENAB) | (CM_SRC_OSCILLATOR + CM_SRC_PLLCPER));
//Set PWM
uint32_t *pwm_manager_addr = (uint32_t *)(((PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000));
*(pwm_manager_addr + PWM_RNG1) = 0x1624;
*(pwm_manager_addr + PWM_RNG2) = 0x1624;
*(pwm_manager_addr + PWM_CTL) = PWM_USEF2 + PWM_PWEN2 + PWM_USEF1 + PWM_PWEN1 + PWM_CLRF1;
printf("[INFO] Audio Init Finished");
}
int32_t play_16bit_unsigned_audio(uint16_t *start, uint16_t *end)
{
if(end < start)
{
printf("[ERROR] End is less than start.");
return 1;
}
if((start - end) % 2 == 0)
{
printf("[ERROR] Isn't a multiple of two so it isn't 16bit");
return 2;
}
uint16_t *end_of_file = (uint16_t *)(uint64_t)(((uint32_t)(uintptr_t)end & 0x0000FFFF) | ((uint32_t)(uintptr_t)end & 0xFFFF0000));
//FIFO write
while(start != end_of_file)
{
uint16_t sample = start[0];
sample >>= 3;
*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_FIF1) = sample;
start++;
sample = start[0];
sample >>= 3;
*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_FIF1) = sample;
//FIFO wait
while(*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_STA) != PWM_FULL1);
start++;
}
printf("[INFO] Completed Audio");
return 0;
}
Working_audio.s
.section .text.init_audio_jack, "ax", %progbits
.balign 4
.globl init_audio_jack;
.type init_audio_jack, %function
init_audio_jack:
mov w0,PERIPHERAL_BASE + GPIO_BASE
mov w1,GPIO_FSEL0_ALT0
orr w1,w1,GPIO_FSEL5_ALT0
str w1,[x0,GPIO_GPFSEL4]
// Set Clock
mov w0, PERIPHERAL_BASE
add w0, w0, CM_BASE
and w0, w0, 0x0000FFFF
mov w1, PERIPHERAL_BASE
add w1, w1, CM_BASE
and w1, w1, 0xFFFF0000
orr w0,w0,w1
mov w1,CM_PASSWORD
orr w1,w1,0x2000 // Bits 0..11 Fractional Part Of Divisor = 0, Bits 12..23 Integer Part Of Divisor = 2
brk #0
str w1,[x0,CM_PWMDIV]
mov w1,CM_PASSWORD
orr w1,w1,CM_ENAB
orr w1,w1,CM_SRC_OSCILLATOR + CM_SRC_PLLCPER // Use 650MHz PLLC Clock
str w1,[x0,CM_PWMCTL]
// Set PWM
mov w0, PERIPHERAL_BASE
add w0, w0, PWM_BASE
and w0, w0, 0x0000FFFF
mov w1,PERIPHERAL_BASE
add w1, w1, PWM_BASE
and w1, w1, 0xFFFF0000
orr w0,w0,w1
mov w1,0x1624 // Range = 13bit 44100Hz Mono
str w1,[x0,PWM_RNG1]
str w1,[x0,PWM_RNG2]
mov w1,PWM_USEF2 + PWM_PWEN2 + PWM_USEF1 + PWM_PWEN1 + PWM_CLRF1
str w1,[x0,PWM_CTL]
.section .text.play_audio, "ax", %progbits
.balign 4
.globl play_audio;
.type play_audio, %function
play_audio:
Loop:
adr x1, _binary_src_audio_Interlude_bin_start // X1 = Sound Sample
ldr w2, =_binary_src_audio_Interlude_bin_end
and w2, w2, 0x0000FFFF // W2 = End Of Sound Sample
ldr w3, =_binary_src_audio_Interlude_bin_end
and w3, w3, 0xFFFF0000
orr w2,w2,w3
FIFO_Write:
ldrh w3,[x1],2 // Write 2 Bytes To FIFO
lsr w3,w3,3 // Convert 16bit To 13bit
str w3,[x0,PWM_FIF1] // FIFO Address
ldrh w3, [x1], 2
lsr w3, w3, 3
str w3, [x0, PWM_FIF1]
FIFO_Wait:
ldr w3,[x0,PWM_STA]
tst w3,PWM_FULL1 // Test Bit 1 FIFO Full
b.ne FIFO_Wait
cmp w1,w2 // Check End Of Sound Sample
b.ne FIFO_Write
b Loop // Play Sample Again
Thanks in advance to anyone that can help!

How many 64-bit multiplications are needed to calculate the low 128-bits of a 64-bit by 128-bit product?

Consider that you want to calculate the low 128-bits of the result of multiplying a 64-bit and 128-bit unsigned number, and that the largest multiplication you have available is the C-like 64-bit multiplication which takes two 64-bit unsigned inputs and returns the low 64-bits of the result.
How many multiplications are needed?
Certainly you can do it with eight: break all the inputs up into 32-bit chunks and use your 64-bit multiplication to do the 4 * 2 = 8 required full-width 32*32->64 multiplications, but can one do better?
Of course the algorithm should do only a "reasonable" number of additions or other basic arithmetic on top of the multiplications (I'm not interested in solutions that re-invent multiplication as an addition loop and hence claim "zero" multiplications).
Four, but it starts to get a little tricky.
Let a and b be the numbers to be multiplied, with a0 and a1 being the low and high 32 bits of a, respectively, and b0, b1, b2, b3 being 32-bit groups of b, from low to high respectively.
The desired result is the remainder of (a0 + a1•232) • (b0 + b1•232 + b2•264 + b3•296) modulo 2128.
We can rewrite that as (a0 + a1•232) • (b0 + b1•232) +
(a0 + a1•232) • (b2•264 + b3•296) modulo 2128.
The remainder of the latter term modulo 2128 can be computed as a single 64-bit by 64-bit multiplication (whose result is implicitly multiplied by 264).
Then the former term can be computed with three multiplications using a
carefully implemented Karatsuba step. The simple version would involve a 33-bit by 33-bit to 66-bit product which is not available, but there is a trickier version that avoids it:
z0 = a0 * b0
z2 = a1 * b1
z1 = abs(a0 - a1) * abs(b0 - b1) * sgn(a0 - a1) * sgn(b1 - b0) + z0 + z2
The last line contains only one multiplication; the other two pseudo-multiplications are just conditional negations. Absolute-difference and conditional-negate are annoying to implement in pure C, but it could be done.
Of course, without Karatsuba, 5 multiplies.
Karatsuba is wonderful, but these days a 64 x 64 multiply can be over in 3 clocks and a new one can be scheduled every clock. So the overhead of dealing with the signs and what not can be significantly greater than the saving of one multiply.
For straightforward 64 x 64 multiply need:
r0 = a0*b0
r1 = a0*b1
r2 = a1*b0
r3 = a1*b1
where need to add r0 = r0 + (r1 << 32) + (r2 << 32)
and add r3 = r3 + (r1 >> 32) + (r2 >> 32) + carry
where the carry is the carry from the additions to r0, and result is r3:r0.
typedef struct { uint64_t w0, w1 ; } uint64x2_t ;
uint64x2_t
mulu64x2(uint64_t x, uint64_t m)
{
uint64x2_t r ;
uint64_t r1, r2, rx, ry ;
uint32_t x1, x0 ;
uint32_t m1, m0 ;
x1 = (uint32_t)(x >> 32) ;
x0 = (uint32_t)x ;
m1 = (uint32_t)(m >> 32) ;
m0 = (uint32_t)m ;
r1 = (uint64_t)x1 * m0 ;
r2 = (uint64_t)x0 * m1 ;
r.w0 = (uint64_t)x0 * m0 ;
r.w1 = (uint64_t)x1 * m1 ;
rx = (uint32_t)r1 ;
rx = rx + (uint32_t)r2 ; // add the ls halves, collecting carry
ry = r.w0 >> 32 ; // pick up ms of r0
r.w0 += (rx << 32) ; // complete r0
rx += ry ; // complete addition, rx >> 32 == carry !
r.w1 += (r1 >> 32) + (r2 >> 32) + (rx >> 32) ;
return r ;
}
For Karatsuba, the suggested:
z1 = abs(a0 - a1) * abs(b0 - b1) * sgn(a0 - a1) * sgn(b1 - b0) + z0 + z2
is trickier than it looks... for a start, if z1 is 64 bits, then need to somehow collect the carry which this addition can generate... and that is complicated by the signed-ness issues.
z0 = a0*b0
z1 = ax*bx -- ax = (a1 - a0), bx = (b0 - b1)
z2 = a1*b1
where need to add r0 = z0 + (z1 << 32) + (z0 << 32) + (z2 << 32)
and add r1 = z2 + (z1 >> 32) + (z0 >> 32) + (z2 >> 32) + carry
where the carry is the carry from the additions to create r0, and result is r1:r0.
where must take into account the signed-ness of ax, bx and z1.
uint64x2_t
mulu64x2_karatsuba(uint64_t a, uint64_t b)
{
uint64_t a0, a1, b0, b1 ;
uint64_t ax, bx, zx, zy ;
uint as, bs, xs ;
uint64_t z0, z2 ;
uint64x2_t r ;
a0 = (uint32_t)a ; a1 = a >> 32 ;
b0 = (uint32_t)b ; b1 = b >> 32 ;
z0 = a0 * b0 ;
z2 = a1 * b1 ;
ax = (uint64_t)(a1 - a0) ;
bx = (uint64_t)(b0 - b1) ;
as = (uint)(ax > a1) ; // sign of magic middle, a
bs = (uint)(bx > b0) ; // sign of magic middle, b
xs = (uint)(as ^ bs) ; // sign of magic middle, x = a * b
ax = (uint64_t)((ax ^ -(uint64_t)as) + as) ; // abs magic middle a
bx = (uint64_t)((bx ^ -(uint64_t)bs) + bs) ; // abs magic middle b
zx = (uint64_t)(((ax * bx) ^ -(uint64_t)xs) + xs) ;
xs = xs & (uint)(zx != 0) ; // discard sign if z1 == 0 !
zy = (uint32_t)zx ; // start ls half of z1
zy = zy + (uint32_t)z0 + (uint32_t)z2 ;
r.w0 = z0 + (zy << 32) ; // complete ls word of result.
zy = zy + (z0 >> 32) ; // complete carry
zx = (zx >> 32) - ((uint64_t)xs << 32) ; // start ms half of z1
r.w1 = z2 + zx + (z0 >> 32) + (z2 >> 32) + (zy >> 32) ;
return r ;
}
I did some very simple timings (using times(), running on Ryzen 7 1800X):
using gcc __int128................... ~780 'units'
using mulu64x2()..................... ~895
using mulu64x2_karatsuba()... ~1,095
...so, yes, you can save a multiply by using Karatsuba, but whether it's worth doing rather depends.

Resources