Related
In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically armv7), with a side glance to RISC-V32 which I expect to grow in importance in the embedded space. The first sample building block I chose to examine is unsigned 64x64->128 bit integer multiplication. Other questions on this site about this building block do not provide detailed coverage of 32-bit platforms.
Over the past thirty years, I have implemented this and other arithmetic building blocks multiple times, but always in assembly language, for various architectures. However, at this point in time my hope and desire is that these could be programmed in straight ISO-C, without the use of intrinsics. Ideally a single version of the C code would generate good machine code across architectures. I know that the approach of manipulating HLL code to control machine code is generally brittle, but hope that processor architectures and toolchains have matured enough to make this feasible.
Some approaches used in assembly language implementations are not well suited for porting to C. In the exemplary code below I have selected six variants that seemed amenable to an HLL implementation. Besides the generation of partial products, which is common to all variants, the two basic approaches are: (1) Sum the partial products using 64-bit arithmetic, letting the compiler take care of the carry propagation between 32-bit halves. In this case there are multiple choices in which order to sum the partial products. (2) Use 32-bit arithmetic for the summing, simulating the carry flag directly. In this case we have a choice of generating the carry after an addition (a = a + b; carry = a < b;) or before the addition (carry = ~a < b; a = a + b;). Variants 1 through 3 below fall into the former category, variants 5 and 6 fall into the latter.
At Compiler Explorer, I focused on the toolchains gcc 12.2 and clang 15.0 for the platforms of interest. I compiled with -O3. The general finding is that on average clang generates more efficient code than gcc, and that the differences between the variants (number of instructions and registers used) are more pronounced with clang. While this may be understandable in the case of RISC-V as the newer architecture, it surprised me in the case of armv7 which has been around for well over a dozen years.
Three cases in particular struck me as noteworthy. While I have worked with compiler engineers before and have a reasonable understanding of basic code transformation, phase ordering issues, etc, the only technique I aware of that might apply to this code is idiom recognition, and I do not see how this could explain the observations by itself. The first case is variant 3, where clang 15.0 produces extremely tight code comprising just 10 instructions that I don't think can be improved upon:
umul64wide:
push {r4, lr}
umull r12, r4, r2, r0
umull lr, r0, r3, r0
umaal lr, r4, r2, r1
umaal r0, r4, r3, r1
ldr r1, [sp, #8]
strd r0, r4, [r1]
mov r0, r12
mov r1, lr
pop {r4, pc}
By contrast, gcc generates twice the number of instructions and requires twice the number of registers. I hypothesize that it does not recognize how to use the multiply-accumulate instruction umaal here, but is that the full story? The reverse situation, but not quite as dramatic, occurs in variant 6, where gcc 12.2 produces this sequence of 18 instructions, with low register usage:
umul64wide:
mov ip, r0
push {r4, r5, lr}
mov lr, r1
umull r0, r1, r0, r2
ldr r4, [sp, #12]
umull r5, ip, r3, ip
adds r1, r1, r5
umull r2, r5, lr, r2
adc ip, ip, #0
umull lr, r3, lr, r3
adds r1, r1, r2
adc r2, ip, #0
adds r2, r2, r5
adc r3, r3, #0
adds r2, r2, lr
adc r3, r3, #0
strd r2, r3, [r4]
pop {r4, r5, pc}
The generated code nicely turns the simulated carry propagation into real carry propagation. clang 15.0 uses nine instructions and five registers more, and I cannot really make out what it is trying to do without spending much more time on analysis. The third observation is with regard to the differences seen in the machine code produced for variant 5 vs. variant 6, in particular with clang. These use the same basic algorithm, with one variant computing the simulated carry before the additions, the other after it. I did find in the end that one variant, namely variant 4, seems to be efficient across both tool chains and both architectures. However, before I proceed to other building blocks and face a similar struggle, I would like to inquire:
(1) Are there coding idioms or algorithms I have not considered in the code below that might lead to superior results? (2) Are there specific optimization switches, e.g. a hypothetical -ffrobnicate (see here), that are not included in -O3 that would help the compilers generate efficient code more consistently for these kind of bit-manipulation scenarios? Explanations as to what compiler mechanisms are likely responsible for the cases of significant differences in code generation observed, and how one might influence or work round them, could also be helpful.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define VARIANT (3)
#define USE_X64_ASM_REF (0)
/* Multiply two unsigned 64-bit operands a and b. Returns least significant 64
bits of product as return value, most significant 64 bits of product via h.
*/
uint64_t umul64wide (uint64_t a, uint64_t b, uint64_t *h)
{
uint32_t a_lo = (uint32_t)a;
uint32_t a_hi = a >> 32;
uint32_t b_lo = (uint32_t)b;
uint32_t b_hi = b >> 32;
uint64_t p0 = (uint64_t)a_lo * b_lo;
uint64_t p1 = (uint64_t)a_lo * b_hi;
uint64_t p2 = (uint64_t)a_hi * b_lo;
uint64_t p3 = (uint64_t)a_hi * b_hi;
#if VARIANT == 1
uint32_t c = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);
*h = p3 + (p1 >> 32) + (p2 >> 32) + c;
return p0 + ((p1 + p2) << 32);
#elif VARIANT == 2
uint64_t s = (p0 >> 32) + p1;
uint64_t t = (uint32_t)s + p2;
*h = (s >> 32) + (t >> 32) + p3;
return (uint32_t)p0 + (t << 32);
#elif VARIANT == 3
*h = (p1 >> 32) + (((p0 >> 32) + (uint32_t)p1 + p2) >> 32) + p3;
return p0 + ((p1 + p2) << 32);
#elif VARIANT == 4
uint64_t t = (p0 >> 32) + p1 + (uint32_t)p2;
*h = (p2 >> 32) + (t >> 32) + p3;
return (uint32_t)p0 + (t << 32);
#elif VARIANT == 5
uint32_t r0, r1, r2, r3, r4, r5, r6;
r0 = (uint32_t)p0;
r1 = p0 >> 32;
r5 = (uint32_t)p1;
r2 = p1 >> 32;
r1 = r1 + r5;
r6 = r1 < r5;
r2 = r2 + r6;
r6 = (uint32_t)p2;
r5 = p2 >> 32;
r1 = r1 + r6;
r6 = r1 < r6;
r2 = r2 + r6;
r4 = (uint32_t)p3;
r3 = p3 >> 32;
r2 = r2 + r5;
r6 = r2 < r5;
r3 = r3 + r6;
r2 = r2 + r4;
r6 = r2 < r4;
r3 = r3 + r6;
*h = ((uint64_t)r3 << 32) | r2;
return ((uint64_t)r1 << 32) | r0;
#elif VARIANT == 6
uint32_t r0, r1, r2, r3, r4, r5, r6;
r0 = (uint32_t)p0;
r1 = p0 >> 32;
r5 = (uint32_t)p1;
r2 = p1 >> 32;
r4 = ~r1;
r4 = r4 < r5;
r1 = r1 + r5;
r2 = r2 + r4;
r6 = (uint32_t)p2;
r5 = p2 >> 32;
r4 = ~r1;
r4 = r4 < r6;
r1 = r1 + r6;
r2 = r2 + r4;
r4 = (uint32_t)p3;
r3 = p3 >> 32;
r6 = ~r2;
r6 = r6 < r5;
r2 = r2 + r5;
r3 = r3 + r6;
r6 = ~r2;
r6 = r6 < r4;
r2 = r2 + r4;
r3 = r3 + r6;
*h = ((uint64_t)r3 << 32) | r2;
return ((uint64_t)r1 << 32) | r0;
#else
#error unsupported VARIANT
#endif
}
#if defined(__SIZEOF_INT128__)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
unsigned __int128 prod = ((unsigned __int128)a) * b;
*h = (uint64_t)(prod >> 32);
return (uint64_t)prod;
}
#elif defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
*h = __umulh (a, b);
return a * b;
}
#elif USE_X64_ASM_REF
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
uint64_t res_l, res_h;
__asm__ (
"movq %2, %%rax;\n\t" // rax = a
"mulq %3;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res_h = rdx
"movq %%rax, %1;\n\t" // res_l = rax
: "=rm" (res_h), "=rm"(res_l)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
*h = res_h;
return res_l;
}
#else // generic (and slow) reference implementation
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
uint32_t cy, t0, t1;
uint32_t a_lo = (uint32_t)a;
uint32_t a_hi = a >> 32;
uint32_t b_lo = (uint32_t)b;
uint32_t b_hi = b >> 32;
uint64_t p0 = (uint64_t)a_lo * b_lo;
uint64_t p1 = (uint64_t)a_lo * b_hi;
uint64_t p2 = (uint64_t)a_hi * b_lo;
uint64_t p3 = (uint64_t)a_hi * b_hi;
uint32_t p0_lo = (uint32_t)p0;
uint32_t p0_hi = p0 >> 32;
uint32_t p1_lo = (uint32_t)p1;
uint32_t p1_hi = p1 >> 32;
uint32_t p2_lo = (uint32_t)p2;
uint32_t p2_hi = p2 >> 32;
uint32_t p3_lo = (uint32_t)p3;
uint32_t p3_hi = p3 >> 32;
uint32_t r0 = p0_lo;
uint32_t r1 = ADDcc (p0_hi, p1_lo, cy, t0, t1);
uint32_t r2 = ADDCcc (p1_hi, p2_hi, cy, t0, t1);
uint32_t r3 = ADDC (p3_hi, 0, cy, t0, t1);
r1 = ADDcc (r1, p2_lo, cy, t0, t1);
r2 = ADDCcc (r2, p3_lo, cy, t0, t1);
r3 = ADDC (r3, 0, cy, t0, t1);
*h = ((uint64_t)r3 << 32) + r2;
return ((uint64_t)r1 << 32) + r0;
}
#endif
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t a, b, res_hi, res_lo, ref_hi, ref_lo, count = 0;
printf ("Smoke test of umul64wide variant %d\n", VARIANT);
do {
a = KISS64;
b = KISS64;
ref_lo = umul64wide_ref (a, b, &ref_hi);
res_lo = umul64wide (a, b, &res_hi);
if ((res_lo ^ ref_lo) | (res_hi ^ ref_hi)) {
printf ("!!!! error: a=%016llx b=%016llx res=%016llx_%016llx ref=%016llx_%016llx\n",
a, b, res_hi, res_lo, ref_hi, ref_lo);
return EXIT_FAILURE;
}
if (!(count & 0xfffffff)) printf ("\r%llu", count);
count++;
} while (count);
return EXIT_SUCCESS;
}
I avoided the use of the ((x += y) < y) overflow test, since not every ISA handles conditional flags efficiently, and may inhibit re-ordering when using the results of flag register(s); x86[-64] is the obvious example, though later BMI(2) instructions may help mitigate this. I also added a 32 x 32 -> 64 bit C implementation for comparison - but I would expect any modern ISA to at least supply a 'high word' multiply like ARM's umulh.
/******************************************************************************/
/* stackoverflow.com/questions/74713642 */
#include <inttypes.h>
#include <stdio.h>
/* umul_32_32 : 32 x 32 => 64 */
/* force inline (non-portable), or implement it as macro, e.g.,
* #define umul_32_32(rh, rl, x, y) do { ... } while (0) */
#if (1)
static inline __attribute__((always_inline))
uint64_t umul_32_32 (uint32_t x, uint32_t y)
{
return (((uint64_t) x) * y);
}
#else
/* if no widening multiply is available, the compiler probably
* generates something at least as efficient as the following -
* or (worst case) it calls a builtin function. */
static inline __attribute__((always_inline))
uint64_t umul_32_32 (uint32_t x, uint32_t y)
{
uint32_t m0, m1, m2, m3; /* (partial products) */
uint32_t x0, x1, y0, y1;
x0 = x & UINT16_MAX, x1 = x >> (16);
y0 = y & UINT16_MAX, y1 = y >> (16);
m0 = x0 * y0, m1 = x1 * y0;
m2 = x0 * y1, m3 = x1 * y1;
m1 += m0 >> (16);
m3 += m2 >> (16);
m1 += m2 & UINT16_MAX;
uint32_t rh = m3 + (m1 >> (16));
uint32_t rl = m1 << (16) | (m0 & UINT16_MAX);
return (((uint64_t) rh) << 32 | rl);
/* 32 x 32 => 64 : no branching or carry overflow tests. */
}
#endif
/* ensure the function is called to inspect code gen / assembly,
* otherwise gcc and clang evaluate this at compile time. */
__attribute__((noinline)) void umul_64_64 (
uint64_t *rh, uint64_t *rl, uint64_t x, uint64_t y)
{
uint64_t m0, m1, m2, m3; /* (partial products) */
uint32_t x0, x1, y0, y1;
x0 = (uint32_t) (x), x1 = (uint32_t) (x >> (32));
y0 = (uint32_t) (y), y1 = (uint32_t) (y >> (32));
m0 = umul_32_32(x0, y0), m1 = umul_32_32(x1, y0);
m2 = umul_32_32(x0, y1), m3 = umul_32_32(x1, y1);
m1 += m0 >> (32);
m3 += m2 >> (32);
m1 += m2 & UINT32_MAX;
*rh = m3 + (m1 >> (32));
*rl = m1 << (32) | (m0 & UINT32_MAX);
/* 64 x 64 => 128 : no branching or carry overflow tests. */
}
#if (0)
int main (void)
{
uint64_t x = UINT64_MAX, y = UINT64_MAX, rh, rl;
umul_64_64(& rh, & rl, x, y);
fprintf(stdout, "0x%016" PRIX64 ":0x%016" PRIX64 "\n", rh, rl);
return (0);
}
#endif
/******************************************************************************/
For ARM-7, I'm getting more or less the same results as your 'variant 3' code, which isn't surprising, since it's the same essential idea. I tried different flags on gcc-12 and gcc-trunk, but couldn't improve it.
I'd hazard a guess that with Apple's investment in AArch64 silicon, there's simply been more aggressive optimization and funding directed toward clang that benefits 32-bit ARM-7 as well. But that's pure speculation. It's a pretty glaring disparity for such a major platform though.
I have a program that needs to run a transpose operation on 8x8 float32 matrices many times. I want to transpose these using NEON SIMD intrinsics. I know that the array will always contain 8x8 float elements. I have a baseline non-intrinsic solution below:
void transpose(float *matrix, float *matrixT) {
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
matrixT[i*8+j] = matrix[j*8+i];
}
}
}
I also created an intrinsic solution that transposes each 4x4 quadrant of the 8x8 matrix, and swaps the positions of the second and third quadrants. This solution looks like this:
void transpose_4x4(float *matrix, float *matrixT, int store_index) {
float32x4_t r0, r1, r2, r3, c0, c1, c2, c3;
r0 = vld1q_f32(matrix);
r1 = vld1q_f32(matrix + 8);
r2 = vld1q_f32(matrix + 16);
r3 = vld1q_f32(matrix + 24);
c0 = vzip1q_f32(r0, r1);
c1 = vzip2q_f32(r0, r1);
c2 = vzip1q_f32(r2, r3);
c3 = vzip2q_f32(r2, r3);
r0 = vcombine_f32(vget_low_f32(c0), vget_low_f32(c2));
r1 = vcombine_f32(vget_high_f32(c0), vget_high_f32(c2));
r2 = vcombine_f32(vget_low_f32(c1), vget_low_f32(c3));
r3 = vcombine_f32(vget_high_f32(c1), vget_high_f32(c3));
vst1q_f32(matrixT + store_index, r0);
vst1q_f32(matrixT + store_index + 8, r1);
vst1q_f32(matrixT + store_index + 16, r2);
vst1q_f32(matrixT + store_index + 24, r3);
}
void transpose(float *matrix, float *matrixT) {
// Transpose top-left 4x4 quadrant and store the result in the top-left 4x4 quadrant
transpose_4x4(matrix, matrixT, 0);
// Transpose top-right 4x4 quadrant and store the result in the bottom-left 4x4 quadrant
transpose_4x4(matrix + 4, matrixT, 32);
// Transpose bottom-left 4x4 quadrant and store the result in the top-right 4x4 quadrant
transpose_4x4(matrix + 32, matrixT, 4);
// Transpose bottom-right 4x4 quadrant and store the result in the bottom-right 4x4 quadrant
transpose_4x4(matrix + 36, matrixT, 36);
}
This solution however, results in a slower performance than the baseline non-intrinsic solution. I am struggling to see, if there is one, a faster solution that can transpose my 8x8 matrix. Any help would be greatly appreciated!
Edit: both solutions are compiled using the -O1 flag.
First off, you shouldn't expect a huge performance boost to start with:
there is actually no computation
you are dealing with 32bit data, and thus, not much of bandwidth constraint.
to sum it up, just a little bit saving in bandwidth by vectorizing - that's all
As for the 4x4 transpose, you don't even need a separate function, but just a macro:
#define TRANSPOSE4x4(pSrc,pDst) vst1q_f32_x4(pDst,vld4q_f32(pSrc))
will do the job since NEON does the 4x4 transpose on the fly when you load the data with vld4.
But you should ask yourself at this point if your approach - transposing all the matrice prior to actual computation - is the right one if 4x4 transpose costs virtually nothing. This step could end up being a pure waste of computation and bandwidth. Optimization shouldn't be limited to the final step, but should be considered from the designing phase.
8x8 transpose is a different animal though:
void transpose8x8(float *pDst, float *pSrc)
{
float32x4_t row0a, row0b, row1a, row1b, row2a, row2b, row3a, row3b, row4a, row4b, row5a, row5b, row6a, row6b, row7a, row7b;
float32x4_t r0a, r0b, r1a, r1b, r2a, r2b, r3a, r3b, r4a, r4b, r5a, r5b, r6a, r6b, r7a, r7b;
row0a = vld1q_f32(pSrc);
pSrc += 4;
row0b = vld1q_f32(pSrc);
pSrc += 4;
row1a = vld1q_f32(pSrc);
pSrc += 4;
row1b = vld1q_f32(pSrc);
pSrc += 4;
row2a = vld1q_f32(pSrc);
pSrc += 4;
row2b = vld1q_f32(pSrc);
pSrc += 4;
row3a = vld1q_f32(pSrc);
pSrc += 4;
row3b = vld1q_f32(pSrc);
pSrc += 4;
row4a = vld1q_f32(pSrc);
pSrc += 4;
row4b = vld1q_f32(pSrc);
pSrc += 4;
row5a = vld1q_f32(pSrc);
pSrc += 4;
row5b = vld1q_f32(pSrc);
pSrc += 4;
row6a = vld1q_f32(pSrc);
pSrc += 4;
row6b = vld1q_f32(pSrc);
pSrc += 4;
row7a = vld1q_f32(pSrc);
pSrc += 4;
row7b = vld1q_f32(pSrc);
r0a = vtrn1q_f32(row0a, row1a);
r0b = vtrn1q_f32(row0b, row1b);
r1a = vtrn2q_f32(row0a, row1a);
r1b = vtrn2q_f32(row0b, row1b);
r2a = vtrn1q_f32(row2a, row3a);
r2b = vtrn1q_f32(row2b, row3b);
r3a = vtrn2q_f32(row2a, row3a);
r3b = vtrn2q_f32(row2b, row3b);
r4a = vtrn1q_f32(row4a, row5a);
r4b = vtrn1q_f32(row4b, row5b);
r5a = vtrn2q_f32(row4a, row5a);
r5b = vtrn2q_f32(row4b, row5b);
r6a = vtrn1q_f32(row6a, row7a);
r6b = vtrn1q_f32(row6b, row7b);
r7a = vtrn2q_f32(row6a, row7a);
r7b = vtrn2q_f32(row6b, row7b);
row0a = vtrn1q_f64(row0a, row2a);
row0b = vtrn1q_f64(row0b, row2b);
row1a = vtrn1q_f64(row1a, row3a);
row1b = vtrn1q_f64(row1b, row3b);
row2a = vtrn2q_f64(row0a, row2a);
row2b = vtrn2q_f64(row0b, row2b);
row3a = vtrn2q_f64(row1a, row3a);
row3b = vtrn2q_f64(row1b, row3b);
row4a = vtrn1q_f64(row4a, row6a);
row4b = vtrn1q_f64(row4b, row6b);
row5a = vtrn1q_f64(row5a, row7a);
row5b = vtrn1q_f64(row5b, row7b);
row6a = vtrn2q_f64(row4a, row6a);
row6b = vtrn2q_f64(row4b, row6b);
row7a = vtrn2q_f64(row5a, row7a);
row7b = vtrn2q_f64(row5b, row7b);
vst1q_f32(pDst, row0a);
pDst += 4;
vst1q_f32(pDst, row4a);
pDst += 4;
vst1q_f32(pDst, row1a);
pDst += 4;
vst1q_f32(pDst, row5a);
pDst += 4;
vst1q_f32(pDst, row2a);
pDst += 4;
vst1q_f32(pDst, row6a);
pDst += 4;
vst1q_f32(pDst, row3a);
pDst += 4;
vst1q_f32(pDst, row7a);
pDst += 4;
vst1q_f32(pDst, row0b);
pDst += 4;
vst1q_f32(pDst, row4b);
pDst += 4;
vst1q_f32(pDst, row1b);
pDst += 4;
vst1q_f32(pDst, row5b);
pDst += 4;
vst1q_f32(pDst, row2b);
pDst += 4;
vst1q_f32(pDst, row6b);
pDst += 4;
vst1q_f32(pDst, row3b);
pDst += 4;
vst1q_f32(pDst, row7b);
}
It boils down to : 16 load + 32 trn + 16 store vs 64 load + 64 store
Now we can clearly see it really isn't worth it. The neon routine above might be a little faster, but I doubt it will make a difference in the end.
No, you can't optimize it any further. Nobody can. Just make sure the pointers are 64byte aligned, test it, and decide for yourself.
ld1 {v0.4s-v3.4s}, [x1], #64
ld1 {v4.4s-v7.4s}, [x1], #64
ld1 {v16.4s-v19.4s}, [x1], #64
ld1 {v20.4s-v23.4s}, [x1]
trn1 v24.4s, v0.4s, v2.4s // row0
trn1 v25.4s, v1.4s, v3.4s
trn2 v26.4s, v0.4s, v2.4s // row1
trn2 v27.4s, v1.4s, v3.4s
trn1 v28.4s, v4.4s, v6.4s // row2
trn1 v29.4s, v5.4s, v7.4s
trn2 v30.4s, v4.4s, v6.4s // row3
trn2 v31.4s, v5.4s, v7.4s
trn1 v0.4s, v16.4s, v18.4s // row4
trn1 v1.4s, v17.4s, v19.4s
trn2 v2.4s, v16.4s, v18.4s // row5
trn2 v3.4s, v17.4s, v19.4s
trn1 v4.4s, v20.4s, v22.4s // row6
trn1 v5.4s, v21.4s, v23.4s
trn2 v6.4s, v20.4s, v22.4s // row7
trn2 v7.4s, v21.4s, v23.4s
trn1 v16.2d, v24.2d, v28.2d // row0a
trn1 v17.2d, v0.2d, v4.2d // row0b
trn1 v18.2d, v26.2d, v30.2d // row1a
trn1 v19.2d, v2.2d, v6.2d // row1b
trn2 v20.2d, v24.2d, v28.2d // row2a
trn2 v21.2d, v0.2d, v4.2d // row2b
trn2 v22.2d, v26.2d, v30.2d // row3a
trn2 v23.2d, v2.2d, v6.2d // row3b
st1 {v16.4s-v19.4s}, [x0], #64
st1 {v20.4s-v23.4s}, [x0], #64
trn1 v16.2d, v25.2d, v29.2d // row4a
trn1 v17.2d, v1.2d, v5.2d // row4b
trn1 v18.2d, v27.2d, v31.2d // row5a
trn1 v19.2d, v3.2d, v7.2d // row5b
trn2 v20.2d, v25.2d, v29.2d // row4a
trn2 v21.2d, v1.2d, v5.2d // row4b
trn2 v22.2d, v27.2d, v31.2d // row5a
trn2 v23.2d, v3.2d, v7.2d // row5b
st1 {v16.4s-v19.4s}, [x0], #64
st1 {v20.4s-v23.4s}, [x0]
ret
above is the hand optimized assembly version that's most probably shorter (as short as it can get), but not exactly meaningfully faster than:
Below is the pure C version that I'd settle with:
void transpose8x8(float *pDst, float *pSrc)
{
uint32_t i = 8;
do {
pDst[0] = *pSrc++;
pDst[8] = *pSrc++;
pDst[16] = *pSrc++;
pDst[24] = *pSrc++;
pDst[32] = *pSrc++;
pDst[40] = *pSrc++;
pDst[48] = *pSrc++;
pDst[56] = *pSrc++;
pDst++;
} while (--i);
}
or
void transpose8x8(float *pDst, float *pSrc)
{
uint32_t i = 8;
do {
*pDst++ = pSrc[0];
*pDst++ = pSrc[8];
*pDst++ = pSrc[16];
*pDst++ = pSrc[24];
*pDst++ = pSrc[32];
*pDst++ = pSrc[40];
*pDst++ = pSrc[48];
*pDst++ = pSrc[56];
pSrc++;
} while (--i);
}
PS: It could bring some gain in performance/power consumption if you declared pDst and pSrc uint32_t *, because the compiler would definitely generate pure integer machine code which has most various addressing modes, and only use w registers instead of s ones. Just typecase float * to uint32_t *
PS2: Clang already utilizes w registers instead of s ones while GCC is being GCC.... When will GNU-shills finally admit the fact that GCC is an extremely bad choice for ARM?
godbolt
PS3: Below is the non-neon version in assembly (zero latency) since I was very disappointed (even shocked) in both Clang and GCC above:
.arch armv8-a
.global transpose8x8
.text
.balign 64
.func
transpose8x8:
mov w10, #8
sub x0, x0, #8
.balign 16
1:
ldr w2, [x1, #0]
ldr w3, [x1, #32]
ldr w4, [x1, #64]
ldr w5, [x1, #96]
ldr w6, [x1, #128]
ldr w7, [x1, #160]
ldr w8, [x1, #192]
ldr w9, [x1, #224]
subs w10, w10, #1
stp w2, w3, [x0, #8]
add x1, x1, #4
stp w4, w5, [x0, #16]
stp w6, w7, [x0, #24]
stp w8, w9, [x0, #32]!
b.ne 1b
.balign 16
ret
.endfunc
.end
It's arguably the best version you will ever get if you still insist on doing pure 8x8 transpose. It might be a little slower than the neon assembly version, but consume considerably less power.
It's possible to optimise the 8x8 neon code presented in the other answer; 8x8 transpose can be not only thought of as recursive version of [A B;C D]' == [A' C'; B' D'] but also as repeated application of zip or unzip.
a b c d
e f g h
i j k l
m n o p == a b c d e f g h i j k l m n o p
zip(first_half, last_half) ==
zip(...) == a i b j c k d l e m f n g o h p
zip(...) == a e i m b f j n c g k o d h l p == transpose
For 8x8 matrix we need to apply this algorithm 3 times and reading the data by vld4 two of those passes have been already done.
float32x4x4_t d0 = vld4q_f32(input);
float32x4x4_t d1 = vld4q_f32(input + 16);
float32x4x4_t d2 = vld4q_f32(input + 32);
float32x4x4_t d3 = vld4q_f32(input + 48);
float32x4x4_t e0 = {
vzipq_f32(d0.val[0], d2.val[0]).val[0],
vzipq_f32(d0.val[1], d2.val[1]).val[0],
vzipq_f32(d0.val[2], d2.val[2]).val[0],
vzipq_f32(d0.val[3], d2.val[3]).val[0]
};
float32x4x4_t e1 = {
vzipq_f32(d1.val[0], d3.val[0]).val[0],
vzipq_f32(d1.val[1], d3.val[1]).val[0],
vzipq_f32(d1.val[2], d3.val[2]).val[0],
vzipq_f32(d1.val[3], d3.val[3]).val[0]
};
float32x4x4_t e2 = {
vzipq_f32(d0.val[0], d2.val[0]).val[1],
vzipq_f32(d0.val[1], d2.val[1]).val[1],
vzipq_f32(d0.val[2], d2.val[2]).val[1],
vzipq_f32(d0.val[3], d2.val[3]).val[1]
};
float32x4x4_t e3 = {
vzipq_f32(d1.val[0], d3.val[0]).val[1],
vzipq_f32(d1.val[1], d3.val[1]).val[1],
vzipq_f32(d1.val[2], d3.val[2]).val[1],
vzipq_f32(d1.val[3], d3.val[3]).val[1]
};
vst1q_f32_x4(output, e0);
vst1q_f32_x4(output + 16, e1);
vst1q_f32_x4(output + 32, e2);
vst1q_f32_x4(output + 48, e3);
One should be able to perform the transpose also by starting with vld1q_f32_x4, then uzpq and finish with vst4q_f32.
I have been trying to write a bare metal kernel for over a year now and I am up to the point where I am ready to start working on audio output. I have written the code in asm however since I'm not great at it I'm not sure how I can pass audio samples as arguments to a asm function. I tried to rewrite it in C however it isn't working. This problem is really a spot the difference. I know my asm version works but the audio sample is written into the play_audio function. My goal is to have a init function for the audio with no arguments and a play_audio function that takes the pointer to the start of the audio function and a pointer to the end of the audio file. The audio file to be played is a 16 bit unsigned int pcm file. The same file that I'm trying to use for the C audio part is used successfully in the asm version. Since I set the hardware pwm to expect 13bit audio at 41400Hz there is a shift to convert the sample from 16bit to 13 bit so this isn't a mistake.
Not_working_audio.c
void init_audio_jack_c()//ERROR IN HERE
{
//Set phone jack to pwm output
uint32_t *gpio_addr = (uint32_t *)(PERIPHERAL_BASE + GPIO_BASE);
uint32_t *gpio_gpfsel4_addr = gpio_addr + GPIO_GPFSEL4;
*gpio_gpfsel4_addr = GPIO_FSEL0_ALT0 | GPIO_FSEL5_ALT0;
//Set clock
uint32_t *clock_manager_addr = (uint32_t *)(((PERIPHERAL_BASE + CM_BASE) & 0x0000FFFF) | ((PERIPHERAL_BASE + CM_BASE) & 0xFFFF0000));
*(clock_manager_addr + CM_PWMDIV) = (CM_PASSWORD | 0x2000);
*(clock_manager_addr + CM_PWMCTL) = ((CM_PASSWORD | CM_ENAB) | (CM_SRC_OSCILLATOR + CM_SRC_PLLCPER));
//Set PWM
uint32_t *pwm_manager_addr = (uint32_t *)(((PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000));
*(pwm_manager_addr + PWM_RNG1) = 0x1624;
*(pwm_manager_addr + PWM_RNG2) = 0x1624;
*(pwm_manager_addr + PWM_CTL) = PWM_USEF2 + PWM_PWEN2 + PWM_USEF1 + PWM_PWEN1 + PWM_CLRF1;
printf("[INFO] Audio Init Finished");
}
int32_t play_16bit_unsigned_audio(uint16_t *start, uint16_t *end)
{
if(end < start)
{
printf("[ERROR] End is less than start.");
return 1;
}
if((start - end) % 2 == 0)
{
printf("[ERROR] Isn't a multiple of two so it isn't 16bit");
return 2;
}
uint16_t *end_of_file = (uint16_t *)(uint64_t)(((uint32_t)(uintptr_t)end & 0x0000FFFF) | ((uint32_t)(uintptr_t)end & 0xFFFF0000));
//FIFO write
while(start != end_of_file)
{
uint16_t sample = start[0];
sample >>= 3;
*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_FIF1) = sample;
start++;
sample = start[0];
sample >>= 3;
*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_FIF1) = sample;
//FIFO wait
while(*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_STA) != PWM_FULL1);
start++;
}
printf("[INFO] Completed Audio");
return 0;
}
Working_audio.s
.section .text.init_audio_jack, "ax", %progbits
.balign 4
.globl init_audio_jack;
.type init_audio_jack, %function
init_audio_jack:
mov w0,PERIPHERAL_BASE + GPIO_BASE
mov w1,GPIO_FSEL0_ALT0
orr w1,w1,GPIO_FSEL5_ALT0
str w1,[x0,GPIO_GPFSEL4]
// Set Clock
mov w0, PERIPHERAL_BASE
add w0, w0, CM_BASE
and w0, w0, 0x0000FFFF
mov w1, PERIPHERAL_BASE
add w1, w1, CM_BASE
and w1, w1, 0xFFFF0000
orr w0,w0,w1
mov w1,CM_PASSWORD
orr w1,w1,0x2000 // Bits 0..11 Fractional Part Of Divisor = 0, Bits 12..23 Integer Part Of Divisor = 2
brk #0
str w1,[x0,CM_PWMDIV]
mov w1,CM_PASSWORD
orr w1,w1,CM_ENAB
orr w1,w1,CM_SRC_OSCILLATOR + CM_SRC_PLLCPER // Use 650MHz PLLC Clock
str w1,[x0,CM_PWMCTL]
// Set PWM
mov w0, PERIPHERAL_BASE
add w0, w0, PWM_BASE
and w0, w0, 0x0000FFFF
mov w1,PERIPHERAL_BASE
add w1, w1, PWM_BASE
and w1, w1, 0xFFFF0000
orr w0,w0,w1
mov w1,0x1624 // Range = 13bit 44100Hz Mono
str w1,[x0,PWM_RNG1]
str w1,[x0,PWM_RNG2]
mov w1,PWM_USEF2 + PWM_PWEN2 + PWM_USEF1 + PWM_PWEN1 + PWM_CLRF1
str w1,[x0,PWM_CTL]
.section .text.play_audio, "ax", %progbits
.balign 4
.globl play_audio;
.type play_audio, %function
play_audio:
Loop:
adr x1, _binary_src_audio_Interlude_bin_start // X1 = Sound Sample
ldr w2, =_binary_src_audio_Interlude_bin_end
and w2, w2, 0x0000FFFF // W2 = End Of Sound Sample
ldr w3, =_binary_src_audio_Interlude_bin_end
and w3, w3, 0xFFFF0000
orr w2,w2,w3
FIFO_Write:
ldrh w3,[x1],2 // Write 2 Bytes To FIFO
lsr w3,w3,3 // Convert 16bit To 13bit
str w3,[x0,PWM_FIF1] // FIFO Address
ldrh w3, [x1], 2
lsr w3, w3, 3
str w3, [x0, PWM_FIF1]
FIFO_Wait:
ldr w3,[x0,PWM_STA]
tst w3,PWM_FULL1 // Test Bit 1 FIFO Full
b.ne FIFO_Wait
cmp w1,w2 // Check End Of Sound Sample
b.ne FIFO_Write
b Loop // Play Sample Again
Thanks in advance to anyone that can help!
A CUDA kernel with some local, fixed-size array may get compiled so that the array resides in the thread's "local memory", or - if NVCC can determine the position of each array access at compile time, and there are enough registers available - the array might be broken up with its elements residing in registers.
Is it possible to check or to ensure, either via the code or as part of the build process, that a specific array, or all local arrays in a kernel, have been fit into registers? Is doing so supported by any tool?
At runtime
You may use the CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES as a hint on whether your array has been registrified; using CUDA driver API function cuFuncGetAttribute. But for some use cases, runtime may be too late.
At compile time
You want to have a look at the generated ptx file (using --keep option in nvcc).
The local data declation is identified as .local in the ptx. Here is a small example, with a kernel.
#define ww 65
__global__ void kernel(int W, int H, const int *a, int *b)
{
int buffer[ww];
for (int i = threadIdx.x; i < H; i += blockDim.x)
{
#pragma unroll
for (int w = 0; w < ww; ++w)
buffer[w] = a[i + w * W];
for (int j = 5; j < H - 5; ++j)
{
buffer[j % ww] = a[i + (j + 6) * W];
int s = 0;
#pragma unroll
for (int w = 0; w < ww; ++w)
s += buffer[w];
b[i + (j + 6) * W] = s;
}
}
}
When compiled there is a declaration of a local variable:
.visible .entry _Z6kerneliiPKiPi(
.param .u32 _Z6kerneliiPKiPi_param_0,
.param .u32 _Z6kerneliiPKiPi_param_1,
.param .u64 _Z6kerneliiPKiPi_param_2,
.param .u64 _Z6kerneliiPKiPi_param_3
)
{
.local .align 4 .b8 __local_depot0[260];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .pred %p<5>;
.reg .b32 %r<219>;
.reg .b64 %rd<81>;
However, when rolling the buffer, buffer is always accessed with known indices and registers may be obtained - no local storage:
#define ww 65
__global__ void kernel(int W, int H, const int *a, int *b)
{
int buffer[ww];
for (int i = threadIdx.x; i < H; i += blockDim.x)
{
#pragma unroll
for (int w = 0; w < ww; ++w)
buffer[w] = a[i + w * W];
for (int j = 5; j < H - 5; ++j)
{
#pragma unroll
for (int w = 0; w < ww-1; ++w)
buffer[w] = buffer[w + 1];
buffer[ww - 1] = a[i + (j + 6) * W];
int s = 0;
#pragma unroll
for (int w = 0; w < ww; ++w)
s += buffer[w];
b[i + (j + 6) * W] = s;
}
}
}
Yields the following ptx:
.visible .entry _Z6kerneliiPKiPi(
.param .u32 _Z6kerneliiPKiPi_param_0,
.param .u32 _Z6kerneliiPKiPi_param_1,
.param .u64 _Z6kerneliiPKiPi_param_2,
.param .u64 _Z6kerneliiPKiPi_param_3
)
{
.reg .pred %p<5>;
.reg .b32 %r<393>;
.reg .b64 %rd<240>;
Note that depending on the number of registers available, the number of required registers may not fit. These are virtual registers (which has somehow changed in recent versions of CUDA). Meaning that the absence of .local .align 4 .b8 __local_depot is a prerequisite, but not sufficient.
You need to look at the SASS then. Using nvdisasm on your generated .cubin, you want to search for STL instruction which stands for STore Local, as described briefly here. Here are parts of the two disassembled cubins compiled with two different values of --maxrregcount compiler switch - first for 32 (see the many occurrences of STL):
//--------------------- .text._Z6kerneliiPKiPi --------------------------
.section .text._Z6kerneliiPKiPi,"ax",#progbits
.sectioninfo #"SHI_REGISTERS=32"
.align 32
.global _Z6kerneliiPKiPi
.type _Z6kerneliiPKiPi,#function
.size _Z6kerneliiPKiPi,(.L_25 - _Z6kerneliiPKiPi)
.other _Z6kerneliiPKiPi,#"STO_CUDA_ENTRY STV_DEFAULT"
_Z6kerneliiPKiPi:
.text._Z6kerneliiPKiPi:
/*0008*/ MOV R1, c[0x0][0x20];
/*0010*/ { IADD32I R1, R1, -0x180;
/*0018*/ S2R R0, SR_TID.X; }
/*0028*/ ISETP.GE.AND P0, PT, R0, c[0x0][0x144], PT;
/*0030*/ NOP;
/*0038*/ NOP;
/*0048*/ #P0 EXIT;
.L_3:
/*0050*/ IADD R2, R0, c[0x0][0x140];
/*0058*/ MOV R30, c[0x0][0x140];
/*0068*/ ISCADD R5.CC, R2.reuse, c[0x0][0x148], 0x2;
/*0070*/ { SHR R3, R2, 0x1e;
/*0078*/ STL [R1+0x14], R5; }
/*0088*/ ISCADD R2, R30.reuse, R0.reuse, 0x1;
/*0090*/ ISCADD R4, R30.reuse, R0.reuse, 0x2;
/*0098*/ ISCADD R20, R30, R0, 0x3;
/*00a8*/ IADD.X R5, R3, c[0x0][0x14c];
/*00b0*/ { SHR R3, R2.reuse, 0x1e;
/*00b8*/ STL [R1+0x10], R5; }
/*00c8*/ ISCADD R2.CC, R2, c[0x0][0x148], 0x2;
/*00d0*/ STL [R1+0x8], R2;
/*00d8*/ SHR R5, R4, 0x1e;
/*00e8*/ IADD.X R2, R3, c[0x0][0x14c];
/*00f0*/ { ISCADD R4.CC, R4, c[0x0][0x148], 0x2;
/*00f8*/ STL [R1+0x4], R2; }
Then for 255 - no occurence of STL:
//--------------------- .text._Z6kerneliiPKiPi --------------------------
.section .text._Z6kerneliiPKiPi,"ax",#progbits
.sectioninfo #"SHI_REGISTERS=124"
.align 32
.global _Z6kerneliiPKiPi
.type _Z6kerneliiPKiPi,#function
.size _Z6kerneliiPKiPi,(.L_25 - _Z6kerneliiPKiPi)
.other _Z6kerneliiPKiPi,#"STO_CUDA_ENTRY STV_DEFAULT"
_Z6kerneliiPKiPi:
.text._Z6kerneliiPKiPi:
/*0008*/ MOV R1, c[0x0][0x20];
/*0010*/ S2R R0, SR_TID.X;
/*0018*/ ISETP.GE.AND P0, PT, R0, c[0x0][0x144], PT;
/*0028*/ NOP;
/*0030*/ NOP;
/*0038*/ #P0 EXIT;
/*0048*/ MOV R46, c[0x0][0x144];
/*0050*/ IADD R47, RZ, -c[0x0][0x140];
/*0058*/ IADD32I R46, R46, -0x5;
/*0068*/ SHL R47, R47, 0x2;
.L_3:
/*0070*/ ISETP.LT.AND P0, PT, R46, 0x6, PT;
/*0078*/ #P0 BRA `(.L_1);
/*0088*/ MOV R2, c[0x0][0x140];
/*0090*/ ISCADD R2, R2, R0, 0x6;
/*0098*/ SHR R27, R2.reuse, 0x1e;
/*00a8*/ ISCADD R26.CC, R2, c[0x0][0x148], 0x2;
/*00b0*/ SHR R48, R47, 0x1f;
/*00b8*/ IADD.X R27, R27, c[0x0][0x14c];
/*00c8*/ { IADD R44.CC, R47.reuse, R26;
/*00d0*/ LDG.E R49, [R26]; }
/*00d8*/ IADD.X R45, R48.reuse, R27;
/*00e8*/ { IADD R42.CC, R47.reuse, R44 SLOT 0;
/*00f0*/ LDG.E R44, [R44] SLOT 1; }
/*00f8*/ IADD.X R43, R48.reuse, R45;
/*0108*/ { IADD R38.CC, R47, R42 SLOT 0;
/*0110*/ LDG.E R42, [R42] SLOT 1; }
Very much like you I assume, I wish that all of this was better documented.
I'm trying to understand the integrate_functor in particles_kernel.cu from CUDA examples:
struct integrate_functor
{
float deltaTime;
//constructor for functor
//...
template <typename Tuple>
__device__
void operator()(Tuple t)
{
volatile float4 posData = thrust::get<2>(t);
volatile float4 velData = thrust::get<3>(t);
float3 pos = make_float3(posData.x, posData.y, posData.z);
float3 vel = make_float3(velData.x, velData.y, velData.z);
// update position and velocity
// ...
// store new position and velocity
thrust::get<0>(t) = make_float4(pos, posData.w);
thrust::get<1>(t) = make_float4(vel, velData.w);
}
};
We call make_float4(pos, age) but make_float4 is defined in vector_functions.h as
static __inline__ __host__ __device__ float4 make_float4(float x, float y, float z, float w)
{
float4 t; t.x = x; t.y = y; t.z = z; t.w = w; return t;
}
Are CUDA vector types (float3 and float4) more efficient for the GPU and how does the compiler know how to overload the function make_float4?
I'm expanding njuffa's comment into a worked example. In that example, I'm simply adding two arrays in three different ways: loading the data as float, float2 or float4.
These are the timings on a GT540M and on a Kepler K20c card:
GT540M
float - Elapsed time: 74.1 ms
float2 - Elapsed time: 61.0 ms
float4 - Elapsed time: 56.1 ms
Kepler K20c
float - Elapsed time: 4.4 ms
float2 - Elapsed time: 3.3 ms
float4 - Elapsed time: 3.2 ms
As it can be seen, loading the data as float4 is the fastest approach.
Below are the disassembled codes for the three kernels (compilation for compute capability 2.1).
add_float
Function : _Z9add_floatPfS_S_j
.headerflags #"EF_CUDA_SM21 EF_CUDA_PTX_SM(EF_CUDA_SM21)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0010*/ SHL R2, R2, 0x2; /* 0x6000c00008209c03 */
/*0018*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0020*/ SHL R0, R0, 0x2; /* 0x6000c00008001c03 */
/*0028*/ IMAD R0, R0, c[0x0][0x8], R2; /* 0x2004400020001ca3 */
/*0030*/ ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x2c], PT; /* 0x1b0e4000b001dc03 */
/*0038*/ #P0 BRA.U 0xd8; /* 0x40000002600081e7 */
/*0040*/ #!P0 ISCADD R2, R0, c[0x0][0x24], 0x2; /* 0x400040009000a043 */
/*0048*/ #!P0 ISCADD R10, R0, c[0x0][0x20], 0x2; /* 0x400040008002a043 */
/*0050*/ #!P0 ISCADD R0, R0, c[0x0][0x28], 0x2; /* 0x40004000a0002043 */
/*0058*/ #!P0 LD R8, [R2]; /* 0x8000000000222085 */
/*0060*/ #!P0 LD R6, [R2+0x4]; /* 0x800000001021a085 */
/*0068*/ #!P0 LD R4, [R2+0x8]; /* 0x8000000020212085 */
/*0070*/ #!P0 LD R9, [R10]; /* 0x8000000000a26085 */
/*0078*/ #!P0 LD R7, [R10+0x4]; /* 0x8000000010a1e085 */
/*0080*/ #!P0 LD R5, [R10+0x8]; /* 0x8000000020a16085 */
/*0088*/ #!P0 LD R3, [R10+0xc]; /* 0x8000000030a0e085 */
/*0090*/ #!P0 LD R2, [R2+0xc]; /* 0x800000003020a085 */
/*0098*/ #!P0 FADD R8, R9, R8; /* 0x5000000020922000 */
/*00a0*/ #!P0 FADD R6, R7, R6; /* 0x500000001871a000 */
/*00a8*/ #!P0 FADD R4, R5, R4; /* 0x5000000010512000 */
/*00b0*/ #!P0 ST [R0], R8; /* 0x9000000000022085 */
/*00b8*/ #!P0 FADD R2, R3, R2; /* 0x500000000830a000 */
/*00c0*/ #!P0 ST [R0+0x4], R6; /* 0x900000001001a085 */
/*00c8*/ #!P0 ST [R0+0x8], R4; /* 0x9000000020012085 */
/*00d0*/ #!P0 ST [R0+0xc], R2; /* 0x900000003000a085 */
/*00d8*/ EXIT; /* 0x8000000000001de7 */
add_float2
Function : _Z10add_float2P6float2S0_S0_j
.headerflags #"EF_CUDA_SM21 EF_CUDA_PTX_SM(EF_CUDA_SM21)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0010*/ SHL R2, R2, 0x1; /* 0x6000c00004209c03 */
/*0018*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0020*/ SHL R0, R0, 0x1; /* 0x6000c00004001c03 */
/*0028*/ IMAD R0, R0, c[0x0][0x8], R2; /* 0x2004400020001ca3 */
/*0030*/ ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x2c], PT; /* 0x1b0e4000b001dc03 */
/*0038*/ #P0 BRA.U 0xa8; /* 0x40000001a00081e7 */
/*0040*/ #!P0 ISCADD R10, R0, c[0x0][0x20], 0x3; /* 0x400040008002a063 */
/*0048*/ #!P0 ISCADD R11, R0, c[0x0][0x24], 0x3; /* 0x400040009002e063 */
/*0050*/ #!P0 ISCADD R0, R0, c[0x0][0x28], 0x3; /* 0x40004000a0002063 */
/*0058*/ #!P0 LD.64 R4, [R10]; /* 0x8000000000a120a5 */
/*0060*/ #!P0 LD.64 R8, [R11]; /* 0x8000000000b220a5 */
/*0068*/ #!P0 LD.64 R2, [R10+0x8]; /* 0x8000000020a0a0a5 */
/*0070*/ #!P0 LD.64 R6, [R11+0x8]; /* 0x8000000020b1a0a5 */
/*0078*/ #!P0 FADD R9, R5, R9; /* 0x5000000024526000 */
/*0080*/ #!P0 FADD R8, R4, R8; /* 0x5000000020422000 */
/*0088*/ #!P0 FADD R3, R3, R7; /* 0x500000001c30e000 */
/*0090*/ #!P0 FADD R2, R2, R6; /* 0x500000001820a000 */
/*0098*/ #!P0 ST.64 [R0], R8; /* 0x90000000000220a5 */
/*00a0*/ #!P0 ST.64 [R0+0x8], R2; /* 0x900000002000a0a5 */
/*00a8*/ EXIT; /* 0x8000000000001de7 */
add_float4
Function : _Z10add_float4P6float4S0_S0_j
.headerflags #"EF_CUDA_SM21 EF_CUDA_PTX_SM(EF_CUDA_SM21)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ NOP; /* 0x4000000000001de4 */
/*0010*/ MOV R3, c[0x0][0x2c]; /* 0x28004000b000dde4 */
/*0018*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0020*/ SHR.U32 R3, R3, 0x2; /* 0x5800c0000830dc03 */
/*0028*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0030*/ IMAD R0, R0, c[0x0][0x8], R2; /* 0x2004400020001ca3 */
/*0038*/ ISETP.GE.U32.AND P0, PT, R0, R3, PT; /* 0x1b0e00000c01dc03 */
/*0040*/ #P0 BRA.U 0x98; /* 0x40000001400081e7 */
/*0048*/ #!P0 ISCADD R2, R0, c[0x0][0x20], 0x4; /* 0x400040008000a083 */
/*0050*/ #!P0 ISCADD R3, R0, c[0x0][0x24], 0x4; /* 0x400040009000e083 */
/*0058*/ #!P0 ISCADD R0, R0, c[0x0][0x28], 0x4; /* 0x40004000a0002083 */
/*0060*/ #!P0 LD.128 R8, [R2]; /* 0x80000000002220c5 */
/*0068*/ #!P0 LD.128 R4, [R3]; /* 0x80000000003120c5 */
/*0070*/ #!P0 FADD R7, R11, R7; /* 0x500000001cb1e000 */
/*0078*/ #!P0 FADD R6, R10, R6; /* 0x5000000018a1a000 */
/*0080*/ #!P0 FADD R5, R9, R5; /* 0x5000000014916000 */
/*0088*/ #!P0 FADD R4, R8, R4; /* 0x5000000010812000 */
/*0090*/ #!P0 ST.128 [R0], R4; /* 0x90000000000120c5 */
/*0098*/ EXIT; /* 0x8000000000001de7 */
As it can be seen and as mentioned by njuffa, different load instructions are used for the three cases: LD, LD.64 and LD.128, respectively.
Finally, the code:
#include <thrust/device_vector.h>
#define BLOCKSIZE 256
/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
/********************/
/* ADD_FLOAT KERNEL */
/********************/
__global__ void add_float(float *d_a, float *d_b, float *d_c, unsigned int N) {
const int tid = 4 * threadIdx.x + blockIdx.x * (4 * blockDim.x);
if (tid < N) {
float a1 = d_a[tid];
float b1 = d_b[tid];
float a2 = d_a[tid+1];
float b2 = d_b[tid+1];
float a3 = d_a[tid+2];
float b3 = d_b[tid+2];
float a4 = d_a[tid+3];
float b4 = d_b[tid+3];
float c1 = a1 + b1;
float c2 = a2 + b2;
float c3 = a3 + b3;
float c4 = a4 + b4;
d_c[tid] = c1;
d_c[tid+1] = c2;
d_c[tid+2] = c3;
d_c[tid+3] = c4;
//if ((tid < 1800) && (tid > 1790)) {
//printf("%i %i %i %f %f %f\n", tid, threadIdx.x, blockIdx.x, a1, b1, c1);
//printf("%i %i %i %f %f %f\n", tid+1, threadIdx.x, blockIdx.x, a2, b2, c2);
//printf("%i %i %i %f %f %f\n", tid+2, threadIdx.x, blockIdx.x, a3, b3, c3);
//printf("%i %i %i %f %f %f\n", tid+3, threadIdx.x, blockIdx.x, a4, b4, c4);
//}
}
}
/*********************/
/* ADD_FLOAT2 KERNEL */
/*********************/
__global__ void add_float2(float2 *d_a, float2 *d_b, float2 *d_c, unsigned int N) {
const int tid = 2 * threadIdx.x + blockIdx.x * (2 * blockDim.x);
if (tid < N) {
float2 a1 = d_a[tid];
float2 b1 = d_b[tid];
float2 a2 = d_a[tid+1];
float2 b2 = d_b[tid+1];
float2 c1;
c1.x = a1.x + b1.x;
c1.y = a1.y + b1.y;
float2 c2;
c2.x = a2.x + b2.x;
c2.y = a2.y + b2.y;
d_c[tid] = c1;
d_c[tid+1] = c2;
}
}
/*********************/
/* ADD_FLOAT4 KERNEL */
/*********************/
__global__ void add_float4(float4 *d_a, float4 *d_b, float4 *d_c, unsigned int N) {
const int tid = 1 * threadIdx.x + blockIdx.x * (1 * blockDim.x);
if (tid < N/4) {
float4 a1 = d_a[tid];
float4 b1 = d_b[tid];
float4 c1;
c1.x = a1.x + b1.x;
c1.y = a1.y + b1.y;
c1.z = a1.z + b1.z;
c1.w = a1.w + b1.w;
d_c[tid] = c1;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 4*10000000;
const float a = 3.f;
const float b = 5.f;
// --- float
thrust::device_vector<float> d_A(N, a);
thrust::device_vector<float> d_B(N, b);
thrust::device_vector<float> d_C(N);
float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
add_float<<<iDivUp(N/4, BLOCKSIZE), BLOCKSIZE>>>(thrust::raw_pointer_cast(d_A.data()), thrust::raw_pointer_cast(d_B.data()), thrust::raw_pointer_cast(d_C.data()), N);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Elapsed time: %3.1f ms \n", time); gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
thrust::host_vector<float> h_float = d_C;
for (int i=0; i<N; i++) {
if (h_float[i] != (a+b)) {
printf("Error for add_float at %i: result is %f\n",i, h_float[i]);
return -1;
}
}
// --- float2
thrust::device_vector<float> d_A2(N, a);
thrust::device_vector<float> d_B2(N, b);
thrust::device_vector<float> d_C2(N);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
add_float2<<<iDivUp(N/4, BLOCKSIZE), BLOCKSIZE>>>((float2*)thrust::raw_pointer_cast(d_A2.data()), (float2*)thrust::raw_pointer_cast(d_B2.data()), (float2*)thrust::raw_pointer_cast(d_C2.data()), N);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Elapsed time: %3.1f ms \n", time); gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
thrust::host_vector<float> h_float2 = d_C2;
for (int i=0; i<N; i++) {
if (h_float2[i] != (a+b)) {
printf("Error for add_float2 at %i: result is %f\n",i, h_float2[i]);
return -1;
}
}
// --- float4
thrust::device_vector<float> d_A4(N, a);
thrust::device_vector<float> d_B4(N, b);
thrust::device_vector<float> d_C4(N);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
add_float4<<<iDivUp(N/4, BLOCKSIZE), BLOCKSIZE>>>((float4*)thrust::raw_pointer_cast(d_A4.data()), (float4*)thrust::raw_pointer_cast(d_B4.data()), (float4*)thrust::raw_pointer_cast(d_C4.data()), N);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Elapsed time: %3.1f ms \n", time); gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
thrust::host_vector<float> h_float4 = d_C4;
for (int i=0; i<N; i++) {
if (h_float4[i] != (a+b)) {
printf("Error for add_float4 at %i: result is %f\n",i, h_float4[i]);
return -1;
}
}
return 0;
}