Problems with integer promotion in C

Problems with integer promotion in C - c

I'm developing a C code for an embedded application for ARM processor (LPC54628) using Keil software. There's a strange behavior that I am unable to resolve. I tried running this on the software simulator as well as on the microcontroller and the behavior is the same. The problem is with the execution of the second 'else if' condition.
Working code:
uint8_t a; uint8_t b ; uint8_t temp1; uint8_t temp2; uint8_t c;
a = 0x1; b = 0x80; temp1 = 0; temp2 = 0; c = 10U;
temp1 = (b << 1); // after execution, temp1 is 0x00
temp2 = (b >> 7); // after execution, temp2 is 0x01
__NOP();
temp1 = ((b << 1) | (b >> 7)); // after execution, temp1 is 0x00 | 0x01 = 0x01
if (a == b) { }
else if ( a == ((b >> 1) | (b << 7)) ) {c += 1; }
else if ( a == temp1 ) {c -= 1; } // this 'else if' executes since a= 0x01 and temp1 = 0x01
else if ( a == ((b >> 2) | (b << 6)) ) {c += 2; }
else if ( a == ((b << 2) | (b >> 6)) ) {c -= 2; }
else if ( a == ((b >> 3) | (b << 5)) ) {c += 3; }
else if ( a == ((b << 3) | (b >> 5)) ) {c -= 3; }
However, the 'else if' that worked in the code above fails to execute in the following code. Note that the only change I have done is to replace temp1 with the actual expression inside the 'else if' condition. No other change.
Non-working code:
a = 0x1; b = 0x80; temp1 = 0; temp2 = 0; c = 10U;
temp1 = (b << 1); // after execution, temp1 is 0x00
temp2 = (b >> 7); // after execution, temp2 is 0x01
__NOP();
temp1 = ((b << 1) | (b >> 7)); // after execution, temp1 is 0x00 | 0x01 = 0x01
if (a == b) { }
else if ( a == ((b >> 1) | (b << 7)) ) {c += 1; }
else if ( a == ((b << 1) | (b >> 7)) ) {c -= 1; } // this 'else if' DOES NOT execute.
else if ( a == ((b >> 2) | (b << 6)) ) {c += 2; }
else if ( a == ((b << 2) | (b >> 6)) ) {c -= 2; }
else if ( a == ((b >> 3) | (b << 5)) ) {c += 3; }
else if ( a == ((b << 3) | (b >> 5)) ) {c -= 3; }
Can you point out what I am doing wrong?

Integer promotion is annoying. You're fundamentally doing:
else if ( (int) a == (((int)(b << 1)) | ((int)(b >> 7))) ) {
c -= 1;
}
Which means that you're testing if 0x01 == 0x101, which it doesn't.
When you do something like:
uint8_t x = 3;
uint8_t y = x + 4;
You're really doing something like:
uint8_t x = 3;
uint8_t y = (uint8_t)((int) x) + 4)

In the expression ((b << 1) | (b >> 7)), the value b is first promoted to type int because its type is smaller than int. So this expression ends up being:
((0x80 << 1) | (0x80 >> 7)) == (0x100 | 0x1) == 0x101
When you assign this value to temp1, it is converted to a value that fits and you're left with 0x1. When you instead compare the result of this expression directly against a, you're comparing the value 0x1 with 0x101.
If you want the result of this expression to be 8 bit, you need to cast it to uint8_t to truncate the higher bits.
if (a == b) { }
else if ( a == (uint8_t)((b >> 1) | (b << 7)) ) {c += 1; }
else if ( a == (uint8_t)((b << 1) | (b >> 7)) ) {c -= 1; }
else if ( a == (uint8_t)((b >> 2) | (b << 6)) ) {c += 2; }
else if ( a == (uint8_t)((b << 2) | (b >> 6)) ) {c -= 2; }
else if ( a == (uint8_t)((b >> 3) | (b << 5)) ) {c += 3; }
else if ( a == (uint8_t)((b << 3) | (b >> 5)) ) {c -= 3; }

C compilers did NOT used to do this, I do not know exactly when it changed.
unsigned int fun0 ( unsigned char a, unsigned char b )
{
return((a<<1)|(b>>1));
}
unsigned int fun1 ( unsigned char a, unsigned char b )
{
return(unsigned char)((a<<1)|(b>>1));
}
00000000 <fun0>:
0: e1a010a1 lsr r1, r1, #1
4: e1810080 orr r0, r1, r0, lsl #1
8: e12fff1e bx lr
0000000c <fun1>:
c: e1a010a1 lsr r1, r1, #1
10: e1810080 orr r0, r1, r0, lsl #1
14: e20000ff and r0, r0, #255 ; 0xff
18: e12fff1e bx lr
The first one the operation is on 8 bit values before it is combined to be returned. The second is clipped.
I specifically had a day of year problem many many years ago now, the bug would appear late in the year (just so happened to be day 256) and fixed itself January first... day = (high_byte<<8)|(low_byte); (fixed with ...((unsigned int)high_byte)<<8...)
unsigned int fun ( unsigned char a, unsigned char b )
{
return((a<<8)|b);
}
00000000 <fun>:
0: e1810400 orr r0, r1, r0, lsl #8
4: e12fff1e bx lr
Would not have broken today...at least with gcc 10.x.x...I also want to say at some point it was implementation defined, but seems that from many of the various quotes on the net it has been this way since C99...
Note disassembly is your friend...But then always understand that sometimes it is implementation defined (does not seem so in this case) and that just because your compiler did it one way does not mean that is the standard and is true for all compilers. (you are using Kiel I am using gnu for example).
Folks run into this a lot with floating point
float fun0 ( float a, float b )
{
return(a*(b+2.0));
}
float fun1 ( float a, float b )
{
return(a*(b+2.0F));
}
00000000 <fun0>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a06000 mov r6, r0
8: e1a00001 mov r0, r1
c: ebfffffe bl 0 <__aeabi_f2d>
10: e3a02000 mov r2, #0
14: e3a03101 mov r3, #1073741824 ; 0x40000000
18: ebfffffe bl 0 <__aeabi_dadd>
1c: e1a04000 mov r4, r0
20: e1a00006 mov r0, r6
24: e1a05001 mov r5, r1
28: ebfffffe bl 0 <__aeabi_f2d>
2c: e1a02000 mov r2, r0
30: e1a03001 mov r3, r1
34: e1a00004 mov r0, r4
38: e1a01005 mov r1, r5
3c: ebfffffe bl 0 <__aeabi_dmul>
40: ebfffffe bl 0 <__aeabi_d2f>
44: e8bd4070 pop {r4, r5, r6, lr}
48: e12fff1e bx lr
0000004c <fun1>:
4c: e92d4010 push {r4, lr}
50: e1a04000 mov r4, r0
54: e1a00001 mov r0, r1
58: e3a01101 mov r1, #1073741824 ; 0x40000000
5c: ebfffffe bl 0 <__aeabi_fadd>
60: e1a01004 mov r1, r4
64: ebfffffe bl 0 <__aeabi_fmul>
68: e8bd4010 pop {r4, lr}
6c: e12fff1e bx lr
2.0 is a double in the eyes of the compiler but 2.0F is single. And a double plus a single gets promoted to a double operation. Not an integer promotion but constants have an implied type (integer or floating point) and that plays into promotion.

Related

Unsigned 64x64->128 bit integer multiply on 32-bit platforms

In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically armv7), with a side glance to RISC-V32 which I expect to grow in importance in the embedded space. The first sample building block I chose to examine is unsigned 64x64->128 bit integer multiplication. Other questions on this site about this building block do not provide detailed coverage of 32-bit platforms.
Over the past thirty years, I have implemented this and other arithmetic building blocks multiple times, but always in assembly language, for various architectures. However, at this point in time my hope and desire is that these could be programmed in straight ISO-C, without the use of intrinsics. Ideally a single version of the C code would generate good machine code across architectures. I know that the approach of manipulating HLL code to control machine code is generally brittle, but hope that processor architectures and toolchains have matured enough to make this feasible.
Some approaches used in assembly language implementations are not well suited for porting to C. In the exemplary code below I have selected six variants that seemed amenable to an HLL implementation. Besides the generation of partial products, which is common to all variants, the two basic approaches are: (1) Sum the partial products using 64-bit arithmetic, letting the compiler take care of the carry propagation between 32-bit halves. In this case there are multiple choices in which order to sum the partial products. (2) Use 32-bit arithmetic for the summing, simulating the carry flag directly. In this case we have a choice of generating the carry after an addition (a = a + b; carry = a < b;) or before the addition (carry = ~a < b; a = a + b;). Variants 1 through 3 below fall into the former category, variants 5 and 6 fall into the latter.
At Compiler Explorer, I focused on the toolchains gcc 12.2 and clang 15.0 for the platforms of interest. I compiled with -O3. The general finding is that on average clang generates more efficient code than gcc, and that the differences between the variants (number of instructions and registers used) are more pronounced with clang. While this may be understandable in the case of RISC-V as the newer architecture, it surprised me in the case of armv7 which has been around for well over a dozen years.
Three cases in particular struck me as noteworthy. While I have worked with compiler engineers before and have a reasonable understanding of basic code transformation, phase ordering issues, etc, the only technique I aware of that might apply to this code is idiom recognition, and I do not see how this could explain the observations by itself. The first case is variant 3, where clang 15.0 produces extremely tight code comprising just 10 instructions that I don't think can be improved upon:
umul64wide:
push {r4, lr}
umull r12, r4, r2, r0
umull lr, r0, r3, r0
umaal lr, r4, r2, r1
umaal r0, r4, r3, r1
ldr r1, [sp, #8]
strd r0, r4, [r1]
mov r0, r12
mov r1, lr
pop {r4, pc}
By contrast, gcc generates twice the number of instructions and requires twice the number of registers. I hypothesize that it does not recognize how to use the multiply-accumulate instruction umaal here, but is that the full story? The reverse situation, but not quite as dramatic, occurs in variant 6, where gcc 12.2 produces this sequence of 18 instructions, with low register usage:
umul64wide:
mov ip, r0
push {r4, r5, lr}
mov lr, r1
umull r0, r1, r0, r2
ldr r4, [sp, #12]
umull r5, ip, r3, ip
adds r1, r1, r5
umull r2, r5, lr, r2
adc ip, ip, #0
umull lr, r3, lr, r3
adds r1, r1, r2
adc r2, ip, #0
adds r2, r2, r5
adc r3, r3, #0
adds r2, r2, lr
adc r3, r3, #0
strd r2, r3, [r4]
pop {r4, r5, pc}
The generated code nicely turns the simulated carry propagation into real carry propagation. clang 15.0 uses nine instructions and five registers more, and I cannot really make out what it is trying to do without spending much more time on analysis. The third observation is with regard to the differences seen in the machine code produced for variant 5 vs. variant 6, in particular with clang. These use the same basic algorithm, with one variant computing the simulated carry before the additions, the other after it. I did find in the end that one variant, namely variant 4, seems to be efficient across both tool chains and both architectures. However, before I proceed to other building blocks and face a similar struggle, I would like to inquire:
(1) Are there coding idioms or algorithms I have not considered in the code below that might lead to superior results? (2) Are there specific optimization switches, e.g. a hypothetical -ffrobnicate (see here), that are not included in -O3 that would help the compilers generate efficient code more consistently for these kind of bit-manipulation scenarios? Explanations as to what compiler mechanisms are likely responsible for the cases of significant differences in code generation observed, and how one might influence or work round them, could also be helpful.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define VARIANT (3)
#define USE_X64_ASM_REF (0)
/* Multiply two unsigned 64-bit operands a and b. Returns least significant 64
bits of product as return value, most significant 64 bits of product via h.
*/
uint64_t umul64wide (uint64_t a, uint64_t b, uint64_t *h)
{
uint32_t a_lo = (uint32_t)a;
uint32_t a_hi = a >> 32;
uint32_t b_lo = (uint32_t)b;
uint32_t b_hi = b >> 32;
uint64_t p0 = (uint64_t)a_lo * b_lo;
uint64_t p1 = (uint64_t)a_lo * b_hi;
uint64_t p2 = (uint64_t)a_hi * b_lo;
uint64_t p3 = (uint64_t)a_hi * b_hi;
#if VARIANT == 1
uint32_t c = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);
*h = p3 + (p1 >> 32) + (p2 >> 32) + c;
return p0 + ((p1 + p2) << 32);
#elif VARIANT == 2
uint64_t s = (p0 >> 32) + p1;
uint64_t t = (uint32_t)s + p2;
*h = (s >> 32) + (t >> 32) + p3;
return (uint32_t)p0 + (t << 32);
#elif VARIANT == 3
*h = (p1 >> 32) + (((p0 >> 32) + (uint32_t)p1 + p2) >> 32) + p3;
return p0 + ((p1 + p2) << 32);
#elif VARIANT == 4
uint64_t t = (p0 >> 32) + p1 + (uint32_t)p2;
*h = (p2 >> 32) + (t >> 32) + p3;
return (uint32_t)p0 + (t << 32);
#elif VARIANT == 5
uint32_t r0, r1, r2, r3, r4, r5, r6;
r0 = (uint32_t)p0;
r1 = p0 >> 32;
r5 = (uint32_t)p1;
r2 = p1 >> 32;
r1 = r1 + r5;
r6 = r1 < r5;
r2 = r2 + r6;
r6 = (uint32_t)p2;
r5 = p2 >> 32;
r1 = r1 + r6;
r6 = r1 < r6;
r2 = r2 + r6;
r4 = (uint32_t)p3;
r3 = p3 >> 32;
r2 = r2 + r5;
r6 = r2 < r5;
r3 = r3 + r6;
r2 = r2 + r4;
r6 = r2 < r4;
r3 = r3 + r6;
*h = ((uint64_t)r3 << 32) | r2;
return ((uint64_t)r1 << 32) | r0;
#elif VARIANT == 6
uint32_t r0, r1, r2, r3, r4, r5, r6;
r0 = (uint32_t)p0;
r1 = p0 >> 32;
r5 = (uint32_t)p1;
r2 = p1 >> 32;
r4 = ~r1;
r4 = r4 < r5;
r1 = r1 + r5;
r2 = r2 + r4;
r6 = (uint32_t)p2;
r5 = p2 >> 32;
r4 = ~r1;
r4 = r4 < r6;
r1 = r1 + r6;
r2 = r2 + r4;
r4 = (uint32_t)p3;
r3 = p3 >> 32;
r6 = ~r2;
r6 = r6 < r5;
r2 = r2 + r5;
r3 = r3 + r6;
r6 = ~r2;
r6 = r6 < r4;
r2 = r2 + r4;
r3 = r3 + r6;
*h = ((uint64_t)r3 << 32) | r2;
return ((uint64_t)r1 << 32) | r0;
#else
#error unsupported VARIANT
#endif
}
#if defined(__SIZEOF_INT128__)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
unsigned __int128 prod = ((unsigned __int128)a) * b;
*h = (uint64_t)(prod >> 32);
return (uint64_t)prod;
}
#elif defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
*h = __umulh (a, b);
return a * b;
}
#elif USE_X64_ASM_REF
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
uint64_t res_l, res_h;
__asm__ (
"movq %2, %%rax;\n\t" // rax = a
"mulq %3;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res_h = rdx
"movq %%rax, %1;\n\t" // res_l = rax
: "=rm" (res_h), "=rm"(res_l)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
*h = res_h;
return res_l;
}
#else // generic (and slow) reference implementation
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
uint32_t cy, t0, t1;
uint32_t a_lo = (uint32_t)a;
uint32_t a_hi = a >> 32;
uint32_t b_lo = (uint32_t)b;
uint32_t b_hi = b >> 32;
uint64_t p0 = (uint64_t)a_lo * b_lo;
uint64_t p1 = (uint64_t)a_lo * b_hi;
uint64_t p2 = (uint64_t)a_hi * b_lo;
uint64_t p3 = (uint64_t)a_hi * b_hi;
uint32_t p0_lo = (uint32_t)p0;
uint32_t p0_hi = p0 >> 32;
uint32_t p1_lo = (uint32_t)p1;
uint32_t p1_hi = p1 >> 32;
uint32_t p2_lo = (uint32_t)p2;
uint32_t p2_hi = p2 >> 32;
uint32_t p3_lo = (uint32_t)p3;
uint32_t p3_hi = p3 >> 32;
uint32_t r0 = p0_lo;
uint32_t r1 = ADDcc (p0_hi, p1_lo, cy, t0, t1);
uint32_t r2 = ADDCcc (p1_hi, p2_hi, cy, t0, t1);
uint32_t r3 = ADDC (p3_hi, 0, cy, t0, t1);
r1 = ADDcc (r1, p2_lo, cy, t0, t1);
r2 = ADDCcc (r2, p3_lo, cy, t0, t1);
r3 = ADDC (r3, 0, cy, t0, t1);
*h = ((uint64_t)r3 << 32) + r2;
return ((uint64_t)r1 << 32) + r0;
}
#endif
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t a, b, res_hi, res_lo, ref_hi, ref_lo, count = 0;
printf ("Smoke test of umul64wide variant %d\n", VARIANT);
do {
a = KISS64;
b = KISS64;
ref_lo = umul64wide_ref (a, b, &ref_hi);
res_lo = umul64wide (a, b, &res_hi);
if ((res_lo ^ ref_lo) | (res_hi ^ ref_hi)) {
printf ("!!!! error: a=%016llx b=%016llx res=%016llx_%016llx ref=%016llx_%016llx\n",
a, b, res_hi, res_lo, ref_hi, ref_lo);
return EXIT_FAILURE;
}
if (!(count & 0xfffffff)) printf ("\r%llu", count);
count++;
} while (count);
return EXIT_SUCCESS;
}

I avoided the use of the ((x += y) < y) overflow test, since not every ISA handles conditional flags efficiently, and may inhibit re-ordering when using the results of flag register(s); x86[-64] is the obvious example, though later BMI(2) instructions may help mitigate this. I also added a 32 x 32 -> 64 bit C implementation for comparison - but I would expect any modern ISA to at least supply a 'high word' multiply like ARM's umulh.
/******************************************************************************/
/* stackoverflow.com/questions/74713642 */
#include <inttypes.h>
#include <stdio.h>
/* umul_32_32 : 32 x 32 => 64 */
/* force inline (non-portable), or implement it as macro, e.g.,
* #define umul_32_32(rh, rl, x, y) do { ... } while (0) */
#if (1)
static inline __attribute__((always_inline))
uint64_t umul_32_32 (uint32_t x, uint32_t y)
{
return (((uint64_t) x) * y);
}
#else
/* if no widening multiply is available, the compiler probably
* generates something at least as efficient as the following -
* or (worst case) it calls a builtin function. */
static inline __attribute__((always_inline))
uint64_t umul_32_32 (uint32_t x, uint32_t y)
{
uint32_t m0, m1, m2, m3; /* (partial products) */
uint32_t x0, x1, y0, y1;
x0 = x & UINT16_MAX, x1 = x >> (16);
y0 = y & UINT16_MAX, y1 = y >> (16);
m0 = x0 * y0, m1 = x1 * y0;
m2 = x0 * y1, m3 = x1 * y1;
m1 += m0 >> (16);
m3 += m2 >> (16);
m1 += m2 & UINT16_MAX;
uint32_t rh = m3 + (m1 >> (16));
uint32_t rl = m1 << (16) | (m0 & UINT16_MAX);
return (((uint64_t) rh) << 32 | rl);
/* 32 x 32 => 64 : no branching or carry overflow tests. */
}
#endif
/* ensure the function is called to inspect code gen / assembly,
* otherwise gcc and clang evaluate this at compile time. */
__attribute__((noinline)) void umul_64_64 (
uint64_t *rh, uint64_t *rl, uint64_t x, uint64_t y)
{
uint64_t m0, m1, m2, m3; /* (partial products) */
uint32_t x0, x1, y0, y1;
x0 = (uint32_t) (x), x1 = (uint32_t) (x >> (32));
y0 = (uint32_t) (y), y1 = (uint32_t) (y >> (32));
m0 = umul_32_32(x0, y0), m1 = umul_32_32(x1, y0);
m2 = umul_32_32(x0, y1), m3 = umul_32_32(x1, y1);
m1 += m0 >> (32);
m3 += m2 >> (32);
m1 += m2 & UINT32_MAX;
*rh = m3 + (m1 >> (32));
*rl = m1 << (32) | (m0 & UINT32_MAX);
/* 64 x 64 => 128 : no branching or carry overflow tests. */
}
#if (0)
int main (void)
{
uint64_t x = UINT64_MAX, y = UINT64_MAX, rh, rl;
umul_64_64(& rh, & rl, x, y);
fprintf(stdout, "0x%016" PRIX64 ":0x%016" PRIX64 "\n", rh, rl);
return (0);
}
#endif
/******************************************************************************/
For ARM-7, I'm getting more or less the same results as your 'variant 3' code, which isn't surprising, since it's the same essential idea. I tried different flags on gcc-12 and gcc-trunk, but couldn't improve it.
I'd hazard a guess that with Apple's investment in AArch64 silicon, there's simply been more aggressive optimization and funding directed toward clang that benefits 32-bit ARM-7 as well. But that's pure speculation. It's a pretty glaring disparity for such a major platform though.

How to reverse endianness?

Can someone help me understand this code?
int reverse_endianess(int value) {
int resultat = 0;
char *source, *destination;
int i;
source = (char *) &value;
destination = ((char *) &resultat) + sizeof(int);
for (i = 0; i < sizeof(int); i++)
*(--destination) = *(source++);
return resultat;
}
I can't understand this part of code:
destination = ((char *) &resultat) + sizeof(int);
for (i = 0; i < sizeof(int); i++)
*(--destination) = *(source++);

The following causes destination to point to the byte that follows resultat (as long as resultat is an int):
destination = ((char *) &resultat) + sizeof(int);
It could also have been written as follows:
destination = (char *)(&resultat + 1);
The following is just a simple memory copy loop:
for (i = 0; i < sizeof(int); i++)
*(--destination) = *(source++);
It's equivalent to the following:
for (i = 0; i < sizeof(int); i++) {
--destination; // Point to the one byte earlier.
*destination = *source; // Copy one byte.
source++; // Point to one byte later.
}
Program flow (assuming 32-bit int and 8-bit char)
After setup:
source value
+----------+ +---+---+---+---+
| -------+ | a | b | c | d |
+----------+ | +---+---+---+---+
| ^
+------+
destination resultat
+----------+ +---+---+---+---+
| -------+ | 0 | 0 | 0 | 0 |
+----------+ | +---+---+---+---+
| ^
+----------------------+
After one pass of the loop:
source value
+----------+ +---+---+---+---+
| -------+ | a | b | c | d |
+----------+ | +---+---+---+---+
| ^
+----------+
destination resultat
+----------+ +---+---+---+---+
| -------+ | 0 | 0 | 0 | a |
+----------+ | +---+---+---+---+
| ^
+------------------+
When it's done:
source value
+----------+ +---+---+---+---+
| -------+ | a | b | c | d |
+----------+ | +---+---+---+---+
| ^
+----------------------+
destination resultat
+----------+ +---+---+---+---+
| -------+ | d | c | b | a |
+----------+ | +---+---+---+---+
| ^
+------+

Lets say sizeof(int) = 4 so 32 bit int.
A char is sizeof 1, 1 byte.
A normal int* looks like this:
aabbccdd // the int in hexadecimal
^ pointer points to start
If we cast it to a char*, we get aa. Thats whats done with source.
If we now add sizeof 4, we jump 4 bytes to the right:
aabbccdd??
^
We are now one byte behind the value, accessing this may segfault the program or just read garbage. This does not happen due to the use of --destination instead of destination--. It is decremented first.
Now we just read the integer passed in from the front, while writing it from the back:
a1b2c3d4 // original int
->
d4c3b2a1 // destination
<-
Note that two hexadecimal digits are one byte, which is why we dont get
4d3c2b1a. We leave the bytes in the correct way, but put the first bytes last.

Three different approaches. The first one is most efficient on the systems having byte reversing instructions.
#define SWAPUC(a,b) do{unsigned char temp = (a); (a) = (b); (b) = temp;}while(0)
int reverse(int i)
{
unsigned int val = i;
if(sizeof(val) == 4)
val = ((val & 0xff) << 24) | ((val & 0xff00) << 8) | ((val & 0xff0000) >> 8) | ((val & 0xff000000) >> 24);
if(sizeof(val) == 8)
val = ((val & 0x00000000000000ffULL) << 56) | ((val & 0xff00000000000000ULL) >> 56) |
((val & 0x000000000000ff00ULL) << 40) | ((val & 0x00ff000000000000ULL) >> 40) |
((val & 0x0000000000ff0000ULL) << 24) | ((val & 0x0000ff0000000000ULL) >> 24) |
((val & 0x00000000ff000000ULL) << 8) | ((val & 0x000000ff00000000ULL) >> 8);
return val;
}
int reverse1(int val)
{
union
{
unsigned i;
unsigned char uc[sizeof(val)];
}uni = {.i = val};
if(sizeof(val) == 8)
{
SWAPUC(uni.uc[7], uni.uc[0]);
SWAPUC(uni.uc[6], uni.uc[1]);
SWAPUC(uni.uc[5], uni.uc[2]);
SWAPUC(uni.uc[4], uni.uc[3]);
}
if(sizeof(val) == 4)
{
SWAPUC(uni.uc[3], uni.uc[0]);
SWAPUC(uni.uc[2], uni.uc[1]);
}
return uni.i;
}
int reverse2(int val)
{
unsigned char uc[sizeof(val)];
memcpy(uc, &val, sizeof(uc));
if(sizeof(val) == 8)
{
SWAPUC(uc[7], uc[0]);
SWAPUC(uc[6], uc[1]);
SWAPUC(uc[5], uc[2]);
SWAPUC(uc[4], uc[3]);
}
if(sizeof(val) == 4)
{
SWAPUC(uc[3], uc[0]);
SWAPUC(uc[2], uc[1]);
}
memcpy(&val, uc, sizeof(uc));
return val;
}
int main(void)
{
printf("%x\n", reverse2(0xaabbccdd));
}
The generated code (x86):
reverse:
mov eax, edi
bswap eax
ret
reverse1:
mov eax, edi
xor edx, edx
mov ecx, edi
shr eax, 24
movzx esi, ch
sal ecx, 24
mov dl, al
mov eax, edi
sal esi, 16
shr eax, 16
mov dh, al
movzx eax, dx
or eax, esi
or eax, ecx
ret
reverse2:
mov eax, edi
xor edx, edx
mov ecx, edi
shr eax, 24
movzx esi, ch
sal ecx, 24
mov dl, al
mov eax, edi
sal esi, 16
shr eax, 16
mov dh, al
movzx eax, dx
or eax, esi
or eax, ecx
ret
.LC0:
.string "%x\n"
Or cortex M4 (this one has byte swapping instruction)
reverse:
rev r0, r0
bx lr
reverse1:
mov r3, r0
lsrs r2, r3, #24
movs r0, #0
bfi r0, r2, #0, #8
ubfx r2, r3, #16, #8
bfi r0, r2, #8, #8
ubfx r2, r3, #8, #8
bfi r0, r2, #16, #8
bfi r0, r3, #24, #8
bx lr
reverse2:
mov r3, r0
lsrs r2, r3, #24
movs r0, #0
bfi r0, r2, #0, #8
ubfx r2, r3, #16, #8
bfi r0, r2, #8, #8
ubfx r2, r3, #8, #8
bfi r0, r2, #16, #8
bfi r0, r3, #24, #8
bx lr
.LC0:
So the winner is the first function using only the bitwise arithmetics.

I was using this for a very long time.
data is a pointer to value to be reversed
n is the number of char to be reversed; usually 2, 4, 8 for short, int, long long. But this can be different on various architectures/OS
void SwapEndianN(char *data, unsigned short n) {
unsigned short k; char c;
for ( k=0 ; k < (n/2) ;k++ ) {
c = *(data+((n-1)-k));
*(data+((n-1)-k)) = *(data+k);
*(data+k) = c;
}
}

i want to translate this upcode in assembler programming keil uVision 5

(a , b can be any number, between -20 and 20.. Find final values of
i, j, and k for three pairs of a and b having the relationships a > b, a < b, and a = b)
i =1
j=0
k = -1
while (i > j) {
i = i + a – 2 * j;
if (j >= k) {
i = i + 2;
k = k – b + 2 * j;
}
j++;
Keil ( this my version but why it end up infinity loop )
MOV r0, #1
MOV r1, #0
MOV r2, #0
SUB r2,r2,#1; k = -1
MOV r4, #4 ;a =4
MOV r5, #6
MOV r8, #2
B whileLoop
whileLoop
CMP r0,r1
BLE stop
MUL r3,r1,r8 ; r3 = 2*j
ADD r0, r0,r4
SUB r0, r0, r 3; i = i + a -2*j
B ifloop
ifloop
CMP r1,r2 ;j>=k?
BLT A
ADD r0,r0,#2
MUL r3,r1,r8 ;r3 = 2*j
SUB r2,r2,r5 ;k = k -b
ADD r2,r2,r3 ; k = k-b+2j
B A
A
ADD r1,r1,#1 ;j++
B whileLoop
stop B stop
ENDP
END

Bare metal audio output on Raspberry Pi3 working in AARCH64 asm but not the C version

I have been trying to write a bare metal kernel for over a year now and I am up to the point where I am ready to start working on audio output. I have written the code in asm however since I'm not great at it I'm not sure how I can pass audio samples as arguments to a asm function. I tried to rewrite it in C however it isn't working. This problem is really a spot the difference. I know my asm version works but the audio sample is written into the play_audio function. My goal is to have a init function for the audio with no arguments and a play_audio function that takes the pointer to the start of the audio function and a pointer to the end of the audio file. The audio file to be played is a 16 bit unsigned int pcm file. The same file that I'm trying to use for the C audio part is used successfully in the asm version. Since I set the hardware pwm to expect 13bit audio at 41400Hz there is a shift to convert the sample from 16bit to 13 bit so this isn't a mistake.
Not_working_audio.c
void init_audio_jack_c()//ERROR IN HERE
{
//Set phone jack to pwm output
uint32_t *gpio_addr = (uint32_t *)(PERIPHERAL_BASE + GPIO_BASE);
uint32_t *gpio_gpfsel4_addr = gpio_addr + GPIO_GPFSEL4;
*gpio_gpfsel4_addr = GPIO_FSEL0_ALT0 | GPIO_FSEL5_ALT0;
//Set clock
uint32_t *clock_manager_addr = (uint32_t *)(((PERIPHERAL_BASE + CM_BASE) & 0x0000FFFF) | ((PERIPHERAL_BASE + CM_BASE) & 0xFFFF0000));
*(clock_manager_addr + CM_PWMDIV) = (CM_PASSWORD | 0x2000);
*(clock_manager_addr + CM_PWMCTL) = ((CM_PASSWORD | CM_ENAB) | (CM_SRC_OSCILLATOR + CM_SRC_PLLCPER));
//Set PWM
uint32_t *pwm_manager_addr = (uint32_t *)(((PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000));
*(pwm_manager_addr + PWM_RNG1) = 0x1624;
*(pwm_manager_addr + PWM_RNG2) = 0x1624;
*(pwm_manager_addr + PWM_CTL) = PWM_USEF2 + PWM_PWEN2 + PWM_USEF1 + PWM_PWEN1 + PWM_CLRF1;
printf("[INFO] Audio Init Finished");
}
int32_t play_16bit_unsigned_audio(uint16_t *start, uint16_t *end)
{
if(end < start)
{
printf("[ERROR] End is less than start.");
return 1;
}
if((start - end) % 2 == 0)
{
printf("[ERROR] Isn't a multiple of two so it isn't 16bit");
return 2;
}
uint16_t *end_of_file = (uint16_t *)(uint64_t)(((uint32_t)(uintptr_t)end & 0x0000FFFF) | ((uint32_t)(uintptr_t)end & 0xFFFF0000));
//FIFO write
while(start != end_of_file)
{
uint16_t sample = start[0];
sample >>= 3;
*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_FIF1) = sample;
start++;
sample = start[0];
sample >>= 3;
*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_FIF1) = sample;
//FIFO wait
while(*(uint32_t *)((((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0x0000FFFF) | ((uint32_t)(PERIPHERAL_BASE + PWM_BASE) & 0xFFFF0000)) + PWM_STA) != PWM_FULL1);
start++;
}
printf("[INFO] Completed Audio");
return 0;
}
Working_audio.s
.section .text.init_audio_jack, "ax", %progbits
.balign 4
.globl init_audio_jack;
.type init_audio_jack, %function
init_audio_jack:
mov w0,PERIPHERAL_BASE + GPIO_BASE
mov w1,GPIO_FSEL0_ALT0
orr w1,w1,GPIO_FSEL5_ALT0
str w1,[x0,GPIO_GPFSEL4]
// Set Clock
mov w0, PERIPHERAL_BASE
add w0, w0, CM_BASE
and w0, w0, 0x0000FFFF
mov w1, PERIPHERAL_BASE
add w1, w1, CM_BASE
and w1, w1, 0xFFFF0000
orr w0,w0,w1
mov w1,CM_PASSWORD
orr w1,w1,0x2000 // Bits 0..11 Fractional Part Of Divisor = 0, Bits 12..23 Integer Part Of Divisor = 2
brk #0
str w1,[x0,CM_PWMDIV]
mov w1,CM_PASSWORD
orr w1,w1,CM_ENAB
orr w1,w1,CM_SRC_OSCILLATOR + CM_SRC_PLLCPER // Use 650MHz PLLC Clock
str w1,[x0,CM_PWMCTL]
// Set PWM
mov w0, PERIPHERAL_BASE
add w0, w0, PWM_BASE
and w0, w0, 0x0000FFFF
mov w1,PERIPHERAL_BASE
add w1, w1, PWM_BASE
and w1, w1, 0xFFFF0000
orr w0,w0,w1
mov w1,0x1624 // Range = 13bit 44100Hz Mono
str w1,[x0,PWM_RNG1]
str w1,[x0,PWM_RNG2]
mov w1,PWM_USEF2 + PWM_PWEN2 + PWM_USEF1 + PWM_PWEN1 + PWM_CLRF1
str w1,[x0,PWM_CTL]
.section .text.play_audio, "ax", %progbits
.balign 4
.globl play_audio;
.type play_audio, %function
play_audio:
Loop:
adr x1, _binary_src_audio_Interlude_bin_start // X1 = Sound Sample
ldr w2, =_binary_src_audio_Interlude_bin_end
and w2, w2, 0x0000FFFF // W2 = End Of Sound Sample
ldr w3, =_binary_src_audio_Interlude_bin_end
and w3, w3, 0xFFFF0000
orr w2,w2,w3
FIFO_Write:
ldrh w3,[x1],2 // Write 2 Bytes To FIFO
lsr w3,w3,3 // Convert 16bit To 13bit
str w3,[x0,PWM_FIF1] // FIFO Address
ldrh w3, [x1], 2
lsr w3, w3, 3
str w3, [x0, PWM_FIF1]
FIFO_Wait:
ldr w3,[x0,PWM_STA]
tst w3,PWM_FULL1 // Test Bit 1 FIFO Full
b.ne FIFO_Wait
cmp w1,w2 // Check End Of Sound Sample
b.ne FIFO_Write
b Loop // Play Sample Again
Thanks in advance to anyone that can help!

How many 64-bit multiplications are needed to calculate the low 128-bits of a 64-bit by 128-bit product?

Consider that you want to calculate the low 128-bits of the result of multiplying a 64-bit and 128-bit unsigned number, and that the largest multiplication you have available is the C-like 64-bit multiplication which takes two 64-bit unsigned inputs and returns the low 64-bits of the result.
How many multiplications are needed?
Certainly you can do it with eight: break all the inputs up into 32-bit chunks and use your 64-bit multiplication to do the 4 * 2 = 8 required full-width 32*32->64 multiplications, but can one do better?
Of course the algorithm should do only a "reasonable" number of additions or other basic arithmetic on top of the multiplications (I'm not interested in solutions that re-invent multiplication as an addition loop and hence claim "zero" multiplications).

Four, but it starts to get a little tricky.
Let a and b be the numbers to be multiplied, with a0 and a1 being the low and high 32 bits of a, respectively, and b0, b1, b2, b3 being 32-bit groups of b, from low to high respectively.
The desired result is the remainder of (a0 + a1•232) • (b0 + b1•232 + b2•264 + b3•296) modulo 2128.
We can rewrite that as (a0 + a1•232) • (b0 + b1•232) +
(a0 + a1•232) • (b2•264 + b3•296) modulo 2128.
The remainder of the latter term modulo 2128 can be computed as a single 64-bit by 64-bit multiplication (whose result is implicitly multiplied by 264).
Then the former term can be computed with three multiplications using a
carefully implemented Karatsuba step. The simple version would involve a 33-bit by 33-bit to 66-bit product which is not available, but there is a trickier version that avoids it:
z0 = a0 * b0
z2 = a1 * b1
z1 = abs(a0 - a1) * abs(b0 - b1) * sgn(a0 - a1) * sgn(b1 - b0) + z0 + z2
The last line contains only one multiplication; the other two pseudo-multiplications are just conditional negations. Absolute-difference and conditional-negate are annoying to implement in pure C, but it could be done.

Of course, without Karatsuba, 5 multiplies.
Karatsuba is wonderful, but these days a 64 x 64 multiply can be over in 3 clocks and a new one can be scheduled every clock. So the overhead of dealing with the signs and what not can be significantly greater than the saving of one multiply.
For straightforward 64 x 64 multiply need:
r0 = a0*b0
r1 = a0*b1
r2 = a1*b0
r3 = a1*b1
where need to add r0 = r0 + (r1 << 32) + (r2 << 32)
and add r3 = r3 + (r1 >> 32) + (r2 >> 32) + carry
where the carry is the carry from the additions to r0, and result is r3:r0.
typedef struct { uint64_t w0, w1 ; } uint64x2_t ;
uint64x2_t
mulu64x2(uint64_t x, uint64_t m)
{
uint64x2_t r ;
uint64_t r1, r2, rx, ry ;
uint32_t x1, x0 ;
uint32_t m1, m0 ;
x1 = (uint32_t)(x >> 32) ;
x0 = (uint32_t)x ;
m1 = (uint32_t)(m >> 32) ;
m0 = (uint32_t)m ;
r1 = (uint64_t)x1 * m0 ;
r2 = (uint64_t)x0 * m1 ;
r.w0 = (uint64_t)x0 * m0 ;
r.w1 = (uint64_t)x1 * m1 ;
rx = (uint32_t)r1 ;
rx = rx + (uint32_t)r2 ; // add the ls halves, collecting carry
ry = r.w0 >> 32 ; // pick up ms of r0
r.w0 += (rx << 32) ; // complete r0
rx += ry ; // complete addition, rx >> 32 == carry !
r.w1 += (r1 >> 32) + (r2 >> 32) + (rx >> 32) ;
return r ;
}
For Karatsuba, the suggested:
z1 = abs(a0 - a1) * abs(b0 - b1) * sgn(a0 - a1) * sgn(b1 - b0) + z0 + z2
is trickier than it looks... for a start, if z1 is 64 bits, then need to somehow collect the carry which this addition can generate... and that is complicated by the signed-ness issues.
z0 = a0*b0
z1 = ax*bx -- ax = (a1 - a0), bx = (b0 - b1)
z2 = a1*b1
where need to add r0 = z0 + (z1 << 32) + (z0 << 32) + (z2 << 32)
and add r1 = z2 + (z1 >> 32) + (z0 >> 32) + (z2 >> 32) + carry
where the carry is the carry from the additions to create r0, and result is r1:r0.
where must take into account the signed-ness of ax, bx and z1.
uint64x2_t
mulu64x2_karatsuba(uint64_t a, uint64_t b)
{
uint64_t a0, a1, b0, b1 ;
uint64_t ax, bx, zx, zy ;
uint as, bs, xs ;
uint64_t z0, z2 ;
uint64x2_t r ;
a0 = (uint32_t)a ; a1 = a >> 32 ;
b0 = (uint32_t)b ; b1 = b >> 32 ;
z0 = a0 * b0 ;
z2 = a1 * b1 ;
ax = (uint64_t)(a1 - a0) ;
bx = (uint64_t)(b0 - b1) ;
as = (uint)(ax > a1) ; // sign of magic middle, a
bs = (uint)(bx > b0) ; // sign of magic middle, b
xs = (uint)(as ^ bs) ; // sign of magic middle, x = a * b
ax = (uint64_t)((ax ^ -(uint64_t)as) + as) ; // abs magic middle a
bx = (uint64_t)((bx ^ -(uint64_t)bs) + bs) ; // abs magic middle b
zx = (uint64_t)(((ax * bx) ^ -(uint64_t)xs) + xs) ;
xs = xs & (uint)(zx != 0) ; // discard sign if z1 == 0 !
zy = (uint32_t)zx ; // start ls half of z1
zy = zy + (uint32_t)z0 + (uint32_t)z2 ;
r.w0 = z0 + (zy << 32) ; // complete ls word of result.
zy = zy + (z0 >> 32) ; // complete carry
zx = (zx >> 32) - ((uint64_t)xs << 32) ; // start ms half of z1
r.w1 = z2 + zx + (z0 >> 32) + (z2 >> 32) + (zy >> 32) ;
return r ;
}
I did some very simple timings (using times(), running on Ryzen 7 1800X):
using gcc __int128................... ~780 'units'
using mulu64x2()..................... ~895
using mulu64x2_karatsuba()... ~1,095
...so, yes, you can save a multiply by using Karatsuba, but whether it's worth doing rather depends.