What can I do to optimize the following code?

What can I do to optimize the following code? - c

int FindingMoves(int input1) {
int result = 0;
int input2 = input1 + 1;
result = (2 * input2 + 1) * (2 * input2 + 1);
return result;
}
What should I do to optimize the above C program which must be considered as efficient?
Should I use another programming language to a better result for above program maybe Java8, C++?

One way to optimize a code is to let the compiler do the job for you.
Consider different versions of the same function:
int Finding_Moves(int input)
{
++input;
input *= 2;
++input;
return input * input;
}
int Finding__Moves(int input1)
{
int input2 = 2*(input1 + 1) + 1;
return input2*input2;
}
int FindingMoves(int input1)
{
int result = 0;
int input2 = input1 + 1;
result = (2*input2 + 1)*(2*input2 + 1);
return result;
}
In all cases the generated assembly is the same:
lea eax, [rdi+3+rdi]
imul eax, eax
ret
HERE

Very little need to optimize this simple code, but alas:
int FindingMoves(int input1)
{
int input2 = 2*(input1 + 1) + 1;
return input2*input2;
}

If you are interested in micro-optimization, you can play with Godbolt's fantastic Compiler Explorer
For example both gcc -O2 and clang -O2 compile your code into just 2 instructions:
FindingMoves(int): # #FindingMoves(int)
lea eax, [rdi + rdi + 3]
imul eax, eax
ret
You could rewrite the source to make it more readable, but modern compilers already squeeze every bit of performance from it.
I personally would write:
int FindingMoves(int input) {
int x = 2 * (input + 1) + 1;
return x * x;
}
Be aware that optimizing a tiny piece of code like this is not worth it, first get the full program to perform correctly and use an extensive test suite to verify that. Then improve your code to make it more readable, more secure, more reliable, and still fully correct as verified by the test suite.
Then, if measured performance is not satisfactory, write performance tests with benchmark data and use a profiler to identify areas of improvement.
Focussing on optimization too soon is called premature optimization, a condition that causes frustration at many levels.

Without resorting to assember, the simple optimisation is
int FindingMoves(int input1)
{
int term = 2*input1 + 3;
return term*term;
}
Two multiplications, one addition, and then the result is returned. And it's simple enough that any decent quality compiler can produce effective output quite easily.
I'd want to see significant evidence from test cases and profiling before I would try to optimise further.

Related

Why cannot my program reach integer addition instruction throughput bound?

I have read chapter 5 of CSAPP 3e. I want to test if the optimization techniques described in the book can work on my computer. I write the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; ++i) {
sum += array[i];
}
unsigned long long after = __rdtsc();
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
and it reports the CPE is around 1.00.
I transform the program using the 4x4 loop unrolling technique and it leads to the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
int sum0 = 0;
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
/* 4x4 unrolling */
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; i += 4) {
sum0 += array[i];
sum1 += array[i + 1];
sum2 += array[i + 2];
sum3 += array[i + 3];
}
unsigned long long after = __rdtsc();
sum = sum0 + sum1 + sum2 + sum3;
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
Note that I omit the code to handle the situation when SIZE is not a multiple of 4. This program reports the CPE is around 0.80.
My program runs on an AMD 5950X, and according to AMD's software optimization manual (https://developer.amd.com/resources/developer-guides-manuals/), the integer addition instruction has a latency of 1 cycle and throughput of 4 instructions per cycle. It also has a load-store unit which could execute three independent load operations at the same time. My expectation of the CPE is 0.33, and I do not know why the result is so much higher.
My compiler is gcc 12.2.0. All programs are compiled with flags -Og.
I checked the assembly code of the optimized program, but found nothing helpful:
.L4:
movslq %r9d, %rcx
addl (%r8,%rcx,4), %r11d
addl 4(%r8,%rcx,4), %r10d
addl 8(%r8,%rcx,4), %ebx
addl 12(%r8,%rcx,4), %esi
addl $4, %r9d
.L3:
cmpl $127, %r9d
jle .L4
I assume at least 3 of the 4 addl instructions should execute in parallel. However, the result of the program does not meet my expectation.

cmpl $127, %r9d is not a large iteration count compared to rdtsc overhead and the branch mispredict when you exit the loop, and time for the CPU to ramp up to max frequency.
Also, you want to measure core clock cycles, not TSC reference cycles. Put the loop in a static executable (for minimal startup overhead) and run it with perf stat to get core clocks for the whole process. (As in Can x86's MOV really be "free"? Why can't I reproduce this at all? or some perf experiments I've posted in other answers.)
See Idiomatic way of performance evaluation?
10M to 1000M total iterations is appropriate since that's still under a second and we only want to measure steady-state behaviour, not cold-cache or cold-branch-predictor effect. Or page-faults. Interrupt overhead tends to be under 1% on an idle system. Use perf stat --all-user to only count user-space cycles and instructions.
If you want to do it over an array (instead of just removing the pointer increment from the asm), do many passes over a small (16K) array so they all hit in L1d cache. Use a nested loop, or use an and to wrap an index.
Doing that, yes you should be able to measure the 3/clock throughput of add mem, reg on Zen3 and later, even if you leave in the movslq overhead and crap like that from compiler -Og output.
When you're truly micro-benchmarking to find out stuff about throughput of one form of one instruction, it's usually easier to write asm by hand than to coax a compiler into emitting the loop you want. (As long as you know enough asm to avoid pitfalls, e.g. .balign 64 before the loop just for good measure, to hopefully avoid front-end bottlenecks.)
See also https://uops.info/ for how they measure; for any given test, you can click on the link to see the asm loop body for the experiments they ran, and the raw perf counter outputs for each variation on the test. (Although I have to admit I forget what MPERF and APERF mean for AMD CPUs; the results for Intel CPUs are more obvious.) e.g. https://uops.info/html-tp/ZEN3/ADD_R32_M32-Measurements.html is the Zen3 results, which includes a test of 4 or 8 independent add reg, [r14+const] instructions as the inner loop body.
They also tested with an indexed addressing mode. With "With unroll_count=200 and no inner loop" they got identical results for MPERF / APERF / UOPS for 4 independent adds, with indexed vs. non-indexed addressing modes. (Their loops don't have a pointer increment.)

Why is using a third variable faster than an addition trick?

When computing fibonacci numbers, a common method is mapping the pair of numbers (a, b) to (b, a + b) multiple times. This can usually be done by defining a third variable c and doing a swap. However, I realised you could do the following, avoiding the use of a third integer variable:
b = a + b; // b2 = a1 + b1
a = b - a; // a2 = b2 - a1 = b1, Ta-da!
I expected this to be faster than using a third variable, since in my mind this new method should only have to consider two memory locations.
So I wrote the following C programs comparing the processes. These mimic the calculation of fibonacci numbers, but rest assured I am aware that they will not calculate the correct values due to size limitations.
(Note: I realise now that it was unnecessary to make n a long int, but I will keep it as it is because that is how I first compiled it)
File: PlusMinus.c
// Using the 'b=a+b;a=b-a;' method.
#include <stdio.h>
int main() {
long int n = 1000000; // Number of iterations.
long int a,b;
a = 0; b = 1;
while (n--) {
b = a + b;
a = b - a;
}
printf("%lu\n", a);
}
File: ThirdVar.c
// Using the third-variable method.
#include <stdio.h>
int main() {
long int n = 1000000; // Number of iterations.
long int a,b,c;
a = 0; b = 1;
while (n--) {
c = a;
a = b;
b = b + c;
}
printf("%lu\n", a);
}
When I run the two with GCC (no optimisations enabled) I notice a consistent difference in speed:
$ time ./PlusMinus
14197223477820724411
real 0m0.014s
user 0m0.009s
sys 0m0.002s
$ time ./ThirdVar
14197223477820724411
real 0m0.012s
user 0m0.008s
sys 0m0.002s
When I run the two with GCC with -O3, the assembly outputs are equal. (I suspect I had confirmation bias when stating that one just outperformed the other in previous edits.)
Inspecting the assembly for each, I see that PlusMinus.s actually has one less instruction than ThirdVar.s, but runs consistently slower.
Question
Why does this time difference occur? Not only at all, but also why is my addition/subtraction method slower contrary to my expectations?

Why does this time difference occur?
There is no time difference when compiled with optimizations (under recent versions of gcc and clang). For instance, gcc 8.1 for x86_64 compiles both to:
Live at Godbolt
.LC0:
.string "%lu\n"
main:
sub rsp, 8
mov eax, 1000000
mov esi, 1
mov edx, 0
jmp .L2
.L3:
mov rsi, rcx
.L2:
lea rcx, [rdx+rsi]
mov rdx, rsi
sub rax, 1
jne .L3
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
add rsp, 8
ret
Not only at all, but also why is my addition/subtraction method slower contrary to my expectations?
Adding and subtracting could be slower than just moving. However, in most architectures (e.g. a x86 CPU), it is basically the same (1 cycle plus the memory latency); so this does not explain it.
The real problem is, most likely, the dependencies between the data. See:
b = a + b;
a = b - a;
To compute the second line, you have to have finished computing the value of the first. If the compiler uses the expressions as they are (which is the case under -O0), that is what the CPU will see.
In your second example, however:
c = a;
a = b;
b = b + c;
You can compute both the new a and b at the same time, since they do not depend on each other. And, in a modern processor, those operations can actually be computed in parallel. Or, putting it another way, you are not "stopping" the processor by making it wait on a previous result. This is called Instruction-level parallelism.

How 'smart' is GCC's Tail-Call-Optimisation?

I just had a discussion where the following two peices of C code were being discussed:
For-Loop:
#include <stdio.h>
#define n (196607)
int main() {
long loop;
int count=0;
for (loop=0;loop<n;loop++) {
count++;
}
printf("Result = %d\n",count);
return 0;
}
Recursive:
#include <stdio.h>
#define n (196607)
long recursive(long loop) {
return (loop>0) ? recursive(loop-1)+1: 0;
}
int main() {
long result;
result = recursive(n);
printf("Result = %d\n",result);
return 0;
}
On seeing this code, I saw recursive(loop-1)+1 and thought "ah, that's not tail call recursive" because it has work to do after the call to recursive is complete; it needs to increment the return value.
Sure enough, with no optimisation, the recursive code triggers a stack overflow, as you would expect.
with the -O2 flag however, the stack overflow is not encountered, which I take to mean that the stack is reused, rather than pushing more and more onto the stack - which is tco.
GCC can obviously detect this simple case (+1 to return value) and optimise it out, but how far does it go?
What are the limits to what gcc can optimise with tco, when the recursive call isn't the last operation to be performed?
Addendum:
I've written a fully tail recursive return function(); version of the code.
Wrapping the above in a loop with 9999999 iterations, I came up with the following timings:
$ for f in *.exe; do time ./$f > results; done
+ for f in '*.exe'
+ ./forLoop.c.exe
real 0m3.650s
user 0m3.588s
sys 0m0.061s
+ for f in '*.exe'
+ ./recursive.c.exe
real 0m3.682s
user 0m3.588s
sys 0m0.093s
+ for f in '*.exe'
+ ./tail_recursive.c.exe
real 0m3.697s
user 0m3.588s
sys 0m0.077s
so a (admittedly simple and not very rigorous) benchmark shows that it does indeed seem to be in the same order of time taken.

Just disassemble the code and see what happened. Without optimizations, I get this:
0x0040150B cmpl $0x0,0x10(%rbp)
0x0040150F jle 0x401523 <recursive+35>
0x00401511 mov 0x10(%rbp),%eax
0x00401514 sub $0x1,%eax
0x00401517 mov %eax,%ecx
0x00401519 callq 0x401500 <recursive>
But with -O1, -O2 or -O3 I get this:
0x00402D09 mov $0x2ffff,%edx
This doesn't have anything to do with tail optimizations, but much more radical optimizations. The compiler simply inlined the whole function and pre-calculated the result.
This is likely why you end up with the same result in all your different cases of benchmarking.

My attempt to optimize memset on a 64bit machine takes more time than standard implementation. Can someone please explain why?

(machine is x86 64 bit running SL6)
I was trying to see if I can optimize memset on my 64 bit machine. As per my understanding memset goes byte by byte and sets the value. I assumed that if I do in units of 64 bits, it would be faster. But somehow it takes more time. Can someone take a look at my code and suggest why ?
/* Code */
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
void memset8(unsigned char *dest, unsigned char val, uint32_t count)
{
while (count--)
*dest++ = val;
}
void memset32(uint32_t *dest, uint32_t val, uint32_t count)
{
while (count--)
*dest++ = val;
}
void
memset64(uint64_t *dest, uint64_t val, uint32_t count)
{
while (count--)
*dest++ = val;
}
#define CYCLES 1000000000
int main()
{
clock_t start, end;
double total;
uint64_t loop;
uint64_t val;
/* memset 32 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset32((uint32_t*)&val, 0, 2);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset32 %g\n", total);
/* memset 64 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset64(&val, 0, 1);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset64 %g\n", total);
/* memset 8 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset8((unsigned char*)&val, 0, 8);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset8 %g\n", total);
/* memset */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset(&val, 0, 8);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset %g\n", total);
printf("-----------------------------------------\n");
}
/*Result*/
Timetaken memset32 12.46
Timetaken memset64 7.57
Timetaken memset8 37.12
Timetaken memset 6.03
-----------------------------------------
Looks like the standard memset is more optimized than my implementation.
I tried looking into code and everywhere is see that implementation of memset is same as what I did for memset8. When I use memset8, the results are more like what I expect and very different from memset.
Can someone suggest what am I doing wrong ?

Actual memset implementations are typically hand-optimized in assembly, and use the widest aligned writes available on the targeted hardware. On x86_64 that will be at least 16B stores (movaps, for example). It may also take advantage of prefetching (this is less common recently, as most architectures have good automatic streaming prefetchers for regular access patterns), streaming stores or dedicated instructions (historically rep stos was unusably slow on x86, but it is quite fast on recent microarchitectures). Your implementation does none of these things. It should not be terribly surprising that the system implementation is faster.
As an example, consider the implementation used in OS X 10.8 (which has been superseded in 10.9). Here’s the core loop for modest-sized buffers:
.align 4,0x90
1: movdqa %xmm0, (%rdi,%rcx)
movdqa %xmm0, 16(%rdi,%rcx)
movdqa %xmm0, 32(%rdi,%rcx)
movdqa %xmm0, 48(%rdi,%rcx)
addq $64, %rcx
jne 1b
This loop will saturate the LSU when hitting cache on pre-Haswell microarchitectures at 16B/cycle. An implementation based on 64-bit stores like your memset64 cannot exceed 8B/cycle (and may not even achieve that, depending on the microarchitecture in question and whether or not the compiler unrolls your loop). On Haswell, an implementation that uses AVX stores or rep stos can go even faster and achieve 32B/cycle.

As per my understanding memset goes byte by byte and sets the value.
The details of what the memset facility does are implementation dependent. Relying on this is usually a good thing, because the I'm sure the implementors have extensive knowledge of the system and know all kind of techniques to make things as fast as possible.
To elaborate a little more, lets look at:
memset(&val, 0, 8);
When the compiler sees this it can notice a few things like:
The fill value is 0
The number of bytes to fill is 8
and then choose the right instructions to use depending on where val or &val is (in a register, in memory, ...). But if memset is stuck needing to be a function call (like your implementations), none of those optimizations are possible. Even if it can't make compile time decisions like:
memset(&val, x, y); // no way to tell at compile time what x and y will be...
you can be assured that there's a function call written in assembler that will be as fast as possible for your platform.

I think it's worth exploring how to write a faster memset particularly with GCC (which I assume you are using with Scientific Linux 6) in C/C++. Many people assume the standard implementation is optimized. This is not necessarily true. If you see table 2.1 of Agner Fog's Optimizing Software in C++ manuals he compares memcpy for for several different compilers and platforms to his own assembly optimized version of memcpy. Memcpy in GCC at the time really underperformed (but the Mac version was good). He claims the built in functions are even worse and recommends using -no-builtin. GCC in my experience is very good at optimizing code but its library functions (and built in functions) are not very optimized (with ICC it's the other way around).
It would be interesting to see how good you could do using intrinsics. If you look at his asmlib you can see how he implements memset with SSE and AVX (it would be interesting to compare this to the Apple's optimized version Stephen Canon posted).
With AVX you can see he writes 32 bytes at a time.
K100: ; Loop through 32-bytes blocks. Register use is swapped
; Rcount = end of 32-bytes blocks part
; Rdest = negative index from the end, counting up to zero
vmovaps [Rcount+Rdest], ymm0
add Rdest, 20H
jnz K100
vmovaps in this case is the same as the intrinsic _mm256_store_ps. Maybe GCC has improved since then but you might be able to beat GCC's implementation of memset using intrinsics. If you don't have AVX you certainly have SSE (all x86 64bit do) so you could look at the SSE version of his code to see what you could do.
Here is a start for your memset32 funcion assuming the array fits in the L1 cache. If the array does not fit in the cache you want to do a non temporal store with _mm256_stream_ps. For a general function you need several cases also including cases when the memory is not 32 byte aligned.
#include <immintrin.h>
int main() {
int count = (1<<14)/sizeof(int);
int* dest = (int*)_mm_malloc(sizeof(int)*count, 32); // 32 byte aligned
int val = 0xDEADBEEFDEADBEEF;
__m256 val8 = _mm256_castsi256_ps(_mm256_set1_epi32(val));
for(int i=0; i<count; i+=8) {
_mm256_store_ps((float*)(dest+i), val8);
}
}

Computing high 64 bits of a 64x64 int product in C

I would like my C function to efficiently compute the high 64 bits of the product of two 64 bit signed ints. I know how to do this in x86-64 assembly, with imulq and pulling the result out of %rdx. But I'm at a loss for how to write this in C at all, let alone coax the compiler to do it efficiently.
Does anyone have any suggestions for writing this in C? This is performance sensitive, so "manual methods" (like Russian Peasant, or bignum libraries) are out.
This dorky inline assembly function I wrote works and is roughly the codegen I'm after:
static long mull_hi(long inp1, long inp2) {
long output = -1;
__asm__("movq %[inp1], %%rax;"
"imulq %[inp2];"
"movq %%rdx, %[output];"
: [output] "=r" (output)
: [inp1] "r" (inp1), [inp2] "r" (inp2)
:"%rax", "%rdx");
return output;
}

If you're using a relatively recent GCC on x86_64:
int64_t mulHi(int64_t x, int64_t y) {
return (int64_t)((__int128_t)x*y >> 64);
}
At -O1 and higher, this compiles to what you want:
_mulHi:
0000000000000000 movq %rsi,%rax
0000000000000003 imulq %rdi
0000000000000006 movq %rdx,%rax
0000000000000009 ret
I believe that clang and VC++ also have support for the __int128_t type, so this should also work on those platforms, with the usual caveats about trying it yourself.

The general answer is that x * y can be broken down into (a + b) * (c + d), where a and c are the high order parts.
First, expand to ac + ad + bc + bd
Now, you multiply the terms as 32 bit numbers stored as long long (or better yet, uint64_t), and you just remember that when you multiplied a higher order number, you need to scale by 32 bits. Then you do the adds, remembering to detect carry. Keep track of the sign. Naturally, you need to do the adds in pieces.
For code implementing the above, see my other answer.

With regard to your assembly solution, don't hard-code the mov instructions! Let the compiler do it for you. Here's a modified version of your code:
static long mull_hi(long inp1, long inp2) {
long output;
__asm__("imulq %2"
: "=d" (output)
: "a" (inp1), "r" (inp2));
return output;
}
Helpful reference: Machine Constraints

Since you did a pretty good job solving your own problem with the machine code, I figured you deserved some help with the portable version. I would leave an ifdef in where you do just use the assembly if in gnu on x86.
Anyway, here is an implementation based on my general answer. I'm pretty sure this is correct, but no guarantees, I just banged this out last night. You probably should get rid of the statics positive_result[] and result_negative - those are just artefacts of my unit test.
#include <stdlib.h>
#include <stdio.h>
// stdarg.h doesn't help much here because we need to call llabs()
typedef unsigned long long uint64_t;
typedef signed long long int64_t;
#define B32 0xffffffffUL
static uint64_t positive_result[2]; // used for testing
static int result_negative; // used for testing
static void mixed(uint64_t *result, uint64_t innerTerm)
{
// the high part of innerTerm is actually the easy part
result[1] += innerTerm >> 32;
// the low order a*d might carry out of the low order result
uint64_t was = result[0];
result[0] += (innerTerm & B32) << 32;
if (result[0] < was) // carry!
++result[1];
}
static uint64_t negate(uint64_t *result)
{
uint64_t t = result[0] = ~result[0];
result[1] = ~result[1];
if (++result[0] < t)
++result[1];
return result[1];
}
uint64_t higherMul(int64_t sx, int64_t sy)
{
uint64_t x, y, result[2] = { 0 }, a, b, c, d;
x = (uint64_t)llabs(sx);
y = (uint64_t)llabs(sy);
a = x >> 32;
b = x & B32;
c = y >> 32;
d = y & B32;
// the highest and lowest order terms are easy
result[1] = a * c;
result[0] = b * d;
// now have the mixed terms ad + bc to worry about
mixed(result, a * d);
mixed(result, b * c);
// now deal with the sign
positive_result[0] = result[0];
positive_result[1] = result[1];
result_negative = sx < 0 ^ sy < 0;
return result_negative ? negate(result) : result[1];
}

Wait, you have a perfectly good, optimized assembly solution already
working for this, and you want to back it out and try to write it in
an environment that doesn't support 128 bit math? I'm not following.
As you're obviously aware, this operation is a single instruction on
x86-64. Obviously nothing you do is going to make it work any better.
If you really want portable C, you'll need to do something like
DigitalRoss's code above and hope that your optimizer figures out what
you're doing.
If you need architecture portability but are willing to limit yourself
to gcc platforms, there are __int128_t (and __uint128_t) types in the
compiler intrinsics that will do what you want.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight