Example of very CPU-intensive C function under 5 lines of code - c

I'm testing a progress indicator in my program, so I'd like to test it against a function that takes a noticeable amount of time (i.e. >5 seconds) to finish.
I tried this:
void doLotsOfWork () {
for (int i = 0; i < 100; i++) {
arc4random()%(int)HUGE;
}
}
But this runs in under 1 second on my i7. Can you suggest even more CPU-intensive operations? The function body should really be no more than 5 lines of code.

a simple loop should do the trick:
/* iterate.c */
#include <stdio.h>
int
main (void)
{
printf("start\n");
volatile unsigned long long i;
for (i = 0; i < 1000000000ULL; ++i);
printf("stop\n");
return 0;
}
this takes ~2.5 seconds on my system. feel free to increase the amount of iterations to match your timing expectations.
$> gcc -o iterate iterate.c
$> time ./iterate
start
stop
real 0m2.763s
user 0m2.758s
sys 0m0.000s
short form:
void
wait (void)
{
volatile unsigned long long i;
for (i = 0; i < 1000000000ULL; ++i);
}

Just pause the thread/process? No need to do actual work.
In POSIX, call e.g. usleep(5000000); to pause for five seconds (see usleep() manual page).

Related

pthread is slower than the "default" version

SITUATION
I want to see the advantage of using pthread. If I'm not wrong: threads allow me to execute given parts of program in parallel.
so here is what I try to accomplish: I want to make a program that takes a number(let's say n) and outputs the sum of [0..n].
code
#define MAX 1000000000
int
main() {
long long n = 0;
for (long long i = 1; i < MAX; ++i)
n += i;
printf("\nn: %lld\n", n);
return 0;
}
time: 0m2.723s
to my understanding I could simply take that number MAX and divide by 2 and let 2 threads
do the job.
code
#define MAX 1000000000
#define MAX_THREADS 2
#define STRIDE MAX / MAX_THREADS
typedef struct {
long long off;
long long res;
} arg_t;
void*
callback(void *args) {
arg_t *arg = (arg_t*)args;
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
pthread_exit(0);
}
int
main() {
pthread_t threads[MAX_THREADS];
arg_t results[MAX_THREADS];
for (int i = 0; i < MAX_THREADS; ++i) {
results[i].off = i * STRIDE;
results[i].res = 0;
pthread_create(&threads[i], NULL, callback, (void*)&results[i]);
}
for (int i = 0; i < MAX_THREADS; ++i)
pthread_join(threads[i], NULL);
long long result;
result = results[0].res;
for (int i = 1; i < MAX_THREADS; ++i)
result += results[i].res;
printf("\nn: %lld\n", result);
return 0;
}
time: 0m8.530s
PROBLEM
The version with pthread runs slower. Logically this version should run faster, but maybe creation of threads is more expensive.
Can someone suggest a solution or show what I'm doing/understanding wrong here?
Your problem is cache thrashing combined with a lack of optimization (I bet you're compiling without it on).
The naive (-O0) code for
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
will access the memory of *arg. With your results array being defined the way it is, that memory is very close to the memory of the next arg and the two threads will fight for the same cache-line, making RAM caching very ineffective.
If you compile with -O1, the loop should use a register instead and only write to memory at the end. Then, you should get better performance with threads (higher optimization levels on gcc seem to optimize the loop out completely)
Another (better) option is to align arg_t on a cache line:
typedef struct {
_Alignas(64) /*typical cache line size*/ long long off;
long long res;
} arg_t;
Then you should get better performance with threads regardless of whether or not you turn optimization on.
Good cache utilization is generally very important in multithreaded programming (and Ulrich Drepper has much to say on that topic in his infamous What Every Programmer Should Know About Memory).
Creating a whole bunch of threads is very unlikely to be quicker than simply adding numbers. The CPU can add an awfully large number of integers in the time it takes the kernel to set up and tear down a thread. To see the benefit of multithreading, you really need each thread to be doing a significant task -- significant compared to the overhead in creating the thread, anyway. Alternatively, you need to keep a pool of threads running, and assign them work according to some allocation strategy.
Multi-threading works best when an application consists of tasks that are somewhat independent, that would otherwise be waiting on one another to complete. It isn't a magic way to get more throughput.

False sharing in multi threads

The following code runs slower as I increase the NTHREADS. Why use more threads make the program run slower? Is there any way to fix it? Someone said it is about false sharing but I do not really understand that concept.
The program basicly calculate the sum from 1 to 100000000. The idea to use multithread is to seperate the number list into several chuncks, and calculate the sum of each chunck parallelly to make the calculation faster.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define LENGTH 100000000
#define NTHREADS 2
#define NREPEATS 10
#define CHUNCK (LENGTH / NTHREADS)
typedef struct {
size_t id;
long *array;
long result;
} worker_args;
void *worker(void *args) {
worker_args *wargs = (worker_args*) args;
const size_t start = wargs->id * CHUNCK;
const size_t end = wargs->id == NTHREADS - 1 ? LENGTH : (wargs->id+1) * CHUNCK;
for (size_t i = start; i < end; ++i) {
wargs->result += wargs->array[i];
}
return NULL;
}
int main(void) {
long* numbers = malloc(sizeof(long) * LENGTH);
for (size_t i = 0; i < LENGTH; ++i) {
numbers[i] = i + 1;
}
worker_args *args = malloc(sizeof(worker_args) * NTHREADS);
for (size_t i = 0; i < NTHREADS; ++i) {
args[i] = (worker_args) {
.id = i,
.array = numbers,
.result = 0
};
}
pthread_t thread_ids[NTHREADS];
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_create(thread_ids+i, NULL, worker, args+i);
}
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
long sum = 0;
for (size_t i = 0; i < NTHREADS; ++i) {
sum += args[i].result;
}
printf("Run %2zu: total sum is %ld\n", n, sum);
free(args);
free(numbers);
}
Why use more threads make the program run slower?
There is an overhead creating and joining threads. If the threads hasn't much to do then this overhead may be more expensive than the actual work.
Your threads are only doing a simple sum which isn't that expensive. Also consider that going from e.g. 10 to 11 threads doesn't change the work load per thread a lot.
10 threads --> 10000000 sums per thread
11 threads --> 9090909 sums per thread
The overhead of creating an extra thread may exceed the "work load saved" per thread.
On my PC the program runs in less than 100 milliseconds. Multi-threading isn't worth the trouble.
You need a more processing intensive task before multi-threading is worth doing.
Also notice that it seldom make sense to create more threads than the number of cores (incl hyper thread) your computer has.
false sharing
yes, "false sharing" can impact the performance of a multi-threaded program but I doubt it's the real problem in your case.
"false sharing" is something that happens in (some) cache systems when two threads (or rather two cores) writes to two different variables that belongs to the same cache line. In such cases the two threads/cores competes to own the cache line (for writing) and consequently, they'll have to refresh the memory and the cache again and again. That's bad for performance.
As I said - I doubt that is your problem. A clever compiler will do your loop solely be using CPU registers and only write to memory at the end. You can check the disassemble of your code to see if that is the case.
You can avoid "false sharing" by increasing the sizeof of your struct so that each struct fits the size of a cache line on your system.

Why does this code go into infinite loop

This function below checks to see if an integer is prime or not.
I'm running a for loop from 3 to 2147483647 (+ve limit of long int).
But this code hangs, can't find out why?
#include<time.h>
#include<stdio.h>
int isPrime1(long t)
{
long i;
if(t==1) return 0;
if(t%2==0) return 0;
for(i=3;i<t/2;i+=2)
{
if(t%i==0) return 0;
}
return 1;
}
int main()
{
long i=0;
time_t s,e;
s = time(NULL);
for(i=3; i<2147483647; i++)
{
isPrime1(i);
}
e = time(NULL);
printf("\n\t Time : %ld secs", e - s );
return 0;
}
It will eventually terminate, but will take a while, if you look at your loops when you inline your isPrime1 function, you have something like:
for(i=3; i<2147483647; i++)
for(j=3;j<i/2;j+=2)
which is roughly n*n/4 = O(n^2). Your loop trip count is way too high.
It depends upon the system and the compiler. On Linux, with GCC 4.7.2 and compiling with gcc -O2 vishaid.c -o vishaid the program returns immediately, and the compiler is optimizing all the call to isPrime1 by removing them (I checked the generated assembler code with gcc -O2 -S -fverbose-asm, then main does not even call isPrime1). And GCC is right: since isPrime1 has no side-effect and its result is not used, its call can be removed. Then the for loop has an empty body, so can also be optimized.
The lesson to learn is that when benchmarking optimized binaries, you better have some real side-effect in your code.
Also, arithmetic tells us that some i is prime if it has no divisors less than its square root. So better code:
int isPrime1(long t) {
long i;
double r = sqrt((double)t);
long m = (long)r;
if(t==1) return 0;
if(t%2==0) return 0;
for(i=3;i <= m;i +=2)
if(t%i==0) return 0;
return 1;
}
On my system (x86-64/Debian/Sid with i7 3770K Intel processor, the core running that program is at 3.5GHz) long-s are 64 bits. So I coded
int main ()
{
long i = 0;
long cnt = 0;
time_t s, e;
s = time (NULL);
for (i = 3; i < 2147483647; i++)
{
if (isPrime1 (i) && (++cnt % 4096) == 0) {
printf ("#%ld: %ld\n", cnt, i);
fflush (NULL);
}
}
e = time (NULL);
printf ("\n\t Time : %ld secs\n", e - s);
return 0;
}
and after about 4 minutes it was still printing a lot of lines, including
#6819840: 119566439
#6823936: 119642749
#6828032: 119719177
#6832128: 119795597
I'm guessing it would need several hours to complete. After 30 minutes it is still spitting (slowly)
#25698304: 486778811
#25702400: 486862511
#25706496: 486944147
#25710592: 487026971
Actually, the program needed 4 hours and 16 minutes to complete. Last outputs are
#105086976: 2147139749
#105091072: 2147227463
#105095168: 2147315671
#105099264: 2147402489
Time : 15387 secs
BTW, this program is still really inefficient: The primes program /usr/games/primes from bsdgames package is answering much quicker
% time /usr/games/primes 1 2147483647 | tail
2147483423
2147483477
2147483489
2147483497
2147483543
2147483549
2147483563
2147483579
2147483587
2147483629
/usr/games/primes 1 2147483647
10.96s user 0.26s system 99% cpu 11.257 total
and it has still printed 105097564 lines (most being skipped by tail)
If you are interested in prime number generation, read several math books (it is still a research subject if you are interested in efficiency; you still can get your PhD on that subject.). Start with the sieve of erasthothenes and primality test pages on Wikipedia.
Most importantly, compile first your program with debugging information and all warnings (i.e. gcc -Wall -g on Linux) and learn to use your debugger (i.e. gdb on Linux). You could then interrupt your debugged program (with Ctrl-C under gdb, then let it continue with the cont command to gdb) after about a minute and two, then observe that the i counter in main is increasing slowly. Perhaps also ask for profiling information (with -pg option to gcc then use gprof). And when coding complex arithmetic things it is well worth to read good math books about them (and primality test is a very complex subject, central to most cryptographic algorithms).
This is a very inefficient approach to test for primes, and that's why it seems to hang.
Search the web for more efficient algorithms, such as the Sieve of Eratosthenes
Here try this, see if it's really an infinite loop
int main()
{
long i=0;
time_t s,e;
s = time(NULL);
for(i=3; i<2147483647; i++)
{
isPrime1(i);
//calculate the time execution for each loop
e = time(NULL);
printf("\n\t Time for loop %d: %ld secs", i, e - s );
}
return 0;
}

Output text one letter at a time in C

How would I output text one letter at a time like it's typing without using Sleep() for every character?
Sleep is the best option, since it doesn't waste CPU cycles.
The other option is busy waiting, meaning you spin constantly executing NoOps. You can do that with any loop structure that does absolutely nothing. I'm not sure what this is for, but it seems like you might also want to randomize the time you wait between characters to give it a natural feel.
I would have a Tick() method that would loop through the letters and only progress if a random number was smaller than a threshold I set.
some psuedocode may look like
int escapeIndex = 0;
int escapeMax = 1000000;
boolean exportCharacter = false;
int letterIndex = 0;
float someThresh = 0.000001;
String typedText = "somethingOrOther...";
int letterMax = typedText.length();
while (letterIndex < letterMax){
escapeIndex++;
if(random(1.0) < someThresh){
exportCharacter = true;
}
if(escapeIndex > escapeMax) {
exportCharacter = true;
}
if(exportCharacter) {
cout << typedText.charAt(letterIndex);
escapeIndex = 0;
exportCharacter = false;
letterIndex++;
}
}
If I were doing this in a video game lets say to simulate a player typing text into a terminal, this is how I would do it. It's going to be different every time, and it's escape mechanism provides a maximum time limit for the operation.
Sleeping is the best way to do what you're describing, as the alternative, busy waiting, is just going to waste CPU cycles. From the comments, it sounds like you've been trying to manually hard-code every single character you want printed with a sleep call, instead of using loops...
Since there's been no indication that this is homework after ~20 minutes, I thought I'd post this code. It uses usleep from <unistd.h>, which sleeps for X amount of microseconds, if you're using Windows try Sleep().
#include <stdio.h>
#include <unistd.h>
void type_text(char *s, unsigned ms_delay)
{
unsigned usecs = ms_delay * 1000; /* 1000 microseconds per ms */
for (; *s; s++) {
putchar(*s);
fflush(stdout); /* alternatively, do once: setbuf(stdout, NULL); */
usleep(usecs);
}
}
int main(void)
{
type_text("hello world\n", 100);
return 0;
}
Since stdout is buffered, you're going to have to either flush it after printing each character (fflush(stdout)), or set it to not buffer the output at all by running setbuf(stdout, NULL) once.
The above code will print "hello world\n" with a delay of 100ms between each character; extremely basic.

Speed up C program without using conditional compilation

we are working on a model checking tool which executes certain search routines several billion times. We have different search routines which are currently selected using preprocessor directives. This is not only very unhandy as we need to recompile every time we make a different choice, but also makes the code hard to read. It's now time to start a new version and we are evaluating whether we can avoid conditional compilation.
Here is a very artificial example that shows the effect:
/* program_define */
#include <stdio.h>
#include <stdlib.h>
#define skip 10
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the variable skip is an example for a value that influences the behavior of the program. Unfortunately, we need to recompile every time we want a new value of skip.
Let's look at another version of the program:
/* program_variable */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the value for skip is passed as a command line parameter. This adds great flexibility. However, this program is much slower:
$ time ./program_define 1000 10
50004989999950500
real 0m25.973s
user 0m25.937s
sys 0m0.019s
vs.
$ time ./program_variable 1000 10
50004989999950500
real 0m50.829s
user 0m50.738s
sys 0m0.042s
What we are looking for is an efficient way to pass values into a program (by means of a command line parameter or a file input) that will never change afterward. Is there a way to optimize the code (or tell the compiler to) such that it runs more efficiently?
Any help is greatly appreciated!
Comments:
As Dirk wrote in his comment, it is not about the concrete example. What I meant was a way to replace an if that evaluates a variable that is set once and then never changed (say, a command line option) inside a function that is called literally billions of times by a more efficient construct. We currently use the preprocessor to tailor the desired version of the function. It would be nice if there is a nicer way that does not require recompilation.
You can take a look at libdivide which works to do fast division when the divisor isn't known until runtime: (libdivide is an open source library
for optimizing integer division).
If you calculate a % b using a - b * (a / b) (but with libdivide) you might find that it's faster.
I ran your program_variable code on my system to get a baseline of performance:
$ gcc -Wall test1.c
$ time ./a.out 1000 10
50004989999950500
real 0m55.531s
user 0m55.484s
sys 0m0.033s
If I compile test1.c with -O3, then I get:
$ time ./a.out 1000 10
50004989999950500
real 0m54.305s
user 0m54.246s
sys 0m0.030s
In a third test, I manually set the values of limit and skip:
int limit = 1000, skip = 10;
I then re-run the test:
$ gcc -Wall test2.c
$ time ./a.out
50004989999950500
real 0m54.312s
user 0m54.282s
sys 0m0.019s
Taking out the atoi() calls doesn't make much of a difference. But if I compile with -O3 optimizations turned on, then I get a speed bump:
$ gcc -Wall -O3 test2.c
$ time ./a.out
50004989999950500
real 0m26.756s
user 0m26.724s
sys 0m0.020s
Adding a #define macro for an ersatz atoi() function helped a little, but didn't do much:
#define QSaToi(iLen, zString, iOut) {int j = 1; iOut = 0; \
for (int i = iLen - 1; i >= 0; --i) \
{ iOut += ((zString[i] - 48) * j); \
j = j*10;}}
...
int limit, skip;
QSaToi(4, argv[1], limit);
QSaToi(2, argv[2], skip);
And testing:
$ gcc -Wall -O3 -std=gnu99 test3.c
$ time ./a.out 1000 10
50004989999950500
real 0m53.514s
user 0m53.473s
sys 0m0.025s
The expensive part seems to be those atoi() calls, if that's the only difference between -O3 compilation.
Perhaps you could write one binary, which loops through tests of various values of limit and skip, something like:
#define NUM_LIMITS 3
#define NUM_SKIPS 2
...
int limits[NUM_LIMITS] = {100, 1000, 1000};
int skips[NUM_SKIPS] = {1, 10};
int limit, skip;
...
for (int limitIdx = 0; limitIdx < NUM_LIMITS; limitIdx++)
for (int skipIdx = 0; skipIdx < NUM_SKIPS; skipIdx++)
/* per-limit, per-skip test */
If you know your parameters ahead of compilation time, perhaps you can do it this way. You could use fprintf() to write your output to a per-limit, per-skip file output, if you want results in separate files.
You could try using the GCC likely/unlikely builtins (e.g. here) or profile guided optimization (e.g. here). Also, do you intend (i + j) % 10 or i + (j % 10)? The % operator has higher precedence, so your code as written is testing the latter.
I'm a bit familiar with the program Niels is asking about.
There are a bunch of interesting answers around (thanks), but the answers slightly miss the spirit of the question. The given example programs are really just example programs. The logic that is subject to pre-processor statements is much much more involved. In the end, it is not just about executing a modulo operation or a simple division. it is about keeping or skipping certain procedure calls, executing an operation between two other operations etc, defining the size of an array, etc.
All these things could be guarded by variables that are set by command-line parameters. But that would be too costly as many of these routines, statements, memory allocations are executed a billion times. Perhaps that shapes the problem a bit better. Still very interested in your ideas.
Dirk
If you would use C++ instead of C you could use templates so that things can be calculated at compile time, even recursions are possible.
Please have a look at C++ template meta programming.
A stupid answer, but you could pass the define on the gcc command line and run the whole thing with a shell script that recompiles and runs the program based on a command-line parameter
#!/bin/sh
skip=$1
out=program_skip$skip
if [ ! -x $out ]; then
gcc -O3 -Dskip=$skip -o $out test.c
fi
time $out 1000
I got also an about 2× slowdown between program_define and program_variable, 26.2s vs. 49.0s. I then tried
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j, r;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0, r = 0; j < limit; ++j, ++r) {
if (r == skip) r = 0;
if (i + r == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
using an extra variable to avoid the costly division, and the resulting time was 18.9s, so significantly better than the modulo with a statically known constant. However, this auxiliary-variable technique is only promising if the change is easily predictable.
Another possibility would be to eliminate using the modulus operator:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
int current = 0;
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (++current == skip) {
current = 0;
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
If that is the actual code, you have a few ways to optimize it:
(i + j % 10==0) is only true when i==0, so you can skip that entire mod operation when i>0. Also, since i + j only increases by 1 on each loop, you can hoist the mod out and simply have a variable you increment and reset when it hits skip (as has been pointed out in other answers).
You can also have all possible function implementations already in the program, and at runtime you change the function pointer to select the function which you are actually are using.
You can use macros to avoid that you have to write duplicate code:
#define MYFUNCMACRO(name, myvar) void name##doit(){/* time consuming code using myvar */}
MYFUNCMACRO(TEN,10)
MYFUNCMACRO(TWENTY,20)
MYFUNCMACRO(FOURTY,40)
MYFUNCMACRO(FIFTY,50)
If you need to have too many of these macros (hundreds?) you can write a codegenerator which writes the cpp file automatically for a range of values.
I didn't compile nor test the code, but maybe you see the principle.
You might be compiling without optimisation, which will lead your program to load skip each time it's checked, instead of the literal of 10. Try adding -O2 to your compiler's command line, and/or use
register int skip;

Resources