c - Avoid if in loop - c

Context
Debian 64.
Core 2 duo.
Fiddling with a loop. I came with different variations of the same loop but I would like to avoid conditional branching if possible.
But, even if I think it will be difficult to beat.
I thought about SSE or bit shifting but still, it would require a jump (look at the computed goto below). Spoiler : a computed jump doesn't seems to be the way to go.
The code is compiled without PGO. Because on this piece of code, it makes the code slower..
flags :
gcc -march=native -O3 -std=c11 test_comp.c
Unrolling the loop didn't help here..
63 in ascii is '?'.
The printf is here to force the code to execute. Nothing more.
My need :
A logic to avoid the condition. I assume this as a challenge to make my holydays :)
The code :
Test with the sentence. The character '?' is guaranteed to be there but at a random position.
hjkjhqsjhdjshnbcvvyzayuazeioufdhkjbvcxmlkdqijebdvyxjgqddsyduge?iorfe
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv){
/* This is quite slow. Average actually.
Executes in 369,041 cycles here (cachegrind) */
for (int x = 0; x < 100; ++x){
if (argv[1][x] == 63){
printf("%d\n",x);
break;
}
}
/* This is the slowest.
Executes in 370,385 cycles here (cachegrind) */
register unsigned int i = 0;
static void * restrict table[] = {&&keep,&&end};
keep:
++i;
goto *table[(argv[1][i-1] == 63)];
end:
printf("i = %d",i-1);
/* This is slower. Because of the calculation..
Executes in 369,109 cycles here (cachegrind) */
for (int x = 100; ; --x){
if (argv[1][100 - x ] == 63){printf("%d\n",100-x);break;}
}
return 0;
}
Question
Is there a way to make it faster, avoiding the branch maybe ?
The branch miss is huge with 11.3% (cachegrind with --branch-sim=yes).
I cannot think it is the best one can achieve.
If some of you manage assembly with enough talent, please come in.

Assuming you have a buffer of well know size being able to hold the maximum amount of chars to test against, like
char buffer[100];
make it one byte larger
char buffer[100 + 1];
then fill it with the sequence to test against
read(fileno(stdin), buffer, 100);
and put your test-char '?' at the very end
buffer[100] = '?';
This allows you for a loop with only one test condition:
size_t i = 0;
while ('?' != buffer[i])
{
++i;
}
if (100 == i)
{
/* test failed */
}
else
{
/* test passed for i */
}
All other optimisation leave to the compiler.
However I couldn't resist, so here's a possible approach to do micro optimisation
char buffer[100 + 1];
read(fileno(stdin), buffer, 100);
buffer[100] = '?';
char * p = buffer;
while ('?' != *p)
{
++p;
}
if ((p - buffer) == 100)
{
/* test failed */
}
else
{
/* test passed for (p - buffer) */
}

Related

Changed value of variable

Have you any idea what may change that value. It's code run on STM32. There are interrupts but it's almost impossible to interrupt between value initialization and the line after if statement.
My first idea is that this value is written at illegal part of memory which is used by some register.
I'm compiling the code with optimization O1 and only this function has ioptimization O0 to make analysis easier. The soft also crashes in run mode so it's not problem with debugging.
Change of value lead to overflow after few lines and crash the system. The situation repeats every time.
enter image description here
I don't have any idea. I've only checked if the code is correct, decralation of function, place where it is used.
#pragma GCC push_options
#pragma GCC optimize ("O0")
MonitoringParseMessStatus monitoring_ack_message(char *msg, uint16_t length)
{
MonitoringParseMessStatus res = M_BAD_MESS;
uint16_t single_ack_message_length = 18; // minimum len
if(length < single_ack_message_length)
return M_NAK;
uint8_t len_mess = strlen(msg);
char single_message[len_mess + 1];
while(length >= len_mess && length >= single_ack_message_length)
{
memset(&single_message[0], 0, sizeof(single_message));
memcpy(&single_message[0], msg, single_ack_message_length);
msg += len_mess + 1;
length -= len_mess;
len_mess = strlen(msg);
char * ack_ptr = strstr(single_message, "\"ACK\"");
if(ack_ptr == NULL)
{
ack_ptr = strstr(single_message, "\"NAK\"");
if(ack_ptr != NULL)
{
return M_NAK;
}
res = M_BAD_MESS;
continue;
}
else
res = M_ACK;
char *seq_ptr = ack_ptr + strlen("\"ACK\"");
int seq = atoi(seq_ptr);
for(int i = 0; i < QUEUE_SIZE; i++)
{
if(monitoring_queue[i].set == false)
continue;
if(monitoring_queue[i].sequence != seq)
continue;
monitoring_queue[i].set = false;
monitoring_connected_set(monitoring_queue[i].monitoring_num, true);
monitoring_send_earliest_event();
break;
}
}
return res;
}
#pragma GCC pop_options
Optimization might trip the debugger over. So for errors like this you need to first of all debug at the assembler level, to ensure that the assignment of the variable has indeed happened at the line you placed the breakpoint. Decent debuggers have an option to single step the machine code inlined with the C code.
Note: some expression simplification might occur even when optimization is disabled (-O0). Variables may still be placed in registers etc.
Other than that, local variables mysteriously changing value is often caused by stack overflow. Check the SP in your debugger when you are on this line.

understand membarrier function in linux

Example of using membarrier function from linux manual: https://man7.org/linux/man-pages/man2/membarrier.2.html
#include <stdlib.h>
static volatile int a, b;
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("mfence" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
asm volatile ("mfence" : : : "memory");
*read_a = a;
}
int
main(int argc, char **argv)
{
int read_a, read_b;
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
The code above transformed to use membarrier() becomes:
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/membarrier.h>
static volatile int a, b;
static int
membarrier(int cmd, unsigned int flags, int cpu_id)
{
return syscall(__NR_membarrier, cmd, flags, cpu_id);
}
static int
init_membarrier(void)
{
int ret;
/* Check that membarrier() is supported. */
ret = membarrier(MEMBARRIER_CMD_QUERY, 0, 0);
if (ret < 0) {
perror("membarrier");
return -1;
}
if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
fprintf(stderr,
"membarrier does not support MEMBARRIER_CMD_GLOBAL\n");
return -1;
}
return 0;
}
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
membarrier(MEMBARRIER_CMD_GLOBAL, 0, 0);
*read_a = a;
}
int
main(int argc, char **argv)
{
int read_a, read_b;
if (init_membarrier())
exit(EXIT_FAILURE);
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
This "membarrier" description is taken from the Linux manual. I am still confused about how does trhe "membarrier" function add overhead to the slow side, and remove overhead from the fast side, thus resulting in an overall performance increase as long as the slow side is infrequent enough that the overhead of the membarrier() calls does not outweigh the performance gain on the fast side.
Could you please help me to describe it in more detail.
Thanks!
This pair of writes-then-read-the-other-var is https://preshing.com/20120515/memory-reordering-caught-in-the-act/, a demo of StoreLoad reordering (the only kind x86 allows, given its program-order + store buffer with store forwarding memory model).
With only one local MFENCE you could still get reordering:
FAST using just mfence, not membarrier
a = 1 exec
read_b = b; // 0
b = 1;
mfence (force b=1 to be visible before reading a)
read_a = a; // 0
a = 1 visible (global vis. delayed by store buffer)
But consider what would happen if an mfence-on-every-core had to be part of every possible order, between the slow-path's store and its reload.
This ordering would no longer be possible. If read_b=b has already read a 0, then a=1 is already pending1 (if it isn't visible already). It's impossible for it to stay private until after read_a = a because membarrier() makes sure a full barrier runs on every core, and SLOW waits for that to happen (membarrier to return) before reading a.
And there's no way to get 0,0 from having SLOW execute first; it runs membarrier itself so its store is definitely visible to other threads before it reads a.
footnote 1: Waiting to execute, or waiting in the store buffer to commit to L1d cache. The asm("":::"memory") ensures that, but is actually redundant because volatile itself guarantees that the accesses happen in asm in program order. And we basically need volatile for other reasons when hand-rolling atomics instead of using C11 _Atomic. (But generally don't do that unless you're actually writing kernel code. Use atomic_store_explicit(&a, 1, memory_order_release);).
Note it's actually the store buffer that creates StoreLoad reordering (the only kind x86 allows), not so much OoO exec. In fact, a store buffer also lets x86 execute stores out-of-order and then make them globally visible in program order (if it turns out they weren't the result of mis-speculation or something!).
Also note that in-order CPUs can do their memory accesses out of order. They start instructions (including loads) in order, but can let them complete out of order, e.g. by scoreboarding loads to allow hit-under-miss. See also How is load->store reordering possible with in-order commit?

Is it possible to effectively parallelise a brute-force attack on 4 different password patterns?

In the context of my homework task I need to smart brute-force a set of passwords. Every password in the set has either of three possible masks:
%%##
##%%
#%%#
%##%
( # - a numeric character, % - a lowercase alpha character ).
At this point I am doing something like this to run over only one pattern ( the 1st one ) in multithreading:
// Compile: $ gcc test.c -o test -fopenmp -O3 -std=c99
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <omp.h>
int main() {
const char alp[26] = "abcdefghijklmnopqrstuvwxyz";
const char num[10] = "0123456789";
register int i;
char pass[4];
#pragma omp parallel for private(pass)
for (i = 0; i < 67600; i++) {
pass[3] = num[i % 10];
pass[2] = num[i / 10 % 10];
pass[1] = alp[i / 100 % 26];
pass[0] = alp[i / 2600 % 26];
/* Slow password processing here */
}
return 0;
}
But, unfortunately, that technique has nothing to do with searching passwords with different patterns.
So my question is:
Is there a way to construct an effective set of parallel for instructions in order to run the attack simultaneously on each password pattern?
Help is much appreciated.
The trick here is to note that all four password options are simply rotations/shifts of each other.
That is, for the example password qr34 and the patterns you mention, you are looking at:
qr34 %%## #Original potential password
4qr3 #%%# #Rotate 1 place right
34qr ##%% #Rotate 2 places right
r34q %##% #Rotate 3 places right
Given this, you can use the same generation technique as in your first question.
For each potential password generated, check the potential password as well as the next three shifts of that password.
Note that the following code relies on an interesting property of C/C++: if the truth value of a statement can be deduced early, no further execution takes place. That is, given the statement if(A || B || C), if A is false, then B must be evaluated; however, if B is true, then C is never evaluated.
This means that we can have A=CheckPass(pass) and B=CheckPass(RotatePass(pass)) and C=CheckPass(RotatePass(pass)) with the guarantee that the password will only be rotated as many times as necessary.
Note that this scheme requires that each thread have its own, private copy of the potential password.
//Compile with, e.g.: gcc -O3 temp.c -std=c99 -fopenmp
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int PassCheck(char *pass){
return strncmp(pass, "4qr3", 4)==0;
}
//Rotate string one character to the right
char* RotateString(char *str, int len){
char lastchr = str[len-1];
for(int i=len-1;i>0;i--)
str[i]=str[i-1];
str[0] = lastchr;
return str;
}
int main(){
const char alph[27] = "abcdefghijklmnopqrstuvwxyz";
const char num[11] = "0123456789";
char goodpass[4] = "----"; //Provide a default password to indicate an error state
#pragma omp parallel for collapse(4)
for(int i = 0; i < 26; i++)
for(int j = 0; j < 26; j++)
for(int m = 0; m < 10; m++)
for(int n = 0; n < 10; n++){
char pass[4] = {alph[i],alph[j],num[m],num[n]};
if(
PassCheck(pass) ||
PassCheck(RotateString(pass,4)) ||
PassCheck(RotateString(pass,4)) ||
PassCheck(RotateString(pass,4))
){
//It is good practice to use `critical` here in case two
//passwords are somehow both valid. This won't arise in
//your code, but is worth thinking about.
#pragma omp critical
{
memcpy(goodpass, pass, 4);
//#pragma omp cancel for //Escape for loops!
}
}
}
printf("Password was '%.4s'.\n",goodpass);
return 0;
}
I notice that you are generating your password using
pass[3] = num[i % 10];
pass[2] = num[i / 10 % 10];
pass[1] = alp[i / 100 % 26];
pass[0] = alp[i / 2600 % 26];
This sort of technique is occasionally useful, especially in scientific programming, but usually only for addressing convenience and memory locality.
For instance, an array of arrays where an element is accessed as a[y][x] can be written as a flat-array with elements accessed as a[y*width+x]. This gives a speed gain, but only because the memory is contiguous.
In your case, this indexing does not produce any speed gains, but does make it more difficult to reason about how your program works. I would avoid it for this reason.
It's been said that "premature optimization is the root of all evil". This is especially true of micro-optimizations such as the one you're trying here. The biggest speed gains come from high-level algorithmic decisions, not from fiddly stuff. The -O3 compilation flag does most of everything you'll ever need done in terms of making your code fast at this level.
Micro-optimizations assume that doing something convoluted in your high-level code will somehow enable you to out-smart the compiler. This is not a good assumption since the compiler is often quite smart and will be even smarter tomorrow. Your time is very valuable: don't use it on this stuff unless you have a clear justification. (Further discussion of "premature optimization" is here.)

Creating a basic stack overflow using IDA

This program is running with root privileges on my machine and I need to perform a Stack overflow attack on the following code and get root privileges:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <openssl/sha.h>
void sha256(char *string, char outputBuffer[65])
{
unsigned char hash[SHA256_DIGEST_LENGTH];
int i = 0;
SHA256_CTX sha256;
SHA256_Init(&sha256);
SHA256_Update(&sha256, string, strlen(string));
SHA256_Final(hash, &sha256);
for(i = 0; i < SHA256_DIGEST_LENGTH; i++)
{
sprintf(outputBuffer + (i * 2), "%02x", hash[i]);
}
outputBuffer[64] = 0;
}
int password_check(char *userpass)
{
char text[20] = "thisisasalt";
unsigned int password_match = 0;
char output[65] = { 0, };
// >>> hashlib.sha256("Hello, world!").hexdigest()
char pass[] = "315f5bdb76d078c43b8ac0064e4a0164612b1fce77c869345bfc94c75894edd3";
text[0] = 'a';
text[1] = 't';
text[2] = 'j';
text[3] = '5';
text[4] = '3';
text[5] = 'k';
text[6] = '$';
text[7] = 'g';
text[8] = 'f';
text[9] = '[';
text[10] = ']';
text[11] = '\0';
strcat(text, userpass);
sha256(text, output);
if (strcmp(output, pass) == 0)
{
password_match = 1;
}
return (password_match == 1);
}
int main(int argc, char **argv)
{
if (argc < 3)
{
printf("Usage: %s <pass> <command>\n", argv[0]);
exit(1);
}
if (strlen((const char *) argv[1]) > 10)
{
printf("Error: pasword too long\n");
exit(1);
}
if (password_check(argv[1]))
{
printf("Running command as root: %s\n", argv[2]);
setuid(0);
setgid(0);
system(argv[2]);
}
else
{
printf("Authentication failed! This activity will be logged!\n");
}
return 0;
}
So I try to analyse the program with IDA and I see the text segment going from the lower addresses to the higher addresses, higher than that I see the data and then the bss and finally external commands.
Now as far as I know the stack should be just above that, but I'm not certain how to view it, how exactly am I supposed to view the stack in order to know what I'm writing on? (Do I even need it or am I completely clueless?)
Second question is considering the length of the input, how do i get around this check in the code:
if (strlen((const char *) argv[1]) > 10)
{
printf("Error: pasword too long\n");
exit(1);
}
Can I somehow give the string to the program by reference? If so how do I do it? (Again, hoping I'm not completely clueless)
Now as far as I know the stack should be just above that, but I'm not certain how to view it, how exactly am I supposed to view the stack in order to know what I'm writing on? (Do I even need it or am I completely clueless?)
The stack location varies all the time - you need to look at the value of the ESP/RSP register, its value is the current address of the top of the stack. Typically, variable addressing will be based on EBP rather then ESP, but they both will point to the same general area of memory.
During analysis, IDA sets up a stack frame for each function, which acts much like a struct - you can define variables with types and names in it. This frame is summarized at the top of the function:
Double-clicking it or any local variable in the function body will open a more detailed window. That's as good as you can get without actually running your program in a debugger.
You can see that text is right next to password_match, and judging from the addresses, there are 0x14 bytes allocated for text, as one would expect. However, this is not guaranteed and the compiler can freely shuffle the variables around, pad them or optimize them into registers.
Second question is considering the length of the input, how do i get around this check in the code:
if (strlen((const char *) argv[1]) > 10)
{
printf("Error: pasword too long\n");
exit(1);
}
You don't need to get around this check, it's already broken enough. There's an off-by-one error.
Stop reading here if you want to figure out the overflow yourself.
The valid range of indices for text spans from text[0] through text[19]. In the code, user input is written to the memory area starting at text[11]. The maximum input length allowed by the strlen check is 10 symbols + the NULL terminator. Unfortunately, that means text[19] contains the 9th user-entered symbol, and the 10th symbol + the terminator overflow into adjacent memory space. Under certain circumstances, that allows you to overwrite the least significant byte of password_match with an arbitrary value, and the second least significant byte with a 0. Your function accepts the password if password_match equals 1, which means the 10th character in your password needs to be '\x01' (note that this is not the same character as '1').
Here are two screenshots from IDA running as a debugger. text is highlighted in yellow, password_match is in green.
The password I entered was 123456789\x01.
Stack before user entered password is strcat'd into text.
Stack after strcat. Notice that password_match changed.

Speed up C program without using conditional compilation

we are working on a model checking tool which executes certain search routines several billion times. We have different search routines which are currently selected using preprocessor directives. This is not only very unhandy as we need to recompile every time we make a different choice, but also makes the code hard to read. It's now time to start a new version and we are evaluating whether we can avoid conditional compilation.
Here is a very artificial example that shows the effect:
/* program_define */
#include <stdio.h>
#include <stdlib.h>
#define skip 10
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the variable skip is an example for a value that influences the behavior of the program. Unfortunately, we need to recompile every time we want a new value of skip.
Let's look at another version of the program:
/* program_variable */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the value for skip is passed as a command line parameter. This adds great flexibility. However, this program is much slower:
$ time ./program_define 1000 10
50004989999950500
real 0m25.973s
user 0m25.937s
sys 0m0.019s
vs.
$ time ./program_variable 1000 10
50004989999950500
real 0m50.829s
user 0m50.738s
sys 0m0.042s
What we are looking for is an efficient way to pass values into a program (by means of a command line parameter or a file input) that will never change afterward. Is there a way to optimize the code (or tell the compiler to) such that it runs more efficiently?
Any help is greatly appreciated!
Comments:
As Dirk wrote in his comment, it is not about the concrete example. What I meant was a way to replace an if that evaluates a variable that is set once and then never changed (say, a command line option) inside a function that is called literally billions of times by a more efficient construct. We currently use the preprocessor to tailor the desired version of the function. It would be nice if there is a nicer way that does not require recompilation.
You can take a look at libdivide which works to do fast division when the divisor isn't known until runtime: (libdivide is an open source library
for optimizing integer division).
If you calculate a % b using a - b * (a / b) (but with libdivide) you might find that it's faster.
I ran your program_variable code on my system to get a baseline of performance:
$ gcc -Wall test1.c
$ time ./a.out 1000 10
50004989999950500
real 0m55.531s
user 0m55.484s
sys 0m0.033s
If I compile test1.c with -O3, then I get:
$ time ./a.out 1000 10
50004989999950500
real 0m54.305s
user 0m54.246s
sys 0m0.030s
In a third test, I manually set the values of limit and skip:
int limit = 1000, skip = 10;
I then re-run the test:
$ gcc -Wall test2.c
$ time ./a.out
50004989999950500
real 0m54.312s
user 0m54.282s
sys 0m0.019s
Taking out the atoi() calls doesn't make much of a difference. But if I compile with -O3 optimizations turned on, then I get a speed bump:
$ gcc -Wall -O3 test2.c
$ time ./a.out
50004989999950500
real 0m26.756s
user 0m26.724s
sys 0m0.020s
Adding a #define macro for an ersatz atoi() function helped a little, but didn't do much:
#define QSaToi(iLen, zString, iOut) {int j = 1; iOut = 0; \
for (int i = iLen - 1; i >= 0; --i) \
{ iOut += ((zString[i] - 48) * j); \
j = j*10;}}
...
int limit, skip;
QSaToi(4, argv[1], limit);
QSaToi(2, argv[2], skip);
And testing:
$ gcc -Wall -O3 -std=gnu99 test3.c
$ time ./a.out 1000 10
50004989999950500
real 0m53.514s
user 0m53.473s
sys 0m0.025s
The expensive part seems to be those atoi() calls, if that's the only difference between -O3 compilation.
Perhaps you could write one binary, which loops through tests of various values of limit and skip, something like:
#define NUM_LIMITS 3
#define NUM_SKIPS 2
...
int limits[NUM_LIMITS] = {100, 1000, 1000};
int skips[NUM_SKIPS] = {1, 10};
int limit, skip;
...
for (int limitIdx = 0; limitIdx < NUM_LIMITS; limitIdx++)
for (int skipIdx = 0; skipIdx < NUM_SKIPS; skipIdx++)
/* per-limit, per-skip test */
If you know your parameters ahead of compilation time, perhaps you can do it this way. You could use fprintf() to write your output to a per-limit, per-skip file output, if you want results in separate files.
You could try using the GCC likely/unlikely builtins (e.g. here) or profile guided optimization (e.g. here). Also, do you intend (i + j) % 10 or i + (j % 10)? The % operator has higher precedence, so your code as written is testing the latter.
I'm a bit familiar with the program Niels is asking about.
There are a bunch of interesting answers around (thanks), but the answers slightly miss the spirit of the question. The given example programs are really just example programs. The logic that is subject to pre-processor statements is much much more involved. In the end, it is not just about executing a modulo operation or a simple division. it is about keeping or skipping certain procedure calls, executing an operation between two other operations etc, defining the size of an array, etc.
All these things could be guarded by variables that are set by command-line parameters. But that would be too costly as many of these routines, statements, memory allocations are executed a billion times. Perhaps that shapes the problem a bit better. Still very interested in your ideas.
Dirk
If you would use C++ instead of C you could use templates so that things can be calculated at compile time, even recursions are possible.
Please have a look at C++ template meta programming.
A stupid answer, but you could pass the define on the gcc command line and run the whole thing with a shell script that recompiles and runs the program based on a command-line parameter
#!/bin/sh
skip=$1
out=program_skip$skip
if [ ! -x $out ]; then
gcc -O3 -Dskip=$skip -o $out test.c
fi
time $out 1000
I got also an about 2× slowdown between program_define and program_variable, 26.2s vs. 49.0s. I then tried
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j, r;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0, r = 0; j < limit; ++j, ++r) {
if (r == skip) r = 0;
if (i + r == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
using an extra variable to avoid the costly division, and the resulting time was 18.9s, so significantly better than the modulo with a statically known constant. However, this auxiliary-variable technique is only promising if the change is easily predictable.
Another possibility would be to eliminate using the modulus operator:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
int current = 0;
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (++current == skip) {
current = 0;
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
If that is the actual code, you have a few ways to optimize it:
(i + j % 10==0) is only true when i==0, so you can skip that entire mod operation when i>0. Also, since i + j only increases by 1 on each loop, you can hoist the mod out and simply have a variable you increment and reset when it hits skip (as has been pointed out in other answers).
You can also have all possible function implementations already in the program, and at runtime you change the function pointer to select the function which you are actually are using.
You can use macros to avoid that you have to write duplicate code:
#define MYFUNCMACRO(name, myvar) void name##doit(){/* time consuming code using myvar */}
MYFUNCMACRO(TEN,10)
MYFUNCMACRO(TWENTY,20)
MYFUNCMACRO(FOURTY,40)
MYFUNCMACRO(FIFTY,50)
If you need to have too many of these macros (hundreds?) you can write a codegenerator which writes the cpp file automatically for a range of values.
I didn't compile nor test the code, but maybe you see the principle.
You might be compiling without optimisation, which will lead your program to load skip each time it's checked, instead of the literal of 10. Try adding -O2 to your compiler's command line, and/or use
register int skip;

Resources