Why would < be slower than <=? [C] - c

Naturally, I've assumed the < and <= operators run at the same speed (per Jonathon Reinhart's logic, here). Recently, I decided to test that assumption, and I was a little surprised by my results.
I know, for most modern hardware, this question is purely academic, so had to write test programs that looped about 1 billion times (to get any minuscule difference to add up to more acceptable levels). The programs were as basic as possible (to cut out all possible sources of interference).
lt.c:
int main() {
for (int i = 0; i < 1000000001; i++);
return 0;
}
le.c:
int main() {
for (int i = 0; i <= 1000000000; i++);
return 0;
}
They were compiled and run on a Linux VirtualBox 3.19.0-18-generic #18-Ubuntu x86_64 installation, using GCC with the -std=c11 flag set.
The average time for lt.c's binary was:
real 0m2.404s
user 0m2.389s
sys 0m0.000s
The average time for le.c was:
real 0m2.397s
user 0m2.384s
sys 0m0.000s
The difference is small, but I couldn't get it to go away or reverse no matter how many times I ran the binaries.
I made the comparison value in the for-loop of lt.c one larger than le.c (so they'd both loop the same number of times). Was this somehow a mistake?
According the answer in Is < faster than <=?, < compiles to jge and <= compiles to jg. That was dealing with if statements rather than a for-loop, but could this still be the reason? Could the execution of jge take slightly longer than jg? (I think this would be ironic since that would mean moving from C to ASM inverts which one is the more complicated instruction, with lt in C translating to gte in ASM and lte to gt.)
Or, is this just so hardware specific that different x86 lines or individual chips may consistently show the reverse trend, the same trend, or no difference?

There were a few requests in the comments to my question to include the assembly being generated for me by GCC. After getting to compiler to pop out the assembly versions of each file, I checked it.
Result:
It turns out the default optimization setting turned both for-loops into the same assembly. Both files were identical in assembly-form, actually. (diff confirmed this.)
Possible reason for the previously observed time difference:
It seems the order in which I ran the binaries was the cause for the run time difference.
On a given runthrough, the programs generally were executed quicker with each successive execution, before plateauing after about 3 executions.
I alternated back and forth between time ./lt and time ./le, so the one run first would have a bias towards extra time in its average.
I usually ran lt first.
I did several separate runthroughs (increasing the averaged bias).
Code excerpt:
movl $0, -4(%rbp)
jmp .L2
.L3:
addl $1, -4($rbp)
.L2
cmpl $1000000000, -4(%rbp)
jle .L3
mol $0, %eax
pop %rbp
... * covers face * ...carry on....

Let's speak in assembly. (depends on the architecture of course)
When comparing you'll use cmp or test instruction and then
- when you use < the equal instruction would be jl which checks if SF and OF are not the same (some special flags called sign and overflow)
- when you use <= the equal instruction is jle which checks not only SF != OF but also ZF == 1 (zero flag)
and so one, more here
but honestly it's not even the whole cycle so...I think the difference is unmeasurable under normal circumstances

Related

Benchmarking C struct comparsion: XOR vs ==

Say we have a simple struct in C that has 4 fields:
typedef struct {
int a;
int b;
int c;
int d;
} value_st;
Let's take a look at these two short versions of C struct equal check.
The first one is straight-forward and does the following:
int compare1(const value_st *x1, const value_st *x2) {
return ( (x1->a == x2->a) && (x1->b == x2->b) &&
(x1->c == x2->c) && (x1->d == x2->d) );
}
The second one uses XOR:
int compare2(const value_st *x1, const value_st *x2) {
return ( (x1->a ^ x2->a) | (x1->b ^ x2->b) |
(x1->c ^ x2->c) | (x1->d ^ x2->d);
}
The first version will return nonzero if both structs are equal.
and the second version will return zero iff the two structs are equal.
Compiler Output
Compiling with GCC -O2 and examining the assembly looks like what we expect.
The first version is 4 CMP instructions and JMPS:
xor %eax,%eax
mov (%rsi),%edx
cmp %edx,(%rdi)
je 0x9c0 <compare1+16>
repz retq
nopw 0x0(%rax,%rax,1)
mov 0x4(%rsi),%ecx
cmp %ecx,0x4(%rdi)
jne 0x9b8 <compare1+8>
mov 0x8(%rsi),%ecx
cmp %ecx,0x8(%rdi)
jne 0x9b8 <compare1+8>
mov 0xc(%rsi),%eax
cmp %eax,0xc(%rdi)
sete %al
movzbl %al,%eax
retq
The second version looks like this:
mov (%rdi),%eax
mov 0x4(%rdi),%edx
xor (%rsi),%eax
xor 0x4(%rsi),%edx
or %edx,%eax
mov 0x8(%rdi),%edx
xor 0x8(%rsi),%edx
or %edx,%eax
mov 0xc(%rdi),%edx
xor 0xc(%rsi),%edx
or %edx,%eax
retq
So the second version has:
no branches
less instructions
Benchmarking
static uint64_t
now_msec() {
struct timespec spec;
clock_gettime(CLOCK_MONOTONIC, &spec);
return ((uint64_t)spec.tv_sec * 1000) + (spec.tv_nsec / 1000000);
}
void benchmark() {
uint64_t start = now_msec();
uint64_t sum = 0;
for (uint64_t i = 0; i < 1e10; i++) {
if (compare1(&x1, &x2)) {
sum++;
}
}
uint64_t delta_ms = now_msec() - start;
// use sum and delta here
}
Enough iterations to filter out the time it takes to call clock_gettime()
But here is the thing I don't get...
When I benchmark equal structs where all the instructions need to be executed,
the first version is faster...
time took for compare == is 3114 [ms] [matches: 10000000000]
time took for compare XOR is 3177 [ms] [matches: 10000000000]
How is this possible ?
Even with branch prediction, XOR should be super fast instruction and
not lose to CMP/JMP
Update
Couple of important notes:
This question is mainly to understand the outcome. not to try to beat the compiler or create an obscure code - it is always better to write clean code and let the compiler optimize
We assume the structs are in the cache, otherwise the dominating factor will be obviously the memory lookup
Branch prediction will obviously play a part...but can it be better than branchless code (given that most of the time we execute all the code) ?
memcmp will require zero padding in the struct and also might need a loop / if in most standard implementations, as it supports variable size comparison
Update 2
Many have stated that the difference is tiny per call...this is true but is consistent which means that this difference is in favor of the first version in many consecutive runs
Update 3
I've copied my test code to a lab server with a Intel(R) Xeon(R) CPU E5-2667 v3 # 3.20GHz
The XOR version runs almost two times faster on the server for GCC 8.
Tried with both clang and GCC 8:
For GCC 8:
time took for compare == is 7432 [ms] [matches: 3000000000]
time took for compare XOR is 4214 [ms] [matches: 3000000000]
for Clang:
time took for compare == is 4265 [ms] [matches: 3000000000]
time took for compare XOR is 5508 [ms] [matches: 3000000000]
So it seems like this is very compiler and CPU dependent.
Well, in the first case there are 4 mov's and 4 cmp's. In the second case there are 4 mov's, 4 xor's and 4 or's. As jmp's not taken take in effect no time, the first version is faster. (cmp and xor do basically the same thing and should execute in the same amount of time)
The moral of the story here is that you should never try to outsmart your compiler, it really knows better (at least in 99.99% of cases)
And never obscure the intent of your program in an effort to make it faster, unless you have hard evidence it is (1) needed and (2) effective.
time took for compare == is 3114 [ms] [matches: 10000000000]
time took for compare XOR is 3177 [ms] [matches: 10000000000]
How is this possible ?
Because actual execution time is affected by many factors out of your control, which is why you should never rely on a single run of a benchmarking program to make any decisions. Run it many times, under different load conditions, and average the results.
Secondly, this run shows a difference of 63 milliseconds out of a little over 3 seconds, or 2%, for one billion comparisons between the two methods. As far as a person sitting in front of the screen is concerned, that's barely noticable. If your results consistently showed a difference of a full second or more that would be worth investigating, but this is down in the noise.
And finally, what is going to be the more common operation in the real code - comparing identical structs or non-identical structs? If the second case is going to be more common, even if just by a bare majority of 51%, then the == method will be significantly faster on average due to short-circuiting.
When optimizing code, look at the big picture - don't hyperfocus on a single operation. You'll wind up writing code that's hard to read, harder to maintain, and probably not as optimized as you think it is.

Does multiplying a 1-100 int by -1 or setting said int to zero take more time?

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.
Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.
About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

number of clock cycles in if statement in c program?

Sorry i was not specific with the problem. i am trying to use intrinsic bit-parallelism of a system .a small part of the code is as follows-
int d;
char ch1;
char ch2;
cin>>ch1>>ch2;
if((d&1) == 0) {
//heavy computation
}
if(ch1 == ch2){
//heavy computation
}
first if condition execute if lsb of d is set.
how many clock cycles the two 'if' conditions require to execute?
include the clock cycles required to convert the variable values in binary form.
On a i386 architecture and with gcc the assembly code produced for the abode conditions would be,
for condition 1:
subl $16, %esp
movb $97, -2(%ebp)
movb $98, -1(%ebp)
movl -12(%ebp), %eax
andl $1, %eax
testl %eax, %eax
jne .L2
for condition 2:
movzbl -2(%ebp), %eax
cmpb -1(%ebp), %al
jne .L4
So for simplicity we consider the i386 is a MIPS with RISC core and it fallows the fallowing table:
number of clock cycles for the above statements would be 18.
Actually when you compile with "gcc -S file.c" the assembly for the 2 conditions is not produced as the compiler might go for the optimization of the null conditions(ineffective conditions or the dead code), so try to include some useful statements inside the conditions and compile the code you would get the above stated instructions.
With any good compiler, the if statements shown in this question would not consume any processor cycles in an executing program. This is because the compiler would recognize that neither of the if statements does anything, regardless of whether the condition is true or false, so they would be removed during optimization.
In general, optimization can dramatically transform a program. Even if the if statements had statements in their then-clauses, the compiler could determine at compile-time that ch1 does not equal ch2, so there is no need to perform the comparison during program execution.
Beyond that, if a condition is tested during program execution, there is often not a clear correlation between evaluating the test and how many processor cycles it takes. Modern processors are quite complicated, and a test and branch might be executed speculatively in advance while other instructions are also executing, so that the if statement does not cost the program any time at all. On the other hand, executing a branch might cause the processor to discard many instructions it had been preparing to execute and to reload new instructions from the new branch destination, thus costing the program many cycles.
In fact, both of these effects might occur for the same if statement in the same program. When the if statement is used in a loop with many executions, the processor may cache information about the branch decision and use that to speed up execution. At another time, when the if statement happens to be executed just once (because the loop conditions are different), the cached information may mislead the processor and cost cycles.
Probably you can compile your complete code and disassemble it using GDB. Once disassembled find out number and type (Load (5 cycles) Store (4 cycles) Branch (3 cycles) Jump (3 cycles) etc.,) of instructions your mentioned statements took. Sum of such cycles result to clock cycles consumed. However this depends on what processor you are on.
By looking at your question, i think you need to calculate number of instruction executed for your statement and then calculate cycles for every instruction in your if else
Code:
if(x == 0)
{
x = 1;
}
x++;
This will consume following number of instructions
mov eax, $x
cmp eax, 0
jne end
mov eax, 1
end:
inc eax
mov $x, eax
so first if statement will consume 2cpu cycles
Adding to your particular code
cin>>ch1>>ch2;
if((d&1) == 0) {
//heavy computation
}
if(ch1 == ch2){
//heavy computation
}
you need to get instruction required in those two if else operations from which you can calculate cycles.
Also you need to add something inside ( if(){body} ) in body of if statements else modern compilers are intelligent remove your code considering it is dead code.
It depends on your "IF".
Take this to the simplest case that you want to compare two bytes, you probably only need 2 clock cycles in an instruction, ie. 1111 0001 which means (1st) activating ALU-CMP and setting data from R0 to TMP; (2nd) carrying R1 onto the bus and setting the output to ACC.
Otherwise, you will need at least other 3 clocks for fetching, 1 clock for checking I/O interrupt, and 1 final clock to reset the instruction register.
Therefore, on the circuit scale, you only need 7 clock cycles to execute an "IF" for 2 bytes. However, you would never write an "IF" just to compare two numbers (represented by two bytes), wouldn't you? 😅

Profile C Execution

So, just for fun, and out of curiosity, I wanted to see what executes faster when doing an even-odd check, modulus or bitwise comparisons.
So, I whipped up the following, but I'm not sure that it's behaving correctly, as the difference is so small. I read somewhere online that bitwise should be an order of magnitude faster than modulus checking.
Is it possible that it's getting optimized away? I've just started tinkering with assembly, otherwise I'd attempt to dissect the executable a bit.
EDIT 3: Here is a working test, thanks in a large way to #phonetagger:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
// to reset the global
static const int SEED = 0x2A;
// 5B iterations, each
static const int64_t LOOPS = 5000000000;
int64_t globalVar;
// gotta call something
int64_t doSomething( int64_t input )
{
return 1 + input;
}
int main(int argc, char *argv[])
{
globalVar = SEED;
// mod
clock_t startMod = clock();
for( int64_t i=0; i<LOOPS; ++i )
{
if( ( i % globalVar ) == 0 )
{
globalVar = doSomething(globalVar);
}
}
clock_t endMod = clock();
double modTime = (double)(endMod - startMod) / CLOCKS_PER_SEC;
globalVar = SEED;
// bit
clock_t startBit = clock();
for( int64_t j=0; j<LOOPS; ++j )
{
if( ( j & globalVar ) == 0 )
{
globalVar = doSomething(globalVar);
}
}
clock_t endBit = clock();
double bitTime = (double)(endBit - startBit) / CLOCKS_PER_SEC;
printf("Mod: %lf\n", modTime);
printf("Bit: %lf\n", bitTime);
printf("Dif: %lf\n", ( modTime > bitTime ? modTime-bitTime : bitTime-modTime ));
}
5 billion iterations of each loop, with a global removing compiler optimization yields the following:
Mod: 93.099101
Bit: 16.701401
Dif: 76.397700
gcc foo.c -std=c99 -S -O0 (note, I specifically did -O0) for x86 gave me the same assembly for both loops. Operator strength reduction meant that both ifs used an andl to get the job done (which is faster than a modulo on Intel machines):
First Loop:
.L6:
movl 72(%esp), %eax
andl $1, %eax
testl %eax, %eax
jne .L5
call doNothing
.L5:
addl $1, 72(%esp)
.L4:
movl LOOPS, %eax
cmpl %eax, 72(%esp)
jl .L6
Second Loop:
.L9:
movl 76(%esp), %eax
andl $1, %eax
testl %eax, %eax
jne .L8
call doNothing
.L8:
addl $1, 76(%esp)
.L7:
movl LOOPS, %eax
cmpl %eax, 76(%esp)
jl .L9
The miniscule difference you see is probably because of the resolution/inaccuracy of clock.
Most compilers will compile both of the following to EXACTLY the same machine instruction(s):
if( ( i % 2 ) == 0 )
if( ( i & 1 ) == 0 )
...even without ANY "optimization" turned on. The reason is that you are MOD-ing and AND-ing with constant values, and a %2 operation is, as any compiler writer should know, functionally equivalent to an &1 operation. In fact, MOD by any power-of-2 has an equivalent AND operation. If you really want to test the difference, you'll need to make the right-hand-side of both operations be variable, and to be absolutely sure the compiler's cleverness isn't thwarting your efforts, you'll need to bury the variables' initializations somewhere that the compiler can't tell at that point what its runtime value will be; i.e. you'll need to pass the values into a GLOBALLY-DECLARED (i.e. not 'static') test function as parameters, in which case the compiler can't trace back to their definition & substitute the variables with constants, because theoretically any external caller could pass any values in for those parameters. Alternatively, you could leave the code in main() and define the variables globally, in which case the compiler can't substitute them with constants because it can't know for sure that another function may have altered the value of the global variables.
Incidentally, this same issue exists for divide operations.... Divisions by constant powers-of-two can be substituted with an equivalent right-shift (>>) operation. The same trick works for multiplication (<<), but the benefits are less (or nonexistant) for multiplications. True division operations just take a long time in hardware, though significant improvements have been made in most modern processors vs. even 15 years ago, division operations still take maybe 80 clock cycles, while a >> operation takes only a single cycle. You're not going to see an "order of magnitude" improvement using bitwise tricks on modern processors, but most compilers will still use those tricks because there is still some noticeable improvement.
EDIT: On some embedded processors (and, unbelievable though it was, the original Sparc desktop/workstation processor versions before v8), there isn't even a divide instruction at all. All true divide & mod operations on such processors must be performed entirely in software, which can be a monstrously expensive operation. In that sort of environment, you surely would see an order of magnitude difference.
Bitwise checking takes only a single machine instruction ("and ...,0x01"); that's pretty hard to beat.
Modulo check will absolutely be slower if you have a dumb compiler that actually computes modulo by taking remainders (sometimes including a subroutine call to modulo routine!). Smart compilers know about the modulo function and generate code for it directly; if they have any decent optimization they know that "modulo(x,2)" can be implemented with the same AND trick above.
Our PARLANSE compiler does this as a matter of course. I'd be surprised if widely available C and C++ compilers don't do this too.
With such "good" compilers, it won't matter which way you write odd/even (or even "is power of two") checks; it will be pretty damn fast.

*str and *str++

I have this code (my strlen function)
size_t slen(const char *str)
{
size_t len = 0;
while (*str)
{
len++;
str++;
}
return len;
}
Doing while (*str++), as shown below, the program execution time is much larger:
while (*str++)
{
len++;
}
I'm doing this to probe the code
int main()
{
double i = 11002110;
const char str[] = "long string here blablablablablablablabla"
while (i--)
slen(str);
return 0;
}
In first case the execution time is around 6.7 seconds, while in the second (using *str++), the time is around 10 seconds!
Why so much difference?
Probably because the post-increment operator (used in the condition of the while statement) involves keeping a temporary copy of the variable with its old value.
What while (*str++) really means is:
while (tmp = *str, ++str, tmp)
...
By contrast, when you write str++; as a single statement in the body of the while loop, it is in a void context, hence the old value isn't fetched because it's not needed.
To summarise, in the *str++ case you have an assignment, 2 increments, and a jump in each iteration of the loop. In the other case you only have 2 increments and a jump.
Trying this out on ideone.com, I get about 0.5s execution with *str++ here. Without, it takes just over a second (here). Using *str++ was faster. Perhaps with optimisation on *str++ can be done more efficiently.
This depends on your compiler, compiler flags, and your architecture. With Apple's LLVM gcc 4.2.1, I don't get a noticeable change in performance between the two versions, and there really shouldn't be. A good compiler would turn the *str version into something like
IA-32 (AT&T Syntax):
slen:
pushl %ebp # Save old frame pointer
movl %esp, %ebp # Initialize new frame pointer
movl -4(%ebp), %ecx # Load str into %ecx
xor %eax, %eax # Zero out %eax to hold len
loop:
cmpb (%ecx), $0 # Compare *str to 0
je done # If *str is NUL, finish
incl %eax # len++
incl %ecx # str++
j loop # Goto next iteration
done:
popl %ebp # Restore old frame pointer
ret # Return
The *str++ version could be compiled exactly the same (since changes to str aren't visible outside slen, when the increment actually occurs isn't important), or the body of the loop could be:
loop:
incl %ecx # str++
cmpb -1(%ecx), $0 # Compare *str to 0
je done # If *str is NUL, finish
incl %eax # len++
j loop # Goto next iteration
Others have already provided some excellent commentary, including analysis for the generated assembly code. I strongly recommend that you read them carefully. As they have pointed out this sort of question can't really be answered without some quantification, so let's and play with it a bit.
First, we're going to need a program. Our plan is this: we will generate strings whose lengths are powers of two, and try all functions in turn. We run through once to prime the cache and then separately time 4096 iterations using the highest-resolution available to us. Once we are done, we will calculate some basic statistics: min, max and the simple-moving average and dump it. We can then do some rudimentary analysis.
In addition to the two algorithms you've already shown, I will show a third option which doesn't involve the use of a counter at all, relying instead on a subtraction, and I'll mix things up by throwing in std::strlen, just to see what happens. It'll be an interesting throwdown.
Through the magic of television our little program is already written, so we compile it with gcc -std=c++11 -O3 speed.c and we get cranking producing some data. I've done two separate graphs, one for strings whose size is from 32 to 8192 bytes and another for strings whose size is from 16384 all the way to 1048576 bytes long. In the following graphs, the Y axis is the time consumed in nanoseconds and the X axis shows the length of the string in bytes.
Without further ado, let's look at performance for "small" strings from 32 to 8192 bytes:
Now this is interesting. Not only is the std::strlen function outperforming everything across the board, it's doing it with gusto too since it's performance is a lot of more stable.
Will the situation change if we look at larger strings, from 16384 all the way to 1048576 bytes long?
Sort of. The difference is becoming even more pronounced. As our custom-written functions huff-and-puff, std::strlen continues to perform admirably.
An interesting observation to make is that you can't necessarily translate number of C++ instructions (or even, number of assembly instructions) to performance, since functions whose bodies consist of fewer instructions sometimes take longer to execute.
An even more interesting -- and important observation is to notice just how well the str::strlen function performs.
So what does all this get us?
First conclusion: don't reinvent the wheel. Use the standard functions available to you. Not only are they already written, but they are very very heavily optimized and will almost certainly outperform anything you can write unless you're Agner Fog.
Second conclusion: unless you have hard data from a profiler that a particular section of code or function is hot-spot in your application, don't bother optimizing code. Programmers are notoriously bad at detecting hot-spots by looking at high level function.
Third conclusion: prefer algorithmic optimizations in order to improve your code's performance. Put your mind to work and let the compiler shuffle bits around.
Your original question was: "why is function slen2 slower than slen1?" I could say that it isn't easy to answer without a lot more information, and even then it might be a lot longer and more involved than you care for. Instead what I'll say is this:
Who cares why? Why are you even bothering with this? Use std::strlen - which is better than anything that you can rig up - and move on to solving more important problems - because I'm sure that this isn't the biggest problem in your application.

Resources