Is (really) '<' faster than '!=' in C?

Is (really) '<' faster than '!=' in C? - c

This is the typical question to be DV, so I've hesitated (a lot) before posting it...
I know this question was marked as duplicate, but my tests (if they are good: are they good? this is part of the question) tends to show this is not the case.
At the beginning, I did some tests comparing a for loop to a while loop.
That shows that for loop was better.
But going further, for or while was not the point: the difference is related to:
for (int l = 0; l < loops;l++) {
or
for (int l = 0; l != loops;l++) {
And if you run that (under Windows 10, Visual studio 2017, release), you see that the first one is more than twice faster than the second one.
It is hard (for a novice as I am) to understand if the compiler for some reasons is able to optimize more one or another. But...
Short question
Why?
Longer question
The complete code is the following:
For the '<' loop:
int forloop_inf(int loops, int iterations)
{
int n = 0;
int x = n;
for (int l = 0; l < loops;l++) {
for (int i = 0; i < iterations;i++) {
n++;
x += n;
}
}
return x;
}
For the '!=' loop:
int forloop_diff(int loops, int iterations)
{
int n = 0;
int x = n;
for (int l = 0; l != loops;l++) {
for (int i = 0; i != iterations;i++) {
n++;
x += n;
}
}
return x;
}
In both cases, the inner calculation is just here in order to avoid the compiler skipping all the loops.
Respectively this is called by:
printf("for loop inf %f\n", monitor_int(loops, iterations, forloop_inf, &result));
printf("%d\n", result);
and
printf("for loop diff %f\n", monitor_int(loops, iterations, forloop_diff, &result));
printf("%d\n", result);
where loops = 10 * 1000 and iterations = 1000 * 1000.
And where monitor_int is:
double monitor_int(int loops, int iterations, int(*func)(int, int), int *result)
{
clock_t start = clock();
*result = func(loops, iterations);
clock_t stop = clock();
return (double)(stop - start) / CLOCKS_PER_SEC;
}
The result in seconds is:
for loop inf 2.227 seconds
for loop diff 4.558 seconds
So, even if the interest of all that is relative to the weight of what is done inside the loop compared to the loop itself, why such a difference?
Edit:
You can find here the full source code reviewed so that functions are called in a random order several times.
The corresponding disassembly is here (obtained with dumpbin /DISASM CPerf2.exe).
Running it, I now obtain:
'!=' 0.045231 (average on 493 runs)
'<' 0.031010 (average on 507 runs)
I do not know how to set O3 in Visual Studio, the compile command line is the following:
/permissive- /Yu"stdafx.h" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Ot /Fp"x64\Release\CPerf2.pch" /diagnostics:classic
The code for the loops is above, here is the random way to run it:
typedef int(loop_signature)(int, int);
void loops_compare()
{
int loops = 1 * 100;
int iterations = 1000 * 1000;
int result;
loop_signature *functions[2] = {
forloop_diff,
forloop_inf
};
int n_rand = 1000;
int n[2] = { 0, 0 };
double cum[2] = { 0.0, 0.0 };
for (int i = 0; i < n_rand;i++) {
int pick = rand() % 2;
loop_signature *fun = functions[pick];
double time = monitor(loops, iterations, fun, &result);
n[pick]++;
cum[pick] += time;
}
printf("'!=' %f (%d) / '<' %f (%d)\n", cum[0] / (double)n[0], n[0], cum[1] / (double)n[1], n[1]);
}
and the disassembly (the loop functions only, but not sure it is the good extract of the link above):
?forloop_inf##YAHHH#Z:
0000000140001000: 48 83 EC 08 sub rsp,8
0000000140001004: 45 33 C0 xor r8d,r8d
0000000140001007: 45 33 D2 xor r10d,r10d
000000014000100A: 44 8B DA mov r11d,edx
000000014000100D: 85 C9 test ecx,ecx
000000014000100F: 7E 6F jle 0000000140001080
0000000140001011: 48 89 1C 24 mov qword ptr [rsp],rbx
0000000140001015: 8B D9 mov ebx,ecx
0000000140001017: 66 0F 1F 84 00 00 nop word ptr [rax+rax]
00 00 00
0000000140001020: 45 33 C9 xor r9d,r9d
0000000140001023: 33 D2 xor edx,edx
0000000140001025: 33 C0 xor eax,eax
0000000140001027: 41 83 FB 02 cmp r11d,2
000000014000102B: 7C 29 jl 0000000140001056
000000014000102D: 41 8D 43 FE lea eax,[r11-2]
0000000140001031: D1 E8 shr eax,1
0000000140001033: FF C0 inc eax
0000000140001035: 8B C8 mov ecx,eax
0000000140001037: 03 C0 add eax,eax
0000000140001039: 0F 1F 80 00 00 00 nop dword ptr [rax]
00
0000000140001040: 41 FF C1 inc r9d
0000000140001043: 83 C2 02 add edx,2
0000000140001046: 45 03 C8 add r9d,r8d
0000000140001049: 41 03 D0 add edx,r8d
000000014000104C: 41 83 C0 02 add r8d,2
0000000140001050: 48 83 E9 01 sub rcx,1
0000000140001054: 75 EA jne 0000000140001040
0000000140001056: 41 3B C3 cmp eax,r11d
0000000140001059: 7D 06 jge 0000000140001061
000000014000105B: 41 FF C2 inc r10d
000000014000105E: 45 03 D0 add r10d,r8d
0000000140001061: 42 8D 0C 0A lea ecx,[rdx+r9]
0000000140001065: 44 03 D1 add r10d,ecx
0000000140001068: 41 8D 48 01 lea ecx,[r8+1]
000000014000106C: 41 3B C3 cmp eax,r11d
000000014000106F: 41 0F 4D C8 cmovge ecx,r8d
0000000140001073: 44 8B C1 mov r8d,ecx
0000000140001076: 48 83 EB 01 sub rbx,1
000000014000107A: 75 A4 jne 0000000140001020
000000014000107C: 48 8B 1C 24 mov rbx,qword ptr [rsp]
0000000140001080: 41 8B C2 mov eax,r10d
0000000140001083: 48 83 C4 08 add rsp,8
0000000140001087: C3 ret
0000000140001088: CC CC CC CC CC CC CC CC ÌÌÌÌÌÌÌÌ
?forloop_diff##YAHHH#Z:
0000000140001090: 45 33 C0 xor r8d,r8d
0000000140001093: 41 8B C0 mov eax,r8d
0000000140001096: 85 C9 test ecx,ecx
0000000140001098: 74 28 je 00000001400010C2
000000014000109A: 44 8B C9 mov r9d,ecx
000000014000109D: 0F 1F 00 nop dword ptr [rax]
00000001400010A0: 85 D2 test edx,edx
00000001400010A2: 74 18 je 00000001400010BC
00000001400010A4: 8B CA mov ecx,edx
00000001400010A6: 66 66 0F 1F 84 00 nop word ptr [rax+rax]
00 00 00 00
00000001400010B0: 41 FF C0 inc r8d
00000001400010B3: 41 03 C0 add eax,r8d
00000001400010B6: 48 83 E9 01 sub rcx,1
00000001400010BA: 75 F4 jne 00000001400010B0
00000001400010BC: 49 83 E9 01 sub r9,1
00000001400010C0: 75 DE jne 00000001400010A0
00000001400010C2: C3 ret
00000001400010C3: CC CC CC CC CC CC CC CC CC CC CC CC CC ÌÌÌÌÌÌÌÌÌÌÌÌÌ
Edit again:
What I feel surprising is also the following:
In debug more the performance is the same (and the assembly code too)
So how to be confident about what you're coding if such differences appear after that? (considering I've done no mistake somewhere)

For proper benchmarking, it's important to run the functions in random order and many times.
typedef int(signature)(int, int);
...
int main() {
int loops, iterations, runs;
fprintf(stderr, "Loops: ");
scanf("%d", &loops);
fprintf(stderr, "Iterations: ");
scanf("%d", &iterations);
fprintf(stderr, "Runs: ");
scanf("%d", &runs);
fprintf(stderr, "Running for %d loops and %d iterations %d times.\n", loops, iterations, runs);
signature *functions[2] = {
forloop_inf,
forloop_diff
};
int result = functions[0](loops, iterations);
for( int i = 0; i < runs; i++ ) {
int pick = rand() % 2;
signature *function = functions[pick];
int new_result;
printf("%d %f\n", pick, monitor_int(loops, iterations, function, &new_result));
if( result != new_result ) {
fprintf(stderr, "got %d expected %d\n", new_result, result);
}
}
}
Armed with this we can do 1000 runs in random order and find the average times.
It's also important to benchmark with optimizations turned on. Not much point in asking how fast unoptimized code will run. I'll try at -O2 and -O3.
My findings are that with Apple LLVM version 8.0.0 (clang-800.0.42.1) doing 10000 loops and 1000000 iterations at -O2 forloop_inf is indeed about 50% faster than forloop_diff.
forloop_inf: 0.000009
forloop_diff: 0.000014
Looking at the generated assembly code for -O2 with clang -O2 -S -mllvm --x86-asm-syntax=intel test.c I can see many differences between the two implementations. Maybe somebody who knows assembly can tell us why.
But at -O3 the performance difference is no longer discernible.
forloop_inf: 0.000002
forloop_diff: 0.000002
This is because at -O3 they are almost exactly the same. One is using je and one is using jle. That's it.
In conclusion, when benchmarking...
Do many runs.
Randomize the order.
Compile and run as close as you can do how you would in production.
In this case that means turning on compiler optimizations.
Look at the assembly code.
And most of all.
Pick the safest code, not the fastest.
i < max is safer than i != max because it will still terminate if i somehow jumps over max.
As demonstrated, with optimizations turned on, they're both so fast that even not fully optimized they can whip through 10,000,000,000 iterations in 0.000009 seconds. i < max or i != max is unlikely to be a performance bottleneck, rather whatever you're doing 10 billion times is.
But i != max could lead to a bug.

"<" is not faster than '!='. What's happening is something entirely different.
A loop "for (i = 0; i < n; ++i) is a pattern that the compiler recognises. If the loop body has no instructions modifying i or n, then the compiler knows this is a loop executing exactly max (n - i, 0) times, and can produce optimal code for this.
A loop "for (i = 0; i != n; ++i) is used in practice much less often, so compiler writers are not too bothered with it. And the number of iterations is much harder to determine. If i > n then we have undefined behaviour for signed integers unless there are statements that exit the loop. For unsigned numbers the number of iterations is tricky because it depends on the type of i. You will just get less optimised code.

Always look at the generated code.
It used to be the truth many years ago when some μP did not have some conditional branch instructions or very few flags. So, some of the conditions had to be compiled to set of comparisons and jumps.
But it is not the truth anymore as the modern processors have very rich conditional branching instructions (some of them have many "regular" conditional instructions as well – for example ARM ones) and a lots of the flags.
You can play yourself with different conditions here: https://godbolt.org/g/9DsqJm

Related

Why bitops in linux kernel performance in slower than mine?

I'm looking for a best bitops lib or function that wrote with c language, thus I think linux kernel was the best in this case.
So I copy Linux kernel set_bit function from arch/x86/include/asm/bitops.h and compare with mine and saw a strange results!!!
kernel_bitops.c
#define ADDR BITOP_ADDR(addr)
#define __ASM_FORM(x) #x
#define BITOP_ADDR(x) "m" (*(volatile long *) (x))
#define __ASM_SEL(a,b) __ASM_FORM(b)
#define __ASM_SIZE(inst, ...) __ASM_SEL(inst##l##__VA_ARGS__, inst##q##__VA_ARGS__)
__always_inline void linux_set_bit(long nr, volatile unsigned long *addr)
{
asm volatile(__ASM_SIZE(bts) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
}
my_bitops.c
#define SETBIT(_value, _bitIndex) _value |= (1ul<<(_bitIndex))
__always_inline void mine_set_bit(long nr, volatile unsigned long *addr)
{
SETBIT(*addr,nr)
}
main.c
#define ARRAY_SIZE 10000000
static unsigned long num_array[ARRAY_SIZE];
unsigned long int num = 0x0F00000F00000000;
for (int i = 0; i < ARRAY_SIZE; i++)
num_array[i] = num;
clock_t start = clock();
for (unsigned long int i = 0 ; i < ARRAY_SIZE; i++)
for (unsigned long int j = 0; j < sizeof(unsigned long int) * 8; j++)
// linux_set_bit(j, &num_array[i]);
// mine_set_bit(j, &num_array[i]);
clock_t end = clock();
Time took for Linux: 1375991 us
Time took for mine: 912256 us
CPU: Intel(R) Core(TM) i7-7700K CPU # 4.20GHz
Assembly code that generated with -O2 are:
26 [1] linux_set_bit(j, &num_array[i]);
0x4005c0 <+ 90> 48 8b 45 d0 mov -0x30(%rbp),%rax
0x4005c4 <+ 94> 48 c1 e0 03 shl $0x3,%rax
0x4005c8 <+ 98> 48 8d 90 60 10 60 00 lea 0x601060(%rax),%rdx
0x4005cf <+ 105> 48 8b 45 d8 mov -0x28(%rbp),%rax
0x4005d3 <+ 109> 48 89 d6 mov %rdx,%rsi
0x4005d6 <+ 112> 48 89 c7 mov %rax,%rdi
0x4005d9 <+ 115> e8 69 00 00 00 callq 0x400647 <linux_set_bit>
71 [1] asm volatile(__ASM_SIZE(bts) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
0x400653 <+ 12> 48 8b 45 f0 mov -0x10(%rbp),%rax
0x400657 <+ 16> 48 8b 55 f8 mov -0x8(%rbp),%rdx
0x40065b <+ 20> 48 0f ab 10 bts %rdx,(%rax)
19 [1] SETBIT(*addr,nr);
0x400653 <+ 12> 48 8b 45 f0 mov -0x10(%rbp),%rax
0x400657 <+ 16> 48 8b 00 mov (%rax),%rax
0x40065a <+ 19> 48 8b 55 f8 mov -0x8(%rbp),%rdx
0x40065e <+ 23> be 01 00 00 00 mov $0x1,%esi
0x400663 <+ 28> 89 d1 mov %edx,%ecx
0x400665 <+ 30> d3 e6 shl %cl,%esi
0x400667 <+ 32> 89 f2 mov %esi,%edx
0x400669 <+ 34> 89 d2 mov %edx,%edx
0x40066b <+ 36> 48 09 c2 or %rax,%rdx
0x40066e <+ 39> 48 8b 45 f0 mov -0x10(%rbp),%rax
0x400672 <+ 43> 48 89 10 mov %rdx,(%rax)
Where i am wrong? Or Linux has a slow operation?

The main difference is that your code can't handle "bit number" being larger than the number of bits in an unsigned long, and Linux's version can. Because of this difference you've written a loop that works with your version's limitations, that isn't ideal when those limitations aren't there, and isn't ideal for Linux's version.
Specifically; for Linux's version, you could/should do this:
for (unsigned long int i = 0 ; i < ARRAY_SIZE * sizeof(unsigned long int) * 8; i++) {
linux_set_bit(i, num_array);
}
By removing the entire inner loop overhead, plus the calculation needed to find a pointer to an element of the array (the &num_array[i] part), it'll be significantly faster (and probably be faster than yours).

Yes, bts %reg, (mem) is slow (https://uops.info); IDK why Linux forces that form without using a lock prefix. Possibly the operation needs to be atomic wrt. interrupts on the same core, which doing it with a single instruction accomplishes.
If not, it's faster to emulate it with multiple instructions to calculate the address of the byte or dword containing the bit you want: How can memory destination BTS be significantly slower than load / BTS reg,reg / store?
(bts imm, (mem) is not bad, though, so you could use __builtin_constant_p(bitpos) and use memory-destination bts.)
As #Brendan points out, your version only works for bitpos < sizeof(unsigned long) * CHAR_BIT, i.e. within the first qword.
I don't know why exactly Linux forces a memory destination bts with a volatile pointer. Presumably there's some kind of reason other than performance. Otherwise, yes it's a missed optimization.

Why does 32 bit compiler and 64 bit compiler makes such a difference with my code? [duplicate]

This question already has answers here:
How dangerous is it to access an array out of bounds?
(12 answers)
Closed 3 years ago.
Excuse my bad English.
I have written down some lines to return max, min, sum of all values, and arrange all values in ascending order when five integers are input.
While writing, I mistakenly wrote 'num[4]' when I declared a INT array when I needed to put in 5 integers.
But as I compiled with TDM-GCC 4.9.2 64-bit release, it worked without any problem. As soon as I realized and changed to TDM-GCC 4.9.2 32-bit release, it did not.
This is my whole code;
#include<stdio.h>
int main()
{
int num[4],i,j,k,a,b,c,m,number,sum=0;
printf("This program returns max, min, sum of all values, and arranges all values in ascending order when five integers are input.\n");
printf("Please enter five integers.\n");
for(i=0;i<5;i++)
{
printf("Enter #%d\n",i+1);
scanf("%d",&num[i]);
}
//arrange all values
for(j=0;j<5;j++)
{
for(k=j+1;k<5;k++)
{
if(num[j]>num[k])
{
number=num[j];
num[j]=num[k];
num[k]=number;
}
}
}
//find maximum value
int max=num[0];
for(a=1;a<5;a++)
{
if(max<num[a])
{
max=num[a];
}
}
//find minimum value
int min=num[0];
for(b=1;b<5;b++)
{
if(min>num[b])
{
min=num[b];
}
}
//find sum of all values
for(c=0;c<5;c++)
{
sum=sum+num[c];
}
printf("Max Value : %d\n",max);//print max
printf("Min Value : %d\n",min);//print min
printf("Sum : %d\n",sum); //print sum
printf("In ascending order : "); //print all values in ascending order
for(m=0;m<5;m++)
{
printf("%d ",num[m]);
}
}
I am new to C and all kinds of programming, and don't know how to search these kind of problems. I know my way of asking like this here is very inappropriate, and I sincerely apologize to people who are irritated by these types of questioning posts. But this is my best try, so please don't blame, but I'm willing to accept any kind of advice or tips.
Thank you.

When allocating on the stack, GCC targeting 64-bit (and probably Clang) will align stack allocations to 8 bytes.
For 32-bit targets, it's only going to use 4 bytes of padding.
So when you compiled your program for 64-bit, an extra four bytes was used to pad the stack. That's why when you accessed that last integer, it didn't segfault.
To see this in action, we'll create a test file.
void test_func() {
int n[4];
int b = 11;
for (int i = 0; i < 4; i++) {
n[i] = b;
}
}
And we'll compile it for 32-bit and 64-bit.
gcc -g -c -m64 test.c -o test_64.o
gcc -g -c -m32 test.c -o test_32.o
And now we'll print the disassembly for each.
objdump -S test_64.o >test_64_dis.txt
objdump -S test_32.o >test_32_dis.txt
Here's the contents of the 64-bit version.
test_64.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <func>:
void func() {
0: f3 0f 1e fa endbr64
4: 55 push %rbp
5: 48 89 e5 mov %rsp,%rbp
8: 48 83 ec 30 sub $0x30,%rsp
c: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
13: 00 00
15: 48 89 45 f8 mov %rax,-0x8(%rbp)
19: 31 c0 xor %eax,%eax
int n[4];
int b = 11;
1b: c7 45 dc 0b 00 00 00 movl $0xb,-0x24(%rbp)
for (int i = 0; i < 4; i++) {
22: c7 45 d8 00 00 00 00 movl $0x0,-0x28(%rbp)
29: eb 10 jmp 3b <func+0x3b>
n[i] = b;
2b: 8b 45 d8 mov -0x28(%rbp),%eax
2e: 48 98 cltq
30: 8b 55 dc mov -0x24(%rbp),%edx
33: 89 54 85 e0 mov %edx,-0x20(%rbp,%rax,4)
for (int i = 0; i < 4; i++) {
37: 83 45 d8 01 addl $0x1,-0x28(%rbp)
3b: 83 7d d8 03 cmpl $0x3,-0x28(%rbp)
3f: 7e ea jle 2b <func+0x2b>
}
}
41: 90 nop
42: 48 8b 45 f8 mov -0x8(%rbp),%rax
46: 64 48 33 04 25 28 00 xor %fs:0x28,%rax
4d: 00 00
4f: 74 05 je 56 <func+0x56>
51: e8 00 00 00 00 callq 56 <func+0x56>
56: c9 leaveq
57: c3 retq
Here's the 32-bit version.
test_32.o: file format elf32-i386
Disassembly of section .text:
00000000 <func>:
void func() {
0: f3 0f 1e fb endbr32
4: 55 push %ebp
5: 89 e5 mov %esp,%ebp
7: 83 ec 28 sub $0x28,%esp
a: e8 fc ff ff ff call b <func+0xb>
f: 05 01 00 00 00 add $0x1,%eax
14: 65 a1 14 00 00 00 mov %gs:0x14,%eax
1a: 89 45 f4 mov %eax,-0xc(%ebp)
1d: 31 c0 xor %eax,%eax
int n[4];
int b = 11;
1f: c7 45 e0 0b 00 00 00 movl $0xb,-0x20(%ebp)
for (int i = 0; i < 4; i++) {
26: c7 45 dc 00 00 00 00 movl $0x0,-0x24(%ebp)
2d: eb 0e jmp 3d <func+0x3d>
n[i] = b;
2f: 8b 45 dc mov -0x24(%ebp),%eax
32: 8b 55 e0 mov -0x20(%ebp),%edx
35: 89 54 85 e4 mov %edx,-0x1c(%ebp,%eax,4)
for (int i = 0; i < 4; i++) {
39: 83 45 dc 01 addl $0x1,-0x24(%ebp)
3d: 83 7d dc 03 cmpl $0x3,-0x24(%ebp)
41: 7e ec jle 2f <func+0x2f>
}
}
43: 90 nop
44: 8b 45 f4 mov -0xc(%ebp),%eax
47: 65 33 05 14 00 00 00 xor %gs:0x14,%eax
4e: 74 05 je 55 <func+0x55>
50: e8 fc ff ff ff call 51 <func+0x51>
55: c9 leave
56: c3 ret
Disassembly of section .text.__x86.get_pc_thunk.ax:
00000000 <__x86.get_pc_thunk.ax>:
0: 8b 04 24 mov (%esp),%eax
3: c3 ret
You can see the compiler is generating 24 bytes and then 20 bytes respectively, if you look right after the variable declarations.
Regarding advice/tips you asked for, a good starting point would be to enable all compiler warnings and treat them as errors. In GCC and Clang, you'd use the -Wall -Wextra -Werror -Wfatal-errors.
I wouldn't recommend this if you're using the MSVC compiler, though, which often issues warnings about declarations from the header files it's distributed with.

Other answers cover what might he actually happening, by analyzing the generated assembly, but the really relevant explanation is: Indexing out of array bounds is Undefined Behavior in C. And that's kinda the end of story.
UB means, the code is "allowed" to do anything by C standard. It could do different thing every time it is run. It could do what you want it to do with no ill effects. It might do what you want, but then something completely unrelated behaves in a funny way. Compiler, operating system, or even phase of the moon could make a difference. Or not.
It is generally not useful to think about what actually happens with Undefined Behavior at C level. You can of course produce the assembly output of a particular compilation, and inspect what it does, but that is result of that one compilation. A new compilation might change things (even if you just do new build at different time, because value of __TIME__ macro depends on time...).

Why is writing to memory much slower than reading it?

Here's a simple memset bandwidth benchmark:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
int main()
{
unsigned long n, r, i;
unsigned char *p;
clock_t c0, c1;
double elapsed;
n = 1000 * 1000 * 1000; /* GB */
r = 100; /* repeat */
p = calloc(n, 1);
c0 = clock();
for(i = 0; i < r; ++i) {
memset(p, (int)i, n);
printf("%4d/%4ld\r", p[0], r); /* "use" the result */
fflush(stdout);
}
c1 = clock();
elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);
free(p);
}
On my system (details below) with a single DDR3-1600 memory module, it outputs:
Bandwidth = 4.751 GB/s (Giga = 10^9)
This is 37% of the theoretical RAM speed: 1.6 GHz * 8 bytes = 12.8 GB/s
On the other hand, here's a similar "read" test:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
unsigned long do_xor(const unsigned long* p, unsigned long n)
{
unsigned long i, x = 0;
for(i = 0; i < n; ++i)
x ^= p[i];
return x;
}
int main()
{
unsigned long n, r, i;
unsigned long *p;
clock_t c0, c1;
double elapsed;
n = 1000 * 1000 * 1000; /* GB */
r = 100; /* repeat */
p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));
c0 = clock();
for(i = 0; i < r; ++i) {
p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
printf("%4ld/%4ld\r", i, r);
fflush(stdout);
}
c1 = clock();
elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);
free(p);
}
It outputs:
Bandwidth = 11.516 GB/s (Giga = 10^9)
I can get close to the theoretical limit for read performance, such as XORing a large array, but writing appears to be much slower. Why?
OS Ubuntu 14.04 AMD64 (I compile with gcc -O3. Using -O3 -march=native makes the read performance slightly worse, but does not affect memset)
CPU Xeon E5-2630 v2
RAM A single "16GB PC3-12800 Parity REG CL11 240-Pin DIMM" (What it says on the box) I think that having a single DIMM makes performance more predictable. I'm assuming that with 4 DIMMs, memset will be up to 4 times faster.
Motherboard Supermicro X9DRG-QF (Supports 4-channel memory)
Additional system: A laptop with 2x 4GB of DDR3-1067 RAM: read and write are both about 5.5 GB/s, but note that it uses 2 DIMMs.
P.S. replacing memset with this version results in exactly the same performance
void *my_memset(void *s, int c, size_t n)
{
unsigned long i = 0;
for(i = 0; i < n; ++i)
((char*)s)[i] = (char)c;
return s;
}

With your programs, I get
(write) Bandwidth = 6.076 GB/s
(read) Bandwidth = 10.916 GB/s
on a desktop (Core i7, x86-64, GCC 4.9, GNU libc 2.19) machine with six 2GB DIMMs. (I don't have any more detail than that to hand, sorry.)
However, this program reports write bandwidth of 12.209 GB/s:
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <emmintrin.h>
static void
nt_memset(char *buf, unsigned char val, size_t n)
{
/* this will only work with aligned address and size */
assert((uintptr_t)buf % sizeof(__m128i) == 0);
assert(n % sizeof(__m128i) == 0);
__m128i xval = _mm_set_epi8(val, val, val, val,
val, val, val, val,
val, val, val, val,
val, val, val, val);
for (__m128i *p = (__m128i*)buf; p < (__m128i*)(buf + n); p++)
_mm_stream_si128(p, xval);
_mm_sfence();
}
/* same main() as your write test, except calling nt_memset instead of memset */
The magic is all in _mm_stream_si128, aka the machine instruction movntdq, which writes a 16-byte quantity to system RAM, bypassing the cache (the official jargon for this is "non-temporal store"). I think this pretty conclusively demonstrates that the performance difference is all about the cache behavior.
N.B. glibc 2.19 does have an elaborately hand-optimized memset that makes use of vector instructions. However, it does not use non-temporal stores. That's probably the Right Thing for memset; in general, you clear memory shortly before using it, so you want it to be hot in the cache. (I suppose an even cleverer memset might switch to non-temporal stores for really huge block clear, on the theory that you could not possibly want all of that in the cache, because the cache simply isn't that big.)
Dump of assembler code for function memset:
=> 0x00007ffff7ab9420 <+0>: movd %esi,%xmm8
0x00007ffff7ab9425 <+5>: mov %rdi,%rax
0x00007ffff7ab9428 <+8>: punpcklbw %xmm8,%xmm8
0x00007ffff7ab942d <+13>: punpcklwd %xmm8,%xmm8
0x00007ffff7ab9432 <+18>: pshufd $0x0,%xmm8,%xmm8
0x00007ffff7ab9438 <+24>: cmp $0x40,%rdx
0x00007ffff7ab943c <+28>: ja 0x7ffff7ab9470 <memset+80>
0x00007ffff7ab943e <+30>: cmp $0x10,%rdx
0x00007ffff7ab9442 <+34>: jbe 0x7ffff7ab94e2 <memset+194>
0x00007ffff7ab9448 <+40>: cmp $0x20,%rdx
0x00007ffff7ab944c <+44>: movdqu %xmm8,(%rdi)
0x00007ffff7ab9451 <+49>: movdqu %xmm8,-0x10(%rdi,%rdx,1)
0x00007ffff7ab9458 <+56>: ja 0x7ffff7ab9460 <memset+64>
0x00007ffff7ab945a <+58>: repz retq
0x00007ffff7ab945c <+60>: nopl 0x0(%rax)
0x00007ffff7ab9460 <+64>: movdqu %xmm8,0x10(%rdi)
0x00007ffff7ab9466 <+70>: movdqu %xmm8,-0x20(%rdi,%rdx,1)
0x00007ffff7ab946d <+77>: retq
0x00007ffff7ab946e <+78>: xchg %ax,%ax
0x00007ffff7ab9470 <+80>: lea 0x40(%rdi),%rcx
0x00007ffff7ab9474 <+84>: movdqu %xmm8,(%rdi)
0x00007ffff7ab9479 <+89>: and $0xffffffffffffffc0,%rcx
0x00007ffff7ab947d <+93>: movdqu %xmm8,-0x10(%rdi,%rdx,1)
0x00007ffff7ab9484 <+100>: movdqu %xmm8,0x10(%rdi)
0x00007ffff7ab948a <+106>: movdqu %xmm8,-0x20(%rdi,%rdx,1)
0x00007ffff7ab9491 <+113>: movdqu %xmm8,0x20(%rdi)
0x00007ffff7ab9497 <+119>: movdqu %xmm8,-0x30(%rdi,%rdx,1)
0x00007ffff7ab949e <+126>: movdqu %xmm8,0x30(%rdi)
0x00007ffff7ab94a4 <+132>: movdqu %xmm8,-0x40(%rdi,%rdx,1)
0x00007ffff7ab94ab <+139>: add %rdi,%rdx
0x00007ffff7ab94ae <+142>: and $0xffffffffffffffc0,%rdx
0x00007ffff7ab94b2 <+146>: cmp %rdx,%rcx
0x00007ffff7ab94b5 <+149>: je 0x7ffff7ab945a <memset+58>
0x00007ffff7ab94b7 <+151>: nopw 0x0(%rax,%rax,1)
0x00007ffff7ab94c0 <+160>: movdqa %xmm8,(%rcx)
0x00007ffff7ab94c5 <+165>: movdqa %xmm8,0x10(%rcx)
0x00007ffff7ab94cb <+171>: movdqa %xmm8,0x20(%rcx)
0x00007ffff7ab94d1 <+177>: movdqa %xmm8,0x30(%rcx)
0x00007ffff7ab94d7 <+183>: add $0x40,%rcx
0x00007ffff7ab94db <+187>: cmp %rcx,%rdx
0x00007ffff7ab94de <+190>: jne 0x7ffff7ab94c0 <memset+160>
0x00007ffff7ab94e0 <+192>: repz retq
0x00007ffff7ab94e2 <+194>: movq %xmm8,%rcx
0x00007ffff7ab94e7 <+199>: test $0x18,%dl
0x00007ffff7ab94ea <+202>: jne 0x7ffff7ab950e <memset+238>
0x00007ffff7ab94ec <+204>: test $0x4,%dl
0x00007ffff7ab94ef <+207>: jne 0x7ffff7ab9507 <memset+231>
0x00007ffff7ab94f1 <+209>: test $0x1,%dl
0x00007ffff7ab94f4 <+212>: je 0x7ffff7ab94f8 <memset+216>
0x00007ffff7ab94f6 <+214>: mov %cl,(%rdi)
0x00007ffff7ab94f8 <+216>: test $0x2,%dl
0x00007ffff7ab94fb <+219>: je 0x7ffff7ab945a <memset+58>
0x00007ffff7ab9501 <+225>: mov %cx,-0x2(%rax,%rdx,1)
0x00007ffff7ab9506 <+230>: retq
0x00007ffff7ab9507 <+231>: mov %ecx,(%rdi)
0x00007ffff7ab9509 <+233>: mov %ecx,-0x4(%rdi,%rdx,1)
0x00007ffff7ab950d <+237>: retq
0x00007ffff7ab950e <+238>: mov %rcx,(%rdi)
0x00007ffff7ab9511 <+241>: mov %rcx,-0x8(%rdi,%rdx,1)
0x00007ffff7ab9516 <+246>: retq
(This is in libc.so.6, not the program itself -- the other person who tried to dump the assembly for memset seems only to have found its PLT entry. The easiest way to get the assembly dump for the real memset on a Unixy system is
$ gdb ./a.out
(gdb) set env LD_BIND_NOW t
(gdb) b main
Breakpoint 1 at [address]
(gdb) r
Breakpoint 1, [address] in main ()
(gdb) disas memset
...
.)

The main difference in the performance comes from the caching policy of your PC/memory region. When you read from a memory and the data is not in the cache, the memory must be first fetched to the cache through memory bus before you can perform any computation with the data. However, when you write to memory there are different write policies. Most likely your system is using write-back cache (or more precisely "write allocate"), which means that when you write to a memory location that's not in the cache, the data is first fetched from the memory to the cache and eventually written back to memory when the data is evicted from cache, which means round-trip for the data and 2x bus bandwidth usage upon writes. There is also write-through caching policy (or "no-write allocate") which generally means that upon cache-miss at writes the data isn't fetched to the cache, and which should give closer to the same performance for both reads and writes.

The difference -- at least on my machine, with an AMD processor -- is that the read program is using vectorized operations. Decompiling the two yields this for the writing program:
0000000000400610 <main>:
...
400628: e8 73 ff ff ff callq 4005a0 <clock#plt>
40062d: 49 89 c4 mov %rax,%r12
400630: 89 de mov %ebx,%esi
400632: ba 00 ca 9a 3b mov $0x3b9aca00,%edx
400637: 48 89 ef mov %rbp,%rdi
40063a: e8 71 ff ff ff callq 4005b0 <memset#plt>
40063f: 0f b6 55 00 movzbl 0x0(%rbp),%edx
400643: b9 64 00 00 00 mov $0x64,%ecx
400648: be 34 08 40 00 mov $0x400834,%esi
40064d: bf 01 00 00 00 mov $0x1,%edi
400652: 31 c0 xor %eax,%eax
400654: 48 83 c3 01 add $0x1,%rbx
400658: e8 a3 ff ff ff callq 400600 <__printf_chk#plt>
But this for the reading program:
00000000004005d0 <main>:
....
400609: e8 62 ff ff ff callq 400570 <clock#plt>
40060e: 49 d1 ee shr %r14
400611: 48 89 44 24 18 mov %rax,0x18(%rsp)
400616: 4b 8d 04 e7 lea (%r15,%r12,8),%rax
40061a: 4b 8d 1c 36 lea (%r14,%r14,1),%rbx
40061e: 48 89 44 24 10 mov %rax,0x10(%rsp)
400623: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400628: 4d 85 e4 test %r12,%r12
40062b: 0f 84 df 00 00 00 je 400710 <main+0x140>
400631: 49 8b 17 mov (%r15),%rdx
400634: bf 01 00 00 00 mov $0x1,%edi
400639: 48 8b 74 24 10 mov 0x10(%rsp),%rsi
40063e: 66 0f ef c0 pxor %xmm0,%xmm0
400642: 31 c9 xor %ecx,%ecx
400644: 0f 1f 40 00 nopl 0x0(%rax)
400648: 48 83 c1 01 add $0x1,%rcx
40064c: 66 0f ef 06 pxor (%rsi),%xmm0
400650: 48 83 c6 10 add $0x10,%rsi
400654: 49 39 ce cmp %rcx,%r14
400657: 77 ef ja 400648 <main+0x78>
400659: 66 0f 6f d0 movdqa %xmm0,%xmm2 ;!!!! vectorized magic
40065d: 48 01 df add %rbx,%rdi
400660: 66 0f 73 da 08 psrldq $0x8,%xmm2
400665: 66 0f ef c2 pxor %xmm2,%xmm0
400669: 66 0f 7f 04 24 movdqa %xmm0,(%rsp)
40066e: 48 8b 04 24 mov (%rsp),%rax
400672: 48 31 d0 xor %rdx,%rax
400675: 48 39 dd cmp %rbx,%rbp
400678: 74 04 je 40067e <main+0xae>
40067a: 49 33 04 ff xor (%r15,%rdi,8),%rax
40067e: 4c 89 ea mov %r13,%rdx
400681: 49 89 07 mov %rax,(%r15)
400684: b9 64 00 00 00 mov $0x64,%ecx
400689: be 04 0a 40 00 mov $0x400a04,%esi
400695: e8 26 ff ff ff callq 4005c0 <__printf_chk#plt>
40068e: bf 01 00 00 00 mov $0x1,%edi
400693: 31 c0 xor %eax,%eax
Also, note that your "homegrown" memset is actually optimized down to a call to memset:
00000000004007b0 <my_memset>:
4007b0: 48 85 d2 test %rdx,%rdx
4007b3: 74 1b je 4007d0 <my_memset+0x20>
4007b5: 48 83 ec 08 sub $0x8,%rsp
4007b9: 40 0f be f6 movsbl %sil,%esi
4007bd: e8 ee fd ff ff callq 4005b0 <memset#plt>
4007c2: 48 83 c4 08 add $0x8,%rsp
4007c6: c3 retq
4007c7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
4007ce: 00 00
4007d0: 48 89 f8 mov %rdi,%rax
4007d3: c3 retq
4007d4: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4007db: 00 00 00
4007de: 66 90 xchg %ax,%ax
I can't find any references regarding whether or not memset uses vectorized operations, the disassembly of memset#plt is unhelpful here:
00000000004005b0 <memset#plt>:
4005b0: ff 25 72 0a 20 00 jmpq *0x200a72(%rip) # 601028 <_GLOBAL_OFFSET_TABLE_+0x28>
4005b6: 68 02 00 00 00 pushq $0x2
4005bb: e9 c0 ff ff ff jmpq 400580 <_init+0x20>
This question suggests that since memset is designed to handle every case, it might be missing some optimizations.
This guy definitely seems convinced that you need to roll your own assembler memset to take advantage of SIMD instructions. This question does, too.
I'm going to take a shot in the dark and guess that it's not using SIMD operations because it can't tell whether or not it's going to be operating on something that's a multiple of the size of one vectorized operation, or there's some alignment-related issue.
However, we can confirm that it's not an issue of cache efficiency by checking with cachegrind. The write program produces the following:
==19593== D refs: 6,312,618,768 (80,386 rd + 6,312,538,382 wr)
==19593== D1 misses: 1,578,132,439 ( 5,350 rd + 1,578,127,089 wr)
==19593== LLd misses: 1,578,131,849 ( 4,806 rd + 1,578,127,043 wr)
==19593== D1 miss rate: 24.9% ( 6.6% + 24.9% )
==19593== LLd miss rate: 24.9% ( 5.9% + 24.9% )
==19593==
==19593== LL refs: 1,578,133,467 ( 6,378 rd + 1,578,127,089 wr)
==19593== LL misses: 1,578,132,871 ( 5,828 rd + 1,578,127,043 wr) <<
==19593== LL miss rate: 9.0% ( 0.0% + 24.9% )
and the read program produces:
==19682== D refs: 6,312,618,618 (6,250,080,336 rd + 62,538,282 wr)
==19682== D1 misses: 1,578,132,331 (1,562,505,046 rd + 15,627,285 wr)
==19682== LLd misses: 1,578,131,740 (1,562,504,500 rd + 15,627,240 wr)
==19682== D1 miss rate: 24.9% ( 24.9% + 24.9% )
==19682== LLd miss rate: 24.9% ( 24.9% + 24.9% )
==19682==
==19682== LL refs: 1,578,133,357 (1,562,506,072 rd + 15,627,285 wr)
==19682== LL misses: 1,578,132,760 (1,562,505,520 rd + 15,627,240 wr) <<
==19682== LL miss rate: 4.1% ( 4.1% + 24.9% )
While the read program has a lower LL miss rate because it performs many more reads (an extra read per XOR operation), the total number of misses is the same. So whatever the issue is, it's not there.

Caching and locality almost certainly explain most of the effects you are seeing.
There isn't any caching or locality on writes, unless you want a non-deterministic system. Most write times are measured as the time it takes for the data to get all the way to the storage medium (whether this is a hard drive or a memory chip), whereas reads can come from any number of cache layers that are faster than the storage medium.

It might be Just How it (the-System-as-a-Whole) Performs. The read being faster appears to be a common trend with a wide range of relative throughput performance. On a quick analysis of the DDR3 Intel and the DDR2 charts listed, as a few select cases of (write/read)%;
Some top performing DDR3 chips are writing at about ~60-70% of the read throughput. However, there some memory modules (ie. Golden Empire CL11-13-13 D3-2666) down to only ~30% write.
Top performing DDR2 chips appear to have only about ~50% of the write throughput compared to the read. But there are also some notably bad contenders (ie. OCZ OCZ21066NEW_BT1G) down to ~20%.
While this may not explain the cause for the ~40% write/read reported, as benchmark code and setup used is likely different (the notes are vague), this is definitely a factor. (I would run some existing benchmark programs and see if the numbers fall in-line with those of the code posted in the question.)
Update:
I downloaded the memory look-up table from the linked site and processed it in Excel. While it still shows a wide range of values it is much less sever than the original reply above which only looked at the top-read memory chips and a few selected "interesting" entries from the charts. I'm not sure why the discrepancies, especially in the terrible contenders singled out above, are not present in the secondary list.
However, even under the new numbers the difference still ranges widely from 50%-100% (median 65, mean 65) of the read performance. Do note that just because a chip was "100%" efficient in a write/read ratio doesn't mean it was better overall .. just that it was more even-keel between the two operations.

Here's my working hypothesis. If correct, it explains why writes are about twice slower than reads:
Even though memset only writes to virtual memory, ignoring its previous contents, at the hardware level, the computer cannot do a pure write to DRAM: it reads the contents of DRAM into cache, modifies them there and then writes them back to DRAM. Therefore, at the hardware level, memset does both reading and writing (even though the former seems useless)! Hence the roughly two-fold speed difference.

Because to read you simply pulse the address lines and read out the core states on the sense lines. The write-back cycle occurs after the data is delivered to the CPU and hence doesn't slow things down. On the other hand, to write you must first perform a fake read to reset the cores, then perform the write cycle.
(Just in case it's not obvious, this answer is tongue-in-cheek -- describing why write is slower than read on an old core memory box.)

Simplest way to toggle a integer variable between two values

I have a variable a which can have only two values x1 or x2. How to toggle a between these values. I came up with this. Any other more efficient way?
a = (a == x1 ? x2: x1);

It's (highly) unlikely to be your bottleneck, but you could use the XOR method:
togglex1x2 = (x1 ^ x2); // This is the combined toggle value
a = x1; // Initialise to either x1 or x2
a ^= togglex1x2; // toggles
a ^= togglex1x2; // toggles
...
[You should write code that is understandable first, and optimise only when you have measured a bottleneck (and then double checked it is where you think it is!), and if you optimise make sure you comment with reasoning. ]

Try something like this. a will toggle between x1 and x2
a = (x1 + x2) - a;

Another way is to toggle an index between 0 and 1 and index an array with that:
int main() {
int const values[] = {0x55, 0xaa};
int selector = 0;
selector ^= 1; // toggle index
int value = values[selector]; // select value
}

It's very hard to predict which method is better without a context - the biggest unknown is in which way is this operation critical - is it latency bound (for .e.g if you're doing a long calculation with data dependencies that go through this code), or perhaps bandwidth critical (you're swapping many unrelated elements, and start to run your resources thin).
Tried to benchmark the solutions proposed here.
see this code for e.g.:
int main()
{
int x1 = 123, x2 = 456;
int x1_xor_x2 = x1 ^ x2;
int a = x1;
int i;
for (i = 0; i < 10000; ++i)
a = (a == x1 ? x2: x1);
for (i = 0; i < 10000; ++i)
a ^= x1_xor_x2;
printf ("a=%d\n", a); // prevent all this from being optimized out
}
becomes (gcc, with -O3):
0000000000400440 <main>:
400440: b8 10 27 00 00 mov $0x2710,%eax // loop counter
400445: ba c8 01 00 00 mov $0x1c8,%edx
40044a: be 7b 00 00 00 mov $0x7b,%esi // 123 in esi
40044f: b9 c8 01 00 00 mov $0x1c8,%ecx // 456 in ecx
400454: eb 12 jmp 400468 <main+0x28>
400456: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40045d: 00 00 00
400460: 83 fa 7b cmp $0x7b,%edx
400463: 89 ca mov %ecx,%edx
400465: 0f 45 d6 cmovne %esi,%edx // conditional move
400468: 83 e8 01 sub $0x1,%eax
40046b: 75 f3 jne 400460 <main+0x20>
40046d: b8 10 27 00 00 mov $0x2710,%eax
400472: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400478: 81 f2 b3 01 00 00 xor $0x1b3,%edx
40047e: 83 e8 01 sub $0x1,%eax // xoring
400481: 75 f5 jne 400478 <main+0x38>
400483: be 6c 06 40 00 mov $0x40066c,%esi
400488: bf 01 00 00 00 mov $0x1,%edi
40048d: 31 c0 xor %eax,%eax
40048f: e9 9c ff ff ff jmpq 400430 <__printf_chk#plt>
After adding time checks (and bumping the loop count to 100M), I get on my server (AMD Opteron 6272):
first: 0.089000s
second: 0.067000s
a=123
This is not very interesting though as there's no consumer of a that requires low latency data (so the calculations may buffer up and we're checking ALU BW, not latency)
Trying to add sum += a on every iteration resulted in increased delta in favor of the first -
first: 0.106000s
second: 0.066000s
But! since a simple add isn't very time consuming itself, tried using the reciprocal (float sum and += 1/a) - this one would really need the data fast:
first: 0.014000s
second: 0.087000s
Finally, inversion :)
This goes to show that you can have different performance outcomes according to how a given operation
in used in your program. There's not much sense in benchmarking a single method without the rest of the code (not that we don't so it, it's just that you need to take any result with a chunk of salt).
Of course, this is all for the sake of discussion, most likely this bit of code is not even remotely a bottleneck..

So you benchmarked it and this is the bottleneck, right?
Oh, well, nope... then just forget about efficiency. This is already a very small expression that is evaluated quickly.
By the way, there are other methods but I'm not sure that 1. they are really faster, 2. if they are faster, it really counts, 3. if they are slower, the readability penalty is a worthwhile tradeoff.
For example:
#define FIRST 42
#define SECOND 1337
/* initialize */
int x = FIRST;
/* toggle */
x = FIRST + SECOND - x;

divdi3 division used for long long by gcc on x86

When gcc sees multiplication or division of integer types that isn't supported in hardware, it generates call to special library function.
http://gcc.gnu.org/onlinedocs/gccint/Integer-library-routines.html#Integer-library-routines
According link above, long __divdi3 (long a, long b) used for division of long. However, here http://gcc.gnu.org/onlinedocs/gcc-3.3/gccint/Library-Calls.html divdi explained as "call for division of one signed double-word". When first source has cleary mapping of di suffix -> long arguments, second states divdi for double-word and udivdi for full-word (single, right?)
When I compile simple example
int main(int argc, char *argv[]) {
long long t1, t2, tr;
t1 = 1;
t2 = 1;
tr = t1 / t2;
return tr;
}
with gcc -Wall -O0 -m32 -march=i386 (gcc ver. 4.7.2)
dissamble shows me
080483cc <main>:
80483cc: 55 push %ebp
80483cd: 89 e5 mov %esp,%ebp
80483cf: 83 e4 f0 and $0xfffffff0,%esp
80483d2: 83 ec 30 sub $0x30,%esp
80483d5: c7 44 24 28 01 00 00 movl $0x1,0x28(%esp)
80483dc: 00
80483dd: c7 44 24 2c 00 00 00 movl $0x0,0x2c(%esp)
80483e4: 00
80483e5: c7 44 24 20 01 00 00 movl $0x1,0x20(%esp)
80483ec: 00
80483ed: c7 44 24 24 00 00 00 movl $0x0,0x24(%esp)
80483f4: 00
80483f5: 8b 44 24 20 mov 0x20(%esp),%eax
80483f9: 8b 54 24 24 mov 0x24(%esp),%edx
80483fd: 89 44 24 08 mov %eax,0x8(%esp)
8048401: 89 54 24 0c mov %edx,0xc(%esp)
8048405: 8b 44 24 28 mov 0x28(%esp),%eax
8048409: 8b 54 24 2c mov 0x2c(%esp),%edx
804840d: 89 04 24 mov %eax,(%esp)
8048410: 89 54 24 04 mov %edx,0x4(%esp)
8048414: e8 17 00 00 00 call 8048430 <__divdi3>
8048419: 89 44 24 18 mov %eax,0x18(%esp)
804841d: 89 54 24 1c mov %edx,0x1c(%esp)
8048421: 8b 44 24 18 mov 0x18(%esp),%eax
8048425: c9 leave
8048426: c3 ret
Note 8048414: call 8048430 <__divdi3>.
I can't use gcc lib for my project and it's multiplatform. I hoped not to write all __* functions for all platforms (speed is not matter), but now I'm a bit confused.
Can somebody explain, why is there __divdi3 (not __divti3) call generated for long long int (64-bit) division?

On x86 machines, the term "word" usually implies presence of a 16-bit value. More generally in the computer-science world, word can denote values of virtually arbitrary lengths, with words of 10 or 12 bits not being uncommon in the embedded systems.
I believe that the terminology you have hit upon is used for the Linux/Unix systems just for the sake of unification on the level of the operating system and has nothing to do with the target platform of your build. An example of use of the same notation can be found in gdb, which uses w for the 32-bit word and hw for the 16-bit "half-word" (in the x86 sense).
Furthermore, this convention also extends to the standard IEEE-754 floating point numbers with ease, and is summarised in the few bullet points below
s - single (precision, word) is used for four byte integers (int) / floats (float)
d - double (precision) for eight byte integers (long or long long) / floats (double)
t - ten bytes for integers (long long) / floats (long double)
This naming convention is used for all arithmetic built-ins, like __divsi3, __divdi3, __divti3 or __mulsi3, __muldi3, __multi3... (and all u - unsigned - variants). A complete list can be found here.
Division of 64-bit numbers on 32-bit machines uses advanced (and bit difficult) algorithm. However, you can still use algorithm principle you've learned in school. Here's simple pseudo-code for it (have a look on this answer about big-integers):
result = 0;
count = 0;
remainder = numerator;
while(highest_bit_of_divisor_not_set) {
divisor = divisor << 1;
count++;
}
while(remainder != 0) {
if(remainder >= divisor) {
remainder = remainder - divisor;
result = result | (1 << count);
}
if(count == 0) {
break;
}
divisor = divisor >> 1;
count--;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight