For the sake of a presentation at work, I wanted to compare the performance of NodeJS to C. Here is what I wrote:
Node.js (for.js):
var d = 0.0,
start = new Date().getTime();
for (var i = 0; i < 100000000; i++)
{
d += i >> 1;
}
var end = new Date().getTime();
console.log(d);
console.log(end - start);
C (for.c)
#include <stdio.h>
#include <time.h>
int main () {
clock_t start = clock();
long d = 0.0;
for (long i = 0; i < 100000000; i++)
{
d += i >> 1;
}
clock_t end = clock();
clock_t elapsed = (end - start) / (CLOCKS_PER_SEC / 1000);
printf("%ld\n", d);
printf("%lu\n", elapsed);
}
Using GCC I compiled my for.c and ran it:
gcc for.c
./a.out
Results:
2499999950000000
198
Then I tried it in NodeJS:
node for.js
Results:
2499999950000000
116
After running numerous times, I discovered this held true no matter what. If I switched for.c to use a double instead of a long in the loop, the time C took was even longer!
Not trying to start a flame war, but why is Node.JS (116 ms.) so much faster than native C (198 ms.) at performing this same operation? Is Node.JS applying an optimization that GCC does not do out of the box?
EDIT:
Per suggestion in comments, I ran gcc -Wall -O2 for.c. Results improved to C taking 29 ms. This begs the question, how is it that the native C settings are not optimized as much as a Javascript compiler? Also, what is -Wall and -02 doing. I'm really curious about the details of what is going on here.
This begs the question, how is it that the native C settings are not optimized as much as a Javascript compiler?
Since C is statically-compiled and linked, requiring a potentially lengthy build step of your entire codebase (I once worked in one that took almost an hour for a full optimized build, but only 10 minutes otherwise), and a very dangerous, hardware-level language that risks a lot of undefined behaviors if you don't treat it with care, the default settings of compilers usually don't optimize to smithereens since that's a developer/debug build intended to help with debugging and productivity with faster turnaround.
So in C you get a distinct separation between an unoptimized but faster-to-build, easier-to-debug developer/debug build and a very optimized, slower-to-build, harder-to-debug production/release build that runs really fast, and the default settings of compilers often favor the former.
With something like v8/NodeJS, you're dealing with a just-in-time compiler (dynamic compilation) that builds and optimizes only the necessary code on the fly at run-time. On top of that, JS is a much safer language and also often designed for security that doesn't allow you to work at the raw bits and bytes of the hardware.
As a result, it doesn't need that kind of strong release/debug build distinction of a native, statically-compiled language like C/C++. But it also doesn't let you put the pedal to the metal as you can in C if you really want.
A lot of people trying to benchmark C/C++ coming from other languages often fail to understand this build distinction and the importance of compiler/linker optimization settings and get confused. As you can see, with the proper settings, it's hard to beat the performance of these native compilers and languages that allow you to write really low-level code.
Adding the register key word helps as expected
#include <stdio.h>
#include <time.h>
int main () {
register long i, d;
clock_t start = clock();
i = d = 0L;
for (i = 0; i < 100000000L; i++) {
d += i >> 1;
}
clock_t end = clock();
clock_t elapsed = (end - start) / (CLOCKS_PER_SEC / 1000);
printf("%ld\n", d);
printf("%lu\n", elapsed);
}
and compile with the C compiler
cc for.c -o for
./for ; node for.js
returns
2499999950000000
97
2499999950000000
222
I've also done some testing calculating prime numbers and I've found that Node.js is about twice as fast as C see here.
When you have a very simple counting type of loop, the -O2 optimization can sometimes convert the output to a simple formula without even iterating the loop. See Karl's Blog for an explination. If you add something more complicated to the routine it is likely node.js will be faster again. for example I added a devisor term into your sample program and the c -O2 optimization was no longer able to convert it to a simple formula and node.js became faster again.
I am still baffled as to how node.js can be caster than c at simple integer calculations, but in every test I've done so far it is faster. I've also performed some tests with bitwise calculations that I haven't posted yet and still node.js was faster.
node.js and C are distinct in that node.js is interpreting JavaScript whereas C is compiling code to machine language. As such, both are handled differently. For node.js, you simply run the .js file. C is much different from this. When you are compiling code to machine language using GCC, you must supply compiler optimization settings aka "flags." Your node.js program is actually slower than the C program if you specify the -O3 flag to GCC. In fact, the node.js program took twice as long for me as the C program did. You stated that you would like to know more about what the C compiler does to optimize code. This is a complex topic/field and I highly recommend reading this Wikipedia article on compiler optimization to learn more.
In short, you made an unfair comparison because you did not optimize your C code.
Related
Say I have the OpenCL kernel,
/* Header to make Clang compatible with OpenCL */
/* Test kernel */
__kernel void test(long K, const global float *A, global float *b)
{
for (long i=0; i<K; i++)
for (long j=0; j<K; j++)
b[i] = 1.5f * A[K * i + j];
}
I'm trying to figure out how to compile this to a binary which can be loaded into OpenCL using the clCreateProgramWithBinary command.
I'm on a Mac (Intel GPU), and thus I'm limited to OpenCL 1.2. I've tried a number of different variations on the command,
clang -cc1 -triple spir test.cl -O3 -emit-llvm-bc -o test.bc -cl-std=cl1.2
but the binary always fails when I try to build the program. I'm at my wits' end with this, it's all so confusing and poorly documented.
The performance of the above test function can, in regular C, be significantly improved by applying the standard LLVM compiler optimization flag -O3. My understanding is that this optimization flag some how takes advantage of the contiguous memory access pattern of the inner loop to improve performance. I'd be more than happy to listen to anyone who wants to fill in the details on this.
I'm also wondering how I can first convert to SPIR code, and then convert that to a buildable binary. Eventually I would like to find a way to apply the -O3 compiler optimizations to my kernel, even if I have to manually modify the SPIR (as diffiult as that will be).
I've also gotten the SPIRV-LLVM-Translator tool working (as far as I can tell), and ran,
./llvm-spirv test.bc -o test.spv
and this binary fails to load at the clCreateProgramWithBinary step, I can't even get to the build step.
Possibly SPIRV doesn't work with OpenCL 1.2, and I have to use clCreateProgramWithIL, which unfortunately doesn't exist for OpenCL 1.2. It's difficult to say for sure why it doesn't work.
Please see my previous question here for some more context on this problem.
I don't believe there's any standardised bitcode file format that's available across implementations, at least at the OpenCL 1.x level.
As you're talking specifically about macOS, have you investigated Apple's openclc compiler? This is also what Xcode invokes when you compile a .cl file as part of a target. The compiler is located in /System/Library/Frameworks/OpenCL.framework/Libraries/openclc; it does have comprehensive --help output but that's not a great source for examples on how to use it.
Instead, I recommend you try the OpenCL-in-Xcode tutorial, and inspect the build commands it ends up running:
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/OpenCL_MacProgGuide/XCodeHelloWorld/XCodeHelloWorld.html
You'll find it produces bitcode files (.bc) for 4 "architectures": i386, x86_64, "gpu_64", and "gpu_32". It also auto-generates some C code which loads this code by calling gclBuildProgramBinaryAPPLE().
I don't know if you can untangle it further than that but you certainly can ship bitcode which is GPU-independent using this compiler.
I should point out that OpenCL is deprecated on macOS, so if that's the only platform you're targeting, you really should go for Metal Compute instead. It has much better tooling and will be actively supported for longer. For cross-platform projects it might still make sense to use OpenCL even on macOS, although for shipping kernel binaries instead of source, it's likely you'll have to use platform-specific code for loading those anyway.
Or are there some optimizations that can only be done at compile time (and therefore only work within compilation units)? I ask because in C, the compilation unit is a source file, and I'm trying to understand if there is any reason not to split source code into separate files in some circumstances (e.g. an optimization that could have been done if all the source were in one file was not done).
A typical (simplified) compile might look like
1) Pre-process
2) Parse code to internal representation
3) Optimize code
4) Emit assembly language
5) Assemble to .o file
6) Link .o file to a.out
LTOs are typically achieved by dumping the internal compiler representation to disk between steps 2 and 3, then during the final link (step 6) going back and performing steps 3-5. This could be depending on the compiler and version however. If it follows this pattern then you would see LTO equivalent to Compile Time optimizations.
However ...
Having very large source files can be annoying -- Emacs starts to choke on source files >10MB.
If you are in a multi-user development environment, depending on your SCM you may have a lot of trouble if multiple engineers are working on the same file.
If you use a distributed build system you perform compiles in parallel. So if it takes 1 second each to compile and optimize a file, and you have 1000 files and 1000 build agents, your total compile time is 1 second. If you are doing all your optimization for all 1000 files during the final you will have 999 agents sitting idle and 1 agent spend an eternity doing all your optimization.
academic example:
main()
{
int i;
for (i = 0; i < MAX; i++) {
fun(i);
}
}
fun(int i)
{
if (i == 0) {
doSomething();
}
}
if fun is in the same compilation unit, and data-flow-analys is enabled, the foor-loop could be optimized to a single function call.
BUT: I would stay with MooseBoys' comment.
I've been rewriting a program that is a mix of Fortran and C, which is around 10k lines of random particle simulation. However, I recently realised that Release mode was running a lot slower than Debug.
Debug (-O0): 23 seconds. Release(-O1 or -O2): 43 seconds. For a small test run.
This only changes when the C optimization settings are changed within Project->Properties->C/C++ Build->GCC C Compiler->Optimization Level, and is unaffected by the GNU Fortran compiler settings.
Looking into this, it seems that -O1 and -O2 run slowly, while -O0 runs a lot faster. Even with all optimization flags set manually (from GCC Docs), it still runs faster than -O1.
It may be that running in Release causes different results which cause extra computations to be made (values being outside of their expected ranges etc.), would this be likely? And if so would it be possible to change the behaviour back to the original Debug settings?
Any help would be appreciated, let me know if you need more information to help.
Chris.
Edit: System Information:
Windows 8.1 Pro
GCC version 4.8.1
Eclipse for Parallel Application Developers, Juno SR2
Okay it turns out that this occurs for version 4.8.1 of gcc, but not the later versions (is fixed with 4.8.3). For both the equivalent C and Fortran codes, using -O1 or higher caused the compiler to incorrectly optimise the code.
To simplify the code, the incorrect result occurs from the lines:
a = b - c;
c = c + a;
Which, if computed normally would be c = b + (c - c);, reducing to c = b;.
But seems to instead be calculated as c = c + (b - c);, which is calculated normally.
This creates a very slight inaccuracy (-0.0999999978 compared to -0.1000000015).
In this case, this resulted in a loop that would have run 2 times, to run 1500 times, and for this to happen over 500 times across the run.
The way it was (accidentally) fixed, was to force the compiler to calculate a = b - c before running the c = c + a. This can be done by printf("%f\n",a);, or c = c + 0 between calculating a and c (both seemed to work). Note, this line does not ever have to run. You can put this into a if statement that is never true (e.g. if (a == INT_MAX)) but not somewhere the compiler knows it will never run (e.g. if (false)).
At least this is my, and my supervisors, current thoughts on the matter. Still not exactly sure what part of the optimisation causes this, or what fixed it in the most recent versions, but I hope this can be helpful for someone else and save them a week of confusion.
Thanks for the help anyway. - Chris.
I am trying a basic microbenchmark comparison of c with ocaml. I have heard that for the fibonacci program, c and ocaml are about the same, but I can't replicate those results. I compile the c code with gcc -O3 fib.c -o c-code, and compile the OCaml code with ocamlopt -o ocaml-code fibo.ml. I am timing by using time ./c-code and time ./ocaml-code. Every time I do this OCaml takes 0.10 seconds whereas the c code is about .03 seconds each time. Besides the fact that this is a naive benchmark, is there a way to make ocaml faster? Can anyone see what the times on their computers are?
C
#include <stdio.h>
int fibonacci(int n)
{
return n<3 ? 1 : fibonacci(n-1) + fibonacci(n-2);
}
int main(void)
{
printf("%d", fibonacci(34));
return 0;
}
OCaml
let rec fibonacci n = if n < 3 then 1 else fibonacci(n-1) + fibonacci(n-2);;
print_int(fibonacci 34);;
The ML version already beats the C version when compiled with gcc -O2, which I think is a pretty decent job. Looking at the assembly generated by gcc -O3, it looks like gcc is doing some aggressive inlining and loop unrolling. To make the code faster, I think you would have to rewrite the code, but you should focus on higher level abstraction instead.
I think this is just an overhead to ocaml, it would be more relevant to compare with a larger program.
You can use the -S option to produce assembly output, along with -verbose to see how ocaml calls external applications (gcc). Additionally, using the -p option and running your application through gprof will help determine if this is an overhead from ocaml, or something you can actually improve.
Cheers.
For my computer I get the following,
ocaml - 0.035 (std-dev=0.02; 10 trials)
c - 0.027 (std-dev=0.03; 10 trials)
Background: I'm trying to create a pure D language implementation of functionality that's roughly equivalent to C's memchr but uses arrays and indices instead of pointers. The reason is so that std.string will work with compile time function evaluation. For those of you unfamiliar w/ D, functions can be evaluated at compile time if certain restrictions are met. One restriction is that they can't use pointers. Another is that they can't call C functions or use inline assembly language. Having the string library work at compile time is useful for some compile time code gen hacks.
Question: How does memchr work under the hood to perform as fast as it does? On Win32, anything that I've been able to create in pure D using simple loops is at least 2x slower even w/ obvious optimization techniques such as disabling bounds checking, loop unrolling, etc. What kinds of non-obvious tricks are available for something as simple as finding a character in a string?
I would suggest taking a look at GNU libc's source. As for most functions, it will contain both a generic optimized C version of the function, and optimized assembly language versions for as many supported architectures as possible, taking advantage of machine specific tricks.
The x86-64 SSE2 version combines the results from pcmpeqb on a whole cache-line of data at once (four 16B vectors), to amortize the overhead of the early-exit pmovmskb/test/jcc.
gcc and clang are currently incapable of auto-vectorizing loops with if() break early-exit conditions, so they make naive byte-at-a-time asm from the obvious C implementation.
This implementation of memchr from newlib is one example of someone's optimizing memchr:
it's reading and testing 4 bytes at a time (apart from memchr, other functions in the newlib library are here).
Incidentally, most of the the source code for the MSVC run-time library is available, as an optional part of the MSVC installation (so, you could look at that).
Here is FreeBSD's (BSD-licensed) memchr() from memchr.c. FreeBSD's online source code browser is a good reference for time-tested, BSD-licensed code examples.
void *
memchr(s, c, n)
const void *s;
unsigned char c;
size_t n;
{
if (n != 0) {
const unsigned char *p = s;
do {
if (*p++ == c)
return ((void *)(p - 1));
} while (--n != 0);
}
return (NULL);
}
memchr like memset and memcpy generally reduce to fairly small amount of machine code. You are unlikely to be able to reproduce that kind of speed without inlining similar assembly code. One major issue to consider in an implementation is data alignment.
One generic technique you may be able to use is to insert a sentinel at the end of the string being searched, which guarantees that you will find it. It allows you to move the test for end of string from inside the loop, to after the loop.
GNU libc definitely uses the assembly version of memchr() (on any common linux distro). This is why it is so unbelievable fast. For example, if we count lines in 11Gb file (like "wc -l" does) it takes around 2.5 seconds with assembly version of memchr() from GNU libc. But if we replace memchr() assembly call with for example memchr() C implementation from FreeBSD - the speed will decrease to like 30 seconds. This is equal to replacing memchr() with just a while loop which compares one char after another.