C Code slower on Windows than on Linux - c

i'm working on a project that will have builds for Windows and Linux, 32 and 64 bits.
This project is based on loading strings for a text file, process it and write results to a SQLite3 database.
On linux it reaches almost 400k sequences per second, compiled by GCC without any optimization. However on Windows it stucks in 100k sequences per second, compiled on VS2010 without any optimization.
I tried using optimizations in compilers but nothing changed.
Is this right? C code on Windows runs slower?
EDIT:
I think i need to be more clear on some points.
I made tests with code optimization enabled AND disabled. Performance didn't changed, probably because my program's bottleneck is the time wasted reading data from HD.
This program takes benefits of parallel computing. There a queue where a thread queues processed data and another dequeue to write in the SQLite database. This way i don't think there is any performance lose from this.

Is this right? C code on Windows runs slower?
No. C doesn't have speed. It's the implementations of C that introduce speed. There are implementations that produce fast behaviour (generally "compilers that produce fast machine code") and implementations that produce slow behaviour for both Windows and Linux.
It isn't just Windows and Linux that are significant here, either. Some compilers optimise for specific processors, and will produce slow machine code for any other processors.
I tried using optimizations in compilers but nothing changed.
Testing speed without optimisations enabled makes no sense. However, this does tend to indicate that something else is slow. Perhaps the implementation that produced the library files for SQLite3 client in Windows is an implementation that produces slow code. I'd start by rebuilding the lot (including the SQLite3 library) with full optimisations enabled. Following that, you could try using a profiler to determine where the difference is and use the results to perform intelligent optimisations to your code.

Related

What does gcc mean here?

I am studying a thesis. This paper describes a side channel attack. It measures the cache miss rate when there is only the attacker's code, and the cache miss rate when other programs and the attack code run on the same core as interference. I found that the cache miss rate can be obtained through perf. But after thinking about it for a long time, I can't fully understand what this interference program is.
The following are the original words in the paper.
we also show the results when there is only the sender process running on the physical core (denoted by sender only) and the results with the sender sharing the physical core with a benign gcc workload (denoted by sender & gcc). When there is only the sender process, it has the smallest L1 miss rate. When it is sharing the core with a benign program, the benign program, e.g., the gcc, will cause contention in the cache.
what does the interference program mentioned here refer to? Is it a benign c code, or is it a gcc library when compiling c code with gcc?
If the interference is to run the gcc library to compile the c code, the gcc compiles the c code in an instant. How can we make it run for a long time? This may be a very basic question, but I haven't figured it out after thinking about it for a long time.
The URL of the paper is:
https://caslab.csl.yale.edu/publications/xiong2020leaking.pdf
Thank you to everyone who is willing to provide suggestions.
“gcc” refers to the primary command of the GNU Compiler Collection. There are several reasons the run time of gcc might not be too short for the work in the paper:
The short running time of gcc may suffice for the work in the paper.
You may experience gcc as taking a short time for the programs you use it for, but it may take much longer for longer programs in more complicated languages.
gcc can be asked to operate on many source files in a single command.
gcc may be run repeatedly during the work in the paper.
You may be accustomed to running programs on a computer used exclusively by you. Commonly, in environments where side channel attacks are a concern, programs are run on “server” systems concurrently by multiple users. Such systems may be continuously busy. (This is not to exclude such attacks being a concern in a system normally used only by a single person, where, aside from the software desired by the rightful user, some software is executing at the instigation of a malicious person.)

Will it be feasible to use gcc's function multi-versioning without code changes?

According to most benchmarks, Intel's Clear Linux is way faster than other distributions, mostly thanks to a GCC feature called Function Multi-Versioning. Right now the method they use is to compile the code, analyze which function contains vectorized loops, then patch the code with FMV attributes and compile it again.
How feasible will it be for GCC to do it automatically? For example, by passing -mmultiarch=sandybridge,skylake (or a similar -m option listing CPU extensions like AVX and AVX2).
Right now I'm interested in two usage scenarios:
Use this option for our large math-heavy program for delivering releases to our customers. I don't want to pollute the code with non-standard attributes and I don't want to modify the third-party libraries we use.
The other Linux distributions will be able to do this easily, without patching the code as Intel does. This should give all Linux users massive performance gains.
No, but it doesn't matter. There's very, very little code that will actually benefit from this; for the most part by doing it globally you'll just (without special effort to sort matching versions in pages together) make your system much more memory-constrained and slower due to the huge increase in code size. Most actual loads aren't even CPU-bound; they're syscall-overhead-bound, GPU-bound, IO-bound, etc. And many of the modern ones that are CPU-bound aren't running precompiled code but JIT'd code (i.e. everything running in a browser, whether that's your real browser or the outdated and unpatched fork of Chrome in every Electron app).

I turned on compiler optimization and my multithreaded C program imploded aggressively, any articles I can read on this?

I'm using MinGW, which is gcc for Windows. My program involves multiple windows, two different main threads, and several worker threads in a thread pool for overlapped network I/O.
It works perfectly fine without compiler optimization.
A) Is compiler optimization even necessary? My program's already very fast. Is it at all likely that it will provide a significant improvement?
B) Are there any articles on how to properly build a multthreaded program so compiler optimization can do its job?
“Imploded aggressively” is a bit weird (is your program a controller for a fission bomb?), but I understand that your program behaved as desired without compiler optimizations and mysteriously with compiler optimizations.
The technical term for this is that your program is buggy.
Multithreaded programming is intrinsically hard. Multithreaded programming when the threads share memory is very hard; it's the masochist way of concurrent programming (message passing is a lot easier to get right). You don't just need to read an article or two, you need to read several books and get a few years' programming experience.
You were unlucky that your program seemed to work without optimizations. It probably wouldn't work on a different machine where the timings are a bit different, or with a different compiler, or on a different operating system, either. So you ended up wasting your time thinking your program worked. But it doesn't. A compiler transforms correct source code into correct executables, no matter what optimization level you choose.¹
¹ Barring compiler bugs, sure. But the odds are very strongly stacked against you.
99.9% of all household failures in one optimization mode and not another are due to serious bugs. Multithreading races etc. are very sensitive to code performance. An instruction reorder or loop shortcut can turn a test pass into a debugging nightmare.
I'm assuming that the server runs up OK and detonates under load in aparrently different places, so making conventional debugging useless?
You are going to have to rely on logging and changing the test conditions to narrow down the point of ignition. My guess is this is going to be a Heisenbug that mutates with changes to the code, optimization, options, load profile, buffer sizes etc.
Not fixing the problem is not a good plan since it wil just show up in another form on next years boxes with more cores etc. Even with optimization off, it's still there, lurking, waiting for the opportunity to strike.
I hope I'm providing some comfort.
Seriously - log everything you can with a good logger - one that queues up the logs so as to keep disk latency out of the main app. Change things around to try and make the bug mutate and perhaps show up in the non-optimized build too. Write down, (type in), absolutely everything that you do amd what happens after any change, good or bad. Making the bug worse is actually better than making its symptoms go away, (without knowing exactly why). Try the server on various hardware configs, if you can.
Eventually, you will find the bug!
You have one thing going for you - it seems that you can reliably reproduce the problem. That, in itself, is a massive plus.
Forgot to ask - apart from the nuclear explosive metaphor, what is the main symptom? Is it AV'ing/segfaulting all over the place, or is it locked or livelocked up?
To answer part "A" of your question, the unoptimized version of your code still has the concurrency bugs in it, but the timing of how the threads run is such that the bugs have not yet been exposed with your test workloads. The current version of the unoptimized program will eventually fail in use, so you will need to fix the concurrency bugs before using the program for real work.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

C compiler producing lightweight executeables

I'm currently using MSVC for C++ but as I'm switching to C to write a very performance-intensive program (interpreter) I have to search for a fitting C compiler.
I've looked at some binaries produced by Turbo-C and even if its old they seem pretty straigthforward and optimized.
Now I don't know what the best compiler for building an interpreter is, but maybe you can help me.
I've considered GCC but as I don't know much about it, I can't be really sure.
99.9% a programs performance depends on the code you write and the language you choose.
you can safely ignore the performance of the the compiler.
Stick to MSVC...and dont waste time :)
If I were you, I would take the approach of worrying less about the compiler and worrying more about your own code. Write the code for the interpreter in a reasonable way. Then, profile it, and optimize spots based on how much time they take. That is more likely to produce a performance benefit than using a particular compiler.
If you want a lightweight program, it is not the compiler you need to worry about so much as the code you write and the libraries you use. Most compilers will produce similar results from the same source code.
For example, using C++ with MFC, a basic windows application used to start off at about 900kB and grow rapidly. Linking with the dynamic MFC dlls would get you down to a few hundred kB. But by dropping MFC totally - using Win32 APIs directly - and using a minimal C runtime it was relatively easy to implement the same thing in an .exe of about 25kB or less (IIRC - it's been a long time since I did this).
So ditch the libraries and get back to proper low level C (or even C++ if you don't use too many "clever" features), and you can easily write very compact applications.
edit
I've just realised I was confused by the question title into talking about lightweight applications as opposed to concentrating on performance, which appears to be the real thrust of the question. If you want performance, then there is no specific need to use C, or move to a painful development environment - just write good, high performance code. Fundamentally this is about using the correct designs and algorithms and then profiling and optimising the resulting code to eliminate bottlenecks and inefficiencies. Note that these days you may achieve a much bugger bang for your buck by switching to a multithreaded approach than just concentrating on raw code optimisation - make sure you utilise the hardware well.
You can use GCC, through MingW, Eclipse CDT, or one of the other Windows ports. You can optimize for executable size, speed of resulting executable, or speed of compilation.
C++ was designed to be backward compatible with C. So any C++ compiler should be able to compile pure C. You might want to tell it that it's C and not C++, so compiler doesn't do name mangling, etc. If the compiler is very good with C++, it should be equally good, or better with C, because C is much simpler.
I would suggest sticking with MSVC. It's a very decent system. Though if you are not convinced - compare your options. Build the same program with multiple compilers, look at the assembly they produce, measure the resulting executable performance, etc.

Resources