MPI IO very slow. What could be the cause? - c

I have just converted a program to make use of MPI calls for use on multiple nodes but I am having a problem getting IO to work well with MPI calls.
I am using standard MPI2 IO Methods like MPI_File_open and MPI_File_write to write my final results to a file. On my laptop, I experience a slight speedup (0.2s -> 0.1s) but on the University's super computer my file writing speed becomes abysmal - (0.2s -> 90s!).
I cant understand why performance would be so bad on the supercomputer but improved on my desktop. Is there something I am overlooking which would heavily contribute to the slow speed?
Some Notes:
The file system on my laptop is ext4 and the one used by the University is nfs
I am using OpenMP 1.4.4 on the super computer and OpenMP 1.4.5 on my laptop.
I have the change the processes view multiple times using MPI_File_set_view due to a requirement set in the guidelines which I dont think I can get past.
I have tried using the asynchronous version of write -MPI_File_iwrite, but this actually gives worse results.

Related

CLion uses system memory excessively

I recently started to use CLion, on Windows 7 64-bit, for editing C files.
One thing that bothers me a lot is that it uses too much system memory. It doesn't cause out of memory error as asked in another question. Actually CLion shows much lesser memory consumption in IDE (~500 mb out of ~2000 mb) than it takes from system (~1000 mb). You can see a snapshot of the system memory usage and CLion's memory display below:
I use CLion not for C++ but for C projects. My project isn't that big (~5 c files < 300 lines and ~10 h files). I don't use it to compile the project, I just use it for editing. And during the snapshot there was no user program running by it. And CLion wasn't showing any processes running (indexing etc). It is a general behaviour.
I'm not sure if what I experience is something expected/normal, or it is caused because of my system setup, project settings or the way I use the IDE.
Is there any known causes for excessive memory usage? Can you suggest practices to decrease memory usage?
The post is 2 years old, but I am also having this issue with CLion 2018.1, and I imagine, others do, too. Some tips that worked for me:
Excluding directories from indexing.
Deleting source files I don't need.
Resolving a circular dependency between two classes. (Note: I can't vouch it was exactly that, because I tried several things at once, and it seems odd that such a powerful IDE would be affected by such an issue, but I can't rule it out.)
If it's really bad, the indexing can be paused. Guaranteed to reduce the memory usage. Of course, the intelligent completion won't work then.
Currently the RAM usage is stable at ~1 Gb with RocksDB, RapidJson, and ~50 classes.
UPDATE: tweaking clion64.exe.vmoptions reduced the consumption radically.
Same issue here. I haven't used CLion just sitting there so that I do not have to open again, 2 projects few files open, nothing major, still eating up +3GB is not something that I can accept, switching back to Sublime, that works fine, as others have mentioned I am using it only for editing/refactoring, compilation happens in Terminal.
(PyCharm has similar issues)
CLion need to index and support all information about the system headers to provide you smart completion, auto-import and symbol resolution. Your project is the smallest part of code base for analyzing.
I have heard about version 2020.3, which brings option to switch off refreshing files.
https://intellij-support.jetbrains.com/hc/en-us/community/posts/360007093580-How-to-disable-refreshing-files-after-build
Unfortunately I cannot try it out in my professional development environment.

Pure C OpenCL vs Python OpenCL performance

I am looking for performance measurement between Python wrapper to OpenCL and Pure C OpenCL. Performance measurements can varies with time, memory, etc..
- Are there any benchmarks available?
- What should be the expectation about the time performance differences?
- What kind of tasks (parallel of course...) should make a difference?
It is likely that PyOpenCL is your best choice. I would choose to use C only in very specific situations (a super-critical need for speed/low-latency on the host). For most casual parallel programs, it is fine for the host side to have plenty of slack, because all the real work gets done on the device.
You can consider PyOpenCL and OpenCL to have identical performance on the device.
Maybe use C if you are, like... designing a self-driving car, and every millisecond/amp matters. But even in that situation, it is likely that Python could be used effectively.
The best way to figure out if your specific program is slowed down is to time your code. For PyOpenCL that means:
import time
and
cl.command_queue_properties.PROFILING_ENABLE
Many smart companies and individuals choose to code first in Python, because they can build a flexible, working prototype quickly. If they end up needing more host performance later, it is relatively easy to port Python to C.
Hope that helps!
OpenCL uses precompiled programs, that later sent to device for execution. They are so-called "kernels". These kernels are deployed to be executed on end-device. This means main cost that must be measured is OpenCL implementation API I/O. Therefore, you can't rely on memory/CPU measurements, as real OpenCL part will use same of them.
AFAIK, no benchmarks available, but it is not hard to do one, if you will need it (matrix multiplication is hello world example, overall).
OpenCL is not that kind, that uses I/O on every CPU cycle. Field of use - really big data processing, that uses one big input, a lot of processing operations, and one output (no matter small or big). No one says that OpenCL can't be used with many I/O and minimal calculation variations, but implementation API overhead not worth it.
Expectations must be that I/O is pretty same fast in approximation to overall application performance.
There is a benchmark here: https://github.com/bennylp/saxpy-benchmark, comparing PyOpenCL against OpenCL as well as other frameworks/methods such as CUDA, plain C++, Numpy, R, Octave, and even TensorFlow (disclaimer: I'm the author)
According to the benchmark results, the performance difference between OpenCL and PyOpenCL varies too wildly. The PyOpenCL GPU target is almost 7x slower than OpenCL, but for the CPU target PyOpenCL is actually more than 2x faster than OpenCL!

Does gprof support multithreaded applications?

We're developing a multithreaded project. My colleague said that gprof works perfectly with no work around with multithreaded programs. I read otherwise some time ago.
http://sam.zoy.org/writings/programming/gprof.html
http://lists.gnu.org/archive/html/bug-binutils/2010-05/msg00029.html
I also read this:
How to profile multi-threaded C++ application on Linux?
So I'm guessing the workaround is no longer needed? If so, since when is it not needed?
Unless you change the processing the gprof would work fine.
Changing the processing means using co-processor or gpus as computing units. In the worst case you have to manually call the setitimer function for every thread. But as per latest version, (2013-14) it's not needed.
In certain cases it behaves mischievously. So I advice to use the VTUNE from Intel which would give more accurate and more detailed information.

C Code slower on Windows than on Linux

i'm working on a project that will have builds for Windows and Linux, 32 and 64 bits.
This project is based on loading strings for a text file, process it and write results to a SQLite3 database.
On linux it reaches almost 400k sequences per second, compiled by GCC without any optimization. However on Windows it stucks in 100k sequences per second, compiled on VS2010 without any optimization.
I tried using optimizations in compilers but nothing changed.
Is this right? C code on Windows runs slower?
EDIT:
I think i need to be more clear on some points.
I made tests with code optimization enabled AND disabled. Performance didn't changed, probably because my program's bottleneck is the time wasted reading data from HD.
This program takes benefits of parallel computing. There a queue where a thread queues processed data and another dequeue to write in the SQLite database. This way i don't think there is any performance lose from this.
Is this right? C code on Windows runs slower?
No. C doesn't have speed. It's the implementations of C that introduce speed. There are implementations that produce fast behaviour (generally "compilers that produce fast machine code") and implementations that produce slow behaviour for both Windows and Linux.
It isn't just Windows and Linux that are significant here, either. Some compilers optimise for specific processors, and will produce slow machine code for any other processors.
I tried using optimizations in compilers but nothing changed.
Testing speed without optimisations enabled makes no sense. However, this does tend to indicate that something else is slow. Perhaps the implementation that produced the library files for SQLite3 client in Windows is an implementation that produces slow code. I'd start by rebuilding the lot (including the SQLite3 library) with full optimisations enabled. Following that, you could try using a profiler to determine where the difference is and use the results to perform intelligent optimisations to your code.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

Resources