Converting a pthreaded program to MPI? - c

I understand the differences between a multithreaded program and a program relying on inter-machine communication. My problem is that I have a nice multithreaded program written in 'C' that works and runs really well on an 8-core machine. There is now opportunity to port this program to a cluster to gain access to more cores. Is it worth the effort to rip out the pthread stuff and retrofit MPI (which I've never used) or are we better off recoding the whole thing (or most of it) from scratch? Assume we are "stuck" with C so a wholesale change of language isn't an option.

Depending on how your software is written, there may or may not be advantages to going to MPI over keeping your pthread implementation.
Unfortunately (or fortunately), message passing is a very different beast than pthreading - the basic assumption is quite different. I love this quote from Joshua Phillips of the Maestro team: "The difference between message-passing and shared-state communication is equivalent to the difference between sending a colleague an e-mail requesting her to complete a task and opening up her organizer to write down the task directly in her to-do list. More than just being rude, the latter is likely to confuse her – she might erase it, not notice it, or accidentally prioritize it incorrectly."
Unfortunately, the way you share data is very different. There is no direct access to data in other threads (since it can be on other machines), so it can be a very daunting task to migrate from pthreads to MPI. On the other hand, if the code is written so each thread is isolated, it can be an easy task, and definitely worthwhile.
In order to determine how useful this will be, you'll need to understand the code, and what you hope to achieve by switching. It can be worthwhile as a learning experience (you learn a LOT about synchronization and threading by working in MPI), but may not be practical if the gains will be minor.

Re. your comment to Reed -- this sounds like an easy, low-overhead conversion to MPI. Just be careful: not all MPI APIs support dynamic creation of processes, i.e., you start your program with N processes (specified at startup) and you're stuck with N processes throughout the life-time of the program.

Related

Writing program to run on server, requesting experienced advice

I'm developing a program that will need to run on Internet servers (a back-end component to be used by several cross-platform programs). I'm familiar with the security precautions to take (to prevent buffer overflows and SQL Injection attacks, for instance), but have never written a server program before, or any program that will be used on this scale.
The program needs to be able to serve hundreds or thousands of clients simultaneously. The protocols are designed for processing speed and to minimize the amount of data that must be exchanged, and the server side will be written in C. There will be both a Windows and a Linux version from the same code.
Questions:
How should the program handle communications -- multiple threads, a single thread handling all the sockets in turn, or spawn a new process for every so many incoming connections (or for each one)?
Do I need to worry about things like memory fragmentation, since this program will need to run for months at a time?
What other design issues, specific to this kind of programming, might an experienced developer of cross-platform programs for desktop and mobile systems not be aware of?
Please, no suggestions to use a different language. That decision has already been made, for reasons I'm not at liberty to go into.
For I'd use libevent or libev and non-blocking I/O. This way the operating system will take case of most of your scheduling problems. I'd also use a thread pool for processing tasks, that by nature are blocking, so they don't block the main loop. And if you ever need to read or write large amounts of data to or from the disc, use mmap, again to let the OS handle as much as possible.
The basic advice is use the OS, as much as possible. If you want a good example of a program which does this look at Varnish, it is very well written, and performs fantastic.
With my experience running multiple servers for over 3 years of uptime, and programs with little over a year of uptime I can still recommend making the setup so that the system gracefully recovers from a program error and from a server reboot.
Even though performance gets a hit when a program is restarted, you need to be able to handle that as external circumstances can force the program to such a restart.
Don't try to reinvent the wheel when not needed, and have a look at zeromq or something like that to handle distribution of incoming communications. (If you are allowed to, prototype the backends in a more forgiving language than C like Python, then reimplement in C but keeping the communications protocol)

Whats the advantages and disadvantages of using Socket in IPC

I have been asked this question in some recent interviews,Whats the advantages and disadvantages of using Socket in IPC when there are other ways to perform IPC.Have not found exact answer .
Any help would be much appreciated.
Compared to pipes, IPC sockets differ by being bidirectional, that is, reads and writes can be done on the same descriptor. Pipes, unlike sockets, are unidirectional. You have to keep a pair of descriptors if you want to do both reads and writes.
Pipes, on the other hand, guarantee atomicity when reading or writing under a certain amount of bytes. Writing something less than PIPE_BUF bytes at once is guaranteed to be delivered in one chunk and never observed partial. Sockets do require more care from the programmer in that respect.
Shared memory, when used for IPC, requires explicit synchronisation from the programmer. It may be the most efficient and most flexible mechanism, but that comes at an increased complexity cost.
Another point in favour of sockets: an app using sockets can be easily distributed - ie. it can be run on one host or spread across several hosts with little effort. This depends of course on the nature of the app.
Perhaps this is too simplified an answer, yet it is an important detail. Sockets are not supported on all OS's. Recently, I have been aware of a project that used sockets for IPC all over the place only to find that they were forced to change from Linux to a proprietary OS which was POSIX, but did not support sockets the same way as Linux.
Sockets allow you a few benefits...
You can connect a simple client to them for testing (manually enter data, see the response).
This is very useful for debugging, simulating and blackbox testing.
You can run the processes on different machines. This can be useful for scalability and is very helpful in debugging / testing if you work in embedded software.
It becomes very easy to expose your process as a service
But there are drawbacks as well
Overhead is greater than IPC optimized for a single machine. Shared memory in particular is better if you need the performance, and you know your processes are all on the same machine.
Security - if your client apps can connect so can anyone else, if you're not careful about authentication. Data can also be sniffed if you're not encrypting, and modified if you're not at least signing data sent over the wire.
Using a true message queue tends to leave you with fixed sized messages. If you have a large number of messages of wildly varying sizes this can become a performance problem. Using a socket can be a way around this, though you're then left trying to wrap this functionality to become identical to a queue, which is tricky to get the detail right on, particularly aspects like blocking/non-blocking and atomicity.
Shared memory is quick but requires management (you end up writing a version of malloc to manage the SHM) plus you have to synchronise and lock it in some way. Though you can use libraries to help with this the availability depends on your environment and language.
Queues are easy but have the downsides listed as pros to my socket discussion.
Pipes have been covered by Blagovests answer to this question.
As is ever the case with this kind of stuff I would suggest reading the W. Richard Stevens books on IPC and sockets. There is no better explanation than his! :-)

OpenMP debug newbie questions

I am starting to learn OpenMP, running examples (with gcc 4.3) from https://computing.llnl.gov/tutorials/openMP/exercise.html in a cluster. All the examples work fine, but I have some questions:
How do I know in which nodes (or cores of each node) have the different threads been "run"?
Case of nodes, what is the average transfer time in microsecs or nanosecs for sending the info and getting it back?
What are the best tools for debugging OpenMP programs?
Best advices for speeding up real programs?
Typically your OpenMP program does not know, nor does it care, on which cores it is running. If you have a job management system that may provide the information you want in its log files. Failing that, you could probably insert calls to the environment inside your threads and check the value of some environment variable. What that is called and how you do this is platform dependent, I'll leave figuring it out up to you.
How the heck should I (or any other SOer) know ? For an educated guess you'd have to tell us a lot more about your hardware, o/s, run-time system, etc, etc, etc. The best answer to the question is the one you determine from your own measurements. I fear that you may also be mistaken in thinking that information is sent around the computer -- in shared-memory programming variables usually stay in one place (or at least you should think about them staying in one place the reality may be a lot messier but also impossible to discern) and is not sent or received.
Parallel debuggers such as TotalView or DDT are probably the best tools. I haven't yet used Intel's debugger's parallel capabilities but they look promising. I'll leave it to less well-funded programmers than me to recommend FOSS options, but they are out there.
i) Select the fastest parallel algorithm for your problem. This is not necessarily the fastest serial algorithm made parallel.
ii) Test and measure. You can't optimise without data so you have to profile the program and understand where the performance bottlenecks are. Don't believe any advice along the lines that 'X is faster than Y'. Such statements are usually based on very narrow, and often out-dated, cases and have become, in the minds of their promoters, 'truths'. It's almost always possible to find counter-examples. It's YOUR code YOU want to make faster, there's no substitute for YOUR investigations.
iii) Know your compiler inside out. The rate of return (measured in code speed improvements) on the time you spent adjusting compilation options is far higher than the rate of return from modifying the code 'by hand'.
iv) One of the 'truths' that I cling to is that compilers are not terrifically good at optimising for use of the memory hierarchy on current processor architectures. This is one area where code modification may well be worthwhile, but you won't know this until you've profiled your code.
You cannot know, the partition of threads on different cores is handled entirely by the OS. You speaking about nodes, but OpenMP is a multi-thread (and not multi-process) parallelization that allow parallelization for one machine containing several cores. If you need parallelization across different machines you have to use a multi-process system like OpenMPI.
The order of magnitude of communication times are :
huge in case of communications between cores inside the same CPU, it can be considered as instantaneous
~10 GB/s for communications between two CPU across a motherboard
~100-1000 MB/s for network communications between nodes, depending of the hardware
All the theoretical speeds should be specified in your hardware specifications. You should also do little benchmarks to know what you will really have.
For OpenMP, gdb do the job well, even with many threads.
I work in extreme physics simulation on supercomputer, here are our daily aims :
use as less communication as possible between the threads/processes, 99% of the time it is communications that kill performances in parallel jobs
split the tasks optimally, machine load should be as close as possible to 100% all the time
test, tune, re-test, re-tune... . Parallelization is not at all a generic "miracle solution", it generally needs some practical work to be efficient.

Best way to ensure accurate timing with C

I am a beginning C programmer (though not a beginning programmer) looking to dive into a project to teach myself C. My project is music-based, and because of this I am curious whether there are any 'best practices' per-se, when it comes to timing functions.
Just to clarify, my project is pretty much an attempt to build some barebones music notation/composition software (remember, emphasis on barebones). I was originally thinking about using OSX as my platform, but I want to do it in C, not Obj-C (though I know it would probably be easier...CoreAudio looked like a pretty powerful tool for this kind of stuff). So even though I don't have to build OSX apps in Obj-C, I will probably end up building this on a linux system (probably debian...).
Thanks everyone, for your great answers.
There are two accurate methods for timing functions:
Single process execution.
Timer event handler / callback
Single Process Execution
Most modern computers execute more than one program simultaneously. Actually, they execute pieces of many programs, swapping them out based on priorities and other metrics to look like more than one program is executing at the same time. This overhead effects timing in programs. Either the program gets delayed in reading the time or the OS gets delayed in setting its own time variables.
The solution in this case is to eliminate as many tasks from running. The ideal environment is for best accuracy is to have your program as the sole program running. Some OSes provide API for superuser applications to block all other programs or kill them.
Timer event handling / callback
Since the OS can't be trusted to execute your program with high precision, most OS's will provide Timer APIs. Many of these APIs include the ability to call one of your functions when the timer expires. This is known as a callback function. Other OS's may send a message or generate an event when the timer expires. These fall under the class of timer handlers. The callback process has less overhead than the handlers and thus is more accurate.
Music Hardware
Although you may have your program send music to the speakers, many computers now have separate processors that play music. This frees up the main processor and provides more continuous notes, rather than sounds separated by silent gaps due to platform overhead of your program send the next sounds to the speaker.
A quality music processor has at least these to functions:
Start Playing
End Music Notification
Start Playing
This is the function where you tell the music processor where your data is and the size of the data. The processor will start playing the music.
End Music Notification
You provide the processor with a pointer to a function that it will call when the music data has been processed. Nice processors will call the function early so there will be no gaps in the sounds while reloading.
All of this is platform dependent and may not be standard across platforms.
Hope this helps.
This is quite a vast area, and, depending on exactly what you want to do, potentially very difficult.
You don't give much away by saying your project is "music based".
Is it a musical score typesetting program?
Is it processing audio?
Is it filtering MIDI data?
Is it sequencing MIDI data?
Is it generating audio from MIDI data
Does it only perform playback?
Does it need to operate in a real time environment?
Your question though hints at real time operation, so in that case...
The general rule when working in a real time environment is don't do anything which may block the real time thread. This includes:
Calling free/malloc/calloc/etc (dynamic memory allocation/deallocation).
File I/O.any
Use of spinlocks/semaphores/mutexes upon threads.
Calls to GUI code.
Calls to printf.
Bearing these considerations in mind for a real time music application, you're going to have to learn how to do multi-threading in C and how to pass data from the UI/GUI thread to the real time thread WITHOUT breaking ANY of the above restrictions.
For an open source real time audio (and MIDI) (routing) server take a look at http://jackaudio.org
gettimeofday() is the best for wall clock time. getrusage() is the best for CPU time, although it may not be portable. clock() is more portable for CPU timing, but it may have integer overflow.
This is pretty system-dependent. What OS are you using?
You can take a look at gettimeofday() for fairly high granularity. It should work ok if you just need to read time once in awhile.
SIGALRM/setitimer can be used to receive an interrupt periodically. Additionally, some systems have higher level libraries for dealing with time.

Where can I find benchmarks on different networking architectures?

Where can I find benchmarks on different networking architectures?
I am playing with sockets / threads / forks and I'd like to know what the best is. I was thinking there has got to be a place where someone has already spelled out all the pros and cons of different architectures for a socket service, listed benchmarks with code that runs.
Ultimately I'd like to run these various configurations with my own code and see which runs best in different circumstances.
Many people I talk to say that I should just use single threaded select. But I see an argument for threads when you're storing state information inside the thread to keep code simple. What is the trade off mark for writing my own state structure vs using a proven thread architecture.
I've also been told forking is bad... but when you need 12000 connections on a machine that cannot raise the open file per process limit, forking is an option! Forking is also a nice option for stability when you've got one process that needs restarting, it doesn't disturb the others.
Sorry, this is one of my longer questions... so many variables are left empty.
Thanks,
Chenz
edit: here's the link I was looking for, which is a whole paper answering your question. http://www.kegel.com/c10k.html
There are web servers designed along all three models (fork, thread, select). People like to benchmark web servers.
http://www.lighttpd.net/benchmark
Libevent has some benchmarks and links to stuff about how to choose a select() vs. threaded model, generally in favour of using the libevent model.
http://monkey.org/~provos/libevent/
It's very difficult to answer this question as so much depends on what your service is actually doing. Does it have to query a database? read files from the filesystem? perform complicated calculations? go off and talk to some other service? Also, how long-lived are client connections? Might connections have some semantic interaction with other connections, or are they all treated as independent of each other? Might you want to think about load-balancing your service across multiple servers later? (If so, you might usefully think about that now so that any necessary help can be designed in from the start.)
As you hint, the serving machine might have limits which interact with the various techniques, steering you towards one answer or another. You have a per-process file descriptor limit, but remember that you may also have a fixed size process table! How many concurrent clients are you expecting, anyway?
If your service keeps crashing and you need to keep restarting it or you think you want a multi-process model so that connections are isolated from each other, you're probably doing it wrong. Stability is extremely important in this sort of context, and that means good practice and memory hygiene, both in general and in the face of network-based attacks.
Remember the history... fork() is cheap in the Unix world, but spawning new processes relatively expensive on Windows. OTOH, Windows threads are lightweight, whereas threading has always been a bit alien to Unix and only relatively recently become widespread.

Resources