When a process writes to a file - file

Generally, when a process writes to a file, e.g a python script running open('file', 'w').write('text'), what are the exact events that occur? By that I mean something among the lines of 'process A loads file from hard disk to RAM, process B changes content then ...'. I've read about IPC and now I'm trying to dig deeper and understand more on the subject of processes. I couldn't find a thorough explanation on the subject, so if you could find one or explain I'd really appreciate it.

The example of "a python script running open('file', 'w').write('text')" is heavily OS-dependent. The only processes involved here are the process running the Python interpreter, which, e.g. on Linux, can sometimes execute in userspace and sometimes execute in kernel space, and possibly some kernel-only processes, with any IPC, if required, happening inside the kernel. There is no particular requirement that everything down to the disk read itself cannot be handled on the user's process when it is running in kernel mode, but in practice, there may be other processes involved. This is OS- and even driver-specific behavior.
In this particular example (which isn't great, because it relies on the automatic cPython close when the variable goes out of scope), the Python process makes a system call to open a file, one to write the file, and one to close the file. These are all blocking -- that is, they do not return until the results are ready. When the process blocks, it is put on a queue waiting for some event to occur to make it ready to run again.
The opposite of this is asynchronous I/O, which can be performed by polling, by callbacks, or by the select statement, which can block until any one of a number of events has occurred.
But when most people talk about IPC, they are not usually talking about communication between or with kernel processes. Rather, they are talking about communication between multiple user processes and/or threads, using semaphores, mutexes, named pipes, etc. A good introduction to these sorts of things would be any tutorial information you can find on using pthreads, or even the Python threads and multiprocessing modules. There are examples there for several simple cases.
The primary difference between processes and threads on Linux is that threads share an address space and processes each have their own address space. Python itself adds the wrinkle of the GIL, which limits the utility of threads in Python somewhat.

Related

What is best way to have several child processes in C

Working on a project that request to download about 300 pics from different locations by using wget every 20 minutes.
I wrote a C program that reads the database for all the Ids and locations into an array.
For each entry in the array, I call the external wget command to download it.
It works but is slow because it is doing one by one.
My thinking is to use either Multi-process, multi-thread or openMP to create several children.
Any suggestion for how to do this is appreciate.
Multiple Processes
An error in one process cannot crash another process. This is particularly useful when you will host third-party code (e.g. plugins), and this is the approach that (among others) Google Chrome takes. The disadvantage is that N processes use more system resources than N threads.
Multiple Threads
Uses fewer system resources than an equivalent number of processes. Thread programming is more error prone for many developers, and an error in one thread can affect other threads.
Best Option
For what you are doing, you are unlikely to see a significant difference in resource utilization. Use whichever model you can write fast in high quality.
Personally I would go for multi process. The wget's do not need to share any memory or communicate (other than an exit status which is only needed by the root) so a thread will not provide any additional benefit (in my opinion). As well as this creating them as processed allows the OS scheduler to best decide when to run each process.

Is it possible to interrupt a process and checkpoint it to resume it later on?

Lets say, you have an application, which is consuming up all the computational power. Now you want to do some other necessary work. Is there any way on Linux, to interrupt that application and checkpoint its state, so that later on it could be resumed from the state it was interrupted?
Especially I am interested in a way, where the application could be stopped and restarted on another machine. Is that possible too?
In general terms, checkpointing a process is not entirely possible (because a process is not only an address space, but also has other resources likes file descriptors, and TCP/IP sockets ...).
In practice, you can use some checkpointing libraries like BLCR etc. With certain limiting conditions, you might be able to migrate a checkpoint image from one system to another one (very similar to the source one: same kernel, same versions of libraries & compilers, etc.).
Migrating images is also possible at the virtual machine level. Some of them are quite good for that.
You could also design and implement your software with your own checkpointing machinery. Then, you should think of using garbage collection techniques and terminology. Look also into Emacs (or Xemacs) unexec.c file (which is heavily machine dependent).
Some languages implementation & runtime have checkpointing primitives. SBCL (a free Common Lisp implementation) is able to save a core image and restart it later. SML/NJ is able to export an image. Squeak (a Smalltalk implementation) also has such ability.
As an other example of checkpointing, the GCC compiler is actually able to compile a single *.h header (into a pre-compiled header file which is a persistent image of GCC heap) by using persistence techniques.
Read more about orthogonal persistence. It is also a research subject. serialization is also relevant (and you might want to use textual formats à la JSON, YAML, XML, ...). You might also use hibernation techniques (on the whole system level).
From the man pages man kill
Interrupting a process requires two steps:
To stop
kill -STOP <pid>
and
To continue
kill -CONT <pid>
Where <pid> is the process-id.
Type: Control + Z to suspend a process (it sends a SIGTSTP)
then bg / fg to resume it in background or in foreground
Checkingpointing an individual process is fundamentally impossible on POSIX. That's because processes are not independent; they can interact. If nothing else, a process has a unique process ID, which it might have stored somewhere internally, and if you resume it with a different process ID, all hell could break loose. This is especially true if the process uses any kind of locks/synchronization primitives. Of course you also can't resume the process with the same process ID it originally had, since that might be taken by a new process.
Perhaps you could solve the problem by making process (and thread) ids 128-bit or so, such that they're universally unique...
On linux it is achivable by sending this process STOP signal. Leter on you resume it by sending CONT signal. Please refer to kill manual.

Any possible solution to capture process entry/exit?

I Would like to capture the process entry, exit and maintain a log for the entire system (probably a daemon process).
One approach was to read /proc file system periodically and maintain the list, as I do not see the possibility to register inotify for /proc. Also, for desktop applications, I could get the help of dbus, and whenever client registers to desktop, I can capture.
But for non-desktop applications, I don't know how to go ahead apart from reading /proc periodically.
Kindly provide suggestions.
You mentioned /proc, so I'm going to assume you've got a linux system there.
Install the acct package. The lastcomm command shows all processes executed and their run duration, which is what you're asking for. Have your program "tail" /var/log/account/pacct (you'll find its structure described in acct(5)) and voila. It's just notification on termination, though. To detect start-ups, you'll need to dig through the system process table periodically, if that's what you really need.
Maybe the safer way to move is to create a SuperProcess that acts as a parent and forks children. Everytime a child process stops the father can find it. That is just a thought in case that architecture fits your needs.
Of course, if the parent process is not doable then you must go to the kernel.
If you want to log really all process entry and exits, you'll need to hook into kernel. Which means modifying the kernel or at least writing a kernel module. The "linux security modules" will certainly allow hooking into entry, but I am not sure whether it's possible to hook into exit.
If you can live with occasional exit slipping past (if the binary is linked statically or somehow avoids your environment setting), there is a simple option by preloading a library.
Linux dynamic linker has a feature, that if environment variable LD_PRELOAD (see this question) names a shared library, it will force-load that library into the starting process. So you can create a library, that will in it's static initialization tell the daemon that a process has started and do it so that the process will find out when the process exits.
Static initialization is easiest done by creating a global object with constructor in C++. The dynamic linker will ensure the static constructor will run when the library is loaded.
It will also try to make the corresponding destructor to run when the process exits, so you could simply log the process in the constructor and destructor. But it won't work if the process dies of signal 9 (KILL) and I am not sure what other signals will do.
So instead you should have a daemon and in the constructor tell the daemon about process start and make sure it will notice when the process exits on it's own. One option that comes to mind is opening a unix-domain socket to the daemon and leave it open. Kernel will close it when the process dies and the daemon will notice. You should take some precautions to use high descriptor number for the socket, since some processes may assume the low descriptor numbers (3, 4, 5) are free and dup2 to them. And don't forget to allow more filedescriptors for the daemon and for the system in general.
Note that just polling the /proc filesystem you would probably miss the great number of processes that only live for split second. There are really many of them on unix.
Here is an outline of the solution that we came up with.
We created a program that read a configuration file of all possible applications that the system is able to monitor. This program read the configuration file and through a command line interface you was able to start or stop programs. The program itself stored a table in shared memory that marked applications as running or not. A interface that anybody could access could get the status of these programs. This program also had an alarm system that could either email/page or set off an alarm.
This solution does not require any changes to the kernel and is therefore a less painful solution.
Hope this helps.

Inter-program communication for an arbitrary number of programs

I am attempting to have a bunch of independent programs intelligently allocate shared resources among themselves. However, I could have only one program running, or could have a whole bunch of them.
My thought was to mmap a virtual file in each program, but the concurrency is killing me. Mutexes are obviously ineffective because each program could have a lock on the file and be completely oblivious of the others. However, my attempts to write a semaphore have all failed, since the semaphore would be internal to the file, and I can't rely on only one thing writing to it at a time, etc.
I've seen quite a bit about named pipes but it doesn't seem to be to be a practical solution for what I'm doing since I don't know how many other programs there will be, if any, nor any way of identifying which program is participating in my resource-sharing operation.
You could use a UNIX-domain socket (AF_UNIX) - see man 7 unix.
When a process starts up, it tries to bind() a well-known path. If the bind() succeeds then it knows that it is the first to start up, and becomes the "resource allocator". If the bind() fails with EADDRINUSE then another process is already running, and it can connect() to it instead.
You could also use a dedicated resource allocator process that always listens on the path, and arbitrates resource requests.
Not entirely clear what you're trying to do, but personally my first thought would be to use dbus (more detail). Should be easy enough within that framework for your processes/programs to register/announce themselves and enumerate/signal other registered processes, and/or to create a central resource arbiter and communicate with it. Readily available on any system with gnome or KDE installed too.

are posix pipes lightweight?

In a linux application I'm using pipes to pass information between threads.
The idea behind using pipes is that I can wait for multiple pipes at once using poll(2). That works well in practice, and my threads are sleeping most of the time. They only wake up if there is something to do.
In user-space the pipes look just like two file-handles. Now I wonder how much resources such a pipes use on the OS side.
Btw: In my application I only send single bytes every now and then. Think about my pipes as simple message queues that allow me to wake-up receiving threads, tell them to send some status-data or to terminate.
No, I would not consider pipes "lightweight", but that doesn't necessarily mean they're the wrong answer for your application either.
Sending a byte over a pipe is going to require a minimum of 3 system calls (write,poll,read). Using an in-memory queue and pthread operations (mutex_lock, cond_signal) involves much less overhead. Open file descriptors definitely do consume kernel resources; that's why processes are typically limited to 256 open files by default (not that the limit can't be expanded where appropriate).
Still, the pipe/poll solution for inter-thread communication does have advantages too: particularly if you need to wait for input from a combination of sources (network + other threads).
As you are using Linux you can investigate and compare pipe performance with eventfd's. They are technically faster and lighter weight but you'll be very lucky to actually see the gains in practice.
http://www.kernel.org/doc/man-pages/online/pages/man2/eventfd.2.html
Measure and you'll know. Full processes with pipes are sufficiently lightweight for lots of applications. Other applications require something lighter weight, like OS threads (pthreads being the popular choice for many Unix apps), or superlightweight, like a user-level threads package that never goes into kernel mode except to handle I/O. While the only way to know for sure is to measure, pipes are probably good enough for up to a few tens of threads, whereas you probably want user-level threads once you get to a few tens of thousands of threads. Exactly where the boundaries should be drawn using today's codes, I don't know. If I wanted to know, I would measure :-)

Resources