Multiple processes read/write files. What API to use? - c

I have this situation when I need to spawn worker processes. On one side, worker processes should read evenly split parts of a file and pass data to socket connection. Other side should read that data and write it in parallel. I plan to split source file into parts beforehand so that each process gets only one part of a file to read from or write to.
So I'm already using sockets with read/write. From that, I think, it is better for me to continue to use this simple API. But I can not find any means of setting file pointer when using file descriptors. I need that obviously, when reading from file that is divided to read/write parts.
I've heard that mmap can help me somehow. But to my understanding mmap needs much RAM and my app will run multiple mentioned transfers. The app is also quite limited in CPU usage.
The question is, what API should I use?
EDIT I be on Linux. Filesystem is ext4.

Related

Message Passing between Processes without using dedicated functions

I'm writing a C program in which I need to pass messages between child processes and the main process. But the thing is, I need to do it without using the functions like msgget() and msgsnd()
How can I implement? What kind of techniques can I use?
There are multiple ways to communicate with children processes, it depends on your application.
Very much depends on the level of abstraction of your application.
-- If the level of abstraction is low:
If you need very fast communication, you could use shared memory (e.g. shm_open()). But that would be complicated to synchronize correctly.
The most used method, and the method I'd use if I were in your shoes is: pipes.
It's simple, fast, and since pipes file descriptors are supported by epoll() and those kind of asynchronous I/O APIs, you can take advantage from this fact.
Another plus is that, if your application grows, and you need to communicate with remote processes (processes that are not in your local machine), adapting pipes to sockets is very easy, basically it's still the same reading/writing from/to a file descriptor.
Also, Unix-domain sockets (which in other platforms are called "named pipes") let you to have a server process that creates a listening socket with a very well known name (e.g. an entry in the filesystem, such as /tmp/my_socket) and all clients in the local machine can connect to that.
Pipes, networking sockets, or unix-domain sockets are very interchangeable solutions, because - as said before - all involve reading/writing data from/to a file descriptor, so you can reuse the code.
The disadvantage with a file descriptor is that you're writing data to a stream of bytes, so you need to implement the "message streaming protocol" of your messages by yourself, to "unstream" your messages (marshalling/unmarshalling), but that's not so complicated in the most of the cases, and that also depends on the kind of messages you're sending.
I'd pass on other solutions such as memory mapped files and so on.
-- If the level of abstraction is higher:
You could use a 3rd party message passing system, such as RabbitMQ, ZMQ, and so on.

When a process writes to a file

Generally, when a process writes to a file, e.g a python script running open('file', 'w').write('text'), what are the exact events that occur? By that I mean something among the lines of 'process A loads file from hard disk to RAM, process B changes content then ...'. I've read about IPC and now I'm trying to dig deeper and understand more on the subject of processes. I couldn't find a thorough explanation on the subject, so if you could find one or explain I'd really appreciate it.
The example of "a python script running open('file', 'w').write('text')" is heavily OS-dependent. The only processes involved here are the process running the Python interpreter, which, e.g. on Linux, can sometimes execute in userspace and sometimes execute in kernel space, and possibly some kernel-only processes, with any IPC, if required, happening inside the kernel. There is no particular requirement that everything down to the disk read itself cannot be handled on the user's process when it is running in kernel mode, but in practice, there may be other processes involved. This is OS- and even driver-specific behavior.
In this particular example (which isn't great, because it relies on the automatic cPython close when the variable goes out of scope), the Python process makes a system call to open a file, one to write the file, and one to close the file. These are all blocking -- that is, they do not return until the results are ready. When the process blocks, it is put on a queue waiting for some event to occur to make it ready to run again.
The opposite of this is asynchronous I/O, which can be performed by polling, by callbacks, or by the select statement, which can block until any one of a number of events has occurred.
But when most people talk about IPC, they are not usually talking about communication between or with kernel processes. Rather, they are talking about communication between multiple user processes and/or threads, using semaphores, mutexes, named pipes, etc. A good introduction to these sorts of things would be any tutorial information you can find on using pthreads, or even the Python threads and multiprocessing modules. There are examples there for several simple cases.
The primary difference between processes and threads on Linux is that threads share an address space and processes each have their own address space. Python itself adds the wrinkle of the GIL, which limits the utility of threads in Python somewhat.

Is there a Win32 API to copy a fragment of a file on another file?

I would like to programmatically copy a section of a file on another file. Is there any Win32 API I could use without moving bytes thru my program? Or should I just read from the source file and write on the target?
I know how to do this by reading and writing chunks of bytes, I just wanted to avoid doing it myself if the OS already offers that.
What you're asking for can be achieved, bot not easily. Device drivers routinely transfer data without CPU involvement, but doing that requires kernel mode code. Basically, you would have to write a device driver. The benefits would have to be huge to justify the difficulties associated with developing, testing, and distributing a kernel mode driver. So unless you think there is huge benefit at stake here, I'm afraid that ReadFile/WriteFile are the best you can do.

should I use processes or threads for my application?

I have an ARM device running a Linux 2.6 Kernel, with total ram of 64 MB RAM.
There is a data source, which consists of a meter that is queried by the Linux box, through RS485 and ModBus as app protocol.
There is another task, that consists of reading these values and making a json object, then HTTP POST to a specific server.
Network operation might be slower than serial, especially on low GPRS Coverage.
I need concurrency, program is written in C.
Which way would you have concurrency? Using select() or using pthreads?
When analyzing this particular application there's really only one question relevant to choosing pthreads:
Do the sensor reader and network writer need to share an address space?
In this instance I think the answer is clearly "no". Of course that isn't the only possible question, but the only germane one. There are reasons to prefer separate processes:
the two halves of the application have no common code; RS485 is wildly different from HTTP/JSON
segregation of responsibility: if the RS485 side is waiting on a UART, do you really want to block the HTTP side?
letting the OS do its job so you don't have to: if using pthreads, you have to handle a lot of the synchronization and preemption that the kernel does for you for free and code that you don't have to write has no new bugs.
Further analysis would require more detail than you've given, but here is one additional way to think about the choice: threads were invented to mitigate some limitations of the process model. Unless you know that you are going to hit those limitations, use separate processes.
added in response to comments:
I half agree with psusi's suggested design. There need only be two processes, one (let's say the sensor reader, that's a fine choice) which forks one and only one http sender. The two processes can communicate using traditional IPC like a pipe. The sensor process sends data down the pipe when it has some and the child (http) process packs it up in json and sends it on its way.
It only takes two long-lived processes, it uses probably about the same amount of core as would a pthread implementation and it is far, far easier to get right.
select() is more efficient, because it avoids the context switching that comes with multiple threads. And threads would be more efficient than separate processes, because you avoid having to copy the data (unless you setup shared memory, but at that point you might as well have gone with threads). However, writing non-blocking I/O, as with select(), is harder to do and get right, and doesn't enjoy the multitasking that comes with multiple threads. And multiple processes is likely to be the easiest implementation, especially because you can use curl rather than writing the HTTP POST half yourself.
Why you need concurrency? Is the meter has to be polled in a strict time interval?
If the answer is YES: Just use two processes, one poll the meter data and write to a ring buffer in nand storage, the other read the data from the ring buffer and send HTTP data.
If the answer is NO: You don't need concurrency and non-block at all. Use a big loop in main() is enough.

How does main stream web server implement this feature?

This means, for example, a module can
start compressing the response from a
backend server and stream it to the
client before the module has received
the entire response from the backend.
Nice!
I know it's some kind of asynchronous IO but simple like that isn't enough.
Anyone knows?
Without looking at the source code of an actual implementation, I'm speculating here:
It's most likely some kind of stream (abstract buffered IO) that is passed from one module to the other ("chaining"). One module (maybe a servlet container) writes to a stream that is read by another module (the compression module in your example), which then writes its output to another stream. The contents of that stream may then be processed further or transmitted to the client.
The backend may need to wait on IO before it can fully produce the page. Modules can begin compressing the start of the message before the backend is entirely done writing it.
To understand why this is useful, you need to understand how ngnix is structured. ngninx is a server that relies on non-blocking input and output. Normally, a server will use blocking input and output: it will listen on a connection, and when a connection is found, it will process the page. In order to increase throughput, multiple threads are spawned, called 'workers'.
Contrast this to ngnix: It continually asks the kernel, "Are any of my IO requests ready?" This allows it to handle the same amount of pages with 1) less overhead from all the different processes, and 2) lower memory usage. It has some downsides, however. For extremely low-volume applications, ngnix may use more CPU than a blocking server. Second, it's much less portable. Windows uses an entirely different model for non-blocking IO.
Getting back to your original question, compressing the beginning of a page is useful because it can be ready for the rest of the page when it's done accessing a database or reading from a disk or what-have-you.

Resources