What are the differences between poll and select? - c

I am referring to the POSIX standard select and poll system C API calls.

The select() call has you create three bitmasks to mark which sockets and file descriptors you want to watch for reading, writing, and errors, and then the operating system marks which ones in fact have had some kind of activity; poll() has you create a list of descriptor IDs, and the operating system marks each of them with the kind of event that occurred.
The select() method is rather clunky and inefficient.
There are typically more than a thousand potential file descriptors available to a process. If a long-running process has only a few descriptors open, but at least one of them has been assigned a high number, then the bitmask passed to select() has to be large enough to accomodate that highest descriptor — so whole ranges of hundreds of bits will be unset that the operating system has to loop across on every select() call just to discover that they are unset.
Once select() returns, the caller has to loop over all three bitmasks to determine what events took place. In very many typical applications only one or two file descriptors will get new traffic at any given moment, yet all three bitmasks must be read all the way to the end to discover which descriptors those are.
Because the operating system signals you about activity by rewriting the bitmasks, they are ruined and are no longer marked with the list of file descriptors you want to listen to. You either have to rebuild the whole bitmask from some other list that you keep in memory, or you have to keep a duplicate copy of each bitmask and memcpy() the block of data over on top of the ruined bitmasks after each select() call.
So the poll() approach works much better because you can keep re-using the same data structure.
In fact, poll() has inspired yet another mechanism in modern Linux kernels: epoll() which improves even more upon the mechanism to allow yet another leap in scalability, as today's servers often want to handle tens of thousands of connections at once. This is a good introduction to the effort:
http://scotdoyle.com/python-epoll-howto.html
While this link has some nice graphs showing the benefits of epoll() (you will note that select() is by this point considered so inefficient and old-fashioned that it does not even get a line on these graphs!):
http://lse.sourceforge.net/epoll/index.html
Update: Here is another Stack Overflow question, whose answer gives even more detail about the differences:
Caveats of select/poll vs. epoll reactors in Twisted

I think that this answers your question:
From Richard Stevens (rstevens#noao.edu):
The basic difference is that select()'s fd_set is a bit mask and
therefore has some fixed size. It would be possible for the kernel to
not limit this size when the kernel is compiled, allowing the
application to define FD_SETSIZE to whatever it wants (as the comments
in the system header imply today) but it takes more work. 4.4BSD's
kernel and the Solaris library function both have this limit. But I
see that BSD/OS 2.1 has now been coded to avoid this limit, so it's
doable, just a small matter of programming. :-) Someone should file a
Solaris bug report on this, and see if it ever gets fixed.
With poll(), however, the user must allocate an array of pollfd
structures, and pass the number of entries in this array, so there's
no fundamental limit. As Casper notes, fewer systems have poll() than
select, so the latter is more portable. Also, with original
implementations (SVR3) you could not set the descriptor to -1 to tell
the kernel to ignore an entry in the pollfd structure, which made it
hard to remove entries from the array; SVR4 gets around this.
Personally, I always use select() and rarely poll(), because I port my
code to BSD environments too. Someone could write an implementation
of poll() that uses select(), for these environments, but I've never
seen one. Both select() and poll() are being standardized by POSIX
1003.1g.
October 2017 Update:
The email referenced above is at least as old as 2001; the poll() command is now (2017) supported across all modern operating systems - including BSD. In fact, some people believe that select() should be deprecated. Opinions aside, portability issues around poll() are no longer a concern on modern systems. Furthermore, epoll() has since been developed (you can read the man page), and continues to rise in popularity.
For modern development you probably don't want to use select(), although there's nothing explicitly wrong with it. poll(), and it's more modern evolution epoll(), provide the same features (and more) as select() without suffering from the limitations therein.

Both of them are slow and mostly the same, But different in size and some kind of features!
When you write an iterator, You need to copy the set of select every time! While poll has fixed this kind of problem to have beautiful code. Another difference is that poll can handle more than 1024 file descriptors (FDs) by default. poll can handle different events to make the program more readable instead of having a lot of variables to handle this kind of job. Operations in poll and select is linear and slow because of having a lot of checks.

Related

How to choose for multithreading - c

I have to do a program client-server in c where server can use n-threads that can work simultaneously for manage the request of clients.
For do it I use a socket that use a listener that put the new FD (of new connection request) in a list and then the threads can take it when they are able to do.
I know that I can use pipe too for communication between thread.
Is the socket the best way ? And why or why not?
Sorry for my bad English
To communicate between threads you can use socket as well as shared memory.
To do multithreading there are many libraries available on github, one of them I used is the below one.
https://github.com/snikulov/prog_posix_threads/blob/master/workq.c
I tried and tested the same way what you want. it works perfect!
There's one very nice resource related to socket multiplexing which I think you should stop and read after reading this answer. That resource is entitled The C10K problem, and it details numerous solutions to the problem people faced in the year 2000, of handling 10000 clients.
Of those solutions, multithreading is not the primary one. Indeed, multithreading as an optimisation should be one of your last resorts, as that optimisation will interfere with the instruments you use to diagnose other optimisations.
In general, here is how you should perform optimisations, in order to provide guaranteed justifications:
Use a profiler to determine the most significant bottlenecks (in your single-threaded program).
Perform your optimisation upon one of the more significant bottlenecks.
Use the profiler again, with the same set of data, to verify that your optimisation worked correctly.
You can repeat these steps ad infinitum until you decide the improvements are no longer tangible (meaning, good luck observing the differences between before and after). Following these steps will provide you with data you can show your employer, if he/she asks you what you've been doing for the last hour, so make sure you save the output of your profiler at each iteration.
Optimisations are per-machine; what this means is that an optimisation for your machine might actually be slower on another machine. For example, you may use a buffer of 4096 bytes for your machine, while the cache lines for another machine might indicate that 512 bytes is a better idea.
Hence, ideally, we should design programs and modules in such a way that their resources are minimal and can be easily be scaled up, substituted and/or otherwise adjusted for other machines. This can be difficult, as it means in the buffer example above you might start off with a buffer of one byte; you'd most likely need to study finite state machines to achieve that, and using buffers of one byte might not always be technically feasable (i.e. when dealing with fields that are guaranteed to be a certain width; you should use that width as your minimum limit, and scale up from there). The reward is ultra-portable and ultra-optimisable in all situations.
Keep in mind that extra threads use extra resources; we tend to assume that the stack space reserved for a thread can grow to 1MB, so 10000 sockets occupying 10000 threads (in a thread-per-socket model) would occupy about 10GB of memory! Yikes! The minimal resources method suggests that we should start off with one thread, and scale up from there, using a multithreading profiler to measure performance like in the three steps above.
I think you'll find, though, that for anything purely socket-driven, you likely won't need more than one thread, even for 10000 clients, if you study the C10K problem or use some library which has been engineered based on those findings (see your comments for one such suggestion). We're not talking about masses of number crunching, here; we're talking about socket operations, which the kernel likely processes using a single core, and so you can likely match that single core with a single thread, and avoid any context switching or thread synchronisation troubles/overheads incurred by multithreading.

how important is setting max fd on select?

Within an infinite loop, I am listening 100+ file descriptors using select. If fd has some packets ready to be read, I notify the packet processor thread assigned to this file descriptor and I don't set the bit for this file descriptor for the next round until I receive a notification from data processor thread saying it is done. I wonder how inefficient my code would be if I won't calculate the max. fd for select everytime I clear/set a file descriptor from the set. I am expecting file descriptors to be nearly contiguous, data arrival rate to be a few thousands bytes every second for each fd.
You should really use poll instead of select. Both are standard, but poll is easier to use, does not place a limit on the number of file descriptors you can check (whereas select limits you to the compile-time constant FD_SETSIZE), and more efficient. If you do use select, you can always pass FD_SETSIZE for the first argument, but this will of course give worst-case performance since the kernel has to scan the whole fd_set; passing the actual max+1 allows a shorter search, but still not as efficient as the array passed to poll.
For what it's worth, these days it seems stylish to use the nonstandard Linux epoll or whatever the BSD equivalent is. These interfaces may have some advantages if you have a huge number (on the order of tens of thousands) of long-lived (at least several round trips) connections, but otherwise performance will not be noticably better (and, at the lower end, may be worse), and these interfaces are of course non-portable, and in my opinion, harder to use correctly than the plain, portable poll.
It is in principle important to give a good max fd to select (but with only a few hundreds of file descriptors in your process that does not matter much).
But select is becoming obsolete (precisely because of the max fd, so the kernel will take O(m) time where m is the max.fd; so select could be costly if using it on a small set of file descriptors whose max m is large). Use poll(2) instead (which, when given a set of n file descriptors takes O(n) time, independently of the maximal file descriptor m).
Current Linux systems and processes might have many dozens of thousands of file descriptors. Read about the C10K problem.
And you might have some event loop, e.g. use libraries like libevent or libev (which might use ̀poll internally, and may use more operating system specific things like epoll etc... abstracting them in a convenient interface)

Atomicity of UNIX read()/write() when sending data to device

When writing directly to a device in /dev, I open a file descriptor and perform a UNIX write() followed by a read(). Can I have multiple threads doing this write()/read() sequence on the same file descriptor, and not get jumbled data if two threads enter the write() function at the same time?
References to std documentation would be immensely helpful. I've not been able to find anything though. Someone has mentioned that such operations are atomic in the kernel, but I am sceptical.
Also, to clarify this is a file in /dev, so any insight as to how far the "file pointer" concept applies here is helpful as well.
File pointers (FILE *fp, for example), are a layer in the user-side code sitting above the function calls (such as write()). Access to fp is controlled by locks in a threaded environment (you can't have to threads modifying the same structure at the same time).
Inside the kernel, I'd expect there to be a lock on the file descriptor (and/or 'open file description') to prevent it being used from two threads at once.
You can look up the POSIX specification for read() and
getchar_unlocked()
to find out more about locking etc — at least for a POSIX compliant implementation.
Note that POSIX still uses C99. Therefore, it is not cognizant of the C11 thread facilities. The C11 standard does not have read() et al (file I/O using file descriptors), so it says nothing about such system calls. Neither does it provide a getchar_unlocked() or any of its relatives.
Caveat: I have not been in kernels for awhile, but this is the way it used to work.
For disk files:
Can you open the file in append mode, writing block sizes <= BLKSIZE ?
Small enough block sizes guarantee, in POSIX environments, atomic writes in POSIX environments (actually, the limit may be greater than BLKSIZE... I'm too lazy to hunt around for the alternate symbol).
Append guarantees seeks to the end of the file... for devices supporting seeks. Combine with atomic writes you should be golden.
Each buffer must stand by itself under the assumption some "foreign" information may follow it.
For ttys:
Append mode makes no sense here. As before, but paying attention to line endings gets even more important. And this very much does not apply to reads. Codes ttys treat as control sequences can also trip you up if even the modes the sequences enable split across blocks.
For other devices:
Can get tricky here. Depends on the device.
I'm going to assume that you are referring to a generic character device (e.g. a tty) since you were not specific. As far as I know, each fd-type operation (e.g. read()/write()) maps directly into a call into the driver.
Therefore, the driver will receive each write()'s data chunk as a whole and not see the next one's data until it is done (e.g. data is queued to be transmitted).
However, if the driver does not consume the entire chunk of data at once (i.e. write() returns less than the specified number of bytes, then there is no guarantee, that the thread will be able to write again with the remainder before another thread does a different write().
Also, as Johnathan Leffler noted, if you use standard I/O with process-level buffering, all bets are off.
Bottom line, if you are using direct fd writes, each write will map directly to one driver function call. From there, it's up to the driver as to if write is atomic.
Edit: wlformyd brings up the question of locking between multiple threads on multiple processors. To my knowledge, there is no locking on a FD and, in fact, that would be ineffective as multiple FDs could be used to access the same device.
I believe it is up to the driver itself to do locking to prevent contention to internal queues and/or hardware. in that sense, on a multi-processor system, the kernel doesn't prevent multiple simultaneous access to a driver's write routine. However, a properly written driver should do the locking to prevent mixing of output between two write calls.

Send data to multiple sockets using pipes, tee() and splice()

I'm duplicating a "master" pipe with tee() to write to multiple sockets using splice(). Naturally these pipes will get emptied at different rates depending on how much I can splice() to the destination sockets. So when I next go to add data to the "master" pipe and then tee() it again, I may have a situation where I can write 64KB to the pipe but only tee 4KB to one of the "slave" pipes. I'm guessing then that if I splice() all of the "master" pipe to the socket, I will never be able to tee() the remaining 60KB to that slave pipe. Is that true? I guess I can keep track of a tee_offset (starting at 0) which I set to the start of the "unteed" data and then don't splice() past it. So in this case I would set tee_offset to 4096 and not splice more than that until I'm able to tee it to all to the other pipes. Am I on the right track here? Any tips/warnings for me?
If I understand correctly, you've got some realtime source of data that you want to multiplex to multiple sockets. You've got a single "source" pipe hooked up to whatever's producing your data, and you've got a "destination" pipe for each socket over which you wish to send the data. What you're doing is using tee() to copy data from the source pipe to each of the destination pipes and splice() to copy it from the destination pipes to the sockets themselves.
The fundamental issue you're going to hit here is if one of the sockets simply can't keep up - if you're producing data faster than you can send it, then you're going to have a problem. This isn't related to your use of pipes, it's just a fundamental issue. So, you'll want to pick a strategy to cope in this case - I suggest handling this even if you don't expect it to be common as these things often come up to bite you later. Your basic choices are to either close the offending socket, or to skip data until it's cleared its output buffer - the latter choice might be more suitable for audio/video streaming, for example.
The issue which is related to your use of pipes, however, is that on Linux the size of a pipe's buffer is somewhat inflexible. It defaults to 64K since Linux 2.6.11 (the tee() call was added in 2.6.17) - see the pipe manpage. Since 2.6.35 this value can be changed via the F_SETPIPE_SZ option to fcntl() (see the fcntl manpage) up to the limit specified by /proc/sys/fs/pipe-size-max, but the buffering is still more awkward to change on-demand than a dynamically allocated scheme in user-space would be. This means that your ability to cope with slow sockets will be somewhat limited - whether this is acceptable depends on the rate at which you expect to receive and be able to send data.
Assuming this buffering strategy is acceptable, you're correct in your assumption that you'll need to track how much data each destination pipe has consumed from the source, and it's only safe to discard data which all destination pipes have consumed. This is somewhat complicated by the fact that tee() doesn't have the concept of an offset - you can only copy from the start of the pipe. The consequence of this is that you can only copy at the speed of the slowest socket, since you can't use tee() to copy to a destination pipe until some of the data has been consumed from the source, and you can't do this until all the sockets have the data you're about to consume.
How you handle this depends on the importance of your data. If you really need the speed of tee() and splice(), and you're confident that a slow socket will be an extremely rare event, you could do something like this (I've assumed you're using non-blocking IO and a single thread, but something similar would also work with multiple threads):
Make sure all pipes are non-blocking (use fcntl(d, F_SETFL, O_NONBLOCK) to make each file descriptor non-blocking).
Initialise a read_counter variable for each destination pipe to zero.
Use something like epoll() to wait until there's something in the source pipe.
Loop over all destination pipes where read_counter is zero, calling tee() to transfer data to each one. Make sure you pass SPLICE_F_NONBLOCK in the flags.
Increment read_counter for each destination pipe by the amount transferred by tee(). Keep track of the lowest resultant value.
Find the lowest resultant value of read_counter - if this is non-zero, then discard that amount of data from the source pipe (using a splice() call with a destination opened on /dev/null, for example). After discarding data, subtract the amount discarded from read_counter on all the pipes (since this was the lowest value then this cannot result in any of them becoming negative).
Repeat from step 3.
Note: one thing that's tripped me up in the past is that SPLICE_F_NONBLOCK affects whether the tee() and splice() operations on the pipes are non-blocking, and the O_NONBLOCK you set with fnctl() affects whether the interactions with other calls (e.g. read() and write()) are non-blocking. If you want everything to be non-blocking, set both. Also remember to make your sockets non-blocking or the splice() calls to transfer data to them might block (unless that's what you want, if you're using a threaded approach).
As you can see, this strategy has a major problem - as soon as one socket blocks up, everything halts - the destination pipe for that socket will fill up, and then the source pipe will become stagnant. So, if you reach the stage where tee() returns EAGAIN in step 4 then you'll want to either close that socket, or at least "disconnect" it (i.e. take it out of your loop) such that you don't write anything else to it until its output buffer is empty. Which you choose depends on whether your data stream can recovery from having bits of it skipped.
If you want to cope with network latency more gracefully then you're going to need to do more buffering, and this is going to involve either user-space buffers (which rather negates the advantages of tee() and splice()) or perhaps disk-based buffer. The disk-based buffering will almost certainly be significantly slower than user-space buffering, and hence not appropriate given that presumably you want a lot of speed since you've chosen tee() and splice() in the first place, but I mention it for completeness.
One thing that's worth noting if you end up inserting data from user-space at any point is the vmsplice() call which can perform "gather output" from user-space into a pipe, in a similar way to the writev() call. This might be useful if you're doing enough buffering that you've split your data among multiple different allocated buffers (for example if you're using a pool allocator approach).
Finally, you could imagine swapping sockets between the "fast" scheme of using tee() and splice() and, if they fail to keep up, moving them on to a slower user-space buffering. This is going to complicate your implementation, but if you're handling large numbers of connections and only a very small proportion of them are slow then you're still reducing the amount of copying to user-space that's involved somewhat. However, this would only ever be a short-term measure to cope with transient network issues - as I said originally, you've got a fundamental problem if your sockets are slower than your source. You'd eventually hit some buffering limit and need to skip data or close connections.
Overall, I would carefully consider why you need the speed of tee() and splice() and whether, for your use-case, simply user-space buffering in memory or on disk would be more appropriate. If you're confident that the speeds will always be high, however, and limited buffering is acceptable then the approach I outlined above should work.
Also, one thing I should mention is that this will make your code extremely Linux-specific - I'm not aware of these calls being support in other Unix variants. The sendfile() call is more restricted than splice(), but might be rather more portable. If you really want things to be portable, stick to user-space buffering.
Let me know if there's anything I've covered which you'd like more detail on.

Using fseek/fwrite from multiple processes to write to different areas of a file?

I recently came across a bit of not-well-tested legacy code for writing data that's distributed across multiple processes (these are part of an MPI-based parallel computation) into the same file. Is this actually guaranteed to work?
It goes like this:
All processes open the same file for writing.
Each process calls fseek to seek to a different location within the file. This position may be past the end of the file.
Each process then writes a block of data into the file with fwrite. The seek locations
and block sizes are such that these writes completely tile a
section of the file -- no gaps, no overlaps.
Is this guaranteed to work, or will it sometimes fail horribly? There is no locking to serialize the writes, and in fact they are likely to be starting from a synchronization point. On the other hand, we can guarantee that they are writing to different file positions, unlike other questions which have had issues with trying to write to the "end of the file" from multiple processes.
It occurs to me that the processes may be on different machines that mount the file via NFS, which I suspect probably answers my question -- but, would it work if the file is local?
I believe this will typically work but there is no guarantee that I can find. The Posix specifications for fwrite(3) defer to ISO C and neither standard mentions concurrency.
So I suspect it will typically work, but fseek(3) and fwrite(3) are buffered I/O functions, so success will depend on internal details of the library implementation. So, absolutely no guarantee but various reasons to expect that it will work.
Now, should the program use lseek(2) and write(2) then I believe you could consider the results guaranteed, but now it's restricted to Posix operating systems.
One thing seems ... odd ... why would an MPI program decide to share its data via NFS and not the message API? It would seem slower, less portable, more prone to trouble, and generally just a waste of the MPI feature set. It certainly is no more distributed given the reliance on a single NFS server.

Resources