Can anyone tell me the use and application of select function in socket programming in c?
The select() function allows you to implement an event driven design pattern, when you have to deal with multiple event sources.
Let's say you want to write a program that responds to events coming from several event sources e.g. network (via sockets), user input (via stdin), other programs (via pipes), or any other event source that can be represented by an fd. You could start separate threads to handle each event source, but you would have to manage the threads and deal with concurrency issues. The other option would be to use a mechanism where you can aggregate all the fd into a single entity fdset, and then just call a function to wait on the fdset. This function would return whenever an event occurs on any of the fd. You could check which fd the event occurred on, read that fd, process the event, and respond to it. After you have done that, you would go back and sit in that wait function - till another event on some fd arrives.
select facility is such a mechanism, and the select() function is the wait function. You can find the details on how to use it in any number of books and online resources.
The select function allows you to check on several different sockets or pipes (or any file descriptors at all if you are not on Windows), and do something based on whichever one is ready first. More specifically, the arguments for the select function are split up into three groups:
Reading: When any of the file descriptors in this category are ready for reading, select will return them to you.
Writing: When any of the file descriptors in this category are ready for writing, select will return them to you.
Exceptional: When any of the file descriptors in this category have an exceptional case -- that is, they close uncleanly, a connection breaks or they have some other error -- select will return them to you.
The power of select is that individual file/socket/pipe functions are often blocking. Select allows you to monitor the activity of several different file descriptors without having to have a dedicated thread of your program to each function call.
In order for you to get a more specific answer, you will probably have to mention what language you are programming in. I have tried to give as general an answer as possible on the conceptual level.
select() is the low-tech way of polling sockets for new data to read or for an open TCP window to write. Unless there's some compelling reason not to, you're probably better off using poll(), or epoll_wait() if your platform has it, for better performance.
I like description at gnu.org:
Sometimes a program needs to accept input on multiple input channels whenever input arrives. For example, some workstations may have devices such as a digitizing tablet, function button box, or dial box that are connected via normal asynchronous serial interfaces; good user interface style requires responding immediately to input on any device. [...]
You cannot normally use read for this purpose, because this blocks the program until input is available on one particular file descriptor; input on other channels won’t wake it up. You could set nonblocking mode and poll each file descriptor in turn, but this is very inefficient.
A better solution is to use the select function. This blocks the program until input or output is ready on a specified set of file descriptors, or until a timer expires, whichever comes first.
Per the documentation for Linux manpages and MSDN for Windows,
select() and pselect() allow a program to monitor multiple file
descriptors, waiting until one or more of the file descriptors become
"ready" for some class of I/O operation (e.g., input possible). A file
descriptor is considered ready if it is possible to perform the
corresponding I/O operation (e.g., read(2)) without blocking.
For simple explanation: often it is required for an application to do multiple things at once. For example you may access multiple sites in a web browser, a web server may want to serve multiple clients simultaneously. One needs a mechanism to monitor each socket so that the application is not busy waiting for one communication to complete.
An example: imagine downloading a large Facebook page on your smart phone whilst traveling on a train. Your connection is intermittent and slow, the web server should be able to process other clients when waiting for your communication to finish.
select(2) - Linux man page
select Function - Winsock Functions
Related
I'm coding for a linux platform using C. Let's say I have 2 threads. A and B.
A is an infinite loop and constantly trying to find out if there is data on the socket localhost:8080, where as B is a thread that spends most of its time in a blocked state until A calls mutex unlock function on a mutex that B uses to block itself. A will unlock B when it received appropriate data on the socket.
So you see here is a problem. B is "event driven" largely whereas A is in a constant running state. My target platform isn't resource rich so I wish A could be "activated" and enter running state only when it received data on socket, instead of constantly looping.
So how can I do that? If it matters - I wish to do this for both UDP and TCP sockets.
There are Multiple was of doing what you want in a clean was. One approach, you are kind of using already, is a event system. A real event system would be overkill for the kind of problem you are dealing with, but can be found here. This is a (random) better implementation, capable of listening for multiple file descriptors and time based events, all in a single thread.
If you want to build one yourself, you should take a look at the select or poll function.
But I agree with #Jeremy Friesner, you should definitely use the functions made for socket programming, they are perfect for your kind of problem. Only use the event system approach if you really need it (with multiple sockets/timed events).
You simply call recv (or recvfrom, recvmsg, etc) and it doesn't return until some data has been received. There's no need to "constantly try to find out if there is data" - that's silly.
If you set the socket to non-blocking mode then recv will return even if there's no data. If that's what you're doing, then the solution is simple: don't set the socket to non-blocking mode.
Some Unix code I am working on depends on being able to poll over a small number of pipes. poll is a POSIX system call that (much like the older select) allows the process to wait until one or more file descriptors is "ready" for reading or writing, which means one can proceed to do so without blocking. This is useful to implement event loops where waiting is clearly separated from the rest of the communication.
Is it possible to do the same for Windows pipe handles - wait for one or more of them to become "ready" for reading/writing?
Existing SO advice on the matter, such as answers to this question, recommend the use of completion ports. However as far as I can tell, completion ports require initiating reading/writing beforehand, and then waiting for (or being notified of) the completion of those operations. This approach does not fit the architecture of the code, which strongly separates the polling code from the reading/writing code, the latter calling into a library that uses the regular ReadFile and WriteFile on the underlying handle.
If there is no direct equivalent to poll, could one abuse completion ports to provide something similar? In other words, is it possible to create IO completion events that announce "you can now call ReadFile (WriteFile) on this handle without it blocking" and wait for them using WaitForMultipleObjects or GetQueuedCompletionStatus?
I have a listening socket on a tcp port. The process itself is using setrlimit(RLIMIT_NOFILE,&...); to configure how many sockets are allowed for the process.
For tests RLIMIT_NOFILE is set to 20 and of course for production it will be set to a sanely bigger number. 20 is good for easily reaching the limit in a test environment.
The server itself has no issues like descriptor leak or similar, but trying to solve the problem by increasing RLIMIT_NOFILE obviously cannot do, because in real life there is no guarantee the the limit will not be reached, no matter how high it is set.
The problem is that after reaching the limit accept returns Too many open files and unless a file or socket is closed the event loop starts spinning without delay, eating 100% of one core. Even if the client closes the connection (e.g. because of timeout), the server will loop until a file descriptor is available to process and close the already dead incoming connection. EDIT: On the other hand the client stalls and there is no good way to know that the server is overloaded.
My question: is there some standard way to handle this situation by closing the incoming connection after accept returns Too many open files.
Several dirty approaches come to mind:
To close and reopen the listening socket with the hope that all pending connections will be closed (this is quite dirty because in threaded server some other thread may get the freed file descriptor)
To track open file descriptor count (this cannot be properly done with external libraries that will have some untracked file descriptors)
To check if file descriptor number is near the limit and start closing incoming connections before the situation happens (this is rather implementation specific and while it will work on Linux, there is no guarantee that other OS will handle file descriptors in the same way)
EDIT: One more dirty and ugly approach:
To keep one spare fd (e.g. dup(STDIN_FILENO) or open("/dev/null",...)) that will be used in case accept fails. The sequence will be:
... accept failed
// stop threads
close(sparefd);
newconnection = accept(...);
close(newconnection);
sparefd = open("/dev/null",...);
// release threads
The main drawback with this approach is thread synchronization to prevent other threads to get the just freed spare fd.
You shouldn't use setrlimit to control how many simultaneous connections your process can handle. Your tiny little bit of socket code is saying to the whole rest of the application, "I only want to have N connections open at a time, and this is the only way I know how to do it, so... nothing else in the process can have any files!". What would happen if everybody did that?
The proper way to do what you want is easy -- keep track of how many connections you have open, and just don't call accept until you can handle another one.
I understand that your code is in a library. The library encounters a resource limit event. I would distinguish, generally, between events which are catastrophic (memory exhaustion, can't open listening socket) and those which are probably temporary. Catastrophic events are hard to handle: without memory, even logging or an orderly shutdown may be impossible.
Too many open files, by contrast, is a condition which is probably temporary, not least because we are the resource hog. Temporary error conditions are luckily trivial to handle: By waiting. This is what you don't do: You should wait for a spell after accept returns "Too many open files", before you call accept again. That will solve the 100% CPU load problem. (I assume that our server performs some work on each connection which is at some point finished, so that the file descriptors of the client connections which our library holds are eventually closed.)
There remains the problem that the library cannot know the requirements of the user code. (How long should the pause between accepts be?1 Is it at all acceptable (sic) to let connection requests wait at all? Do we give up at some point?) It is imperative to report errors back to the user code, so that the user code has a chance to see and fix the error.
If the user code gets the file descriptor back, that's easy: Return accept's error code (and make sure to document that possibility). So I assume that the user code never sees gritty details like file descriptors but instead gets some data, for example. It may even be that the library performs just side effects, possibly concurrently, so that user code never sees any return value which would be usable to communicate errors. Then the library must provide some other way to signal the error condition to the user code. This may impose restrictions on how the user code can use the library: Perhaps before or after certain function calls, or simply periodically, an error status must be actively checked.
1By the way, it is not clear to me, even after reading the accept man page, whether the client's connect fails (because the connection request has been de-queued on the server side but cannot be handled), or whether the request simply stays in the queue so that the client is oblivious of the server's problems, apart from a delay.
notice that multiplexing syscalls such as poll(2) can work (so wait without busy spin looping) on accept-ing sockets (and on connected sockets also, or any other kind of stream file descriptor).
So just have your event loop handle them (probably with other readable & writable file descriptors). And don't call accept(2) when you don't want to.
I have just come across the select() function for linux (or is it Unix?) OS's. And its looking like it can achieve what I need to do.
I have a Linux process (on Debian) that has IPC (Inter-Process Communication) between 3 other processes. 2 of them are Serial Ports Streams and the other is a Named Pipe.
My process needs to read data from each of these streams and react accordingly (its a proxy between these 3 processes). Theres no order to the data coming in from each process (one may talk, then another lay silent for a while).
So I am thinking of having a main loop that simply uses select() to listen on all streams (with a timeout of never). That way select can notify me when/if a stream writes to my process, which stream is talking and then I can react accordingly.
Is this how select works? Is this design ok and how you would handle 3 streams where their behaviour is dynamic and not predictable (in terms of when they will write data to a stream)?
Yes, that's exactly what select is designed to do: multiplex multiple input streams and detect which have data ready to be read from.
I do not understand what the difference is between calling recv() on a non-blocking socket vs a blocking socket after waiting to call recv() after select returns that it is ready for reading. It would seem to me like a blocking socket will never block in this situation anyway.
Also, I have heard that one model for using non blocking sockets is try to make calls (recv/send/etc) on them after some amount of time has passed instead of using something like select. This technique seems slow and wasteful to be compared to using something like select (but then I don't get the purpose of non-blocking at all as described above). Is this common in networking programming today?
There's a great overview of all of the different options for doing high-volume I/O called The C10K Problem. It has a fairly complete survey of a lot of the different options, at least as of 2006.
Quoting from it, on the topic of using select on non-blocking sockets:
Note: it's particularly important to remember that readiness notification from the kernel is only a hint; the file descriptor might not be ready anymore when you try to read from it. That's why it's important to use nonblocking mode when using readiness notification.
And yes, you could use non-blocking sockets and then have a loop that waits if nothing is ready, but that is fairly wasteful compared to using something like select or one of the more modern replacements (epoll, kqueue, etc). I can't think of a reason why anyone would actually want to do this; all of the select like options have the ability to set a timeout, so you can be woken up after a certain amount of time to perform some regular action. I suppose if you were doing something fairly CPU intensive, like running a video game, you may want to never sleep but instead keep computing, while periodically checking for I/O using non-blocking sockets.
The select, poll, epoll, kqueue, etc. facilities target multiple socket/file descriptor handling scenarios. Imagine a heavy loaded web-server with hundreds of simultaneously connected sockets. How would you know when to read and from what socket without blocking everything?
If you call read on a non-blocking socket, it will return immediately if no data has been received since the last call to read. If you only had read, and you wanted to wait until there was data available, you would have to busy wait. This wastes CPU.
poll and select (and friends) allow you to sleep until there's data to read (or write, or a signal has been received, etc.).
If the only thing you're doing is sending and receiving on that socket, you might as well just use a non-blocking socket. Being asynchronous is important when you have other things to do in the meantime, such as update a GUI or handle other sockets.
For your first question, there's no difference in that scenario. The only difference is what they do when there is nothing to be read. Since you're checking that before calling recv() you'll see no difference.
For the second question, the way I see it done in all the libraries is to use select, poll, epoll, kqueue for testing if data is available. The select method is the oldest, and least desirable from a performance standpoint (particularly for managing large numbers of connections).