read on many real file descriptors - c

Working on a Linux (Ubuntu) application. I need to read many files in a non-blocking fashion. Unfortunately epoll doesn't support real file descriptor (file descriptor from file), it does support file descriptor that's network socket. select does work on real file descriptors, but it has two drawbacks, 1) it's slow, linearly go through all the file descriptors that are set, 2) it's limited, it typically won't allow more than 1024 file descriptors.
I can change each file descriptors to be non-blocking and use non-blocking "read" to poll, but it's very expensive especially when there are a large number of file descriptors.
What are the options here?
Thanks.
Update 1
The use case here is to create some sort of file server, with many clients requesting for files, serve them in a non-blocking fashion. Due to network side implementation (not standard TCP/IP stack), can't use sendfile().

You could use multiple select calls combined with either threading or forking. This would reduce the number of FD_ISSET calls per select set.
Perhaps you can provide more details about your use-case. It sounds like you are using select to monitor file changes, which doesn't work as you would expect with regular files. Perhaps you are simply looking for flock

You could use Asynchronous IO on Linux. The relevant AIO manpages (all in section 3) appear to have quite a bit of information. I think that aio_read() would probably be the most useful for you.
Here's some code that I believe you should be able to adapt for your usage:
...
#define _GNU_SOURCE
#include <aio.h>
#include <unistd.h>
typedef struct {
struct aiocb *aio;
connection_data *conn;
} cb_data;
void callback (union sigval u) {
// recover file related data prior to freeing
cb_data data = u.sival_ptr;
int fd = data->aio->aio_fildes;
uint8_t *buffer = data->aio->aio_buf;
size_t len = data->aio->aio_nbytes;
free (data->aio);
// recover connection data pointer then free
connection_data *conn = data->conn;
free (data);
...
// finish handling request
...
return;
}
...
int main (int argc, char **argv) {
// initial setup
...
// setup aio for optimal performance
struct aioinit ainit = { 0 };
// online background threads
ainit.aio_threads = sysconf (_SC_NPROCESSORS_ONLN) * 4;
// use defaults if using few core system
ainit.aio_threads = (ainit.aio_threads > 20 ? ainit.aio_threads : 20)
// set num to the maximum number of likely simultaneous requests
ainit.aio_num = 4096;
ainit.aio_idle_time = 5;
aio_init (&ainit);
...
// handle incoming requests
int exit = 0;
while (!exit) {
...
// the [asynchronous] fun begins
struct aiocb *cb = calloc (1, sizeof (struct aiocb));
if (!cb)
// handle OOM error
cb->aio_fildes = file_fd;
cb->aio_offset = 0; // assuming you want to send the entire file
cb->aio_buf = malloc (file_len);
if (!cb->aio_buf)
// handle OOM error
cb->aio_nbytes = file_len;
// execute the callback in a separate thread
cb->aio_sigevent.sigev_notify = SIGEV_THREAD;
cb_data *data = malloc (sizeof (cb_data));
if (!data)
// handle OOM error
data->aio = cb; // so we can free() later
// whatever you need to finish handling the request
data->conn = connection_data;
cb->aio_sigevent.sigev_value.sival_ptr = data; // passed to callback
cb->aio_sigevent.sigev_notify_function = callback;
if ((err = aio_read (cb))) // and you're done!
// handle aio error
// move on to next connection
}
...
return 0;
}
This will result in you no longer having to wait on files being read in your main thread. Of course, you can create more performant systems using AIO, but those are naturally likely to be more complex and this should work for a basic use case.

Related

Synchronize with sigev_notify_function()

I would like to read (asynchronously) BLOCK_SIZE bytes of one file, and the BLOCK_SIZE bytes of the second file, printing what has been read to the buffer as soon as the respective buffer has been filled. Let me illustrate what I mean:
// in main()
int infile_fd = open(infile_name, O_RDONLY); // add error checking
int maskfile_fd = open(maskfile_name, O_RDONLY); // add error checking
char* buffer_infile = malloc(BLOCK_SIZE); // add error checking
char* buffer_maskfile = malloc(BLOCK_SIZE); // add error checking
struct aiocb cb_infile;
struct aiocb cb_maskfile;
// set AIO control blocks
memset(&cb_infile, 0, sizeof(struct aiocb));
cb_infile.aio_fildes = infile_fd;
cb_infile.aio_buf = buffer_infile;
cb_infile.aio_nbytes = BLOCK_SIZE;
cb_infile.aio_sigevent.sigev_notify = SIGEV_THREAD;
cb_infile.aio_sigevent.sigev_notify_function = print_buffer;
cb_infile.aio_sigevent.sigev_value.sival_ptr = buffer_infile;
memset(&cb_maskfile, 0, sizeof(struct aiocb));
cb_maskfile.aio_fildes = maskfile_fd;
cb_maskfile.aio_buf = buffer_maskfile;
cb_maskfile.aio_nbytes = BLOCK_SIZE;
cb_maskfile.aio_sigevent.sigev_notify = SIGEV_THREAD;
cb_maskfile.aio_sigevent.sigev_notify_function = print_buffer;
cb_maskfile.aio_sigevent.sigev_value.sival_ptr = buffer_maskfile;
and the print_buffer() function is defined as follows:
void print_buffer(union sigval sv)
{
printf("%s\n", __func__);
printf("buffer address: %p\n", sv.sival_ptr);
printf("buffer: %.128s\n", (char*)sv.sival_ptr);
}
By the end of the program I do the usual clean up, i.e.
// clean up
close(infile_fd); // add error checking
close(maskfile_fd); // add error checking
free(buffer_infile);
printf("buffer_inline freed\n");
free(buffer_maskfile);
printf("buffer_maskfile freed\n");
The problem is, every once in a while buffer_inline gets freed before print_buffer manages to print its contents to the console. In a usual case I would employ some kind of pthread_join() but as far as I know this is impossible since POSIX does not specify that sigev_notify_function must be implemented using threads, and besides, how would I get the TID of such thread to call pthread_join() on?
Don't do it this way, if you can avoid it. If you can, just let process termination take care of it all.
Otherwise, the answer indicated in Andrew Henle's comment above is right on. You need to be sure that no more sigev_notify_functions will improperly reference the buffers.
The easiest way to do this is simply to countdown the number of expected notifications before freeing the buffers.
Note: your SIGEV_THREAD function is executed in a separate thread, though not necessarily a new thread each time. (POSIX.1-2017 System Interfaces §2.4.2) Importantly, you are not meant to manage this thread's lifecycle: it is detached by default, with PTHREAD_CREATE_JOINABLE explicitly noted as undefined behavior.
As an aside, I'd suggest never using SIGEV_THREAD in robust code. Per spec, the signal mask of the sigev_notify_function thread is implementation-defined. Yikes. For me, that makes it per se unreliable. In my view, SIGEV_SIGNAL and a dedicated signal-handling thread are much safer.

Read chardevice with libevent

I wrote a chardevice that passes some messages received from the network to an user space application. The user space application has to both read the chardevice and send/receive messages via TCP sockets to other user-space applications. Both read and receiving should be blocking.
Since Libevent is able to handle multiple events at the same time, I thought registering an event for the file created by the chardevice and an event for a socket would just work, but I was wrong.
But a chardevice creates a "character special file", and libevent seems to not be able to block. If I implement a blocking mechanism inside the chardevice, i.e. mutex or semaphore, then the socket event blocks too, and the application cannot receive messages.
The user space application has to accept outside connections at any time.
Do you know how to make it work? Maybe also using another library, I just want a blocking behaviour for both socket and file reader.
Thank you in advance.
Update: Thanks to #Ahmed Masud for the help. This is what I've done
Kernel module chardevice:
Implement a poll function that waits until new data is available
struct file_operations fops = {
...
.read = kdev_read,
.poll = kdev_poll,
};
I have a global variable to handle if the user space has to stop, and a wait queue:
static working = 1;
static wait_queue_head_t access_wait;
This is the read function, I return -1 if there is an error in copy_to_user, > 0 if everything went well, and 0 if the module has to stop. used_buff is atomic since it handles the size of a buffer shared read by user application and written by kernel module.
ssize_t
kdev_read(struct file* filep, char* buffer, size_t len, loff_t* offset)
{
int error_count;
if (signal_pending(current) || !working) { // user called sigint
return 0;
}
atomic_dec(&used_buf);
size_t llen = sizeof(struct user_msg) + msg_buf[first_buf]->size;
error_count = copy_to_user(buffer, (char*)msg_buf[first_buf], llen);
if (error_count != 0) {
atomic_inc(&used_buf);
paxerr("send fewer characters to the user");
return error_count;
} else
first_buf = (first_buf + 1) % BUFFER_SIZE;
return llen;
}
When there is data to read, I simply increment used_buf and call wake_up_interruptible(&access_wait).
This is the poll function, I just wait until the used_buff is > 0
unsigned int
kdev_poll(struct file* file, poll_table* wait)
{
poll_wait(file, &access_wait, wait);
if (atomic_read(&used_buf) > 0)
return POLLIN | POLLRDNORM;
return 0;
}
Now, the problem here is that if I unload the module while the user space application is waiting, the latter will go into a blocked state and it won't be possible to stop it. That's why I wake up the application when the module is unloaded
void
kdevchar_exit(void)
{
working = 0;
atomic_inc(&used_buf); // increase buffer size to application is unlocked
wake_up_interruptible(&access_wait); // wake up application, but this time read will return 0 since working = 0;
... // unregister everything
}
User space application
Libevent by default uses polling, so simply create an event_base and a reader event.
base = event_base_new();
filep = open(fname, O_RDWR | O_NONBLOCK, 0);
evread = event_new(base, filep, EV_READ | EV_PERSIST,
on_read_file, base);
where on_read_file simply reads the file, no poll call is made (libevent handles that):
static void
on_read_file(evutil_socket_t fd, short event, void* arg)
{
struct event_base* base = arg;
int len = read(...);
if (len < 0)
return;
if (len == 0) {
printf("Stopped by kernel module\n");
event_base_loopbreak(base);
return;
}
... // handle message
}

Write atomically to a file using Write() with snprintf()

I want to be able to write atomically to a file, I am trying to use the write() function since it seems to grant atomic writes in most linux/unix systems.
Since I have variable string lengths and multiple printf's, I was told to use snprintf() and pass it as an argument to the write function in order to be able to do this properly, upon reading the documentation of this function I did a test implementation as below:
int file = open("file.txt", O_CREAT | O_WRONLY);
if(file < 0)
perror("Error:");
char buf[200] = "";
int numbytes = snprintf(buf, sizeof(buf), "Example string %s" stringvariable);
write(file, buf, numbytes);
From my tests it seems to have worked but my question is if this is the most correct way to implement it since I am creating a rather large buffer (something I am 100% sure will fit all my printfs) to store it before passing to write.
No, write() is not atomic, not even when it writes all of the data supplied in a single call.
Use advisory record locking (fcntl(fd, F_SETLKW, &lock)) in all readers and writers to achieve atomic file updates.
fcntl()-based record locks work over NFS on both Linux and BSDs; flock()-based file locks may not, depending on system and kernel version. (If NFS locking is disabled like it is on some web hosting services, no locking will be reliable.) Just initialize the struct flock with .l_whence = SEEK_SET, .l_start = 0, .l_len = 0 to refer to the entire file.
Use asprintf() to print to a dynamically allocated buffer:
char *buffer = NULL;
int length;
length = asprintf(&buffer, ...);
if (length == -1) {
/* Out of memory */
}
/* ... Have buffer and length ... */
free(buffer);
After adding the locking, do wrap your write() in a loop:
{
const char *p = (const char *)buffer;
const char *const q = (const char *)buffer + length;
ssize_t n;
while (p < q) {
n = write(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1) {
/* Write error / kernel bug! */
} else
if (errno != EINTR) {
/* Error! Details in errno */
}
}
}
Although there are some local filesystems that guarantee write() does not return a short count unless you run out of storage space, not all do; especially not the networked ones. Using a loop like above lets your program work even on such filesystems. It's not too much code to add for reliable and robust operation, in my opinion.
In Linux, you can take a write lease on a file to exclude any other process opening that file for a while.
Essentially, you cannot block a file open, but you can delay it for up to /proc/sys/fs/lease-break-time seconds, typically 45 seconds. The lease is granted only when no other process has the file open, and if any other process tries to open the file, the lease owner gets a signal. (If the lease owner does not release the lease, for example by closing the file, the kernel will automagically break the lease after the lease-break-time is up.)
Unfortunately, these only work in Linux, and only on local files, so they are of limited use.
If readers do not keep the file open, but open, read, and close it every time they read it, you can write a full replacement file (must be on the same filesystem; I recommend using a lock-subdirectory for this), and hard-link it over the old file.
All readers will see either the old file or the new file, but those that keep their file open, will never see any changes.

how is select() alerted to an fd becoming "ready"?

I don't know why I'm having a hard time finding this, but I'm looking at some linux code where we're using select() waiting on a file descriptor to report it's ready. From the man page of select:
select() and pselect() allow a program to monitor multiple file descriptors,
waiting until one or more of the file descriptors become "ready" for some
class of I/O operation
So, that's great... I call select on some descriptor, give it some time out value and start to wait for the indication to go. How does the file descriptor (or owner of the descriptor) report that it's "ready" such that the select() statement returns?
It reports that it's ready by returning.
select waits for events that are typically outside your program's control. In essence, by calling select, your program says "I have nothing to do until ..., please suspend my process".
The condition you specify is a set of events, any of which will wake you up.
For example, if you are downloading something, your loop would have to wait on new data to arrive, a timeout to occur if the transfer is stuck, or the user to interrupt, which is precisely what select does.
When you have multiple downloads, data arriving on any of the connections triggers activity in your program (you need to write the data to disk), so you'd give a list of all download connections to select in the list of file descriptors to watch for "read".
When you upload data to somewhere at the same time, you again use select to see whether the connection currently accepts data. If the other side is on dialup, it will acknowledge data only slowly, so your local send buffer is always full, and any attempt to write more data would block until buffer space is available, or fail. By passing the file descriptor we are sending to to select as a "write" descriptor, we get notified as soon as buffer space is available for sending.
The general idea is that your program becomes event-driven, i.e. it reacts to external events from a common message loop rather than performing sequential operations. You tell the kernel "this is the set of events for which I want to do something", and the kernel gives you a set of events that have occured. It is fairly common for two events occuring simultaneously; for example, a TCP acknowledge was included in a data packet, this can make the same fd both readable (data is available) and writeable (acknowledged data has been removed from send buffer), so you should be prepared to handle all of the events before calling select again.
One of the finer points is that select basically gives you a promise that one invocation of read or write will not block, without making any guarantee about the call itself. For example, if one byte of buffer space is available, you can attempt to write 10 bytes, and the kernel will come back and say "I have written 1 byte", so you should be prepared to handle this case as well. A typical approach is to have a buffer "data to be written to this fd", and as long as it is non-empty, the fd is added to the write set, and the "writeable" event is handled by attempting to write all the data currently in the buffer. If the buffer is empty afterwards, fine, if not, just wait on "writeable" again.
The "exceptional" set is seldom used -- it is used for protocols that have out-of-band data where it is possible for the data transfer to block, while other data needs to go through. If your program cannot currently accept data from a "readable" file descriptor (for example, you are downloading, and the disk is full), you do not want to include the descriptor in the "readable" set, because you cannot handle the event and select would immediately return if invoked again. If the receiver includes the fd in the "exceptional" set, and the sender asks its IP stack to send a packet with "urgent" data, the receiver is then woken up, and can decide to discard the unhandled data and resynchronize with the sender. The telnet protocol uses this, for example, for Ctrl-C handling. Unless you are designing a protocol that requires such a feature, you can easily leave this out with no harm.
Obligatory code example:
#include <sys/types.h>
#include <sys/select.h>
#include <unistd.h>
#include <stdbool.h>
static inline int max(int lhs, int rhs) {
if(lhs > rhs)
return lhs;
else
return rhs;
}
void copy(int from, int to) {
char buffer[10];
int readp = 0;
int writep = 0;
bool eof = false;
for(;;) {
fd_set readfds, writefds;
FD_ZERO(&readfds);
FD_ZERO(&writefds);
int ravail, wavail;
if(readp < writep) {
ravail = writep - readp - 1;
wavail = sizeof buffer - writep;
}
else {
ravail = sizeof buffer - readp;
wavail = readp - writep;
}
if(!eof && ravail)
FD_SET(from, &readfds);
if(wavail)
FD_SET(to, &writefds);
else if(eof)
break;
int rc = select(max(from,to)+1, &readfds, &writefds, NULL, NULL);
if(rc == -1)
break;
if(FD_ISSET(from, &readfds))
{
ssize_t nread = read(from, &buffer[readp], ravail);
if(nread < 1)
eof = true;
readp = readp + nread;
}
if(FD_ISSET(to, &writefds))
{
ssize_t nwritten = write(to, &buffer[writep], wavail);
if(nwritten < 1)
break;
writep = writep + nwritten;
}
if(readp == sizeof buffer && writep != 0)
readp = 0;
if(writep == sizeof buffer)
writep = 0;
}
}
We attempt to read if we have buffer space available and there was no end-of-file or error on the read side, and we attempt to write if we have data in the buffer; if end-of-file is reached and the buffer is empty, then we are done.
This code will behave clearly suboptimal (it's example code), but you should be able to see that it is acceptable for the kernel to do less than we asked for both on reads and writes, in which case we just go back and say "whenever you're ready", and that we never read or write without asking whether it will block.
From the same man page:
On exit, the sets are modified in place to indicate which file descriptors actually changed status.
So use FD_ISSET() on the sets passed to select to determine which FDs have become ready.

C - How to use both aio_read() and aio_write()

I implement game server where I need to both read and write. So I accept incoming connection and start reading from it using aio_read() but when I need to send something, I stop reading using aio_cancel() and then use aio_write(). Within write's callback I resume reading. So, I do read all the time but when I need to send something - I pause reading.
It works for ~20% of time - in other case call to aio_cancel() fails with "Operation now in progress" - and I cannot cancel it (even within permanent while cycle). So, my added write operation never happens.
How to use these functions well? What did I missed?
EDIT:
Used under Linux 2.6.35. Ubuntu 10 - 32 bit.
Example code:
void handle_read(union sigval sigev_value) { /* handle data or disconnection */ }
void handle_write(union sigval sigev_value) { /* free writing buffer memory */ }
void start()
{
const int acceptorSocket = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr;
memset(&addr, 0, sizeof(struct sockaddr_in));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_port = htons(port);
bind(acceptorSocket, (struct sockaddr*)&addr, sizeof(struct sockaddr_in));
listen(acceptorSocket, SOMAXCONN);
struct sockaddr_in address;
socklen_t addressLen = sizeof(struct sockaddr_in);
for(;;)
{
const int incomingSocket = accept(acceptorSocket, (struct sockaddr*)&address, &addressLen);
if(incomingSocket == -1)
{ /* handle error ... */}
else
{
//say socket to append outcoming messages at writing:
const int currentFlags = fcntl(incomingSocket, F_GETFL, 0);
if(currentFlags < 0) { /* handle error ... */ }
if(fcntl(incomingSocket, F_SETFL, currentFlags | O_APPEND) == -1) { /* handle another error ... */ }
//start reading:
struct aiocb* readingAiocb = new struct aiocb;
memset(readingAiocb, 0, sizeof(struct aiocb));
readingAiocb->aio_nbytes = MY_SOME_BUFFER_SIZE;
readingAiocb->aio_fildes = socketDesc;
readingAiocb->aio_buf = mySomeReadBuffer;
readingAiocb->aio_sigevent.sigev_notify = SIGEV_THREAD;
readingAiocb->aio_sigevent.sigev_value.sival_ptr = (void*)mySomeData;
readingAiocb->aio_sigevent.sigev_notify_function = handle_read;
if(aio_read(readingAiocb) != 0) { /* handle error ... */ }
}
}
}
//called at any time from server side:
send(void* data, const size_t dataLength)
{
//... some thread-safety precautions not needed here ...
const int cancellingResult = aio_cancel(socketDesc, readingAiocb);
if(cancellingResult != AIO_CANCELED)
{
//this one happens ~80% of the time - embracing previous call to permanent while cycle does not help:
if(cancellingResult == AIO_NOTCANCELED)
{
puts(strerror(aio_return(readingAiocb))); // "Operation now in progress"
/* don't know what to do... */
}
}
//otherwise it's okay to send:
else
{
aio_write(...);
}
}
If you wish to have separate AIO queues for reads and writes, so that a write issued later can execute before a read issued earlier, then you can use dup() to create a duplicate of the socket, and use one to issue reads and the other to issue writes.
However, I second the recommendations to avoid AIO entirely and simply use an epoll()-driven event loop with non-blocking sockets. This technique has been shown to scale to high numbers of clients - if you are getting high CPU usage, profile it and find out where that's happening, because the chances are that it's not your event loop that's the culprit.
First of all, consider dumping aio. There are lots of other ways to do asynchronous I/O that are not as braindead (yes, aio is breaindead). Lots of alternatives; if you're on linux you can use libaio (io_submit and friends). aio(7) mentions this.
Back to your question.
I haven't used aio in a long time but here's what I remember. aio_read and aio_write both put requests (aiocb) on some queue. They return immediately even if the requests will complete some time later. It's entirely possible to queue multiple requests without caring what happened to the earlier ones. So, in a nutshell: stop cancelling read requests and keep adding them.
/* populate read_aiocb */
rc = aio_read(&read_aiocb);
/* time passes ... */
/* populate write_aiocb */
rc = aio_write(&write_aiocb)
Later you're free to wait using aio_suspend, poll using aio_error, wait for signals etc.
I see you mention epoll in your comment. You should definitely go for libaio.
Unless I'm not mistaken, POSIX AIO (that is, aio_read(), aio_write() and so on) is guaranteed to work only on seekable file descriptors. From the aio_read() manpage:
The data is read starting at the absolute file offset aiocbp->aio_offset, regardless of the
current file position. After this request, the value of the current file position is unspeci‐
fied.
For devices which do not have an associated file position such as network sockets, AFAICS, POSIX AIO is undefined. Perhaps it happens to work on your current setup, but that seems more by accident than by design.
Also, on Linux, POSIX AIO is implemented in glibc with the help of userspace threads.
That is, where possible use non-blocking IO and epoll(). However, epoll() does not work for seekable file descriptors such as regular files (same goes for the classical select()/poll() as well); in that case POSIX AIO is an alternative to rolling your own thread pool.
There should be no reason to stop or cancel an aio read or write request just because you need to make another read or write. If that were the case, that would defeat the whole point of asynchronous reading and writing since it's main purpose is to allow you to setup a reading or writing operation, and then move on. Since multiple requests can be queued, it would be much better to setup a couple of asynchronous reader/writer pools where you can grab a set of pre-initialized aiocb structures from an "available" pool that have been setup for asynchronous operations whenever you need them, and then return them to another "finished" pool when they're done and you can access the buffers they point to. While they're in the middle of an asynchronous read or write, they would be in a "busy" pool and wouldn't be touched. That way you won't have to keep creating aiocb structures on the heap dynamically every time you need to make a read or write operation, although that's okay to-do ... it's just not very efficient if you never plan on going over a certain limit, or plan to have only a certain number of "in-flight" requests.
BTW, keep in mind with a couple different in-flight asynchronous requests that your asychronous read/write handler can actually be interrupted by another read/write event. So you really don't want to be doing a whole-lot with your handler. In the above scenario I described, your handler would basically move the aiocb struct that triggered the signal handler from one of the pools to the next in the listed "available"->"busy"->"finished" stages. Your main code, after reading from the buffer pointed to by the aiocb structures in the "finished" pool would then move the structure back to the "available" pool.

Resources