I was looking at how a syscall read/write was done in linux, and i found this :
....
loff_t pos = file_pos_read(f.file);
ret = vfs_read(f.file, buf, count, &pos);
file_pos_write(f.file, pos);
fdput(f);
...`
My questions are :
Where did the locking go? I would have imaginated something like :
....
lock(f.file); // <-- lock file struct
loff_t pos = file_pos_read(f.file);
ret = vfs_read(f.file, buf, count, &pos);
file_pos_write(f.file, pos);
fdput(f);
unlock(f.file); // <-- unlock file struct
...
If multiple threads try to read/write at the same time, they could read/write at the same offset ?
If my understanding is correct, linux doesn't use any locking mechanism to protect the offset, is this POSIX compliant ?
I did look at the POSIX specification, and found nothing about this case.
Linux doesn't use any locking mechanism to protect multithread writing to a file.
You have to use your own mutex to protect your file.
It's your responsibility in a multithreaded application to serialize access to file descriptors. Across processes you can use the flock(2) syscall to synchronize access to the same file.
The kernel won't crash if you access the same file from two different processes/threads, but it may overwrite or corrupt the file position and file data in an undefined way.
Related
I've implemented a char device for my kernel module and implemented a read function for it. The read function calls copy_to_user to return data to the caller. I've originally implemented the read function in a blocking manner (with wait_event_interruptible) but the problem reproduces even when I implement read in a non-blocking manner. My code is running on a MIPS procesor.
The user space program opens the char device and reads into a buffer allocated on the stack.
What I've found is that occasionally copy_to_user will fail to copy any bytes. Moreover, even if I replace copy_to_user with a call to memcpy (only for the purposes of checking... I know this isn't the right thing to do), and print out the destination buffer immediately afterwards, I see that memcpy has failed to copy any bytes.
I'm not really sure how to further debug this - how can I determine why memory is not being copied? Is it possible that the process context is wrong?
EDIT: Here's some pseudo-code outlining what the code currently looks like:
User mode (runs repeatedly):
char buf[BUF_LEN];
FILE *f = fopen(char_device_file, "rb");
fread(buf, 1, BUF_LEN, f);
fclose(f);
Kernel mode:
char_device =
create_char_device(char_device_name,
NULL,
read_func,
NULL,
NULL);
int read_func(char *output_buffer, int output_buffer_length, loff_t *offset)
{
int rc;
if (*offset == 0)
{
spin_lock_irqsave(&lock, flags);
while (get_available_bytes_to_read() == 0)
{
spin_unlock_irqrestore(&lock, flags);
if (wait_event_interruptible(self->wait_queue, get_available_bytes_to_read() != 0))
{
// Got a signal; retry the read
return -ERESTARTSYS;
}
spin_lock_irqsave(&lock, flags);
}
rc = copy_to_user(output_buffer, internal_buffer, bytes_to_copy);
spin_unlock_irqrestore(&lock, flags);
}
else rc = 0;
return rc;
}
It took quite a bit of debugging, but in the end Tsyvarev's hint (the comment about not calling copy_to_user with a spinlock taken) seems to have been the cause.
Our process had a background thread which occasionally launched a new process (fork + exec). When we disabled this thread, everything worked well. The best theory we have is that the fork made all of our memory pages copy-on-write, so when we tried to copy to them, the kernel had to do some work which could not be done with the spinlock taken. Hopefully it at least makes some sense (although I'd have guessed that this would apply only to the child process, and the parent's process pages would simply remain writable, but who knows...).
We rewrote our code to be lockless and the problem disappeared.
Now we just need to verify that our lockless code is indeed safe on different architectures. Easy as pie.
I'm interested in the basic principles of Web-servers, like Apache or Nginx, so now I'm developing my own server.
When my server gets a request, it's searching for a file (e.g index.html), if it exists - read all the content to the buffer (content) and write it to the socket after. Here's a simplified code:
int return_file(char* content, char* fullPath) {
file = open(fullPath, O_RDONLY);
if (file > 0) { // File was found, OK
while ((nread = read(file, content, 2048)) > 0) {}
close(file);
return 200;
}
}
The question is pretty simple: is it possible to avoid using buffer and write file content directly to the socket?
Thanks for any tips :)
There is no standardized system call which can write directly from a file to a socket.
However, some operating systems do provide such a call. For example, both FreeBSD and Linux implement a system call called sendfile, but the precise details differ between the two systems. (In both cases, you need the underlying file descriptor for the file, not the FILE* pointer, although on both these platforms you can use fileno() to extract the fd from the FILE*.)
For more information:
FreeBSD sendfile()
Linux sendfile()
What you can do is write the "chunk" you read immediately to the client.
In order to write the content, you MUST read it, so you can't avoid that, but you can use a smaller buffer, and write the contents as you read them eliminating the need to read the whole file into memory.
For instance, you could
unsigned char byte;
// FIXME: store the return value to allow
// choosing the right action on error.
//
// Note that `0' is not really an error.
while (read(file, &byte, 1) > 0) {
if (write(client, &byte, 1) <= 0) {
// Handle error.
}
}
but then, unsigned char byte; could be unsigned char byte[A_REASONABLE_BUFFER_SIZE]; which would be better, and you don't need to store ALL the content in memory.
}
No, it is not. There must be an intermediate storage that you use for reading/writing the data.
There is one edge case: when you use memory mapped files, the mapped file's region can be used for writing into socket. But internally, the system would anyway perform a read into memory buffer operation.
I want to be able to write atomically to a file, I am trying to use the write() function since it seems to grant atomic writes in most linux/unix systems.
Since I have variable string lengths and multiple printf's, I was told to use snprintf() and pass it as an argument to the write function in order to be able to do this properly, upon reading the documentation of this function I did a test implementation as below:
int file = open("file.txt", O_CREAT | O_WRONLY);
if(file < 0)
perror("Error:");
char buf[200] = "";
int numbytes = snprintf(buf, sizeof(buf), "Example string %s" stringvariable);
write(file, buf, numbytes);
From my tests it seems to have worked but my question is if this is the most correct way to implement it since I am creating a rather large buffer (something I am 100% sure will fit all my printfs) to store it before passing to write.
No, write() is not atomic, not even when it writes all of the data supplied in a single call.
Use advisory record locking (fcntl(fd, F_SETLKW, &lock)) in all readers and writers to achieve atomic file updates.
fcntl()-based record locks work over NFS on both Linux and BSDs; flock()-based file locks may not, depending on system and kernel version. (If NFS locking is disabled like it is on some web hosting services, no locking will be reliable.) Just initialize the struct flock with .l_whence = SEEK_SET, .l_start = 0, .l_len = 0 to refer to the entire file.
Use asprintf() to print to a dynamically allocated buffer:
char *buffer = NULL;
int length;
length = asprintf(&buffer, ...);
if (length == -1) {
/* Out of memory */
}
/* ... Have buffer and length ... */
free(buffer);
After adding the locking, do wrap your write() in a loop:
{
const char *p = (const char *)buffer;
const char *const q = (const char *)buffer + length;
ssize_t n;
while (p < q) {
n = write(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1) {
/* Write error / kernel bug! */
} else
if (errno != EINTR) {
/* Error! Details in errno */
}
}
}
Although there are some local filesystems that guarantee write() does not return a short count unless you run out of storage space, not all do; especially not the networked ones. Using a loop like above lets your program work even on such filesystems. It's not too much code to add for reliable and robust operation, in my opinion.
In Linux, you can take a write lease on a file to exclude any other process opening that file for a while.
Essentially, you cannot block a file open, but you can delay it for up to /proc/sys/fs/lease-break-time seconds, typically 45 seconds. The lease is granted only when no other process has the file open, and if any other process tries to open the file, the lease owner gets a signal. (If the lease owner does not release the lease, for example by closing the file, the kernel will automagically break the lease after the lease-break-time is up.)
Unfortunately, these only work in Linux, and only on local files, so they are of limited use.
If readers do not keep the file open, but open, read, and close it every time they read it, you can write a full replacement file (must be on the same filesystem; I recommend using a lock-subdirectory for this), and hard-link it over the old file.
All readers will see either the old file or the new file, but those that keep their file open, will never see any changes.
as stated in: http://www.kernel.org/doc/htmldocs/kernel-hacking.html#routines-copy this functions "can" sleep.
So, do I always have to do a lock (e.g. with mutexes) when using this functions or are there exceptions?
I'm currently working on a module and saw some Kernel Oops at my system, but cannot reproduce them. I have a feeling they are fired because I'm currently do no locking around copy_[to/from]_user(). Maybe I'm wrong, but it smells like it has something to do with it.
I have something like:
static unsigned char user_buffer[BUFFER_SIZE];
static ssize_t mcom_write (struct file *file, const char *buf, size_t length, loff_t *offset) {
ssize_t retval;
size_t writeCount = (length < BUFFER_SIZE) ? length : BUFFER_SIZE;
memset((void*)&user_buffer, 0x00, sizeof user_buffer);
if (copy_from_user((void*)&user_buffer, buf, writeCount)) {
retval = -EFAULT;
return retval;
}
*offset += writeCount;
retval = writeCount;
cleanupNewline(user_buffer);
dispatch(user_buffer);
return retval;
}
Is this save to do so or do I need locking it from other accesses, while copy_from_user is running?
It's a char device I read and write from, and if a special packet in the network is received, there can be concurrent access to this buffer.
You need to do locking iff the kernel side data structure that you are copying to or from might go away otherwise - but it is that data structure you should be taking a lock on.
I am guessing your function mcom_write is a procfs write function (or similar) right? In that case, you most likely are writing to the procfs file, your program being blocked until mcom_write returns, so even if copy_[to/from]_user sleeps, your program wouldn't change the buffer.
You haven't stated how your program works so it is hard to say anything. If your program is multithreaded and one thread writes while another can change its data, then yes, you need locking, but between the threads of the user-space program not your kernel module.
If you have one thread writing, then your write to the procfs file would be blocked until mcom_write finishes so no locking is needed and your problem is somewhere else (unless there is something else that is wrong with this function, but it's not with copy_from_user)
I have created a following program in which I wish to poll on the file descriptor of the file that I am opening in the program.
#define FILE "help"
int main()
{
int ret1;
struct pollfd fds[1];
ret1 = open(FILE, O_CREAT);
fds[0].fd = ret1;
fds[0].events = POLLIN;
while(1)
{
poll(fds,1,-1);
if (fds[0].revents & POLLIN)
printf("POLLING");
}
return 0;
}
It is going in infinite loop. I am expecting to run the loop when some operation happen to the file. (Its a ASCII file)
plz help
poll() actually doesn't work on opened files. Since a read() on a file will never block, poll() will always return that you can read non-blocking from the file.
This would (almost) work on character devices*, named pipes** or sockets, though, since those block when you read() from them when there is no data available. (you also need to actually read that data then, or else poll will tell again and again that data is available)
To "poll" a growing/shrinking file, see man inotify or implement your own polling using fstat() in a loop.
* block devices are a story apart; while technically a read from a harddisk can block for 10 ms or longer, this is not perceived as blocking I/O in linux.
** see also how to flush a named pipe using bash
No idea if this is the cause of your problems (probably not), but it is a particularly bad idea to redefine the standard macro FILE.
Didn't your compiler complain about this?