Multiple threads writing on same file - c

I would like to know if we can use multiple threads to write binary data on the same file.
FILE *fd = openfile("test");
int SIZE = 1000000000;
int * table = malloc(sizeof(int) * SIZE);
// .. filling the table
fwrite(table, sizeof(*table), SIZE, fd);
so I wonder if i can use threads,and each thread calls fssek to seek to a different location to write in the same file.
Any idea ?

fwrite should be thread safe, but you'll need a mutex anyway, because you need the seek and the write to be atomic. Depending on your platform, you might have a write function that takes an offset, or you might be able to open the file in each thread. A better option if you have everything in memory anyway as your code suggests, would just be for each thread to fill into a single large array and then write that out when everything is done.

While fread() and fwrite() are thread safe, the stream buffer represented by the FILE* is not. So you can have multiple threads accessing the same file, but not via the same FILE* - each thread must have its own, and the file to which they refer must be shareable - which is OS dependent.
An alternative and possibly simpler approach is to use a memory mapped file, so that each thread treats the file as shared memory, and you let the OS deal with the file I/O. This has a significant advantage over normal file I/O as it is truly random access, so you don't need to worry about fseek() and sequential read/writes etc.

fseek and fwrite are thread-safe so you can use them without additional synchronization.

Let each thread open the file, and make sure they write to different positions, finally let each thread close the file and your done.
Update:
This works on IX'ish systems, at least.

Related

Can I adapt a function that writes to disk to write to memory

I have third-party library with a function that does some computation on the specified data, and writes the results to a file specified by file name:
int manipulateAndWrite(const char *filename,
const FOO_DATA *data);
I cannot change this function, or reimplement the computation in my own function, because I do not have the source.
To get the results, I currently need to read them from the file. I would prefer to avoid the write to and read from the file, and obtain the results into a memory buffer instead.
Can I pass a filepath that indicates writing to memory instead of a
filesystem?
Yes, you have several options, although only the first suggestion below is supported by POSIX. The rest of them are OS-specific, and may not be portable across all POSIX systems, although I do believe they work on all POSIXy systems.
You can use a named pipe (FIFO), and have a helper thread read from it concurrently to the writer function.
Because there is no file per se, the overhead is just the syscalls (write and read); basically just the overhead of interprocess communication, nothing to worry about. To conserve resources, do create the helper thread with a small stack (using pthread_attr_ etc.), as the default stack size tends to be huge (on the order of several megabytes; 2*PTHREAD_STACK_SIZE should be plenty for helper threads.)
You should ensure the named pipe is in a safe directory, accessible only to the user running the process, for example.
In many POSIXy systems, you can create a pipe or a socket pair, and access it via /dev/fd/N, where N is the descriptor number in decimal. (In Linux, /proc/self/fd/N also works.) This is not mandated by POSIX, so may not be available on all systems, but most do support it.
This way, there is no actual file per se, and the function writes to the pipe or socket. If the data written by the function is at most PIPE_BUF bytes, you can simply read the data from the pipe afterwards; otherwise, you do need to create a helper thread to read from the pipe or socket concurrently to the function, or the write will block.
In this case, too, the overhead is minimal.
On ELF-based POSIXy systems (basically all), you can interpose the open(), write(), and close() syscalls or C library functions.
(In Linux, there are two basic approaches, one using the linker --wrap, and one using dlsym(). Both work fine for this particular case. This ability to interpose functions is based on how ELF binaries are linked at run time, and is not directly related to POSIX.)
You first set up the interposing functions, so that open() detects if the filename matches your special "in-memory" file, and returns a dedicated descriptor number for it. (You may also need to interpose other functions, like ftruncate() or lseek(), depending on what the function actually does; in Linux, you can run a binary under ptrace to examine what syscalls it actually uses.)
When write() is called with the dedicated descriptor number, you simply memcpy() it to a memory buffer. You'll need to use global variables to describe the allocated size, size used, and the pointer to the memory buffer, and probably be prepared to resize/grow the buffer if necessary.
When close() is called with the dedicated descriptor number, you know the memory buffer is complete, and the contents ready for processing.
You can use a temporary file on a RAM filesystem. While the data is technically written to a file and read back from it, the operations involve RAM only.
You should arrange for a default path to one to be set at compile time, and for individual users to be able to override that for their personal needs, for example via an environment variable (YOURAPP_TMPDIR?).
There is no need for the application to try and look for a RAM-based filesystem: choices like this are, and should be, up to the user. The application should not even care what kind of filesystem the file is on, and should just use the specified directory.
You could not use that library function. Take a look at this on how to write to in-memory files:
Is it possible to create a C FILE object to read/write in memory

How to use pthread in C to prevent simultaneous read and write to a file on disk?

I am writing a program, which has one write thread, and a few read threads, that write/read to a file on disk. I wish that no write / read will happen at the same time. I found many examples which uses pthread mutex lock to protect memory array during write / read, such as declaring the protected memory array volatile.
volatile int array[NUMBER];
pthread_mutex_t locks[NUMBER];
pthread_mutex_lock(&locks[index]);
array[acct] -= SOME_NUMBER;
pthread_mutex_unlock(&locks[index]);
But I cannot find examples with using pthreads to protect files on disk.
volatile FILE* array[NUMBER]; ??
Can someone point me to the right direction? I wish write / read threads will not access the files on disk simultaneously.
Edit: I read more, and according to this post, it seems that multithreading does not work with disk IO.
According to the description, your problem is about protecting the files on the disk, not the stream descriptors(FILE*) that represents the files. You can try to use pthread's rwlock to synchronize concurrent access between multiple threads:
FILE *fp = fopen(...);
pthread_rwlock_t rwlock = PTHREAD_RWLOCK_INITIALIZER;
// read thread:
char buf[BUFLEN];
pthread_rwlock_rdlock(&rwlock);
fread(buf, sizeof buf, 1, fp);
pthread_rwlock_unlock(&rwlock);
// write thread:
pthread_rwlock_wrlock(&rwlock);
fwrite(buf, sizeof buf, 1, fp);
pthread_rwlock_unlock(&rwlock);
Note that this protect the file from accessed by multiple threads in the same process, this does not protect it from accessing by multiple processes on the system.
It depends on what you mean by "will not access the files on disk simultaneously". Since you talk about pthreads, it means you're on a POSIX system. POSIX already has certain guarantees about file access by multiple processes and threads.
If you use the raw system calls read and write you're for example guaranteed that writes will be atomic. This means that if a read and write happen simultaneously (meaning: you don't know which starts first), the read will either see the entire change to the file that the write did or none of it (there might be some exceptions here on errors). Of course there are problems with reading and writing the same file descriptor by multiple threads since read/write updates the offset where the next read/write will happen. So doing an lseek and then write is unsafe if some other thread can touch the file descriptor between the seek and the write. But for that you have the system calls pread and pwrite that guarantee that the seek+read/write will be atomic.
So by just going on this part of your problem description:
one write thread, and a few read threads, that write/read to a file on disk. I wish that no write / read will happen at the same time.
this is already guaranteed by the operating system. The problem I think is that "at the same time" is a very vague requirement because almost nothing that accesses a shared resource (like a file or even memory) happens at the same time. When thinking about threads or just about any concurrency you need to frame the problem in terms of what needs to happen before/after some other thing.

Multi-Threading with files

So let's say I have the following code where I open a file, read the contents line by line and then use each line for a function somewhere else and then when I'm done rewind the file.
FILE *file = Open_File();
char line[max];
while (!EndofFile())
{
int length = GetLength(line);
if (length > 0)
{
DoStuffToLine(line)
}
}
rewind(file);
I'm wondering if there is a way to use threads here to add concurrency. Since I'm just reading the file and not writing to it I feel like I don't have to worry about race conditioning. However I'm not sure how to handle the code that's in the while loop because if one thread is looping over the file and another thread is looping over the file at the same time, would they cause each other to skip over lines, make other errors, etc? What's a good way to approach this?
If you're trying to do this to improve read performance, you're going to likely be disappointed since this will almost surely be disk I/O bound. Adding more threads won't help the OS and disk controller fetch data any faster.
However, if you're trying to just process the data in parallel, that's another matter. In that case, I would read the entire file into a memory buffer somewhere, then have your threads process it in parallel. That way you don't have to worry about thread safety with rewinding the file pointer or any other annoying issues like it.
You'll likely still need to use other locking mechanisms for the multithreaded parts of course, depending on exactly what you're doing, but you shouldn't have to worry about what the standard library is going to do when you start accessing a file with multiple threads.
The concurrency adds some race condition problems:
1. The EndofFile() function is evaluated at the start of the loop, it may always happens that this function returns true for two threads, then one thread reaches the end of file and the other thread attempts to read the file.You never know when a thread may be in execution;
2. Same is valid for the GetLength function: when a thread has the length information, the length may change because another thread may read another line;
3. You are reading a file sequentially, even if you rewind it, it may always occur that the current position of the IO pointer is altered by some other thread.
Furthermore, as Telgin pointed out, reading a file is not CPU bound, but I/O bound, so is the system to read the file.You can't improve the performance because you need some locks, and locking to guarantee thread safety just introduces overhead.
I'm not sure that this is the best approach. However, you could read the file. Then store it in two separate objects and read the objects instead of the file. Just make sure to do cleanup afterward.

thread safety of read/pread system calls

I have few queries related to read()/pread() system calls in a multithreaded environment
I am using Mac-OSX which is freeBsd based , if that helps in any way
I am only using this file in read mode,and not read/write
And the language is c/c++
Suppose we have a file on disk
AAAABBBBCCCCDDDEEEE....
and 4 alphabets fit on one page of the file
So Page1:AAAA
Page2:BBBB
..... and so on
now i initiate a read system call from two different threads with the same file descriptor
my intension is to read first page from thread 1, second page from thread 2,..and so on.
read(fd,buff,sizeof(page));
From the man page i am given to understand that read will also increment the file pointer ,so definitely i am gonna get garbled responses like
ABCC ABBB .. etc (with no particular sequence )
to remedy this i can use pread()
"Pread() performs the same function, but reads from the speci-
fied position in the file without modifying the file pointer" // from man pages
But i am not sure whether using pread will actually help me in my objective , cause even though it does not increment the internal file pointer , there are no guarantees that the responses are not jumbled.
All of my data is page aligned and i want to read one page from each thread like
Thread 1 reads:AAAA
Thread 2 reads:BBBB
Thread 3 reads:CCCC ... without actually garbling the content ..
I also found a post Is it safe to read() from a file as soon as write() returns?
but it wasnt quite useful .
I am also not sure whether read() will actually have the problem, that i am thinking of.The file that i am reading is a binary file and hence i litle difficult to just quickly manually read and verify..
Any help will be appreciated
read and write change the position of the underlying open file. They are "thread safe" in the sense that your program will not have undefined behavior (crash or worse) if multiple threads perform IO on the same open file at once using them, but the order and atomicity of the operations could vary depending on the type of file and the implementation.
On the other hand, pread and pwrite do not change the position in the open file. They were added to POSIX for exactly the purpose you want: performing IO operations on the same open file from multiple threads or processes without the operations interfering with one another's position. You could still run into some trouble with ordering if you're mixing pread and pwrite (or multiple calls to pwrite) with overlapping parts of the file, but as long as you avoid that, they're perfectly safe for what you want to do.
fcntl advisory locks are locks on a range of the file. You may find this useful to serialize reads and writes to the same region while allowing concurrency on separate regions.
int rc;
struct flock f;
f.l_type = F_RDLCK; /* or F_WRLCK */
f.l_whence = SEEK_SET;
f.l_start = n;
f.l_len = 1;
while ((rc = fcntl(fd, F_SETLKW, &f)) == -1 && errno = EINTR)
;
if (rc == -1)
perror("fcntl(F_SETLKW)");
else {
/* do stuff */
f.l_type = F_UNLCK;
fcntl(fd, F_SETLK, &f);
}
Multiple reader locks are permitted at a time, while a single writer lock blocks all others.
Be warned that all file locking mechanisms are subtly broken on some configurations on all platforms.
Share a mutex lock between the two threads, enable the lock in the thread before it reads, and unlock the lock when the correct read is complete. See pthread_mutex_create, pthread_mutex_lock, and pthread_mutex_unlock.

Probing for filesystem block size

I'm going to first admit that this is for a class project, since it will be pretty obvious. We are supposed to do reads to probe for the block size of the filesystem. My problem is that the time taken to do this appears to be linearly increasing, with no steps like I would expect.
I am timing the read like this:
double startTime = getticks();
read = fread(x, 1, toRead, fp);
double endTime = getticks();
where getticks uses rdtsc instructions. I am afraid there is caching/prefetching that is causing the reads to not take time during the fread. I tried creating a random file between each execution my program, but that is not alleviating my problem.
What is the best way to accurately measure the time taken for a read from disk? I am pretty sure my block size is 4096, but how can I get data to support that?
The usual way of determining filesystem block size is to ask the filesystem what its blocksize is.
#include <sys/statvfs.h>
#include <stdio.h>
int main() {
struct statvfs fs_stat;
statvfs(".", &fs_stat);
printf("%lu\n", fs_stat.f_bsize);
}
But if you really want, open(…,…|O_DIRECT) or posix_fadvise(…,…,…,POSIX_FADV_DONTNEED) will try to let you bypass the kernel's buffer cache (not guaranteed).
You may want to use the system calls (open(), read(), write(), ...)
directly to reduce the impact of the buffering done by the FILE* stuff.
Also, you may want to use synchronous I/O somehow.
One ways is opening the file with the O_SYNC flag set
(or O_DIRECT as per ephemient's reply).
Quoting the Linux open(2) manual page:
O_SYNC The file is opened for synchronous I/O. Any write(2)s on the
resulting file descriptor will block the calling process until
the data has been physically written to the underlying hardware.
But see NOTES below.
Another options would be mounting the filesystem with -o sync (see mount(8)) or setting the S attribute on the file using the chattr(1) command.

Resources