Multi-Threading with files - c

So let's say I have the following code where I open a file, read the contents line by line and then use each line for a function somewhere else and then when I'm done rewind the file.
FILE *file = Open_File();
char line[max];
while (!EndofFile())
{
int length = GetLength(line);
if (length > 0)
{
DoStuffToLine(line)
}
}
rewind(file);
I'm wondering if there is a way to use threads here to add concurrency. Since I'm just reading the file and not writing to it I feel like I don't have to worry about race conditioning. However I'm not sure how to handle the code that's in the while loop because if one thread is looping over the file and another thread is looping over the file at the same time, would they cause each other to skip over lines, make other errors, etc? What's a good way to approach this?

If you're trying to do this to improve read performance, you're going to likely be disappointed since this will almost surely be disk I/O bound. Adding more threads won't help the OS and disk controller fetch data any faster.
However, if you're trying to just process the data in parallel, that's another matter. In that case, I would read the entire file into a memory buffer somewhere, then have your threads process it in parallel. That way you don't have to worry about thread safety with rewinding the file pointer or any other annoying issues like it.
You'll likely still need to use other locking mechanisms for the multithreaded parts of course, depending on exactly what you're doing, but you shouldn't have to worry about what the standard library is going to do when you start accessing a file with multiple threads.

The concurrency adds some race condition problems:
1. The EndofFile() function is evaluated at the start of the loop, it may always happens that this function returns true for two threads, then one thread reaches the end of file and the other thread attempts to read the file.You never know when a thread may be in execution;
2. Same is valid for the GetLength function: when a thread has the length information, the length may change because another thread may read another line;
3. You are reading a file sequentially, even if you rewind it, it may always occur that the current position of the IO pointer is altered by some other thread.
Furthermore, as Telgin pointed out, reading a file is not CPU bound, but I/O bound, so is the system to read the file.You can't improve the performance because you need some locks, and locking to guarantee thread safety just introduces overhead.

I'm not sure that this is the best approach. However, you could read the file. Then store it in two separate objects and read the objects instead of the file. Just make sure to do cleanup afterward.

Related

Deadlock when reading from a stream in FUSE

My understanding of FUSE's multithreaded read cycle is something like this:
....
.-> read --.
/ \
open ---> read ----+-> release
\ /
`-> read --'
....
i.e., Once a file is open'd, multiple read threads are spawned to read different chunks of the file. Then, when everything that was wanted has been read, there is a final, single release. All these are per ones definition of open, read and release as FUSE operations.
I'm creating an overlay filesystem which converts one file type to another. Clearly, random access without any kind of indexing is a problem; so for the time being, I'm resorting to streaming. That is, in the above model, each read thread would begin the conversion process from the start, until it arrives at the correct bit of converted data to push out into the read buffer.
This is obviously very inefficient! To resolve this, a single file conversion process can start at the open stage and use a mutex and read cursor (i.e., "I've consumed this much, so far") that the reader threads can use to force sequential access. That is, the mutex gets locked by the thread that requests the data from the current cursor position and all other reader threads have to wait until it's their turn.
I don't see why this wouldn't work for streaming a file out. However, if any random/non-sequential access occurs we'll have a deadlock: if the requested offset is beyond or before the current cursor position, the mutex will never unlock for the appropriate reader to be able to reach that point. (Presumably we have no access to FUSE's threads, so to act as a supervisor. Also, I can't get around the problem by forcing the file to be a FIFO, as FUSE doesn't support writing to them.)
Indeed, I would only be able to detect when this happens if the mutex is locked and the cursor is beyond the requested offset (i.e., the "can't rewind" situation). In that case, I can return EDEADLK, but there's no way to detect "fast-forward" requests that can't be satisfied.
At the risk of the slightly philosophical question... What do I do?

Caching file pointers in C

I need to cache file pointers in my program, but the problem is that I may have multiple threads accessing that file pointer cache. For example, if thread1 asks for a file pointer, and a cache miss occurs, fopen is called and the pointer is cached. Now when thread 2 arrives and cache hit occurs, both the files share the read/write pointer leading to errors. Some things I thought of -
I could keep track of when the file is in use, but currently I don't know when it will be released, and including this feature disturbs my design
I could send a duplicate of the file pointer in case of a hit, but I don't know any way of doing this so that these two copies do not share read/write locations
How should I proceed?
Are you concerned about optimizing out the file open operation? I think you are making it way more complex and error prone than what it should be. File pointers (FILE*) are not thread-safe structures so you cannot share them across threads.
What you probably need to do (if you really want to cache the file open operations) is to keep a dictionary mapping filename to a file descriptor (an int) and have a thread-safe function to return a descriptor by name or open if it's not in the dictionary.
And of course doing I/O to the same file descriptor from multiple threads needs to be regulated as well.

Atomically write 64kB

I need to write something like 64 kB of data atomically in the middle of an existing file. That is all, or nothing should be written. How to achieve that in Linux/C?
I don't think it's possible, or at least there's not any interface that guarantees as part of its contract that the write would be atomic. In other words, if there is a way that's atomic right now, that's an implementation detail, and it's not safe to rely on it remaining that way. You probably need to find another solution to your problem.
If however you only have one writing process, and your goal is that other processes either see the full write or no write at all, you can just make the changes in a temporary copy of the file and then use rename to atomically replace it. Any reader that already had a file descriptor open to the old file will see the old contents; any reader opening it newly by name will see the new contents. Partial updates will never be seen by any reader.
There are a few approaches to modify file contents "atomically". While technically the modification itself is never truly atomic, there are ways to make it seem atomic to all other processes.
My favourite method in Linux is to take a write lease using fcntl(fd, F_SETLEASE, F_WRLCK). It will only succeed if fd is the only open descriptor to the file; that is, nobody else (not even this process) has the file open. Also, the file must be owned by the user running the process, or the process must run as root, or the process must have the CAP_LEASE capability, for the kernel to grant the lease.
When successful, the lease owner process gets a signal (SIGIO by default) whenever another process is opening or truncating the file. The opener will be blocked by the kernel for up to /proc/sys/fs/lease-break-time seconds (45 by default), or until the lease owner releases or downgrades the lease or closes the file, whichever is shorter. Thus, the lease owner has dozens of seconds to complete the "atomic" operation, without any other process being able to see the file contents.
There are a couple of wrinkles one needs to be aware of. One is the privileges or ownership required for the kernel to allow the lease. Another is the fact that the other party opening or truncating the file will only be delayed; the lease owner cannot replace (hardlink or rename) the file. (Well, it can, but the opener will always open the original file.) Also, renaming, hardlinking, and unlinking/deleting the file does not affect the file contents, and therefore are not affected at all by file leases.
Remember also that you need to handle the signal generated. You can use fcntl(fd, F_SETSIG, signum) to change the signal. I personally use a trivial signal handler -- one with an empty body -- to catch the signal, but there are other ways too.
A portable method to achieve semi-atomicity is to use a memory map using mmap(). The idea is to use memmove() or similar to replace the contents as quickly as possible, then use msync() to flush the changes to the actual storage medium.
If the memory map offset in the file is a multiple of the page size, the mapped pages reflect the page cache. That is, any other process reading the file, in any way -- mmap() or read() or their derivatives -- will immediately see the changes made by the memmove(). The msync() is only needed to make sure the changes are also stored on disk, in case of a system crash -- it is basically equivalent to fsync().
To avoid preemption (kernel interrupting the action due to the current timeslice being up) and page faults, I'd first read the mapped data to make sure the pages are in memory, and then call sched_yield(), before the memmove(). Reading the mapped data should fault the pages into page cache, and sched_yield() releases the rest of the timeslice, making it extremely likely that the memmove() is not interrupted by the kernel in any way. (If you do not make sure the pages are already faulted in, the kernel will likely interrupt the memmove() for each page separately. You won't see that in the process, but other processes see the modifications to occur in page-sized chunks.)
This is not exactly atomic, but it is practical: it does not give you any guarantees, only makes the race window very very short; therefore I call this semi-atomic.
Note that this method is compatible with file leases. One could try to take a write lease on the file, but fall back to leaseless memory mapping if the lease is not granted within some acceptable time period, say a second or two. I'd use timer_create() and timer_settime() to create the timeout timer, and the same empty-body signal handler to catch the SIGALRM signal; that way the fcntl() is interrupted (returns -1 with errno == EINTR) when the timeout occurs -- with the timer interval set to some small value (say 25000000 nanoseconds, or 0.025 seconds) so it repeats very often after that, interrupting syscalls if the initial interrupt is missed for any reason.
Most userspace applications create a copy of the original file, modify the contents of the copy, then replace the original file with the copy.
Each process that opens the file will only see complete changes, never a mix of old and new contents. However, anyone keeping the file open, will only see their original contents, and not be aware of any changes (unless they check themselves). Most text editors do check, but daemons and other processes do not bother.
Remember that in Linux, the file name and its contents are two separate things. You can open a file, unlink/remove it, and still keep reading and modifying the contents for as long as you have the file open.
There are other approaches, too. I do not want to suggest any specific approach, because the optimal one depends heavily on the circumstances: Do the other processes keep the file open, or do they always (re)open it before reading the contents? Is atomicity preferred or absolutely required? Is the data plain text, structured like XML, or binary?
EDITED TO ADD:
Please note that there are no ways to guarantee beforehand that the file will be successfully modified atomically. Not in theory, and not in practice.
You might encounter a write error with the disk full, for example. Or the drive might hiccup at just the wrong moment. I'm only listing three practical ways to make it seem atomic in typical use cases.
The reason write leases are my favourite is that I can always use fcntl(fd,F_GETLEASE,&ptr) to check whether the lease is still valid or not. If not, then the write was not atomic.
High system load is unlikely to cause the lease to be broken for a 64k write, if the same data has been read just prior (so that it will likely be in page cache). If the process has superuser privileges, you can use setpriority(PRIO_PROCESS,getpid(),-20) to temporarily raise the process priority to maximum while taking the file lease and modifying the file. If the data to be overwritten has just been read, it is extremely unlikely to be moved to swap; thus swapping should not occur, either.
In other words, while it is quite possible for the lease method to fail, in practice it is almost always successful -- even without the extra tricks mentioned in this addendum.
Personally, I simply check if the modification was not atomic, using the fcntl() call after the modification, prior to msync()/fsync() (making sure the data hits the disk in case a power outage occurs); that gives me an absolutely reliable, trivial method to check whether the modification was atomic or not.
For configuration files and other sensitive data, I too recommend the rename method. (Actually, I prefer the hardlink approach used for NFS-safe file locking, which amounts to the same thing but uses a temporary name to detect naming races.) However, it has the problem that any process keeping the file open will have to check and reopen the file, voluntarily, to see the changed contents.
Disk writes cannot be atomic without a layer of abstraction. You should keep a journal and revert if a write is interrupted.
As far as I know a write below the size of PIPE_BUF is atomic. However I never rely on this. If the programs that access the file are written by you, you can use flock() to achieve exclusive access. This system call sets a lock on the file and allows other processes that know about the lock to get access or not.

Read and write from a file in a circular buffer fashion

i need to make a file behave as a circular buffer. From one thread i have to write the data.From another thread i have read from the file. But the size of the file is fixed.
Any idea?
Since you have not mentioned the language you will be using, I am only able to provide you with a general answer: Write an abstraction that, when reading past the end of the file, seeks to the beginning of the file and resumes reading there.
Be advised that writing and reading to the file from multiple threads needs proper synchronization.
I assume that a thread knows the position of the other thread. In this case the writer can append to the file and increase its position until it arrives at MAXSIZE. Then it should wrap around by seeking to position 0 and continue overwriting the old contents as long as its position is smaller than position of the reader, after that it has to block. At the same time reader can read and wrap around if necessary until it reaches the position of the writer.
In other words it's not much different from the standard circular in memory buffer. Are you sure that using a file is necessary in your case? You might also consider doing some research on the producer-consumer problem.
You could also consider using a named pipe.

Read a line of input faster than fgets?

I'm writing a program where performance is quite important, but not critical. Currently I am reading in text from a FILE* line by line and I use fgets to obtain each line. After using some performance tools, I've found that 20% to 30% of the time my application is running, it is inside fgets.
Are there faster ways to get a line of text? My application is single-threaded with no intentions to use multiple threads. Input could be from stdin or from a file. Thanks in advance.
You don't say which platform you are on, but if it is UNIX-like, then you may want to try the read() system call, which does not perform the extra layer of buffering that fgets() et al do. This may speed things up slightly, on the other hand it may well slow things down - the only way to find out is to try it and see.
Use fgets_unlocked(), but read carefully what it does first
Get the data with fgetc() or fgetc_unlocked() instead of fgets(). With fgets(), your data is copied into memory twice, first by the C runtime library from a file to an internal buffer (stream I/O is buffered), then from that internal buffer to an array in your program
Read the whole file in one go into a buffer.
Process the lines from that buffer.
That's the fastest possible solution.
You might try minimizing the amount of time you spend reading from the disk by reading large amounts of data into RAM then working on that. Reading from disk is slow, so minimize the amount of time you spend doing that by reading (ideally) the entire file once, then working on it.
Sorta like the way CPU cache minimizes the time the CPU actually goes back to RAM, you could use RAM to minimize the number of times you actually go to disk.
Depending on your environment, using setvbuf() to increase the size of the internal buffer used by file streams may or may not improve performance.
This is the syntax -
setvbuf (InputFile, NULL, _IOFBF, BUFFER_SIZE);
Where InputFile is a FILE* to a file just opened using fopen() and BUFFER_SIZE is the size of the buffer (which is allocated by this call for you).
You can try various buffer sizes to see if any have positive influence. Note that this is entirely optional, and your runtime may do absolutely nothing with this call.
If the data is coming from disk, you could be IO bound.
If that is the case, get a faster disk (but first check that you're getting the most out of your existing one...some Linux distributions don't optimize disk access out of the box (hdparm)), stage the data into memory (say by copying it to a RAM disk) ahead of time, or be prepared to wait.
If you are not IO bound, you could be wasting a lot of time copying. You could benefit from so-called zero-copy methods. Something like memory map the file and only access it through pointers.
That is a bit beyond my expertise, so you should do some reading or wait for more knowledgeable help.
BTW-- You might be getting into more work than the problem is worth; maybe a faster machine would solve all your problems...
NB-- It is not clear that you can memory map the standard input either...
If the OS supports it, you can try asynchronous file reading, that is, the file is read into memory whilst the CPU is busy doing something else. So, the code goes something like:
​​​​​
start asynchronous read
loop:
wait for asynchronous read to complete
if end of file goto exit
start asynchronous read
do stuff with data read from file
goto loop
exit:
If you have more than one CPU then one CPU reads the file and parses the data into lines, the other CPU takes each line and processes it.

Resources