Thread safe multi-file writing

Thread safe multi-file writing - c

I have a daemon that accepts socket connections and reads or writes a dynamic set of files, depending on the nature of the connection. Because my daemon is multithreaded, the possibility exists that the same file may be written to by more than one thread. Because my list of files is dynamic and not fixed, I'm not sure how to keep one thread from bumping into the other. For performance reasons, I want threads to be writing to different files at the same time, just not the same file at the same time.
Other questions have suggested using mutexes, but I'm not entirely clear how a mutex would help in this scenario - the list of files being dynamic and only known to the thread.
Would it be appropriate to use file locking in this case? If so, how would one implement file locking in a thread-safe way?

flock will work OK. It doesn't lock file descriptors, it locks the actual file.
A file that has been exclusively flock'ed can't be exclusively locked again by another process or thread. That would defeat the entire purpose of locks.
One note is that these locks are advisory. A process that doesn't use flock can happily overwrite the file, even if another process has exclusive-flock'ed it.

I would use an event broker pattern. Each socketing thread fires an event (have args of the file(s) ) then the event is handled by a central file broker with a shared collection of files currently being written.
If the file cannot be written to, decide what you want to do... otherwise report a success.
Multiple listeners, one central file-lock collection, multiple writers.

I can't say this would be the "optimum" solution, but I'd propose something like this:
Maintain a linked list of a struct that contains two things:
The filename
A condition wait variable associated with the file.
Flow A. When the daemon receives a request, mutex lock the list and check to see whether the filename is in the list or not. If it is not, add a new entry to the linked list with a new condition wait variable for other threads to use. Release the mutex lock. Perform the file operation. Once complete, lock the linked list and remove the struct entry for that file, then signal the other threads via the wait object.
Flow B. If a request comes in for the same file, it'll lock the list and look for the filename contained in the list. If it is in the list, grab the wait variable and wait on it. When the thread is signaled, grab a lock on the list and see if the file is in the list (It's possible another thread picked up the lock on the filename before you). If not, follow Flow A. If so, grab the wait variable in the new struct and wait again until signaled, then follow the above steps again.

Related

Is it possible to have a shared global variable for inter-process communication?

I need to solve a concurrency assignment for my operating systems class. I don't want the solution here, but I am lacking one part.
We should write a process that writes to file, reads from it and then deltetes it. This process we should run two times in two different shells. No fork here for simplicity. Process A should write, Process B then read and then Process should delete the file. Afterwards they switch roles.
I understand that you can achieve atomicity easily by locking. With while loops around the read-, and write sections etc. you can also get further control. But when I run process A and then process B, process B will spin before the write seciton until it achieves the lock and not got into reading when process A releases the lock. So my best guess is to have a read and a write lock. This information must be shared somehow between the processes. The only way I can think of is some global variable, but since both processes hold copies of the variables, I think this is not possible. Another way would be to have a read lock file and a write lock file, but that seems overly complicated to me.
Is there a better way?

You can use semaphores to ensure the writer and deleter wait for the previous process to finish its job. (Use man sem_init for details)
When running multiple processes with semaphores, it should be created using shared mem (man shm_open for more details).
You will need as many semaphores as the number of pipelines in this process.

You can use file as a lock. Two processes try to create a file with a previously agreed upon name using the O_EXCL flag. Only one will succeed. The one that succeeds gets the access to the resource. So in this case process A should try to create a file with name say, foo, with O_EXCL flag and, if successful, it should go ahead and write to file the information. After its work is complete, Process A should unlink foo. Process B should try to create file foo with O_EXCL flag, and if successful, try to read the file created by Process A. After its attempt is over, Process B should unlink the file foo. That way only one process will be accessing the file at any time.

Your problem (with files and alternating roles in the creation/deletion of files) seems to be a candidate to use the O_EXCL flag on opening/creating the file. This flag makes the open(2) system call to succeed in creating a file only if the file doesn't exist, so it makes the file to appear as a semaphore itself. Each process can liberate the lock (A or B) but the one that does, just liberates the lock and makes the role of owning again accessible.
You will see that both processes try to use one of the roles, but if they both try to use the owner role, one of them will succeed, and the other will fail.
Just enable a SIGINT signal handler on the owning process, to allow it to delete the file in case it gets signalled, or you will leave the file and after that no process will be able to assume the owning role (at least you will need to delete it manually).
This was the first form of locking feature in unix, long before semaphores, shared memory or other ways to block processes existed. It is based on the atomicity of system calls (you cannot execute two system calls on the same file simultaneously)

Alternative to sleep, semaphores

I have a simple c program ( on linux). The steps in the program are as follows:
within a while loop, It calls a query that returns exactly one record. It is essentially a view that looks for a column called "processed" with value of "0" and uses "limit 1".
I read the records in the result set and perform some calculations and upload the results back to the database. I also set the processed column to "1".
If this query does not return any records, I exit the while loop.
Once the while loop is exited, program exits.
Once it completes running, I do not want the program to exit. The reason is the database might get more qualifying records in the next 30 minutes. I want this program to be long running program that would check for any new records and start the while loop again to process the records.
I am not doing any multi threading or fancy stuff. I did some google and found posts talking about semaphore.
Is this the right way to go about? Are there any simple examples of semaphores with explanation?

First, I hope you're using a transaction. Otherwise there can be a race condition between 1 and 2.
I think your question is "How does your program know when there is more information to be processed in a SQL table?" There's several ways to do this.
The simplest is polling. Your program just checks every so often if there's any work. If there isn't, it sleeps for a while. If checking is cheap, or you don't have to check very often, polling is fine. It's pretty robust, there's no coordination necessary between the worker and the supplier. The worker just checks for work.
Another is to make the program block on some sort of I/O like waiting for a lock on a file. That's what semaphores are about. It goes like this.
The queue is empty.
The producer gets an exclusive lock on the semaphore file.
Your worker tries to get a lock on the semaphore file, it blocks.
The producer adds to the queue and releases its lock.
The worker immediately unblocks.
Checks the queue
Does its work.
...but resetting the system is a problem. The producer doesn't know when the queue is empty again without polling. And this requires everything adding to the SQL table knows about this procedure and is located on the same machine. Even if you get it working, it's very vulnerable to deadlocks and race conditions.
Another way is via signals. The producer process sends a signal to the worker process to say "I added some work". As above, this requires coordination between the things adding to the SQL table and the workers.
A better solution is to not use SQL for a work queue. It's inherently something you have to poll. Instead use a named or network pipe. Pipes automatically act as a queue. Producers write to the pipe when they add work. The worker connects to the pipe and read from it to get more work. If there's no work, it quietly blocks waiting for work. The pipe can contain all the information necessary to do the work, or it can just contain an indication that there is work elsewhere (like an ID for a row).
Finally, depending on how much processing needs to be done, you could try doing all that processing in a stored procedure triggered by a table update.

Making all the children sleep from another child thread

I am trying to develop a program with POSIX threads in which i have a child thread which will be updating the content of a file and the database between certain intervals and there will be other children who reads data from the file and database all the time. So i don't want any thread to read the file or database while they are being written by the single updater thread. So my idea is to make all other children threads sleep from the child thread which will update the file and database. sleep() makes the calling thread sleep. Is there any way the above scenario can be implemented?!
EDIT:
I have two different functions for reading and writing the file. Most of the threads access the read method so they aren't vulnerable but they might be if they try to read in between while the periodic thread which accesses the write method is updating the file's contents.

You do not want to use sleep for this at all. Instead, use a reader/writer lock. The updater thread must acquire the lock (in write mode) before it modifies the data. And the other threads must acquire the lock (in read mode) before reading the data.
Note that if your reader threads are reading continuously, the writer will get starved and never acquire the lock. So you will need some separate mechanism such as a flag the updater can set that tells the readers to please stop reading and release their locks. If the readers only read occasionally this shouldn't be such an issue (unless there are tons of readers in which case you may have an architectural problem).

N threads writing into M Files running in different priorites and keep a track of all files requested by and currently allocated to a thread

I want to create a program, using POSIX threads, having n threads running at different priorities.
There are files (say m files) which are shared among these n threads. If one thread is using the file (assuming that it writing onto the file), no other thread will be allowed to use it. The code should maintain a Table that tells: which file it has acquired and for which file its requests are pending.
Also, we need a Monitor Thread to check for deadlocks ; any implementations hints/ideas?

You don't need to check for deadlocks. You have to write a nice code that makes it impossible to run into deadlock scenario. For that reason, I'd recommend you use try-lock approach to lock down a chain of files and unlock them back shall any of the lock acquisition fail.
Also, if you are using C buffered I/O, I'd recommend you stick with ftrylockfile and funlockfile APIs. Otherwise use a synchronization mechanism that is most appropriate for your case, be that futex API or locks implemented using atomic instructions.

The standard unix way to accomplish this is: spooldirectories.
file operations, such as rename / link / unlink are atomic
have one central input spool-dir, where input files can be placed
a process / thread that wants to process a file, starts by moving it to another name, or better: to another (work) directory (using the thread_id or process number as directory name is obvious.)
(since this move is atomic there is no possible race condition!)
after processing, the finished files can be moved to an output directory
the scoreboard function is simply a readdir(+stat), maybe even inotify, on the work directories
process starvation will always be a problem. Incompletely processed files will live forever in de workdirs. Having a stamp/ pid file in the workdirectories could help cleanup / restart.
if designed well, this structure could work even after machine failure. The workers would have to maintain their own backup / log /stamp-file mechanism.
if you haven't noticed yet: no locking will be needed.

I hate C. I have to try and think of a way to do this without classes:(
OK, a 'Sfile' struct to represent each file. Has name, path, file fd/handle, everything to do with one file, plus an 'inUse' boolean.
A 'waitingThreads' array for those threads waiting for a set of files.
A 'Sfiles' struct with an array of *Sfile to hold all the files, a waitingThreads array and a lock, (mutex/futex/criticalSection).
Each thread should have an event/semaphore/something that it can wait on until its files all become available and some way to access to the set of files that it needs and somewhere to store the fds/handles/whatever for the files.
OK, off we go:
Any thread that wants files locks up the Sfiles and iterates the *Sfile array, checking if every file it needs is free to use. If they all are, it sets the 'inUse' boolean, loads itself up with the fd/handles, unlocks and runs on - it has all its files. If any file it needs is in use, it pushes itself onto the waitingThreads array and waits on its event/sema.
Whne a thread is done with its files, it locks the Sfiles and clears the 'inUse' boolean for the files it was using. It then iterates the waitingThreads array - if the array is empty, it just unlocks and exits. If the array is not empty, it tries to find threads that can now run with the files that are now free. If it finds none, it just unlocks and returns. If it does find one, it loads that thread up with the fd/handles, sets the inUse boolean and signals its event/sema - that thread will then run with its desired set of files. The thread continues to iterate the waitingThreads array to the end, looking for mre threads that it can load up and signal with the remaining free files. When it reaches the end of the array, it returns.
That, or something like it, will ensure that the threads always run with their complete set of files, prevent any deadlocks due to threads locking partial sets of files and does not require any polling.
If you really, really need that table thingy, you can build it inside the lock every time a thread enters or leaves the lock. I would suggest mallocing a suitable struct, loading it up with all the details of the free files and waiting threads, and queueing it off to another thread. You could just have some 'monitoring' thread that periodically locks up the Sfiles, dumps all the info and unlocks, but that keeps the Sfiles locked for the entire 'dump' time - you may not want that overhead - it's up to you.
Edit:
OH - forgot the priority thingy. The OS thread priority is probably useless for your purpose. Have each thread expose a priority enum/int and keep the 'waitingThreads' array sorted by that priority, so giving the higher priority threads the first bite at whatever files are returned.
Is that good enough for your homework assignment?

Asynchronous File I/O using threads in C

I'm trying to understand how asynchronous file operations being emulated using threads. I've found next-to-nothing materials to read about the subject.
Is it possible that:
a process uses a thread to open a regular file (HDD).
the parent gets the file descriptor from the thread, now it may close the thread.
the parent uses the file descriptor with a new thread, reading X bytes from the file.
the parent gets the file descriptor with the seek-position of the current file state.
the parent may repeat these operations, without the need to open, or seek, every time it wishes to "continue" reading a new chunk of the file?
This is just a wild guess of mine, would appreciate if anybody mind to shed more light to clarify how it's being emulated efficiently.
UPDATE:
By efficient I actually mean that I don't want the thread to "wait" since the moment the file been opened. Think of a HTTP non-blocking daemon which serves a client with a huge file, you want to use the thread to read chunks of the file without blocking the daemon - but you don't want to keep the thread busy while "waiting" for the actual transfer to take place, you want to use the thread for other blocking operations of other clients.

To understand asynchronous I/O better, it may be helpful to think in terms of overlapping operation. That is, the number of pending operations (operations that have been started but not yet completed) can simutaneously go above one.
A diagram that explains asynchronous I/O might look like this: http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
If you are using the asynchronous I/O capabilities provided by the underlying Operating System, then it is possible to asynchronously read from multiple files without spawning a equal number of threads.
If your underlying Operating System does not provide asynchronous I/O, or if you decide not to use it, in other words, you wish to emulate asynchronous operation by only using blocking I/O (the regular Read/Write provided by the Operating System) then it is necessary to spawn as many threads as the number of simutaneous I/O operations. This is because when a thread is making a function call to blocking I/O, the thread cannot continue its execution until the operation finishes. In order to start another blocking I/O operation, that operation has to be issued from another thread that is not already occupied.

When you open/create a file fire up a thread. Now store that thread id/ptr as your file handle.
Basically the thread will do nothing except sit in a loop waiting for an "event". A semaphore would be good here. When you want to do a read then you add the read command to a queue (remember to critical section the stack add), return a unique id, and then you increment the semaphore. If the thread is asleep it will now wake up and grab the first message off the queue and process it. When it has completed you remove the command from the queue.
To poll if a file read has completed you can, simply, check to see if its in the command queue. If its not there then the command has completed.
Furthermore if you want to allow synchronous reads as well then you can wait after sending the message through for an "event" to get triggered by the completion. You then check to see if the unique id is the queue and if it isn't you return control. If it still is then you go back to a wait state until the relevant unique id has been processed.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight