I have a simple c program ( on linux). The steps in the program are as follows:
within a while loop, It calls a query that returns exactly one record. It is essentially a view that looks for a column called "processed" with value of "0" and uses "limit 1".
I read the records in the result set and perform some calculations and upload the results back to the database. I also set the processed column to "1".
If this query does not return any records, I exit the while loop.
Once the while loop is exited, program exits.
Once it completes running, I do not want the program to exit. The reason is the database might get more qualifying records in the next 30 minutes. I want this program to be long running program that would check for any new records and start the while loop again to process the records.
I am not doing any multi threading or fancy stuff. I did some google and found posts talking about semaphore.
Is this the right way to go about? Are there any simple examples of semaphores with explanation?
First, I hope you're using a transaction. Otherwise there can be a race condition between 1 and 2.
I think your question is "How does your program know when there is more information to be processed in a SQL table?" There's several ways to do this.
The simplest is polling. Your program just checks every so often if there's any work. If there isn't, it sleeps for a while. If checking is cheap, or you don't have to check very often, polling is fine. It's pretty robust, there's no coordination necessary between the worker and the supplier. The worker just checks for work.
Another is to make the program block on some sort of I/O like waiting for a lock on a file. That's what semaphores are about. It goes like this.
The queue is empty.
The producer gets an exclusive lock on the semaphore file.
Your worker tries to get a lock on the semaphore file, it blocks.
The producer adds to the queue and releases its lock.
The worker immediately unblocks.
Checks the queue
Does its work.
...but resetting the system is a problem. The producer doesn't know when the queue is empty again without polling. And this requires everything adding to the SQL table knows about this procedure and is located on the same machine. Even if you get it working, it's very vulnerable to deadlocks and race conditions.
Another way is via signals. The producer process sends a signal to the worker process to say "I added some work". As above, this requires coordination between the things adding to the SQL table and the workers.
A better solution is to not use SQL for a work queue. It's inherently something you have to poll. Instead use a named or network pipe. Pipes automatically act as a queue. Producers write to the pipe when they add work. The worker connects to the pipe and read from it to get more work. If there's no work, it quietly blocks waiting for work. The pipe can contain all the information necessary to do the work, or it can just contain an indication that there is work elsewhere (like an ID for a row).
Finally, depending on how much processing needs to be done, you could try doing all that processing in a stored procedure triggered by a table update.
Related
I want to create a program, using POSIX threads, having n threads running at different priorities.
There are files (say m files) which are shared among these n threads. If one thread is using the file (assuming that it writing onto the file), no other thread will be allowed to use it. The code should maintain a Table that tells: which file it has acquired and for which file its requests are pending.
Also, we need a Monitor Thread to check for deadlocks ; any implementations hints/ideas?
You don't need to check for deadlocks. You have to write a nice code that makes it impossible to run into deadlock scenario. For that reason, I'd recommend you use try-lock approach to lock down a chain of files and unlock them back shall any of the lock acquisition fail.
Also, if you are using C buffered I/O, I'd recommend you stick with ftrylockfile and funlockfile APIs. Otherwise use a synchronization mechanism that is most appropriate for your case, be that futex API or locks implemented using atomic instructions.
The standard unix way to accomplish this is: spooldirectories.
file operations, such as rename / link / unlink are atomic
have one central input spool-dir, where input files can be placed
a process / thread that wants to process a file, starts by moving it to another name, or better: to another (work) directory (using the thread_id or process number as directory name is obvious.)
(since this move is atomic there is no possible race condition!)
after processing, the finished files can be moved to an output directory
the scoreboard function is simply a readdir(+stat), maybe even inotify, on the work directories
process starvation will always be a problem. Incompletely processed files will live forever in de workdirs. Having a stamp/ pid file in the workdirectories could help cleanup / restart.
if designed well, this structure could work even after machine failure. The workers would have to maintain their own backup / log /stamp-file mechanism.
if you haven't noticed yet: no locking will be needed.
I hate C. I have to try and think of a way to do this without classes:(
OK, a 'Sfile' struct to represent each file. Has name, path, file fd/handle, everything to do with one file, plus an 'inUse' boolean.
A 'waitingThreads' array for those threads waiting for a set of files.
A 'Sfiles' struct with an array of *Sfile to hold all the files, a waitingThreads array and a lock, (mutex/futex/criticalSection).
Each thread should have an event/semaphore/something that it can wait on until its files all become available and some way to access to the set of files that it needs and somewhere to store the fds/handles/whatever for the files.
OK, off we go:
Any thread that wants files locks up the Sfiles and iterates the *Sfile array, checking if every file it needs is free to use. If they all are, it sets the 'inUse' boolean, loads itself up with the fd/handles, unlocks and runs on - it has all its files. If any file it needs is in use, it pushes itself onto the waitingThreads array and waits on its event/sema.
Whne a thread is done with its files, it locks the Sfiles and clears the 'inUse' boolean for the files it was using. It then iterates the waitingThreads array - if the array is empty, it just unlocks and exits. If the array is not empty, it tries to find threads that can now run with the files that are now free. If it finds none, it just unlocks and returns. If it does find one, it loads that thread up with the fd/handles, sets the inUse boolean and signals its event/sema - that thread will then run with its desired set of files. The thread continues to iterate the waitingThreads array to the end, looking for mre threads that it can load up and signal with the remaining free files. When it reaches the end of the array, it returns.
That, or something like it, will ensure that the threads always run with their complete set of files, prevent any deadlocks due to threads locking partial sets of files and does not require any polling.
If you really, really need that table thingy, you can build it inside the lock every time a thread enters or leaves the lock. I would suggest mallocing a suitable struct, loading it up with all the details of the free files and waiting threads, and queueing it off to another thread. You could just have some 'monitoring' thread that periodically locks up the Sfiles, dumps all the info and unlocks, but that keeps the Sfiles locked for the entire 'dump' time - you may not want that overhead - it's up to you.
Edit:
OH - forgot the priority thingy. The OS thread priority is probably useless for your purpose. Have each thread expose a priority enum/int and keep the 'waitingThreads' array sorted by that priority, so giving the higher priority threads the first bite at whatever files are returned.
Is that good enough for your homework assignment?
I do understand what an APC is, how it works, and how Windows uses it, but I don't understand when I (as a programmer) should use QueueUserAPC instead of, say, a fiber, or thread pool thread.
When should I choose to use QueueUserAPC, and why?
QueueUserAPC is a neat tool that can often be a shortcut for some tasks that are otherwise handled with synchronization objects. It allows you to tell a particular thread to do something whenever it is convenient for that thread (i.e. when it finishes its current work and starts waiting on something).
Let's say you have a main thread and a worker thread. The worker thread opens a socket to a file server and starts downloading a 10GB file by calling recv() in a loop. The main thread wants to have the worker thread do something else in its downtime while it is waiting for net packets; it can queue a function to be run on the worker while it would otherwise be waiting and doing nothing.
You have to be careful with APCs, because as in the scenario I mentioned you would not want to make another blocking WinSock call (which would result in undefined behavior). You really have to be watching in order to find any good uses of this functionality because you can do the same thing in other ways. For example, by having the other thread check an event every time it is about to go to sleep, rather than giving it a function to run while it is waiting. Obviously the APC would be simpler in this scenario.
It is like when you have a call desk employee sitting and waiting for phone calls, and you give that person little tasks to do during their downtime. "Here, solve this Rubik's cube while you're waiting." Although, when a phone call comes in, the person would not put down the Rubik's cube to answer the phone (the APC has to return before the thread can go back to waiting).
QueueUserAPC is also useful if there is a single thread (Thread A) that is in charge of some data structure, and you want to perform some operation on the data structure from another thread (Thread B), but you don't want to have the synchronization overhead / complexity of trying to share that data between two threads. By having Thread B queue the operation to run on Thread A, which solely maintains that structure, you are executing any arbitrary function you want on that data without having to worry about synchronization.
It is just another tool like a thread pool. However with a thread pool you cannot send a task to a particular thread. You have no control over where the work is done. When you queue up a task that may end up creating a whole new thread. You may queue two tasks and they get done simultaneously on two different threads. With QueueUserAPC, you can be guaranteed that the tasks would get done in order and on the thread you designate.
I have a daemon that accepts socket connections and reads or writes a dynamic set of files, depending on the nature of the connection. Because my daemon is multithreaded, the possibility exists that the same file may be written to by more than one thread. Because my list of files is dynamic and not fixed, I'm not sure how to keep one thread from bumping into the other. For performance reasons, I want threads to be writing to different files at the same time, just not the same file at the same time.
Other questions have suggested using mutexes, but I'm not entirely clear how a mutex would help in this scenario - the list of files being dynamic and only known to the thread.
Would it be appropriate to use file locking in this case? If so, how would one implement file locking in a thread-safe way?
flock will work OK. It doesn't lock file descriptors, it locks the actual file.
A file that has been exclusively flock'ed can't be exclusively locked again by another process or thread. That would defeat the entire purpose of locks.
One note is that these locks are advisory. A process that doesn't use flock can happily overwrite the file, even if another process has exclusive-flock'ed it.
I would use an event broker pattern. Each socketing thread fires an event (have args of the file(s) ) then the event is handled by a central file broker with a shared collection of files currently being written.
If the file cannot be written to, decide what you want to do... otherwise report a success.
Multiple listeners, one central file-lock collection, multiple writers.
I can't say this would be the "optimum" solution, but I'd propose something like this:
Maintain a linked list of a struct that contains two things:
The filename
A condition wait variable associated with the file.
Flow A. When the daemon receives a request, mutex lock the list and check to see whether the filename is in the list or not. If it is not, add a new entry to the linked list with a new condition wait variable for other threads to use. Release the mutex lock. Perform the file operation. Once complete, lock the linked list and remove the struct entry for that file, then signal the other threads via the wait object.
Flow B. If a request comes in for the same file, it'll lock the list and look for the filename contained in the list. If it is in the list, grab the wait variable and wait on it. When the thread is signaled, grab a lock on the list and see if the file is in the list (It's possible another thread picked up the lock on the filename before you). If not, follow Flow A. If so, grab the wait variable in the new struct and wait again until signaled, then follow the above steps again.
I have an application that I'm working on that requires a couple of secondary threads, and each will be responsible for a number of file handles (at least 1, upwards of 10). The file handles are not shared amongst the threads, so I don't have to worry about one secondary thread blocking the other when selecting to see what is ready to read/write. What I want to be sure of is that neither of the secondary threads will cause the main thread to stop executing while the select/pselect call is executing.
I would imagine that this is not a problem - one would imagine that such things would be done in, say, a web server - but I couldn't find anything that specifically said "yes, you can do this" when I Googled. Am I correct in my assumption that this will not cause any problems?
For clarification, what I have looks something like:
Main thread of execution ( select() loop handling incoming command messages and outgoing responses )
Secondary thread #1 ( select() loop providing a service )
Secondary thread #2 ( select() loop providing another service )
As I previously mentioned, none of the file handles are shared amongst the threads - they are created, used, and destroyed within an individual thread, with the other threads ignorant of their existence.
No you don't have to worry about them blocking the main thread. I have used select in multiple threads in various projects. As long as they have distinct FDSETS then you're fine and each one can be used like an independent event loop.
Isn't select supposed to block the whole process?
Have you tried to set the nonblocking mode on the socket?
Also, see select_tut manpage for some help.
Here's a relevant section from the select_tut manpage:
So what is the point of select()? Can't I just read and write to my descriptors whenever I want? The point of select() is that it watches multiple descriptors at the same time and properly puts the process to sleep if there is no activity.
I'm trying to understand how asynchronous file operations being emulated using threads. I've found next-to-nothing materials to read about the subject.
Is it possible that:
a process uses a thread to open a regular file (HDD).
the parent gets the file descriptor from the thread, now it may close the thread.
the parent uses the file descriptor with a new thread, reading X bytes from the file.
the parent gets the file descriptor with the seek-position of the current file state.
the parent may repeat these operations, without the need to open, or seek, every time it wishes to "continue" reading a new chunk of the file?
This is just a wild guess of mine, would appreciate if anybody mind to shed more light to clarify how it's being emulated efficiently.
UPDATE:
By efficient I actually mean that I don't want the thread to "wait" since the moment the file been opened. Think of a HTTP non-blocking daemon which serves a client with a huge file, you want to use the thread to read chunks of the file without blocking the daemon - but you don't want to keep the thread busy while "waiting" for the actual transfer to take place, you want to use the thread for other blocking operations of other clients.
To understand asynchronous I/O better, it may be helpful to think in terms of overlapping operation. That is, the number of pending operations (operations that have been started but not yet completed) can simutaneously go above one.
A diagram that explains asynchronous I/O might look like this: http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
If you are using the asynchronous I/O capabilities provided by the underlying Operating System, then it is possible to asynchronously read from multiple files without spawning a equal number of threads.
If your underlying Operating System does not provide asynchronous I/O, or if you decide not to use it, in other words, you wish to emulate asynchronous operation by only using blocking I/O (the regular Read/Write provided by the Operating System) then it is necessary to spawn as many threads as the number of simutaneous I/O operations. This is because when a thread is making a function call to blocking I/O, the thread cannot continue its execution until the operation finishes. In order to start another blocking I/O operation, that operation has to be issued from another thread that is not already occupied.
When you open/create a file fire up a thread. Now store that thread id/ptr as your file handle.
Basically the thread will do nothing except sit in a loop waiting for an "event". A semaphore would be good here. When you want to do a read then you add the read command to a queue (remember to critical section the stack add), return a unique id, and then you increment the semaphore. If the thread is asleep it will now wake up and grab the first message off the queue and process it. When it has completed you remove the command from the queue.
To poll if a file read has completed you can, simply, check to see if its in the command queue. If its not there then the command has completed.
Furthermore if you want to allow synchronous reads as well then you can wait after sending the message through for an "event" to get triggered by the completion. You then check to see if the unique id is the queue and if it isn't you return control. If it still is then you go back to a wait state until the relevant unique id has been processed.