Reading different parts of the same txt through many processes

Reading different parts of the same txt through many processes - c

I try to find out a way to read from a txt file while I amin different processes. For example, I am in process A and I read the first 10 records of the file ( lets say that there are 100 records). In process B I want to read the next ten records. The problem is that only in process A I take the right records and when I am in process B i take only 0. Can someone help? Thanks in advance!

If you dig deeper especially in linux environment, you would find out that threads are lighter than processes when it is about achieving something which needs multiple strands of execution.
I would do this in following manner:
Create a process that has 10 threads. Each thread will read 10 records from the TXT file.
( I will need to use pthread_create() instead of fork() ). I will also create a mutex which each thread will lock while it reads the file.
I will create Th1 (thread_1) using the above call, lock the mutex, open the file, read is using read() call, the buffer will hold all the 100 records, filter out which 10 I need at that point of time, unlock the mutex when it is done, end it using pthread_join()
Repeat the step 2 for 10 times in total so that I have all the records.

This can be done using Inter-process Communication which mainly involves shared-memory, semaphores or message-queues. For a basic knowledge regarding this you can read my blog
One another way of doing it is by passing the file-descriptors between the processes. One process opens the file, reads 10 records, then passes it to the second process. This does the same thing and sends it to the first one. And the entire process repeats until the end of file is reached.
The passing of file-descriptors is mainly done using UNIX Domain Sockets and you can find the code related to this in this answer
Hope this helps.

Related

Synchronization between two processes using semaphores in c

I have a task in which I have to write a program in C language that manages access and reading/writing to a file.
When the program starts it should create two processes(using fork()).
-The first process will be responsible for the initial write to the file(The file is a text file with 2000 random characters from a to z).
-The second process will be responsible for reading from the file ,after the first process has finished writing.
My question is :
How can I synchronize the execution order by using semaphores(sem() call system) in order to ensure that the first process always starts first and the second process starts only after the first process has finished writing?

I can recommend using binarysemaphores:
//
https://www.freertos.org/xSemaphoreCreateBinary.html
https://controllerstech.com/how-to-use-binary-semaphore-in-stm32/
If you are working in an embedded context i would recommend using Tasknotification since they are less ram hungry and therefore may be more fitting in a less powerfull system.
https://www.freertos.org/RTOS-task-notifications.html

One process writing while the other one reads the shared memory

I have 2 programs (processes). One process writes to shared memory while the other reads it. So, my approach was like this:
Initially, value of shared memory is 0. So, process 1 writes only when value is 0. Now process 1 has written some other value to shm and "waits" till value becomes 0. Process 2 reads the shm and writes 0 to it. By wait, I mean in while(1) loop.
My question is if this approach fine or can I do better with some other approach in terms of CPU usage and memory usage?

Mentioned problem known as Process Synchronization problem and given logic is nothing but Busy Waiting approach of the problem which is very primary solution.
Read Producer-Consumer Problem which is similiar to given problem.
There are some better solutions to this than Busy Waiting like:
Spinlock, Semaphore etc.
You can get basic knowledge of all of this from here
Hope it will help!!

I think this is fine but the problem occurs when both the process write to a shared memory block.
At that time you could use a semaphore to synchronize the two process allowing one at a time to write to shared resource/memory block.
You can find regarding Semaphores Click [here](https://en.wikipedia.org/wiki/Semaphore_(programming)

Alternative to sleep, semaphores

I have a simple c program ( on linux). The steps in the program are as follows:
within a while loop, It calls a query that returns exactly one record. It is essentially a view that looks for a column called "processed" with value of "0" and uses "limit 1".
I read the records in the result set and perform some calculations and upload the results back to the database. I also set the processed column to "1".
If this query does not return any records, I exit the while loop.
Once the while loop is exited, program exits.
Once it completes running, I do not want the program to exit. The reason is the database might get more qualifying records in the next 30 minutes. I want this program to be long running program that would check for any new records and start the while loop again to process the records.
I am not doing any multi threading or fancy stuff. I did some google and found posts talking about semaphore.
Is this the right way to go about? Are there any simple examples of semaphores with explanation?

First, I hope you're using a transaction. Otherwise there can be a race condition between 1 and 2.
I think your question is "How does your program know when there is more information to be processed in a SQL table?" There's several ways to do this.
The simplest is polling. Your program just checks every so often if there's any work. If there isn't, it sleeps for a while. If checking is cheap, or you don't have to check very often, polling is fine. It's pretty robust, there's no coordination necessary between the worker and the supplier. The worker just checks for work.
Another is to make the program block on some sort of I/O like waiting for a lock on a file. That's what semaphores are about. It goes like this.
The queue is empty.
The producer gets an exclusive lock on the semaphore file.
Your worker tries to get a lock on the semaphore file, it blocks.
The producer adds to the queue and releases its lock.
The worker immediately unblocks.
Checks the queue
Does its work.
...but resetting the system is a problem. The producer doesn't know when the queue is empty again without polling. And this requires everything adding to the SQL table knows about this procedure and is located on the same machine. Even if you get it working, it's very vulnerable to deadlocks and race conditions.
Another way is via signals. The producer process sends a signal to the worker process to say "I added some work". As above, this requires coordination between the things adding to the SQL table and the workers.
A better solution is to not use SQL for a work queue. It's inherently something you have to poll. Instead use a named or network pipe. Pipes automatically act as a queue. Producers write to the pipe when they add work. The worker connects to the pipe and read from it to get more work. If there's no work, it quietly blocks waiting for work. The pipe can contain all the information necessary to do the work, or it can just contain an indication that there is work elsewhere (like an ID for a row).
Finally, depending on how much processing needs to be done, you could try doing all that processing in a stored procedure triggered by a table update.

Concurrent programming - Is it necessary to manually lock files that multiple processes will be accessing?

I know for pthreads, if they're modifying the same variables or files, you can use pthread_mutex_lock to prevent simultaneous writes.
If I'm using fork() to have multiple processes, which are editing the same file, how can I make sure they're not writing simultaneously to that file?
Ideally I'd like to lock the file for one writer at a time, and each process would only need to write once (no loops necessary). Do I need to do this manually or will UNIX do it for me?

Short answer: you have to do it manually. There are certain guarantees on the atomicity of each write, but you'll still need to synchronize the processes to avoid interleaving writes. There are a lot of techniques for synchronizing processes. Since all of your writers are descendants of a common process, probably the easiest thing to do is to pass a token on a common pipe. Before you fork, create a pipe and write a single byte into it. Any time a process wants to write to the file, it will do a blocking read on the pipe. If it gets a byte, then it proceeds to write to the file. When it is done, it writes a byte back into the pipe. If any other process wants to access the file, it will block on the pipe read until the other process is done writing. This is often simpler than using a semaphore, which is another excellent technique.

Execute 2 different functions concurrently, is pthread my answer?

I have a fixed size array (example: struct bucket[DATASIZE]) where at the very beginning I load information from a file. Since I am concerned about scalability and execution time, no dynamic array was used.
Each time I process half of the array I am free to replace those spots with more data from the file. I don't have a clear idea on how I would do that but I thought about pthreads to start 2 parallel tasks: one would be the actual data processing and the other one would make sure to fill out the array.
However, all the examples that I've seen on pthreads show that they are all working on the same task but concurrently. Is there a way to have them do separate things? Any ideas, thoughts?

You can definitely have threads doing different tasks. The pattern you're after is very common - it's called a Producer-Consumer arrangement.

What you are trying to do seems very similar to standard concurrent program called producer-consumer (look it up, you surely find an example in pthreads). This program has one fixed size buffer which is processed by consumer and filled by producer.

Yes, that's an excellent use for pthreads: it's one of the very things that pthreads was made for.
You might think about fork( )ing twice, once to create the process to do the data manipulation; and then a second fork( ) to create the process that fills in the blanks. Use a mutex to let each process protect the array from the other process and it will work fine.
Why would your array need a mutex? How would you set it up? When would each process need to acquire the mutex and when would it need to release the mutex?
-- pete

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight