The problem: I have a few text files (10) with numbers in them on every line. I need to have them split across some threads I create using the pthread library. These threads that are created (worker threads) are to find the largest prime number that gets sent to them (and over all the largest prime from all of the text files).
My current thoughts on solutions: I am thinking myself to have two arrays and all of the text files in one array and the other array will contain a binary file that I can read say 1000 lines and send the pointer to the index of that binary file in a struct that contains the id, file pointer, and file position and let it crank through that.
A little bit of what I am talking about:
pthread_create(&threads[index],NULL,workerThread,(void *)threadFields[index]);//Pass struct to each worker
Struct:
typedef struct threadFields{
int *id, *position;
FILE *Fin;
}tField;
If anyone has any insight or a better solution it would be greatly appreciated
EDIT:
Okay so I found a solution to my problem and I believe it is similar to what SaveTheRbtz suggested. Here is what I implemented:
I took the files and merged them in to 1 binary file and kept tack of it in the loop (I had to account for how many bytes each entry was, this was hard-coded)
struct threadFields *info = threadStruct;
int index;
int id = info->id;
unsigned int currentNum = 0;
int Seek = info->StartPos;
unsigned int localLargestPrime = 0;
char *buffer = malloc(50);
int isPrime = 0;
while(Seek<info->EndPos){
for(index = 0; index < 1000; index++){//Loop 1000 times
fseek(fileOut,Seek*sizeof(char)*20, SEEK_SET);
fgets(buffer,20,fileOut);
Seek++;
currentNum = atoi(buffer);
if(currentNum>localLargestPrime && currentNum > 0){
isPrime = ChkPrim(currentNum);
if( isPrime == 1)
localLargestPrime = currentNum;
}
}
Can you do ten threads, each of which processes a file specified as an argument. Each thread will read its own file, checking whether the value is larger than the largest prime it has recorded so far, and if so, checking that the new number is prime. Then, when its finished, it can return the prime to the coordinator thread. The coordinator threads sits back and waits for the threads to finish, collecting the largest prime from each thread, and only keeping the largest. You can probably use 0 as a sentinel value to indicate 'no primes found (yet)'.
Let's say I wanted 11 threads instead of 10; how would I split the workload then?
I'd have the 11th thread do pthread_exit() immediately. If you want to make coordination problems for yourself, you can, but why make life harder than you have to.
If you absolutely must have 11 threads process 10 files and divvy up the work, then I suppose I would probably have set of 10 file streams initially in a queue. The threads would wait on a condition 'queue not empty' to get a file stream (mutexes and conditions and all that). When a thread acuires a file stream, it would read one number from the file and push the stream back onto the queue (signalling queue not empty), then process the number. On EOF, a thread would close the file and not push it back onto the queue (so the threads have to detect 'no file streams left with unread data'). This means that each thread would read about one eleventh of the data, depending on how long the prime calculation takes for the numbers it actually reads. That's much, much, much trickier to code than a simple one thread per file solution, but it scales (more or less) to an arbitrary number of threads and files. In particular, it could be used to have 7 threads process 10 files, as well as having 17 threads process 10 files.
Looks like a job for message queue:
Set of "supplier" threads which split data into chunks
and put then to the queue. In your case chunk can be represented with file name or
(fd, offset, size) tuple. For simplicity there can be one such
supplier.
Number of "worker" threads that pull data from input
queue, process it and put results to another queue. For performance
reasons there usually many workers, for example if your task is
CPU-intensive then sysconf(_SC_NPROCESSORS_ONLN) should be a good
choice.
One "aggregator" thread that "reduces" result queue to single value. For your case it's simple max() function.
This is highly scalable solution will enable you to easily combine many different kinds of processing stages into easily understandable pipeline.
Related
I am trying to portion out 1 million lines of float numbers to 16 different processes. For example,
process 0 needs to read between lines 1-62500 and
process 1 needs to read between lines 62501-125000 etc.
I have tried the following code, but every process reads the lines between 1-62500. How can I change the line interval for each process?
MPI_Init(NULL, NULL);
n=1000000/numberOfProcesses;
FILE *myFile;
myFile = fopen("input.txt","r");
i=0;
k = n+1;
while(k--){
fscanf(myFile,"%f",&input[i]);
i++;
}
fclose(myFile);
MPI_Finalize();
Assuming numbeOfProcesses=4 and numberOfLines=16
//so new n will be 4
//n=1000000/numberOfProcesses;
n=numberOfLines/numbeOfProcesses
FILE *myFile;
myFile = fopen("input.txt","r");
i=0;
k = n+1 //(5)
From your program, all processes will read the file from the same location or offset. What you need to do is to make each process read from their own specific line or offset. For example, rank 0 should read from 0, rank 1 from n, rank 2 from 2*n etc. Pass this as parameter to fseek.
n=numberOfLines/numbeOfProcesses
MPI_Comm_rank(MPI_COMM_WORLD,&rank)
file_start= n*rank
fseek(myfile, file_start, SEEK_SET);
fseek will go the offset (file_start) of the file. Then file_start will be 4 for rank 0, 8 for rank 1 etc...
Also while loop should be modified accordingly.
As #Gilles pointed out in comments, here we are explicitly assuming the number of lines in the file. This can lead to many issues.
To get scalability and parallel performance benefits, it is better to use MPI IO, which offers great features for parallel file operations. MPI IO is developed for this kind of usecases.
In a coding competition specified at this link there is a task where you need to read much data on stdin, do some calculations and present a whole lot of data on stdout.
In my benchmarking it is almost only i/o that takes time although I have tried optimizing it as much as possible.
What you have as input is a string (1 <= len <= 100'000) and q rows of pair of int where q also is 1 <= q <= 100'000.
I benchmarked my code on a 100 times larger dataset (len = 10M, q = 10M) and this is the result:
Activity time accumulated
Read text: 0.004 0.004
Read numbers: 0.146 0.150
Parse numbers: 0.200 0.350
Calc answers: 0.001 0.351
Format output: 0.037 0.388
Print output: 0.143 0.531
By implementing my own formating and number parsing inline i managed to get the time down to 1/3 of the time when using printf and scanf.
However when I uploaded my solution to the competitions webpage my solution took 1.88 seconds (I think that is the total time over 22 datasets). When I look in the high-score there are several implementations (in c++) that finished in 0.05 seconds, nearly 40 times faster than mine! How is that possible?
I guess that I could speed it up a bit by using 2 threads, then I can start calculating and writing to stdout while still reading from stdin. This will however decrease the time to min(0.150, 0.143) in a theoretical best case on my large dataset. I'm still nowhere close to the highscore..
In the image below you can see the statistics of the consumed time.
The program gets compiled by the website with this options:
gcc -g -O2 -std=gnu99 -static my_file.c -lm
and timed like this:
time ./a.out < sample.in > sample.out
My code looks like this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_LEN (100000 + 1)
#define ROW_LEN (6 + 1)
#define DOUBLE_ROW_LEN (2*ROW_LEN)
int main(int argc, char *argv[])
{
int ret = 1;
// Set custom buffers for stdin and out
char stdout_buf[16384];
setvbuf(stdout, stdout_buf, _IOFBF, 16384);
char stdin_buf[16384];
setvbuf(stdin, stdin_buf, _IOFBF, 16384);
// Read stdin to buffer
char *buf = malloc(MAX_LEN);
if (!buf) {
printf("Failed to allocate buffer");
return 1;
}
if (!fgets(buf, MAX_LEN, stdin))
goto EXIT_A;
// Get the num tests
int m ;
scanf("%d\n", &m);
char *num_buf = malloc(DOUBLE_ROW_LEN);
if (!num_buf) {
printf("Failed to allocate num_buffer");
goto EXIT_A;
}
int *nn;
int *start = calloc(m, sizeof(int));
int *stop = calloc(m, sizeof(int));
int *staptr = start;
int *stpptr = stop;
char *cptr;
for(int i=0; i<m; i++) {
fgets(num_buf, DOUBLE_ROW_LEN, stdin);
nn = staptr++;
cptr = num_buf-1;
while(*(++cptr) > '\n') {
if (*cptr == ' ')
nn = stpptr++;
else
*nn = *nn*10 + *cptr-'0';
}
}
// Count for each test
char *buf_end = strchr(buf, '\0');
int len, shift;
char outbuf[ROW_LEN];
char *ptr_l, *ptr_r, *out;
for(int i=0; i<m; i++) {
ptr_l = buf + start[i];
ptr_r = buf + stop[i];
while(ptr_r < buf_end && *ptr_l == *ptr_r) {
++ptr_l;
++ptr_r;
}
// Print length of same sequence
shift = len = (int)(ptr_l - (buf + start[i]));
out = outbuf;
do {
out++;
shift /= 10;
} while (shift);
*out = '\0';
do {
*(--out) = "0123456789"[len%10];
len /= 10;
} while(len);
puts(outbuf);
}
ret = 0;
free(start);
free(stop);
EXIT_A:
free(buf);
return ret;
}
Thanks to your question, I went and solved the problem myself. Your time is better than mine, but I'm still using some stdio functions.
I simply do not think the high score of 0.05 seconds is bona fide. I suspect it's the product of a highly automated system that returned that result in error, and that no one ever verified it.
How to defend that assertion? There's no real algorithmic complexity: the problem is O(n). The "trick" is to write specialized parsers for each aspect of the input (and avoid work done only in debug mode). The total time for 22 trials is 50 milliseconds, meaning each trial averages 2.25 ms? We're down near the threshold of measurability.
Competitions like the problem you addressed yourself to are unfortunate, in a way. They reinforce the naive idea that performance is the ultimate measure of a program (there's no score for clarity). Worse, they encourage going around things like scanf "for performance" while, in real life, getting a program to run correctly and fast basically never entails avoiding or even tuning stdio. In a complex system, performance comes from things like avoiding I/O, passing over the data only once, and minimizing copies. Using the DBMS effectively is often key (as it were), but such things never show up in programming challenges.
Parsing and formatting numbers as text does take time, and in rare circumstances can be a bottleneck. But the answer is hardly ever to rewrite the parser. Rather, the answer is to parse the text into a convenient binary form, and use that. In short: compilation.
That said, a few observations may help.
You don't need dynamic memory for this problem, and it's not helping. The problem statement says the input array may be up to 100,000 elements, and the number of trials may be as many as 100,000. Each trial is two integer strings of up to 6 digits each separated by a space and terminated by a newline: 6 + 1 + 6 + 1 = 14. Total input, maximum is 100,000 + 1 + 6 + 1 + 100,000 * 14: under 16 KB. You are allowed 1 GB of memory.
I just allocated a single 16 KB buffer, and read it in all at once with read(2). Then I made a single pass over that input.
You got suggestions to use asynchronous I/O and threads. The problem statement says you're measured on CPU time, so neither of those help. The shortest distance between two points is a straight line; a single read into statically allocated memory wastes no motion.
One ridiculous aspect of the way they measure performance is that they use gcc -g. That means assert(3) is invoked in code that is measured for performance! I couldn't get under 4 seconds on test 22 until I removed the my asserts.
In sum, you did pretty well, and I suspect the winner you're baffled by is a phantom. Your code does faff about a bit, and you can dispense with dynamic memory and tuning stdio. I bet your time can be trimmed by simplifying it. To the extent that performance matters, that's where I'd direct your attention.
You should allocate all your buffers continuously.
Allocate a buffer which is the size of all your buffers (num_buff, start, stop) then rearrange the points to the corresponding offsets by their size.
This can reduce your cache miss \ page faults.
Since the read and the write operation seems to consume a lot of time you should consider adding threads. One thread should deal with I\O and another should deal with the computation. (It is worth checking if another thread for prints could speed things up as well). Make sure you don't use any locks while doing this.
Answering this question is tricky because optimization heavily depends on the problem you have.
One idea is to look at the content of the file you are trying to read and see if there patterns or things that you can use in your favor.
The code you wrote is a "general" solution for reading from a file, executing something and then writing to a file. But if you the file is not randomly generated each time and the content is always the same why not try to write a solution for that file?
On the other hand, you could try to use low-level system functions. One that comes to my thinking is mmap which allows you to map a file directly to memory and access that memory instead of using scanf and fgets.
Another thing I found that might help is in your solutin you are having two while loops, why not try and use only one? Another thing would be to do some Asynchronous I/O reading, so instead of reading the whole file in a loop, and then doing the calculation in another loop, you can try and read a portion at the beginning, start processing it async and continue reading.
This link might help for the async part
I would like to make a variable (that belongs to a process) get a new random value, each time the new process starts.
I need this random generation to make every process created sleep a random number of seconds. At the beginning of the program I used
srand(time(NULL)), and in the function that the process would run I used
int sleeptime = rand() % 16 + 5; //that's because I need values from 5 to 20.
I've tried to implement such a thing, but I saw that for every process the value of the variable is the same.
I think that if I took as argument for srand(..) the current time in milliseconds (time at which the respective process begins) I would get the random values. The problem is I didn't find any information for this. The only thing suggested on different pages is the well known: srand(time(NULL));(where time(NULL) returns the current time in seconds).
Can you, please, suggest me some way to implement this? Thank you in advance.
If you're on linux, you also seed the PRNG by reading from /dev/random. Something like this:
void seedprng() {
unsigned i;
FILE* f = open("/dev/random", "rb");
if (f) {
fread(&i, sizeof(i), 1, f);
fclose(f);
}
else{
printf("falling back to time seeding\n");
i = (unsigned)time(NULL);
}
srand(&i)
}
I'm doing some research work on Ubuntu 15.10 x64. I want to study if there's a way to make 2 or more processes reading a text file simultaneously slow down each other's reading.
For example, two processes P1 and P2. A text file /etc/example.txt. It has 1KB data.
P1's pseudo code:
for (int i = 0; i < 1000000; i ++) {
str = read_file ('/etc/example.txt', 'r');
print(str);
}
P2's pseudo code:
for (int i = 0; i < 100; i ++) {
str = read_file ('/etc/example.txt', 'r');
print(str);
}
time = get_the_whole_run_time();
print(time / 100);
Condition 1:
P1 is running. P2 is used to "race" with P1 and it calculates the average reading time TIME_1.
Condition 2:
P1 is NOT running. Only run P2 and it calculates the average reading time TIME_2.
My goal is to make TIME_1 significantly higher than TIME_2 (this is for research purpose). But my experiments don't work out that way. TIME_1 is nearly the same as TIME_2.
I know there maybe are some things like file system cache that affects the result. I used the command: echo 3 > /proc/sys/vm/drop_caches to clear the cache. But it doesn't work.
Any ideas? Thanks!
Use huge files to experiment.
Beware that if P1 and P2 run simultaneously, the average time may be less than with a single process; because one of the process may benefits of the fresh cache the other has set up just before, thus has no need to wait for physical I/O for example. Your experiment is very difficult to set up, as there is many many variables in it and many system-internal mechanisms that have non orthogonal effects; and surprising results may appear.
I have a simple program that only uses one process (each time it's executed), creates a semaphore with a key that is the file's name (ftok() function), and then writes a line to a file. The thing is, the semaphores (in this case, 2) have to do two things: one has to guarantee that no more than two programs write at the same time, and the other has to verify that only 10 lines maximum have been written to the file. So if I execute the program and the file already has 10 lines of text, it won't write anything to it.
This is my code:
#include "semaphores.h"
int main() {
int semaphoreLines = create_semaphore(ftok("Ex5.c", 0), 10);
int semaphoreWrite = create_semaphore(ftok("Ex5.c", 1), 1);
FILE *file;
int ret_val = down(semaphoreLines, 1);
if(ret_val != 0) {
printf("No more lines can be written to the file!\n");
exit(-1);
}
down(semaphoreWrite, 1);
file = fopen("Ex5.txt", "a");
fprintf(file, "This is process %d\n", getpid());
fclose(file);
up(semaphoreWrite, 1);
return 0;
}
When I execute it the first time, semaphoreLines goes to 9 (as intended), locks the semaphoreWrite to 0 (so no other process can write to the file), then writes and frees up the latter back to 1. The process terminates. I manually tell it to run again in Terminal. However, semaphoreLines should be 9 so when I down() it, it goes to 8 and so forth. The issue is, it gets back up at 10 again. I don't want this.
Maybe it's because I'm fairly new to semaphore programming, but I thought semaphores were public if they don't get created with 0 key. With the ftok(), I wanted it to be public so that if I run the program again it decrements it if possible and writes, if not it displays the error code and terminates. I mean, the semaphore doesn't get removed, so the second time the program gets executed it should see the semaphore value is 9, right...?
I don't really want to fork 10 processes and have them write one by one to the file in the same program...or is that the only way to do it?
P.S. The create_semaphore() function is part of my semaphores.h header file, which contains 4 simple functions I wrote so it's easier to use semaphores instead of running all that semget, semop, and semctl stuff every time I want to work with them.
The issue is, it gets back up at 10 again. I don't want this.
If you don't want this, then don't do it. You yourself are setting the semaphore value to 10 in create_semaphore(). Instead, pass IPC_EXCL in addition to IPC_CREAT to semget(), and if that yields errno EEXIST, just return from create_semaphore() and skip the semctl(SETVAL).