I am using multiple threads to access various random files using threads. However, I get an error [Thread 0x7ffff7007700 (LWP 16256) exited]. Also, info threads shows that only 2 threads are created. However, I am trying to create 100 of them. Also, do I have to use the pthread_join() function in my case? The code:
#define NTHREADS 100
void *encrypt(void *args)
{
int count = *((int*) args);
AES_KEY enc_key;
AES_set_encrypt_key(key, 128, &enc_key);
int i;
for(i=1;i<=count;i++){
char *ifile;
char *ofile;
long length;
size_t result;
char * buffer;
sprintf(ifile,"random_files/random_%d.txt",i);
FILE *ifp = fopen(ifile,"rb");
if (ifp==NULL) {fputs ("File error",stderr); exit (1);}
fseek(ifp,0, SEEK_END);
length = ftell(ifp);
fseek (ifp,0, SEEK_SET);
buffer = (char*) malloc (sizeof(char)*length);
if (buffer == NULL) {fputs ("Memory error",stderr); exit (2);}
result = fread (buffer, 1, length, ifp);
if (result != length) {fputs ("Reading error",stderr); exit (3);}
printf("%s",buffer);
fclose (ifp);
free(buffer);
}
}
int main(){
int i,j,count =0;
pthread_t threads[NTHREADS];
for (i=0; i<NTHREADS; i++){
count = count +20;
int *count_ptr = &count;
if(pthread_create(&threads[i], NULL, encrypt, count_ptr)!=0){
fprintf(stderr, "error: Cannot create thread # %d\n", i);
break;
}
}
printf("After Thread\n");
exit(0);
}
Yes, you should join your threads, as none were created detached (and you're probably not deep enough in learning pthreads to deal with that anyway).
That said, you have a significant logic problem in your thread parameters. They're all getting the address of the same count variable in main. Probably the fastest way to change that with no modifications to your thread-proc is simply stand up a side-by-side array of int counts[NTHREADS] matching your threads, using each element as the corresponding thread's data param:
int main()
{
pthread_t threads[NTHREADS];
int counts[NTHREADS]; // HERE
int i,j,count =0;
for (i=0; i<NTHREADS; i++)
{
counts[i] = (count += 20);
if(pthread_create(&threads[i], NULL, encrypt, counts+i)!=0) // LAST PARAM
{
fprintf(stderr, "error: Cannot create thread # %d\n", i);
break;
}
}
for (i=0; i<NTHREADS; ++i)
pthread_join(threads[i], NULL));
return EXIT_SUCCESS;
}
Alternatively, you could do some dynamic allocation hoops, or send the value via intptr_t, cast to void*, but the method shown above has the advantage of requiring no changes on your thread-proc, a target I was aiming for.
I've left any issues in your thread proc to you to solve, but that should get you up and running on your thread stack, at least.
There are missing things in the code, first, in your encrypt function you need to close the threads at the end and also in the main function BUT, before you do so, after your for loop you need to join them all, then they are going to be processed and closed correctly.
The flow will go like this:
(void) pthread_join(th1, NULL);
(void) pthread_join(th2, NULL);
You can put them all in a for to join them by tid.
Another update: There is not really anything "random" going on, I would just start open the files according to its tid, and then add other big implementations, your as is will try to open files like crazy, it should work anyways but, just saying.
I Work with couple of threads. all running as long as an exit_flag is set to false.
I Have specific thread that doesn't recognize the change in the flag, and therefor not ending and freeing up its resources, and i'm trying to understand why.
UPDATE: After debugging a bit with gdb, i can see that given 'enough time' the problematic thread does detects the flag change.
My conclusion from this is that not enough time passes for the thread to detect the change in normal run.
How can i 'delay' my main thread, long enough for all threads to detect the flag change, without having to JOIN them? (the use of exit_flag was in an intention NOT to join the threads, as i don't want to manage all threads id's for that - i'm just detaching each one of them, except the thread that handles input).
I've tried using sleep(5) in close_server() method, after the flag changing, with no luck
Notes:
Other threads that loop on the same flag does terminate succesfully
exit_flag declaration is: static volatile bool exit_flag
All threads are reading the flag, flag value is changed only in close_server() method i have (which does only that)
Data race that may occur when a thread reads the flag just before its changed, doesn't matter to me, as long as in the next iteration of the while loop it will read the correct value.
No error occurs in the thread itself (according to strerr & stdout which are 'clean' from error messages (for the errors i handle in the thread)
Ths situation also occurs even when commenting out the entire while((!exit_flag) && (remain_data > 0)) code block - so this is not a sendfile hanging issure
station_info_t struct:
typedef struct station_info {
int socket_fd;
int station_num;
} station_info_t;
Problematic thread code:
void * station_handler(void * arg_p)
{
status_type_t rs = SUCCESS;
station_info_t * info = (station_info_t *)arg_p;
int remain_data = 0;
int sent_bytes = 0;
int song_fd = 0;
off_t offset = 0;
FILE * fp = NULL;
struct stat file_stat;
/* validate station number for this handler */
if(info->station_num < 0) {
fprintf(stderr, "station_handler() station_num = %d, something's very wrong! exiting\n", info->station_num);
exit(EXIT_FAILURE);
}
/* Open the file to send, and get his stats */
fp = fopen(srv_params.songs_names[info->station_num], "r");
if(NULL == fp) {
close(info->socket_fd);
free(info);
error_and_exit("fopen() failed! errno = ", errno);
}
song_fd = fileno(fp);
if( fstat(song_fd, &file_stat) ) {
close(info->socket_fd);
fclose(fp);
free(info);
error_and_exit("fstat() failed! errno = ", errno);
}
/** Run as long as no exit procedure was initiated */
while( !exit_flag ) {
offset = 0;
remain_data = file_stat.st_size;
while( (!exit_flag) && (remain_data > 0) ) {
sent_bytes = sendfile(info->socket_fd, song_fd, &offset, SEND_BUF);
if(sent_bytes < 0 ) {
error_and_exit("sendfile() failed! errno = ", errno);
}
remain_data = remain_data - sent_bytes;
usleep(USLEEP_TIME);
}
}
printf("Station %d handle exited\n", info->station_num);
/* Free \ close all resources */
close(info->socket_fd);
fclose(fp);
free(info);
return NULL;
}
I'll be glad to get some help.
Thanks guys
Well, as stated by user362924 the main issue is that i don't join the threads in my main thread, therefore not allowing them enough time to exit.
A workaround to the matter, if for some reason one wouldn't want to join all threads and dynamically manage thread id's, is to use sleep command in the end of the main thread, for a couple of seconds.
of course this workaround is not good practice and not recommended (to anyone who gets here by google)
I'm new to using the pthread library in C and I have an assignment for my class to write a simple program using them. The basic description of the program is it takes 1 or more input files containing website names and 1 output file name. I then need to create 1 thread per input file to read in the website names and push them onto a queue. Then I need to create a couple of threads to pull those names off of the queue, find their IP Address, and then write that information out to the output file. The command line arguments are expected as follows:
./multi-lookup [one or more input files] [single output file name]
My issue is this. Whenever I run the program with only 1 thread to push information to the output file then everything works properly. When I make it two threads then the program hangs and none of my testing "printf" statements are even printed. My best guess is that deadlock is occurring somehow and that I'm not using my mutexes properly but I can't figure out how to fix it. Please help!
If you need any information that I'm not providing then just let me know. Sorry for the lack of comments in the code.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>
#include "util.h"
#include "queue.h"
#define STRING_SIZE 1025
#define INPUTFS "%1024s"
#define USAGE "<inputFilePath> <outputFilePath>"
#define NUM_RESOLVERS 2
queue q;
pthread_mutex_t locks[2];
int requestors_finished;
void* requestors(void* input_file);
void* resolvers(void* output_file);
int main(int argc, char* argv[])
{
FILE* inputfp = NULL;
FILE* outputfp = NULL;
char errorstr[STRING_SIZE];
pthread_t requestor_threads[argc - 2];
pthread_t resolver_threads[NUM_RESOLVERS];
int return_code;
requestors_finished = 0;
if(queue_init(&q, 10) == QUEUE_FAILURE)
fprintf(stderr, "Error: queue_init failed!\n");
if(argc < 3)
{
fprintf(stderr, "Not enough arguments: %d\n", (argc - 1));
fprintf(stderr, "Usage:\n %s %s\n", argv[0], USAGE);
return 1;
}
pthread_mutex_init(&locks[0], NULL);
pthread_mutex_init(&locks[1], NULL);
int i;
for(i = 0; i < (argc - 2); i++)
{
inputfp = fopen(argv[i+1], "r");
if(!inputfp)
{
sprintf(errorstr, "Error Opening Input File: %s", argv[i]);
perror(errorstr);
break;
}
return_code = pthread_create(&(requestor_threads[i]), NULL, requestors, inputfp);
if(return_code)
{
printf("ERROR: return code from pthread_create() is %d\n", return_code);
exit(1);
}
}
outputfp = fopen(argv[i+1], "w");
if(!outputfp)
{
sprintf(errorstr, "Errord opening Output File: %s", argv[i+1]);
perror(errorstr);
exit(1);
}
for(i = 0; i < NUM_RESOLVERS; i++)
{
return_code = pthread_create(&(resolver_threads[i]), NULL, resolvers, outputfp);
if(return_code)
{
printf("ERROR: return code from pthread_create() is %d\n", return_code);
exit(1);
}
}
for(i = 0; i < (argc - 2); i++)
pthread_join(requestor_threads[i], NULL);
requestors_finished = 1;
for(i = 0; i < NUM_RESOLVERS; i++)
pthread_join(resolver_threads[i], NULL);
pthread_mutex_destroy(&locks[0]);
pthread_mutex_destroy(&locks[1]);
return 0;
}
void* requestors(void* input_file)
{
char* hostname = (char*) malloc(STRING_SIZE);
FILE* input = input_file;
while(fscanf(input, INPUTFS, hostname) > 0)
{
while(queue_is_full(&q))
usleep((rand()%100));
if(!queue_is_full(&q))
{
pthread_mutex_lock(&locks[0]);
if(queue_push(&q, (void*)hostname) == QUEUE_FAILURE)
fprintf(stderr, "Error: queue_push failed on %s\n", hostname);
pthread_mutex_unlock(&locks[0]);
}
hostname = (char*) malloc(STRING_SIZE);
}
printf("%d\n", queue_is_full(&q));
free(hostname);
fclose(input);
pthread_exit(NULL);
}
void* resolvers(void* output_file)
{
char* hostname;
char ipstr[INET6_ADDRSTRLEN];
FILE* output = output_file;
int is_empty = queue_is_empty(&q);
//while(!queue_is_empty(&q) && !requestors_finished)
while((!requestors_finished) || (!is_empty))
{
while(is_empty)
usleep((rand()%100));
pthread_mutex_lock(&locks[0]);
hostname = (char*) queue_pop(&q);
pthread_mutex_unlock(&locks[0]);
if(dnslookup(hostname, ipstr, sizeof(ipstr)) == UTIL_FAILURE)
{
fprintf(stderr, "DNSlookup error: %s\n", hostname);
strncpy(ipstr, "", sizeof(ipstr));
}
pthread_mutex_lock(&locks[1]);
fprintf(output, "%s,%s\n", hostname, ipstr);
pthread_mutex_unlock(&locks[1]);
free(hostname);
is_empty = queue_is_empty(&q);
}
pthread_exit(NULL);
}
Although I'm not familiar with your "queue.h" library, you need to pay attention to the following:
When you check whether your queue is empty you are not acquiring the mutex, meaning that the following scenario might happen:
Some requestors thread checks for emptiness (let's call it thread1) and just before it executes pthread_mutex_lock(&locks[0]); (and after if(!queue_is_full(&q)) ) thread1 gets contex switched
Other requestors threads fill the queue up and when out thread1 finally gets hold of the mutex if will try to insert to the full queue. Now if your queue implementation crashes when one tries to insert more elements into an already full queue thread1 will never unlock the mutex and you'll have a deadlock.
Another scenario:
Some resolver thread runs first requestors_finished is initially 0 so (!requestors_finished) || (!is_empty) is initially true.
But because the queue is still empty is_empty is true.
This thread will reach while(is_empty) usleep((rand()%100)); and sleep forever, because you pthread_join this thread your program will never terminate because this value is never updated in the loop.
The general idea to remember is that when you access some resource that is not atomic and might be accessed by other threads you need to make sure you're the only one performing actions on this resource.
Using a mutex is OK but you should consider that you cannot anticipate when will a context switch occur, so if you want to chech e.g whether the queue is empty you should do this while having the mutex locked and not unlock it until you're finished with it otherwise there's no guarantee that it'll stay empty when the next line executes.
You might also want to consider reading more about the consumer producer problem.
To help you know (and control) when the consumers (resolver) threads should run and when the producer threads produce you should consider using conditional variables.
Some misc. stuff:
pthread_t requestor_threads[argc - 2]; is using VLA and not in a good way - think what will happen if I give no parameters to your program. Either decide on some maximum and define it or create it dynamically after having checked the validity of the input.
IMHO the requestors threads should open the file themselves
There might be some more problems but start by fixing those.
I wanted to write a program that test if two files are duplicates (have exactly the same content). First I test if the files have the same sizes, and if they have i start to compare their contents.
My first idea, was to "split" the files into fixed size blocks, then start a thread for every block, fseek to startup character of every block and continue the comparisons in parallel. When a comparison from a thread fails, the other working threads are canceled, and the program exits out of the thread spawning loop.
The code looks like this:
dupf.h
#ifndef __NM__DUPF__H__
#define __NM__DUPF__H__
#define NUM_THREADS 15
#define BLOCK_SIZE 8192
/* Thread argument structure */
struct thread_arg_s {
const char *name_f1; /* First file name */
const char *name_f2; /* Second file name */
int cursor; /* Where to seek in the file */
};
typedef struct thread_arg_s thread_arg;
/**
* 'arg' is of type thread_arg.
* Checks if the specified file blocks are
* duplicates.
*/
void *check_block_dup(void *arg);
/**
* Checks if two files are duplicates
*/
int check_dup(const char *name_f1, const char *name_f2);
/**
* Returns a valid pointer to a file.
* If the file (given by the path/name 'fname') cannot be opened
* in 'mode', the program is interrupted an error message is shown.
**/
FILE *safe_fopen(const char *name, const char *mode);
#endif
dupf.c
#include <errno.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include "dupf.h"
FILE *safe_fopen(const char *fname, const char *mode)
{
FILE *f = NULL;
f = fopen(fname, mode);
if (f == NULL) {
char emsg[255];
sprintf(emsg, "FOPEN() %s\t", fname);
perror(emsg);
exit(-1);
}
return (f);
}
void *check_block_dup(void *arg)
{
const char *name_f1 = NULL, *name_f2 = NULL; /* File names */
FILE *f1 = NULL, *f2 = NULL; /* Streams */
int cursor = 0; /* Reading cursor */
char buff_f1[BLOCK_SIZE], buff_f2[BLOCK_SIZE]; /* Character buffers */
int rchars_1, rchars_2; /* Readed characters */
/* Initializing variables from 'arg' */
name_f1 = ((thread_arg*)arg)->name_f1;
name_f2 = ((thread_arg*)arg)->name_f2;
cursor = ((thread_arg*)arg)->cursor;
/* Opening files */
f1 = safe_fopen(name_f1, "r");
f2 = safe_fopen(name_f2, "r");
/* Setup cursor in files */
fseek(f1, cursor, SEEK_SET);
fseek(f2, cursor, SEEK_SET);
/* Initialize buffers */
rchars_1 = fread(buff_f1, 1, BLOCK_SIZE, f1);
rchars_2 = fread(buff_f2, 1, BLOCK_SIZE, f2);
if (rchars_1 != rchars_2) {
/* fread failed to read the same portion.
* program cannot continue */
perror("ERROR WHEN READING BLOCK");
exit(-1);
}
while (rchars_1-->0) {
if (buff_f1[rchars_1] != buff_f2[rchars_1]) {
/* Different characters */
fclose(f1);
fclose(f2);
pthread_exit("notdup");
}
}
/* Close streams */
fclose(f1);
fclose(f2);
pthread_exit("dup");
}
int check_dup(const char *name_f1, const char *name_f2)
{
int num_blocks = 0; /* Number of 'blocks' to check */
int num_tsp = 0; /* Number of threads spawns */
int tsp_iter = 0; /* Iterator for threads spawns */
pthread_t *tsp_threads = NULL;
thread_arg *tsp_threads_args = NULL;
int tsp_threads_iter = 0;
int thread_c_res = 0; /* Thread creation result */
int thread_j_res = 0; /* Thread join res */
int loop_res = 0; /* Function result */
int cursor;
struct stat buf_f1;
struct stat buf_f2;
if (name_f1 == NULL || name_f2 == NULL) {
/* Invalid input parameters */
perror("INVALID FNAMES\t");
return (-1);
}
if (stat(name_f1, &buf_f1) != 0 || stat(name_f2, &buf_f2) != 0) {
/* Stat fails */
char emsg[255];
sprintf(emsg, "STAT() ERROR: %s %s\t", name_f1, name_f2);
perror(emsg);
return (-1);
}
if (buf_f1.st_size != buf_f2.st_size) {
/* File have different sizes */
return (1);
}
/* Files have the same size, function exec. is continued */
num_blocks = (buf_f1.st_size / BLOCK_SIZE) + 1;
num_tsp = (num_blocks / NUM_THREADS) + 1;
cursor = 0;
for (tsp_iter = 0; tsp_iter < num_tsp; tsp_iter++) {
loop_res = 0;
/* Create threads array for this spawn */
tsp_threads = malloc(NUM_THREADS * sizeof(*tsp_threads));
if (tsp_threads == NULL) {
perror("TSP_THREADS ALLOC FAILURE\t");
return (-1);
}
/* Create arguments for every thread in the current spawn */
tsp_threads_args = malloc(NUM_THREADS * sizeof(*tsp_threads_args));
if (tsp_threads_args == NULL) {
perror("TSP THREADS ARGS ALLOCA FAILURE\t");
return (-1);
}
/* Initialize arguments and create threads */
for (tsp_threads_iter = 0; tsp_threads_iter < NUM_THREADS;
tsp_threads_iter++) {
if (cursor >= buf_f1.st_size) {
break;
}
tsp_threads_args[tsp_threads_iter].name_f1 = name_f1;
tsp_threads_args[tsp_threads_iter].name_f2 = name_f2;
tsp_threads_args[tsp_threads_iter].cursor = cursor;
thread_c_res = pthread_create(
&tsp_threads[tsp_threads_iter],
NULL,
check_block_dup,
(void*)&tsp_threads_args[tsp_threads_iter]);
if (thread_c_res != 0) {
perror("THREAD CREATION FAILURE");
return (-1);
}
cursor+=BLOCK_SIZE;
}
/* Join last threads and get their status */
while (tsp_threads_iter-->0) {
void *thread_res = NULL;
thread_j_res = pthread_join(tsp_threads[tsp_threads_iter],
&thread_res);
if (thread_j_res != 0) {
perror("THREAD JOIN FAILURE");
return (-1);
}
if (strcmp((char*)thread_res, "notdup")==0) {
loop_res++;
/* Closing other threads and exiting by condition
* from loop. */
while (tsp_threads_iter-->0) {
pthread_cancel(tsp_threads[tsp_threads_iter]);
}
}
}
free(tsp_threads);
free(tsp_threads_args);
if (loop_res > 0) {
break;
}
}
return (loop_res > 0) ? 1 : 0;
}
The function works fine (at least for what I've tested). Still, some guys from #C (freenode) suggested that the solution is overly complicated, and it may perform poorly because of parallel reading on hddisk.
What I want to know:
Is the threaded approach flawed by default ?
Is fseek() so slow ?
Is there a way to somehow map the files to memory and then compare them ?
LATED EDIT:
Today I had some time, and I've followed your advices. You were right, this threaded version actually performs worse than a single threaded version, and all because of the parallel readings on hard disk.
Another thing is that I've written a function that uses mmap(), and until now is the optimal one. Still the biggest drawback of that function is that it fails, when the files are getting really big.
Here is the new implementation (a very brute and direct code):
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include "dupf.h"
/**
* Safely assures that a file is opened.
* If cannot open file, the flow of the program is interrupted.
* The error code returned is -1.
**/
FILE *safe_fopen(const char *fname, const char *mode)
{
FILE *f = NULL;
f = fopen(fname, mode);
if (f == NULL) {
char emsg[1024];
sprintf(emsg, "Cannot open file: %s\t", fname);
perror(emsg);
exit(-1);
}
return (f);
}
/**
* Check if two files have the same size.
* Returns:
* -1 Error.
* 0 If they have the same size.
* 1 If the don't have the same size.
**/
int check_same_size(const char *f1_name, const char *f2_name, off_t *f1_size, off_t *f2_size)
{
struct stat f1_stat, f2_stat;
if((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_same_size].\n");
return (-1);
}
if((stat(f1_name, &f1_stat) != 0) || (stat(f2_name, &f2_stat) !=0)){
fprintf(stderr, "Cannot apply stat. [check_same_size].\n");
return (-1);
}
if(f1_size != NULL){
*f1_size = f1_stat.st_size;
}
if(f2_size != NULL){
*f2_size = f2_stat.st_size;
}
return (f1_stat.st_size == f2_stat.st_size) ? 0 : 1;
}
/**
* Test if two files are duplicates.
* Returns:
* -1 Error.
* 0 If they are duplicates.
* 1 If they are not duplicates.
**/
int check_dup_plain(char *f1_name, char *f2_name, int block_size)
{
if ((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_dup_plain].\n");
return (-1);
}
FILE *f1 = NULL, *f2 = NULL;
char f1_buff[block_size], f2_buff[block_size];
size_t rch1, rch2;
if(check_same_size(f1_name, f2_name, NULL, NULL) == 1){
return (1);
}
f1 = safe_fopen(f1_name, "r");
f2 = safe_fopen(f2_name, "r");
while(!feof(f1) && !feof(f2)){
rch1 = fread(f1_buff, 1, block_size, f1);
rch2 = fread(f2_buff, 1, block_size, f2);
if(rch1 != rch2){
fprintf(stderr, "Invalid reading from file. Cannot continue. [check_dup_plain].\n");
return (-1);
}
while(rch1-->0){
if(f1_buff[rch1] != f2_buff[rch1]){
return (1);
}
}
}
fclose(f1);
fclose(f2);
return (0);
}
/**
* Test if two files are duplicates.
* Returns:
* -1 Error.
* 0 If they are duplicates.
* 1 If they are not duplicates.
**/
int check_dup_memmap(char *f1_name, char *f2_name)
{
struct stat f1_stat, f2_stat;
char *f1_array = NULL, *f2_array = NULL;
off_t f1_size, f2_size;
int f1_des, f2_des, cont, res;
if((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_dup_memmap].\n");
return (-1);
}
if(check_same_size(f1_name, f2_name, &f1_size, &f2_size) == 1){
return (1);
}
f1_des = open(f1_name, O_RDONLY);
f2_des = open(f2_name, O_RDONLY);
if((f1_des == -1) || (f2_des == -1)){
perror("Cannot open file");
exit(-1);
}
f1_array = mmap(0, f1_size * sizeof(*f1_array), PROT_READ, MAP_SHARED, f1_des, 0);
if(f1_array == NULL){
fprintf(stderr, "Cannot map file to memory [check_dup_memmap].\n");
return (-1);
}
f2_array = mmap(0, f2_size * sizeof(*f2_array), PROT_READ, MAP_SHARED, f2_des, 0);
if(f2_array == NULL){
fprintf(stderr, "Cannot map file to memory [check_dup_memmap].\n");
return (-1);
}
cont = f1_size;
res = 0;
while(cont-->0){
if(f1_array[cont]!=f2_array[cont]){
res = 1;
break;
}
}
munmap((void*) f1_array, f1_size * sizeof(*f1_array));
munmap((void*) f2_array, f2_size * sizeof(*f2_array));
return res;
}
int main(int argc, char *argv[])
{
printf("result: %d\n",check_dup_memmap("f2","f1"));
return (0);
}
I am planning now to extend this code, by re-adding the threaded functionality, but this time the reading will be on memory.
Thanks for your answers.
The limiting factor will be disk reads, which (assuming that both files are on the same disk) will be serialized anyway, so I don't think threading will help much at all.
You could probably simplify your code greatly by using hashes, instead of doing a byte-by-byte comparison. Assuming you're not doing anything important, like deleting, an md5 or similar hash function should be plenty. Boost provides quite a few, and they're usually pretty fast.
if fileA.size == fileB.size
if fileA.hash() == fileB.hash()
flag(fileA, fileB, same);
I wouldn't delete files after that comparison, but it's plenty safe to move them to a temporary directory for further review or just build a list of possible duplicates.
It's hard to guess about performance without a real system to test against (for example if you're using a solid state drive, there's no head seek time and the cost of reading different sectors from different threads is almost zero).
If this is running against a reasonably standard computer with regular (spinning platter) hard drives, having multiple threads contend for the part of the disk they want to read from will possibly slow things down (depending, again, on the hardware and also the size of the chunks).
If the time it takes to compute the "sameness" of a chunk is fast compared to the time it takes to read that chunk from disk, having a separate thread will not help much since the second (or third...) thread would spend most of it's time waiting for IO to complete anyway.
Another factor is the cache size of the CPU. If all of the memory you're processing at one time fits in the CPU cache, things will be much faster than if different threads cause different chunks of memory to be loaded into cache as they execute instructions.
If you have more threads than you have CPU cores, you will just slow things down by making unnecessary context switches (since a thread needs a core to run on).
After reading all of that, if you still think multithreading is going to help for your target system, consider one thread that does IO only, places the data in a queue, and has two or more worker threads taking data off of the queue to process. That way, you optimize disk IO and can take advantage of multiple cores to crunch the numbers.
Steve suggested you can memory map you files on Unix. That will speed up access to the underlying data a bit by leveraging low level OS functionality (the same kind used to manage swap files). That will give you some performance improvement as the OS will handle loading the parts of the file you are working on into memory efficiently, as long as the file fits into available address space. FYI you can do the same thing on Windows.
Before even considering the performance effects of parallel disk reads and thread overhead and such...
Is there any reason to believe that scanning the files in chunks will find the differences any quicker than straight through? Is the data contained in the files predominantly in a certain format, and if so, is the splitting scheme tailored to it? If not, I don't see how scanning the files by skipping over every n bytes (which is all the multithreaded splitting is effectively doing) could offer any improvement over reading the bytes in the order they are on disk.
Think of the two limiting cases -- "splitting" the file into one block, and splitting the file into as many one-byte "blocks" as there are bytes in the file. Will either of those cases be more efficient than the other, or some in-between value? If there is no in-between value that you know you should optimize to, then you know nothing about how the data is stored in the files, so it should make no difference how you scan them.
Even if you set the split to optimize to the disk's performance like block size, you're still going to have to go back to read the next byte, which will likely be at an extremely non-optimal position. And in the end you're going to have to read every single byte in the file, no matter how you split it.
Because you're using pthreads, I assume you're working in a Unix environment -- in which case you could mmap(2) both files into memory and compare the memory arrays directly.
Well, there is the standard memory mapping mmap() function that maps a file to memory. You should be able to do something like
int fd1;
int fd2;
int size1;
int size2;
fd1 = open(name1, O_RDONLY);
size1 = lseek(fd1, 0, SEEK_END);
fd2 = open(name2, O_RDONLY);
size2 = lseek(fd2, 0, SEEK_END);
if ( size1 == size2 )
{
char * data1 = mmap(0, size1, PROT_READ, MAP_SHARED, fd1, 0);
char * data2 = mmap(0, size1, PROT_READ, MAP_SHARED, fd2, 0);
int i;
/* ...and this is, obviously, where you'd do something more clever */
for ( i = 0; i < size1 && *data1 == *data2; i++, data1++, data2++ );
if ( i == size1 )
printf("Equal\n");
}
close(fd1);
close(fd2);
Other than that, yes, your solution looks overly complicated ;-) The threaded approach is not necessarily flawed, but you might not see that parallel access improves performance. For SAN drives or ramdisks it might improve performance, for normal spinning platter drives it might impede it. But simpler is usually better, unless you really have a performance issue.
Regarding fseek() vs other methods, it depends on the operating system you use. Google is you friend here, you can easily find articles at least for Solaris and Linux.
Even if disk access was not the limiting factor (it will be), unless you have a multi-core processor that could hand off different threads to different cores, you would not see a speed-up from going multi-threaded. Basically, you have to compare all N bytes of the file one way or another, and even if you use threads, if they execute in the same core, it will take the same amount of time as without using threads.
There are some environments that could spread the workload across cores, but even so, the CPU will be able to process so much faster than the data can be pulled in from disk that the disk I/O system will be the limiting factor.