aio_write takes more time than plain write on ext4?

aio_write takes more time than plain write on ext4? - c

I have a C program that writes 32768 blocks, each block is 16K size (total size of 512MB), to an ext4 filesystem on a system running 3.18.1 kernel. The regular write system call version of this program takes 5.35 seconds to finish the writes (as measured by gettimeofday before and after the for loop). The async io version of this program however takes the following times:
to queue all the aio_writes (32768 aio_writes): 7.43 seconds
poll to finish each IO request: additional 4.93 seconds
The output files are opened with these flags:O_WRONLY, O_CREAT, O_NONBLOCK
Why does async io take more than double the write() time? Even the Time-to-queue-async-io-request/time-to-write-sync-io is 1.4.
Since some people marked it off-topic, I looked at the definition and decided to paste the code - that seems to be the only reason why it should be marked off-topic. I am not asking why the code is not working, only why aio is much slower than regular writes, especially since all parallel writes are to different blocks. Here's the aio code, followed by the non-aio code:
AIO program
#define MAX_AIO (16384*2)
#define BUFSIZE 16384
struct mys {
int status;
struct aiocb aio;
};
void set_aiocb(struct mys *aio, int num, int fd)
{
int i;
for (i = 0; i < num; i++) {
aio[i].aio.aio_fildes = fd;
aio[i].aio.aio_offset = BUFSIZE * i;
aio[i].aio.aio_buf = malloc(BUFSIZE);
set_buf(aio[i].aio.aio_buf, BUFSIZE, i);
aio[i].aio.aio_nbytes = BUFSIZE;
aio[i].aio.aio_reqprio = fd;
aio[i].aio.aio_sigevent.sigev_notify = SIGEV_NONE;
aio[i].aio.aio_sigevent.sigev_signo = SIGUSR1;
aio[i].aio.aio_sigevent.sigev_value.sival_ptr = &aio[i];
aio[i].aio.aio_lio_opcode = 0;
aio[i].status = EINPROGRESS;
}
}
void main(void)
{
int fd = open("/tmp/AIO", O_WRONLY | O_CREAT, 0666);
int i, open_reqs = MAX_AIO;
struct mys aio[MAX_AIO];
struct timeval start, end, diff;
set_aiocb(aio, MAX_AIO, fd);
gettimeofday(&start, NULL);
for (i = 0; i < MAX_AIO; i++)
aio_write(&aio[i].aio);
while (open_reqs > 0) {
for (i = 0; i < MAX_AIO; i++) {
if (aio[i].status == EINPROGRESS) {
aio[i].status = aio_error(&(aio[i].aio));
if (aio[i].status != EINPROGRESS)
open_reqs--;
}
}
}
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
printf("%d.%d\n", (int)diff.tv_sec, (int)diff.tv_usec);
}
Regular IO program
#define MAX_AIO (16384*2)
#define BUFSIZE 16384
char buf[MAX_AIO][BUFSIZE];
void main(void)
{
int i, fd = open("/tmp/NON_AIO", O_WRONLY | O_CREAT, 0666);
struct timeval start, end, diff;
gettimeofday(&start, NULL);
for (i = 0; i < MAX_AIO; i++)
write(fd, buf[i], BUFSIZE);
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
printf("%d.%d\n", (int)diff.tv_sec, (int)diff.tv_usec);
}

You aren't really comparing apples with apples.
In the AIO code, you have a separately allocated buffer for each of the write operations, so the program has 512 MiB (16 * 16 * 2 KiB) of memory allocated, plus the 32 K copies of the AIO control structure. That memory has to be allocated, initialized (each buffer gets a different value if the set_buf() function — which is not shown — sets each byte of the buffer to the value of the third parameter), then copied by the kernel to the driver, possibly via the kernel buffer pool.
In the regular IO code, you have one big, contiguous buffer which is initialized to all zeroes which you write to the disk.
To make the comparison equitable, you should use the same infrastructure in both programs, creating the AIO structures, but the regular IO code will then simply step through the structures, writing the data portion of each (while the AIO code will behave more or less as shown). And I expect you will find that the performance is a lot more nearly similar when you do that.

Related

DIRECT I/O performance

I am trying to measure DIRECT IO performance. By my understanding DIRECT I/O ignores the page cache and goes to the underlying device to fetch the data. Therefore, if we are reading the same file over and over again, DIRECT I/O would be slower compared to accesses involving the page cache as the file would be cached.
#define _GNU_SOURCE
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <time.h>
char *DIRECT_FILE_PATH = "direct.dat";
char *NON_DIRECT_FILE_PATH = "no_direct.dat";
int FILE_SIZE_MB = 100;
int NUM_ITER = 100;
void lay_file(int direct_flag) {
int flag = O_RDWR | O_CREAT | O_APPEND | O_DIRECT;
mode_t mode = 0644;
int fd;
if (direct_flag) {
fd = open(DIRECT_FILE_PATH, flag, mode);
} else {
fd = open(NON_DIRECT_FILE_PATH, flag, mode);
}
if (fd == -1) {
printf("Failed to open file. Error: \t%s\n", strerror(errno));
}
ftruncate(fd, FILE_SIZE_MB*1024*1024);
close(fd);
}
void read_file(int direct_flag) {
mode_t mode = 0644;
void *buf = malloc(FILE_SIZE_MB*1024*1024);
int fd, flag;
if (direct_flag) {
flag = O_RDONLY | O_DIRECT;
fd = open(DIRECT_FILE_PATH, flag, mode);
} else {
flag = O_RDONLY;
fd = open(NON_DIRECT_FILE_PATH, flag, mode);
}
for (int i=0; i<NUM_ITER; i++) {
read(fd, buf, FILE_SIZE_MB*1024*1024);
lseek(fd,0,SEEK_SET);
}
close(fd);
}
int main() {
lay_file(0);
lay_file(1);
clock_t t;
t = clock();
read_file(1);
t = clock() - t;
double time_taken = ((double)t)/CLOCKS_PER_SEC; // in seconds
printf("DIRECT I/O read took %f seconds to execute \n", time_taken);
t = clock();
read_file(0);
t = clock() - t;
time_taken = ((double)t)/CLOCKS_PER_SEC; // in seconds
printf("NON DIRECT I/O read took %f seconds to execute \n", time_taken);
return 0;
}
Using the above code to measure DIRECT I/O performance tells me that DIRECT I/O is faster than regular IO involving the page cache. This is the output
**
DIRECT I/O read took 0.824861 seconds to execute
NON DIRECT I/O read took 1.643310 seconds to execute
**
Please let me know if I am missing something. I have an NVMe SSD as the storage device. I wonder if it is too fast to really show the difference in performance of when the page cache is used and not used.
UPDATE:
Changing the buffer size to 4KB shows that DIRECT I/O is slower. The large buffer size was probably making large sequential writes to the underlying device which is more helpful but would still like some insights.
** DIRECT I/O read took 0.000209 seconds to execute NON DIRECT I/O read took 0.000151 seconds to execute **

epoll_wait() succeeds but returns -1 bytes

Below is code I created for epoll_wait on UNIX domain datagram sockets (note this is UNIX domain, not internet domain). Each of these C programs is by a NASM program -- the C object files are linked into the NASM executable.
I have no trouble with epoll_instance_create or add_to_epoll_fd_list (which adds each of the file descriptors). However, epoll_wait_next calls perror and returns "epoll_wait: Success", then loops through "for(i = 0; i < event_count; i++) { " as it should, but bytes_read is -1 on the line "bytes_read = read(events[i].data.fd, read_buffer, BUF_SIZE);"
The struct epoll_event epoll_events is defined as global so it can be passed by NASM when these programs are called.
The problem may be that I don't understand how to call read() in my context as the code comes from other code samples (this is the first time I've used epoll).
#define BUF_SIZE 750
#define SV_SOCK_PATH "/tmp/ud_ucase"
#define epoll_maxevents 100
struct epoll_event epoll_events;
int64_t epoll_instance_create() {
int epoll_fd = epoll_create1(0);
return (int64_t) epoll_fd;
}
int64_t add_to_epoll_fd_list(int epoll_fd, int this_fd) {
epoll_events.data.fd = epoll_fd;
epoll_events.events = EPOLLIN;
int res = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, this_fd, &epoll_events);
perror("epoll_ctl");
if (res > 0)
return 1;
return 0;
}
int64_t epoll_wait_next(int epoll_fd, struct epoll_event * events){
int event_count, i;
ssize_t count;
ssize_t bytes_read;
int64_t read_buffer[750];
int64_t test_data;
event_count = epoll_wait(epoll_fd, events, epoll_maxevents, - 1);
perror("epoll_wait");
for(i = 0; i < event_count; i++) {
printf("Reading file descriptor '%d' -- ", events[i].data.fd);
bytes_read = read(events[i].data.fd, read_buffer, BUF_SIZE);
if (bytes_read > 0)
test_data = read_buffer[0];
printf("%zd bytes read.\n", bytes_read);
read_buffer[bytes_read] = '\0';
}
return 0;
}
The overall code (including the NASM code) is very large, and even a minimal version would be over 300 lines, so I have posted the relevant C programs above, which I think should be enough to spot the problem(s).
Thanks in advance for any help on understanding why bytes_read comes as -1; it should be 720 bytes.

Effect of Buffer Size in File I/O in Unix

I am trying to understand the inner workings of Unix based OSs. I was reading on buffered I/O, and how the buffer size affects the number of system calls made, which in turn affects the total time taken by say, a copy program. To being with, here is my program:
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <time.h>
long currentTimeMillis();
int main(int argc, char *argv[]) {
int bufsize = atoi(argv[3]);
printf("copying with buffer size %d\n", bufsize);
char buf[bufsize];
//open the file
int fd_from = open(argv[1], O_RDWR);
if(-1 == fd_from) {
printf("Error opening source file\n");
return -1;
}
//file to be copied to
int fd_to = open(argv[2], O_WRONLY | O_CREAT, S_IRUSR | S_IWUSR);
if(-1 == fd_to) {
printf("Error opening destination file\n");
return -1;
}
//copy
long startTime = currentTimeMillis();
int bytes_read = 0;
long totalTimeForRead = 0;
long totalTimeForWrite = 0;
while(1) {
long readStartTime = currentTimeMillis();
int bytes_read = read(fd_from,buf,bufsize);
long readEndTime = currentTimeMillis();
if(0 == bytes_read) {
break;
}
if(-1 == bytes_read) {
printf("Error occurred while reading source file\n");
return -1;
}
totalTimeForRead += readEndTime - readStartTime;
long writeStartTime = currentTimeMillis();
int bytes_written = write(fd_to,buf,bufsize);
long writeEndTime = currentTimeMillis();
totalTimeForWrite += (writeEndTime - writeStartTime);
if(-1 == bytes_written) {
printf("Some error occurred while writing file\n");
return -1;
}
}
long endTime = currentTimeMillis();
printf("Total time to copy%ld\n", endTime - startTime);
printf("Total time to write%ld\n", totalTimeForWrite);
printf("Total time to read%ld\n", totalTimeForRead);
}
long currentTimeMillis() {
struct timeval time;
gettimeofday(&time, NULL);
return time.tv_sec * 1000 + time.tv_usec / 1000;
}
I am using a 16G MacBook Pro with 2.9GHz Intel i7 (if this information would be useful). The size of source file is 2.8G. I was a bit surprised to see that total time take by read() is much smaller than write(). Here're my findings with a buffer size of 16K:
./a.out largefile dest 16382
copying with buffer size 16382
Total time to copy5987
Total time to write5330
Total time to read638
From what I have read, write() returns immediately after transferring the data to the kernel buffer from the user buffer. So the time taken by it is this time + the time taken for the system call initiation. read() also reads from kernel buffer to user buffer, so the total time taken should be same (in both cases, there is no disk I/O).
Why is it then that there is such drastic difference in the results? Or am I benchmarking it wrong? Final question: Is it alright to do such benchmarking on a SSD, which has limited write cycles?

Filesystem VS Raw disk benchmarking in C

I am doing some benchmarking (on OS X) to see how the use of file system influences the bandwidth etc. I am using concurrency with the hope to create fragmentation in the FS.
However, it looks like using the FS is more efficient than raw disk accesses. Why?
Here is my code:
#include <pthread.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#define NO_THREADS (2)
#define PACKET_SIZE (1024 * 4)
#define SIZE_TO_WRITE (1024 * 1024 * 1024)
void write_buffer(void *arg) {
int *p_start = arg;
int start = *p_start;
char buffer[PACKET_SIZE];
char path[50];
sprintf(path, "file%d", start);
int fd = open(path, O_CREAT | O_WRONLY | O_APPEND);
//int fd = open("/dev/rdisk0s4", O_WRONLY);
if (fd < 0) {
fprintf(stderr, "Cound not open.\n", stderr);
goto end;
}
//lseek(fd, start * SIZE_TO_WRITE, SEEK_SET);
int current;
for (current = start; current < start + SIZE_TO_WRITE; current += PACKET_SIZE) {
int i;
for (i = 0; i < PACKET_SIZE; ++i) {
buffer[i] = i + current;
}
if (PACKET_SIZE != write(fd, buffer, PACKET_SIZE)) {
fprintf(stderr, "Could not write packet %d properly.", current);
goto close;
}
}
fsync(fd);
close:
close(fd);
end:
pthread_exit(0);
}
void flush(void) {
fflush(stdout);
fflush(stderr);
}
int main(void) {
pthread_t threads[NO_THREADS];
int starts[NO_THREADS];
int i;
atexit(flush);
for (i = 0; i < NO_THREADS; ++i) {
starts[i] = i;
if(pthread_create(threads + i, NULL, (void *) &write_buffer, (void *)(starts + i))) {
fprintf(stderr, "Error creating thread no %d\n", i);
return EXIT_FAILURE;
}
}
for (i = 0; i < NO_THREADS; ++i) {
if(pthread_join(threads[i], NULL)) {
fprintf(stderr, "Error joining thread\n");
return EXIT_FAILURE;
}
}
puts("Done");
return EXIT_SUCCESS;
}
With the help of the FS, the 2 threads write the file in 31.33 seconds. Without, it is achieved after minutes...

When you use /dev/rdisk0s4 instead of /path/to/normal/file%d, for every write you perform the OS will issue a disk I/O. Even if that disk is an SSD, that means that the round-trip time is probably at least a few hundred microseconds on average. When you write to the file instead, the filesystem isn't actually issuing your writes to disk until later. The Linux man page describes this well:
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
So, the data you wrote is being buffered by the filesystem, which only requires that a copy be made in memory -- this probably takes on the order of a few microseconds at most. If you want to do an apples-to-apples comparison, you should make sure you're doing synchronous I/O for both test cases. Even running fsync after the whole test is done will probably allow the filesystem to be much faster, since it will batch up the I/O into one continuous streaming write, which could be faster than what your test directly on the disk can achieve.
In general, writing good systems benchmarks is incredibly difficult, especially when you don't know a lot about the system you're trying to test. I'd recommend using an off-the-shelf Unix filesystem benchmarking toolkit if you want high quality results -- otherwise, you could spend literally a lifetime learning about performance pathologies of the OS and FS you're testing... not that that's a bad thing, if you're interested in it like I am :-)

Suggestions for duplicate file finder algorithm (using C)

I wanted to write a program that test if two files are duplicates (have exactly the same content). First I test if the files have the same sizes, and if they have i start to compare their contents.
My first idea, was to "split" the files into fixed size blocks, then start a thread for every block, fseek to startup character of every block and continue the comparisons in parallel. When a comparison from a thread fails, the other working threads are canceled, and the program exits out of the thread spawning loop.
The code looks like this:
dupf.h
#ifndef __NM__DUPF__H__
#define __NM__DUPF__H__
#define NUM_THREADS 15
#define BLOCK_SIZE 8192
/* Thread argument structure */
struct thread_arg_s {
const char *name_f1; /* First file name */
const char *name_f2; /* Second file name */
int cursor; /* Where to seek in the file */
};
typedef struct thread_arg_s thread_arg;
/**
* 'arg' is of type thread_arg.
* Checks if the specified file blocks are
* duplicates.
*/
void *check_block_dup(void *arg);
/**
* Checks if two files are duplicates
*/
int check_dup(const char *name_f1, const char *name_f2);
/**
* Returns a valid pointer to a file.
* If the file (given by the path/name 'fname') cannot be opened
* in 'mode', the program is interrupted an error message is shown.
**/
FILE *safe_fopen(const char *name, const char *mode);
#endif
dupf.c
#include <errno.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include "dupf.h"
FILE *safe_fopen(const char *fname, const char *mode)
{
FILE *f = NULL;
f = fopen(fname, mode);
if (f == NULL) {
char emsg[255];
sprintf(emsg, "FOPEN() %s\t", fname);
perror(emsg);
exit(-1);
}
return (f);
}
void *check_block_dup(void *arg)
{
const char *name_f1 = NULL, *name_f2 = NULL; /* File names */
FILE *f1 = NULL, *f2 = NULL; /* Streams */
int cursor = 0; /* Reading cursor */
char buff_f1[BLOCK_SIZE], buff_f2[BLOCK_SIZE]; /* Character buffers */
int rchars_1, rchars_2; /* Readed characters */
/* Initializing variables from 'arg' */
name_f1 = ((thread_arg*)arg)->name_f1;
name_f2 = ((thread_arg*)arg)->name_f2;
cursor = ((thread_arg*)arg)->cursor;
/* Opening files */
f1 = safe_fopen(name_f1, "r");
f2 = safe_fopen(name_f2, "r");
/* Setup cursor in files */
fseek(f1, cursor, SEEK_SET);
fseek(f2, cursor, SEEK_SET);
/* Initialize buffers */
rchars_1 = fread(buff_f1, 1, BLOCK_SIZE, f1);
rchars_2 = fread(buff_f2, 1, BLOCK_SIZE, f2);
if (rchars_1 != rchars_2) {
/* fread failed to read the same portion.
* program cannot continue */
perror("ERROR WHEN READING BLOCK");
exit(-1);
}
while (rchars_1-->0) {
if (buff_f1[rchars_1] != buff_f2[rchars_1]) {
/* Different characters */
fclose(f1);
fclose(f2);
pthread_exit("notdup");
}
}
/* Close streams */
fclose(f1);
fclose(f2);
pthread_exit("dup");
}
int check_dup(const char *name_f1, const char *name_f2)
{
int num_blocks = 0; /* Number of 'blocks' to check */
int num_tsp = 0; /* Number of threads spawns */
int tsp_iter = 0; /* Iterator for threads spawns */
pthread_t *tsp_threads = NULL;
thread_arg *tsp_threads_args = NULL;
int tsp_threads_iter = 0;
int thread_c_res = 0; /* Thread creation result */
int thread_j_res = 0; /* Thread join res */
int loop_res = 0; /* Function result */
int cursor;
struct stat buf_f1;
struct stat buf_f2;
if (name_f1 == NULL || name_f2 == NULL) {
/* Invalid input parameters */
perror("INVALID FNAMES\t");
return (-1);
}
if (stat(name_f1, &buf_f1) != 0 || stat(name_f2, &buf_f2) != 0) {
/* Stat fails */
char emsg[255];
sprintf(emsg, "STAT() ERROR: %s %s\t", name_f1, name_f2);
perror(emsg);
return (-1);
}
if (buf_f1.st_size != buf_f2.st_size) {
/* File have different sizes */
return (1);
}
/* Files have the same size, function exec. is continued */
num_blocks = (buf_f1.st_size / BLOCK_SIZE) + 1;
num_tsp = (num_blocks / NUM_THREADS) + 1;
cursor = 0;
for (tsp_iter = 0; tsp_iter < num_tsp; tsp_iter++) {
loop_res = 0;
/* Create threads array for this spawn */
tsp_threads = malloc(NUM_THREADS * sizeof(*tsp_threads));
if (tsp_threads == NULL) {
perror("TSP_THREADS ALLOC FAILURE\t");
return (-1);
}
/* Create arguments for every thread in the current spawn */
tsp_threads_args = malloc(NUM_THREADS * sizeof(*tsp_threads_args));
if (tsp_threads_args == NULL) {
perror("TSP THREADS ARGS ALLOCA FAILURE\t");
return (-1);
}
/* Initialize arguments and create threads */
for (tsp_threads_iter = 0; tsp_threads_iter < NUM_THREADS;
tsp_threads_iter++) {
if (cursor >= buf_f1.st_size) {
break;
}
tsp_threads_args[tsp_threads_iter].name_f1 = name_f1;
tsp_threads_args[tsp_threads_iter].name_f2 = name_f2;
tsp_threads_args[tsp_threads_iter].cursor = cursor;
thread_c_res = pthread_create(
&tsp_threads[tsp_threads_iter],
NULL,
check_block_dup,
(void*)&tsp_threads_args[tsp_threads_iter]);
if (thread_c_res != 0) {
perror("THREAD CREATION FAILURE");
return (-1);
}
cursor+=BLOCK_SIZE;
}
/* Join last threads and get their status */
while (tsp_threads_iter-->0) {
void *thread_res = NULL;
thread_j_res = pthread_join(tsp_threads[tsp_threads_iter],
&thread_res);
if (thread_j_res != 0) {
perror("THREAD JOIN FAILURE");
return (-1);
}
if (strcmp((char*)thread_res, "notdup")==0) {
loop_res++;
/* Closing other threads and exiting by condition
* from loop. */
while (tsp_threads_iter-->0) {
pthread_cancel(tsp_threads[tsp_threads_iter]);
}
}
}
free(tsp_threads);
free(tsp_threads_args);
if (loop_res > 0) {
break;
}
}
return (loop_res > 0) ? 1 : 0;
}
The function works fine (at least for what I've tested). Still, some guys from #C (freenode) suggested that the solution is overly complicated, and it may perform poorly because of parallel reading on hddisk.
What I want to know:
Is the threaded approach flawed by default ?
Is fseek() so slow ?
Is there a way to somehow map the files to memory and then compare them ?
LATED EDIT:
Today I had some time, and I've followed your advices. You were right, this threaded version actually performs worse than a single threaded version, and all because of the parallel readings on hard disk.
Another thing is that I've written a function that uses mmap(), and until now is the optimal one. Still the biggest drawback of that function is that it fails, when the files are getting really big.
Here is the new implementation (a very brute and direct code):
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include "dupf.h"
/**
* Safely assures that a file is opened.
* If cannot open file, the flow of the program is interrupted.
* The error code returned is -1.
**/
FILE *safe_fopen(const char *fname, const char *mode)
{
FILE *f = NULL;
f = fopen(fname, mode);
if (f == NULL) {
char emsg[1024];
sprintf(emsg, "Cannot open file: %s\t", fname);
perror(emsg);
exit(-1);
}
return (f);
}
/**
* Check if two files have the same size.
* Returns:
* -1 Error.
* 0 If they have the same size.
* 1 If the don't have the same size.
**/
int check_same_size(const char *f1_name, const char *f2_name, off_t *f1_size, off_t *f2_size)
{
struct stat f1_stat, f2_stat;
if((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_same_size].\n");
return (-1);
}
if((stat(f1_name, &f1_stat) != 0) || (stat(f2_name, &f2_stat) !=0)){
fprintf(stderr, "Cannot apply stat. [check_same_size].\n");
return (-1);
}
if(f1_size != NULL){
*f1_size = f1_stat.st_size;
}
if(f2_size != NULL){
*f2_size = f2_stat.st_size;
}
return (f1_stat.st_size == f2_stat.st_size) ? 0 : 1;
}
/**
* Test if two files are duplicates.
* Returns:
* -1 Error.
* 0 If they are duplicates.
* 1 If they are not duplicates.
**/
int check_dup_plain(char *f1_name, char *f2_name, int block_size)
{
if ((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_dup_plain].\n");
return (-1);
}
FILE *f1 = NULL, *f2 = NULL;
char f1_buff[block_size], f2_buff[block_size];
size_t rch1, rch2;
if(check_same_size(f1_name, f2_name, NULL, NULL) == 1){
return (1);
}
f1 = safe_fopen(f1_name, "r");
f2 = safe_fopen(f2_name, "r");
while(!feof(f1) && !feof(f2)){
rch1 = fread(f1_buff, 1, block_size, f1);
rch2 = fread(f2_buff, 1, block_size, f2);
if(rch1 != rch2){
fprintf(stderr, "Invalid reading from file. Cannot continue. [check_dup_plain].\n");
return (-1);
}
while(rch1-->0){
if(f1_buff[rch1] != f2_buff[rch1]){
return (1);
}
}
}
fclose(f1);
fclose(f2);
return (0);
}
/**
* Test if two files are duplicates.
* Returns:
* -1 Error.
* 0 If they are duplicates.
* 1 If they are not duplicates.
**/
int check_dup_memmap(char *f1_name, char *f2_name)
{
struct stat f1_stat, f2_stat;
char *f1_array = NULL, *f2_array = NULL;
off_t f1_size, f2_size;
int f1_des, f2_des, cont, res;
if((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_dup_memmap].\n");
return (-1);
}
if(check_same_size(f1_name, f2_name, &f1_size, &f2_size) == 1){
return (1);
}
f1_des = open(f1_name, O_RDONLY);
f2_des = open(f2_name, O_RDONLY);
if((f1_des == -1) || (f2_des == -1)){
perror("Cannot open file");
exit(-1);
}
f1_array = mmap(0, f1_size * sizeof(*f1_array), PROT_READ, MAP_SHARED, f1_des, 0);
if(f1_array == NULL){
fprintf(stderr, "Cannot map file to memory [check_dup_memmap].\n");
return (-1);
}
f2_array = mmap(0, f2_size * sizeof(*f2_array), PROT_READ, MAP_SHARED, f2_des, 0);
if(f2_array == NULL){
fprintf(stderr, "Cannot map file to memory [check_dup_memmap].\n");
return (-1);
}
cont = f1_size;
res = 0;
while(cont-->0){
if(f1_array[cont]!=f2_array[cont]){
res = 1;
break;
}
}
munmap((void*) f1_array, f1_size * sizeof(*f1_array));
munmap((void*) f2_array, f2_size * sizeof(*f2_array));
return res;
}
int main(int argc, char *argv[])
{
printf("result: %d\n",check_dup_memmap("f2","f1"));
return (0);
}
I am planning now to extend this code, by re-adding the threaded functionality, but this time the reading will be on memory.
Thanks for your answers.

The limiting factor will be disk reads, which (assuming that both files are on the same disk) will be serialized anyway, so I don't think threading will help much at all.

You could probably simplify your code greatly by using hashes, instead of doing a byte-by-byte comparison. Assuming you're not doing anything important, like deleting, an md5 or similar hash function should be plenty. Boost provides quite a few, and they're usually pretty fast.
if fileA.size == fileB.size
if fileA.hash() == fileB.hash()
flag(fileA, fileB, same);
I wouldn't delete files after that comparison, but it's plenty safe to move them to a temporary directory for further review or just build a list of possible duplicates.

It's hard to guess about performance without a real system to test against (for example if you're using a solid state drive, there's no head seek time and the cost of reading different sectors from different threads is almost zero).
If this is running against a reasonably standard computer with regular (spinning platter) hard drives, having multiple threads contend for the part of the disk they want to read from will possibly slow things down (depending, again, on the hardware and also the size of the chunks).
If the time it takes to compute the "sameness" of a chunk is fast compared to the time it takes to read that chunk from disk, having a separate thread will not help much since the second (or third...) thread would spend most of it's time waiting for IO to complete anyway.
Another factor is the cache size of the CPU. If all of the memory you're processing at one time fits in the CPU cache, things will be much faster than if different threads cause different chunks of memory to be loaded into cache as they execute instructions.
If you have more threads than you have CPU cores, you will just slow things down by making unnecessary context switches (since a thread needs a core to run on).
After reading all of that, if you still think multithreading is going to help for your target system, consider one thread that does IO only, places the data in a queue, and has two or more worker threads taking data off of the queue to process. That way, you optimize disk IO and can take advantage of multiple cores to crunch the numbers.
Steve suggested you can memory map you files on Unix. That will speed up access to the underlying data a bit by leveraging low level OS functionality (the same kind used to manage swap files). That will give you some performance improvement as the OS will handle loading the parts of the file you are working on into memory efficiently, as long as the file fits into available address space. FYI you can do the same thing on Windows.

Before even considering the performance effects of parallel disk reads and thread overhead and such...
Is there any reason to believe that scanning the files in chunks will find the differences any quicker than straight through? Is the data contained in the files predominantly in a certain format, and if so, is the splitting scheme tailored to it? If not, I don't see how scanning the files by skipping over every n bytes (which is all the multithreaded splitting is effectively doing) could offer any improvement over reading the bytes in the order they are on disk.
Think of the two limiting cases -- "splitting" the file into one block, and splitting the file into as many one-byte "blocks" as there are bytes in the file. Will either of those cases be more efficient than the other, or some in-between value? If there is no in-between value that you know you should optimize to, then you know nothing about how the data is stored in the files, so it should make no difference how you scan them.
Even if you set the split to optimize to the disk's performance like block size, you're still going to have to go back to read the next byte, which will likely be at an extremely non-optimal position. And in the end you're going to have to read every single byte in the file, no matter how you split it.

Because you're using pthreads, I assume you're working in a Unix environment -- in which case you could mmap(2) both files into memory and compare the memory arrays directly.

Well, there is the standard memory mapping mmap() function that maps a file to memory. You should be able to do something like
int fd1;
int fd2;
int size1;
int size2;
fd1 = open(name1, O_RDONLY);
size1 = lseek(fd1, 0, SEEK_END);
fd2 = open(name2, O_RDONLY);
size2 = lseek(fd2, 0, SEEK_END);
if ( size1 == size2 )
{
char * data1 = mmap(0, size1, PROT_READ, MAP_SHARED, fd1, 0);
char * data2 = mmap(0, size1, PROT_READ, MAP_SHARED, fd2, 0);
int i;
/* ...and this is, obviously, where you'd do something more clever */
for ( i = 0; i < size1 && *data1 == *data2; i++, data1++, data2++ );
if ( i == size1 )
printf("Equal\n");
}
close(fd1);
close(fd2);
Other than that, yes, your solution looks overly complicated ;-) The threaded approach is not necessarily flawed, but you might not see that parallel access improves performance. For SAN drives or ramdisks it might improve performance, for normal spinning platter drives it might impede it. But simpler is usually better, unless you really have a performance issue.
Regarding fseek() vs other methods, it depends on the operating system you use. Google is you friend here, you can easily find articles at least for Solaris and Linux.

Even if disk access was not the limiting factor (it will be), unless you have a multi-core processor that could hand off different threads to different cores, you would not see a speed-up from going multi-threaded. Basically, you have to compare all N bytes of the file one way or another, and even if you use threads, if they execute in the same core, it will take the same amount of time as without using threads.
There are some environments that could spread the workload across cores, but even so, the CPU will be able to process so much faster than the data can be pulled in from disk that the disk I/O system will be the limiting factor.