Why can I not mmap /proc/self/maps? - c

To be specific: why can I do this:
FILE *fp = fopen("/proc/self/maps", "r");
char buf[513]; buf[512] = NULL;
while(fgets(buf, 512, fp) > NULL) printf("%s", buf);
but not this:
int fd = open("/proc/self/maps", O_RDONLY);
struct stat s;
fstat(fd, &s); // st_size = 0 -> why?
char *file = mmap(0, s.st_size /*or any fixed size*/, PROT_READ, MAP_PRIVATE, fd, 0); // gives EINVAL for st_size (because 0) and ENODEV for any fixed block
write(1, file, st_size);
I know that /proc files are not really files, but it seems to have some defined size and content for the FILE* version. Is it secretly generating it on-the-fly for read or something? What am I missing here?
EDIT:
as I can clearly read() from them, is there any way to get the possible available bytes? or am I stuck to read until EOF?

They are created on the fly as you read them. Maybe this would help, it is a tutorial showing how a proc file can be implemented:
https://devarea.com/linux-kernel-development-creating-a-proc-file-and-interfacing-with-user-space/
tl;dr: you give it a name and read and write handlers, that's it. Proc files are meant to be very simple to implement from the kernel dev's point of view. They do not behave like full-featured files though.
As for the bonus question, there doesn't seem to be a way to indicate the size of the file, only EOF on reading.

proc "files" are not really files, they are just streams that can be read/written from, but they contain no pyhsical data in memory you can map to.
https://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html

As already explained by others, /proc and /sys are pseudo-filesystems, consisting of data provided by the kernel, that does not really exist until it is read – the kernel generates the data then and there. Since the size varies, and really is unknown until the file is opened for reading, it is not provided to userspace at all.
It is not "unfortunate", however. The same situation occurs very often, for example with character devices (under /dev), pipes, FIFOs (named pipes), and sockets.
We can trivially write a helper function to read pseudofiles completely, using dynamic memory management. For example:
// SPDX-License-Identifier: CC0-1.0
//
#define _POSIX_C_SOURCE 200809L
#define _ATFILE_SOURCE
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
/* For example main() */
#include <stdio.h>
/* Return a directory handle for a specific relative directory.
For absolute paths and paths relative to current directory, use dirfd==AT_FDCWD.
*/
int at_dir(const int dirfd, const char *dirpath)
{
if (dirfd == -1 || !dirpath || !*dirpath) {
errno = EINVAL;
return -1;
}
return openat(dirfd, dirpath, O_DIRECTORY | O_PATH | O_CLOEXEC);
}
/* Read the (pseudofile) contents to a dynamically allocated buffer.
For absolute paths and paths relative to current durectory, use dirfd==AT_FDCWD.
You can safely initialize *dataptr=NULL,*sizeptr=0 for dynamic allocation,
or reuse the buffer from a previous call or e.g. getline().
Returns 0 with errno set if an error occurs. If the file is empty, errno==0.
In all cases, remember to free (*dataptr) after it is no longer needed.
*/
size_t read_pseudofile_at(const int dirfd, const char *path, char **dataptr, size_t *sizeptr)
{
char *data;
size_t size, have = 0;
ssize_t n;
int desc;
if (!path || !*path || !dataptr || !sizeptr) {
errno = EINVAL;
return 0;
}
/* Existing dynamic buffer, or a new buffer? */
size = *sizeptr;
if (!size)
*dataptr = NULL;
data = *dataptr;
/* Open pseudofile. */
desc = openat(dirfd, path, O_RDONLY | O_CLOEXEC | O_NOCTTY);
if (desc == -1) {
/* errno set by openat(). */
return 0;
}
while (1) {
/* Need to resize buffer? */
if (have >= size) {
/* For pseudofiles, linear size growth makes most sense. */
size = (have | 4095) + 4097 - 32;
data = realloc(data, size);
if (!data) {
close(desc);
errno = ENOMEM;
return 0;
}
*dataptr = data;
*sizeptr = size;
}
n = read(desc, data + have, size - have);
if (n > 0) {
have += n;
} else
if (n == 0) {
break;
} else
if (n == -1) {
const int saved_errno = errno;
close(desc);
errno = saved_errno;
return 0;
} else {
close(desc);
errno = EIO;
return 0;
}
}
if (close(desc) == -1) {
/* errno set by close(). */
return 0;
}
/* Append zeroes - we know size > have at this point. */
if (have + 32 > size)
memset(data + have, 0, 32);
else
memset(data + have, 0, size - have);
errno = 0;
return have;
}
int main(void)
{
char *data = NULL;
size_t size = 0;
size_t len;
int selfdir;
selfdir = at_dir(AT_FDCWD, "/proc/self/");
if (selfdir == -1) {
fprintf(stderr, "/proc/self/ is not available: %s.\n", strerror(errno));
exit(EXIT_FAILURE);
}
len = read_pseudofile_at(selfdir, "status", &data, &size);
if (errno) {
fprintf(stderr, "/proc/self/status: %s.\n", strerror(errno));
exit(EXIT_FAILURE);
}
printf("/proc/self/status: %zu bytes\n%s\n", len, data);
len = read_pseudofile_at(selfdir, "maps", &data, &size);
if (errno) {
fprintf(stderr, "/proc/self/maps: %s.\n", strerror(errno));
exit(EXIT_FAILURE);
}
printf("/proc/self/maps: %zu bytes\n%s\n", len, data);
close(selfdir);
free(data); data = NULL; size = 0;
return EXIT_SUCCESS;
}
The above example program opens a directory descriptor ("atfile handle") to /proc/self. (This way you do not need to concatenate strings to construct paths.)
It then reads the contents of /proc/self/status. If successful, it displays its size (in bytes) and its contents.
Next, it reads the contents of /proc/self/maps, reusing the previous buffer. If successful, it displays its size and contents as well.
Finally, the directory descriptor is closed as it is no longer needed, and the dynamically allocated buffer released.
Note that it is perfectly safe to do free(NULL), and also to discard the dynamic buffer (free(data); data=NULL; size=0;) between the read_pseudofile_at() calls.
Because pseudofiles are typically small, the read_pseudofile_at() uses a linear dynamic buffer growth policy. If there is no previous buffer, it starts with 8160 bytes, and grows it by 4096 bytes afterwards until sufficiently large. Feel free to replace it with whatever growth policy you prefer, this one is just an example, but works quite well in practice without wasting much memory.

Related

How to use write() and read() in unistd.h

I'm trying to use the functions read() and write() from unistd.h, but whenever I try input anything, it does not work. And I am only alowed to use functions from fcntl.h and unistd.h, not those from stdio.h.
Here is my code:
#include <fcntl.h>
#include <unistd.h>
int main() {
int fd_in = open("/dev/pts/5", O_RDONLY);
int fd_write = open("/dev/pts/log.txt", O_RDWR);
char buf[20];
ssize_t bytes_read;
if (fd_in == -1){
char out[] = "Error in opening file";
write(fd_write, out, sizeof(out));
}
//using a while loop to read from input
while ((bytes_read = read(fd_in, buf, sizeof(buf))) > 0) {
char msg[] = "Block read: \n<%s>\n";
read(fd_write, msg, sizeof(msg));
//continue with other parts
}
}
The problem is that I don't get the desired output for the inputs I provide. For example:
//input
Hello
//output
Block read:
<Hello>
I wrote example code how to use read(2) and write(2). I don't know whether you need to use /dev/pts/ or not. I never used it, so also now I don't use it. Maybe my example will be helpful anyway.
The header string.h is included only for strlen(3).
#include <unistd.h>
#include <string.h>
int main (void) {
size_t input_size = 50;
// "+ 1" is for storing '\0'
char buffer[input_size + 1];
// We don't use the return value of
// memset(3), but it's good to know
// anyway that there is one. See also
// https://stackoverflow.com/q/13720428/20276305
memset(buffer, '\0', input_size + 1);
ssize_t bytes_read_count = -1;
ssize_t bytes_written_count = -1;
// Reading
bytes_read_count = read(STDIN_FILENO,
buffer,
input_size);
if (bytes_read_count == -1) {
// No return value checking (and also below). It
// would make little sense here since we exit the
// function directly after write(2), no matter if
// write(2) succeeded or not
write(STDERR_FILENO, "Error1\n", strlen("Error1\n"));
return 1;
}
// We want to be sure to have a proper string, just in
// case we would like to perform more operations on it
// one day. So, we need to explicitly end the array
// with '\0'. We need to do it regardless of the earlier
// memset(3) call because the user might input more
// than input_size, so all the '\0' would be
// overwritten
buffer[input_size] = '\0';
// Writing
bytes_written_count = write(STDOUT_FILENO,
buffer,
bytes_read_count);
if (bytes_written_count == -1) {
write(STDERR_FILENO, "Error2\n", strlen("Error2\n"));
return 1;
}
return 0;
}
Edit: I add a comment about memset(3) return value, and also remove checking it since it seemed unnecessary.

Why am I getting free(): invalid pointer?

I am currently reading Linux System Programming by Robert Love and am stuck on the read() example that takes care of all five error cases. I am getting an free(): invalid pointer error. I am assuming that it has something to do with advancing the buffer in case the read is not finished.
It works if I store the offset and return the pointer to its original position. This is not mentioned in the book. Is there a better approach?
#include <stdio.h>
#include <malloc.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
int main()
{
int fd;
if( (fd = open("someFile.txt", O_CREAT | O_WRONLY | O_TRUNC, 0664)) < 0)
perror("open for write");
char * text = "This is an example text";
write(fd, text, strlen(text));
close(fd);
if( (fd = open("someFile.txt", O_RDONLY)) < 0)
perror("open");
char *buf;
if( (buf = (char *) calloc(50, sizeof(char))) == NULL )
perror("calloc");
int len = 50;
ssize_t ret;
/* If I store the offset in a variable it works */
off_t offset = 0;
while (len != 0 && (ret = read (fd, buf, len)) != 0) {
if (ret == -1) {
if (errno == EINTR)
continue;
perror ("read");
break;
}
len -= ret;
buf += ret;
offset += ret; // Offset stored here
}
if( close(fd) == -1 )
perror("close");
buf -= offset; // Here I return the pointer to its original position
free(buf);
return 0;
}
There are multiple bugs in this code.
First, perror is not being used correctly, as it only prints an error -- there should also be code here to abort on errors, so subsequent code doesn't try to use results from operations that failed.
Secondly, only the result from calloc can be given to free. The result is saved in buf, but then later code changes the value of buf and tries to free the changed value. Storing the changes in offset should fix this, but this is an error prone solution at best. If you have multiple code paths that modify buf, you have to make sure every one of those also modify offset in the same way.
A better approach would be to not modify buf, and instead use a second pointer variable in the read that is initialized to the value of buf and then gets modified after each read.
As pointed out, the number given to calloc is different than the number len is initialized to. This is a perfect example of misuse of a magic number. Both the 20 and 50 should be replaced with the same symbol (variable or constant or #define) so that you don't get a buffer overrun error.

Read a file a number of bytes per time in c

I am trying to write a program on how to read a file 10 bytes per time using read, however, I do not know how to go about it. How should I modify this code to read 10bytes per time. Thanks!!!!
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/time.h>
int main (int argc, char *argv[])
{
printf("I am here1\n");
int fd, readd = 0;
char* buf[1024];
printf("I am here2\n");
fd =open("text.txt", O_RDWR);
if (fd == -1)
{
perror("open failed");
exit(1);
}
else
{
printf("I am here3\n");
if(("text.txt",buf, 1024)<0)
printf("read error\n");
else
{
printf("I am here3\n");
/*******************************
* I suspect this should be the place I make the modification
*******************************/
if(read("text.txt",buf, 1024)<0)
printf("read error\n");
else
{
printf("I am here4\n");
printf("\nN: %c",buf);
if(write(fd,buf,readd) != readd)
printf("write error\n");
}
}
return 0;
}
The final parameter of read() is the maximum size of the data you wish to read so, to try and read ten bytes at a time, you would need:
read (fd, buf, 10)
You'll notice I've also changed the first parameter to the file descriptor rather than the file name string.
Now, you'll probably want that in a loop since you'll want to do something with the data, and you also need to check the return value since it can give you less than what you asked for.
A good example for doing this would be:
int copyTenAtATime (char *infile, char *outfile) {
// Buffer details (size and data).
int sz;
char buff[10];
// Try open input and output.
int ifd = open (infile, O_RDWR);
int ofd = open (outfile, O_WRONLY|O_CREAT);
// Do nothing unless both opened okay.
if ((ifd >= 0) && (ofd >= 0)) {
// Read chunk, stopping on error or end of file.
while ((sz = read (ifd, buff, sizeof (buff))) > 0) {
// Write chunk, flagging error if not all written.
if (write (ofd, buff, sz) != sz) {
sz = -1;
break;
}
}
}
// Finished or errored here, close files that were opened.
if (ifd >= 0) close (ifd);
if (ofd >= 0) close (ofd);
// Return zero if all okay, otherwise error indicator.
return (sz == 0) ? 0 : -1;
}
change the value in read,
read(fd,buf,10);
From man of read
ssize_t read(int fd, void *buf, size_t count);
read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf.
if(read("text.txt",buf, 1024)<0)// this will give you the error.
First argument must be an file descriptor.

Using system calls to implement the unix cat command

For my OS class I have the assignment of implementing Unix's cat command with system calls (no scanf or printf). Here's what I got so far:
(Edited thanks to responses)
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
main(void)
{
int fd1;
int fd2;
char *buffer1;
buffer1 = (char *) calloc(100, sizeof(char));
char *buffer2;
buffer2 = (char *)calloc(100, sizeof(char));
fd1 = open("input.in", O_RDONLY);
fd2 = open("input2.in", O_RDONLY);
while(eof1){ //<-lseek condition to add here
read (fd1, buffer1, /*how much to read here?*/ );
write(1, buffer1, sizeof(buffer1)-1);
}
while (eof2){
read (fd2,buffer2, /*how much to read here?*/);
write(1, buffer2, sizeof(buffer2)-1);
}
}
The examples I have seen only show read with a known number of bytes. I don't know how much bytes each of the read files will have, so how do I specify read's last paramether?
Before you can read into a buffer, you have to allocate one. Either on the stack (easiest) or with mmap.
perror is a complicated library function, not a system call.
exit is not a system call on Linux. But _exit is.
Don't write more bytes than you have read before.
Or, in general: Read the documentation on all these system calls.
Edit: Here is my code, using only system calls. The error handling is somewhat limited, since I didn't want to re-implement perror.
#include <fcntl.h>
#include <unistd.h>
static int
cat_fd(int fd) {
char buf[4096];
ssize_t nread;
while ((nread = read(fd, buf, sizeof buf)) > 0) {
ssize_t ntotalwritten = 0;
while (ntotalwritten < nread) {
ssize_t nwritten = write(STDOUT_FILENO, buf + ntotalwritten, nread - ntotalwritten);
if (nwritten < 1)
return -1;
ntotalwritten += nwritten;
}
}
return nread == 0 ? 0 : -1;
}
static int
cat(const char *fname) {
int fd, success;
if ((fd = open(fname, O_RDONLY)) == -1)
return -1;
success = cat_fd(fd);
if (close(fd) != 0)
return -1;
return success;
}
int
main(int argc, char **argv) {
int i;
if (argc == 1) {
if (cat_fd(STDIN_FILENO) != 0)
goto error;
} else {
for (i = 1; i < argc; i++) {
if (cat(argv[i]) != 0)
goto error;
}
}
return 0;
error:
write(STDOUT_FILENO, "error\n", 6);
return 1;
}
You need to read as many bytes as will fit in the buffer. Right now, you don't have a buffer yet, all you got is a pointer to a buffer. That isn't initialized to anything. Chicken-and-egg, you therefore don't know how many bytes to read either.
Create a buffer.
There is usually no need to read the entire file in one gulp. Choosing a buffer size that is the same or a multiple of the host operating system's memory page size is a good way to go. 1 or 2 X the page size is probably good enough.
Using buffers that are too big can actually cause your program to run worse because they put pressure on the virtual memory system and can cause paging.
You could use open, fstat, mmap, madvise and write to make a very efficient cat command.
If Linux specific you could use open, fstat, fadvise and splice to make an even more efficient cat command.
The advise calls are to specify the SEQUENTIAL flags which will tell the kernel to do aggressive read-ahead on the file.
If you like to be polite to the rest of the system and minimize buffer cache use, you can do your copy in chunks of 32 megabytes or so and use the advise DONTNEED flags on the parts already read.
Note:
The above will only work if the source is a file. If the fstat fails to provide a size then you must fall back to using an allocated buffer and read, write. You can use splice too.
Use the stat function to find the size of your files before you read them. Alternatively, you can read chunks until you get an EOF.

Suggestions for duplicate file finder algorithm (using C)

I wanted to write a program that test if two files are duplicates (have exactly the same content). First I test if the files have the same sizes, and if they have i start to compare their contents.
My first idea, was to "split" the files into fixed size blocks, then start a thread for every block, fseek to startup character of every block and continue the comparisons in parallel. When a comparison from a thread fails, the other working threads are canceled, and the program exits out of the thread spawning loop.
The code looks like this:
dupf.h
#ifndef __NM__DUPF__H__
#define __NM__DUPF__H__
#define NUM_THREADS 15
#define BLOCK_SIZE 8192
/* Thread argument structure */
struct thread_arg_s {
const char *name_f1; /* First file name */
const char *name_f2; /* Second file name */
int cursor; /* Where to seek in the file */
};
typedef struct thread_arg_s thread_arg;
/**
* 'arg' is of type thread_arg.
* Checks if the specified file blocks are
* duplicates.
*/
void *check_block_dup(void *arg);
/**
* Checks if two files are duplicates
*/
int check_dup(const char *name_f1, const char *name_f2);
/**
* Returns a valid pointer to a file.
* If the file (given by the path/name 'fname') cannot be opened
* in 'mode', the program is interrupted an error message is shown.
**/
FILE *safe_fopen(const char *name, const char *mode);
#endif
dupf.c
#include <errno.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include "dupf.h"
FILE *safe_fopen(const char *fname, const char *mode)
{
FILE *f = NULL;
f = fopen(fname, mode);
if (f == NULL) {
char emsg[255];
sprintf(emsg, "FOPEN() %s\t", fname);
perror(emsg);
exit(-1);
}
return (f);
}
void *check_block_dup(void *arg)
{
const char *name_f1 = NULL, *name_f2 = NULL; /* File names */
FILE *f1 = NULL, *f2 = NULL; /* Streams */
int cursor = 0; /* Reading cursor */
char buff_f1[BLOCK_SIZE], buff_f2[BLOCK_SIZE]; /* Character buffers */
int rchars_1, rchars_2; /* Readed characters */
/* Initializing variables from 'arg' */
name_f1 = ((thread_arg*)arg)->name_f1;
name_f2 = ((thread_arg*)arg)->name_f2;
cursor = ((thread_arg*)arg)->cursor;
/* Opening files */
f1 = safe_fopen(name_f1, "r");
f2 = safe_fopen(name_f2, "r");
/* Setup cursor in files */
fseek(f1, cursor, SEEK_SET);
fseek(f2, cursor, SEEK_SET);
/* Initialize buffers */
rchars_1 = fread(buff_f1, 1, BLOCK_SIZE, f1);
rchars_2 = fread(buff_f2, 1, BLOCK_SIZE, f2);
if (rchars_1 != rchars_2) {
/* fread failed to read the same portion.
* program cannot continue */
perror("ERROR WHEN READING BLOCK");
exit(-1);
}
while (rchars_1-->0) {
if (buff_f1[rchars_1] != buff_f2[rchars_1]) {
/* Different characters */
fclose(f1);
fclose(f2);
pthread_exit("notdup");
}
}
/* Close streams */
fclose(f1);
fclose(f2);
pthread_exit("dup");
}
int check_dup(const char *name_f1, const char *name_f2)
{
int num_blocks = 0; /* Number of 'blocks' to check */
int num_tsp = 0; /* Number of threads spawns */
int tsp_iter = 0; /* Iterator for threads spawns */
pthread_t *tsp_threads = NULL;
thread_arg *tsp_threads_args = NULL;
int tsp_threads_iter = 0;
int thread_c_res = 0; /* Thread creation result */
int thread_j_res = 0; /* Thread join res */
int loop_res = 0; /* Function result */
int cursor;
struct stat buf_f1;
struct stat buf_f2;
if (name_f1 == NULL || name_f2 == NULL) {
/* Invalid input parameters */
perror("INVALID FNAMES\t");
return (-1);
}
if (stat(name_f1, &buf_f1) != 0 || stat(name_f2, &buf_f2) != 0) {
/* Stat fails */
char emsg[255];
sprintf(emsg, "STAT() ERROR: %s %s\t", name_f1, name_f2);
perror(emsg);
return (-1);
}
if (buf_f1.st_size != buf_f2.st_size) {
/* File have different sizes */
return (1);
}
/* Files have the same size, function exec. is continued */
num_blocks = (buf_f1.st_size / BLOCK_SIZE) + 1;
num_tsp = (num_blocks / NUM_THREADS) + 1;
cursor = 0;
for (tsp_iter = 0; tsp_iter < num_tsp; tsp_iter++) {
loop_res = 0;
/* Create threads array for this spawn */
tsp_threads = malloc(NUM_THREADS * sizeof(*tsp_threads));
if (tsp_threads == NULL) {
perror("TSP_THREADS ALLOC FAILURE\t");
return (-1);
}
/* Create arguments for every thread in the current spawn */
tsp_threads_args = malloc(NUM_THREADS * sizeof(*tsp_threads_args));
if (tsp_threads_args == NULL) {
perror("TSP THREADS ARGS ALLOCA FAILURE\t");
return (-1);
}
/* Initialize arguments and create threads */
for (tsp_threads_iter = 0; tsp_threads_iter < NUM_THREADS;
tsp_threads_iter++) {
if (cursor >= buf_f1.st_size) {
break;
}
tsp_threads_args[tsp_threads_iter].name_f1 = name_f1;
tsp_threads_args[tsp_threads_iter].name_f2 = name_f2;
tsp_threads_args[tsp_threads_iter].cursor = cursor;
thread_c_res = pthread_create(
&tsp_threads[tsp_threads_iter],
NULL,
check_block_dup,
(void*)&tsp_threads_args[tsp_threads_iter]);
if (thread_c_res != 0) {
perror("THREAD CREATION FAILURE");
return (-1);
}
cursor+=BLOCK_SIZE;
}
/* Join last threads and get their status */
while (tsp_threads_iter-->0) {
void *thread_res = NULL;
thread_j_res = pthread_join(tsp_threads[tsp_threads_iter],
&thread_res);
if (thread_j_res != 0) {
perror("THREAD JOIN FAILURE");
return (-1);
}
if (strcmp((char*)thread_res, "notdup")==0) {
loop_res++;
/* Closing other threads and exiting by condition
* from loop. */
while (tsp_threads_iter-->0) {
pthread_cancel(tsp_threads[tsp_threads_iter]);
}
}
}
free(tsp_threads);
free(tsp_threads_args);
if (loop_res > 0) {
break;
}
}
return (loop_res > 0) ? 1 : 0;
}
The function works fine (at least for what I've tested). Still, some guys from #C (freenode) suggested that the solution is overly complicated, and it may perform poorly because of parallel reading on hddisk.
What I want to know:
Is the threaded approach flawed by default ?
Is fseek() so slow ?
Is there a way to somehow map the files to memory and then compare them ?
LATED EDIT:
Today I had some time, and I've followed your advices. You were right, this threaded version actually performs worse than a single threaded version, and all because of the parallel readings on hard disk.
Another thing is that I've written a function that uses mmap(), and until now is the optimal one. Still the biggest drawback of that function is that it fails, when the files are getting really big.
Here is the new implementation (a very brute and direct code):
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include "dupf.h"
/**
* Safely assures that a file is opened.
* If cannot open file, the flow of the program is interrupted.
* The error code returned is -1.
**/
FILE *safe_fopen(const char *fname, const char *mode)
{
FILE *f = NULL;
f = fopen(fname, mode);
if (f == NULL) {
char emsg[1024];
sprintf(emsg, "Cannot open file: %s\t", fname);
perror(emsg);
exit(-1);
}
return (f);
}
/**
* Check if two files have the same size.
* Returns:
* -1 Error.
* 0 If they have the same size.
* 1 If the don't have the same size.
**/
int check_same_size(const char *f1_name, const char *f2_name, off_t *f1_size, off_t *f2_size)
{
struct stat f1_stat, f2_stat;
if((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_same_size].\n");
return (-1);
}
if((stat(f1_name, &f1_stat) != 0) || (stat(f2_name, &f2_stat) !=0)){
fprintf(stderr, "Cannot apply stat. [check_same_size].\n");
return (-1);
}
if(f1_size != NULL){
*f1_size = f1_stat.st_size;
}
if(f2_size != NULL){
*f2_size = f2_stat.st_size;
}
return (f1_stat.st_size == f2_stat.st_size) ? 0 : 1;
}
/**
* Test if two files are duplicates.
* Returns:
* -1 Error.
* 0 If they are duplicates.
* 1 If they are not duplicates.
**/
int check_dup_plain(char *f1_name, char *f2_name, int block_size)
{
if ((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_dup_plain].\n");
return (-1);
}
FILE *f1 = NULL, *f2 = NULL;
char f1_buff[block_size], f2_buff[block_size];
size_t rch1, rch2;
if(check_same_size(f1_name, f2_name, NULL, NULL) == 1){
return (1);
}
f1 = safe_fopen(f1_name, "r");
f2 = safe_fopen(f2_name, "r");
while(!feof(f1) && !feof(f2)){
rch1 = fread(f1_buff, 1, block_size, f1);
rch2 = fread(f2_buff, 1, block_size, f2);
if(rch1 != rch2){
fprintf(stderr, "Invalid reading from file. Cannot continue. [check_dup_plain].\n");
return (-1);
}
while(rch1-->0){
if(f1_buff[rch1] != f2_buff[rch1]){
return (1);
}
}
}
fclose(f1);
fclose(f2);
return (0);
}
/**
* Test if two files are duplicates.
* Returns:
* -1 Error.
* 0 If they are duplicates.
* 1 If they are not duplicates.
**/
int check_dup_memmap(char *f1_name, char *f2_name)
{
struct stat f1_stat, f2_stat;
char *f1_array = NULL, *f2_array = NULL;
off_t f1_size, f2_size;
int f1_des, f2_des, cont, res;
if((f1_name == NULL) || (f2_name == NULL)){
fprintf(stderr, "Invalid filename passed to function [check_dup_memmap].\n");
return (-1);
}
if(check_same_size(f1_name, f2_name, &f1_size, &f2_size) == 1){
return (1);
}
f1_des = open(f1_name, O_RDONLY);
f2_des = open(f2_name, O_RDONLY);
if((f1_des == -1) || (f2_des == -1)){
perror("Cannot open file");
exit(-1);
}
f1_array = mmap(0, f1_size * sizeof(*f1_array), PROT_READ, MAP_SHARED, f1_des, 0);
if(f1_array == NULL){
fprintf(stderr, "Cannot map file to memory [check_dup_memmap].\n");
return (-1);
}
f2_array = mmap(0, f2_size * sizeof(*f2_array), PROT_READ, MAP_SHARED, f2_des, 0);
if(f2_array == NULL){
fprintf(stderr, "Cannot map file to memory [check_dup_memmap].\n");
return (-1);
}
cont = f1_size;
res = 0;
while(cont-->0){
if(f1_array[cont]!=f2_array[cont]){
res = 1;
break;
}
}
munmap((void*) f1_array, f1_size * sizeof(*f1_array));
munmap((void*) f2_array, f2_size * sizeof(*f2_array));
return res;
}
int main(int argc, char *argv[])
{
printf("result: %d\n",check_dup_memmap("f2","f1"));
return (0);
}
I am planning now to extend this code, by re-adding the threaded functionality, but this time the reading will be on memory.
Thanks for your answers.
The limiting factor will be disk reads, which (assuming that both files are on the same disk) will be serialized anyway, so I don't think threading will help much at all.
You could probably simplify your code greatly by using hashes, instead of doing a byte-by-byte comparison. Assuming you're not doing anything important, like deleting, an md5 or similar hash function should be plenty. Boost provides quite a few, and they're usually pretty fast.
if fileA.size == fileB.size
if fileA.hash() == fileB.hash()
flag(fileA, fileB, same);
I wouldn't delete files after that comparison, but it's plenty safe to move them to a temporary directory for further review or just build a list of possible duplicates.
It's hard to guess about performance without a real system to test against (for example if you're using a solid state drive, there's no head seek time and the cost of reading different sectors from different threads is almost zero).
If this is running against a reasonably standard computer with regular (spinning platter) hard drives, having multiple threads contend for the part of the disk they want to read from will possibly slow things down (depending, again, on the hardware and also the size of the chunks).
If the time it takes to compute the "sameness" of a chunk is fast compared to the time it takes to read that chunk from disk, having a separate thread will not help much since the second (or third...) thread would spend most of it's time waiting for IO to complete anyway.
Another factor is the cache size of the CPU. If all of the memory you're processing at one time fits in the CPU cache, things will be much faster than if different threads cause different chunks of memory to be loaded into cache as they execute instructions.
If you have more threads than you have CPU cores, you will just slow things down by making unnecessary context switches (since a thread needs a core to run on).
After reading all of that, if you still think multithreading is going to help for your target system, consider one thread that does IO only, places the data in a queue, and has two or more worker threads taking data off of the queue to process. That way, you optimize disk IO and can take advantage of multiple cores to crunch the numbers.
Steve suggested you can memory map you files on Unix. That will speed up access to the underlying data a bit by leveraging low level OS functionality (the same kind used to manage swap files). That will give you some performance improvement as the OS will handle loading the parts of the file you are working on into memory efficiently, as long as the file fits into available address space. FYI you can do the same thing on Windows.
Before even considering the performance effects of parallel disk reads and thread overhead and such...
Is there any reason to believe that scanning the files in chunks will find the differences any quicker than straight through? Is the data contained in the files predominantly in a certain format, and if so, is the splitting scheme tailored to it? If not, I don't see how scanning the files by skipping over every n bytes (which is all the multithreaded splitting is effectively doing) could offer any improvement over reading the bytes in the order they are on disk.
Think of the two limiting cases -- "splitting" the file into one block, and splitting the file into as many one-byte "blocks" as there are bytes in the file. Will either of those cases be more efficient than the other, or some in-between value? If there is no in-between value that you know you should optimize to, then you know nothing about how the data is stored in the files, so it should make no difference how you scan them.
Even if you set the split to optimize to the disk's performance like block size, you're still going to have to go back to read the next byte, which will likely be at an extremely non-optimal position. And in the end you're going to have to read every single byte in the file, no matter how you split it.
Because you're using pthreads, I assume you're working in a Unix environment -- in which case you could mmap(2) both files into memory and compare the memory arrays directly.
Well, there is the standard memory mapping mmap() function that maps a file to memory. You should be able to do something like
int fd1;
int fd2;
int size1;
int size2;
fd1 = open(name1, O_RDONLY);
size1 = lseek(fd1, 0, SEEK_END);
fd2 = open(name2, O_RDONLY);
size2 = lseek(fd2, 0, SEEK_END);
if ( size1 == size2 )
{
char * data1 = mmap(0, size1, PROT_READ, MAP_SHARED, fd1, 0);
char * data2 = mmap(0, size1, PROT_READ, MAP_SHARED, fd2, 0);
int i;
/* ...and this is, obviously, where you'd do something more clever */
for ( i = 0; i < size1 && *data1 == *data2; i++, data1++, data2++ );
if ( i == size1 )
printf("Equal\n");
}
close(fd1);
close(fd2);
Other than that, yes, your solution looks overly complicated ;-) The threaded approach is not necessarily flawed, but you might not see that parallel access improves performance. For SAN drives or ramdisks it might improve performance, for normal spinning platter drives it might impede it. But simpler is usually better, unless you really have a performance issue.
Regarding fseek() vs other methods, it depends on the operating system you use. Google is you friend here, you can easily find articles at least for Solaris and Linux.
Even if disk access was not the limiting factor (it will be), unless you have a multi-core processor that could hand off different threads to different cores, you would not see a speed-up from going multi-threaded. Basically, you have to compare all N bytes of the file one way or another, and even if you use threads, if they execute in the same core, it will take the same amount of time as without using threads.
There are some environments that could spread the workload across cores, but even so, the CPU will be able to process so much faster than the data can be pulled in from disk that the disk I/O system will be the limiting factor.

Resources