How to optimize binary file(over 1MB) read in C? - c

I need to read two 1MB+ binary files byte by byte, compare them - If they're not equal, print out the next 16 bytes starting at the unequal byte. The requirement is that it all runs in just 5msecs. Currently, my program is taking 19msecs if the unequal bit is at the end of the two files. Are there any suggestions as to how I can optimize it?
#include <stdio.h> //printf
#include <unistd.h> //file open
#include <fcntl.h> //file read
#include <stdlib.h> //exit()
#include <time.h> //clock
#define SIZE 4096
void compare_binary(int fd1, int fd2)
{
int cmpflag = 0;
int errorbytes = 1;
char c1[SIZE], c2[SIZE];
int numberofbytesread = 1;
while(read(fd1, &c1, SIZE) == SIZE && read(fd2, &c2, SIZE) == SIZE && errorbytes < 17){
for (int i=0 ; i < SIZE ; i++) {
if (c1[i] != c2[i] && cmpflag == 0){
printf("Bytes not matching at offset %d\n",numberofbytesread);
cmpflag = 1;
}
if (cmpflag == 1){
printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
errorbytes++;
}
if (errorbytes > 16){
break;
}
numberofbytesread++;
}
}
}
int main(int argc, char *argv[])
{
int fd[2];
if (argc < 3){
printf("Check the number of arguments passed.\n");
printf("Usage: ./compare_binary <binaryfile1> <binaryfile2>\n");
exit(0);
}
if (!((access(argv[1], F_OK) == 0) && (access(argv[2], F_OK) == 0))){
printf("Please check if the files passed in the argument exist.\n");
exit(0);
}
fd[0] = open(argv[1], O_RDONLY);
fd[1] = open(argv[2], O_RDONLY);
if (fd[0]< 0 && fd[1] < 0){
printf("Can't open file.\n");
exit(0);
}
clock_t t;
t = clock();
compare_binary(fd[0], fd[1]);
t = clock() - t;
double time_taken = ((double)t)/(CLOCKS_PER_SEC/1000);
printf("compare_binary took %f milliseconds to execute \n", time_taken);
}
Basically need the optimized way to read binary files over 1MB such that they can be done under 5msecs.

First, try reading larger blocks. There's no point in performing so many read calls when you can read everything at once. Using 2 MB of memory is not a deal nowadays. Disk I/O calls are inherently expensive, their overhead is significant too, but can be reduced.
Second, try comparing integers (or even 64-bit longs) instead of bytes in each iteration, that reduces the number of loops you need to do significantly. Once you find a missmatch, you can still switch to the byte-per-byte implementation. (of course, some extra trickery is required if the file length is not a multiple of 4 or 8).

first thing caught my eye is this
if (cmpflag == 1){
printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
errorbytes++;
}
if (errorbytes > 16){
break;
}
yourcmpflag checking is useless maybe this thing do a little optimaztion
if (c1[i] != c2[i] && cmpflag == 0){
printf("Bytes not matching at offset %d\n",numberofbytesread);
printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
errorbytes++;
if (errorbytes > 16){
break;
}
}
you can do array compare built in function, or increase your buffer too

Related

In C how to read the output of a process and write it in the input of another?

Hi I have a C programme that is basacally suppose to simulate the pipe function in linux and write the amount of bytes that are read in a .txt file so
./a.out cat test : grep -v le : wc -l
The problem that I'm trying to figure out is
Why is the same amount of bytes written in the file since I know each process returns a different amount ?
This piece of code is executed in the parent and is trying to count the amount of bytes of each output with a read syscall and writes the output in a write syscall in the next process so that the next process can use the output as his input.
So let's say I have these pipes a | b | c
This code will read the output of a and write it in b so that b can use it as it's input and so on.
for (int i = 1; i < processes-1; i++) {
close(apipe[i][1]);
char str[4096];
int count=0;
int nbChar=0;
while(1){
count=read(apipe[i][0],str,sizeof(str));
nbChar+=count;
if(count==-1){
if (errno == EINTR) {
continue;
} else {
perror("read");
exit(1);
}
}else if(count==0)break;
}
char *leInput=(char*)malloc(nbChar*sizeof(char));
strncpy(leInput,str,nbChar);
if(i>0){
fprintf(fp, "%d : %d \n ", i,nbChar);
}
close(apipe[i][0]);
write(apipe[i+1][1], leInput, nbChar);
}
Each time through the while(1) loop you're read into the beginning of str, not where you left off in the previous iteration. So you're overwriting the previous read with the next read.
You should copy incrementally to leInput each time through the loop. You can then use realloc() to grow it to accomodate the new input, and you can use leInput + nbChar to copy after the place where you finished the previous time.
for (int i = 1; i < processes-1; i++) {
close(apipe[i][1]);
int nbChar=0;
char *leInput = NULL;
while(1){
int count=0;
char str[4096];
count=read(apipe[i][0],str,sizeof(str));
if(count==-1){
if (errno == EINTR) {
continue;
} else {
perror("read");
exit(1);
}
} else if(count==0) {
break;
}
leInput = realloc((nbChar + count)*sizeof(char));
memcpy(leInput + nbChar, str, count);
nbChar += count;
}
if(i>0){
fprintf(fp, "%d : %d \n ", i,nbChar);
}
close(apipe[i][0]);
write(apipe[i+1][1], leInput, nbChar);
}
Alternatively you could just write to the next pipe in the inner loop, without collectiong everything into leInput:
for (int i = 1; i < processes-1; i++) {
int nbChar = 0;
close(apipe[i][1]);
while(1){
int count=0;
char str[4096];
count=read(apipe[i][0],str,sizeof(str));
if(count==-1){
if (errno == EINTR) {
continue;
} else {
perror("read");
exit(1);
}
} else if(count==0) {
break;
}
write(apipe[i+1][1], str, count);
nbChar += count;
}
if(i>0){
fprintf(fp, "%d : %d \n ", i,nbChar);
}
close(apipe[i][0]);
close(apipe[i+1][1])
}

Reading from Pipe in C

I have a program that reads from a Random Access File and is to return the smallest and largest number in the file. One requirement is that this is done with 4 processes using fork() and piping the results. I divide the file up into 4 chunks and have each process evaluate a chunk of the file. I find the max and min of each chunk and write them to a pipe. At the end I will compare the piped values and find the largest and smallest of the values.
I am having trouble reading from the pipes as they are returning -1. Any insight on what I am doing wrong? Thanks!
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int findMin(int start, int end, const char * filename);
int findMax(int start, int end, const char * filename);
//Calculates minimum and maximum of a number
int main(int argc, char * argv[])
{
const char * filename; // name of file to read
FILE * ft; // file handle for the file
int pid, // process id of this process
num, // the number of integer values in the file
i, // loop control variable for reading values
temp=0; // used to store each value read from the file
long size; // size in bytes of the input file
/*********************************************************************/
filename = argv[1]; // read the file named on the command line
ft= fopen(filename, "rb");
if (ft)
{
pid = getpid();
fseek (ft,0,SEEK_END); //go to end of file
size = ftell(ft); //what byte in file am I at?
fseek (ft,0,SEEK_SET); //go to beginning of file
num = (int)size / (int)sizeof(int); // number of integer values
printf("file size: %li bytes\n", size);
printf("sizeof(int) = %i bytes\n",(int) sizeof(int));
printf("how many integers = %i\n\n", num);
fclose(ft);
}
//Split file size into quarters to make 4 processes
int increment = num/4;
int num1 = increment;
int num2 = num1 + increment;
int num3 = num2 + increment;
int num4 = num;
int status;
int pid1 = -1;
int pid2 = -1;
//Pipes
int fdmin1[2];
int fdmax1[2];
int fdmin2[2];
int fdmax2[2];
int fdmin3[2];
int fdmax3[2];
int fdmin4[2];
int fdmax4[2];
//initializing pipes
if(pipe(fdmin1) == -1)
{
perror("Piping fd1 failed");
return 0;
}
if(pipe(fdmax1) == -1)
{
perror("Piping fd2 failed");
return 0;
}
if(pipe(fdmin2) == -1)
{
perror("Piping fd3 failed");
return 0;
}
if(pipe(fdmax2) == -1)
{
perror("Piping fd4 failed");
return 0;
}
if(pipe(fdmin3) == -1)
{
perror("Piping fd3 failed");
return 0;
}
if(pipe(fdmax3) == -1)
{
perror("Piping fd4 failed");
return 0;
}
if(pipe(fdmin4) == -1)
{
perror("Piping fd3 failed");
return 0;
}
if(pipe(fdmax4) == -1)
{
perror("Piping fd4 failed");
return 0;
}
//temp variables for pipes
int temp1;
int temp2;
int temp3;
int temp4;
int temp5;
int temp6;
int temp7;
int temp8;
pid1 = fork();
printf("pid1: %d \n", pid1);
if(pid1 > 0)
{
//Process 1
temp1 = findMin(0, num1, filename);
temp2 = findMax(0, num1, filename);
close(fdmin1[0]);
if(write(fdmin1[1], &temp1, sizeof(int)) == -1)
{
printf("Error writting to pipe");
}
close(fdmin1[1]);
close(fdmax1[0]);
if(write(fdmax1[1], &temp2, sizeof(int)) == -1)
{
printf("Error writting to pipe");
}
close(fdmax1[1]);
}
else if(pid1 == 0)
{
//Process 2
temp3 = findMin(num1, num2, filename);
temp4 = findMax(num1, num2, filename);
close(fdmin2[0]);
if(write(fdmin2[1], &temp3, sizeof(int)) == -1)
{
printf("Error writting to pipe");
}
close(fdmin2[1]);
close(fdmax2[0]);
if(write(fdmax2[1], &temp4, sizeof(int)) == -1)
{
printf("Error writting to pipe");
}
close(fdmax2[1]);
pid2 = fork();
printf("pid2: %d \n", pid2);
if(pid2 > 0)
{
//Process 3
temp5 = findMin(num2, num3, filename);
temp6 = findMax(num2, num3, filename);
close(fdmin3[0]);
if(write(fdmin3[1], &temp5, sizeof(int)) == -1)
{
printf("Error writting to pipe");
}
close(fdmin3[1]);
close(fdmax3[0]);
if(write(fdmax3[1], &temp6, sizeof(int)) == -1)
{
printf("Error writting to pipe");
}
close(fdmax3[1]);
}
else if(pid2 == 0)
{
//Process 4
temp7 = findMin(num3, num4, filename);
temp8 = findMax(num3, num4, filename);
close(fdmin4[0]);
if(write(fdmin4[1], &temp7, sizeof(int)) == -1)
{
printf("Error writting to pipe");
}
close(fdmin4[1]);
close(fdmax4[0]);
if(write(fdmax4[1], &temp8, sizeof(int)) == -1)
{
printf("Error writting to pipe");
}
close(fdmax4[1]);
}
}
//Close all pipe ends in all processes
close(fdmin1[0]);
close(fdmin1[1]);
close(fdmin2[0]);
close(fdmin2[1]);
close(fdmin3[0]);
close(fdmin3[1]);
close(fdmin4[0]);
close(fdmin4[1]);
close(fdmax1[0]);
close(fdmax1[1]);
close(fdmax2[0]);
close(fdmax2[1]);
close(fdmax3[0]);
close(fdmax3[1]);
close(fdmax4[0]);
close(fdmax4[1]);
//Wait for all processes to finish
int returnStatus;
waitpid(pid1, &returnStatus, 0);
int returnStatus2;
waitpid(pid2, &returnStatus2, 0);
//Make sure we are in parant process
if(pid1 > 0)
{
//Variables to compare min and max returned from processses
int min1;
int max1;
int min2;
int max2;
int min3;
int max3;
int min4;
int max4;
//read from pipe (error is occuring here)
close(fdmin1[1]);
if(read(fdmin1[0], &min1, sizeof(int)) == -1)
{
printf("Error reading");
}
close(fdmin1[0]);
printf("min1: %d \n", min1);
}
return 0;
}
//function to find the minimum in the file
int findMin(int start, int end, const char * filename)
{
int temp;
int smallestNum;
int i;
int length = end - start;
FILE * ft2;
ft2= fopen(filename, "rb");
fseek (ft2,start,SEEK_SET);
fread(&smallestNum,sizeof(int),1,ft2);
for(i = 0; i < length; i++)
{
fread(&temp,sizeof(int),1,ft2);
//printf("%d \n", temp);
if(temp < smallestNum)
{
smallestNum = temp;
}
/*
printf("%5i: %7i ",pid,temp);
if ((i+1)%5 == 0)
printf("\n");
*/
}
fclose(ft2);
printf("SmallestNum: %d \n", smallestNum);
return smallestNum;
}
//function to find maximum in file
int findMax(int start, int end, const char * filename)
{
int temp;
int largestNum;
int i;
int length = end - start;
FILE * ft3;
ft3= fopen(filename, "rb");
fseek (ft3,start,SEEK_SET);
fread(&largestNum,sizeof(int),1,ft3);
for(i = 0; i < length; i++)
{
fread(&temp,sizeof(int),1,ft3);
//printf("%d \n", temp);
if(temp > largestNum)
{
largestNum = temp;
}
/*
printf("%5i: %7i ",pid,temp);
if ((i+1)%5 == 0)
printf("\n");
*/
}
fclose(ft3);
printf("Largest Num: %d \n", largestNum);
return largestNum;
}
Here is the code for generating the Random Access File
/*
* This file generates a binary output file containing integers. It
* requires the output filename as a parameter and will take an
* argument indicating the number of values to generate as input.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#define BIAS 0 // a bias value added to the numbers to "bias" the file
// contents to provide an offset to the min and max
int main(int argc, char * argv[]) {
const char * filename; // name of the output file
FILE * ft; // file handle for output file
int numtogen = 1000000; // default is to generate 1,000,000 numbers
int randomnum, i; // variables used in the loop generating numbers
if (argc<2) { // not enough arguments, need output file name
printf("Usage: gendata <filename> [number of numbers]\n");
return 1;
}
if (argc == 3) // optional third argument for number of numbers
numtogen = atoi(argv[2]);
filename=argv[1]; // use the filename entered to store numbers
srand(time(NULL)); // seed the random number generator
ft= fopen(filename, "wb") ;
if (ft) {
for (i = 0; i < numtogen; i++){
randomnum = rand() % numtogen + BIAS;
fwrite(&randomnum,sizeof(int),1,ft);
}
fclose(ft);
}
return 0;
}
I am having trouble reading from the pipes as they are returning -1. Any insight on what I am doing wrong? Thanks!
this is because in the main process you close two times the pipe, doing
printf("pid1: %d \n", pid1);
if(pid1 > 0)
{
...
close(fdmin1[0]); <<< HERE
and
//Close all pipe ends in all processes
close(fdmin1[0]); <<< HERE
so it is closed when you do :
if(read(fdmin1[0], &min1, sizeof(int)) == -1)
do not close fdmin1[0] before to read in but the reverse.
Note you also close two times fdmin1[1] and fdmax1[0] and fdmax1[1].
The usage of the pipes is very strange and probably no what you want :
fdmin1 is a pipe between the main process and itself, the main process does if(write(fdmin1[1], &temp1, sizeof(int)) == -1)and later if(read(fdmin1[0], &min1, sizeof(int)) == -1) so that pipe is useless and min1 is temp1
the main process does if(write(fdmax1[1], &temp2, sizeof(int)) == -1) but nobody read that value, that pipe is useless and temp2 = findMax(0, num1, filename); is done for nothing.
the main process child does if(write(fdmin2[1], &temp3, sizeof(int)) == -1) and if(write(fdmax2[1], &temp4, sizeof(int)) == -1) and if(write(fdmin3[1], &temp5, sizeof(int)) == -1) and if(write(fdmax3[1], &temp6, sizeof(int)) == -1) but nobody read, these four pipes are useless and all the min/max computing are done for nothing.
it is the same for the third created process doing if(write(fdmin4[1], &temp7, sizeof(int)) == -1) and if(write(fdmax4[1], &temp8, sizeof(int)) == -1) but nobody read, these two pipes are useless and the min/max computing are done for nothing.
That means at the end you do not get the right min/max value in the main process, but only the min value of the first quarter computing by the main process and all other computing are lost.
The code
//Wait for all processes to finish
int returnStatus;
waitpid(pid1, &returnStatus, 0);
int returnStatus2;
waitpid(pid2, &returnStatus2, 0);
is executed by all the child processes, because you do not exit or return when you have to do.
You also have an undefined behavior because you have a race condition between your processes, the execution is not the same depending on where I had usleep in your code. A parent process must wait for the end of its child process when needed, you do not at the right moment. Note your process numbering is wrong, there are only the main process and two children, so 3 processes rather than 4, //process4 does not exist and that comment is in process 2.
Except in the main process you do not read from the right position in the file because for findMin and findMax the parameter start correspond to a rank of int rather than a position in the file, you must replace
fseek (ft2,start,SEEK_SET);
fseek (ft3,start,SEEK_SET);
by
fseek (ft2,start*sizeof(int),SEEK_SET);
fseek (ft3,start*sizeof(int),SEEK_SET);
You also (try to) read one int too many doing
int length = end - start;
...
fread(&smallestNum,sizeof(int),1,ft2);
for(i = 0; i < length; i++)
{
fread(&temp,sizeof(int),1,ft2);
for instance replace the loop to have
for(i = 1; i < length; i++)
There also are a lot of useless variables in your program, if I compile with option -Wall :
bruno#bruno-XPS-8300:/tmp$ gcc -Wall -g p.c -o p
p.c: In function ‘main’:
p.c:250:16: warning: unused variable ‘max4’ [-Wunused-variable]
int max4;
^
p.c:249:16: warning: unused variable ‘min4’ [-Wunused-variable]
int min4;
^
p.c:248:16: warning: unused variable ‘max3’ [-Wunused-variable]
int max3;
^
p.c:247:16: warning: unused variable ‘min3’ [-Wunused-variable]
int min3;
^
p.c:246:16: warning: unused variable ‘max2’ [-Wunused-variable]
int max2;
^
p.c:245:16: warning: unused variable ‘min2’ [-Wunused-variable]
int min2;
^
p.c:244:16: warning: unused variable ‘max1’ [-Wunused-variable]
int max1;
^
p.c:48:12: warning: unused variable ‘status’ [-Wunused-variable]
int status;
^
p.c:20:8: warning: unused variable ‘temp’ [-Wunused-variable]
temp=0; // used to store each value read from the file
^
p.c:19:8: warning: unused variable ‘i’ [-Wunused-variable]
i, // loop control variable for reading values
^
p.c:17:8: warning: variable ‘pid’ set but not used [-Wunused-but-set-variable]
int pid, // process id of this process
^
bruno#bruno-XPS-8300:/tmp$
Out of that
You must check the value of argc before to do filename = argv[1];.
If fopen(filename, "rb"); fails you must stop the execution, currently you continue with an undefined behavior.
Note also your program can be simplified using array of pipe rather than separated variables for them, allowing you to use a loop rather than the sequence of if(pipe(fdmin1) == -1) ... if(pipe(fdmax4) == -1) .... It is the same to start the child processes, rather than to duplicate the code use a function to write it only one time. Doing that you can have a definition allowing any number of child process rather than dedicated to 4 only.
Going back to the statement
I divide the file up into 4 chunks and have each process evaluate a chunk of the file
This is an extreme case but you have to manage the case the file is too small to be divided by 4, this is not the case in your proposal.
this is done with 4 processes
Considering the main process is count among the 4, 3 children must be created. But rather than to have each child creating an other one if needed, it is more simple to have the 3 children created by the main process and the parallelism is a little better.
A program must be simple, I already said you have a lot of variables for nothing and lot of code is duplicated, also :
It is useless to have so many pipes, only one is enough to allow each child to send the min/max it computed because the pipe reads and writes are guaranteed to be atomic up to PIPE_BUF (larger than the size of 2 int)
It is useless to read the file so many times, you can search for the min and the max at the same time.
And finally a proposal :
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#define N 4 /* including the main process */
/* to send/receive result atomicaly through the pipe */
typedef struct {
int min, max;
} MinMax;
void findMinMax(long offset, long n, FILE * fp, MinMax * minmax);
//Calculates minimum and maximum of a number
int main(int argc, char * argv[])
{
const char * filename; // name of file to read
FILE * fp; // file handle for the file
long num; // the number of integer values in the file
long size; // size in bytes of the input file
long offset; // offset in file
int pp[2]; // the unique pipe
int pids[N-1];
MinMax minmax;
int i;
if (argc != 2) {
fprintf(stderr, "Usage: %s <filename>\n", *argv);
exit(-1);
}
filename = argv[1];
fp = fopen(filename, "rb");
if (fp == NULL) {
perror("cannot open file");
exit(-1);
}
/* get file size */
if (fseek(fp, 0, SEEK_END) == -1) { //go to end of file
perror("cannot fseek");
fclose(fp); /* also done automaticaly when exiting program */
exit(-1);
}
size = ftell(fp); //what byte in file am I at?
num = size / sizeof(int); // number of integer values
printf("file size: %li bytes\n", size);
printf("how many integers = %li\n\n", num);
if (num < N) {
fprintf(stderr, "the input file is too small, it must contains at least %i int\n", N);
fclose(fp); /* also done automaticaly when exiting program */
exit(-1);
}
//initializing pipe
if(pipe(pp) == -1) {
perror("Piping failed");
exit(-1);
}
offset = 0;
for (i = 0; i != N-1; ++i) {
pids[i] = fork();
switch (pids[i]) {
case 0:
/* child */
{
FILE * fp2 = fopen(filename, "rb");
if (fp2 == NULL) {
perror("child cannot open file");
exit(-1);
}
findMinMax(offset, num/N, fp2, &minmax);
printf("min max child %d : %d %d\n", i, minmax.min, minmax.max);
if (write(pp[1], &minmax, sizeof(minmax)) != sizeof(minmax)) {
perror("Error writting to pipe");
exit(-1);
}
}
exit(0);
case -1:
/* parent */
perror("Cannot fork");
exit(-1);
default:
/* parent, no error */
offset += (num/N)*sizeof(int);
}
}
findMinMax(offset, (size - offset)/sizeof(int), fp, &minmax);
printf("min max main : %d %d\n", minmax.min, minmax.max);
for (i = 0; i != N-1; ++i) {
int status;
MinMax mm;
if ((waitpid(pids[i], &status, 0) != -1) &&
(status == 0) &&
(read(pp[0], &mm, sizeof(mm)) == sizeof(mm))) {
if (mm.min < minmax.min)
minmax.min = mm.min;
if (mm.max > minmax.max)
minmax.max = mm.max;
}
else
fprintf(stderr, "cannot get result for child %d\n", i);
}
printf("global min max : %d %d\n", minmax.min, minmax.max);
return 0;
}
// function to find the minimum and maximum in the file
// n > 1
void findMinMax(long offset, long n, FILE * fp, MinMax * minmax)
{
int v;
if (fseek(fp, offset, SEEK_SET) == -1) {
perror("cannot fseek");
exit(-1);
}
if (fread(&minmax->min, sizeof(minmax->min), 1, fp) != 1) {
fclose(fp); /* also done automaticaly when exiting program */
perror("cannot read int");
exit(-1);
}
minmax->max = minmax->min;
while (--n) {
if (fread(&v, sizeof(v), 1, fp) != 1) {
fclose(fp); /* also done automaticaly when exiting program */
perror("cannot read int");
exit(-1);
}
if (v < minmax->min)
minmax->min = v;
if (v > minmax->max)
minmax->max = v;
}
fclose(fp); /* also done automaticaly when exiting program */
}
As you can see the code is much simple and I just have to modify #define N 4 to an other value to change the number of processes working in parallel.
Using your second program to generate 1000000 int in aze, compilation and execution of my proposal :
bruno#bruno-XPS-8300:/tmp$ gcc -g -Wall p.c
bruno#bruno-XPS-8300:/tmp$ ./a.out aze
file size: 4000000 bytes
how many integers = 1000000
min max main : 2 999995
min max child 0 : 10 999994
min max child 2 : 0 999998
min max child 1 : 3 999999
global min max : 0 999999
bruno#bruno-XPS-8300:/tmp$

copying contents of file to another file n bytes at a time in c

Trying to copy the contents of a file to another file by copying n bytes at a time in c. I believe the code below works for copying one byte at a time but am not sure how to make it work for n number of bytes, have tried making a character array of size n and changing the read/write functions to read(sourceFile , &c, n) and write(destFile , &c, n), but the buffer doesn't appear to work that way.
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
#include <time.h>
void File_Copy(int sourceFile, int destFile, int n){
char c;
while(read(sourceFile , &c, 1) != 0){
write(destFile , &c, 1);
}
}
int main(){
int fd, fd_destination;
fd = open("source_file.txt", O_RDONLY); //opening files to be read/created and written to
fd_destination = open("destination_file.txt", O_RDWR | O_CREAT);
clock_t begin = clock(); //starting clock to time the copying function
File_Copy(fd, fd_destination, 100); //copy function
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC; //timing display
return 0;
}
how to make it work for n number of bytes
Just read N number of bytes and copy that many bytes that you successfully read.
#define N 4096
void File_Copy(int sourceFile, int destFile, int n){
char c[N];
const size_t csize = sizeof(c)/sizeof(*c);
while (1) {
const ssize_t readed = read(sourceFile, c, csize);
if (readed <= 0) {
// nothing more to read
break;
}
// copy to destination that many bytes we read
const ssize_t written = write(destFile, c, readed);
if (written != readed) {
// we didn't transfer everything and destFile should be blocking
// handle error
abort();
}
}
}
You want to copy a buffer of size n at once:
void File_Copy(int sourceFile, int destFile, int n){
char c[n];
ssize_t st;
while((st = read(sourceFile , c, n)) > 0){
write(destFile , c, st);
}
}
Note, that not necessarily n bytes are always copied at once, it might be less. And you also have to check the return value of write() and handle the situation, when less bytes were written, as it fits your needs.
One example is a loop:
while (st > 0) {
int w = write(destFile, c, st);
if (w < 0) {
perror("write");
return;
}
st -= w;
}
Another issue: When you create the destination file here
fd_destination = open("destination_file.txt", O_RDWR | O_CREAT);
you do not specify the third mode parameter. This leads to a random mode, which might lead to this open() to fail the next time. So better add a valid mode, for example like this:
fd_destination = open("destination_file.txt", O_RDWR | O_CREAT, 0644);
This might have distorted your test results.
This is my version using lseek (no loop required):
It relies on read and write always processing the complete buffer and never a part of it (I don't know if this is guaranteed).
void File_Copy(int sourceFile, int destFile)
{
off_t s = lseek(sourceFile, 0, SEEK_END);
lseek(sourceFile, 0, SEEK_SET);
char* c = malloc(s);
if (read(sourceFile, c, s) == s)
write(destFile, c, s);
free(c);
}
The following code does not rely on this assumption and can also be used for file descriptors not supporting lseek.
void File_Copy(int sourceFile, int destFile, int n)
{
char* c = malloc(n);
while (1)
{
ssize_t readStatus = read(sourceFile, c, n);
if (readStatus == -1)
{
printf("error, read returned -1, errno: %d\n", errno);
return;
}
if (readStatus == 0)
break; // EOF
ssize_t bytesWritten = 0;
while (bytesWritten != readStatus)
{
ssize_t writeStatus = write(destFile, c + bytesWritten, readStatus - bytesWritten);
if (writeStatus == -1)
{
printf("error, write returned -1, errno is %d\n", errno);
return;
}
bytesWritten += writeStatus;
if (bytesWritten > readStatus) // should not be possible
{
printf("how did 'bytesWritten > readStatus' happen?");
return;
}
}
}
free(c);
}
On my system (PCIe SSD) I get best performance with a buffer between 1MB and 4MB (you can also use dd to find this size). Bigger buffers don't make sense. And you need big files (try 50GB) to see the effect.

Find a substring within a string using processes

I have a large file (around 1,000,000 characters) in the format "AATACGTAGCTA" and a subsequent file, such as "CGTATC" (10,240 characters). I want to find the largest match of the subsequence within the main sequence. A full, 100% subsequence match may not exist, this is not guaranteed. For the sake of a smaller example, the above would output: Longest match is 4/6 starting at position 5.
I'm working on my C basics, and would like to implement it like so:
The user chooses how many processes they would like to split the work
into.
Each process does 1/nth of the work and updates the shared memory
values located in the struct.
The longest match (it may not be all characters) is reflected in the
struct, as well as it's starting position, and how many
characters were matched. See output below.
Code
#define _GNU_SOURCE
#include <limits.h>
#include <stdio.h>
#include <errno.h>
#include <semaphore.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/shm.h>
typedef struct memoryNeeded {
int start_pos, total_correct;
char sequence[1038336];
char subsequence[10240];
sem_t *sem;
} memoryNeeded;
// Used to check all arguments for validity
int checkArguments(char* p, int argc) {
char *prcs;
errno = 0;
int num;
long conv = strtol(p, &prcs, 10);
if (errno!= 0 || *prcs != '\0' || conv > INT_MAX || conv > 50) {
puts("Please input a valid integer for number of processes. (1-50)");
exit(1);
} else {
num = conv;
if (argc != 4) {
puts("\nPlease input the correct amount of command line arguments (4) in"
"the format: \n./DNA (processes) (sequence) (subsequence)\n");
exit(1);
} else
printf("Looking for string using %d processes...\n", num);
return(num);
}
}
int main (int argc, char* argv[]) {
int processes = checkArguments(argv[1], argc);
key_t shmkey;
int procNumber, shmid, pid;
FILE *sequence;
FILE *subsequence;
char *buf1, *buf2;
// Create shared memory
size_t region_size = sizeof(memoryNeeded);
shmkey = ftok("ckozeny", 5);
shmid = shmget(shmkey, region_size, 0644 | IPC_CREAT);
if (shmid < 0) {
perror("shmget\n");
exit(1);
}
// Create structure in shared memory, attach memory and open semaphore
memoryNeeded *mn;
mn = (memoryNeeded *)shmat(shmid, NULL, 0);
mn->sem = sem_open("sem", O_CREAT | O_EXCL, 0644, 1);
sequence = fopen(argv[2], "r");
subsequence = fopen(argv[3], "r");
// Get file sizes
fseek(sequence, 0L, SEEK_END);
int sz1 = ftell(sequence);
rewind(sequence);
fseek(subsequence, 0L, SEEK_END);
int sz2 = ftell(subsequence);
rewind(subsequence);
// Read files into 2 buffers, which are put into struct mn
buf1 = malloc(sz1);
buf2 = malloc(sz2);
if (sz1 != fread(buf1, 1, sz1, sequence)) {
free(buf1);
}
if (sz2 != fread(buf2, 1, sz2, subsequence)) {
free(buf2);
}
// Initialize struct with necessary values
mn->start_pos = 0;
mn->total_correct = 0;
strncpy(mn->sequence, buf1, sz1);
strncpy(mn->subsequence, buf2, sz2);
fclose(sequence);
fclose(subsequence);
// Begin n forks
for (procNumber = 0; procNumber < processes; procNumber++) {
pid = fork();
if (pid < 0) {
sem_unlink("sem");
sem_close(mn->sem);
printf ("Fork error.\n");
} else if (pid == 0)
break;
}
if (pid != 0) {
while ((pid = waitpid (-1, NULL, 0))){
if (errno == ECHILD)
break;
}
printf("Best match is at position %d with %d/10240 correct.", mn->start_pos, mn->total_correct);
printf ("\nParent: All children have exited.\n");
sem_unlink("sem");
sem_close(mn->sem);
shmdt(mn);
shmctl(shmid, IPC_RMID, 0);
exit(0);
} else {
// this child process will do its 1/nth of the work
sem_wait(mn->sem);
printf ("Child(%d) is in critical section.\n", procNumber);
sleep(1);
int i = 0;
int longest, count = 0;
for (i = 0; i < sz1; i += processes) {
for (int j = 0; j < sz2; j += processes) {
count = 0;
while (mn->sequence[i+j] == mn->subsequence[j]) {
count++;
j++;
}
if (count > longest) {
longest = count;
}
}
}
// If local match is longer than that of the struct, update and unlock
if (longest > mn->total_correct) {
mn->total_correct = count;
mn->start_pos = (i - count);
sem_post(mn->sem);
} else
// If not - unlock and let next process go
sem_post(mn->sem);
exit(0);
}
return 1;
}
The current child code is more or less "pseudocode". I've put it together how it makes sense in my head. (I'm aware this may not be correct or function as intended.) My question is in regard to the child code algorithm near the bottom.
How do I implement this so each child does 1/nth of the work, and finds the longest match, even though it may not match 100%?
Final output would be:
./DNA 6 sequence1 subsequence1
Looking for string using 6 processes...
Best match is at position 123456 with 9876/10240 correct.
Thanks.

Difference between standard I/O and lower-level I/O

I have two segments of code.
#define BUFFSIZE 30
#define STDIN_FILENO 0
#define STDOUT_FILENO 1
#define STDERR_FILENO 2
#include <strings.h>
int main(void)
{
int n;
char buf[BUFFSIZE];
while ( (n = read(STDIN_FILENO, buf, BUFFSIZE)) > 0)
if (write(STDOUT_FILENO, buf, n) != n)
err_sys("write error");
if (n < 0)
err_sys("read error");
exit(0);
}
And another
#define STDIN_FILENO 0
#define STDOUT_FILENO 1
#define STDERR_FILENO 2
#include <stdio.h>
#include <strings.h>
int main(void) {
int c;
while ( (c = getc(stdin)) != EOF)
if (putc(c, stdout) == EOF)
err_sys("output error");
if (ferror(stdin))
err_sys("input error");
exit(0); }
For the first programme, I thought if I input a string whose length is larger than BUFFZISE, the characters whose indexes are larger than BUFFZISE will be eliminated. But it turned out not to be so. Why does this happen? And what is the major difference between these two I/O mechanism? Many thanks.
For me, the basic difference between I/O levels is that lower level is not buffered (in standard library).
In your case, the first example is reading and writing using your own buffer of size BUFFSIZE. In the second example, you are reading/writing by a single character relying on the fact that the buffering is done by the library. Otherwise, both examples are doing the same thing.
Lower level functions allow to use a few more options than higher level functions like non-blocking I/O. Also programs using higher level functions may be a bit slower. In your second example the data is copied (byte after byte) from an input buffer to an output buffer which does not happen in the first example.
BTW, your first example can miss some characters, the loop shall be something like:
while ( (n = read(STDIN_FILENO, buf, BUFFSIZE)) > 0) {
int i, k = 0;
do {
i = write(STDOUT_FILENO, buf+k, n-k);
if (i < 0) {err_sys("write error"); break;}
k += i;
} while (k < n);
}

Resources