Compare 2 files using POSIX system calls

Compare 2 files using POSIX system calls - c

C newbie here.
Banging my head against the wall with this one...:/
I'm trying to compare to files which are not been used by any other process which means that they are static, using only system calls. I have no problems doing so using fopen() but it feels much more complicated when using just open(), read() and write()...
here's what I got so far:
...//Some code here to get file descriptors and other file manipulation checks
int one = read(srcfd1,buf1,src1_size);
int two = read(srcfd2,buf2,src2_size);
printf("%s\n",buf1); //works fine till it gets here...
int samefile = strcmp(buf1,buf2); //Crashes somewhere around here..
if (samefile != 0)
{
printf("not equle\n");
return(1);
}
else
{
printf("equle\n");
return(2);
}
So basically, what I think I need to do is to compare the 2 buffers but this is not seem to be working...
I found something which I believe should give me some idea here but I can't make sense of it (the last answer in the link...).
The return values are irrelevant .
Appreciate any help I can get...:/

Your buffers are not NUL terminated, so it doesn't make sense to use strcmp - this will almost certainly fail unless your buffers happen to contain a 0 somewhere. Also you don't say whether these files are text files or binary files, but to make this work (for either text or binary) you should change:
int samefile = strcmp(buf1,buf2); //Crashes somewhere around here..
to:
int samefile = memcmp(buf1,buf2,src1_size); // use memcmp to compare buffers
Note that you should also check that src1_size == src2_size prior to calling memcmp.

This crashes since the buffers possibly are not null terminated. You are trying to print them as string "%s" in printf and doing a strcmp too.
You can trying null terminating the buffers, after your read calls, and then print them as string.
buf1[one] = '\0';
buf2[two] ='\0';
This will most likely fix your code. But a few other points,
1) Are your buffers sufficiently large as the file?
2) Better to partially read data, than to try to grab everything in one go.
(means use a loop to read data, till it returns a 0)
like,
Assuming the array "buf" is sufficiently large to hold all the file's data.
The number "512" means, read will at most try to read 512 bytes and the
iteration will continue, till read returns 0 (when there is no more data) or
may be a negative number, in case of any error. The array's index is getting
incremented, by the number of bytes read till now, so that the data does not
get overwritten.
An example - If a file is having say 515 bytes, read will be called thrice.
During the first call it will return 512, for the 2nd call it will return 3
and the third call will return 0. (read call returns the number of bytes,
actually read)
index = 0;
while( (no_of_bytes_read = read(fd, buf + index, 512)) > 0)
{
index = index + no_of_bytes_read;
}

Related

C UNIX - read() reads none existing letters

I've got a little problem while experimenting with some C code. I've tried to use read()-Command to read a text out of a file and store the results in a charArray. But when I print the results they're always different from the file.
Here is the code:
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
void main() {
int fd = open("file", 2);
char buf[2];
printf("Read elements: %ld\n", read(fd, buf, 2));
printf("%s\n", buf);
close(fd);
}
The file "file" was created in the same directory using the following UNIX commands:
cat > file
Hi
So it contains just the word "Hi". When I run it, I expect it to read 2 bytes from the file (which are 'H' and 'i') and store them at buf[0] and buf[1]. But when I want to print the result, it appears, that there was an issue, because besides the word "Hi" there are several wierd characters printed (indicating a memory reading/writing problem i guess, due to bad buffer size). I've tried to increase the size of the buf-Array and it appears that when i change the size, the wierd characters printed change. The problem is removed when size reaches 32 bytes.
Can someone explain to me in detail why this is happening?
I've understood so far that read() does not read \'0' when it reads something, and that the third parameter of read() indicates the maximum number of bytes to read.
Antoher thing I've noticed while experimenting with the above code is the following: Let's assume one changes the third parameter (maximum bytes to read) of read() to 3, and the size of buf-Array to 512 (overkill i know, but I really wanted to see what will happen). Now read will acutally read a third character (in my case 'e') and store it into the buffer, even tho this third character does not exist.
I've searched for a while now #stackoverflow and I found many similiar cases, but none of them made me understand my problem. If there is any other thread i missed, it would be a pleasure if u could link me to it.
At last: sry for my bad english, it's not my native language.

Clearly you need to make buf 3 bytes long and use the last byte as the null byte (0 or '\0'). That way, when you print the string, your computer doesn't carry on until he finds another 0 !
The way strings (char arrays really) are handled in C is quite straightforward. Indeed, when dealing with strings (most) if not all functions take under the assumption that string parameters are null terminated (puts) or return null terminated strings (strdup).
The point is that, by default the computer can't tell where a string ends unless it is given the strings size each time he processes it. The easiest implementation around this approach was to append after each string a 0 (namely the null byte). That way, the computer just need to iterate over the string's characters and stop when he finds the termination character (other name for null byte).

Pthreads, fread(), and printf(): Getting random D4's in my string

The Scoop:
I am creating a method that runs through a lengthy file in chunks: using pthreads. I am calling fread() to read the file in this sort of fashion:
fread( thread_data[i].buffer, 1, 50, f )
/*
thread_data is a data structure for each thread (hence i)
buffer is in thread_data as an array of length 50
*/
I am then directly calling a print statement to see what each thread is doing, as a weird pattern was showing up in some of the parts that I was printing. Namely, my print statement would look something like this:
this is suppose to be 50 characters, but it is only a fewgD4
That D4 directly above is what I have my question on. Every thread that I make, at the end of the string, we are printing D4, and in this case, followed by a g. Other times, it is followed by a d, and most commonly a �. Now, I did read the wikipedia page on this character, which states:
replacement character used to replace an unknown or unrepresentable character
My question:
What kind of an error am I running into? Why is the end of each read statement containing unknown characters, especially the weird gD4 guy?
Aside:
I am trying to make a function in c that utilizes pthreads to find the frequency of each word in a file, in case anyone was wondering. These weird characters were showing up in my list, which is something that I find slightly unpleasent. Finally, don't bother linking me to the Obligaroty Unicode article, I am already aware of it, and the characters are not outside of what I am working with.

The strings you are printing out are not null-terminated — fread() does not null-terminate its output, it simply reads in as many raw bytes as you asked for (or fewer). So when you print out your buffer, your print function is walking past the end of the data and printing out whatever garbage memory comes after the buffer, which in your case just happens to be gD4.
You need to either explicitly null-terminate your buffer; or, if your print function supports it, tell it exactly how many characters to print. Either way, you need to save the return value from fread to know how many characters you read. For example:
int n = fread(thread_data[i].buffer, 1, 50, f);
if (n < 0) /* Handle error */ ;
// Explicitly add a null terminator -- make sure the buffer has room for it!
thread_data[i].buffer[n] = 0;

C, format file for data of HTTP response

I have no experience with fscanf() and very little with functions for FILE. I have code that correctly determines if a client requested an existing file (using stat() and it also ensures it is not a directory). I will omit this part because it is working fine.
My goal is to send a string back to the client with a HTTP header (a string) and the correctly read data, which I would imagine has to become a string at some point to be concatenated with the header for sending back. I know that + is not valid C, but for simplicity I would like to send this: headerString+dataString.
The code below does seem to work for text files but not images. I was hoping that reading each character individually would solve the problem but it does not. When I point a browser (Firefox) at my server looking for an image it tells me "The image (the name of the image) cannot be displayed because it contains errors.".
This is the code that is supposed to read a file into httpData:
int i = 0;
FILE* file;
file = fopen(fullPath, "r");
if (file == NULL) errorMessageExit("Failed to open file");
while(!feof(file)) {
fscanf(file, "%c", &httpData[i]);
i++;
}
fclose(file);
printf("httpData = %s\n", httpData);
Edit: This is what I send:
char* httpResponse = malloc((strlen(httpHeader)+strlen(httpData)+1)*sizeof(char));
strcpy(httpResponse, httpHeader);
strcat(httpResponse, httpData);
printf("HTTP response = %s\n", httpResponse);
The data part produces ???? for the image but correct html for an html file.

Images contain binary data. Any of the 256 distinct 8-bit patterns may appear in the image including, in particular, the null byte, 0x00 or '\0'. On some systems (notably Windows), you need to distinguish between text files and binary files, using the letter b in the standard I/O fopen() call (works fine on Unix as well as Windows). Given that binary data can contain null bytes, you can't use strcpy() et al to copy chunks of data around since the str*() functions stop copying at the first null byte. Therefore, you have to use the mem*() functions which take a start position and a length, or an equivalent.
Applied to your code, printing the binary httpData with %s won't work properly; the %s will stop at the first null byte. Since you have used stat() to verify the existence of the file, you also have a size for the file. Assuming you don't have to deal with dynamically changing files, that means you can allocate httpData to be the correct size. You can also pass the size to the reading code. This also means that the reading code can use fread() and the writing code can use fwrite(), saving on character-by-character I/O.
Thus, we might have a function:
int readHTTPData(const char *filename, size_t size, char *httpData)
{
FILE *fp = fopen(filename, "rb");
size_t n;
if (fp == 0)
return E_FILEOPEN;
n = fread(httpData, size, 1, fp);
fclose(fp);
if (n != 1)
return E_SHORTREAD;
fputs("httpData = ", stdout);
fwrite(httpData, size, 1, stdout);
putchar('\n');
return 0;
}
The function returns 0 on success, and some predefined (negative?) error numbers on failure. Since memory allocation is done before the routine is called, it is pretty simple:
Open the file; report error if that fails.
Read the file in a single operation.
Close the file.
Report error if the read did not get all the data that was expected.
Report on the data that was read (debugging only — and printing binary data to standard output raw is not the best idea in the world, but it parallels what the code in the question does).
Report on success.
In the original code, there is a loop:
int i = 0;
...
while(!feof(file)) {
fscanf(file, "%c", &httpData[i]);
i++;
}
This loop has a lot of problems:
You should not use feof() to test whether there is more data to read. It reports whether an EOF indication has been given, not whether it will be given.
Consequently, when the last character has been read, the feof() reports 'false', but the fscanf() tries to read the next (non-existent) character, adds it to the buffer (probably as a letter such as ÿ, y-umlaut, 0xFF, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS).
The code makes no check on how many characters have been read, so it has no protection against buffer overflow.
Using fscanf() to read a single character is a lot of overhead compared to getc().
Here's a more nearly correct version of the code, assuming that size is the number of bytes allocated to httpData.
int i = 0;
int c;
while ((c = getc(file)) != EOF && i < size)
httpData[i++] = c;
You could check that you get EOF when you expect it. Note that the fread() code does the size checking inside the fread() function. Also, the way I wrote the arguments, it is an all-or-nothing proposition — either all size bytes are read or everything is treated as missing. If you want byte counts and are willing to tolerate or handle short reads, you can reverse the order of the size arguments. You could also check the return from fwrite() if you wanted to be sure it was all written, but people tend to be less careful about checking that output succeeded. (It is almost always crucial to check that you got the input you expected, though — don't skimp on input checking.)
At some point, for plain text data, you need to think about CRLF vs NL line endings. Text files handle that automatically; binary files do not. If the data to be transferred is image/png or something similar, you probably don't need to worry about this. If you're on Unix and dealing with text/plain, you may have to worry about CRLF line endings (but I'm not an expert on this — I've not done low-level HTTP stuff recently (not in this millennium), so the rules may have changed).

Why is this C code giving me a bus error?

I have, as usual, been reading quite a few posts on here. I found a particular useful posts on bus errors in general, see here. My problem is that I cannot understand why my particular code is giving me an error.
My code is an attempt to teach myself C. It's a modification of a game I made when I learned Java. The goal in my game is to take a huge 5049 x 1 text file of words. Randomly pick a word, jumble it and try to guess it. I know how to do all of that. So anyway, each line of the text file contains a word like:
5049
must
lean
better
program
now
...
So, I created an string array in C, tried to read this string array and put it into C. I didn't do anything else. Once I get the file into C, the rest should be easy. Weirder yet is that it complies. My problem comes when I run it with ./blah command.
The error I get is simple. It says:
zsh: bus error ./blah
My code is below. I suspect it might have to do with memory or overflowing the buffer, but that's completely unscientific and a gut feeling. So my question is simple, why is this C code giving me this bus error msg?
#include<stdio.h>
#include<stdlib.h>
//Preprocessed Functions
void jumblegame();
void readFile(char* [], int);
int main(int argc, char* argv[])
{
jumblegame();
}
void jumblegame()
{
//Load File
int x = 5049; //Rows
int y = 256; //Colums
char* words[x];
readFile(words,x);
//Define score variables
int totalScore = 0;
int currentScore = 0;
//Repeatedly pick a random work, randomly jumble it, and let the user guess what it is
}
void readFile(char* array[5049], int x)
{
char line[256]; //This is to to grab each string in the file and put it in a line.
FILE *file;
file = fopen("words.txt","r");
//Check to make sure file can open
if(file == NULL)
{
printf("Error: File does not open.");
exit(1);
}
//Otherwise, read file into array
else
{
while(!feof(file))//The file will loop until end of file
{
if((fgets(line,256,file))!= NULL)//If the line isn't empty
{
array[x] = fgets(line,256,file);//store string in line x of array
x++; //Increment to the next line
}
}
}
}

This line has a few problems:
array[x] = fgets(line,256,file);//store string in line x of array
You've already read the line in the condition of the immediately preceding if statement: the current line that you want to operate on is already in the buffer and now you use fgets to get the next line.
You're trying to assign to the same array slot each time: instead you'll want to keep a separate variable for the array index that increments each time through the loop.
Finally, you're trying to copy the strings using =. This will only copy references, it won't make a new copy of the string. So each element of the array will point to the same buffer: line, which will go out of scope and become invalid when your function exits. To populate your array with the strings, you need to make a copy of each one for the array: allocate space for each new string using malloc, then use strncpy to copy each line into your new string. Alternately, if you can use strdup, it will take care of allocating the space for you.
But I suspect that this is the cause of your bus error: you're passing in the array size as x, and in your loop, you're assigning to array[x]. The problem with this is that array[x] doesn't belong to the array, the array only has useable indices of 0 to (x - 1).

You are passing the value 5049 for x. The first time that the line
array[x] = ...
executes, it's accessing an array location that does not exist.
It looks like you are learning C. Great! A skill you need to master early is basic debugger use. In this case, if you compile your program with
gcc -g myprogram.c -o myprogram
and then run it with
gdb ./myprogram
(I am assuming Linux), you will get a stack dump that shows the line where bus error occurred. This should be enough to help you figure out the error yourself, which in the long run is much better than asking others.
There are many other ways a debugger is useful, but this is high on the list. It gives you a window into your running program.

You are storing the lines in the line buffer, which is defined inside the readFile function, and storing pointers to it in the arary. There are two problems with that: you are overwriting the value everytime a new string is read and the buffer is in the stack, and is invalid once the function returns.

You have at least a few problems:
array[x] = fgets(line,256,file)
This stores the address of line into each array element. line in no longer valid when readFile() returns, so you'll have an array of of useless pointers. Even if line had a longer lifetime, it wouldn't be useful to have all your array elements having the same pointer (they'd each just point to whatever happened to be written in the buffer last)
while(!feof(file))
This is an antipattern for reading a file. See http://c-faq.com/stdio/feof.html and "Using feof() incorrectly". This antipattern is likely responsible for your program looping more than you might expect when reading the file.
you allocate the array to hold 5049 pointers, but you simply read however much is in the file - there's no checking for whether or not you read the expected number or to prevent reading too many. You should think about allocating the array dynamically as you read the file or have a mechanism to ensure you read the right amount of data (not too little and not too much) and handle the error when it's not right.

I suspect the problem is with (fgets(line,256,file))!=NULL). A better way to read a file is with fread() (see http://www.cplusplus.com/reference/clibrary/cstdio/fread/). Specify the FILE* (a file stream in C), the size of the buffer, and the buffer. The routine returns the number of bytes read. If the return value is zero, then the EOF has been reached.
char buff [256];
fread (file, sizeof(char), 256, buff);

Faster I/O in C

I have a problem which will take 1000000 lines of inputs like below from console.
0 1 23 4 5
1 3 5 2 56
12 2 3 33 5
...
...
I have used scanf, but it is very very slow. Is there anyway to get the input from console in a faster way? I could use read(), but I am not sure about the no of bytes in each line, so I can not as read() to read 'n' bytes.
Thanks,
Very obliged

Use fgets(...) to pull in a line at a time. Note that you should check for the '\n' at the end of the line, and if there is not one, you are either at EOF, or you need to read another buffer's worth, and concatenate the two together. Lather, rinse, repeat. Don't get caught with a buffer overflow.
THEN, you can parse each logical line in memory yourself. I like to use strspn(...) and strcspn(...) for this sort of thing, but your mileage may vary.
Parsing:
Define a delimiters string. Use strspn() to count "non data" chars that match the delimiters, and skip over them. Use strcspn() to count the "data" chars that DO NOT match the delimiters. If this count is 0, you are done (no more data in the line). Otherwise, copy out those N chars to hand to a parsing function such as atoi(...) or sscanf(...). Then, reset your pointer base to the end of this chunk and repeat the skip-delims, copy-data, convert-to-numeric process.

If your example is representative, that you indeed have a fixed format of five decimal numbers per line, I'd probably use a combination of fgets() to read the lines, then a loop calling strtol() to convert from string to integer.
That should be faster than scanf(), while still clearer and more high-level than doing the string to integer conversion on your own.
Something like this:
typedef struct {
int number[5];
} LineOfNumbers;
int getNumbers(FILE *in, LineOfNumbers *line)
{
char buf[128]; /* Should be large enough. */
if(fgets(buf, sizeof buf, in) != NULL)
{
int i;
char *ptr, *eptr;
ptr = buf;
for(i = 0; i < sizeof line->number / sizeof *line->number; i++)
{
line->number[i] = (int) strtol(ptr, &eptr, 10);
if(eptr == ptr)
return 0;
ptr = eptr;
}
return 1;
}
return 0;
}
Note: this is untested (even uncompiled!) browser-written code. But perhaps useful as a concrete example.

You use multiple reads with a fixed size buffer till you hit end of file.

Out of curiosity, what generates that many lines that fast in a console ?

Use binary I/O if you can. Text conversion can slow down the reading by several times. If you're using text I/O because it's easy to debug, consider again binary format, and use the od program (assuming you're on unix) to make it human-readable when needed.
Oh, another thing: there's AT&T's SFIO library, which stands for safer/faster file IO. You might also have some luck with that, but I doubt that you'll get the same kind of speedup as you will with binary format.

Read a line at a time (if buffer not big enough for a line, expand and continue with larger buffer).
Then use dedicated functions (e.g. atoi) rather than general for conversion.
But, most of all, set up a repeatable test harness with profiling to ensure changes really do speed things up.

fread will still return if you try to read more bytes than there are.
I have found on of the fastest ways to read file is like this:
/*seek end of file */
fseek(file,0,SEEK_END);
/*get size of file */
size = ftell(file);
/*seek start of file */
fseek(file,0,SEEK_SET);
/* make a buffer for the file */
buffer = malloc(1048576);
/*fread in 1MB at a time until you reach size bytes etc */
On modern computers put your ram to use and load the whole thing to ram, then you can easily work your way through the memory.
At the very least you should be using fread with block sizes as big as you can, and at least as big as the cache blocks or HDD sector size (4096 bytes minimum, I would use 1048576 as a minimum personally). You will find that with much bigger read requsts rfead is able to sequentially get a big stream in one operation. The suggestion here of some people to use 128 bytes is rediculous.... as you will end up with the drive having to seek all the time as the tiny delay between calls will cause the head to already be past the next sector which almost certainly has sequential data that you want.

You can greatly reduce the time of execution by taking input using fread() or fread_unlocked() (if your program is single-threaded). Locking/Unlocking the input stream just once takes negligible time, so ignore that.
Here is the code:
#include <iostream>
int maxio=1000000;
char buf[maxio], *s = buf + maxio;
inline char getc1(void)
{
if(s >= buf + maxio) { fread_unlocked(buf,sizeof(char),maxio,stdin); s = buf; }
return *(s++);
}
inline int input()
{
char t = getc1();
int n=1,res=0;
while(t!='-' && !isdigit(t)) t=getc1(); if(t=='-')
{
n=-1; t=getc1();
}
while(isdigit(t))
{
res = 10*res + (t&15);
t=getc1();
}
return res*n;
}
This is implemented in C++. In C, you won't need to include iostream, function isdigit() is implicitly available.
You can take input as a stream of chars by calling getc1() and take integer input by calling input().
The whole idea behind using fread() is to take all the input at once. Calling scanf()/printf(), repeatedly takes up valuable time in locking and unlocking streams which is completely redundant in a single-threaded program.
Also make sure that the value of maxio is such that all input can be taken in a few "roundtrips" only (ideally one, in this case). Tweak it as necessary.
Hope this helps!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Compare 2 files using POSIX system calls - c

Related

C UNIX - read() reads none existing letters

Pthreads, fread(), and printf(): Getting random D4's in my string

C, format file for data of HTTP response

Why is this C code giving me a bus error?

Faster I/O in C

Categories

Resources