Faster I/O in C - c

I have a problem which will take 1000000 lines of inputs like below from console.
0 1 23 4 5
1 3 5 2 56
12 2 3 33 5
...
...
I have used scanf, but it is very very slow. Is there anyway to get the input from console in a faster way? I could use read(), but I am not sure about the no of bytes in each line, so I can not as read() to read 'n' bytes.
Thanks,
Very obliged

Use fgets(...) to pull in a line at a time. Note that you should check for the '\n' at the end of the line, and if there is not one, you are either at EOF, or you need to read another buffer's worth, and concatenate the two together. Lather, rinse, repeat. Don't get caught with a buffer overflow.
THEN, you can parse each logical line in memory yourself. I like to use strspn(...) and strcspn(...) for this sort of thing, but your mileage may vary.
Parsing:
Define a delimiters string. Use strspn() to count "non data" chars that match the delimiters, and skip over them. Use strcspn() to count the "data" chars that DO NOT match the delimiters. If this count is 0, you are done (no more data in the line). Otherwise, copy out those N chars to hand to a parsing function such as atoi(...) or sscanf(...). Then, reset your pointer base to the end of this chunk and repeat the skip-delims, copy-data, convert-to-numeric process.

If your example is representative, that you indeed have a fixed format of five decimal numbers per line, I'd probably use a combination of fgets() to read the lines, then a loop calling strtol() to convert from string to integer.
That should be faster than scanf(), while still clearer and more high-level than doing the string to integer conversion on your own.
Something like this:
typedef struct {
int number[5];
} LineOfNumbers;
int getNumbers(FILE *in, LineOfNumbers *line)
{
char buf[128]; /* Should be large enough. */
if(fgets(buf, sizeof buf, in) != NULL)
{
int i;
char *ptr, *eptr;
ptr = buf;
for(i = 0; i < sizeof line->number / sizeof *line->number; i++)
{
line->number[i] = (int) strtol(ptr, &eptr, 10);
if(eptr == ptr)
return 0;
ptr = eptr;
}
return 1;
}
return 0;
}
Note: this is untested (even uncompiled!) browser-written code. But perhaps useful as a concrete example.

You use multiple reads with a fixed size buffer till you hit end of file.

Out of curiosity, what generates that many lines that fast in a console ?

Use binary I/O if you can. Text conversion can slow down the reading by several times. If you're using text I/O because it's easy to debug, consider again binary format, and use the od program (assuming you're on unix) to make it human-readable when needed.
Oh, another thing: there's AT&T's SFIO library, which stands for safer/faster file IO. You might also have some luck with that, but I doubt that you'll get the same kind of speedup as you will with binary format.

Read a line at a time (if buffer not big enough for a line, expand and continue with larger buffer).
Then use dedicated functions (e.g. atoi) rather than general for conversion.
But, most of all, set up a repeatable test harness with profiling to ensure changes really do speed things up.

fread will still return if you try to read more bytes than there are.
I have found on of the fastest ways to read file is like this:
/*seek end of file */
fseek(file,0,SEEK_END);
/*get size of file */
size = ftell(file);
/*seek start of file */
fseek(file,0,SEEK_SET);
/* make a buffer for the file */
buffer = malloc(1048576);
/*fread in 1MB at a time until you reach size bytes etc */
On modern computers put your ram to use and load the whole thing to ram, then you can easily work your way through the memory.
At the very least you should be using fread with block sizes as big as you can, and at least as big as the cache blocks or HDD sector size (4096 bytes minimum, I would use 1048576 as a minimum personally). You will find that with much bigger read requsts rfead is able to sequentially get a big stream in one operation. The suggestion here of some people to use 128 bytes is rediculous.... as you will end up with the drive having to seek all the time as the tiny delay between calls will cause the head to already be past the next sector which almost certainly has sequential data that you want.

You can greatly reduce the time of execution by taking input using fread() or fread_unlocked() (if your program is single-threaded). Locking/Unlocking the input stream just once takes negligible time, so ignore that.
Here is the code:
#include <iostream>
int maxio=1000000;
char buf[maxio], *s = buf + maxio;
inline char getc1(void)
{
if(s >= buf + maxio) { fread_unlocked(buf,sizeof(char),maxio,stdin); s = buf; }
return *(s++);
}
inline int input()
{
char t = getc1();
int n=1,res=0;
while(t!='-' && !isdigit(t)) t=getc1(); if(t=='-')
{
n=-1; t=getc1();
}
while(isdigit(t))
{
res = 10*res + (t&15);
t=getc1();
}
return res*n;
}
This is implemented in C++. In C, you won't need to include iostream, function isdigit() is implicitly available.
You can take input as a stream of chars by calling getc1() and take integer input by calling input().
The whole idea behind using fread() is to take all the input at once. Calling scanf()/printf(), repeatedly takes up valuable time in locking and unlocking streams which is completely redundant in a single-threaded program.
Also make sure that the value of maxio is such that all input can be taken in a few "roundtrips" only (ideally one, in this case). Tweak it as necessary.
Hope this helps!

Related

Using fread() to read a text based file - best practices

Consider this code to read a text based file. This sort of fread() usage was briefly touched upon in the excellent book C Programming: A Modern Approach by K.N. King.
There are other methods of reading text based files, but here I am concerned with fread() only.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
// Declare file stream pointer.
FILE *fp = fopen("Note.txt", "r");
// fopen() call successful.
if(fp != NULL)
{
// Navigate through to end of the file.
fseek(fp, 0, SEEK_END);
// Calculate the total bytes navigated.
long filesize = ftell(fp);
// Navigate to the beginning of the file so
// it can be read.
rewind(fp);
// Declare array of char with appropriate size.
char content[filesize + 1];
// Set last char of array to contain NULL char.
content[filesize] = '\0';
// Read the file content.
fread(content, filesize, 1, fp);
// Close file stream pointer.
fclose(fp);
// Print file content.
printf("%s\n", content);
}
// fopen() call unsuccessful.
else
{
printf("File could not be read.\n");
}
return 0;
}
There are some problems I have with this method. My opinion is that this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
To circumvent this issue, we may use a buffer size and keep on reading into a char array of that size. If filesize is less than buffer size, then we simply perform fread() once as described in the above code. Otherwise, We divide the total file size by the buffer size and get a result, whose int portion we will use as the total number of times to iterate a loop where we will invoke fread() each time, appending the read buffer array into a larger string. Now, for the final fread(), which we will perform after the loop, we will have to read exactly (filesize % buffersize) bytes of data into an array of that size and finally append this array into the larger string (Which we would have malloc-ed with filesize + 1 beforehand). I find that if we perform fread() for the last chunk of data using buffersize as its second parameter, then extra garbage data of size (buffersize - chunksize) will be read in and the data might become corrupted. Are my assumptions here correct? Please explain if/ how I have overlooked something.
Also, there is the issue that non-ASCII characters might not have size of 1 byte. In that case I would assume the proper amount is being read, but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
fread() does not care about strings (null character terminated arrays). It reads data as if it was in multiples of unsigned char*1 with no special concern to the data content if the stream opened in binary mode and perhaps some data processing (e.g. end-of-line, byte-order-mark) in text mode.
Are my assumptions here correct?
Failed assumptions:
Assuming ftell() return value equals the sum of fread() bytes.
The assumption can be false in text mode (as OP opened the file) and fseek() to the end is technical undefined behavior in binary mode.
Assuming not checking the return value of fread() is OK. Use the return value of fread() to know if an error occurred, end-of-file and how many multiples of bytes were read.
Assuming error checking is not required. , ftell(), fread(), fseek() instead of rewind() all deserve error checks. In particular, ftell() readily fails on streams that have no certain end.
Assuming no null characters are read. A text file is not certainly made into one string by reading all and appending a null character. Robust code detects and/or copes with embedded null characters.
Multi-byte: assuming input meets the encoding requirements. Example: robust code detects (and rejects) invalid UTF8 sequences - perhaps after reading the entire file.
Extreme: Assuming a file length <= LONG_MAX, the max value returned from ftell(). Files may be larger.
but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
fread() does not function on multi-byte boundaries, only multiples of unsigned char. A given fread() may end with a portion of a multi-byte and the next fread() will continue from mid-multi-byte.
Instead of of 2 pass approach consider 1 single pass
// Pseudo code
total_read = 0
Allocate buffer, say 4096
forever
if buffer full
double buffer_size (`realloc()`)
u = unused portion of buffer
fread u bytes into unused portion of buffer
total_read += number_just_read
if (number_just_read < u)
quit loop
Resize buffer total_read (+ 1 if appending a '\0')
Alternatively consider the need to read the entire file in before processing the data. I do not know the higher level goal, but often processing data as it arrives makes for less resource impact and faster throughput.
Advanced
Text files may be simple ASCII only, 8-bit code page defined, one of various UTF encodings (byte-order-mark, etc. The last line may or may not end with a '\n'. Robust text processing beyond simple ASCII is non-trivial.
ASCII and UTF-8 are the most common. IMO, handle 1 or both of those and error out on anything that does not meet their requirements.
*1 fread() reads in multiple of bytes as per the 3rd argument, which is 1 in OP's case.
// v --- multiple of 1 byte
fread(content, filesize, 1, fp);

To implement read file line by line such as readline()

Like readline() function. I thought to implement this,
first I have to read file such as read(fd, buf, 4096);, and then, I have to compare buf[i] byte by byte like if (buf[i] == '\n').
So, if I find corresponding i, then use lseek() to go first file offset, and then again read(fd, buf, i). After fisrt operation like this, the second readline() call will do this mechanism again.
I thought this solution at first, but comparing buf[i], which means compare byte by byte, is too slow to read all of the character in the fd. Must I have to compare like this or is there better solutions??
I'm supposing that the reason you cannot use fgets() is that this is an exercise in which you are supposed to learn something about POSIX low-level I/O functions, and maybe a bit about buffering. If you really only care about getting the data, then I urge you to wrap a stream around your file descriptor via fdopen(), and then to use fgets() to read it.
I thought this solution at first, but comparing buf[i], which means compare byte by byte, is too slow to read all of the character in the fd. Must I have to compare like this or is there better solutions??
You want to read up to the first appearance of a given byte. How do you suppose you could do that without examining each byte you read? It's not possible except maybe with hardware support, and you're unlikely to have that.
I think your concern is misplaced, anyway. It is far more costly to move data from disk to memory than it is to examine the data in memory afterward. If you're going to work at the low level you propose and you want good performance, then you must read the data from disk in suitably large chunks, as it appears you do in your read()-based approach.
On the other hand, it follows that you also want to avoid re-reading any data, so if you're after good performance then the lseek() is unsuitable. Moreover, if you need to handle non-seekable files, such as pipes, then lseek() is completely out of the question. In either of those cases, you must maintain the buffer somehow, and be prepared to serve multiple requests from its contents. Additionally, you must be prepared for the likelihood that line boundaries will not correspond with the buffer boundary, that you may sometimes need more than one read to find a newline, and that it is conceivable that lines will be longer than your buffer, however long that is.
Thus, if fgets() and other stream-based I/O alternatives are not an option for you then you have a buffer management problem to solve. I suggest you start there. Once you've got that worked out, it should be straightforward to write an analog of fgets() in terms of that buffering.
Implement fgetc using a 'read' for 1 character, use your own getc to implement readline?
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
char my_getc()
{
unsigned char ch[1];
read(1, ch, 1);
return ch[0];
}
char *my_readline()
{
char line[4096];
char *ret;
char c;
int position = 0;
while(c = my_getc() != '\n')
line[position++] += c;
line[position] = '\0';
ret = malloc(sizeof(char) * strlen(line));
return ret;
}
int main(int argc, char *argv[])
{
char c;
printf("%s\n", my_readline());
}
If you need a well tested solution you should maybe read the source of an existing implementation,...

Compare 2 files using POSIX system calls

C newbie here.
Banging my head against the wall with this one...:/
I'm trying to compare to files which are not been used by any other process which means that they are static, using only system calls. I have no problems doing so using fopen() but it feels much more complicated when using just open(), read() and write()...
here's what I got so far:
...//Some code here to get file descriptors and other file manipulation checks
int one = read(srcfd1,buf1,src1_size);
int two = read(srcfd2,buf2,src2_size);
printf("%s\n",buf1); //works fine till it gets here...
int samefile = strcmp(buf1,buf2); //Crashes somewhere around here..
if (samefile != 0)
{
printf("not equle\n");
return(1);
}
else
{
printf("equle\n");
return(2);
}
So basically, what I think I need to do is to compare the 2 buffers but this is not seem to be working...
I found something which I believe should give me some idea here but I can't make sense of it (the last answer in the link...).
The return values are irrelevant .
Appreciate any help I can get...:/
Your buffers are not NUL terminated, so it doesn't make sense to use strcmp - this will almost certainly fail unless your buffers happen to contain a 0 somewhere. Also you don't say whether these files are text files or binary files, but to make this work (for either text or binary) you should change:
int samefile = strcmp(buf1,buf2); //Crashes somewhere around here..
to:
int samefile = memcmp(buf1,buf2,src1_size); // use memcmp to compare buffers
Note that you should also check that src1_size == src2_size prior to calling memcmp.
This crashes since the buffers possibly are not null terminated. You are trying to print them as string "%s" in printf and doing a strcmp too.
You can trying null terminating the buffers, after your read calls, and then print them as string.
buf1[one] = '\0';
buf2[two] ='\0';
This will most likely fix your code. But a few other points,
1) Are your buffers sufficiently large as the file?
2) Better to partially read data, than to try to grab everything in one go.
(means use a loop to read data, till it returns a 0)
like,
Assuming the array "buf" is sufficiently large to hold all the file's data.
The number "512" means, read will at most try to read 512 bytes and the
iteration will continue, till read returns 0 (when there is no more data) or
may be a negative number, in case of any error. The array's index is getting
incremented, by the number of bytes read till now, so that the data does not
get overwritten.
An example - If a file is having say 515 bytes, read will be called thrice.
During the first call it will return 512, for the 2nd call it will return 3
and the third call will return 0. (read call returns the number of bytes,
actually read)
index = 0;
while( (no_of_bytes_read = read(fd, buf + index, 512)) > 0)
{
index = index + no_of_bytes_read;
}

Why use 4096 elements for a char array buffer?

I found a program that takes in standard input
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "Usage: %s <PATTERN>\n", argv[0]);
return 2;
}
/* we're not going to worry about long lines */
char buf[4096]; // 4kibi
while (!feof(stdin) && !ferror(stdin)) { // when given a file through input redirection, file becomes stdin
if (!fgets(buf, sizeof(buf), stdin)) { // puts reads sizeof(buf) characters from stdin and puts it into buf; fgets() stops reading when the newline is read
break;
}
if (rgrep_matches(buf, argv[1])) {
fputs(buf, stdout); // writes the string into stdout
fflush(stdout);
}
}
if (ferror(stdin)) {
perror(argv[0]); // interprets error
return 1;
}
return 0;
}
Why is the buf set to 4096 elements? Is it because the maximum number of characters on each line can only be 4096?
The answer is in the code you pasted:
/* we're not going to worry about long lines */
char buf[4096]; // 4kibi
Lines longer than 4096 characters can occur, but the author didn't deem them worth caring about.
Note also the definition of fgets:
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s. Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte (\0) is stored after the last character in the buffer.
So if there is a line longer than 4095 characters (since the 4096'th is reserved for the null byte), it will be split across multiple iterations of the while loop.
The program just reads 4096 characters per iteration.
There's no limit in the size of a line, but the may be a limit in the size of the stack ( 8 MB in modern linux systems)
Most programmers choose what fit best for the program being implemented, in this case the programmer commented that there's no need to worry about longer lines.
The author seems to just have a very large memory block for his expected input, to avoid dealing with chunks.
The seemingly awkward number 4096 is most likely explained by the fact that it is a) a power of two number and b) is a memory page size. So when the system chooses to swap out a page to disc, it can do it in one go without any overhead involved.
Wether this really helps is another question, because if you allocate a page with 'malloc', it may not be aligned on a page boundary.
I myself also use such a number often, because it doesn't hurt and in best case it might help. However, it is only really relevant if you are worried about speed and you have reall yontrol over the allocation process in detail. If you allocate a page directly from the OS, then such a size might really have some benefits.
There is no such thing as max no characters in a line. 4096 is taken assuming a normal condition's no lines will be more than 4096 bytes.
It more like preparing for worst case.
Assume you take the size of array less than the sizeof(line) then itbreaks the operation into more than one step till eof is encountered.
I think it is simply that the author chose the char buffer size to be 4*kibi* (4096 = 1024 * 4) by design as commented in code.

Go to a certain point of a binary file in C (using fseek) and then reading from that location (using fread)

I am wondering if this is the best way to go about solving my problem.
I know the values for particular offsets of a binary file where the information I want is held...What I want to do is jump to the offsets and then read a certain amount of bytes, starting from that location.
After using google, I have come to the conclusion that my best bet is to use fseek() to move to the position of the offset, and then to use fread() to read an amount of bytes from that position.
Am I correct in thinking this? And if so, how is best to go about doing so? i.e. how to incorporate the two together.
If I am not correct, what would you suggest I do instead?
Many thanks in advance for your help.
Matt
Edit:
I followed a tutorial on fread() and adjusted it to the following:
`#include <stdio.h>
int main()
{
FILE *f;
char buffer[11];
if (f = fopen("comm_array2.img", "rt"))
{
fread(buffer, 1, 10, f);
buffer[10] = 0;
fclose(f);
printf("first 10 characters of the file:\n%s\n", buffer);
}
return 0;
}`
So I used the file 'comm_array2.img' and read the first 10 characters from the file.
But from what I understand of it, this goes from start-of-file, I want to go from some-place-in-file (offset)
Is this making more sense?
Edit Number 2:
It appears that I was being a bit dim, and all that is needed (it would seem from my attempt) is to put the fseek() before the fread() that I have in the code above, and it seeks to that location and then reads from there.
If you are using file streams instead of file descriptors, then you can write yourself a (simple) function analogous to the POSIX pread() system call.
You can easily emulate it using streams instead of file descriptors1. Perhaps you should write yourself a function such as this (which has a slightly different interface from the one I suggested in a comment):
size_t fpread(void *buffer, size_t size, size_t mitems, size_t offset, FILE *fp)
{
if (fseek(fp, offset, SEEK_SET) != 0)
return 0;
return fread(buffer, size, nitems, fp);
}
This is a reasonable compromise between the conventions of pread() and fread().
What would the syntax of the function call look like? For example, reading from the offset 732 and then again from offset 432 (both being from start of the file) and filestream called f.
Since you didn't say how many bytes to read, I'm going to assume 100 each time. I'm assuming that the target variables (buffers) are buffer1 and buffer2, and that they are both big enough.
if (fpread(buffer1, 100, 1, 732, f) != 1)
...error reading at offset 732...
if (fpread(buffer2, 100, 1, 432, f) != 1)
...error reading at offset 432...
The return count is the number of complete units of 100 bytes each; either 1 (got everything) or 0 (something went awry).
There are other ways of writing that code:
if (fpread(buffer1, sizeof(char), 100, 732, f) != 100)
...error reading at offset 732...
if (fpread(buffer2, sizeof(char), 100, 432, f) != 100)
...error reading at offset 432...
This reads 100 single bytes each time; the test ensures you got all 100 of them, as expected. If you capture the return value in this second example, you can know how much data you did get. It would be very surprising if the first read succeeded and the second failed; some other program (or thread) would have had to truncate the file between the two calls to fpread(), but funnier things have been known to happen.
1 The emulation won't be perfect; the pread() call provides guaranteed atomicity that the combination of fseek() and fread() will not provide. But that will seldom be a problem in practice, unless you have multiple processes or threads concurrently updating the file while you are trying to position and read from it.
It frequently depends on the distance between the parts you care about. If you're only skipping over/ignoring a few bytes between the parts you care about, it's often easier to just read that data and ignore what you read, rather than using fseek to skip past it. A typical way to do this is define a struct holding both the data you care about, and place-holders for the ones you don't care about, read in the struct, and then just use the parts you care about:
struct whatever {
long a;
long ignore;
short b;
} w;
fread(&w, 1, sizeof(w), some_file);
// use 'w.a' and 'w.b' here.
If there's any great distance between the parts you care about, though, chances are that your original idea of using fseek to get to the parts that matter will be simpler.
Your theory sounds correct. Open, seek, read, close.
Create a struct to for the data you want to read and pass a pointer to read() of struct's allocated memory. You'll likely need #pragma pack(1) or similar on the struct to prevent misalignment problems.

Resources