To implement read file line by line such as readline()

To implement read file line by line such as readline() - c

Like readline() function. I thought to implement this,
first I have to read file such as read(fd, buf, 4096);, and then, I have to compare buf[i] byte by byte like if (buf[i] == '\n').
So, if I find corresponding i, then use lseek() to go first file offset, and then again read(fd, buf, i). After fisrt operation like this, the second readline() call will do this mechanism again.
I thought this solution at first, but comparing buf[i], which means compare byte by byte, is too slow to read all of the character in the fd. Must I have to compare like this or is there better solutions??

I'm supposing that the reason you cannot use fgets() is that this is an exercise in which you are supposed to learn something about POSIX low-level I/O functions, and maybe a bit about buffering. If you really only care about getting the data, then I urge you to wrap a stream around your file descriptor via fdopen(), and then to use fgets() to read it.
I thought this solution at first, but comparing buf[i], which means compare byte by byte, is too slow to read all of the character in the fd. Must I have to compare like this or is there better solutions??
You want to read up to the first appearance of a given byte. How do you suppose you could do that without examining each byte you read? It's not possible except maybe with hardware support, and you're unlikely to have that.
I think your concern is misplaced, anyway. It is far more costly to move data from disk to memory than it is to examine the data in memory afterward. If you're going to work at the low level you propose and you want good performance, then you must read the data from disk in suitably large chunks, as it appears you do in your read()-based approach.
On the other hand, it follows that you also want to avoid re-reading any data, so if you're after good performance then the lseek() is unsuitable. Moreover, if you need to handle non-seekable files, such as pipes, then lseek() is completely out of the question. In either of those cases, you must maintain the buffer somehow, and be prepared to serve multiple requests from its contents. Additionally, you must be prepared for the likelihood that line boundaries will not correspond with the buffer boundary, that you may sometimes need more than one read to find a newline, and that it is conceivable that lines will be longer than your buffer, however long that is.
Thus, if fgets() and other stream-based I/O alternatives are not an option for you then you have a buffer management problem to solve. I suggest you start there. Once you've got that worked out, it should be straightforward to write an analog of fgets() in terms of that buffering.

Implement fgetc using a 'read' for 1 character, use your own getc to implement readline?
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
char my_getc()
{
unsigned char ch[1];
read(1, ch, 1);
return ch[0];
}
char *my_readline()
{
char line[4096];
char *ret;
char c;
int position = 0;
while(c = my_getc() != '\n')
line[position++] += c;
line[position] = '\0';
ret = malloc(sizeof(char) * strlen(line));
return ret;
}
int main(int argc, char *argv[])
{
char c;
printf("%s\n", my_readline());
}
If you need a well tested solution you should maybe read the source of an existing implementation,...

Related

C - moving back the pointer in the file using lseek

I am writing an academic project in C and I can use only <fcntl.h> and <unistd.h> libraries to file operations.
I have the function to read file line by line. The algorithm is:
Set pointer at the beginning of the file and get current position.
Read data to the buffer (char buf[100]) with constant size, iterate character by character and detect end of line '\n'.
Increment current position: curr_pos = curr_pos + length_of_read_line;
Set pointer to current position using lseek(fd, current_position, SEEK_SET);
SEEK_SET - set pointer to given offset from the beginning of the file. In my pseudo code current_position is the offset.
And actually it works fine, but I always move the pointer starting at the beginning of the file - I use SEEK_SET - it isn't optimized.
lseek accept also argument SEEK_CUR - it's a current position. How can I move back pointer from current position of pointer (SEEK_CUR). I tried to set negative offset, but didn't work.

The most efficient way to read lines of data from a file is typically to read a large chunk of data that may span multiple lines, process lines of data from the chunk until one reaches the end, move any partial line from the end of the buffer to the start, and then read another chunk of data. Depending upon the target system and task to be performed, it may be better to read enough to fill whatever space remains after the partial line, or it may be better to always read a power-of-two number of bytes and make the buffer large enough to accommodate a chunk that size plus a maximum-length partial line (left over from the previous read). The one difficulty with this approach is that all data to be read from the stream using the same buffer. In cases where that is practical, however, it will often allow better performance than using many separate calls to fread, and may be nicer than using fgets.
While it should be possible for a standard-library function to facilitate line input, the design of fgets is rather needlessly hostile since it provides no convenient indication of how much data it has read. After reading each line, code that wants a string containing the printable portion will have to use strlen to try to ascertain how much data was read (hopefully the input won't contain any zero bytes) and then check the byte before the trailing zero to see if it's a newline. Not impossible, but awkward at the very least. If the fread-and-buffer approach will satisfy an application's needs, it's likely to be at least as efficient as using fgets, if not moreso, and since the effort required to use fgets() robustly will be comparable to that required to use a buffering approach, one may as well use the latter.

Since your question is tagged as posix, I would go with getline(), without having to manually take care of moving the file pointer.
Example:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE* fp;
char* line = NULL;
size_t len = 0;
ssize_t read;
fp = fopen("input.txt", "r");
if(fp == NULL)
return -1;
while((read = getline(&line, &len, fp)) != -1)
{
printf("Read line of length %zu:\n", read);
printf("%s", line);
}
fclose(fp);
if(line)
free(line);
return 0;
}
Output with custom input:
Read line of length 11:
first line
Read line of length 12:
second line
Read line of length 11:
third line

Understanding K&R's putc macro: K&R Chapter 8 (The Unix System Interface) Exercise 2

I've been trying to understand K&R's version of putc for some time now, and I'm out of resources (google, stack overflow, clcwiki don't quite have what I'm looking for and I have no friends or colleagues to turn to). I'll explain the context first and then ask for clarification.
This chapter of the text introduced an example of a data structure that describes a file. The structure includes a character buffer for reading and writing large chunks at a time. They then asked the reader to write a version of the standard library putc.
As a clue for the reader, K&R wrote a version of getc that has supports both buffered and unbuffered reading. They also wrote the skeleton of the putc macro, leaving the user to write the function _flushbuf() for themselves. The putc macro looks like this (p is a pointer to the file structure):
int _flushbuf(int, FILE *);
#define putc(x,p) (--(p)->cnt >= 0 \
? *(p)->ptr++ = (x) : _flushbuf((x),p)
typedef struct {
int cnt; /*characters left*/
char *ptr; /*next character position*/
char *base; /*location of buffer*/
int flag; /*mode of file access*/
int fd; /*file descriptor*/
} FILE;
Confusingly, the conditional in the macro is actually testing if the structure's buffer is full (this is stated in the text) - as a side note, the conditional in getc is exactly the same but means the buffer is empty. Weird?
Here's where I need clarification: I think there's a pretty big problem with buffered writing in putc; since writing to p is only performed in _flushbuf(), but _flushbuf() is only called when the file structure's buffer is full, then writing is only done if the buffer is entirely filled. And the size for buffered reading is always the system's BUFSIZ. Writing anything other than exactly 'BUFSIZ' characters just doesn't happen, because _flushbuf() will never be called in putc.
putc works just fine for unbuffered writing. But the design of the macro makes buffered writing almost entirely pointless. Is this correct, or am I missing something here? Why is it like this? I truly appreciate any and all help here.

I think you may be misreading what takes place inside the putc() macro; there are a lot of operators and symbols in there, and they all matter (and their order-of-execution matters!) for this to work. To help understand it better, let's substitute it into a real usage, and then expand it out until you can see what's going on.
Let's start with a simple invocation of putc('a', file), as in the example below:
FILE *file = /* ... get a file pointer from somewhere ... */;
putc('a', file);
Now substitute the macro in place of the call to putc() (this is the easy part, and is performed by the C preprocessor; also, I think you're missing a parenthesis at the end of the version you provided, so I'm going to insert it at the end where it belongs):
FILE *file = /* ... get a file pointer from somewhere ... */;
(--(file)->cnt >= 0 ? *(file)->ptr++ = ('a') : _flushbuf(('a'),file));
Well, isn't that a mess of symbols. Let's strip off the unneeded parentheses, and then convert the ?...: into the if-statement that it actually is under the hood:
FILE *file = /* ... get a file pointer from somewhere ... */;
if (--file->cnt >= 0)
*file->ptr++ = 'a';
else
_flushbuf('a', file);
This is closer, but it's still not quite obvious what's going on. Let's move the increments and decrements into separate statements so it's easier to see the order of execution:
FILE *file = /* ... get a file pointer from somewhere ... */;
--file->cnt;
if (file->cnt >= 0) {
*file->ptr = 'a';
file->ptr++;
}
else {
_flushbuf('a', file);
}
Now, with the content reordered, it should be a little easier to see what's going on. First, we decrement cnt, the count of remaining characters. If that indicates there's room left, then it's safe to write a into the file's buffer, at the file's current write pointer, and then we move the write pointer forward.
If there isn't room left, then we call _flushbuf(), passing it both the file (whose buffer is full) and the character we wanted to write but couldn't. Presumably, _flushbuf() will first write the whole buffer out to the actual underlying I/O system, and then it will write that character, and then likely reset ptr to the beginning of the buffer and cnt to a big number to indicate that the buffer is able to store lots of data again.
So why does this result in buffered writing? The answer is that the _flushbuf() call only gets performed "every once in a while," when the buffer is full. Writing a byte to a buffer is cheap, while performing the actual I/O is expensive, so this results in _flushbuf() being invoked relatively rarely (only once for every BUFSIZ characters).

If you write enough, the buffer will eventually get full. If you don't, you will eventually close the file (or the runtime will do that for you when main() returns) and fclose() calls _flushbuf() or its equivalent. Or you will manually fflush() the stream, which also does the equivalent to _flushbuf().
If you were to write a few characters and then call sleep(1000), you would find that nothing gets printed for quite a while. That's indeed the way it works.
The tests in getc and putc are the same because in one case the counter records how many characters are available and in the other case it records how much space is available.

File read using POSIX API's

Consider the following piece of code for reading the contents of the file into a buffer
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#define BLOCK_SIZE 4096
int main()
{
int fd=-1;
ssize_t bytes_read=-1;
int i=0;
char buff[50];
//Arbitary size for the buffer?? How to optimise.
//Dynamic allocation is a choice but what is the
//right way to relate the file size to bufffer size.
fd=open("./file-to-buff.txt",O_RDONLY);
if(-1 == fd)
{
perror("Open Failed");
return 1;
}
while((bytes_read=read(fd,buff,BLOCK_SIZE))>0)
{
printf("bytes_read=%d\n",bytes_read);
}
//Test to characters read from the file to buffer.The file contains "Hello"
while(buff[i]!='\0')
{
printf("buff[%d]=%d\n",i,buff[i]);
i++;
//buff[5]=\n-How?
}
//buff[6]=`\0`-How?
close(fd);
return 0;
}
Code Description:
The input file contains a string "Hello"
This content needs to be copied into the buffer.
The objective is acheived by open and read POSIX API's.
The read API uses a pointer to a buffer of an*arbitary size* to copy the data in.
Questions:
Dynamic allocation is the method that must be used to optimize the size of the buffer.What is the right procedure to relate/derive the buffer size from the input file size?
I see at the end of the read operation the read has copied a new line character and a NULL character in addition to the characters "Hello". Please elaborate more on this behavior of read.
Sample Output
bytes_read=6
buff[0]=H
buff[1]=e
buff[2]=l
buff[3]=l
buff[4]=o
buff[5]=
PS: Input file is user created file not created by a program (using write API). Just to mention here, in case if it makes any difference.

Since you want to read the whole file, the best way is to make the buffer as big as the file size. There's no point in resizing the buffer as you go. That just hurts performance without good reason.
You can get the file size in several ways. The quick-and-dirty way is to lseek() to the end of the file:
// Get size.
off_t size = lseek(fd, 0, SEEK_END); // You should check for an error return in real code
// Seek back to the beginning.
lseek(fd, 0, SEEK_SET);
// Allocate enough to hold the whole contents plus a '\0' char.
char *buff = malloc(size + 1);
The other way is to get the information using fstat():
struct stat fileStat;
fstat(fd, &fileStat); // Don't forget to check for an error return in real code
// Allocate enough to hold the whole contents plus a '\0' char.
char *buff = malloc(fileStat.st_size + 1);
To get all the needed types and function prototypes, make sure you include the needed header:
#include <sys/stat.h> // For fstat()
#include <unistd.h> // For lseek()
Note that read() does not automatically terminate the data with \0. You need to do that manually, which is why we allocate an extra character (size+1) for the buffer. The reason why there's already a \0 character there in your case is pure random chance.
Of course, since buf is now a dynamically allocated array, don't forget to free it again when you don't need it anymore:
free(buff);
Be aware though, that allocating a buffer that's as large as the file you want to read into it can be dangerous. Imagine if (by mistake or on purpose, doesn't matter) the file is several GB big. For cases like this, it's good to have a maximum allowable size in place. If you don't want any such limitations, however, then you should switch to another method of reading from files: mmap(). With mmap(), you can map parts of a file to memory. That way, it doesn't matter how big the file is, since you can work only on parts of it at a time, keeping memory usage under control.

1, you can get the file size with stat(filename, &stat), but define the buffer to page size is just fine
2, first, there is no NULL character after "Hello", it must be accident that the stack area you allocated was 0 before your code executed, please refer to APUE chapter 7.6. In fact you must initialize the local variable before using it.
I tried to generate the text file with vim, emacs and echo -n Hello > file-to-buff.txt, only vim adds a line break automatically

You could consider allocating the buffer dynamically by first creating a buffer of a fixed size using malloc and doubling (with realloc) the size when you fill it up. This would have a good time complexity and space trade off.
At the moment you repeatedly read into the same buffer. You should increase the point in the buffer after each read otherwise you will overwrite the buffer contents with the next section of the file.
The code you supply allocates 50 bytes for the buffer yet you pass 4096 as the size to the read. This could result in a buffer overflow for any files over the size of 50 bytes.
As for the `\n' and '\0'. The newline is probably in the file and the '\0' was just already in the buffer. The buffer is allocated on the stack in your code and if that section of the stack had not been used yet it would probably contain zeros, placed there by the operating system when your program was loaded.
The operating system makes no attempt to terminate the data read from the file, it might be binary data or in a character set that it doesn't understand. Terminating the string, if needed, is up to you.
A few other points that are more a matter of style:
You could consider using a for (i = 0; buff[i]; ++i) loop instead of a while for the printing out at the end. This way if anyone messes with the index variable i you will be unaffected.
You could close the file earlier, after you finish reading from it, to avoid having the file open for an extended period of time (and maybe forgetting to close it if some kind of error happens).

For your second question, read don't add automatically a character '\0'.
If you consider that your file is a textual file, your must add a '\0' after calling read, for indicate the end of string.
In C, the end of string is represented by this caracter. If read set 4 characters, printf will read these 4 characters, and will test the 5th: if it's not '\0', it will continue to print until next '\0'.
It's also a source of buffer overflow
For the '\n', it is probably in the input file.

Go to a certain point of a binary file in C (using fseek) and then reading from that location (using fread)

I am wondering if this is the best way to go about solving my problem.
I know the values for particular offsets of a binary file where the information I want is held...What I want to do is jump to the offsets and then read a certain amount of bytes, starting from that location.
After using google, I have come to the conclusion that my best bet is to use fseek() to move to the position of the offset, and then to use fread() to read an amount of bytes from that position.
Am I correct in thinking this? And if so, how is best to go about doing so? i.e. how to incorporate the two together.
If I am not correct, what would you suggest I do instead?
Many thanks in advance for your help.
Matt
Edit:
I followed a tutorial on fread() and adjusted it to the following:
`#include <stdio.h>
int main()
{
FILE *f;
char buffer[11];
if (f = fopen("comm_array2.img", "rt"))
{
fread(buffer, 1, 10, f);
buffer[10] = 0;
fclose(f);
printf("first 10 characters of the file:\n%s\n", buffer);
}
return 0;
}`
So I used the file 'comm_array2.img' and read the first 10 characters from the file.
But from what I understand of it, this goes from start-of-file, I want to go from some-place-in-file (offset)
Is this making more sense?
Edit Number 2:
It appears that I was being a bit dim, and all that is needed (it would seem from my attempt) is to put the fseek() before the fread() that I have in the code above, and it seeks to that location and then reads from there.

If you are using file streams instead of file descriptors, then you can write yourself a (simple) function analogous to the POSIX pread() system call.
You can easily emulate it using streams instead of file descriptors1. Perhaps you should write yourself a function such as this (which has a slightly different interface from the one I suggested in a comment):
size_t fpread(void *buffer, size_t size, size_t mitems, size_t offset, FILE *fp)
{
if (fseek(fp, offset, SEEK_SET) != 0)
return 0;
return fread(buffer, size, nitems, fp);
}
This is a reasonable compromise between the conventions of pread() and fread().
What would the syntax of the function call look like? For example, reading from the offset 732 and then again from offset 432 (both being from start of the file) and filestream called f.
Since you didn't say how many bytes to read, I'm going to assume 100 each time. I'm assuming that the target variables (buffers) are buffer1 and buffer2, and that they are both big enough.
if (fpread(buffer1, 100, 1, 732, f) != 1)
...error reading at offset 732...
if (fpread(buffer2, 100, 1, 432, f) != 1)
...error reading at offset 432...
The return count is the number of complete units of 100 bytes each; either 1 (got everything) or 0 (something went awry).
There are other ways of writing that code:
if (fpread(buffer1, sizeof(char), 100, 732, f) != 100)
...error reading at offset 732...
if (fpread(buffer2, sizeof(char), 100, 432, f) != 100)
...error reading at offset 432...
This reads 100 single bytes each time; the test ensures you got all 100 of them, as expected. If you capture the return value in this second example, you can know how much data you did get. It would be very surprising if the first read succeeded and the second failed; some other program (or thread) would have had to truncate the file between the two calls to fpread(), but funnier things have been known to happen.
1 The emulation won't be perfect; the pread() call provides guaranteed atomicity that the combination of fseek() and fread() will not provide. But that will seldom be a problem in practice, unless you have multiple processes or threads concurrently updating the file while you are trying to position and read from it.

It frequently depends on the distance between the parts you care about. If you're only skipping over/ignoring a few bytes between the parts you care about, it's often easier to just read that data and ignore what you read, rather than using fseek to skip past it. A typical way to do this is define a struct holding both the data you care about, and place-holders for the ones you don't care about, read in the struct, and then just use the parts you care about:
struct whatever {
long a;
long ignore;
short b;
} w;
fread(&w, 1, sizeof(w), some_file);
// use 'w.a' and 'w.b' here.
If there's any great distance between the parts you care about, though, chances are that your original idea of using fseek to get to the parts that matter will be simpler.

Your theory sounds correct. Open, seek, read, close.
Create a struct to for the data you want to read and pass a pointer to read() of struct's allocated memory. You'll likely need #pragma pack(1) or similar on the struct to prevent misalignment problems.

Faster I/O in C

I have a problem which will take 1000000 lines of inputs like below from console.
0 1 23 4 5
1 3 5 2 56
12 2 3 33 5
...
...
I have used scanf, but it is very very slow. Is there anyway to get the input from console in a faster way? I could use read(), but I am not sure about the no of bytes in each line, so I can not as read() to read 'n' bytes.
Thanks,
Very obliged

Use fgets(...) to pull in a line at a time. Note that you should check for the '\n' at the end of the line, and if there is not one, you are either at EOF, or you need to read another buffer's worth, and concatenate the two together. Lather, rinse, repeat. Don't get caught with a buffer overflow.
THEN, you can parse each logical line in memory yourself. I like to use strspn(...) and strcspn(...) for this sort of thing, but your mileage may vary.
Parsing:
Define a delimiters string. Use strspn() to count "non data" chars that match the delimiters, and skip over them. Use strcspn() to count the "data" chars that DO NOT match the delimiters. If this count is 0, you are done (no more data in the line). Otherwise, copy out those N chars to hand to a parsing function such as atoi(...) or sscanf(...). Then, reset your pointer base to the end of this chunk and repeat the skip-delims, copy-data, convert-to-numeric process.

If your example is representative, that you indeed have a fixed format of five decimal numbers per line, I'd probably use a combination of fgets() to read the lines, then a loop calling strtol() to convert from string to integer.
That should be faster than scanf(), while still clearer and more high-level than doing the string to integer conversion on your own.
Something like this:
typedef struct {
int number[5];
} LineOfNumbers;
int getNumbers(FILE *in, LineOfNumbers *line)
{
char buf[128]; /* Should be large enough. */
if(fgets(buf, sizeof buf, in) != NULL)
{
int i;
char *ptr, *eptr;
ptr = buf;
for(i = 0; i < sizeof line->number / sizeof *line->number; i++)
{
line->number[i] = (int) strtol(ptr, &eptr, 10);
if(eptr == ptr)
return 0;
ptr = eptr;
}
return 1;
}
return 0;
}
Note: this is untested (even uncompiled!) browser-written code. But perhaps useful as a concrete example.

You use multiple reads with a fixed size buffer till you hit end of file.

Out of curiosity, what generates that many lines that fast in a console ?

Use binary I/O if you can. Text conversion can slow down the reading by several times. If you're using text I/O because it's easy to debug, consider again binary format, and use the od program (assuming you're on unix) to make it human-readable when needed.
Oh, another thing: there's AT&T's SFIO library, which stands for safer/faster file IO. You might also have some luck with that, but I doubt that you'll get the same kind of speedup as you will with binary format.

Read a line at a time (if buffer not big enough for a line, expand and continue with larger buffer).
Then use dedicated functions (e.g. atoi) rather than general for conversion.
But, most of all, set up a repeatable test harness with profiling to ensure changes really do speed things up.

fread will still return if you try to read more bytes than there are.
I have found on of the fastest ways to read file is like this:
/*seek end of file */
fseek(file,0,SEEK_END);
/*get size of file */
size = ftell(file);
/*seek start of file */
fseek(file,0,SEEK_SET);
/* make a buffer for the file */
buffer = malloc(1048576);
/*fread in 1MB at a time until you reach size bytes etc */
On modern computers put your ram to use and load the whole thing to ram, then you can easily work your way through the memory.
At the very least you should be using fread with block sizes as big as you can, and at least as big as the cache blocks or HDD sector size (4096 bytes minimum, I would use 1048576 as a minimum personally). You will find that with much bigger read requsts rfead is able to sequentially get a big stream in one operation. The suggestion here of some people to use 128 bytes is rediculous.... as you will end up with the drive having to seek all the time as the tiny delay between calls will cause the head to already be past the next sector which almost certainly has sequential data that you want.

You can greatly reduce the time of execution by taking input using fread() or fread_unlocked() (if your program is single-threaded). Locking/Unlocking the input stream just once takes negligible time, so ignore that.
Here is the code:
#include <iostream>
int maxio=1000000;
char buf[maxio], *s = buf + maxio;
inline char getc1(void)
{
if(s >= buf + maxio) { fread_unlocked(buf,sizeof(char),maxio,stdin); s = buf; }
return *(s++);
}
inline int input()
{
char t = getc1();
int n=1,res=0;
while(t!='-' && !isdigit(t)) t=getc1(); if(t=='-')
{
n=-1; t=getc1();
}
while(isdigit(t))
{
res = 10*res + (t&15);
t=getc1();
}
return res*n;
}
This is implemented in C++. In C, you won't need to include iostream, function isdigit() is implicitly available.
You can take input as a stream of chars by calling getc1() and take integer input by calling input().
The whole idea behind using fread() is to take all the input at once. Calling scanf()/printf(), repeatedly takes up valuable time in locking and unlocking streams which is completely redundant in a single-threaded program.
Also make sure that the value of maxio is such that all input can be taken in a few "roundtrips" only (ideally one, in this case). Tweak it as necessary.
Hope this helps!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight