C - moving back the pointer in the file using lseek - c

I am writing an academic project in C and I can use only <fcntl.h> and <unistd.h> libraries to file operations.
I have the function to read file line by line. The algorithm is:
Set pointer at the beginning of the file and get current position.
Read data to the buffer (char buf[100]) with constant size, iterate character by character and detect end of line '\n'.
Increment current position: curr_pos = curr_pos + length_of_read_line;
Set pointer to current position using lseek(fd, current_position, SEEK_SET);
SEEK_SET - set pointer to given offset from the beginning of the file. In my pseudo code current_position is the offset.
And actually it works fine, but I always move the pointer starting at the beginning of the file - I use SEEK_SET - it isn't optimized.
lseek accept also argument SEEK_CUR - it's a current position. How can I move back pointer from current position of pointer (SEEK_CUR). I tried to set negative offset, but didn't work.

The most efficient way to read lines of data from a file is typically to read a large chunk of data that may span multiple lines, process lines of data from the chunk until one reaches the end, move any partial line from the end of the buffer to the start, and then read another chunk of data. Depending upon the target system and task to be performed, it may be better to read enough to fill whatever space remains after the partial line, or it may be better to always read a power-of-two number of bytes and make the buffer large enough to accommodate a chunk that size plus a maximum-length partial line (left over from the previous read). The one difficulty with this approach is that all data to be read from the stream using the same buffer. In cases where that is practical, however, it will often allow better performance than using many separate calls to fread, and may be nicer than using fgets.
While it should be possible for a standard-library function to facilitate line input, the design of fgets is rather needlessly hostile since it provides no convenient indication of how much data it has read. After reading each line, code that wants a string containing the printable portion will have to use strlen to try to ascertain how much data was read (hopefully the input won't contain any zero bytes) and then check the byte before the trailing zero to see if it's a newline. Not impossible, but awkward at the very least. If the fread-and-buffer approach will satisfy an application's needs, it's likely to be at least as efficient as using fgets, if not moreso, and since the effort required to use fgets() robustly will be comparable to that required to use a buffering approach, one may as well use the latter.

Since your question is tagged as posix, I would go with getline(), without having to manually take care of moving the file pointer.
Example:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE* fp;
char* line = NULL;
size_t len = 0;
ssize_t read;
fp = fopen("input.txt", "r");
if(fp == NULL)
return -1;
while((read = getline(&line, &len, fp)) != -1)
{
printf("Read line of length %zu:\n", read);
printf("%s", line);
}
fclose(fp);
if(line)
free(line);
return 0;
}
Output with custom input:
Read line of length 11:
first line
Read line of length 12:
second line
Read line of length 11:
third line

Related

Using fread() to read a text based file - best practices

Consider this code to read a text based file. This sort of fread() usage was briefly touched upon in the excellent book C Programming: A Modern Approach by K.N. King.
There are other methods of reading text based files, but here I am concerned with fread() only.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
// Declare file stream pointer.
FILE *fp = fopen("Note.txt", "r");
// fopen() call successful.
if(fp != NULL)
{
// Navigate through to end of the file.
fseek(fp, 0, SEEK_END);
// Calculate the total bytes navigated.
long filesize = ftell(fp);
// Navigate to the beginning of the file so
// it can be read.
rewind(fp);
// Declare array of char with appropriate size.
char content[filesize + 1];
// Set last char of array to contain NULL char.
content[filesize] = '\0';
// Read the file content.
fread(content, filesize, 1, fp);
// Close file stream pointer.
fclose(fp);
// Print file content.
printf("%s\n", content);
}
// fopen() call unsuccessful.
else
{
printf("File could not be read.\n");
}
return 0;
}
There are some problems I have with this method. My opinion is that this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
To circumvent this issue, we may use a buffer size and keep on reading into a char array of that size. If filesize is less than buffer size, then we simply perform fread() once as described in the above code. Otherwise, We divide the total file size by the buffer size and get a result, whose int portion we will use as the total number of times to iterate a loop where we will invoke fread() each time, appending the read buffer array into a larger string. Now, for the final fread(), which we will perform after the loop, we will have to read exactly (filesize % buffersize) bytes of data into an array of that size and finally append this array into the larger string (Which we would have malloc-ed with filesize + 1 beforehand). I find that if we perform fread() for the last chunk of data using buffersize as its second parameter, then extra garbage data of size (buffersize - chunksize) will be read in and the data might become corrupted. Are my assumptions here correct? Please explain if/ how I have overlooked something.
Also, there is the issue that non-ASCII characters might not have size of 1 byte. In that case I would assume the proper amount is being read, but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
fread() does not care about strings (null character terminated arrays). It reads data as if it was in multiples of unsigned char*1 with no special concern to the data content if the stream opened in binary mode and perhaps some data processing (e.g. end-of-line, byte-order-mark) in text mode.
Are my assumptions here correct?
Failed assumptions:
Assuming ftell() return value equals the sum of fread() bytes.
The assumption can be false in text mode (as OP opened the file) and fseek() to the end is technical undefined behavior in binary mode.
Assuming not checking the return value of fread() is OK. Use the return value of fread() to know if an error occurred, end-of-file and how many multiples of bytes were read.
Assuming error checking is not required. , ftell(), fread(), fseek() instead of rewind() all deserve error checks. In particular, ftell() readily fails on streams that have no certain end.
Assuming no null characters are read. A text file is not certainly made into one string by reading all and appending a null character. Robust code detects and/or copes with embedded null characters.
Multi-byte: assuming input meets the encoding requirements. Example: robust code detects (and rejects) invalid UTF8 sequences - perhaps after reading the entire file.
Extreme: Assuming a file length <= LONG_MAX, the max value returned from ftell(). Files may be larger.
but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
fread() does not function on multi-byte boundaries, only multiples of unsigned char. A given fread() may end with a portion of a multi-byte and the next fread() will continue from mid-multi-byte.
Instead of of 2 pass approach consider 1 single pass
// Pseudo code
total_read = 0
Allocate buffer, say 4096
forever
if buffer full
double buffer_size (`realloc()`)
u = unused portion of buffer
fread u bytes into unused portion of buffer
total_read += number_just_read
if (number_just_read < u)
quit loop
Resize buffer total_read (+ 1 if appending a '\0')
Alternatively consider the need to read the entire file in before processing the data. I do not know the higher level goal, but often processing data as it arrives makes for less resource impact and faster throughput.
Advanced
Text files may be simple ASCII only, 8-bit code page defined, one of various UTF encodings (byte-order-mark, etc. The last line may or may not end with a '\n'. Robust text processing beyond simple ASCII is non-trivial.
ASCII and UTF-8 are the most common. IMO, handle 1 or both of those and error out on anything that does not meet their requirements.
*1 fread() reads in multiple of bytes as per the 3rd argument, which is 1 in OP's case.
// v --- multiple of 1 byte
fread(content, filesize, 1, fp);

C : Best way to go to a known line of a file

I have a file in which I'd like to iterate without processing in any sort the current line. What I am looking for is the best way to go to a determined line of a text file. For example, storing the current line into a variable seems useless until I get to the pre-determined line.
Example :
file.txt
foo
fooo
fo
here
Normally, in order to get here, I would have done something like :
FILE* file = fopen("file.txt", "r");
if (file == NULL)
perror("Error when opening file ");
char currentLine[100];
while(fgets(currentLine, 100, file))
{
if(strstr(currentLine, "here") != NULL)
return currentLine;
}
But fgetswill have to read fully three line uselessly and currentLine will have to store foo, fooo and fo.
Is there a better way to do this, knowing that here is line 4? Something like a go tobut for files?
Since you do not know the length of every line, no, you will have to go through the previous lines.
If you knew the length of every line, you could probably play with how many bytes to move the file pointer. You could do that with fseek().
You cannot access directly to a given line of a textual file (unless all lines have the same size in bytes; and with UTF8 everywhere a Unicode character can take a variable number of bytes, 1 to 6; and in most cases lines have various length - different from one line to the next). So you cannot use fseek (because you don't know in advance the file offset).
However (at least on Linux systems), lines are ending with \n (the newline character). So you could read byte by byte and count them:
int c= EOF;
int linecount=1;
while ((c=fgetc(file)) != EOF) {
if (c=='\n')
linecount++;
}
You then don't need to store the entire line.
So you could reach the line #45 this way (using while ((c=fgetc(file)) != EOF) && linecount<45) ...) and only then read entire lines with fgets or better yet getline(3) on POSIX systems (see this example). Notice that the implementation of fgets or of getline is likely to be built above fgetc, or at least share some code with it. Remember that <stdio.h> is buffered I/O, see setvbuf(3) and related functions.
Another way would be to read the file in two passes. A first pass stores the offset (using ftell(3)...) of every line start in some efficient data structure (a vector, an hashtable, a tree...). A second pass use that data structure to retrieve the offset (of the line start), then use fseek(3) (using that offset).
A third way, POSIX specific, would be to memory-map the file using mmap(2) into your virtual address space (this works well for not too huge files, e.g. of less than a few gigabytes). With care (you might need to mmap an extra ending page, to ensure the data is zero-byte terminated) you would then be able to use strchr(3) with '\n'
In some cases, you might consider parsing your textual file line by line (using appropriately fgets, or -on Linux- getline, or generating your parser with flex and bison) and storing each line in a relational database (such as PostGreSQL or sqlite).
PS. BTW, the notion of lines (and the end-of-line mark) vary from one OS to the next. On Linux the end-of-line is a \n character. On Windows lines are rumored to end with \r\n, etc...
A FILE * in C is a stream of chars. In a seekable file, you can address these chars using the file pointer with fseek(). But apart from that, there are no "special characters" in files, a newline is just another normal character.
So in short, no, you can't jump directly to a line of a text file, as long as you don't know the lengths of the lines in advance.
This model in C corresponds to the files provided by typical operating systems. If you think about it, to know the starting points of individual lines, your file system would have to store this information somewhere. This would mean treating text files specially.
What you can do however is just count the lines instead of pattern matching, something like this:
#include <stdio.h>
int main(void)
{
char linebuf[1024];
FILE *input = fopen("seekline.c", "r");
int lineno = 0;
char *line;
while (line = fgets(linebuf, 1024, input))
{
++lineno;
if (lineno == 4)
{
fputs("4: ", stdout);
fputs(line, stdout);
break;
}
}
fclose(input);
return 0;
}
If you don't know the length of each line, you have to go through all of them. But if you know the line you want to stop you can do this:
while (!found && fgets(line, sizeof line, file) != NULL) /* read a line */
{
if (count == lineNumber)
{
//you arrived at the line
//in case of a return first close the file with "fclose(file);"
found = true;
}
else
{
count++;
}
}
At least you can avoid so many calls to strstr

How to wrap fscanf() using only fread() and vsscanf()

I'm porting some code on an embedded platform that uses a C-like API. The original code uses fscanf() to read and parse data from files. Unfortunately on my API I don't have a fscanf() equivalent, so prior to the actual porting I'm trying to obtain the same behavior of fscanf() using fread() and vsscanf() (which I do have). I also have the equivalent of fseek() and ftell().
EDIT: please keep in mind that the access to the embedded filesystem is very limited (fread - fseek - ftell - fgetc - fgets), so I need a solution that works with strings in memory rather than accessing the file in some other way.
The code looks something like this:
int main()
{
[...] /* variable declarations and definitions */
do
{
read = wrapped_fscanf(pFile, "%d %s", &val, str);
} while (read == 2);
fclose(pFile);
return 0;
}
int wrapped_fscanf(FILE *f, const char *template, ...)
{
va_list args;
va_start(args, template);
char tmpstr[50];
fread(tmpstr, sizeof(char), sizeof(tmpstr), f);
int ret = vsscanf(tmpstr, template, args);
long offset = /* ??? */
fseek(f, offset, SEEK_CUR);
va_end(args);
return ret;
}
The problem is that fscanf() moves the pointer to the position in the file stream at the end of the match, whereas with fread() I'm reading a fixed amount of data (in this case 50 bytes) and I should find a way to move the pointer back to the end of the matched string.
Let's assume that the 50-char string I read from the file is the following:
12 bar 13 foo 56789012345678901234567890123456789
fscanf() would match the int 12 , the string bar and the pointer would point right after the "r" in "bar" so I can call it again and read 13 foo
On the other hand fread() puts the pointer after the last char in the 50-element sequence, which is wrong: I still have to read 13 foo but if I call wrapped_fscanf() again the pointer is in the 51st position.
I have to use fseek() to roll back to the end of the first match, but how do I do that? How do I calculate the value of offset ?
vsscanf() returns the number of matches, not the length of the string and I have no way of knowing how many whitespace charachters separate the elements of the match (or do I?)
I.e. I get the same outputs( {var,str,read} == {9,"xyz",2} ) with
9 xyz
and
9 xyz
Is there some trick that I'm not aware of or do I have to find another solution other than wrapping fscanf() with fread() vsscanf() ftell() and fseek()?
Thank you
Supposing that your vsscanf() implementation supports it, your substitute for fscanf() can append a %n field descriptor to the end of the provided format. As long as there is no failure prior to vsscanf() reaching that field, it will store the number of characters consumed up to that point in the corresponding argument. You could then use that result to reposition the stream appropriately. That would require a bit of varargs wrangling and probably some macro assistance, but I think it could be made to work.
You will need some intermediary buffering code, that will grab chunks of data (using fread), and scan your buffer for the pattern. if the pattern is found, truncate the buffer, if the pattern is not found, append some more data. this is effectively what fscanf will do.

Effective methods for reading and writing large files in C

I'm writing an application that deals with very large user-generated input files. The program will copy about 95 percent of the file, effectively duplicating it and switching a few words and values in the copy, and then appending the copy (in chunks) to the original file, such that each block (consisting of between 10 and 50 lines) in the original is followed by the copied and modified block, and then the next original block, and so on. The user-generated input conforms to a certain format, and it is highly unlikely that any line in the original file is longer than 100 characters long.
Which would be the better approach?
To use one file pointer and use variables that hold the current position of how much has been read and where to write to, seeking the file pointer back and forth to read and write; or
To use multiple file pointers, one for reading and one for writing.
I am mostly concerned with the efficiency of the program, as the input files will reach up to 25,000 lines, each about 50 characters long.
If you have memory constraints, or you want a generic approach, read bytes into a buffer from one file pointer, make changes, and write out the buffer to a second file pointer when the buffer is full. If you reach EOF on the first pointer, make your changes and just flush whatever is in the buffer to the output pointer. If you intend to replace the original file, copy the output file to the input file and remove the output file. This "atomic" approach lets you check that the copy operation took place correctly before deleting anything.
For example, to deal with generically copying over any number of bytes, say, 1 MiB at a time:
#define COPY_BUFFER_MAXSIZE 1048576
/* ... */
unsigned char *buffer = NULL;
buffer = malloc(COPY_BUFFER_MAXSIZE);
if (!buffer)
exit(-1);
FILE *inFp = fopen(inFilename, "r");
fseek(inFp, 0, SEEK_END);
uint64_t fileSize = ftell(inFp);
rewind(inFp);
FILE *outFp = stdout; /* change this if you don't want to write to standard output */
uint64_t outFileSizeCounter = fileSize;
/* we fread() bytes from inFp in COPY_BUFFER_MAXSIZE increments, until there is nothing left to fread() */
do {
if (outFileSizeCounter > COPY_BUFFER_MAXSIZE) {
fread(buffer, 1, (size_t) COPY_BUFFER_MAXSIZE, inFp);
/* -- make changes to buffer contents at this stage
-- if you resize the buffer, then copy the buffer and
change the following statement to fwrite() the number of
bytes in the copy of the buffer */
fwrite(buffer, 1, (size_t) COPY_BUFFER_MAXSIZE, outFp);
outFileSizeCounter -= COPY_BUFFER_MAXSIZE;
}
else {
fread(buffer, 1, (size_t) outFileSizeCounter, inFp);
/* -- make changes to buffer contents at this stage
-- again, make a copy of buffer if it needs resizing,
and adjust the fwrite() statement to change the number
of bytes that need writing */
fwrite(buffer, 1, (size_t) outFileSizeCounter, outFp);
outFileSizeCounter = 0ULL;
}
} while (outFileSizeCounter > 0);
free(buffer);
An efficient way to deal with a resized buffer is to keep a second pointer, say, unsigned char *copyBuffer, which is realloc()-ed to twice the size, if necessary, to deal with accumulated edits. That way, you keep expensive realloc() calls to a minimum.
Not sure why this got downvoted, but it's a pretty solid approach for doing things with a generic amount of data. Hope this helps someone who comes across this question, in any case.
25000 lines * 100 characters = 2.5MB, that's not really a huge file. The fastest will probably be to read the whole file in memory and write your results to a new file and replace the original with that.

File read using POSIX API's

Consider the following piece of code for reading the contents of the file into a buffer
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#define BLOCK_SIZE 4096
int main()
{
int fd=-1;
ssize_t bytes_read=-1;
int i=0;
char buff[50];
//Arbitary size for the buffer?? How to optimise.
//Dynamic allocation is a choice but what is the
//right way to relate the file size to bufffer size.
fd=open("./file-to-buff.txt",O_RDONLY);
if(-1 == fd)
{
perror("Open Failed");
return 1;
}
while((bytes_read=read(fd,buff,BLOCK_SIZE))>0)
{
printf("bytes_read=%d\n",bytes_read);
}
//Test to characters read from the file to buffer.The file contains "Hello"
while(buff[i]!='\0')
{
printf("buff[%d]=%d\n",i,buff[i]);
i++;
//buff[5]=\n-How?
}
//buff[6]=`\0`-How?
close(fd);
return 0;
}
Code Description:
The input file contains a string "Hello"
This content needs to be copied into the buffer.
The objective is acheived by open and read POSIX API's.
The read API uses a pointer to a buffer of an*arbitary size* to copy the data in.
Questions:
Dynamic allocation is the method that must be used to optimize the size of the buffer.What is the right procedure to relate/derive the buffer size from the input file size?
I see at the end of the read operation the read has copied a new line character and a NULL character in addition to the characters "Hello". Please elaborate more on this behavior of read.
Sample Output
bytes_read=6
buff[0]=H
buff[1]=e
buff[2]=l
buff[3]=l
buff[4]=o
buff[5]=
PS: Input file is user created file not created by a program (using write API). Just to mention here, in case if it makes any difference.
Since you want to read the whole file, the best way is to make the buffer as big as the file size. There's no point in resizing the buffer as you go. That just hurts performance without good reason.
You can get the file size in several ways. The quick-and-dirty way is to lseek() to the end of the file:
// Get size.
off_t size = lseek(fd, 0, SEEK_END); // You should check for an error return in real code
// Seek back to the beginning.
lseek(fd, 0, SEEK_SET);
// Allocate enough to hold the whole contents plus a '\0' char.
char *buff = malloc(size + 1);
The other way is to get the information using fstat():
struct stat fileStat;
fstat(fd, &fileStat); // Don't forget to check for an error return in real code
// Allocate enough to hold the whole contents plus a '\0' char.
char *buff = malloc(fileStat.st_size + 1);
To get all the needed types and function prototypes, make sure you include the needed header:
#include <sys/stat.h> // For fstat()
#include <unistd.h> // For lseek()
Note that read() does not automatically terminate the data with \0. You need to do that manually, which is why we allocate an extra character (size+1) for the buffer. The reason why there's already a \0 character there in your case is pure random chance.
Of course, since buf is now a dynamically allocated array, don't forget to free it again when you don't need it anymore:
free(buff);
Be aware though, that allocating a buffer that's as large as the file you want to read into it can be dangerous. Imagine if (by mistake or on purpose, doesn't matter) the file is several GB big. For cases like this, it's good to have a maximum allowable size in place. If you don't want any such limitations, however, then you should switch to another method of reading from files: mmap(). With mmap(), you can map parts of a file to memory. That way, it doesn't matter how big the file is, since you can work only on parts of it at a time, keeping memory usage under control.
1, you can get the file size with stat(filename, &stat), but define the buffer to page size is just fine
2, first, there is no NULL character after "Hello", it must be accident that the stack area you allocated was 0 before your code executed, please refer to APUE chapter 7.6. In fact you must initialize the local variable before using it.
I tried to generate the text file with vim, emacs and echo -n Hello > file-to-buff.txt, only vim adds a line break automatically
You could consider allocating the buffer dynamically by first creating a buffer of a fixed size using malloc and doubling (with realloc) the size when you fill it up. This would have a good time complexity and space trade off.
At the moment you repeatedly read into the same buffer. You should increase the point in the buffer after each read otherwise you will overwrite the buffer contents with the next section of the file.
The code you supply allocates 50 bytes for the buffer yet you pass 4096 as the size to the read. This could result in a buffer overflow for any files over the size of 50 bytes.
As for the `\n' and '\0'. The newline is probably in the file and the '\0' was just already in the buffer. The buffer is allocated on the stack in your code and if that section of the stack had not been used yet it would probably contain zeros, placed there by the operating system when your program was loaded.
The operating system makes no attempt to terminate the data read from the file, it might be binary data or in a character set that it doesn't understand. Terminating the string, if needed, is up to you.
A few other points that are more a matter of style:
You could consider using a for (i = 0; buff[i]; ++i) loop instead of a while for the printing out at the end. This way if anyone messes with the index variable i you will be unaffected.
You could close the file earlier, after you finish reading from it, to avoid having the file open for an extended period of time (and maybe forgetting to close it if some kind of error happens).
For your second question, read don't add automatically a character '\0'.
If you consider that your file is a textual file, your must add a '\0' after calling read, for indicate the end of string.
In C, the end of string is represented by this caracter. If read set 4 characters, printf will read these 4 characters, and will test the 5th: if it's not '\0', it will continue to print until next '\0'.
It's also a source of buffer overflow
For the '\n', it is probably in the input file.

Resources