Reading upto newline - c

Hi
My program reads a CSV file.
So I used fgets to read one line at a time.
But now the interface specification says that it is possible to find NULL characters in few of the columns.
So I need to replace fgets with another function to read from the file
Any suggestions?

If your text stream has a NUL (ascii 0) character, you will need to handle your file as a binary file and use fread to read the file. There are two approaches to this.
Read the entire file into memory. The length of the file can be obtained by fseek(fp, 0, SEEK_END) and then calling ftell.You can then allocate enough memory for the whole file.Once in memory, parsing the file should be relatively easy. This approach is only really suitable for smallish files (probably less than 50M max). For bonus marks look at the mmap function.
Read the file byte by byte and add the characters to a buffer until a newline is found.
Read and parse bit by bit. Create a buffer that is biggest than you largest line and fill it with content from your file. You then parse and extract as many lines as you can. Add the remainder to the beginning of a new buffer an read the next bit. Using a bigger buffer will help minimize copying.

fgets works perfectly well with embedded null bytes. Pre-fill your buffer with \n (using memset) and then use memchr(buf, '\n', sizeof buf). If memchr returns NULL, your buffer was too small and you need to enlarge it to read the rest of the line. Otherwise, you can determine whether the newline you found is the end of the line or the padding you pre-filled the buffer with by inspecting the next byte. If the newline you found is at the end of the buffer or has another newline just after it, it's from padding, and the previous byte is the null terminator inserted by fgets (not a null from the file). Otherwise, the newline you found has a null byte after it (terminator inserted by fgets, and it's the end-of-line newline.
Other approaches will be slow (repeated fgetc) or waste (and risk running out of) resources (loading the whole file into memory).

use fread and then scan the block for the separator
Check the function int T_fread(FILE *input) at http://www.mrx.net/c/source.html

Related

fgets() gets does not stop when it encounters a NUL. Under what circumstance will this be a problem?

I understand that when using fgets, the program will not stop when it encounters NUL, namely '\0'. However when will this a problem and needs to be manually addressed?
My main use case for fgets is to get it from user input (like a better version of scanf to allow reading white spaces.) I cannot think of a situation where a user will want to terminates his input by typing '\0'.
Recall that text file input is usually lines: characters followed by a '\n' (expect maybe the last line). On reading text input, a null character is not special. It is not an alternate end-of-line. It is just another non-'\n' character.
It is functions like fgets(), fscanf() append a null character to the read buffer to denote the end of string. Now when code reads that string, is a null character a read one or the appended one?
If code uses fgets(), fscanf(), getchar(), etc. is not really the issue. The issue is how should code detect null characters and how to handle them.
Reading a null character from a text stream is uncommon, but not impossible. Null characters tend to reflect a problem more often than valid text data.
Reasons null characters exist in a text file
The text file is a wide character text file, perhaps UTF16 when null characters are common. Code needs to read this file with fgetws() and related functions.
The text file is a binary data one. Better to use fread().
File is a text file, yet through error or nefarious intent, code has null characters. Usually best to detect, if possible, and exit working this file with an error message or status.
Legitimate text file uncommonly using null characters. fgets() is not the best tool. Likely need crafted input functions or other extensions like getline().
How to detect?
fgets(): prefill buffer with non-zero input. See if the characters after the first null character are all the pre-fill value.
fscanf(): Read a line with some size like char buf[200]; fscanf(f, "%199[^\n]%n", buf, &length); and use length for input length. Additional code needed to handle end-of-line, extra-long lines, 0 length lines, etc.
fgetc(): Build user code to read/handle as needed - tends to be slow.
How to handle?
In general, error out with a message or status.
If null characters are legitimate to this code's handling of text files, code needs to handle input, not as C strings, but as a buffer and length.
Good luck.

Is it possible to count the frequency of a word in a file precisely using two buffers in C?

I have a file of size 1GB. I want to find out how many times the word "sosowhat" is found in the file. I've written a code using fgetc() which reads one character at a time from the file which is way too slower when it comes for a file of size 1GB. So I made a buffer of size 1000(using mmalloc) to hold 1000 words at a time from the file and I used the strstr() function to count the occurrence of the word "sosowhat". The logic is fine. But the problem is that if the part "so" of "sosowhat" is located at the end of the buffer and the "sowhat" part in the new buffer, the word will not be counted. So I used two buffers old_buffer and current_buffer. At the beginning of each buffer I want to check from the last few characters of old buffer. Is this possible? How can I go back to the old buffer? Is it possible without memmove()? As a beginner, I will be more than happy for your help.
Yes, it can be done. There are more possible approaches to this.
The first one, which is the cleanest, is to keep a second buffer, as suggested, of the length of the searched word, where you keep the last chunk of the old buffer. (It needs to be exactly the length of the searched word because you store wordLength - 1 characters + NULL terminator). Then the quickest way is to append to this stored chunk from the old buffer the first wordLen - 1 characters from the new buffer and search your word here. Then continue with your search normally. - Of course you can create a buffer which can hold both chunks (the last bytes from the old buffer and the first bytes from the new one).
Another approach (which I don't recommend, but can turn out to be a bit easier in terms of code) would be to fseek wordLen - 1 bytes backwards in the read file. This will "move" the chunk stored in previous approach to the next buffer. This is a bit dirtier as you will read some of the contents of the file twice. Although that's not something noticeable in terms of performance, I again recommend against it and use something like the first described approach.
use the same algorithm as per fgetc only read from the buffers you created. It will be same efficient as strstr iterates thorough the string char by char as well.

How is \0 incorporated into normal text files in reference to fgets

I was just wondering that when you input text just using a normal application such as textedit (on OSX) would it still harbour the same '\0' character on the end of each string so that when read through fgets() if would pick said character up and stop reading?
Because I've created a normal text file, but fgets() keeps on stopping at the end of the designated length, instead of when it finds that character, so I have suspicious if it actually exists when I write to a normal text file.
For Example:
How Are You
There
fgets(str, 15, stdin);
This would end up producing: TherAre You
No, in general, text files do not contain \0 characters. fgets reads the number of characters requested, or to the end of the line, whichever comes first. It's fgets itself that appends the \0. From the man page:
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s. Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte ('\0') is stored after the last character in the buffer.
No, text files don't generally contain any control characters. The termination is a C "feature", i.e. a property of how the C language and environment works with strings. Text files are independent of C. The termination is added (to the in-memory buffer into which the data has been read) by the fgets() function.
If your input file does contain a null byte and you're reading with fgets() or equivalent, you have difficulty knowing whether the null in the middle of the string was simply a null in the 'text' file or indicates that the last line of the file did not end with a newline, or that the line was truncated. Clearly, if you try another read and get more data, it was not a premature EOF. If the character immediately before the null byte is a newline, then you can assume that the null byte is the end of string marker added by fgets().
Generally speaking, therefore, if the file contains null bytes, it is not a good idea to use fgets() to read the file.

Reading lines in c with windows.h

I need to use system-calls of windows.h to read a file which I get from command line. I can read to whole file to buffer using ReadFile() and then cut the buffer at the first \0, but how can I read only one line? Also I need to read the last line of the file, Is this possible without reading the whole file into buffer, because maybe the file is 4gb or more so I won't be able to read it. So anyone knows how to read it by lines?
If you have an idea of how long lines are then you are in business, make a buffer that is a bit larger than max line.
ReadFile read a number of bytes and cut buffer at first end of line (\n)
Use LZSeek to position at end of file, then move back a line of bytes and look for end of line, start there and read rest of line.
Don't "cut the buffer at the first \0", ReadFile doesn't return a zero-terminated string. It reads raw bytes. You have to pay attention to the value returned through the lpNumberOfBytesRead argument. It will be equal to the nNumberOfBytesToRead value you pass unless you've reached the end of the file.
Now you know how many valid bytes are in the buffer. Search them for the first '\r' or '\n' byte to find the line terminator. Copy the range of bytes to a string buffer supplied by the caller and return. The next time you read a line, start where you left off previously, past the line terminator. When you don't find the line terminator then you have to copy the bytes in the buffer and call ReadFile() again to read more bytes. That makes the code a bit tricky to get right, excellent exercise otherwise.
ReadFile is a particularly poor choice for what you want to do. Are you allowed to use fgets? That would be much easier to use in your case.

C: Reading a text file (with variable-length lines) line-by-line using fread()/fgets() instead of fgetc() (block I/O vs. character I/O)

Is there a getline function that uses fread (block I/O) instead of fgetc (character I/O)?
There's a performance penalty to reading a file character by character via fgetc. We think that to improve performance, we can use block reads via fread in the inner loop of getline. However, this introduces the potentially undesirable effect of reading past the end of a line. At the least, this would require the implementation of getline to keep track of the "unread" part of the file, which requires an abstraction beyond the ANSI C FILE semantics. This isn't something we want to implement ourselves!
We've profiled our application, and the slow performance is isolated to the fact that we are consuming large files character by character via fgetc. The rest of the overhead actually has a trivial cost by comparison. We're always sequentially reading every line of the file, from start to finish, and we can lock the entire file for the duration of the read. This probably makes an fread-based getline easier to implement.
So, does a getline function that uses fread (block I/O) instead of fgetc (character I/O) exist? We're pretty sure it does, but if not, how should we implement it?
Update Found a useful article, Handling User Input in C, by Paul Hsieh. It's a fgetc-based approach, but it has an interesting discussion of the alternatives (starting with how bad gets is, then discussing fgets):
On the other hand the common retort from C programmers (even those considered experienced) is to say that fgets() should be used as an alternative. Of course, by itself, fgets() doesn't really handle user input per se. Besides having a bizarre string termination condition (upon encountering \n or EOF, but not \0) the mechanism chosen for termination when the buffer has reached capacity is to simply abruptly halt the fgets() operation and \0 terminate it. So if user input exceeds the length of the preallocated buffer, fgets() returns a partial result. To deal with this programmers have a couple choices; 1) simply deal with truncated user input (there is no way to feed back to the user that the input has been truncated, while they are providing input) 2) Simulate a growable character array and fill it in with successive calls to fgets(). The first solution, is almost always a very poor solution for variable length user input because the buffer will inevitably be too large most of the time because its trying to capture too many ordinary cases, and too small for unusual cases. The second solution is fine except that it can be complicated to implement correctly. Neither deals with fgets' odd behavior with respect to '\0'.
Exercise left to the reader: In order to determine how many bytes was really read by a call to fgets(), one might try by scanning, just as it does, for a '\n' and skip over any '\0' while not exceeding the size passed to fgets(). Explain why this is insufficient for the very last line of a stream. What weakness of ftell() prevents it from addressing this problem completely?
Exercise left to the reader: Solve the problem determining the length of the data consumed by fgets() by overwriting the entire buffer with a non-zero value between each call to fgets().
So with fgets() we are left with the choice of writing a lot of code and living with a line termination condition which is inconsistent with the rest of the C library, or having an arbitrary cut-off. If this is not good enough, then what are we left with? scanf() mixes parsing with reading in a way that cannot be separated, and fread() will read past the end of the string. In short, the C library leaves us with nothing. We are forced to roll our own based on top of fgetc() directly. So lets give it a shot.
So, does a getline function that's based on fgets (and doesn't truncate the input) exist?
Don't use fread. Use fgets. I take it this is a homework/classproject problem so I'm not providing a complete answer, but if you say it's not, I'll give more advice. It is definitely possible to provide 100% of the semantics of GNU-style getline, including embedded null bytes, using purely fgets, but it requires some clever thinking.
OK, update since this isn't homework:
memset your buffer to '\n'.
Use fgets.
Use memchr to find the first '\n'.
If no '\n' is found, the line is longer than your buffer. Englarge the buffer, fill the new portion with '\n', and fgets into the new portion, repeating as necessary.
If the character following '\n' is '\0', then fgets terminated due to reaching end of a line.
Otherwise, fgets terminated due to reaching EOF, the '\n' is left over from your memset, the previous character is the terminating null that fgets wrote, and the character before that is the last character of actual data read.
You can eliminate the memset and use strlen in place of memchr if you don't care about supporting lines with embedded nulls (either way, the null will not terminate reading; it will just be part of your read-in line).
There's also a way to do the same thing with fscanf and the "%123[^\n]" specifier (where 123 is your buffer limit), which gives you the flexibility to stop at non-newline characters (ala GNU getdelim). However it's probably slow unless your system has a very fancy scanf implementation.
There isn't a big performance difference between fgets and fgetc/setvbuf.
Try:
int c;
FILE *f = fopen("blah.txt","r");
setvbuf(f,NULL,_IOLBF,4096); /* !!! check other values for last parameter in your OS */
while( (c=fgetc(f))!=EOF )
{
if( c=='\n' )
...
else
...
}

Resources