Reading in and storing using fgets() - c

I'm using fgets to read in text from simple files such as txt files however I need the ability to jump back to previous lines. Is there anyway to do this using fgets? Or should I just store the text in some sort a array/structure?

fseek or a combination of fgetpos and fsetpos would be appropriate. AFAIK, there is no "go to line X" function; you'll have to save some information about each line (e.g. its starting position) instead, using fseek or fsetpos to move around.

You may be able to solve your problems with fseek() and friends ( http://linux.die.net/man/3/fseek ).
However, mixing the "fseek" functions with text files (especially if you're reading and writing to the same stream) may cause problems due to the library translation of line breaks.
If you're not too tight on memory, I'd go with saving information from previous lines.
Better yet, if possible review your algorithm/data structure so that you don't need to go back.

Related

What is generally the best approach reading a file for a compiler?

I know this is a general question.
I'm going to program a compiler and I was wondering if it's better to take the tokens of the language while reading the file (i.e., first open the file, then extract tokens while reading, and finally close the file) or read the file first, close it and then work with the data in a variable. The pseudo-code for this would be something like:
file = open(filename);
textVariable = read(file);
close(file);
getTokens(textVariable);
The first option would be something like:
file = open(filename);
readWhileGeneratingTokens(file);
close(file);
I guess the first option looks better, since there isn't an additional cost in terms of main memory. However, I think there might be some benefits using the second option, for I minimize the time the file is going to be open.
I can't provide any hard data, but generally the amount of time a compiler spends tokenizing source code is rather small compared to the amount of time spent optimizing/generating target code. Because of this, wanting to minimize the amount of time the source file is open seems premature. Additionally, reading the entire source file into memory before tokenizing would prevent any sort of line-by-line execution (think interpreted language) or reading input from a non-file location (think of a stream like stdin). I think it is safe to say that the overhead in reading the entire source file into memory is not worth the computer's resources and will ultimately be detrimental to your project.
Compilers are carefully designed to be able to proceed on as little as one character at a time from the input. They don't read entire files prior to processing, or rather they have no need to do so: that would just add pointless latency. They don't even need to read entire lines before processing.

Text files edit C

I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.

What are some best practices for file I/O in C?

I'm writing a fairly basic program for personal use but I really want to make sure I use good practices, especially if I decide to make it more robust later on or something.
For all intents and purposes, the program accepts some input files as arguments, opens them using fopen() read from the files, does stuff with that information, and then saves the output as a few different files in a subfolder. eg, if the program is in ~/program then the output files are saved in ~/program/csv/
I just output directly to the files, for example output = fopen("./csv/output.csv", "w");, print to it with fprintf(output,"%f,%f", data1, data2); in a loop, and then close with fclose(output); and I just feel like that is bad practice.
Should I be saving it in a temp directory wile it's being written to and then moving it when it's finished? Should I be using more advanced file i/o libraries? Am I just completely overthinking this?
Best practices in my eyes:
Check every call to fopen, printf, puts, fprintf, fclose etc. for errors
use getchar if you must, fread if you can
use putchar if you must, fwrite if you can
avoid arbitrary limits on input line length (might require malloc/realloc)
know when you need to open output files in binary mode
use Standard C, forget conio.h :-)
newlines belong at the end of a line, not at the beginning of some text, i.e. it is printf ("hello, world\n"); and not "\nHello, world" like those mislead by the Mighty William H. often write to cope with the sillyness of their command shell. Outputting newlines first breaks line buffered I/O.
if you need more than 7bit ASCII, chose Unicode (the most common encoding is UTF-8 which is ASCII compatible). It's the last encoding you'll ever need to learn. Stay away from codepages and ISO-8859-*.
Am I just completely overthinking this?
You are. If the task's simple, don't make a complicated solution on purpose just because it feels "more professional". While you're a beginner, focus on code readability, it will facilitate your and others' lives.
It's fine. I/O is fully buffered by default with stdio file functions, so you won't be writing to the file with every single call of fprintf. In fact, in many cases, nothing will be written to it until you call fclose.
It's good practice to check the return of fopen, to close your files when finished, etc. Let the OS and the compiler do their job in making the rest efficient, for simple programs like this.
If no other program is checking for the presence of ~/program/csv/output.csv for further processing, then what you're doing is just fine.
Otherwise you can consider writing to a FILE * obtained by a call to tmpfile in stdio.h or some similar library call, and when finished copy the file to the final destination. You could also put down a lock file output.csv.lck and remove that when you're done, but that depends on you being able to modify the other program's behaviour.
You can make your own cat, cp, mv programs for practice.

good way to read text file in C

I need to read a text file which may contain long lines of text. I am thinking of the best way to do this. Considering efficiency, even though I am doing this in C++, I would still choose C library functions to do the IO.
Because I don't know how long a line is, potentially really really long, I don't want to allocate a large array and then use fgets to read a line. On the other hand, I do need to know where each line ends. One use case of such is to count the words/chars in each line. I could allocate a small array and use fgets to read, and then determine whether there is \r, \n, or \r\n appearing in the line to tell whether a full line has been read. But this involves a lot of strstr calls (for \r\n, or there are better ways? for example from the return value of fgets?). I could also do fgetc to read each individual char one at a time. But does this function have buffering?
Please suggest compare these or other different ways of doing this task.
The correct way to do I/O depends on what you're going to do with the data. If you're counting words, line-based input doesn't make much sense. A more natural approach is to use fgetc and deal with a character at a time and let stdio worry about the buffering. Only if you need the whole line in memory at the same time to process it should you actually allocate a buffer big enough to contain it all.

How do I check if a file is text-based?

I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.

Resources