Using fseek to backtrack - c

Is using fseek to backtrack character fscanf operations reliable?
Like for example if I have just fscanf-ed 10 characters but I would like to backtrack the 10 chars can I just fseek(infile, -10, SEEK_CUR) ?
For most situations it works but I seem to have problems with the character ^M. Apparently fseek registers it as a char but fscanf doesn't register it, thus in my previous example a 10 char block containing a ^M would require fseek(infile, -11, SEEK_CUR) instead. fseek(infile, -10, SEEK_CUR) would make bring it short by 1 character.
Why is this so?
Edit: I was using fopen in text mode

You're seeing the difference between a "text" and a "binary" file. When a file is opened in text mode (no 'b' in the fopen second argument), the stdio library may (indeed, must) interpret the contents of the file according to the operating system's conventions for text files. For example, in Windows, a line ends with \r\n, and this gets translated to a single \n by stdio, since that is the C convention. When writing to a text file, a single \n gets output as \r\n.
This makes it easier to write portable C programs that handle text files. Some details become complicated, however, and fseeking is one of them. Because of this, the C standard only defines fseek in text files in a few cases: to the very beginning, to the very end, to the current position, and to a previous position that has been retrieved with ftell. In other words, you can't compute a location to seek to for text files. Or you can, but you have to take care of the all the platform-specific details yourself.
Alternatively, you can use binary files and do the line-ending transformations yourself. Again, portability suffers.
In your case, if you just want to go back to where you last did fscancf, the easiest would be to use ftell just before you fscanf.

This is because fseek works with bytes, whereas fscanf intelligently handles that the carriage return and line feed are two bytes, and swallows them as one char.

Fseek has no understanding of the file's contents and just moves the filepointer 10 characters back.
fscanf depending on the OS, may interpret newlines differently; it may even be so that fscanf will insert the ^M if you're on DOS and the ^M does not appear in the file. Check your manual that came with your C compiler

Just tried this with VS2008 and found that fscanf and fseek treated the CR and LF characters in the same way (as a single character).
So with two files:
0000000: 3132 3334 3554 3738 3930 3132 3334 3536 12345X7890123456
and
0000000: 3132 3334 350d 0a37 3839 3031 3233 3435 12345..789012345
If I read 15 characters I get to the second '5', then seek back 10 characters, my next character read is the 'X' in the first case and the CRLF in the second.
This seems like a very OS/compiler specific problem.

Did you test the return value of fscanf? Post some code.
Take a look at ungetc. You may have to run a loop over it.

Related

Use ftell to find the file size

fseek(f, 0, SEEK_END);
size = ftell(f);
If ftell(f) tells us the current file position, the size here should be the offset from the end of the file to the beginning. Why is the size not ftell(f)+1? Should not ftell(f) only give us the position of the end of the file?
File positions are like the cursor in a text entry widget: they are in between the bytes of the file. This is maybe easiest to understand if I draw a picture:
This is a hypothetical file. It contains four characters: a, b, c, and d. Each character gets a little box to itself, which we call a "byte". (This file is ASCII.) The fifth box has been crossed out because it's not part of the file yet, but but if you appended a fifth character to the file it would spring into existence.
The valid file positions in this file are 0, 1, 2, 3, and 4. There are five of them, not four; they correspond to the vertical lines before, after, and in between the boxes. When you open the file (assuming you don't use "a"), you start out on position 0, the line before the first byte in the file. When you seek to the end, you arrive at position 4, the line after the last byte in the file. Because we start counting from zero, this is also the number of bytes in the file. (This is one of the several reasons why we start counting from zero, rather than one.)
I am obliged to warn you that there are several reasons why
fseek(fp, 0, SEEK_END);
long int nbytes = ftell(fp);
might not give you the number you actually want, depending on what you mean by "file size" and on the contents of the file. In no particular order:
On Windows, if you open a file in text mode, the numbers you get from ftell on that file are not byte offsets from the beginning of the file; they are more like fgetpos cookies, that can only be used in a subsequent call to fseek. If you need to seek around in a text file on Windows you may be better off opening the file in binary mode and dealing with both DOS and Unix line endings yourself — this is actually my recommendation for production code in general, because it's perfectly possible to have a file with DOS line endings on a Unix system, or vice versa.
On systems where long int is 32 bits, files can easily be bigger than that, in which case ftell will fail, return −1 and set errno to EOVERFLOW. POSIX.1-2001-compliant systems provide a function called ftello that returns an off_t quantity that can represent larger file sizes, provided you put #define _FILE_OFFSET_BITS 64 at the very top of all your source files (before any #includes). I don't know what the Windows equivalent is.
If your file contains characters that are beyond ASCII, then the number of bytes in the file is very likely to be different from the number of characters in the file. (For instance, if the file is encoded in UTF-8, the character 啡 will take up three bytes, Ä will take up either two or three bytes depending on whether it's "composed", and జ్ఞా will take up twelve bytes because, despite being a single grapheme, it's a string of four Unicode code points.) ftell(o) will still tell you the correct number to pass to malloc, if your goal is to read the entire file into memory, but iterating over "characters" will not be so simple as for (i = 0; i < len; i++).
If you are using C's "wide streams" and "wide characters", then, just like text streams on Windows, the numbers you get from ftell on that file are not byte offsets and may not be useful for anything other than subsequent calls to fseek. But wide streams and characters are a bad design anyway; you're actually more likely to be able to handle all the world's languages correctly if you stick to processing UTF-8 by hand in narrow streams and characters.
I'm not sure why fseek()/ftell() is taught as a generic way to get the size of a file. It only works because an implementation defines it to work. POSIX does, for one. Windows does, also, for binary streams - but not for text streams.
It's wrong to not add a caveat or warning to, "This is how you get the number of bytes in a file." Because when a programmer first gets on a system that doesn't define fseek()/ftell() as byte offsets, they're going to have problems. I've seen it.
"But I was told this is how you can always do it."
"Well, no. Whoever taught you was wrong."
Because it is impossible to use fseek()/ftell() to get the size of a file in strictly-conforming C code.
For a binary stream, 7.21.9.2 The fseek function, paragraph 3 of the C standard:
For a binary stream, the new position, measured in characters from the
beginning of the file, is obtained by adding offset to the
position specified by whence. The specified position is the
beginning of the file if whence is SEEK_SET, the current value of
the file position indicator if SEEK_CUR , or end-of-file if
SEEK_END. A binary stream need not meaningfully support fseek
calls with a whence value of SEEK_END.
Footnote 268 specifically states:
Setting the file position indicator to end-of-file, as with
fseek(file, 0, SEEK_END), has undefined behavior for a binary
stream (because of possible trailing null characters) or for any
stream with state-dependent encoding that does not assuredly end in
the initial shift state.
So you can't seek the the end of a binary stream to get a file's size in bytes.
And for a text stream, 7.21.9.4 The ftell function, paragraph 2 states:
The ftell function obtains the current value of the file position
indicator for the stream pointed to by stream. For a binary
stream, the value is the number of characters from the
beginning of the file. For a text stream, its file position
indicator contains unspecified information, usable by the fseek
function for returning the file position indicator for the stream to
its position at the time of the ftell call; the difference
between two such return values is not necessarily a meaningful
measure of the number of characters written or read.
So you can't use ftell() on a text stream to get a byte count.
The only strictly-conformant approach that I'm aware of to get the number of bytes in a file is to read them one-by-one with fgetc() and count them.

How can tell the end of a line with c

I don't know whether the line is ended by '\n' or '\r' or '\r\n'
and don't what the text is encoded by , besides if the encode is utf-8, it can be no bom.
Is there a function or a lib can do this ,or just tell me the termination of a line.
Are you by chance using fgets, fread, fputs, fwrite, etc, on a file that is open for reading text? If so, the implementation will automatically transform OS-specific line terminators (eg. "\r\n") into '\n' when reading, and transform '\n' into OS-specific line terminators when writing.
There are two other scenarios, one of which it turns out was OP:
OP was struggling with "\r\n" being carried over from other OS software, and so opening files for reading in his (presumably Unix-like) OS would no longer convert that. My suggestion is to use dos2unix for these one-off conversions, rather than bloating your code with something which will likely never run again.
You're not using one of those functions. This could be because you're using a stream such as a socket, and perhaps the protocol requires "\r\n". In this case, you should use strstr to find the exact sequence "\r\n".
UTF-8 was designed with a degree of compatibility to ASCII in mind, hence you can assume that any system that uses UTF-8 will also use ASCII or some similar character set. Any characters that use sequences larger than one byte will only use values 0x80 or greater to represent. Since '\n' lies within the 0x00-0x7F range, you're guaranteed that it'll be a single byte and it won't exist as part of a multi-byte character.
Use wcslen to get the size in byte of an utf8 string.
http://linux.die.net/man/3/wcslen

C file reading incorrect number of chars

I have stumbled across a problem where I am attempting to read in a file, which is, according to windows, '87.1 kb' in size, and using the ftell method in program, returns '89282', effectively confirming what windows is saying.
So why is every method to read chars from the file only returning 173 or 174 characters?
The file is a .GIF file renamed to .txt (and I am trying to build a program that can load the data fully as I am working on a program to download online images and need to run comparisons on them).
So far I have tried:
fgetc - This returns 173/174 chars.
fread - Same as above, this is with a string with 1024 or more spaces available.
fgets - Doesn't work (as it doesn't return how many characters it has read - characters which include nulls).
setvbuf - Disabling this with _IONBF, or even supplying a buffer of 1024 or more only means 173/174 is still returned.
fflush - This produced a 'result', although a negative one - it returned '2' chars instead of '173'.
I am utterly stumped as to why it isn't reading anything more than 173/174 chars. Is there something I need to compensate for or expect at the lower level? Some buffer I need to expand or some weird character I need to look out for?
Here's one thing to look at. Have a look at the file in a hex viewer and see if there's a CTRL-Z somewhere around that 173/174 offset.
Then check to see if you're opening it with the "r" mode.
If so, it may be that the Windows translation between text and binary is stopping your reading there because CTRL-Z is an EOF marker in text mode. If so, you can probably fix this with "rb" mode on the fopen.
Failing that, you need to post the smallest code segment that exhibits the problem behaviour. It may be obvious to some of us here but only usually if we can see the code :-)

Is \n multi-character in C?

I read that \n consists of CR & LF. Each has their own ASCII codes.
So is the \n in C represented by a single character or is it multi-character?
Edit: Kindly specify your answer, rather than simply saying "yes, it is" or "no, it isn't"
In a C program, it's a single character, '\n'representing end of line. However, some operating systems (most notably Microsoft Windows) use two characters to represent end of line in text files, and this is likely where the confusion comes from.
It's the responsibility of the C I/O functions to do the conversions between the C representation of '\n' and whatever the OS uses.
In C programs, simply use '\n'. It is guaranteed to be correct. When looking at text files with some sort of editor, you might see two characters. When a text file is transferred from Windows to some Unix-based system, you might get "^M" showing up at the end of each line, which is annoying, but has nothing to do with C.
Generally: '\n' is a single character, which represents a newline. '\r' is a single character, which represents a carriage-return. They are their own independent ASCII characters.
Issues arise because in the actual file representation, UNIX-based systems tend to use '\n' alone to represent what you think of when you hit "enter" or "return" on the keyboard, whereas Windows uses a '\r' followed directly by a '\n'.
In a file:
"This is my UNIX file\nwhich spans two lines"
"This is my Windows file\r\nwhich spans two lines"
Of course, like all binary data, these characters are all about interpretation, and that interpretation depends on the application using the data. Stick to '\n' when you are making C-strings, unless you want a literal carriage-return, because as people have pointed out in the comments, the OS representation doesn't concern you. IO libraries, including C's, are supposed to handle this themselves and abstract it away from you.
For your curiosity, in decimal, '\n' in ASCII is 10, '\r' is 13, but note that this is the ASCII standard, not a C standard.
It depends:
'\n' is a single character (ASCII LF)
"\n" is a '\n' character followed by a 0 terminator
some I/O operations transform a '\n' into '\r\n' on some systems (CR-LF).
When you print the \n to a file, using the windows C stdio libraries, the library interprets that as a logical new-line, not the literal character 0x0A. The output to the file will be the windows version of a new-line: 0x0D0A (\r\n).
Writing
Sample code:
#include <stdio.h>
int main() {
FILE *f = fopen("foo.txt","w");
fprintf(f,"foo\nbar");
return 0;
}
A quick cl /EHsc foo.c later and you get
0x666F6F 0x0D0A 0x626172 (separated for convenience)
in foo.txt under a hex editor.
It's important to note that this translation DOES NOT occur if you are writing to a file in 'binary mode'.
Reading
If you are reading the file back in using the same tools, also on windows, the "windows EOL" will be interpreted properly if you try to match up against \n.
When reading it back
#include <stdio.h>
int main() {
FILE *f = fopen("foo.txt", "r");
char c;
while (EOF != fscanf(f, "%c", &c))
printf("%x-", c);
}
You get
66-6f-6f-a-62-61-72-
Therefore, the only time this should be relevant to you is if you are
Moving files back and forth between mac/unix and windows. Unix needs no real explanation here, since \n directly translates to 0x0A on those platforms. (pre-OSX \n was 0x0D on mac iirc)
Putting text in binary files, only do this carefully please
Trying to figure out why your binary data is being messed up when you opened the file "w", instead of "wb"
Estimating something important based on the size of the file, on windows you'll have an extra byte per newline.
\n is a new-line -- it's a logical representation of whatever separates one line from another in a text file.
A given platform will have some physical representation of that logical separation between lines. On Unix and most similar systems, the new-line is represented by a line-feed (LF) character (and since Unix was/is so closely associated with C, on Unix the LF is often just called a new-line). On MacOS, it's typically represented by a carriage-return (CR). On a fair number of other systems, most prominently Windows, it's represented by a carriage return/line feed pair -- normally in that order, though once in a while you see something use LF followed by CR (as I recall, Clarion used to do that).
In theory, a new-line doesn't need to correspond to any characters in the stream at all though. For example, a system could have text files that were stored as a length followed by the appropriate number of characters. In such a case, the run-time library would need to carry out a slightly more extensive translation between internal and external representations of text files than is now common, but such is life.
According to the C99 Standard (section 5.2.2),
\n "moves the active position [where the next character from fputc would appear] to the initial position on the next line".
Also
[\n] shall produce a unique implementation-defined value
which can be stored in a single char object. The external representations in a text file
need not be identical to the internal representations and are outside the scope of [the C99 Standard]
Most C implementations choose to define \n as ASCII line feed (0x0A) for historical reasons. However, on many computer operating systems, the sequence for moving the active position to the beginning of the next line requires two characters usually 0x0D, 0x0A. So, when writing to a text file, the C implementation must convert the internal sequence of 0x0A to the external one of 0x0D, 0x0A. How this is done is outside of the scope of the C standard, but usually, the file IO library will perform the conversion on any file opened in text mode.
Your question is about text files.
A text file is a sequence of lines.
A line is a sequence of characters ending in (and including) a line break.
A line breaks is represented differently by different Operating Systems.
On Unix/Linux/Mac they are usually represented by a single LINEFEED
On Windows they are usually represented by the pair CARRIAGE RETURN + LINEFEED
On old Macs they were usually represented by a single CARRIAGE RETURN
On other systems (AS/400 ??) there may even not be a specific character that represents a line break ...
Anyway, the library code in C is responsible to translating the system's line break to '\n' when reading text files and do the reverse operation when writing text files.
So, no matter what the representation is on any given system, when you read a text file in C, lines will be ended by a '\n'.
Note: The '\n' is not necessarily 0x0a in all systems.
Yes it is.
\n is a newline. Hex code is 0x0A.
\r is a carriage return. Hex code is 0x0D
It is a single character. It represents Newline (but is not the only representation - Wikipedia).
EDIT: The question was changed while I was typing the answer.

overwriting a specific line on a text file?

how do I go about overwriting a specific line on a text file in c?. I have values in multiple variables that need to be written onto the file.
This only works when the new line has the same size as the old one:
Open the file in the mode a+
fseek() to the start of the file
Before reading the next line, use ftell() to note the start of the line
Read the line
If it's the line you want, fseek() again with the result from ftell() and use fwrite() to overwrite it.
If the length of the line changes, you must copy the file.
Since files (from the point of view of C's standard library) are not line-oriented, but are just a sequence of characters (or bytes in binary mode), you can't expect to edit them at the line-level easily.
As Aaron described, you can of course replace the characters that make up the line if your replacement is the exact same character count.
You can also (perhaps) insert a shorter replacement by padding with whitespace at the end (before the line terminator). That's of course a bit crude.

Resources