How EOF is defined for binary and ascii files - c

I'm programming C on Windows(system language is Japanese), and I have a problem about EOF of binary and ascii files.
I asked this question last week, a kind guy helped me, but I still can't really understand how the program works when reading a binary or an ascii file.
I did the following test:
Test1:
int oneChar;
iFile = fopen("myFile.tar.gz", "rb");
while ((oneChar = fgetc(iFile)) != EOF) {
printf("%d ", oneChar);
}
Test2:
int oneChar;
iFile = fopen("myFile.tar.gz", "r");
while ((oneChar = fgetc(iFile)) != EOF) {
printf("%d ", oneChar);
}
In the test1 case, things worked perfectly for both binary and ascii files. But in test2, program stopped reading when it encountered 0x1A in a binary file. (Does this mean that 1A == EOF?) ASCII table tells me that 1A is a control character called substitute (whatever that means...) And when I printf("%d", EOF), however, it gave me -1...
I also found this question which tells me that the OS knows exactly where a file ends, so I don't really need to find EOF in the file, because EOF is out of the range of a byte (what about 1A?)
Can someone clear things up a little for me? Thanks in advance.

This is a Windows-specific trick for text files: SUB character, which is represented by Ctrl+Z sequence, is interpreted as EOF by fgetc. You do not have to have 1A in your text file in order to get an EOF back from fgetc, though: once you reach the actual end of file, EOF would be returned.
The standard does not define 1A as the char value to represent an EOF. The constant for EOF is of type int, with a negative value outside the range of unsigned char. In fact, the reason why fgetc returns an int, not char, is to let it return a special value for EOF.

The convention of ending a file with Ctrl-Z originated with CP/M, a very old operating system for 8080/Z80 microcomputers. Its file system did not keep track of file sizes down to the byte level, only to the 128-byte sector level, so there needed to be another way to mark the end-of-file.
Microsoft's DOS was made to be as compatible with CP/M as possible, so it kept the convention when reading text files. By this time the file size was kept by the file system so it wasn't strictly necessary, just retained for backward compatibility.
This convention has persisted to the present day in the C and C++ libraries for Windows; when you open a file in text mode, every character is checked for Ctrl-Z and the end-of-file flag is set if it's detected. You're seeing the effects of backwards compatibility taken to an extreme, back to systems that are almost 40 years old.

Found a terrific article that answers all the question! https://latedev.wordpress.com/2012/12/04/all-about-eof/

EOF in text files is usually character 0x1A or ASCII 26 if you will.

Related

How to know if the file end with a new line character or not

I'm trying to input a line at the end of a file that has the following shape "1 :1 :1 :1" , so at some point the file may have a new line character at the end of it, and in order to execute the operation I have to deal with that, so I came up with the following solution :
go to the end of the file and go backward by 1 characters (the length of the new line character in Linux OS as I guess), read that character and if it wasn't a new line character insert a one and then insert the whole line else go and insert the line, and this is the translation of that solution on C :
int insert_element(char filename[]){
elements *elem;
FILE *p,*test;
size_t size = 0;
char *buff=NULL;
char c='\n';
if((p = fopen(filename,"a"))!=NULL){
if(test = fopen(filename,"a")){
fseek(test,-1,SEEK_END );
c= getc(test);
if(c!='\n'){
fprintf(test,"\n");
}
}
fclose(test);
p = fopen(filename,"a");
fseek(p,0,SEEK_END);
elem=(elements *)malloc(sizeof(elements));
fflush(stdin);
printf("\ninput the ID\n");
scanf("%d",&elem->id);
printf("input the adress \n");
scanf("%s",elem->adr);
printf("innput the type \n");
scanf("%s",elem->type);
printf("intput the mark \n");
scanf("%s",elem->mark);
fprintf(p,"%d :%s :%s :%s",elem->id,elem->adr,elem->type,elem->mark);
free(elem);
fflush(stdin);
fclose(p);
return 1;
}else{
printf("\nRrror while opening the file !\n");
return 0;
}
}
as you may notice that the whole program depends on the length of the new line character (1 character "\n") so I wonder if there is an optimal way, in another word works on all OS's
It seems you already understand the basics of appending to a file, so we just have to figure out whether the file already ends with a newline.
In a perfect world, you'd jump to the end of the file, back up one character, read that character, and see if it matches '\n'. Something like this:
FILE *f = fopen(filename, "r");
fseek(f, -1, SEEK_END); /* this is a problem */
int c = fgetc(f);
fclose(f);
if (c != '\n') {
/* we need to append a newline before the new content */
}
Though this will likely work on Posix systems, it won't work on many others. The problem is rooted in the many different ways systems separate and/or terminate lines in text files. In C and C++, '\n' is a special value that tells the text mode output routines to do whatever needs to be done to insert a line break. Likewise, the text mode input routines will translate each line break to '\n' as it returns the data read.
On Posix systems (e.g., Linux), a line break is indicated by a line feed character (LF) which occupies a single byte in UTF-8 encoded text. So the compiler just defines '\n' to be a line feed character, and then the input and output routines don't have to do anything special in text mode.
On some older systems (like old MacOS and Amiga) a line break might be a represented by a carriage return character (CR). Many IBM mainframes used different character encodings called EBCDIC that don't have a direct mappings for LF or CR, but they do have a special control character called next line (NL). There were even systems (like VMS, IIRC) that didn't use a stream model for text files but instead used variable length records to represent each line, so the line breaks themselves were implicit rather than marked by a specific control character.
Most of those are challenges you won't face on modern systems. Unicode added more line break conventions, but very little software supports them in a general way.
The remaining major line break convention is the combination CR+LF. What makes CR+LF challenging is that it's two control characters, but the C i/o functions have to make them appear to the programmer as though they are the single character '\n'. That's not a big deal with streaming text in or out. But it makes seeking within a file hard to define. And that brings us back to the problematic line:
fseek(f, -1, SEEK_END);
What does it mean to back up "one character" from the end on a system where line breaks are indicated by a two character sequence like LF+CR? Do we really want the i/o system to have to possibly scan the entire file in order for fseek (and ftell) to figure out how to make sense of the offset?
The C standards people punted. In text mode, the offset argument for fseek can only be 0 or a value returned by a previous call to ftell. So the problematic call, with a negative offset, isn't valid. (On Posix systems, the invalid call to fseek will likely work, but the standard doesn't require it to.)
Also note that Posix defines LF as a line terminator rather than a separator, so a non-empty text file that doesn't end with a '\n' should be uncommon (though it does happen).
For a more portable solution, we have two choices:
Read the entire file in text mode, remembering whether the most recent character you read was '\n'.
This option is hugely inefficient, so unless you're going to do this only occasionally or only with short files, we can rule that out.
Open the file in binary mode, seek backwards a few bytes from the end, and then read to the end, remembering whether the last thing you read was a valid line break sequence.
This might be a problem if our fseek doesn't support the SEEK_END origin when the file is opened in binary mode. Yep, the C standard says supporting that is optional. However, most implementations do support it, so we'll keep this option open.
Since the file will be read in binary mode, the input routines aren't going to convert the platform's line break sequence to '\n'. We'll need a state machine to detect line break sequences that are more than one byte long.
Let's make the simplifying assumption that a line break is either LF or CR+LF. In the latter case, we don't care about the CR, so we can simply back up one byte from the end and test whether it's LF.
Oh, and we have to figure out what to do with an empty file.
bool NeedsLineBreak(const char *filename) {
const int LINE_FEED = '\x0A';
FILE *f = fopen(filename, "rb"); /* binary mode */
if (f == NULL) return false;
const bool empty_file = fseek(f, 0, SEEK_END) == 0 && ftell(f) == 0;
const bool result = !empty_file ||
(fseek(f, -1, SEEK_END) == 0 && fgetc(f) == LINE_FEED);
fclose(f);
return result;
}

What is the use of `putw` and `getw` function in c?

I wanna know the use of putw() and getw() function. As I know, these are used to write and read from file as like as putc and getc but these deals with only integers. But when I use these for writing integers, it just write different symbol in file (like if I write 65 to file using putw(). It writes A in the file). Why does it take the ASCII value? I am using codeblocks 13.12. Code:
#include <stdio.h>
int main() {
FILE *fp;
int num;
fp = fopen("file.txt", "w");
printf("Enter any number:\n");
scanf("%d", &num);
putw(num, fp);
fclose(fp);
printf("%d\n", num);
return 0;
}
Let's read the point to point explanation of getw() and putw() functions.
getw() and putw() are related to FILE handling.
putw() is use to write integer data on the file (text file).
getw() is use to read the integer data from the file.
getw() and putw() are similar to getc() and putc(). The only difference is that getw() and putw() are especially meant for reading and writing the integer data.
int putw(integer, FILE*);
Return type of the function is integer.
Having two argument first "integer", telling the integer you want to write on the file and second argument "FILE*" telling the location of the file in which the data would be get written.
Now let's see an example.
int main()
{
FILE *fp;
fp=fopen("file1.txt","w");
putw(65,fp);
fclose(fp);
}
Here putw() takes the integer number as argument (65 in this case) to write it on the file file1.txt, but if we manually open the text file we find 'A' written on the file. It means that putw() actually take integer argument but write it as character on the file.
So, it means that compiler take the argument as the ASCII code of the particular character and write the character on the text file.
int getw(FILE*);
Return type is integer.
Having one argument that is FILE* that is the location of the file from which the integer data to be read.
In this below example we will read the data that we have written on the file named file1.txt in the example above.
int main()
{
FILE *fp;
int ch;
fp=fopen("file1.txt","r");
ch=getw(fp);
printf("%d",ch);
fclose(fp);
}
output
65
Explanation: Here we read the data we wrote to file1.txt in above program and we will get the output as 65.
So, getw() reads the character 'A' that was already written on the file file1.txt and return the ASCII code of the character 'A' that is 65.
We can also write the above program as:
int main()
{
FILE *fp;
fp=fopen("file1.txt","r");
printf("%d",getw(fp));
fclose(fp);
}
output
65
If num is an int, then putw(num, fp) is equivalent to fwrite(&num, sizeof(int), 1, fp), except for having a different return value. It writes an int to the file in binary format. getw is similar but with fread instead. You can see how glibc implements them: putw,getw.
This means that:
They are not appropriate for writing text. If you want to write a number to a file in human-readable decimal or hexadecimal format, use fprintf instead.
They typically read/write more than one byte (one character) to the file. For instance, on a machine with 32-bit ints, they will read/write four bytes. Attempting to do putw('c') will not simply write the single character 'c'.
They should only be used with files opened in binary mode (if that makes a difference on your system).
You should not expect the contents of the file to be human-readable at all. If you attempt to view the file in an editor, you'll see the representation of whatever bytes are in the file, in your current character set (e.g. ASCII).
You should not expect the file to be successfully read back on another computer that uses a different internal representation for int (e.g. different width, different endianness).
On a typical system with 32-bit little-endian int, putw(65, fp) will result in the four bytes 0x41 0x00 0x00 0x00. The 0x41 (decimal 65) is the ASCII code for the character A, so you'll see that if you view it. The 0x00 bytes may or may not be displayed at all, depending on what you are using to view.
These function are not a good idea to use in new code. Even if you do need to store binary data in files, which has various disadvantages as noted and should usually only be done if there is a very good reason for it, you should simply use fwrite and fread. getw/putw are a worse option because:
They will make your code less portable. fwrite/fread are part of the ISO C standard, which is the most widely supported cross-platform modern standard for the C language. getw/putw were present in the Single Unix Specification v2 version 2, which dates to 1997 and is now obsolete. They were not included in the POSIX/SUSv3 specs which superseded SUSv2, and it would be unwise to count on them being available on new systems.
They will make your code less readable. Since fread/fwrite are far more widely used, another programmer reading your code will recognize immediately what they do. Since getw/putw are more obscure, people are likely to have to go and look them up, and the names don't make it easy to remember that they operate specifically on the type int. Readers may also confuse them with the similarly-named ISO-standard functions getwc/putwc. So using getw/putw makes your code less readable.
They may introduce subtle bugs. getw returns EOF on end-of-file or error, but EOF is a valid integer value (often -1). Therefore, if it returns this value, you cannot easily tell whether the file actually contained the integer -1, or whether the read failed. And since this only happens for one particular value, it may be missed in testing. You can check ferror() and feof() to distinguish the two cases, but this is awkward, easy to forget to do, and negates most of the apparent convenience of the "simpler" interface of getw compared to fread.
I speculate that the only reason these functions existed in the first place is that, like putc (respectively getc), putw could be implemented as a macro that directly wrote the buffer of fp and would thus be a little more efficient than calling fwrite. Such an implementation is no longer feasible on modern systems, since it wouldn't be thread-safe, so putw needs a function call anyway. In fact, with glibc in particular, putw just calls fwrite after all, with the overhead of an additional function call, so it's now less efficient. So there is no longer any reason at all to use these functions.
From the man page of putw() and getw()
getw() reads a word (that is, an int) from stream. It's provided for compatibility with SVr4.
putw() writes the word w (that is, an int) to stream. It is provided for compatibility with SVr4.
You can use the fread and fwrite function for better use.
getw() reads the integer from the given FILE stream.
putw() it write the integer given in the first argument into the file pointer.
getw:
It will read the integer from the file. like getchar() doing the work. Consider the file having the content "hello". It will read the h and return ascii value of h.
putw:
It will place the given integer, integer taken as a ascii value. Corresponding value of the ascii value placed in the file. like putchar()

Why does the file size decrease after encryption using an offset cipher?

I encrypted a text file using an offset cipher in C. For this, I simply added 128 to each character and got the file size decreased by 3 bytes. I tried the same on some other files too just to get the same result, i.e. decrease in file size by 3 bytes. I got the original size after decryption.
Could you please tell me why does it so happen?
Code for the main logic is given below:
while((ch=fgetc(fs))!=EOF){
fputc(ch+128, ft);
Could you please tell me why does it so happen?
Your ch probably has the wrong declaration. The fputc() function returns an int, not a char, and if you cast to char you will lose the distinction between (char) 0xff and EOF.
// WRONG WRONG WRONG
// char ch = fgetc(fs);
The right declaration:
int ch = fgetc(fs);
Otherwise, it shouldn't happen. Is your process exiting cleanly? If you abort(), then there might be data still in FILE * buffers. Show more code. Run with Valgrind. Check the exit status of your process.
I think the file size should have doubled as two bytes were taken for one character after encryption as something greater than 127 can not be stored in 1 byte.
No, fputc() does not work that way. The fputc() man page (run man fputc in a terminal, unless on Windows):
fputc() writes the character c, cast to an unsigned char, to stream.
Conversion to unsigned char is done by taking the value modulo 256*. So fputc() always writes exactly one byte of data (unlesss it fails).
* This is true all but exceedingly rare systems.
If you talk about Windows, I could imagine that you have opened the file in text mode, not in binary mode.
That leads to the following:
Writing \n leads to a \r\n written to the file.
Reading \r\n from the file gives only \n to the user.
Reading stops at the first \x1A, being a EOF character.
If you add 128 to each byte, the data-to-be-written rolls over at 256. While it may be undefined behaviour to call fputc() with a value > 256 (you should write (ch+128)%256 or (ch+128) & 0xFF), on your systems it obviously writes the value wrapped by 256 and thus you may get \n or \x1A by accident.

How to write äõüö is C? [duplicate]

I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)

Handling multibyte (non-ASCII) characters in C

I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)

Resources