How many bits are read by fgetc in a stream? - c

How many bits are read by the function fgetc in a stream?
The man page of fgetc tells that this function reads a "character", but a character is not a clear definition for me. How many bits does contain a "character" ? Is reading a character with fgetc equivalent as reading a byte?
Does it depend on the architecture of the machine and on the size of "char" or "byte"?
My objective is to read binary data in a stream with portability (byte=8bits or byte=16bits). Is it a better idea to use fread/fwrite with types like uintN_t instead of fgetc in order to control how many bits are read in the stream? Is there a better solution?

How many bits does contain a "character" ?
A character contains precisely CHAR_BIT bits, an implementation-specific value defined in limits.h.
/* Number of bits in a `char'. */
# define CHAR_BIT 8
Is reading a character with fgetc equivalent as reading a byte
Yup, fgetc reads exactly one byte.
This portability problem isn't easily solvable. The best way around it is to not make assumptions on the binary representation.

fgetc read exactly one byte. A character type (signed char, char, unsigned char and qualified versions) contains CHAR_BIT bits (<limits.h>), which is a constant greater than 8.

Your platform has a smallest unit of data, which corresponds to the C data type char. All I/O happens in units of chars. You are guaranteed that a char can hold the values 0–127, and either 0–255 or −127–127. Everything else is platform-specific. (The actual number of bits inside a char is contained in the macro CHAR_BIT.)
That said, as long as you only write and read values within the advertised range into each char, you are guaranteed that your program will work on any conforming platform. The only thing you are not guaranteed is that the resulting data stream will be binarily identical.

Related

read fucntion in C uint_8 and char array buffer differences

in some code seen online, i saw that in read function in C, someone uses a uint8_t array for buffer insted of a char array buffer.
what are the differences?
thanks
The C standard allows char to be signed or unsigned. It also allows it to be more than eight bits.
uint8_t, if it is defined, is unsigned and eight bits. This allows programmers to be completely sure of the type that will be used. In particular, signed char types sometimes cause problems with bitwise and shift operations, due to how these operations are defined (or are not defined) when negative values are involved.
So every char corresponds to a number(see ascii table here). I think people use this to avoid some problems(sorry I don't use c I come from c++)

C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.

Why char is of 1 byte in C language

Why is a char 1 byte long in C? Why is it not 2 bytes or 4 bytes long?
What is the basic logic behind it to keep it as 1 byte? I know in Java a char is 2 bytes long. Same question for it.
char is 1 byte in C because it is specified so in standards.
The most probable logic is. the (binary) representation of a char (in standard character set) can fit into 1 byte. At the time of the primary development of C, the most commonly available standards were ASCII and EBCDIC which needed 7 and 8 bit encoding, respectively. So, 1 byte was sufficient to represent the whole character set.
OTOH, during the time Java came into picture, the concepts of extended charcater sets and unicode were present. So, to be future-proof and support extensibility, char was given 2 bytes, which is capable of handling extended character set values.
Why would a char hold more than 1byte? A char normally represents an ASCII character. Just have a look at an ASCII table, there are only 256 characters in the (extended) ASCII Code. So you need only to represent numbers from 0 to 255, which comes down to 8bit = 1byte.
Have a look at an ASCII Table, e.g. here: http://www.asciitable.com/
Thats for C. When Java was designed they anticipated that in the future it would be enough for any character (also Unicode) to be held in 16bits = 2bytes.
It is because the C languange is 37 years old and there was no need to have more bytes for 1 char, as only 128 ASCII characters were used (http://en.wikipedia.org/wiki/ASCII).
When C was developed (the first book on it was published by its developers in 1972), the two primary character encoding standards were ASCII and EBCDIC, which were 7 and 8 bit encodings for characters, respectively. And memory and disk space were both of greater concerns at the time; C was popularized on machines with a 16-bit address space, and using more than a byte for strings would have been considered wasteful.
By the time Java came along (mid 1990s), some with vision were able to perceive that a language could make use of an international stnadard for character encoding, and so Unicode was chosen for its definition. Memory and disk space were less of a problem by then.
The C language standard defines a virtual machine where all objects occupy an integral number of abstract storage units made up of some fixed number of bits (specified by the CHAR_BIT macro in limits.h). Each storage unit must be uniquely addressable. A storage unit is defined as the amount of storage occupied by a single character from the basic character set1. Thus, by definition, the size of the char type is 1.
Eventually, these abstract storage units have to be mapped onto physical hardware. Most common architectures use individually addressable 8-bit bytes, so char objects usually map to a single 8-bit byte.
Usually.
Historically, native byte sizes have been anywhere from 6 to 9 bits wide. In C, the char type must be at least 8 bits wide in order to represent all the characters in the basic character set, so to support a machine with 6-bit bytes, a compiler may have to map a char object onto two native machine bytes, with CHAR_BIT being 12. sizeof (char) is still 1, so types with size N will map to 2 * N native bytes.
1. The basic character set consists of all 26 English letters in both upper- and lowercase, 10 digits, punctuation and other graphic characters, and control characters such as newlines, tabs, form feeds, etc., all of which fit comfortably into 8 bits.
You don't need more than a byte to represent the whole ascii table (128 characters).
But there are other C types which have more room to contain data, like int type (4 bytes) or long double type (12 bytes).
All of these contain numerical values (even chars! even if they're represented as "letters", they're "numbers", you can compare it, add it...).
These are just different standard sizes, like cm and m for lenght, .

Question regarding C argument promotions [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Alright actually I've study about how to use looping to make my code more efficient so that I could use a particular block of code that should be repeated without typing it over and over again, and after attempted to use what I've learn so far to program something, I feel it's time for me to proceed to the next chapter to learn on how to use control statement to learn how to instructs the program to make decision.
But the thing is that, before I advance myself to it, I still have a few question that need any expert's help on previous stuff. Actually it's about datatype.
A. Character Type
I extract the following from the book C primer Plus 5th ed:
Somewhat oddly , C treats character
constans as type int rather than
char. For example, on an ASCII system
with a 32-bit int and an 8-bit char
, the code:
char grade = 'B';
represents 'B' as the numerical value
66 stored in a 32-bit unit, grade
winds up with 66 stored ub ab 8-bit
unit. This characteristic of character
constants makes it possible to define
a character constant such as 'FATE',
with four separate 8-bit ASCII codes
stored in a 32-bit unit. However ,
attempting to assign such a character
constant to a char variable results
in only the last 8 bits being used,
so the variable gets the value 'E'.
So the next thing I did after reading this was of course, follow what it mentions, that is I try store the word FATE on a variable with char grade and try to compile and see what it'll be stored using printf(), but instead of getting the character 'E' printed out, what I get is 'F'.
Does this mean there's some mistake in the book? OR is there something I misunderstood?
From the above sentences, there's a line says C treats character constants as type int. So to try it out, I assign a number bigger than 255, (e.x. 356) to the char type.
Since 356 is within the range of 32-bit int (I'm running Windows 7), therefore I expect it would print out 356 when I use the %d specifier.
But instead of printing 356, it gives me 100, which is the last 8-bits value.
Why does this happen? I thought char == int == 32-bits? (Although it does mention before char is only a byte).
B. Int and Floating Type
I understand when a number stores in variable in short type is pass to variadic function or any implicit prototype function, it'll be automatically promoted to int type.
This also happen to floating point type, when a floating-point number with float type is passed, it'll be converted to double type, that is why there's no specifier for the float type but instead there's only %f for double and %Lf for long double.
But why there's a specifier for short type although it is also promoted but not float type? Why don't they just give a specifier for float type with a modifier like %hf or something? Is there anything logical or technical behind this?
A lot of questions in one question... Here are answers to a couple:
This characteristic of character constants makes it possible to define a character constant such as 'FATE' , with four separate 8-bit ASCII codes stored in a 32-bit unit.However , attempting to assign such a character constant to a char variable results in only the last 8 bits being used , so the variable gets the value 'E'.
This is actually implementation defined behavior. So yes, there's a mistake in the book. Many books on C are written with the assumption that the only C compiler in the world is the one the author used when testing the examples.
The compiler the author use treated the characters in 'FATE' as the bytes of an integer with the 'F' being the most significant byte and 'E' being the least significant. Your compiler treats the characters in the literal as bytes of an inteder with 'F' being the least significant byte and 'E' the most significant. For example, The first method is how MSVC treats the value, while MinGW (a GCC compiler targeting Windows) treats the literal in the second way.
As far as there being no format specifier to printf() that expects float, on specifiers that expect double - this is because the values passed to printf() for formatting are part of the variable argument list (the ... in printf()'s prototype). There is not type information about these arguments, so as you mentioned, the compiler must always promote them (from C99 6.5.2.2/6 "Function calls"):
If the expression that denotes the called function has a type that does not include a prototype, the integer promotions are performed on each argument, and arguments that have type float are promoted to double. These are called the default argument promotions.
And C99 6.5.2.2/7 "Function calls"
The ellipsis notation in a function prototype declarator causes argument type conversion to stop after the last declared parameter. The default argument promotions are performed on trailing arguments.
So in effect, it's impossible to pass a float to printf() - it will always be promoted to a double. That's why the format specifiers for floating point values expect a double.
Also due to the automatic promotion that would be applied to short, I'm honestly not sure if the h specifier for formatting a short is strictly necessary (though it is necessary for use with the n specifier if you want to get the count of characters written to the stream placed in a short). It might be in C because it needs to be there to support the n specifier, historical reasons, or something that I'm just not thinking of.
First, a char is by definition exactly 1 byte wide. Then the standard more or less says that the sizes should be:
sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long)
The exact sizes vary (except for char) by system and compiler but on a 32 bit Windows the sizes with GCC and VC are (AFAIK):
sizeof(short) == 2 (byte)
sizeof(int) == sizeof(long) == 4 (byte)
Your observation of 'F' versus 'E' in this case is a typical endianness issue (little vs. big endian, how a "word" is stored in memory).
Now what happens to your value ? You have a variable that is 8 bit wide. You assign a bigger value ('FATE' or 356), but the compiler knows it only is allowed to store 8 bits so it cuts off all the other bits.
To A:
3.) This is due to the different byte orderings of big and little endian CPU achitectures. You get the first byte on a little endian (i.e. x86) and the last byte on a big endian CPU (i.e. PPC). Actually you get always the lowest 8 bit when the conversion fom int to char is done but the characters in the int are stored in reversed order.
7.) a char can only hold 8 bits, so everything else gets truncated in the moment you assign the int to a char variable and can never be restored from the char variable later.
To B:
3.) You might sometimes want to print only the lowest 16 bits of a int variable regardless of what is in the higher half. It is not uncommon to pack multiple integer values in a single variable for certain optimizations. This works well for integer types but makes not much sense for floating point types which don't support bitwise operations directly, which might be the reason why there is no separate type specifier for float in printf.
char is 1 byte long. The bit length of a byte can be 8, 16, 32 bits long. In general purpose computers generally the bitlength of character is 8 bits long. So the maximum number which the character can represent depends on the bitlength of the character. To check the bitlength of character check limits.h header file it is defined as CHAR_BIT in this file.
char x = 'FATE' will depend probably on the byte ordering which the machine/compiler will interpret the 'FATE' . So this depends on the system/compiler. Someone please confirm/correct this.
If your system has 8 bits byte, then, when you do c = 360 only the lower 8 bits of the binary representation of 360 will be stored in the variable, because char data is always allocated 1 byte of storage. So %d will print 100 because the upper bits were lost when you assigned the value in the variable, and what is left is only the lower 8 bits.

Can sizeof(int) ever be 1 on a hosted implementation?

My view is that a C implementation cannot satisfy the specification of certain stdio functions (particularly fputc/fgetc) if sizeof(int)==1, since the int needs to be able to hold any possible value of unsigned char or EOF (-1). Is this reasoning correct?
(Obviously sizeof(int) cannot be 1 if CHAR_BIT is 8, due to the minimum required range for int, so we're implicitly only talking about implementations with CHAR_BIT>=16, for instance DSPs, where typical implementations would be a freestanding implementation rather than a hosted implementation, and thus not required to provide stdio.)
Edit: After reading the answers and some links references, some thoughts on ways it might be valid for a hosted implementation to have sizeof(int)==1:
First, some citations:
7.19.7.1(2-3):
If the end-of-file indicator for the input stream pointed to by stream is not set and a
next character is present, the fgetc function obtains that character as an unsigned
char converted to an int and advances the associated file position indicator for the
stream (if defined).
If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the endof-file indicator for the stream is set and the fgetc function returns EOF. Otherwise, the
fgetc function returns the next character from the input stream pointed to by stream.
If a read error occurs, the error indicator for the stream is set and the fgetc function
returns EOF.
7.19.8.1(2):
The fread function reads, into the array pointed to by ptr, up to nmemb elements
whose size is specified by size, from the stream pointed to by stream. For each
object, size calls are made to the fgetc function and the results stored, in the order
read, in an array of unsigned char exactly overlaying the object. The file position
indicator for the stream (if defined) is advanced by the number of characters successfully read.
Thoughts:
Reading back unsigned char values outside the range of int could simply have undefined implementation-defined behavior in the implementation. This is particularly unsettling, as it means that using fwrite and fread to store binary structures (which while it results in nonportable files, is supposed to be an operation you can perform portably on any single implementation) could appear to work but silently fail. essentially always results in undefined behavior. I accept that an implementation might not have a usable filesystem, but it's a lot harder to accept that an implementation could have a filesystem that automatically invokes nasal demons as soon as you try to use it, and no way to determine that it's unusable. Now that I realize the behavior is implementation-defined and not undefined, it's not quite so unsettling, and I think this might be a valid (although undesirable) implementation.
An implementation sizeof(int)==1 could simply define the filesystem to be empty and read-only. Then there would be no way an application could read any data written by itself, only from an input device on stdin which could be implemented so as to only give positive char values which fit in int.
Edit (again): From the C99 Rationale, 7.4:
EOF is traditionally -1, but may be any negative integer, and hence distinguishable from any valid character code.
This seems to indicate that sizeof(int) may not be 1, or at least that such was the intention of the committee.
It is possible for an implementation to meet the interface requirements for fgetc and fputc even if sizeof(int) == 1.
The interface for fgetc says that it returns the character read as an unsigned char converted to an int. Nowhere does it say that this value cannot be EOF even though the expectation is clearly that valid reads "usually" return positive values. Of course, fgetc returns EOF on a read failure or end of stream but in these cases the file's error indicator or end-of-file indicator (respectively) is also set.
Similarly, nowhere does it say that you can't pass EOF to fputc so long as that happens to coincide with the value of an unsigned char converted to an int.
Obviously the programmer has to be very careful on such platforms. This is might not do a full copy:
void Copy(FILE *out, FILE *in)
{
int c;
while((c = fgetc(in)) != EOF)
fputc(c, out);
}
Instead, you would have to do something like (not tested!):
void Copy(FILE *out, FILE *in)
{
int c;
while((c = fgetc(in)) != EOF || (!feof(in) && !ferror(in)))
fputc(c, out);
}
Of course, platforms where you will have real problems are those where sizeof(int) == 1 and the conversion from unsigned char to int is not an injection. I believe that this would necessarily the case on platforms using sign and magnitude or ones complement for representation of signed integers.
I remember this exact same question on comp.lang.c some 10 or 15 years ago. Searching for it, I've found a more current discussion here:
http://groups.google.de/group/comp.lang.c/browse_thread/thread/9047fe9cc86e1c6a/cb362cbc90e017ac
I think there are two resulting facts:
(a) There can be implementations where strict conformance is not possible. E.g. sizeof(int)==1 with one-complement's or sign-magnitude negative values or padding bits in the int type, i.e. not all unsigned char values can be converted to a valid int value.
(b) The typical idiom ((c=fgetc(in))!=EOF) is not portable (except for CHAR_BIT==8), as EOF is not required to be a separate value.
I don't believe the C standard directly requires that EOF be distinct from any value that could be read from a stream. At the same time, it does seem to take for granted that it will be. Some parts of the standard have conflicting requirements that I doubt can be met if EOF is a value that could be read from a stream.
For example, consider ungetc. On one hand, the specification says (§7.19.7.11):
The ungetc function pushes the character specified by c (converted to an unsigned
char) back onto the input stream pointed to by stream. Pushed-back characters will be
returned by subsequent reads on that stream in the reverse order of their pushing.
[ ... ]
One character of pushback is guaranteed.
On the other hand, it also says:
If the value of c equals that of the macro EOF, the operation fails and the input stream is unchanged.
So, if EOF is a value that could be read from the stream, and (for example) we do read from the stream, and immediately use ungetc to put EOF back into the stream, we get a conundrum: the call is "guaranteed" to succeed, but also explicitly required to fail.
Unless somebody can see a way to reconcile these requirements, I'm left with considerable doubt as to whether such an implementation can conform.
In case anybody cares, N1548 (the current draft of the new C standard) retains the same requirements.
Would it not be sufficient if a nominal char which shared a bit pattern with EOF was defined as non-sensical? If, for instance, CHAR_BIT was 16 but all the allowed values occupied only the 15 least significant bits (assume a 2s-complement of sign-magnitude int representation). Or must everything representable in a char have meaning as such? I confess I don't know.
Sure, that would be a weird beast, but we're letting our imaginations go here, right?
R.. has convinced me that this won't hold together. Because a hosted implementation must implement stdio.h and if fwrite is to be able to stick integers on the disk, then fgetc could return any bit pattern that would fit in a char, and that must not interfere with returning EOF. QED.
I think you are right. Such an implementation cannot distinguish a legitimate unsigned char value from EOF when using fgetc/fputc on binary streams.
If there are such implementations (this thread seems to suggest there are), they are not strictly conforming. It is possible to have a freestanding implementation with sizeof (int) == 1.
A freestanding implementation (C99 4) only needs to support the features from the standard library as specified in these headers: <float.h>,
<iso646.h>, <limits.h>, <stdarg.h>, <stdbool.h>, <stddef.h>, and
<stdint.h>. (Note no <stdio.h>). Freestanding might make more sense for a DSP or other embedded device anyway.
I'm not so familiar with C99, but I don't see anything that says fgetc must produce the full range of values of char. The obvious way to implement stdio on such a system would be to put 8 bits in each char, regardless of its capacity. The requirement of EOF is
EOF
which expands to an integer
constant expression, with type int and
a negative value, that is returned by
several functions to indicate
end-of-file, that is, no more input
from a stream
The situation is analogous to wchar_t and wint_t. In 7.24.1/2-3 defining wint_t and WEOF, footnote 278 says
wchar_t and wint_t can be the same integer type.
which would seem to guarantee that "soft" range checking is sufficient to guarantee that *EOF is not in the character set.
Edit:
This wouldn't allow binary streams, since in such a case fputc and fgetc are required to perform no transformation. (7.19.2/3) Binary streams are not optional; only their distinctness from text streams is optional. So it would appear that this renders such an implementation noncompliant. It would still be perfectly usable, though, as long as you don't attempt to write binary data outside the 8-bit range.
You are assuming that the EOF cannot be an actual character in the character set.
If you allow this, then sizeof(int) == 1 is OK.
The TI C55x compiler I am using has a 16bit char and 16bit int and does include a standard library. The library merely assumes an eight bit character set, so that when interpreted as a character as char of value > 255 is not defined; and when writing to an 8-bit stream device, the most significant 8 bits are discarded: For example when written to the UART, only the lower 8 bits are transferred to the shift register and output.

Resources