C compiler flag to ignore sign - c

I am currently dealing with code purchased from a third party contractor. One struct has an unsigned char field while the function that they are passing that field to requires a signed char. The compiler does not like this, as it considers them to be mismatched types. However, it apparently compiles for that contractor. Some Googling has told me that "[i]t is implementation-defined whether a char object can hold negative values". Could the contractor's compiler basically ignore the signed/unsigned type and treat them the same? Or is there a compiler flag that will treat them the same?
C is not my strongest language--just look at my tags on my user page--so any help would be much appreciated.

Actually char, signed char and unsigned char are three different types. From the standard (ISO/IEC 9899:1990):
6.1.2.5 Types
...
The three types char, signed char and
unsigned char are collectively called
the character types.
(and in C++ for instance you have to (or at least should) write override functions with three variants of them if you have a char argument)
Plain char might be treated signed or unsigned by the compiler, but the standard says (also in 6.1.2.5):
An object declared as type char is
large enough to store any member of
the basic execution character set. If
a member of the required source
character set in 5.2.1 is stored in a
char object, its value is guarantied
to be positive. If other quantities
are stored in a char object, the
behavior is implementation-defined:
the values are treated as either
signed or nonnegative integers.
and
An object declared as type signed char occupies the same amount of storage as a ''plain'' char object.
The characters referred to in 5.2.1 are A-Z, a-z, 0-9, space, tab, newline and the following 29 graphic characters:
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
Answer
All of that I interpret to basically mean that ascii characters with value less than 128 are guarantied to be positive. So if the values stored always are less than 128 it should be safe (from a value preserving perspective), although not so good practice.

This is compiler-dependent. For example, in VC++ there's a compiler option and a corresponding _CHAR_UNSIGNED macro defined if that option instructs to use unsigned char by default.

I take it that you're talking about fields of type signed char and unsigned char, so they're explicitly wrong. If one of them was simply char, it might match in whatever compiler the contractor is using (IIRC, it's implementation-defined whether char is signed or unsigned), but not in yours. In that case, you might be able to get by with a command-line option or something to change yours.
Alternatively, the contractor might be using a compiler, or compiler options, that allow him to compile while ignoring errors or warnings. Do you know what sort of compilation environment he has?
In any case, this is not good C. If one of the types is just char, it relies on implementation-defined behavior, and therefore isn't portable. If not, it's flat wrong. I'd take this up with the contractor.

Related

Why is char different from *both* signed char and unsigned char?

cppreference.com states that char is
Equivalent to either signed char or unsigned char [...], but char is a distinct type, different from both signed char and unsigned char
I assume this means that a char can hold exactly the same values as either unsigned char or signed char, but is not compatible with either. Why was it decided to work this way? Why does unqualified char not denote a char of the platform-appropriate signedness, like with the other integer types, where int denotes exactly the same type as signed int?
The three C character types char, signed char, and unsigned char exist as codification of legacy C implementations and usage.
The XJ311 committee that codified C into the first C standard (now known as C89) stated their purpose in the Rational (italics original):
1.1 Purpose
The Committee's overall goal was to develop a clear, consistent, and
unambiguous Standard for the C programming language which codifies the
common, existing definition of C and which promotes the portability of
user programs across C language environments.
The X3J11 charter clearly mandates the Committee to codify common
existing practice. ...
N.B.: the X3J11 committee went out of their way to emphasize they were codifying existing implementations of C and common usage/practices in order to promote portability.
In other words, "standard" C was never created - existing C code, usages, and practices were codified.
Per 3.1.2.5 Types of that same Rationale (bolding mine):
Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. ...
The words of the committee are clear: three types of char exist because plain char had to be either signed or unsigned in order to match "prior practice". Plain char therefore had to be separate - portable code could not rely on plain char being signed or unsigned, but both signed char and unsigned char had to be available.
The three character types can not be compatible in any way because of portability concerns - and portability of standard-conforming C code was one of the XJ311 committee's main goals.
If extern char buffer[10] were compatible with unsigned char buffer[10] on a system where plain char is unsigned, the code would behave differently if the code were compiled* on a system where plain char is signed and therefore incompatible with unsigned char buffer[10]. For example, bit shifting elements of buffer would change behavior depending on whether or not buffer were accessed through the extern char buffer[10] declaration or the unsigned char buffer[10]; definition, breaking portability.
The fact that char could already be signed or unsigned with different behavior in such a situation already existed, and the committee could not change that without violating their goal to "codif[y] the
common, existing definition of C".
But with a goal of promoting portability, there was no reason whatsoever to create a crazed, portability-nightmare-inducing situation where "sometimes char is compatible with this and not that, and sometimes char is compatible with that and not this".
* - If the code compiled at all - but this is a hypothetical meant to demonstrate why the three char types must be incompatible.
TL;DR
Backwards compatibility. Probably. Or possibly that they had to choose and didn't care. But I have no certain answer.
Long version
Intro
Just like OP, I'd prefer a certain answer from a reliable source. In the absence of that, qualified guesses and speculations are better than nothing.
Very many things in C comes from backwards compatibility. When it was decided that whether char would be the same as signed char or unsigned char is implementation defined, there were already a lot of C code out there, some of which was using signed chars and others using unsigned. Forcing it to be one or the other would for certain break some code.
Why it (probably) does not matter
Why does unqualified char not denote a char of the platform-appropriate signedness
It does not matter much. An an implementation that is using signed chars guarantees that CHAR_MIN is equal to SCHAR_MIN and that CHAR_MAX is equal to SCHAR_MAX. Same goes for unsigned. So an unqualified char will always have the exact same range as its qualified counterpart.
From the standard 5.2.4.2.1p2:
If the value of an object of type char is treated as a signed integer when used in an
expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the
value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of
CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of
UCHAR_MAX.
This points us in the direction that they just didn't really care, or that it "feels safer".
Another interesting mention in the C standard is this:
All enumerations have an underlying type. The underlying type can be explicitly specified using an enum type specifier and is its fixed underlying type. If it is not explicitly specified, the underlying type is the enumeration’s compatible type, which is either a signed or unsigned integer type (excluding the bit-precise integer types), or char.
Possible problems with breaking this (speculation)
I'm trying to come up with a scenario where this would actually matter. One that could possibly cause issues is if you compile a source file to a shared library with one compiler using signed char and then use that library in a source file compiled with another compiler using unsigned char.
And even if that would not cause problems, imagine that the shared library is compiled with a pre-ansi compiler. Well, I cannot say for certain that this would cause problems either. But I can imagine that it could.
And another speculation from Steve Summit in comment section:
I'm speculating, but: if the Standard had required, in Eric's phrasing, "char is the same type as an implementation-defined choice of signed char or unsigned char", then if I'm on a platform on which char is the same as signed char, I can intermix the two with no warnings, and create code that's not portable to a machine where char is unsigned by default. So the definition "char is a distinct type from signed char and unsigned char" helps force people to write portable code.
Backwards compatibility is a sacred feature
But remember that the persons behind the C standard were and are VERY concerned about not breaking backwards compatibility. Even to the point that they don't want to change the signature of some library functions to return const values because it would yield warnings. Not errors. Warnings! Warnings that you can easily disable. Instead, they just wrote in the standard that it's undefined behavior to modify the values. You can read more about that here: https://thephd.dev/your-c-compiler-and-standard-library-will-not-help-you
So whenever you encounter very strange design choices in the C standard, it's a very good bet that backwards compatibility is the reason. That's the reason why you can initialize a pointer to NULL with just 0, even for a machine where NULL is not the zero address. And why bool is a macro of the keyword _Bool.
It's also the reason why bitwise | and & has higher precedence than ==, because there were a lot (several hundred kilobytes that was installed on three (3) machines :) ) of source code including stuff like if (a==b & c==d). Dennis Ritchie admitted that he should have changed it. https://www.lysator.liu.se/c/dmr-on-or.html
So we can at least say for certain that there are design choice made with backwards compatibility in mind, that has later been admitted by those who made the choices to be mistakes and that we have reliable sources for that.
C++
And also remember that your sources points to C++ sources. In that language, there are reasons that don't apply to C. Like overloading.
One part of the reasoning for not mandating either signed or unsigned for plain char is the EBCDIC code set used on IBM mainframes in particular.
In §6.2.5 Types ¶3, the C standard says:
An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.
Emphasis added.
Now, in EBCDIC, the lower-case letters have the code points 0x81-0x89, 0x91-0x99, 0xA2-0xA9; the upper-case letters have the code points 0xC1-0xC9, 0xD1-0xD9, 0xE2-0xE9; and the digits have the code points 0xF0-0xF9. So:
The alphabets are not contiguous.
Lower-case letters sort before upper-case letters.
Digits sort higher than letters.
And because of §6.2.5¶3, the type of plain char has to be unsigned.
Each of the first three points is in contradistinction to ASCII (and ISO 8859, and ISO 10646 aka Unicode).
The reason is backwards compatibility. Here is some research regarding the history behind it. It only uses authoritative first sources like the publications by Dennis M. Ritchie (the creator of C) or ISO.
In the beginning, there was only int and char. The early draft of C called "NB" for "new B" included these new types not present in the predecessors B and BCPL [Ritchie, 93]:
...it seemed that a typing scheme was necessary to cope with characters and byte addressing, and to prepare for the coming floating-point hardware
Embryonic C
NB existed so briefly that no full description of it was written. It supplied the types int and char, arrays of them, and pointers to them, declared in a style typified by
int i, j;
char c, d;
unsigned was added later [Ritchie, 93]:
During 1973-1980, the language grew a bit: the type structure gained unsigned, long...
Note that this refers to the stand-alone "type qualifier" unsigned at this point, equivalent to unsigned int.
Around this time in 1978, The C Programming Language 1st edition was published [Kernighan, 78] and in chapter 2.7 mentions type conversion problems related to char:
There is one subtle point about the conversion of characters to integers. The language does not specify whether variables of type char are signed or unsigned quantities. When a char is converted to an int, can it ever produce a negative integer? Unfortunately, this varies from machine to machine, reflecting differences in architecture. On some machines (PDP-11, for instance), a char whose leftmost bit is 1 will be converted to a negative integer (“sign extension”). On others, a char is promoted to an int by adding zeros at the left end, and thus is always positive.
At this point, the type promotion to int was what was described as problematic, not the signedness of char, which wasn't even specified. The above text remains mostly unchanged in the 2nd edition [Kernighan, 88].
However, the types themselves are described differently between editions. In the 1st edition [Kernighan, 78, 2.2], unsigned could only be applied to int and was regarded as a qualifier:
In addition, there are a number of qualifiers which can be applied to int’s: short, long, and unsigned.
Whereas the 2nd edition is in line with standard C [Kernighan, 88, 2.2]:
The qualifier signed or unsigned may be applied to char or any integer. /--/ Whether plain chars are signed or unsigned is machine-dependent, but
printable characters are always positive.
So in between the 1st and 2nd edition, they had discovered a backwards compatibility problem with applying the new unsigned/signed (now called type specifiers and not qualifiers [ANSI/ISO, 90]) to the char type, with the same concerns as were already identified regarding type conversions back in the 1st edition.
This compatibility concern remained during standardization in the late 80s. We can read this from the various rationales such as [ISO, 98, 6.1.2.5 §30]
Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. For reasons of symmetry, the keyword signed is allowed as part of the type name of other integral types. Two varieties of the integral types
are specified: signed and unsigned. If neither specifier is used, signed is assumed. In the Base Document the only unsigned type is unsigned int.
This actually suggests that signed int was allowed to make int more symmetric with char, rather than the other way around.
Sources:
[ANSI/ISO, 90] ANSI/ISO 9899:1990 - Programming Languages - C
[ISO, 98] Rationale for International Standard - Programming Language - C, WG14/N802 J11/98-001
[Kernighan, 78] Kernighan, Brian W., Ritchie, Dennis M. - The C Programming Language, 1st edition (1978)
[Kernighan, 88] Kernighan, Brian W., Ritchie, Dennis M. - The C Programming Language, 2st edition (1988)
[Ritchie, 93] Ritchie, Dennis M. - The Development of the C Language (1993)
The line you quote actually does not come from the C standard at all, but is rather comes from the C++ standard. The website you link to (cppreference.com) is primarily about C++ and the C stuff there is something of an afterthought.
The reason this is important for C++ (and not really for C) is that C++ allows overloading based on types, but you can only overload distinct types. The fact the char must be distinct from both signed char and unsigned char means you can safely overload all three:
// 3 overloads for fn
void fn(char);
void fn(signed char);
void fn(unsigned char);
and you will not get an error about ambiguous overloading or such.

C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.

char to int in C with struct values

I am trying to compare two chars with each other but treating them as integers. These are struct values in a linked list. I have printed out temp->next->variable and temp->variable and confirmed that the if statement should hold, ex: 3 > 2. But I'm thinking it might not work because they are char.
Will the fact that they are char values have an impact on the comparison?
if(temp->next->variable > temp->variable)
{
....
}
Chars are integers, of 8-bit length, so if you're always treating the chars as ints, this will have the exact behavior you desire. Be wary, as some compilers allow char to default to unsigned, some to signed, which may create issues seemingly at random. (If both numbers have the same highest order bit, it won't make a difference, otherwise the result will be opposite what you expect, if the signed-ness is also opposite what you expect.)
If you're treating them as characters, then this will give you their lexicographical comparison, which is based on the interal integer representation of that character. It may be good to check what your locale is -- whether your program will be using look up tables of ASCII or Unicode 8-bit, or anything else.
If you still get hidden issues, a common mistake is having many layered pointers, and even though the arrow [->] is used throughout, you may still need to apply a de-reference [*], or else you'll secretly be testing their relative locations in memory.
According to the C Standard (6.5.8 Relational operators)
3 If both of the operands have arithmetic type, the usual arithmetic
conversions are performed.
The usual arithmetic conversion includes the integer promotion that particularly means that objects of type char are converted to objects of type int
Take into account that type char can behave either as type unsigned char or signed char. So you can get different results of the comparison if sign bits of objects of type char are set.

Why are 4 characters allowed in a char variable? [duplicate]

This question already has answers here:
How to determine the result of assigning multi-character char constant to a char variable?
(5 answers)
Closed 9 years ago.
I have the following code in my program:
char ch='abcd';
printf("%c",ch);
The output is d.
I fail to understand why is a char variable allowed to take in 4 characters in its declaration without giving a compile time error.
Note: More than 4 characters is giving an error.
'abcd' is called a multicharacter constant, and will has an implementation-defined value, here your compiler gives you 'd'.
If you use gcc and compile your code with -Wmultichar or -Wall, gcc will warn you about this.
I fail to understand why is a char variable allowed to take in 4
characters in its declaration without giving a compile time error.
It's not packing 4 characters into one char. The multi-character const 'abcd' is of type int and then the compiler does constant conversion to convert it to char (which overflows in this case).
Assuming you know that you are using multi-char constant, and what it is.
I don't use VS these days, but my take on it is, that 4-char multi-char is packed into an int, then down-casted to a char. That is why it is allowed. Since the packing order of multi-char constant into an integer type is compiler-defined it can behave like you observe it.
Because multi-character constants are meant to be used to fill integer typed, you could try 8-byte long multi-char. I am not sure whether VS compiler supports it, but there is a good chance it is, because that would fit into a 64-bit long type.
It probably should give a warning about trying to fit a literal value too big for the type. It's kind of like unsigned char leet = 1337;. I am not sure, however, how does this work in VS (whether it fires a warning or an error).
4 characters are not being put into a char variable, but into an int character constant which is then assigned to a char.
3 parts of the C standard (C11dr §6.4.4.4) may help:
"An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'."
"An integer character constant has type int."
"The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined."
OP's code of char ch='abcd'; is the the assignment of an int to a char as 'abcd' is an int. Just like char ch='Z';, ch is assigned the int value of 'Z'. In this case, there is no surprise, as the value of 'Z' fits nicely in a char. In the 'abcd', case, the value does not fit in a char and so some information is lost. Various outcomes are possible. Typically on one endian platform, ch will have a value of 'a' and on another, the value of 'd'.
The 'abcd' is an int value, much like 12345 in int x = 12345;.
When the size(int) == 4, an int may be assigned a character constant such as 'abcd'.
When size(int) != 4, the limit changes. So with an 8-char int, int x = 'abcdefgh'; is possible. etc.
Given that an int is only guaranteed to have a minimum range -32767 to 32767, anything beyond 2 is non-portable.
The int endian-ness of even int = 'ab'; presents concerns.
Character constant like 'abcd' are typically used incorrectly and thus many compilers have a warning that is good to enable to flag this uncommon C construct.

How is {int i=999; char c=i;} different from {char c=999;}?

My friend says he read it on some page on SO that they are different,but how could the two be possibly different?
Case 1
int i=999;
char c=i;
Case 2
char c=999;
In first case,we are initializing the integer i to 999,then initializing c with i,which is in fact 999.In the second case, we initialize c directly with 999.The truncation and loss of information aside, how on earth are these two cases different?
EDIT
Here's the link that I was talking of
why no overflow warning when converting int to char
One member commenting there says --It's not the same thing. The first is an assignment, the second is an initialization
So isn't it a lot more than only a question of optimization by the compiler?
They have the same semantics.
The constant 999 is of type int.
int i=999;
char c=i;
i created as an object of type int and initialized with the int value 999, with the obvious semantics.
c is created as an object of type char, and initialized with the value of i, which happens to be 999. That value is implicitly converted from int to char.
The signedness of plain char is implementation-defined.
If plain char is an unsigned type, the result of the conversion is well defined. The value is reduced modulo CHAR_MAX+1. For a typical implementation with 8-bit bytes (CHAR_BIT==8), CHAR_MAX+1 will be 256, and the value stored will be 999 % 256, or 231.
If plain char is a signed type, and 999 exceeds CHAR_MAX, the conversion yields an implementation-defined result (or, starting with C99, raises an implementation-defined signal, but I know of no implementations that do that). Typically, for a 2's-complement system with CHAR_BIT==8, the result will be -25.
char c=999;
c is created as an object of type char. Its initial value is the int value 999 converted to char -- by exactly the same rules I described above.
If CHAR_MAX >= 999 (which can happen only if CHAR_BIT, the number of bits in a byte, is at least 10), then the conversion is trivial. There are C implementations for DSPs (digital signal processors) with CHAR_BIT set to, for example, 32. It's not something you're likely to run across on most systems.
You may be more likely to get a warning in the second case, since it's converting a constant expression; in the first case, the compiler might not keep track of the expected value of i. But a sufficiently clever compiler could warn about both, and a sufficiently naive (but still fully conforming) compiler could warn about neither.
As I said above, the result of converting a value to a signed type, when the source value doesn't fit in the target type, is implementation-defined. I suppose it's conceivable that an implementation could define different rules for constant and non-constant expressions. That would be a perverse choice, though; I'm not sure even the DS9K does that.
As for the referenced comment "The first is an assignment, the second is an initialization", that's incorrect. Both are initializations; there is no assignment in either code snippet. There is a difference in that one is an initialization with a constant value, and the other is not. Which implies, incidentally, that the second snippet could appear at file scope, outside any function, while the first could not.
Any optimizing compiler will just make the int i = 999 local variable disappear and assign the truncated value directly to c in both cases. (Assuming that you are not using i anywhere else)
It depends on your compiler and optimization settings. Take a look at the actual assembly listing to see how different they are. For GCC and reasonable optimizations, the two blocks of code are probably equivalent.
Aside from the fact that the first also defines an object iof type int, the semantics are identical.
i,which is in fact 999
No, i is a variable. Semantically, it doesn't have a value at the point of the initialization of c ... the value won't be known until runtime (even though we can clearly see what it will be, and so can an optimizing compiler). But in case 2 you're assigning 999 to a char, which doesn't fit, so the compiler issues a warning.

Resources