Is plain char usually/always unsigned on non-twos-complement systems? - c

Obviously the standard says nothing about this, but I'm interested more from a practical/historical standpoint: did systems with non-twos-complement arithmetic use a plain char type that's unsigned? Otherwise you have potentially all sorts of weirdness, like two representations for the null terminator, and the inability to represent all "byte" values in char. Do/did systems this weird really exist?

The null character used to terminate strings could never have two representations. It's defined like so (even in C90):
A byte with all bits set to 0, called the null character, shall exist in the basic execution character set
So a 'negative zero' on a ones-complement wouldn't do.
That said, I really don't know much of anything about non-two's complement C implementations. I used a one's-complement machine way back when in university, but don't remember much about it (and even if I cared about the standard back then, it was before it existed).

It's true, for the first 10 or 20 years of commercially produced computers (the 1950's and 60's) there were, apparently, some disagreements on how to represent negative numbers in binary. There were actually three contenders:
Two's complement, which not only won the war but also drove the others to extinction
One's complement, -x == ~x
Sign-magnitude, -x = x ^ 0x80000000
I think the last important ones-complement machine was probably the CDC-6600, at the time, the fastest machine on earth and the immediate predecessor of the first supercomputer.1.
Unfortunately, your question cannot really be answered, not because no one here knows the answer :-) but because the choice never had to be made. And this was for actually two reasons:
Two's complement took over simultaneously with byte machines. Byte addressing hit the world with the twos-complement IBM System/360. Previous machines had no bytes, only complete words had addresses. Sometimes programmers would pack characters inside these words and sometimes they would just use the whole word. (Word length varied from 12 bits to 60.)
C was not invented until a decade after the byte machines and two's complement transition. Item #1 happened in the 1960's, C first appeared on small machines in the 1970's and did not take over the world until the 1980's.
So there simply never was a time when a machine had signed bytes, a C compiler, and something other than a twos-complement data format. The idea of null-terminated strings was probably a repeatedly-invented design pattern thought up by one assembly language programmer after another, but I don't know that it was specified by a compiler until the C era.
In any case, the first actually standardized C ("C89") simply specifies "a byte or code of value zero is appended" and it is clear from the context that they were trying to be number-format independent. So, "+0" is a theoretical answer, but it may never really have existed in practice.
1. The 6600 was one of the most important machines historically, and not just because it was fast. Designed by Seymour Cray himself, it introduced out-of-order execution and various other elements later collectively called "RISC". Although others tried to claimed credit, Seymour Cray is the real inventor of the RISC architecture. There is no dispute that he invented the supercomputer. It's actually hard to name a past "supercomputer" that he didn't design.

I believe it would be almost but not quite possible for a system to have a one's-complement 'char' type, but there are four problems which cannot all be resolved:
Every data type must be representable as a sequence of char, such that if all the char values comprising two objects compare identical, the data objects containing in question will be identical.
Every data type must likewise be representable as a sequence of 'unsigned char'.
The unsigned char values into which any data type can be decomposed must form a group whose order is a power of two.
I don't believe the standard permits a one's-complement machine to special-case the value that would be negative zero and make it behave as something else.
It might be possible to have a standards-compliant machine with a one's-complement or sign-magnitude "char" type if the only way to get a negative zero would be by overlaying some other data type, and if negative zero compared unequal to positive zero. I'm not sure if that could be standards-compliant or not.
EDIT
BTW, if requirement #2 were relaxed, I wonder what the exact requirements would be when overlaying other data types onto 'char'? Among other things, while the standard makes it abundantly clear that one must be able to perform assignments and comparisons on any 'char' values that may result from overlaying another variable onto a 'char', I don't know that it imposes any requirement that all such values must behave as an arithmetic group. For example, I wonder what the legality would be of a machine in which every memory location was physically stored as 66 bits, with the top two bits indicating whether the value was a 64-bit integer, a 32-bit memory handle plus a 32-bit offset, or a 64-bit double-precision floating-point number? Since the standard allows implementations to do anything they like when an arithmetic computation exceeds the range of a signed type, that would suggest that signed types do not necessarily have to behave as a group.
For most signed types, there's no requirement that the type be unable to represent any numbers outside the range specified in limits.h; if limits.h specifies that the minimum "int" is -32767, then it would be perfectly legitimate for an implementation to in fact allow a value of -32768 since any program that tried to do so would invoke Undefined Behavior. The key question would probably be whether it would be legitimate for a 'char' value resulting from the overlay of some other type to yield a value outside the range specified in limits.h. I wonder what the standard says?

Related

What does the following mean in context of programming, specifically C programming language?

representations of values on a computer can vary “culturally” from architecture to architecture or are determined by the type the programmer gave to the value. Therefore, we should try to reason primarily about values and not about representations if we want to write portable code.
Specifying values. We have already seen several ways in which numerical constants (literals) can be specified:
123 Decimal integer constant.
077 Octal integer constant.
0xFFFF Hexadecimal integer constant.
et cetera
Question: Are decimal integer constants and hexadecimal integer constants, different ways to 'represent' values or are they values themselves? If the latter what are different ways to represent them on different architectures?
The source of the aforementioned is the book "Modern C" by Jens Gustedt which is freely available online, specifically from page no. 38 to page no. 46.
The words "representation" can be used here in two different contexts.
One is when we (the programmers) specify e.g. integer constants. For example, the value 37 may be represented in the C code as 37 or 0x25 or 045. Regardless of which representation we have chosen, the C compiler will interpret this into the same value when generating the binary code. Hence, these statements all generate the same code:
int a = 37;
int a = 0x25;
int a = 045;
Another context is how the compiler chooses to store the value 37 internally. The C standard states a few requirements (e.g. that the representation of int must at least be able to represent values in the range -32767 to +32767). Within the rules of the C standard the compiler will use a bit representation which can be operated on efficiently by the native language of the target system's CPU. The most common representation for signed integers is Two's complement and usually a signed integer with type int will occupy 2 or 4 bytes of 8 bits each.
However, the C standard is sufficiently flexible to allow for other internal representations (e.g. bytes with more than 8 bits or Ones' complement representation of signed integers). A common difference between representations of multibyte integers on different systems is the use of different byte order.
The C standard is primarily concerned with the result of standard operations. E.g. 5+6 must give the same result no matter on which platform the expression is executed, but how 5, 6 and 11 are represented on the given platform is largely up to the compiler to decide.
It is of utmost importance to every C programmer to understand that C is an abstraction layer that shields you from the underlying hardware. This service is the raison d'être for the language, the reason it was developed. Among other things, the language shields you from the different internal byte patterns used to hold the same values on different platforms: You write a value and operations on it, and the compiler will see to producing the proper code. This would be different in assembler where you are intimately concerned with memory layout, register sizes etc.
In case it wasn't obvious: I'm emphasizing this because I struggled with these concepts myself when I learned C.
The first thing to hammer down is that C program code is text. What we deal with here are text representations of values, a succession of (most likely) ASCII codes much as if you wrote a letter to your grandma.
Integer literals like 0443 (the less usual octal format), 0x0123 or 291 are simply different string representations for the same value. Here and in the standard, "value" is a value in the mathematical sense. As much as we think "oh, C!" when we see "0x0123", it is nothing else than a way to write down the mathematical value of 291. That's meant with "value", for example when the standard specifies that "the type of an integer constant is the first of the corresponding list in which its value can be represented." The compiler has to create a binary representation of that value in the program's memory. This means it has to find out what value it is (291 in all cases) and then produce the proper byte pattern for it. The integer literal in the C code is not a binary form of anything, no matter whether you choose to write its string representation down base 10, base 16 or base 8. In particular does 0x0123 not mean that the two bytes 01 and 23 will be anywhere in the compiled program, or in which order.1
To demonstrate the abstraction consider the expression (0x0123 << 4) == 0x1230, which should be true on all machines. Both hex literals are of type int here. The beauty of hex code is that it makes bit manipulations in multiples of 4 really easy to compute.
On a typical contemporary Intel architecture an int has 4 bytes and is organized "little endian first", or "little endian" for short: The lowest-value byte comes first if we inspect the memory in ascending order. 0x123 is represented as 00100011-00000001-00000000-00000000 (because the two highest-value bytes are zero for such a small number). 0x1230 is, consequently, 00110000-00010010-00000000-00000000. No left-shift whatsoever took place on the hardware (but also no right-shift!). The bit-shift operators' semantics are an abstraction: "Imagine a regular binary number, following the old Arab fashion of starting with the highest-value digit, and shift that imagined binary number." It is an abstraction that bears zero resemblance to anything happening on the hardware, and the compiler simply translates this abstract operation into the right thing for that particular hardware.
1Now admittedly, they probably are there, but on your prevalent x86 platform their order will be reversed, as assumed below.
Are decimal integer constants and hexadecimal integer constants, different ways to 'represent' values or are they values themselves?
This is philosophy! They are different ways to represent values, like:
0x2 means 2 (for a C compiler)
two means 2 (english language)
a couple means 2 (for an english speaker)
zwei means 2 (...)
A C compiler translates from "some form of human understandable language" to "a very precise form understandable by the machine": the only thing which is retained from the various forms, is the intimate meaning (the value!).
It happens that C, in order to be more friendly, lets you specify integers in two different ways, decimal and hexadecimal (ok, even octal and recently also binary notation). What the C compiler is interested in, is the value and, as already noted in a comment, after the C has "understand" the value, there is no more difference between a "0xC" or a "12". From that point, the compiler must make the machine understand the value 12, using the representation the target machine uses and, again, what is important is the value.
Most probably, the phrase
we should try to reason primarily about values and not about representations
is an invite to the programmers to choose correct data types and values, but not only: also to give useful names for types and variables and so on. A not very good example is: even if we know that a line feed is represented (often) by a 10 decimal, we should use LF or "\n" or similar, which is the value we want, not its representation.
About data types, especially integers, C is not particularly brilliant, compared to other languages which let you define types based on their possible values (for example with the "-3 .. 5" notation, which states that the possible values go from -3 to 5, and lets the compiler choose the number of bits needed for the representation of the range -3 to 5).

What was with the historical typedef soup for integers in C programs?

This is a possibly inane question whose answer I should probably know.
Fifteen years ago or so, a lot of C code I'd look at had tons of integer typedefs in platform-specific #ifdefs. It seemed every program or library I looked at had their own, mutually incompatible typedef soup. I didn't know a whole lot about programming at the time and it seemed like a bizarre bunch of hoops to jump through just to tell the compiler what kind of integer you wanted to use.
I've put together a story in my mind to explain what those typedefs were about, but I don't actually know whether it's true. My guess is basically that when C was first developed and standardized, it wasn't realized how important it was to be able to platform-independently get an integer type of a certain size, and thus all the original C integer types may be of different sizes on different platforms. Thus everyone trying to write portable C code had to do it themselves.
Is this correct? If so, how were programmers expected to use the C integer types? I mean, in a low level language with a lot of bit twiddling, isn't it important to be able to say "this is a 32 bit integer"? And since the language was standardized in 1989, surely there was some thought that people would be trying to write portable code?
When C began computers were less homogenous and a lot less connected than today. It was seen as more important for portability that the int types be the natural size(s) for the computer. Asking for an exactly 32-bit integer type on a 36-bit system is probably going to result in inefficient code.
And then along came pervasive networking where you are working with specific on-the-wire size fields. Now interoperability looks a whole lot different. And the 'octet' becomes the de facto quanta of data types.
Now you need ints of exact multiples of 8-bits, so now you get typedef soup and then eventually the standard catches up and we have standard names for them and the soup is not as needed.
C's earlier success was due to it flexibility to adapt to nearly all existing variant architectures #John Hascall with:
1) native integer sizes of 8, 16, 18, 24, 32, 36, etc. bits,
2) variant signed integer models: 2's complement, 1's complement, signed integer and
3) various endian, big, little and others.
As coding developed, algorithms and interchange of data pushed for greater uniformity and so the need for types that met 1 & 2 above across platforms. Coders rolled their own like typedef int int32 inside a #if .... The many variations of that created the soup as noted by OP.
C99 introduced (u)int_leastN_t, (u)int_fastN_t, (u)intmax_t to make portable yet somewhat of minimum bit-width-ness types. These types are required for N = 8,16,32,64.
Also introduced are semi-optional types (see below **) like (u)intN_t which has the additional attributes of they must be 2's complement and no padding. It is these popular types that are so widely desired and used to thin out the integer soup.
how were programmers expected to use the C integer types?
By writing flexible code that did not strongly rely on bit width. Is is fairly easy to code strtol() using only LONG_MIN, LONG_MAX without regard to bit-width/endian/integer encoding.
Yet many coding tasks oblige precise width types and 2's complement for easy high performance coding. It is better in that case to forego portability to 36-bit machines and 32-bit sign-magnitudes ones and stick with 2N wide (2's complement for signed) integers. Various CRC & crypto algorithms and file formats come to mind. Thus the need for fixed-width types and a specified (C99) way to do it.
Today there are still gotchas that still need to be managed. Example: The usual promotions int/unsigned lose some control as those types may be 16, 32 or 64.
**
These types are optional. However, if an implementation provides integer types with widths of 8, 16, 32, or 64 bits, no padding bits, and (for the signed types) that have a two’s complement representation, it shall define the corresponding typedef names. C11 7.20.1.1 Exact-width integer types 3
I remember that period and I'm guilty of doing the same!
One issue was the size of int, it could be the same as short, or long or in between. For example, if you were working with binary file formats, it was imperative that everything align. Byte ordering complicated things as well. Many developer went the lazy route and just did fwrite of whatever, instead of picking numbers apart byte-by-byte. When the machines upgraded to longer word lengths, all hell broke loose. So typedef was an easy hack to fix that.
If performance was an issue, as it often was back then, int was guaranteed to be the machine's fastest natural size, but if you needed 32 bits, and int was shorter than that, you were in danger of rollover.
In the C language, sizeof() is not supposed to be resolved at the preprocessor stage, which made things complicated because you couldn't do #if sizeof(int) == 4 for example.
Personally, some of the rationale was also just working from an assembler language mindset and not being willing to abstract out the notion of what short, int and long are for. Back then, assembler was used in C quite frequently.
Nowadays, there are plenty of non-binary file formats, JSON, XML, etc. where it doesn't matter what the binary representation is. As well, many popular platforms have settled on a 32-bit int or longer, which is usually enough for most purposes, so there's less of an issue with rollover.
C is a product of the early 1970s, when the computing ecosystem was very different. Instead of millions of computers all talking to each other over an extended network, you had maybe a hundred thousand systems worldwide, each running a few monolithic apps, with almost no communication between systems. You couldn't assume that any two architectures had the same word sizes, or represented signed integers in the same way. The market was still small enough that there wasn't any percieved need to standardize, computers didn't talk to each other (much), and nobody though much about portability.
If so, how were programmers expected to use the C integer types?
If you wanted to write maximally portable code, then you didn't assume anything beyond what the Standard guaranteed. In the case of int, that meant you didn't assume that it could represent anything outside of the range [-32767,32767], nor did you assume that it would be represented in 2's complement, nor did you assume that it was a specific width (it could be wider than 16 bits, yet still only represent a 16 bit range if it contained any padding bits).
If you didn't care about portability, or you were doing things that were inherently non-portable (which bit twiddling usually is), then you used whatever type(s) met your requirements.
I did mostly high-level applications programming, so I was less worried about representation than I was about range. Even so, I occasionally needed to dip down into binary representations, and it always bit me in the ass. I remember writing some code in the early '90s that had to run on classic MacOS, Windows 3.1, and Solaris. I created a bunch of enumeration constants for 32-bit masks, which worked fine on the Mac and Unix boxes, but failed to compile on the Windows box because on Windows an int was only 16 bits wide.
C was designed as a language that could be ported to as wide a range of machines as possible, rather than as a language that would allow most kinds of programs to be run without modification on such a range of machines. For most practical purposes, C's types were:
An 8-bit type if one is available, or else the smallest type that's at least 8 bits.
A 16-bit type, if one is available, or else the smallest type that's at least 16 bits.
A 32-bit type, if one is available, or else some type that's at least 32 bits.
A type which will be 32 bits if systems can handle such things as efficiently as 16-bit types, or 16 bits otherwise.
If code needed 8, 16, or 32-bit types and would be unlikely to be usable on machines which did not support them, there wasn't any particular problem with such code regarding char, short, and long as 8, 16, and 32 bits, respectively. The only systems that didn't map those names to those types would be those which couldn't support those types and wouldn't be able to usefully handle code that required them. Such systems would be limited to writing code which had been written to be compatible with the types that they use.
I think C could perhaps best be viewed as a recipe for converting system specifications into language dialects. A system which uses 36-bit memory won't really be able to efficiently process the same language dialect as a system that use octet-based memory, but a programmer who learns one dialect would be able to learn another merely by learning what integer representations the latter one uses. It's much more useful to tell a programmer who needs to write code for a 36-bit system, "This machine is just like the other machines except char is 9 bits, short is 18 bits, and long is 36 bits", than to say "You have to use assembly language because other languages would all require integer types this system can't process efficiently".
Not all machines have the same native word size. While you might be tempted to think a smaller variable size will be more efficient, it just ain't so. In fact, using a variable that is the same size as the native word size of the CPU is much, much faster for arithmetic, logical and bit manipulation operations.
But what, exactly, is the "native word size"? Almost always, this means the register size of the CPU, which is the same as the Arithmetic Logic Unit (ALU) can work with.
In embedded environments, there are still such things as 8 and 16 bit CPUs (are there still 4-bit PIC controllers?). There are mountains of 32-bit processors out there still. So the concept of "native word size" is alive and well for C developers.
With 64-bit processors, there is often good support for 32-bit operands. In practice, using 32-bit integers and floating point values can often be faster than the full word size.
Also, there are trade-offs between native word alignment and overall memory consumption when laying out C structures.
But the two common usage patterns remain: size agnostic code for improved speed (int, short, long), or fixed size (int32_t, int16_t, int64_t) for correctness or interoperability where needed.

Why aren't the C-supplied integer types good enough for basically any project?

I'm much more of a sysadmin than a programmer. But I do spend an inordinate amount of time grovelling through programmers' code trying to figure out what went wrong. And a disturbing amount of that time is spent dealing with problems when the programmer expected one definition of __u_ll_int32_t or whatever (yes, I know that's not real), but either expected the file defining that type to be somewhere other than it is, or (and this is far worse but thankfully rare) expected the semantics of that definition to be something other than it is.
As I understand C, it deliberately doesn't make width definitions for integer types (and that this is a Good Thing), but instead gives the programmer char, short, int, long, and long long, in all their signed and unsigned glory, with defined minima which the implementation (hopefully) meets. Furthermore, it gives the programmer various macros that the implementation must provide to tell you things like the width of a char, the largest unsigned long, etc. And yet the first thing any non-trivial C project seems to do is either import or invent another set of types that give them explicitly 8, 16, 32, and 64 bit integers. This means that as the sysadmin, I have to have those definition files in a place the programmer expects (that is, after all, my job), but then not all of the semantics of all those definitions are the same (this wheel has been re-invented many times) and there's no non-ad-hoc way that I know of to satisfy all of my users' needs here. (I've resorted at times to making a <bits/types_for_ralph.h>, which I know makes puppies cry every time I do it.)
What does trying to define the bit-width of numbers explicitly (in a language that specifically doesn't want to do that) gain the programmer that makes it worth all this configuration management headache? Why isn't knowing the defined minima and the platform-provided MAX/MIN macros enough to do what C programmers want to do? Why would you want to take a language whose main virtue is that it's portable across arbitrarily-bitted platforms and then typedef yourself into specific bit widths?
When a C or C++ programmer (hereinafter addressed in second-person) is choosing the size of an integer variable, it's usually in one of the following circumstances:
You know (at least roughly) the valid range for the variable, based on the real-world value it represents. For example,
numPassengersOnPlane in an airline reservation system should accommodate the largest supported airplane, so needs at least 10 bits. (Round up to 16.)
numPeopleInState in a US Census tabulating program needs to accommodate the most populous state (currently about 38 million), so needs at least 26 bits. (Round up to 32.)
In this case, you want the semantics of int_leastN_t from <stdint.h>. It's common for programmers to use the exact-width intN_t here, when technically they shouldn't; however, 8/16/32/64-bit machines are so overwhelmingly dominant today that the distinction is merely academic.
You could use the standard types and rely on constraints like “int must be at least 16 bits”, but a drawback of this is that there's no standard maximum size for the integer types. If int happens to be 32 bits when you only really needed 16, then you've unnecessarily doubled the size of your data. In many cases (see below), this isn't a problem, but if you have an array of millions of numbers, then you'll get lots of page faults.
Your numbers don't need to be that big, but for efficiency reasons, you want a fast, “native” data type instead of a small one that may require time wasted on bitmasking or zero/sign-extension.
This is the int_fastN_t types in <stdint.h>. However, it's common to just use the built-in int here, which in the 16/32-bit days had the semantics of int_fast16_t. It's not the native type on 64-bit systems, but it's usually good enough.
The variable is an amount of memory, array index, or casted pointer, and thus needs a size that depends on the amount of addressable memory.
This corresponds to the typedefs size_t, ptrdiff_t, intptr_t, etc. You have to use typedefs here because there is no built-in type that's guaranteed to be memory-sized.
The variable is part of a structure that's serialized to a file using fread/fwrite, or called from a non-C language (Java, COBOL, etc.) that has its own fixed-width data types.
In these cases, you truly do need an exact-width type.
You just haven't thought about the appropriate type, and use int out of habit.
Often, this works well enough.
So, in summary, all of the typedefs from <stdint.h> have their use cases. However, the usefulness of the built-in types is limited due to:
Lack of maximum sizes for these types.
Lack of a native memsize type.
The arbitrary choice between LP64 (on Unix-like systems) and LLP64 (on Windows) data models on 64-bit systems.
As for why there are so many redundant typedefs of fixed-width (WORD, DWORD, __int64, gint64, FINT64, etc.) and memsize (INT_PTR, LPARAM, VPTRDIFF, etc.) integer types, it's mainly because <stdint.h> came late in C's development, and people are still using older compilers that don't support it, so libraries need to define their own. Same reason why C++ has so many string classes.
Sometimes it is important. For example, most image file formats require an exact number of bits/bytes be used (or at least specified).
If you only wanted to share a file created by the same compiler on the same computer architecture, you would be correct (or at least things would work). But, in real life things like file specifications and network packets are created by a variety of computer architectures and compilers, so we have to care about the details in these case (at least).
The main reason the fundamental types can't be fixed is that a few machines don't use 8-bit bytes. Enough programmers don't care, or actively want not to be bothered with support for such beasts, that the majority of well-written code demands a specific number of bits wherever overflow would be a concern.
It's better to specify a required range than to use int or long directly, because asking for "relatively big" or "relatively small" is fairly meaningless. The point is to know what inputs the program can work with.
By the way, usually there's a compiler flag that will adjust the built-in types. See INT_TYPE_SIZE for GCC. It might be cleaner to stick that into the makefile, than to specialize the whole system environment with new headers.
If you want portable code, you want the code your write to function identically on all platforms. If you have
int i = 32767;
you can't say for certain what i+1 will give you on all platforms.
This is not portable. Some compilers (on the same CPU architecture!) will give you -32768 and some will give you 32768. Some perverted ones will give you 0. That's a pretty big difference. Granted if it overflows, this is Undefined Behavior, but you don't know it is UB unless you know exactly what the size of int is.
If you use the standard integer definitions (which is <stdint.h>, the ISO/IEC 9899:1999 standard), then you know the answer of +1 will give exact answer.
int16_t i = 32767;
i+1 will overflow (and on most compilers, i will appear to be -32768)
uint16_t j = 32767;
j+1 gives 32768;
int8_t i = 32767; // should be a warning but maybe not. most compilers will set i to -1
i+1 gives 0; (//in this case, the addition didn't overflow
uint8_t j = 32767; // should be a warning but maybe not. most compilers will set i to 255
i+1 gives 0;
int32_t i = 32767;
i+1 gives 32768;
uint32_t j = 32767;
i+1 gives 32768;
There are two opposing forces at play here:
The need for C to adapt to any CPU architecture in a natural way.
The need for data transferred to/from a program (network, disk, file, etc.) so that a program running on any architecture can correctly interpret it.
The "CPU matching" need has to do with inherent efficiency. There is CPU quantity which is most easily handled as a single unit which all arithmetic operations easily and efficiently are performed on, and which results in the need for the fewest bits of instruction encoding. That type is int. It could be 16 bits, 18 bits*, 32 bits, 36 bits*, 64 bits, or even 128 bits on some machines. (* These were some not-well-known machines from the 1960s and 1970s which may have never had a C compiler.)
Data transfer needs when transferring binary data require that record fields are the same size and alignment. For this it is quite important to have control of data sizes. There is also endianness and maybe binary data representations, like floating point representations.
A program which forces all integer operations to be 32 bit in the interests of size compatibility will work well on some CPU architectures, but not others (especially 16 bit, but also perhaps some 64-bit).
Using the CPU's native register size is preferable if all data interchange is done in a non-binary format, like XML or SQL (or any other ASCII encoding).

Why is infinity = 0x3f3f3f3f?

In some situations, one generally uses a large enough integer value to represent infinity. I usually use the largest representable positive/negative integer. That usually yields more code, since you need to check if one of the operands is infinity before virtually all arithmetic operations in order to avoid overflows. Sometimes it would be desirable to have saturated integer arithmetic. For that reason, some people use smaller values for infinity, that can be added or multiplied several times without overflow. What intrigues me is the fact that it's extremely common to see (specially in programming competitions):
const int INF = 0x3f3f3f3f;
Why is that number special? It's binary representation is:
00111111001111110011111100111111
I don't see any specially interesting property here. I see it's easy to type, but if that was the reason, almost anything would do (0x3e3e3e3e, 0x2f2f2f2f, etc). It can be added once without overflow, which allows for:
a = min(INF, b + c);
But all the other constants would do, then. Googling only shows me a lot of code snippets that use that constant, but no explanations or comments.
Can anyone spot it?
I found some evidence about this here (original content in Chinese); the basic idea is that 0x7fffffff is problematic since it's already "the top" of the range of 4-byte signed ints; so, adding anything to it results in negative numbers; 0x3f3f3f3f, instead:
is still quite big (same order of magnitude of 0x7fffffff);
has a lot of headroom; if you say that the valid range of integers is limited to numbers below it, you can add any "valid positive number" to it and still get an infinite (i.e. something >=INF). Even INF+INF doesn't overflow. This allows to keep it always "under control":
a+=b;
if(a>INF)
a=INF;
is a repetition of equal bytes, which means you can easily memset stuff to INF;
also, as #Jörg W Mittag noticed above, it has a nice ASCII representation, that allows both to spot it on the fly looking at memory dumps, and to write it directly in memory.
I may or may not be one of the earliest discoverers of 0x3f3f3f3f. I published a Romanian article about it in 2004 (http://www.infoarena.ro/12-ponturi-pentru-programatorii-cc #9), but I've been using this value since 2002 at least for programming competitions.
There are two reasons for it:
0x3f3f3f3f + 0x3f3f3f3f doesn't overflow int32. For this some use 100000000 (one billion).
one can set an array of ints to infinity by doing memset(array, 0x3f, sizeof(array))
0x3f3f3f3f is the ASCII representation of the string ????.
Krugle finds 48 instances of that constant in its entire database. 46 of those instances are in a Java project, where it is used as a bitmask for some graphics manipulation.
1 project is an operating system, where it is used to represent an unknown ACPI device.
1 project is again a bitmask for Java graphics.
So, in all of the projects indexed by Krugle, it is used 47 times because of its bitpattern, once because of its ASCII interpretation, and not a single time as a representation of infinity.

About the use of signed integers in C family of languages

When using integer values in my own code, I always try to consider the signedness, asking myself if the integer should be signed or unsigned.
When I'm sure the value will never need to be negative, I then use an unsigned integer.
And I have to say this happen most of the time.
When reading other peoples' code, I rarely see unsigned integers, even if the represented value can't be negative.
So I asked myself: «is there a good reason for this, or do people just use signed integers because the don't care»?
I've search on the subject, here and in other places, and I have to say I can't find a good reason not to use unsigned integers, when it applies.
I came across those questions: «Default int type: Signed or Unsigned?», and «Should you always use 'int' for numbers in C, even if they are non-negative?» which both present the following example:
for( unsigned int i = foo.Length() - 1; i >= 0; --i ) {}
To me, this is just bad design. Of course, it may result in an infinite loop, with unsigned integers.
But is it so hard to check if foo.Length() is 0, before the loop?
So I personally don't think this is a good reason for using signed integers all the way.
Some people may also say that signed integers may be useful, even for non-negative values, to provide an error flag, usually -1.
Ok, that's good to have a specific value that means «error».
But then, what's wrong with something like UINT_MAX, for that specific value?
I'm actually asking this question because it may lead to some huge problems, usually when using third-party libraries.
In such a case, you often have to deal with signed and unsigned values.
Most of the time, people just don't care about the signedness, and just assign a, for instance, an unsigned int to a signed int, without checking the range.
I have to say I'm a bit paranoid with the compiler warning flags, so with my setup, such an implicit cast will result in a compiler error.
For that kind of stuff, I usually use a function or macro to check the range, and then assign using an explicit cast, raising an error if needed.
This just seems logical to me.
As a last example, as I'm also an Objective-C developer (note that this question is not related to Objective-C only):
- ( NSInteger )tableView: ( UITableView * )tableView numberOfRowsInSection: ( NSInteger )section;
For those not fluent with Objective-C, NSInteger is a signed integer.
This method actually retrieves the number of rows in a table view, for a specific section.
The result will never be a negative value (as the section number, by the way).
So why use a signed integer for this?
I really don't understand.
This is just an example, but I just always see that kind of stuff, with C, C++ or Objective-C.
So again, I'm just wondering if people just don't care about that kind of problems, or if there is finally a good and valid reason not to use unsigned integers for such cases.
Looking forward to hear your answers : )
a signed return value might yield more information (think error-numbers, 0 is sometimes a valid answer, -1 indicates error, see man read) ... which might be relevant especially for developers of libraries.
if you are worrying about the one extra bit you gain when using unsigned instead of signed then you are probably using the wrong type anyway. (also kind of "premature optimization" argument)
languages like python, ruby, jscript etc are doing just fine without signed vs unsigned. that might be an indicator ...
When using integer values in my own code, I always try to consider the signedness, asking myself if the integer should be signed or unsigned.
When I'm sure the value will never need to be negative, I then use an unsigned integer.
And I have to say this happen most of the time.
To carefully consider which type that is most suitable each time you declare a variable is very good practice! This means you are careful and professional. You should not only consider signedness, but also the potential max value that you expect this type to have.
The reason why you shouldn't use signed types when they aren't needed have nothing to do with performance, but with type safety. There are lots of potential, subtle bugs that can be caused by signed types:
The various forms of implicit promotions that exist in C can cause your type to change signedness in unexpected and possibly dangerous ways. The integer promotion rule that is part of the usual arithmetic conversions, the lvalue conversion upon assignment, the default argument promotions used by for example VA lists, and so on.
When using any form of bitwise operators or similar hardware-related programming, signed types are dangerous and can easily cause various forms of undefined behavior.
By declaring your integers unsigned, you automatically skip past a whole lot of the above dangers. Similarly, by declaring them as large as unsigned int or larger, you get rid of lots of dangers caused by the integer promotions.
Both size and signedness are important when it comes to writing rugged, portable and safe code. This is the reason why you should always use the types from stdint.h and not the native, so-called "primitive data types" of C.
So I asked myself: «is there a good reason for this, or do people just use signed integers because the don't care»?
I don't really think it is because they don't care, nor because they are lazy, even though declaring everything int is sometimes referred to as "sloppy typing" - which means sloppily picked type more than it means too lazy to type.
I rather believe it is because they lack deeper knowledge of the various things I mentioned above. There's a frightening amount of seasoned C programmers who don't know how implicit type promotions work in C, nor how signed types can cause poorly-defined behavior when used together with certain operators.
This is actually a very frequent source of subtle bugs. Many programmers find themselves staring at a compiler warning or a peculiar bug, which they can make go away by adding a cast. But they don't understand why, they simply add the cast and move on.
for( unsigned int i = foo.Length() - 1; i >= 0; --i ) {}
To me, this is just bad design
Indeed it is.
Once upon a time, down-counting loops would yield more effective code, because the compiler pick add a "branch if zero" instruction instead of a "branch if larger/smaller/equal" instruction - the former is faster. But this was at a time when compilers were really dumb and I don't believe such micro-optimizations are relevant any longer.
So there is rarely ever a reason to have a down-counting loop. Whoever made the argument probably just couldn't think outside the box. The example could have been rewritten as:
for(unsigned int i=0; i<foo.Length(); i++)
{
unsigned int index = foo.Length() - i - 1;
thing[index] = something;
}
This code should not have any impact on performance, but the loop itself turned a whole lot easier to read, while at the same time fixing the bug that your example had.
As far as performance is concerned nowadays, one should probably spend the time pondering about which form of data access that is most ideal in terms of data cache use, rather than anything else.
Some people may also say that signed integers may be useful, even for non-negative values, to provide an error flag, usually -1.
That's a poor argument. Good API design uses a dedicated error type for error reporting, such as an enum.
Instead of having some hobbyist-level API like
int do_stuff (int a, int b); // returns -1 if a or b were invalid, otherwise the result
you should have something like:
err_t do_stuff (int32_t a, int32_t b, int32_t* result);
// returns ERR_A is a is invalid, ERR_B if b is invalid, ERR_XXX if... and so on
// the result is stored in [result], which is allocated by the caller
// upon errors the contents of [result] remain untouched
The API would then consistently reserve the return of every function for this error type.
(And yes, many of the standard library functions abuse return types for error handling. This is because it contains lots of ancient functions from a time before good programming practice was invented, and they have been preserved the way they are for backwards-compatibility reasons. So just because you find a poorly-written function in the standard library, you shouldn't run off to write an equally poor function yourself.)
Overall, it sounds like you know what you are doing and giving signedness some thought. That probably means that knowledge-wise, you are actually already ahead of the people who wrote those posts and guides you are referring to.
The Google style guide for example, is questionable. Similar could be said about lots of other such coding standards that use "proof by authority". Just because it says Google, NASA or Linux kernel, people blindly swallow them no matter the quality of the actual contents. There are good things in those standards, but they also contain subjective opinions, speculations or blatant errors.
Instead I would recommend referring to real professional coding standards instead, such as MISRA-C. It enforces lots of thought and care for things like signedness, type promotion and type size, where less detailed/less serious documents just skip past it.
There is also CERT C, which isn't as detailed and careful as MISRA, but at least a sound, professional document (and more focused towards desktop/hosted development).
There is one heavy-weight argument against widely unsigned integers:
Premature optimization is the root of all evil.
We all have at least on one occasion been bitten by unsigned integers. Sometimes like in your loop, sometimes in other contexts. Unsigned integers add a hazard, even though a small one, to your program. And you are introducing this hazard to change the meaning of one bit. One little, tiny, insignificant-but-for-its-sign-meaning bit. On the other hand, the integers we work with in bread and butter applications are often far below the range of integers, more in the order of 10^1 than 10^7. Thus, the different range of unsigned integers is in the vast majority of cases not needed. And when it's needed, it is quite likely that this extra bit won't cut it (when 31 is too little, 32 is rarely enough) and you'll need a wider or an arbitrary-wide integer anyway. The pragmatic approach in these cases is to just use the signed integer and spare yourself the occasional underflow bug. Your time as a programmer can be put to much better use.
From the C FAQ:
The first question in the C FAQ is which integer type should we decide to use?
If you might need large values (above 32,767 or below -32,767), use long. Otherwise, if space is very important (i.e. if there are large arrays or many structures), use short. Otherwise, use int. If well-defined overflow characteristics are important and negative values are not, or if you want to steer clear of sign-extension problems when manipulating bits or bytes, use one of the corresponding unsigned types.
Another question concerns types conversions:
If an operation involves both signed and unsigned integers, the situation is a bit more complicated. If the unsigned operand is smaller (perhaps we're operating on unsigned int and long int), such that the larger, signed type could represent all values of the smaller, unsigned type, then the unsigned value is converted to the larger, signed type, and the result has the larger, signed type. Otherwise (that is, if the signed type can not represent all values of the unsigned type), both values are converted to a common unsigned type, and the result has that unsigned type.
You can find it here. So basically using unsigned integers, mostly for arithmetic conversions can complicate the situation since you'll have to either make all your integers unsigned, or be at the risk of confusing the compiler and yourself, but as long as you know what you are doing, this is not really a risk per se. However, it could introduce simple bugs.
And when it is a good to use unsigned integers? one situation is when using bitwise operations:
The << operator shifts its first operand left by a number of bits
given by its second operand, filling in new 0 bits at the right.
Similarly, the >> operator shifts its first operand right. If the
first operand is unsigned, >> fills in 0 bits from the left, but if
the first operand is signed, >> might fill in 1 bits if the high-order
bit was already 1. (Uncertainty like this is one reason why it's
usually a good idea to use all unsigned operands when working with the
bitwise operators.)
taken from here
And I've seen this somewhere:
If it was best to use unsigned integers for values that are never negative, we would have started by using unsigned int in the main function int main(int argc, char* argv[]). One thing is sure, argc is never negative.
EDIT:
As mentioned in the comments, the signature of main is due to historical reasons and apparently it predates the existence of the unsigned keyword.
Unsigned intgers are an artifact from the past. This is from the time, where processors could do unsigned arithmetic a little bit faster.
This is a case of premature optimization which is considered evil.
Actually, in 2005 when AMD introduced x86_64 (or AMD64, how it was then called), the 64 bit architecture for x86, they brought the ghosts of the past back: If a signed integer is used as an index and the compiler can not prove that it is never negative, is has to insert a 32 to 64 bit sign extension instruction - because the default 32 to 64 bit extension is unsigned (the upper half of a 64 bit register gets cleard if you move a 32 bit value into it).
But I would recommend against using unsigned in any arithmetic at all, being it pointer arithmetic or just simple numbers.
for( unsigned int i = foo.Length() - 1; i >= 0; --i ) {}
Any recent compiler will warn about such an construct, with condition ist always true or similar. With using a signed variable you avoid such pitfalls at all. Instead use ptrdiff_t.
A problem might be the c++ library, it often uses an unsigned type for size_t, which is required because of some rare corner cases with very large sizes (between 2^31 and 2^32) on 32 bit systems with certain boot switches ( /3GB windows).
There are many more, comparisons between signed and unsigned come to my mind, where the signed value automagically gets promoted to a unsigned and thus becomes a huge positive number, when it has been a small negative before.
One exception for using unsigned exists: For bit fields, flags, masks it is quite common. Usually it doesn't make sense at all to interpret the value of these variables as a magnitude, and the reader may deduce from the type that this variable is to be interpreted in bits.
The result will never be a negative value (as the section number, by the way). So why use a signed integer for this?
Because you might want to compare the return value to a signed value, which is actually negative. The comparison should return true in that case, but the C standard specifies that the signed get promoted to an unsigned in that case and you will get a false instead. I don't know about ObjectiveC though.

Resources