Why do we use explicit data types? (from a low level point of view) - c

When we take a look at some fundamental data types, such as char and int, we know that a char is simply an unsigned byte (depending on the language), int is just a signed dword, bool is just a char that can only be 1 or 0, etc. My question is, why do we use these types in compiled languages instead of just declaring a variable of type byte, dword, etc, since the operations applied to the types mentionned above are pretty much all the same, once you differentiate signed and unsigned data, and floating point data?
To extend the context of the question, in the C language, if and while statements can take a boolean value as an input, which is usually stored as a char, which exausts the need for an explicit boolean type.
In practice, the 2 pieces of code should be equivilant at the binary level:
int main()
{
int x = 5;
char y = 'c';
printf("%d %c\n", x - 8, y + 1);
return 0;
}
//outputs: -3 d
-
signed dword main()
{
signed dword x = 5;
byte y = 'c';
printf("%d %c\n", x - 8, y + 1);
return 0;
}
//outputs: -3 d

My question is, why do we use these types in compiled languages
To make the code target-agnostic. Some platforms only have efficient 16-bit integers, and forcing your variables to always be 32-bit would make your code slower for no reason when compiled for such platforms. Or maybe you have a target with 36-bit integers, and a strict 32-bit type would require extra instructions to implement.
Your question sounds very x86-centric. x86 is not the only architecture, and for most languages not the one language designers had in mind.
Even more recent languages that were designed in the era of x86 being widespread on desktops and servers were designed to be portable to other ISAs, like 8-bit AVR where a 32-bit int would take 4 registers vs. 2 for a 16-bit int.

A programming language defines an "abstract" data model, that a computer designer is free to implement his way. For instance, nothing mandates to store a Boolean in a byte, it could be "packed" as a single bit along with others. And if you read carefully the C standard, you will notice that a char has no defined size.
[Anecdotically, I recall an old time when FORTRAN variables, including integers, floats but also booleans, were stored on 72 bits on IBM machines.]
Language designers should put little constraints on machine architecture, to leave opportunities for nice designs. In fact, languages have no "low level", they implicitly describe a virtual machine not tied to a particular hardware (it could be implemented with cogwheels and ropes).
As far as I know, only the ADA language went to the point of specifying in details all the characteristics of the arithmetic, but not to the point of enforcing a number of bits per word.
Ignoring the boolean type was one of the saddest design decision in the C language. I took as late as C99 to integrate it :-(
Another sad decision is to have stopped considering the int type as the one that naturally fits in a machine word (and should have become 64 bits in current PCs).

The point of a high-level language is to provide some isolation from machine details. So, we speak of "integers", not some particular number of bytes of memory. The implementation then maps the higher-level types on whatever seems best suited to the target hardware.
And there are different semantics associated with different 4-byte types: for integers, signed versus unsigned is important to some classes of programs.
I understand this is a C question and it's arguable about how high-level C is or is not; but it is at least intended to be portable across machine architectures.
And, in your example, you assume 'int' is 32 bits. Nothing in the language says that has to be true. It has not always been true, and certainly was not true in the original PDP-11 implementation. And nowadays, for example, it is possibly appropriate to have 'int' be 64 bits on a 64-bit machine.
Note that it's not invariable that languages have types like "integer", etc. BLISS, a language at the same conceptual level as C, has the machine word as the only builtin datatype.

Related

What was with the historical typedef soup for integers in C programs?

This is a possibly inane question whose answer I should probably know.
Fifteen years ago or so, a lot of C code I'd look at had tons of integer typedefs in platform-specific #ifdefs. It seemed every program or library I looked at had their own, mutually incompatible typedef soup. I didn't know a whole lot about programming at the time and it seemed like a bizarre bunch of hoops to jump through just to tell the compiler what kind of integer you wanted to use.
I've put together a story in my mind to explain what those typedefs were about, but I don't actually know whether it's true. My guess is basically that when C was first developed and standardized, it wasn't realized how important it was to be able to platform-independently get an integer type of a certain size, and thus all the original C integer types may be of different sizes on different platforms. Thus everyone trying to write portable C code had to do it themselves.
Is this correct? If so, how were programmers expected to use the C integer types? I mean, in a low level language with a lot of bit twiddling, isn't it important to be able to say "this is a 32 bit integer"? And since the language was standardized in 1989, surely there was some thought that people would be trying to write portable code?
When C began computers were less homogenous and a lot less connected than today. It was seen as more important for portability that the int types be the natural size(s) for the computer. Asking for an exactly 32-bit integer type on a 36-bit system is probably going to result in inefficient code.
And then along came pervasive networking where you are working with specific on-the-wire size fields. Now interoperability looks a whole lot different. And the 'octet' becomes the de facto quanta of data types.
Now you need ints of exact multiples of 8-bits, so now you get typedef soup and then eventually the standard catches up and we have standard names for them and the soup is not as needed.
C's earlier success was due to it flexibility to adapt to nearly all existing variant architectures #John Hascall with:
1) native integer sizes of 8, 16, 18, 24, 32, 36, etc. bits,
2) variant signed integer models: 2's complement, 1's complement, signed integer and
3) various endian, big, little and others.
As coding developed, algorithms and interchange of data pushed for greater uniformity and so the need for types that met 1 & 2 above across platforms. Coders rolled their own like typedef int int32 inside a #if .... The many variations of that created the soup as noted by OP.
C99 introduced (u)int_leastN_t, (u)int_fastN_t, (u)intmax_t to make portable yet somewhat of minimum bit-width-ness types. These types are required for N = 8,16,32,64.
Also introduced are semi-optional types (see below **) like (u)intN_t which has the additional attributes of they must be 2's complement and no padding. It is these popular types that are so widely desired and used to thin out the integer soup.
how were programmers expected to use the C integer types?
By writing flexible code that did not strongly rely on bit width. Is is fairly easy to code strtol() using only LONG_MIN, LONG_MAX without regard to bit-width/endian/integer encoding.
Yet many coding tasks oblige precise width types and 2's complement for easy high performance coding. It is better in that case to forego portability to 36-bit machines and 32-bit sign-magnitudes ones and stick with 2N wide (2's complement for signed) integers. Various CRC & crypto algorithms and file formats come to mind. Thus the need for fixed-width types and a specified (C99) way to do it.
Today there are still gotchas that still need to be managed. Example: The usual promotions int/unsigned lose some control as those types may be 16, 32 or 64.
**
These types are optional. However, if an implementation provides integer types with widths of 8, 16, 32, or 64 bits, no padding bits, and (for the signed types) that have a two’s complement representation, it shall define the corresponding typedef names. C11 7.20.1.1 Exact-width integer types 3
I remember that period and I'm guilty of doing the same!
One issue was the size of int, it could be the same as short, or long or in between. For example, if you were working with binary file formats, it was imperative that everything align. Byte ordering complicated things as well. Many developer went the lazy route and just did fwrite of whatever, instead of picking numbers apart byte-by-byte. When the machines upgraded to longer word lengths, all hell broke loose. So typedef was an easy hack to fix that.
If performance was an issue, as it often was back then, int was guaranteed to be the machine's fastest natural size, but if you needed 32 bits, and int was shorter than that, you were in danger of rollover.
In the C language, sizeof() is not supposed to be resolved at the preprocessor stage, which made things complicated because you couldn't do #if sizeof(int) == 4 for example.
Personally, some of the rationale was also just working from an assembler language mindset and not being willing to abstract out the notion of what short, int and long are for. Back then, assembler was used in C quite frequently.
Nowadays, there are plenty of non-binary file formats, JSON, XML, etc. where it doesn't matter what the binary representation is. As well, many popular platforms have settled on a 32-bit int or longer, which is usually enough for most purposes, so there's less of an issue with rollover.
C is a product of the early 1970s, when the computing ecosystem was very different. Instead of millions of computers all talking to each other over an extended network, you had maybe a hundred thousand systems worldwide, each running a few monolithic apps, with almost no communication between systems. You couldn't assume that any two architectures had the same word sizes, or represented signed integers in the same way. The market was still small enough that there wasn't any percieved need to standardize, computers didn't talk to each other (much), and nobody though much about portability.
If so, how were programmers expected to use the C integer types?
If you wanted to write maximally portable code, then you didn't assume anything beyond what the Standard guaranteed. In the case of int, that meant you didn't assume that it could represent anything outside of the range [-32767,32767], nor did you assume that it would be represented in 2's complement, nor did you assume that it was a specific width (it could be wider than 16 bits, yet still only represent a 16 bit range if it contained any padding bits).
If you didn't care about portability, or you were doing things that were inherently non-portable (which bit twiddling usually is), then you used whatever type(s) met your requirements.
I did mostly high-level applications programming, so I was less worried about representation than I was about range. Even so, I occasionally needed to dip down into binary representations, and it always bit me in the ass. I remember writing some code in the early '90s that had to run on classic MacOS, Windows 3.1, and Solaris. I created a bunch of enumeration constants for 32-bit masks, which worked fine on the Mac and Unix boxes, but failed to compile on the Windows box because on Windows an int was only 16 bits wide.
C was designed as a language that could be ported to as wide a range of machines as possible, rather than as a language that would allow most kinds of programs to be run without modification on such a range of machines. For most practical purposes, C's types were:
An 8-bit type if one is available, or else the smallest type that's at least 8 bits.
A 16-bit type, if one is available, or else the smallest type that's at least 16 bits.
A 32-bit type, if one is available, or else some type that's at least 32 bits.
A type which will be 32 bits if systems can handle such things as efficiently as 16-bit types, or 16 bits otherwise.
If code needed 8, 16, or 32-bit types and would be unlikely to be usable on machines which did not support them, there wasn't any particular problem with such code regarding char, short, and long as 8, 16, and 32 bits, respectively. The only systems that didn't map those names to those types would be those which couldn't support those types and wouldn't be able to usefully handle code that required them. Such systems would be limited to writing code which had been written to be compatible with the types that they use.
I think C could perhaps best be viewed as a recipe for converting system specifications into language dialects. A system which uses 36-bit memory won't really be able to efficiently process the same language dialect as a system that use octet-based memory, but a programmer who learns one dialect would be able to learn another merely by learning what integer representations the latter one uses. It's much more useful to tell a programmer who needs to write code for a 36-bit system, "This machine is just like the other machines except char is 9 bits, short is 18 bits, and long is 36 bits", than to say "You have to use assembly language because other languages would all require integer types this system can't process efficiently".
Not all machines have the same native word size. While you might be tempted to think a smaller variable size will be more efficient, it just ain't so. In fact, using a variable that is the same size as the native word size of the CPU is much, much faster for arithmetic, logical and bit manipulation operations.
But what, exactly, is the "native word size"? Almost always, this means the register size of the CPU, which is the same as the Arithmetic Logic Unit (ALU) can work with.
In embedded environments, there are still such things as 8 and 16 bit CPUs (are there still 4-bit PIC controllers?). There are mountains of 32-bit processors out there still. So the concept of "native word size" is alive and well for C developers.
With 64-bit processors, there is often good support for 32-bit operands. In practice, using 32-bit integers and floating point values can often be faster than the full word size.
Also, there are trade-offs between native word alignment and overall memory consumption when laying out C structures.
But the two common usage patterns remain: size agnostic code for improved speed (int, short, long), or fixed size (int32_t, int16_t, int64_t) for correctness or interoperability where needed.

Char vs int calling conventions

Are there any 32- or 64- bit platforms on which a int8_t, int16_t, and int32_t are passed differently?
I am asking because OCaml (4.03+, which has not yet been released) can pass 32- or 64- bit integers and doubles to C directly, but not 8- or 16- bit integers or single-precision floats.
Edit: Basically I am asking if
char x(char y)
can be called as
int x(int y)
The keyword in calling convention is convention and conventions differ for many reasons ie architecture etc. I might be using a int8_t but behind the scenes it may be taking up 32 bits of space or on 64bit systems 64 bits. As long as it behaves like an int8_t I don't care. A language might expose a char but it might be represented as an int, C does this. When you dig into the C interface for languages at a much higher level of abstraction than C you start to see some design decisions the compiler writers made to simplify their lives.
Even if there were 32- or 64-bit platform that pass int8_ts, int16_ts and int32_ts differently (which I doubt there are, for various reasons), that would not be an excuse for OCaml to not be able to pass such arguments. All it would have to do is follow the platform's standard ABI and the world would be alright. In all likelihood, the FFI simply doesn't implement it because it wasn't sufficiently useful for the implementers.
That being said, one main reason why you would be very unlikely to find such a platform is because most C compilers try to preserve some, at least token, compatibility with K&R C, where it was not necessary to declare function prototypes; undeclared functions will be called as if they had int arguments and return values. For such reasons, virtually all ABIs always promote all integer arguments to the native word size, so that they will be assignment-compatible with whatever concrete integer type is in fact used, even if it would be possible to pass the argument more efficiently using smaller units.
EDIT: To answer your edited question, then on all somewhat standard ABIs, yes, you can call char x(char) as an int x(int). One caveat could be, however, that a compiler may assume the higher-order bits of the passed word are zero, however, so that you may, potentially, get undefined behavior if you, for example, pass 512 as an argument that is used as a char in the C function.
Are there any 32- or 64- bit platforms on which a int8_t, int16_t, and int32_t are passed differently?
Presumably there exists one, maybe experimental or historical. But if we limit our domain of discourse to the set of architectures supported by the OCaml system (the main implementation), then the answer is no.
I am asking because OCaml (4.03+, which has not yet been released) can
pass 32- or 64- bit integers and doubles to C directly, but not 8- or
16- bit integers or single-precision floats.
The reason is because OCaml has a different representation of an integer. In comparison with the C representation, it is shifted to the left by one bit and incremented. So 0 is represented as C's 1, 1 is 3 and so on. The least significant bit is used to distinguish between immediate values and pointers. So OCaml integer effectively has one less bit, i.e., 63 bits on 64-bit system and 31 on 32-bit system. That all means that, before passing an integer to a c function you need to shift it to the right. A simple and low cost operation. The char type is also represented by int, so it has the same issue. There are no int8_t and int16_t in the standard library, so I can't speak about them. A third-party library may introduce them in any representation, that may or may not support the direct passing to a C function.
OCaml's int32, int64, nativeint and float, are all represented as a boxed value, i.e., they are allocated in OCaml's heap and passed around OCaml functions as a pointer. The representation of the allocated block is the same as in C, thus allowing to pass it as is. Of course, this is safe only if a called function doesn't store this value somewhere and doesn't call any functions that may provoke a garbage collection. Before 4.03 a programmer was required to unbox the value, by either copying the contents into C variable (safe way), or by dereferencing and passing it directly (a dangerous way - you should know what you're doing). The 4.03 release provides a new annotation (an attribute unboxed), that is a generalization of the previously existed "%float" annotation. This annotation allows to pass values of this specific four types directly. This annotation works for a restricted set of function types, and, of course, is safe only for specific functions.
So, to summarize, small integers, are passed to C in a very fast manner. Boxed values, usually require some extra work, that also introduces a call to garbage collection and allocation. To alleviate this, 4.03 added some optimizations that allow passing them directly, if it is safe.
Edit: Basically I am asking if
char x(char y)
can be called as
int x(int y)
From the OCaml foreign function interface point of view there is no big difference. The only problem, is that OCaml int is smaller, than C int, so if x may return a value, that doesn't fit into the OCaml representation, a bigger (and boxed) type should be used, like nativeint for example. In many cases this issue can be ignored, e.g., when int is an error code, or it is known to be small, or no greater than the input parameter.

Why aren't the C-supplied integer types good enough for basically any project?

I'm much more of a sysadmin than a programmer. But I do spend an inordinate amount of time grovelling through programmers' code trying to figure out what went wrong. And a disturbing amount of that time is spent dealing with problems when the programmer expected one definition of __u_ll_int32_t or whatever (yes, I know that's not real), but either expected the file defining that type to be somewhere other than it is, or (and this is far worse but thankfully rare) expected the semantics of that definition to be something other than it is.
As I understand C, it deliberately doesn't make width definitions for integer types (and that this is a Good Thing), but instead gives the programmer char, short, int, long, and long long, in all their signed and unsigned glory, with defined minima which the implementation (hopefully) meets. Furthermore, it gives the programmer various macros that the implementation must provide to tell you things like the width of a char, the largest unsigned long, etc. And yet the first thing any non-trivial C project seems to do is either import or invent another set of types that give them explicitly 8, 16, 32, and 64 bit integers. This means that as the sysadmin, I have to have those definition files in a place the programmer expects (that is, after all, my job), but then not all of the semantics of all those definitions are the same (this wheel has been re-invented many times) and there's no non-ad-hoc way that I know of to satisfy all of my users' needs here. (I've resorted at times to making a <bits/types_for_ralph.h>, which I know makes puppies cry every time I do it.)
What does trying to define the bit-width of numbers explicitly (in a language that specifically doesn't want to do that) gain the programmer that makes it worth all this configuration management headache? Why isn't knowing the defined minima and the platform-provided MAX/MIN macros enough to do what C programmers want to do? Why would you want to take a language whose main virtue is that it's portable across arbitrarily-bitted platforms and then typedef yourself into specific bit widths?
When a C or C++ programmer (hereinafter addressed in second-person) is choosing the size of an integer variable, it's usually in one of the following circumstances:
You know (at least roughly) the valid range for the variable, based on the real-world value it represents. For example,
numPassengersOnPlane in an airline reservation system should accommodate the largest supported airplane, so needs at least 10 bits. (Round up to 16.)
numPeopleInState in a US Census tabulating program needs to accommodate the most populous state (currently about 38 million), so needs at least 26 bits. (Round up to 32.)
In this case, you want the semantics of int_leastN_t from <stdint.h>. It's common for programmers to use the exact-width intN_t here, when technically they shouldn't; however, 8/16/32/64-bit machines are so overwhelmingly dominant today that the distinction is merely academic.
You could use the standard types and rely on constraints like “int must be at least 16 bits”, but a drawback of this is that there's no standard maximum size for the integer types. If int happens to be 32 bits when you only really needed 16, then you've unnecessarily doubled the size of your data. In many cases (see below), this isn't a problem, but if you have an array of millions of numbers, then you'll get lots of page faults.
Your numbers don't need to be that big, but for efficiency reasons, you want a fast, “native” data type instead of a small one that may require time wasted on bitmasking or zero/sign-extension.
This is the int_fastN_t types in <stdint.h>. However, it's common to just use the built-in int here, which in the 16/32-bit days had the semantics of int_fast16_t. It's not the native type on 64-bit systems, but it's usually good enough.
The variable is an amount of memory, array index, or casted pointer, and thus needs a size that depends on the amount of addressable memory.
This corresponds to the typedefs size_t, ptrdiff_t, intptr_t, etc. You have to use typedefs here because there is no built-in type that's guaranteed to be memory-sized.
The variable is part of a structure that's serialized to a file using fread/fwrite, or called from a non-C language (Java, COBOL, etc.) that has its own fixed-width data types.
In these cases, you truly do need an exact-width type.
You just haven't thought about the appropriate type, and use int out of habit.
Often, this works well enough.
So, in summary, all of the typedefs from <stdint.h> have their use cases. However, the usefulness of the built-in types is limited due to:
Lack of maximum sizes for these types.
Lack of a native memsize type.
The arbitrary choice between LP64 (on Unix-like systems) and LLP64 (on Windows) data models on 64-bit systems.
As for why there are so many redundant typedefs of fixed-width (WORD, DWORD, __int64, gint64, FINT64, etc.) and memsize (INT_PTR, LPARAM, VPTRDIFF, etc.) integer types, it's mainly because <stdint.h> came late in C's development, and people are still using older compilers that don't support it, so libraries need to define their own. Same reason why C++ has so many string classes.
Sometimes it is important. For example, most image file formats require an exact number of bits/bytes be used (or at least specified).
If you only wanted to share a file created by the same compiler on the same computer architecture, you would be correct (or at least things would work). But, in real life things like file specifications and network packets are created by a variety of computer architectures and compilers, so we have to care about the details in these case (at least).
The main reason the fundamental types can't be fixed is that a few machines don't use 8-bit bytes. Enough programmers don't care, or actively want not to be bothered with support for such beasts, that the majority of well-written code demands a specific number of bits wherever overflow would be a concern.
It's better to specify a required range than to use int or long directly, because asking for "relatively big" or "relatively small" is fairly meaningless. The point is to know what inputs the program can work with.
By the way, usually there's a compiler flag that will adjust the built-in types. See INT_TYPE_SIZE for GCC. It might be cleaner to stick that into the makefile, than to specialize the whole system environment with new headers.
If you want portable code, you want the code your write to function identically on all platforms. If you have
int i = 32767;
you can't say for certain what i+1 will give you on all platforms.
This is not portable. Some compilers (on the same CPU architecture!) will give you -32768 and some will give you 32768. Some perverted ones will give you 0. That's a pretty big difference. Granted if it overflows, this is Undefined Behavior, but you don't know it is UB unless you know exactly what the size of int is.
If you use the standard integer definitions (which is <stdint.h>, the ISO/IEC 9899:1999 standard), then you know the answer of +1 will give exact answer.
int16_t i = 32767;
i+1 will overflow (and on most compilers, i will appear to be -32768)
uint16_t j = 32767;
j+1 gives 32768;
int8_t i = 32767; // should be a warning but maybe not. most compilers will set i to -1
i+1 gives 0; (//in this case, the addition didn't overflow
uint8_t j = 32767; // should be a warning but maybe not. most compilers will set i to 255
i+1 gives 0;
int32_t i = 32767;
i+1 gives 32768;
uint32_t j = 32767;
i+1 gives 32768;
There are two opposing forces at play here:
The need for C to adapt to any CPU architecture in a natural way.
The need for data transferred to/from a program (network, disk, file, etc.) so that a program running on any architecture can correctly interpret it.
The "CPU matching" need has to do with inherent efficiency. There is CPU quantity which is most easily handled as a single unit which all arithmetic operations easily and efficiently are performed on, and which results in the need for the fewest bits of instruction encoding. That type is int. It could be 16 bits, 18 bits*, 32 bits, 36 bits*, 64 bits, or even 128 bits on some machines. (* These were some not-well-known machines from the 1960s and 1970s which may have never had a C compiler.)
Data transfer needs when transferring binary data require that record fields are the same size and alignment. For this it is quite important to have control of data sizes. There is also endianness and maybe binary data representations, like floating point representations.
A program which forces all integer operations to be 32 bit in the interests of size compatibility will work well on some CPU architectures, but not others (especially 16 bit, but also perhaps some 64-bit).
Using the CPU's native register size is preferable if all data interchange is done in a non-binary format, like XML or SQL (or any other ASCII encoding).

Does the size of the integer or any other data types in C dependent on the underlying architecture?

#include<stdio.h>
int main()
{
int c;
return 0;
} // on Intel architecture
#include <stdio.h>
int main()
{
int c;
return 0;
}// on AMD architecture
/*
Here I have a code on the two different machines and I want to know the 'Is the size of the data types dependent on the machine '
*/
see here:
size guarantee for integral/arithmetic types in C and C++
Fundamental C type sizes are depending on implementation (compiler) and architecture, however they have some guaranteed boundaries. One should therefore never hardcode type sizes and instead use sizeof(TYPENAME) to get their length in bytes.
Quick answer: Yes, mostly, but ...
The sizes of types in C are dependent on the decisions of compiler writers, subject to the requirements of the standard.
The decisions of compiler writers tend to be strongly influenced by the CPU architecture. For example, the C standard says:
A "plain" int object has the natural size suggested by the
architecture of the execution environment.
though that leaves a lot of room for judgement.
Such decisions can also be influenced by other considerations, such as compatibility with compilers from the same vendor for other architectures and the convenience of having types for each supported size. For example, on a 64-bit system, the obvious "natural size" for int is 64 bits, but many compilers still have 32-bit int. (With 8-bit char and 64-bit int, short would probably be either 16 or 32 bits, and you couldn't have fundamental integer types covering both sizes.)
(C99 introduces "extended integer types", which could solve the issue of covering all the supported sizes, but I don't know of any compiler that implements them.)
Yes. The size of the basic datatypes depends on the underlying CPU architecture. ISO C (and C++) guarantees only mininum sizes for datatypes.
But it's not consistent across compiler vendor for the same CPU. Consider that there are compilers with 32-bit long ints for Intel x386 CPUs, and other compilers that give you 64-bit longs.
And don't forget about the decade or so of pain that MS programmers had to deal with during the era of the Intel 286 machines, what with all of the different "memory models" that compilers forced on us. 16-bit pointers versus 32-bit segmented pointers. I for one am glad that those days are gone.
It usually does, for performance reasons. The C standard defines the minimum value ranges for all types like char, short, int, long, long long and their unsigned counterparts.
However, x86 CPUs from Intel and AMD are essentially the same hardware to most x86 compilers. At least, they expose the same registers and instructions to the programmer and most of them operate identically (if we consider what's officially defined and documented).
At any rate, it's up to the compiler or its developer(s) to use any other size, not necessarily matching the natural operand size on the target hardware as long as that size agrees with the C standard.

Would making plain int 64-bit break a lot of reasonable code?

Until recently, I'd considered the decision by most systems implementors/vendors to keep plain int 32-bit even on 64-bit machines a sort of expedient wart. With modern C99 fixed-size types (int32_t and uint32_t, etc.) the need for there to be a standard integer type of each size 8, 16, 32, and 64 mostly disappears, and it seems like int could just as well be made 64-bit.
However, the biggest real consequence of the size of plain int in C comes from the fact that C essentially does not have arithmetic on smaller-than-int types. In particular, if int is larger than 32-bit, the result of any arithmetic on uint32_t values has type signed int, which is rather unsettling.
Is this a good reason to keep int permanently fixed at 32-bit on real-world implementations? I'm leaning towards saying yes. It seems to me like there could be a huge class of uses of uint32_t which break when int is larger than 32 bits. Even applying the unary minus or bitwise complement operator becomes dangerous unless you cast back to uint32_t.
Of course the same issues apply to uint16_t and uint8_t on current implementations, but everyone seems to be aware of and used to treating them as "smaller-than-int" types.
As you say, I think that the promotion rules really are the killer. uint32_t would then promote to int and all of a sudden you'd have signed arithmetic where almost everybody expects unsigned.
This would be mostly hidden in places where you do just arithmetic and assign back to an uint32_t. But it could be deadly in places where you do comparison to constants. Whether code that relies on such comparisons without doing an explicit cast is reasonable, I don't know. Casting constants like (uint32_t)1 can become quite tedious. I personally at least always use the suffix U for constants that I want to be unsigned, but this already is not as readable as I would like.
Also have in mind that uint32_t etc are not guaranteed to exist. Not even uint8_t. The enforcement of that is an extension from POSIX. So in that sense C as a language is far from being able to make that move.
"Reasonable Code"...
Well... the thing about development is, you write and fix it and then it works... and then you stop!
And maybe you've been burned a lot so you stay well within the safe ranges of certain features, and maybe you haven't been burned in that particular way so you don't realize that you're relying on something that could kind-of change.
Or even that you're relying on a bug.
On olden Mac 68000 compilers, int was 16 bit and long was 32. But even then most extant C code assumed an int was 32, so typical code you found on a newsgroup wouldn't work. (Oh, and Mac didn't have printf, but I digress.)
So, what I'm getting at is, yes, if you change anything, then some things will break.
With modern C99 fixed-size types
(int32_t and uint32_t, etc.) the need
for there to be a standard integer
type of each size 8, 16, 32, and 64
mostly disappears,
C99 has fixed-sized typeDEFs, not fixed-size types. The native C integer types are still char, short, int, long, and long long. They are still relevant.
The problem with ILP64 is that it has a great mismatch between C types and C99 typedefs.
int8_t = char
int16_t = short
int32_t = nonstandard type
int64_t = int, long, or long long
From 64-Bit Programming Models: Why LP64?:
Unfortunately, the ILP64 model does
not provide a natural way to describe
32-bit data types, and must resort to
non-portable constructs such as
__int32 to describe such types. This is likely to cause practical problems
in producing code which can run on
both 32 and 64 bit platforms without
#ifdef constructions. It has been possible to port large quantities of
code to LP64 models without the need
to make such changes, while
maintaining the investment made in
data sets, even in cases where the
typing information was not made
externally visible by the application.
DEC Alpha and OSF/1 Unix was one of the first 64-bit versions of Unix, and it used 64-bit integers - an ILP64 architecture (meaning int, long and pointers were all 64-bit quantities). It caused lots of problems.
One issue I've not seen mentioned - which is why I'm answering at all after so long - is that if you have a 64-bit int, what size do you use for short? Both 16 bits (the classical, change nothing approach) and 32 bits (the radical 'well, a short should be half as long as an int' approach) will present some problems.
With the C99 <stdint.h> and <inttypes.h> headers, you can code to fixed size integers - if you choose to ignore machines with 36-bit or 60-bit integers (which is at least quasi-legitimate). However, most code is not written using those types, and there are typically deep-seated and largely hidden (but fundamentally flawed) assumptions in the code that will be upset if the model departs from the existing variations.
Notice Microsoft's ultra-conservative LLP64 model for 64-bit Windows. That was chosen because too much old code would break if the 32-bit model was changed. However, code that had been ported to ILP64 or LP64 architectures was not immediately portable to LLP64 because of the differences. Conspiracy theorists would probably say it was deliberately chosen to make it more difficult for code written for 64-bit Unix to be ported to 64-bit Windows. In practice, I doubt whether that was more than a happy (for Microsoft) side-effect; the 32-bit Windows code had to be revised a lot to make use of the LP64 model too.
There's one code idiom that would break if ints were 64-bits, and I see it often enough that I think it could be called reasonable:
checking if a value is negative by testing if ((val & 0x80000000) != 0)
This is commonly found in checking error codes. Many error code standards (like Window's HRESULT) uses bit 31 to represent an error. And code will sometimes check for that error either by testing bit 31 or sometimes by checking if the error is a negative number.
Microsoft's macros for testing HRESULT use both methods - and I'm sure there's a ton of code out there that does similar without using the SDK macros. If MS had moved to ILP64, this would be one area that caused porting headaches that are completely avoided with the LLP64 model (or the LP64 model).
Note: if you're not familiar with terms like "ILP64", please see the mini-glossary at the end of the answer.
I'm pretty sure there's a lot of code (not necessarily Windows-oriented) out there that uses plain-old-int to hold error codes, assuming that those ints are 32-bits in size. And I bet there's a lot of code with that error status scheme that also uses both kinds of checks (< 0 and bit 31 being set) and which would break if moved to an ILP64 platform. These checks could be made to continue to work correctly either way if the error codes were carefully constructed so that sign-extension took place, but again, many such systems I've seen construct the error values by or-ing together a bunch a bitfields.
Anyway, I don't think this is an unsolvable problem by any means, but I do think it's a fairly common coding practice that would cause a lot of code to require fixing up if moved to an ILP64 platform.
Note that I also don't think this was one of the foremost reasons for Microsoft to choose the LLP64 model (I think that decision was largely driven by binary data compatibility between 32-bit and 64-bit processes, as mentioned in MSDN and on Raymond Chen's blog).
Mini-Glossary for the 64-bit Platform Programming Model terminology:
ILP64: int, long, pointers are 64-bits
LP64: long and pointers are 64-bits, int is 32-bits (used by many (most?) Unix platforms)
LLP64: long long and pointers are 64-bits, int and long remain 32-bits (used on Win64)
For more information on 64-bit programming models, see "64-bit Programming Models: Why LP64?"
While I don't personally write code like this, I'll bet that it's out there in more than one place... and of course it'll break if you change the size of int.
int i, x = getInput();
for (i = 0; i < 32; i++)
{
if (x & (1 << i))
{
//Do something
}
}
Well, it's not like this story is all new. With "most computers" I assume you mean desktop computers. There already has been a transition from 16-bit to 32-bit int. Is there anything at all that says the same progression won't happen this time?
Not particularly. int is 64 bit on some 64 bit architectures (not x64).
The standard does not actually guarantee you get 32 bit integers, just that (u)int32_t can hold one.
Now if you are depending on int is the same size as ptrdiff_t you may be broken.
Remember, C does not guarantee that the machine even is a binary machine.

Resources