Simple 32 to 64-bit transition? - c

Is there a simple way to compile 32-bit C code into a 64-bit application, with minimal modification? The code was not setup to use fixed type sizes.
I am not interested in taking advantage of 64-bit memory addressing. I just need to compile into a 64-bit binary while maintaining 4 byte longs and pointers.
Something like:
#define long int32_t
But of course that breaks a number of long use cases and doesn't deal with pointers. I thought there might be some standard procedure here.

There seem to be two orthogonal notions of "portability":
My code compiles everywhere out of the box. Its general behaviour is the same on all platforms, but details of available features vary depending on the platform's characteristics.
My code contains a folder for architecture-dependent stuff. I guarantee that MYINT32 is always 32 bit no matter what. I successfully ported the notion of 32 bits to the nine-fingered furry lummoxes of Mars.
In the first approach, we write unsigned int n; and printf("%u", n) and we know that the code always works, but details like the numeric range of unsigned int are up to the platform and not of our concern. (Wchar_t comes in here, too.) This is what I would call the genuinely portable style.
In the second approach, we typedef everything and use types like uint32_t. Formatted output with printf triggers tons of warnings, and we must resort to monsters like PRI32. In this approach we derive a strange sense of power and control from knowing that our integer is always 32 bits wide, but I hesitate to call this "portable" -- it's just stubborn.
The fundamental concept that requires a specific representation is serialization: The document you write on one platform should be readable on all other platforms. Serialization is naturally where we forgo the type system, must worry about endianness and need to decide on a fixed representation (including things like text encoding).
The upshot is this:
Write your main program core in portable style using standard language primitives.
Write well-defined, clean I/O interfaces for serialization.
If you stick to that, you should never even have to think about whether your platform is 32 or 64 bit, big or little endian, Mac or PC, Windows or Linux. Stick to the standard, and the standard will stick with you.

No, this is not, in general, possible. Consider, for example, malloc(). What is supposed to happen when it returns a pointer value that cannot be represented in 32 bits? How can that pointer value possibly be passed to your code as a 32 bit value, that will work fine when dereferenced?
This is just one example - there are numerous other similar ones.
Well-written C code isn't inherently "32-bit" or "64-bit" anyway - it should work fine when recompiled as a 64 bit binary with no modifications necessary.
Your actual problem is wanting to load a 32 bit library into a 64 bit application. One way to do this is to write a 32 bit helper application that loads your 32 bit library, and a 64 bit shim library that is loaded into the 64 bit application. Your 64 bit shim library communicates with your 32 bit helper using some IPC mechanism, requesting the helper application to perform operations on its behalf, and returning the results.
The specific case - a Matlab MEX file - might be a bit complicated (you'll need two-way function calling, so that the 64 bit shim library can perform calls like mexGetVariable() on behalf of the 32 bit helper), but it should still be doable.

The one area that will probably bite you is if any of your 32-bit integers are manipulated bit-wise. If you assume that some status flags are stored in a 32-bit register (for example), or if you are doing bit shifting, then you'll need to focus on those.
Another place to look would be any networking code that assumes the size (and endian) of integers passed on the wire. Once those get moved into 64-bit ints you'll need to make sure that you don't lose sign bits or precision.
Structures that contain integers will no longer be the same size. Any assumptions about size and alignment need to be cleaned out.

Related

What was with the historical typedef soup for integers in C programs?

This is a possibly inane question whose answer I should probably know.
Fifteen years ago or so, a lot of C code I'd look at had tons of integer typedefs in platform-specific #ifdefs. It seemed every program or library I looked at had their own, mutually incompatible typedef soup. I didn't know a whole lot about programming at the time and it seemed like a bizarre bunch of hoops to jump through just to tell the compiler what kind of integer you wanted to use.
I've put together a story in my mind to explain what those typedefs were about, but I don't actually know whether it's true. My guess is basically that when C was first developed and standardized, it wasn't realized how important it was to be able to platform-independently get an integer type of a certain size, and thus all the original C integer types may be of different sizes on different platforms. Thus everyone trying to write portable C code had to do it themselves.
Is this correct? If so, how were programmers expected to use the C integer types? I mean, in a low level language with a lot of bit twiddling, isn't it important to be able to say "this is a 32 bit integer"? And since the language was standardized in 1989, surely there was some thought that people would be trying to write portable code?
When C began computers were less homogenous and a lot less connected than today. It was seen as more important for portability that the int types be the natural size(s) for the computer. Asking for an exactly 32-bit integer type on a 36-bit system is probably going to result in inefficient code.
And then along came pervasive networking where you are working with specific on-the-wire size fields. Now interoperability looks a whole lot different. And the 'octet' becomes the de facto quanta of data types.
Now you need ints of exact multiples of 8-bits, so now you get typedef soup and then eventually the standard catches up and we have standard names for them and the soup is not as needed.
C's earlier success was due to it flexibility to adapt to nearly all existing variant architectures #John Hascall with:
1) native integer sizes of 8, 16, 18, 24, 32, 36, etc. bits,
2) variant signed integer models: 2's complement, 1's complement, signed integer and
3) various endian, big, little and others.
As coding developed, algorithms and interchange of data pushed for greater uniformity and so the need for types that met 1 & 2 above across platforms. Coders rolled their own like typedef int int32 inside a #if .... The many variations of that created the soup as noted by OP.
C99 introduced (u)int_leastN_t, (u)int_fastN_t, (u)intmax_t to make portable yet somewhat of minimum bit-width-ness types. These types are required for N = 8,16,32,64.
Also introduced are semi-optional types (see below **) like (u)intN_t which has the additional attributes of they must be 2's complement and no padding. It is these popular types that are so widely desired and used to thin out the integer soup.
how were programmers expected to use the C integer types?
By writing flexible code that did not strongly rely on bit width. Is is fairly easy to code strtol() using only LONG_MIN, LONG_MAX without regard to bit-width/endian/integer encoding.
Yet many coding tasks oblige precise width types and 2's complement for easy high performance coding. It is better in that case to forego portability to 36-bit machines and 32-bit sign-magnitudes ones and stick with 2N wide (2's complement for signed) integers. Various CRC & crypto algorithms and file formats come to mind. Thus the need for fixed-width types and a specified (C99) way to do it.
Today there are still gotchas that still need to be managed. Example: The usual promotions int/unsigned lose some control as those types may be 16, 32 or 64.
**
These types are optional. However, if an implementation provides integer types with widths of 8, 16, 32, or 64 bits, no padding bits, and (for the signed types) that have a two’s complement representation, it shall define the corresponding typedef names. C11 7.20.1.1 Exact-width integer types 3
I remember that period and I'm guilty of doing the same!
One issue was the size of int, it could be the same as short, or long or in between. For example, if you were working with binary file formats, it was imperative that everything align. Byte ordering complicated things as well. Many developer went the lazy route and just did fwrite of whatever, instead of picking numbers apart byte-by-byte. When the machines upgraded to longer word lengths, all hell broke loose. So typedef was an easy hack to fix that.
If performance was an issue, as it often was back then, int was guaranteed to be the machine's fastest natural size, but if you needed 32 bits, and int was shorter than that, you were in danger of rollover.
In the C language, sizeof() is not supposed to be resolved at the preprocessor stage, which made things complicated because you couldn't do #if sizeof(int) == 4 for example.
Personally, some of the rationale was also just working from an assembler language mindset and not being willing to abstract out the notion of what short, int and long are for. Back then, assembler was used in C quite frequently.
Nowadays, there are plenty of non-binary file formats, JSON, XML, etc. where it doesn't matter what the binary representation is. As well, many popular platforms have settled on a 32-bit int or longer, which is usually enough for most purposes, so there's less of an issue with rollover.
C is a product of the early 1970s, when the computing ecosystem was very different. Instead of millions of computers all talking to each other over an extended network, you had maybe a hundred thousand systems worldwide, each running a few monolithic apps, with almost no communication between systems. You couldn't assume that any two architectures had the same word sizes, or represented signed integers in the same way. The market was still small enough that there wasn't any percieved need to standardize, computers didn't talk to each other (much), and nobody though much about portability.
If so, how were programmers expected to use the C integer types?
If you wanted to write maximally portable code, then you didn't assume anything beyond what the Standard guaranteed. In the case of int, that meant you didn't assume that it could represent anything outside of the range [-32767,32767], nor did you assume that it would be represented in 2's complement, nor did you assume that it was a specific width (it could be wider than 16 bits, yet still only represent a 16 bit range if it contained any padding bits).
If you didn't care about portability, or you were doing things that were inherently non-portable (which bit twiddling usually is), then you used whatever type(s) met your requirements.
I did mostly high-level applications programming, so I was less worried about representation than I was about range. Even so, I occasionally needed to dip down into binary representations, and it always bit me in the ass. I remember writing some code in the early '90s that had to run on classic MacOS, Windows 3.1, and Solaris. I created a bunch of enumeration constants for 32-bit masks, which worked fine on the Mac and Unix boxes, but failed to compile on the Windows box because on Windows an int was only 16 bits wide.
C was designed as a language that could be ported to as wide a range of machines as possible, rather than as a language that would allow most kinds of programs to be run without modification on such a range of machines. For most practical purposes, C's types were:
An 8-bit type if one is available, or else the smallest type that's at least 8 bits.
A 16-bit type, if one is available, or else the smallest type that's at least 16 bits.
A 32-bit type, if one is available, or else some type that's at least 32 bits.
A type which will be 32 bits if systems can handle such things as efficiently as 16-bit types, or 16 bits otherwise.
If code needed 8, 16, or 32-bit types and would be unlikely to be usable on machines which did not support them, there wasn't any particular problem with such code regarding char, short, and long as 8, 16, and 32 bits, respectively. The only systems that didn't map those names to those types would be those which couldn't support those types and wouldn't be able to usefully handle code that required them. Such systems would be limited to writing code which had been written to be compatible with the types that they use.
I think C could perhaps best be viewed as a recipe for converting system specifications into language dialects. A system which uses 36-bit memory won't really be able to efficiently process the same language dialect as a system that use octet-based memory, but a programmer who learns one dialect would be able to learn another merely by learning what integer representations the latter one uses. It's much more useful to tell a programmer who needs to write code for a 36-bit system, "This machine is just like the other machines except char is 9 bits, short is 18 bits, and long is 36 bits", than to say "You have to use assembly language because other languages would all require integer types this system can't process efficiently".
Not all machines have the same native word size. While you might be tempted to think a smaller variable size will be more efficient, it just ain't so. In fact, using a variable that is the same size as the native word size of the CPU is much, much faster for arithmetic, logical and bit manipulation operations.
But what, exactly, is the "native word size"? Almost always, this means the register size of the CPU, which is the same as the Arithmetic Logic Unit (ALU) can work with.
In embedded environments, there are still such things as 8 and 16 bit CPUs (are there still 4-bit PIC controllers?). There are mountains of 32-bit processors out there still. So the concept of "native word size" is alive and well for C developers.
With 64-bit processors, there is often good support for 32-bit operands. In practice, using 32-bit integers and floating point values can often be faster than the full word size.
Also, there are trade-offs between native word alignment and overall memory consumption when laying out C structures.
But the two common usage patterns remain: size agnostic code for improved speed (int, short, long), or fixed size (int32_t, int16_t, int64_t) for correctness or interoperability where needed.

uint32_t vs int as a convention for everyday programming

When should one use the datatypes from stdint.h?
Is it right to always use as a convention them?
What was the purpose of the design of nonspecific size types like int and short?
When should one use the datatypes from stdint.h?
When the programming tasks specify the integer width especially to accommodate some file or communication protocol format.
When high degree of portability between platforms is required over performance.
Is it right to always use as a convention them (then)?
Things are leaning that way. The fixed width types are a more recent addition to C. Original C had char, short, int, long and that was progressive as it tried, without being too specific, to accommodate the various integer sizes available across a wide variety of processors and environments. As C is 40ish years old, it speaks to the success of that strategy. Much C code has been written and successfully copes with the soft integer specification size. With increasing needs for consistency, char, short, int, long and long long, are not enough (or at least not so easy) and so int8_t, int16_t, int32_t, int64_t are born. New languages tend to require very specific fixed integer size types and 2's complement. As they are successfully, that Darwinian pressure will push on C. My crystal ball says we will see a slow migration to increasing uses of fixed width types in C.
What was the purpose of the design of nonspecific size types like int and short?
It was a good first step to accommodate the wide variety of various integer widths (8,9,12,18,36, etc.) and encodings (2's, 1's, sign/mag). So much coding today uses power-of-2 size integers with 2's complement, that one may not realize that many other arrangements existed beforehand. See this answer also.
My work demands that I use them and I actually love using them.
I find it useful when I have to implement a protocol and use them inside a structure which can be a message that needs to be sent out or a holder of certain information.
If I have to use a sequence number that needs to be incremented, I wouldn't use int because sequence numbers aren't supposed to be negative. I use uint32_t instead. I will hence know the sequence number space and can plan/code accordingly.
The code we write will be running on 32 as well as 64 bit machine so using "int" on different bit machines results in subtle bugs which can be a pain to identify. Using unint16_t will allocate 16 bits on 32 or 64 bit architecture.
No, I would say it's never a good idea to use those for general-purpose programming.
If you really care about number of bits, then go ahead and use them but for most general use you don't care so then use the general types. The general types might be faster, and they are certainly easier to read and write.
Fixed width datatypes should be used only when really required (e.g. when implementing transfer protocols or accessing hardware or requiring a certain range of values (you should use the ..._least_... variant there)). Your program won't adapt else on changed environments (e.g. using uint32_t for filesizes might be ok 10 years ago, but off_t will adapt to recent needs). As others have pointed out, there might be a performance impact as int might be faster than uint32_t on 16 bit platforms.
int itself is very problematic due to its signedness; it is better to use e.g. size_t when variable holds result of strlen() or sizeof().

Would making plain int 64-bit break a lot of reasonable code?

Until recently, I'd considered the decision by most systems implementors/vendors to keep plain int 32-bit even on 64-bit machines a sort of expedient wart. With modern C99 fixed-size types (int32_t and uint32_t, etc.) the need for there to be a standard integer type of each size 8, 16, 32, and 64 mostly disappears, and it seems like int could just as well be made 64-bit.
However, the biggest real consequence of the size of plain int in C comes from the fact that C essentially does not have arithmetic on smaller-than-int types. In particular, if int is larger than 32-bit, the result of any arithmetic on uint32_t values has type signed int, which is rather unsettling.
Is this a good reason to keep int permanently fixed at 32-bit on real-world implementations? I'm leaning towards saying yes. It seems to me like there could be a huge class of uses of uint32_t which break when int is larger than 32 bits. Even applying the unary minus or bitwise complement operator becomes dangerous unless you cast back to uint32_t.
Of course the same issues apply to uint16_t and uint8_t on current implementations, but everyone seems to be aware of and used to treating them as "smaller-than-int" types.
As you say, I think that the promotion rules really are the killer. uint32_t would then promote to int and all of a sudden you'd have signed arithmetic where almost everybody expects unsigned.
This would be mostly hidden in places where you do just arithmetic and assign back to an uint32_t. But it could be deadly in places where you do comparison to constants. Whether code that relies on such comparisons without doing an explicit cast is reasonable, I don't know. Casting constants like (uint32_t)1 can become quite tedious. I personally at least always use the suffix U for constants that I want to be unsigned, but this already is not as readable as I would like.
Also have in mind that uint32_t etc are not guaranteed to exist. Not even uint8_t. The enforcement of that is an extension from POSIX. So in that sense C as a language is far from being able to make that move.
"Reasonable Code"...
Well... the thing about development is, you write and fix it and then it works... and then you stop!
And maybe you've been burned a lot so you stay well within the safe ranges of certain features, and maybe you haven't been burned in that particular way so you don't realize that you're relying on something that could kind-of change.
Or even that you're relying on a bug.
On olden Mac 68000 compilers, int was 16 bit and long was 32. But even then most extant C code assumed an int was 32, so typical code you found on a newsgroup wouldn't work. (Oh, and Mac didn't have printf, but I digress.)
So, what I'm getting at is, yes, if you change anything, then some things will break.
With modern C99 fixed-size types
(int32_t and uint32_t, etc.) the need
for there to be a standard integer
type of each size 8, 16, 32, and 64
mostly disappears,
C99 has fixed-sized typeDEFs, not fixed-size types. The native C integer types are still char, short, int, long, and long long. They are still relevant.
The problem with ILP64 is that it has a great mismatch between C types and C99 typedefs.
int8_t = char
int16_t = short
int32_t = nonstandard type
int64_t = int, long, or long long
From 64-Bit Programming Models: Why LP64?:
Unfortunately, the ILP64 model does
not provide a natural way to describe
32-bit data types, and must resort to
non-portable constructs such as
__int32 to describe such types. This is likely to cause practical problems
in producing code which can run on
both 32 and 64 bit platforms without
#ifdef constructions. It has been possible to port large quantities of
code to LP64 models without the need
to make such changes, while
maintaining the investment made in
data sets, even in cases where the
typing information was not made
externally visible by the application.
DEC Alpha and OSF/1 Unix was one of the first 64-bit versions of Unix, and it used 64-bit integers - an ILP64 architecture (meaning int, long and pointers were all 64-bit quantities). It caused lots of problems.
One issue I've not seen mentioned - which is why I'm answering at all after so long - is that if you have a 64-bit int, what size do you use for short? Both 16 bits (the classical, change nothing approach) and 32 bits (the radical 'well, a short should be half as long as an int' approach) will present some problems.
With the C99 <stdint.h> and <inttypes.h> headers, you can code to fixed size integers - if you choose to ignore machines with 36-bit or 60-bit integers (which is at least quasi-legitimate). However, most code is not written using those types, and there are typically deep-seated and largely hidden (but fundamentally flawed) assumptions in the code that will be upset if the model departs from the existing variations.
Notice Microsoft's ultra-conservative LLP64 model for 64-bit Windows. That was chosen because too much old code would break if the 32-bit model was changed. However, code that had been ported to ILP64 or LP64 architectures was not immediately portable to LLP64 because of the differences. Conspiracy theorists would probably say it was deliberately chosen to make it more difficult for code written for 64-bit Unix to be ported to 64-bit Windows. In practice, I doubt whether that was more than a happy (for Microsoft) side-effect; the 32-bit Windows code had to be revised a lot to make use of the LP64 model too.
There's one code idiom that would break if ints were 64-bits, and I see it often enough that I think it could be called reasonable:
checking if a value is negative by testing if ((val & 0x80000000) != 0)
This is commonly found in checking error codes. Many error code standards (like Window's HRESULT) uses bit 31 to represent an error. And code will sometimes check for that error either by testing bit 31 or sometimes by checking if the error is a negative number.
Microsoft's macros for testing HRESULT use both methods - and I'm sure there's a ton of code out there that does similar without using the SDK macros. If MS had moved to ILP64, this would be one area that caused porting headaches that are completely avoided with the LLP64 model (or the LP64 model).
Note: if you're not familiar with terms like "ILP64", please see the mini-glossary at the end of the answer.
I'm pretty sure there's a lot of code (not necessarily Windows-oriented) out there that uses plain-old-int to hold error codes, assuming that those ints are 32-bits in size. And I bet there's a lot of code with that error status scheme that also uses both kinds of checks (< 0 and bit 31 being set) and which would break if moved to an ILP64 platform. These checks could be made to continue to work correctly either way if the error codes were carefully constructed so that sign-extension took place, but again, many such systems I've seen construct the error values by or-ing together a bunch a bitfields.
Anyway, I don't think this is an unsolvable problem by any means, but I do think it's a fairly common coding practice that would cause a lot of code to require fixing up if moved to an ILP64 platform.
Note that I also don't think this was one of the foremost reasons for Microsoft to choose the LLP64 model (I think that decision was largely driven by binary data compatibility between 32-bit and 64-bit processes, as mentioned in MSDN and on Raymond Chen's blog).
Mini-Glossary for the 64-bit Platform Programming Model terminology:
ILP64: int, long, pointers are 64-bits
LP64: long and pointers are 64-bits, int is 32-bits (used by many (most?) Unix platforms)
LLP64: long long and pointers are 64-bits, int and long remain 32-bits (used on Win64)
For more information on 64-bit programming models, see "64-bit Programming Models: Why LP64?"
While I don't personally write code like this, I'll bet that it's out there in more than one place... and of course it'll break if you change the size of int.
int i, x = getInput();
for (i = 0; i < 32; i++)
{
if (x & (1 << i))
{
//Do something
}
}
Well, it's not like this story is all new. With "most computers" I assume you mean desktop computers. There already has been a transition from 16-bit to 32-bit int. Is there anything at all that says the same progression won't happen this time?
Not particularly. int is 64 bit on some 64 bit architectures (not x64).
The standard does not actually guarantee you get 32 bit integers, just that (u)int32_t can hold one.
Now if you are depending on int is the same size as ptrdiff_t you may be broken.
Remember, C does not guarantee that the machine even is a binary machine.

Adding 64 bit support to existing 32 bit code, is it difficult?

There is a library which I build against different 32-bit platforms. Now, 64-bit architectures must be supported. What are the most general strategies to extend existing 32-bit code to support 64-bit architectures? Should I use #ifdef's or anything else?
The amount of effort involved will depend entirely on how well written the original code is. In the best possible case there will be no effort involved other than re-compiling. In the worst case you will have to spend a lot of time making your code "64 bit clean".
Typical problems are:
assumptions about sizes of int/long/pointer/etc
assigning pointers <=> ints
relying on default argument or function result conversions (i.e. no function prototypes)
inappropriate printf/scanf format specifiers
assumptions about size/alignment/padding of structs (particularly in regard to file or network I/O, or interfacing with other APIs, etc)
inappropriate casts when doing pointer arithmetic with byte offsets
Simply don't rely on assumption of the machine word size? always use sizeof, stdint.h, etc. Unless you rely on different library calls for different architectures, there should be no need for #ifdefs.
The easiest strategy is to build what you have with 64-bit settings and test the heck out of it. Some code doesn't need to change at all. Other code, usually with wrong assumptions about the size of ints/pointers will be much more brittle and will need to be modified to be non-dependant on the architecture.
Very often binary files containing binary records cause the most problems. This is especially true in environments where ints grow from 32-bit to 64-bit in the transition to a 64-bit build. Primarily this is due to the fact that integers get written natively to files in their current (32-bit) length and read in using an incorrect length in a 64-bit build where ints are 64-bit.

Word and Double Word integers in C

I am trying to implement a simple, moderately efficient bignum library in C. I would like to store digits using the full register size of the system it's compiled on (presumably 32 or 64-bit ints). My understanding is that I can accomplish this using intptr_t. Is this correct? Is there a more semantically appropriate type, i.e. something like intword_t?
I also know that with GCC I can easily do overflow detection on a 32-bit machine by upcasting both arguments to 64-bit ints, which will occupy two registers and take advantage of instructions like IA31 ADC (add with carry). Can I do something similar on a 64-bit machine? Is there a 128-bit type I can upcast to which will compile to use these instructions if they're available? Better yet, is there a standard type that represents twice the register size (like intdoubleptr_t) so this could be done in a machine independent fashion?
Thanks!
Any reason not to use size_t? size_t is 4 bytes on a 32-bit system and 8 bytes on a 64-bit system, and is probably more portable than using WORD_SIZE (I think WORD_SIZE is gcc-specific, no?)
I am not aware of any 128-bit value on 64-bit systems, could be wrong here but haven't come across that type in the kernel or regular user apps.
I'd strongly recommend using the C99 <stdint.h> header. It declares int32_t, int64_t, uint32_t, and uint64_t, which look like what you really want to use.
EDIT: As Alok points out, int_fast32_t, int_fast64_t, etc. are probably what you want to use. The number of bits you specify should be the minimum you need for the math to work, i.e. for the calculation to not "roll over".
The optimization comes from the fact that the CPU doesn't have to waste cycles realigning data, padding the leading bits on a read, and doing a read-modify-write on a write. Truth is, a lot of processors (such as recent x86s) have hardware in the CPU that optimizes these access pretty well (at least the padding and read-modify-write parts), since they're so common and usually only involve transfers between the processor and cache.
So the only thing left for you to do is make sure the accesses are aligned: take sizeof(int_fast32_t) or whatever and use it to make sure your buffer pointers are aligned to that.
Truth is, this may not amount to that much improvement (due to the hardware optimizing transfers at runtime anyway), so writing something and timing it may be the only way to be sure. Also, if you're really crazy about performance, you may need to look at SSE or AltiVec or whatever vectorization tech your processor has, since that will outperform anything you can write that is portable when doing vectored math.

Resources