What is the difference between far pointers and near pointers? - c

Can anybody tell me the difference between far pointers and near pointers in C?

On a 16-bit x86 segmented memory architecture, four registers are used to refer to the respective segments:
DS → data segment
CS → code segment
SS → stack segment
ES → extra segment
A logical address on this architecture is written segment:offset. Now to answer the question:
Near pointers refer (as an offset) to the current segment.
Far pointers use segment info and an offset to point across segments. So, to use them, DS or CS must be changed to the specified value, the memory will be dereferenced and then the original value of DS/CS restored. Note that pointer arithmetic on them doesn't modify the segment portion of the pointer, so overflowing the offset will just wrap it around.
And then there are huge pointers, which are normalized to have the highest possible segment for a given address (contrary to far pointers).
On 32-bit and 64-bit architectures, memory models are using segments differently, or not at all.

Since nobody mentioned DOS, lets forget about old DOS PC computers and look at this from a generic point-of-view. Then, very simplified, it goes like this:
Any CPU has a data bus, which is the maximum amount of data the CPU can process in one single instruction, i.e equal to the size of its registers. The data bus width is expressed in bits: 8 bits, or 16 bits, or 64 bits etc. This is where the term "64 bit CPU" comes from - it refers to the data bus.
Any CPU has an address bus, also with a certain bus width expressed in bits. Any memory cell in your computer that the CPU can access directly has an unique address. The address bus is large enough to cover all the addressable memory you have.
For example, if a computer has 65536 bytes of addressable memory, you can cover these with a 16 bit address bus, 2^16 = 65536.
Most often, but not always, the data bus width is as wide as the address bus width. It is nice if they are of the same size, as it keeps both the CPU instruction set and the programs written for it clearer. If the CPU needs to calculate an address, it is convenient if that address is small enough to fit inside the CPU registers (often called index registers when it comes to addresses).
The non-standard keywords far and near are used to describe pointers on systems where you need to address memory beyond the normal CPU address bus width.
For example, it might be convenient for a CPU with 16 bit data bus to also have a 16 bit address bus. But the same computer may also need more than 2^16 = 65536 bytes = 64kB of addressable memory.
The CPU will then typically have special instructions (that are slightly slower) which allows it to address memory beyond those 64kb. For example, the CPU can divide its large memory into n pages (also sometimes called banks, segments and other such terms, that could mean a different thing from one CPU to another), where every page is 64kB. It will then have a "page" register which has to be set first, before addressing that extended memory. Similarly, it will have special instructions when calling/returning from sub routines in extended memory.
In order for a C compiler to generate the correct CPU instructions when dealing with such extended memory, the non-standard near and far keywords were invented. Non-standard as in they aren't specified by the C standard, but they are de facto industry standard and almost every compiler supports them in some manner.
far refers to memory located in extended memory, beyond the width of the address bus. Since it refers to addresses, most often you use it when declaring pointers. For example: int * far x; means "give me a pointer that points to extended memory". And the compiler will then know that it should generate the special instructions needed to access such memory. Similarly, function pointers that use far will generate special instructions to jump to/return from extended memory. If you didn't use far then you would get a pointer to the normal, addressable memory, and you'd end up pointing at something entirely different.
near is mainly included for consistency with far; it refers to anything in the addressable memory as is equivalent to a regular pointer. So it is mainly a useless keyword, save for some rare cases where you want to ensure that code is placed inside the standard addressable memory. You could then explicitly label something as near. The most typical case is low-level hardware programming where you write interrupt service routines. They are called by hardware from an interrupt vector with a fixed width, which is the same as the address bus width. Meaning that the interrupt service routine must be in the standard addressable memory.
The most famous use of far and near is perhaps the mentioned old MS DOS PC, which is nowadays regarded as quite ancient and therefore of mild interest.
But these keywords exist on more modern CPUs too! Most notably in embedded systems where they exist for pretty much every 8 and 16 bit microcontroller family on the market, as those microcontrollers typically have an address bus width of 16 bits, but sometimes more than 64kB memory.
Whenever you have a CPU where you need to address memory beyond the address bus width, you will have the need of far and near. Generally, such solutions are frowned upon though, since it is quite a pain to program on them and always take the extended memory in account.
One of the main reasons why there was a push to develop the 64 bit PC, was actually that the 32 bit PCs had come to the point where their memory usage was starting to hit the address bus limit: they could only address 4GB of RAM. 2^32 = 4,29 billion bytes = 4GB. In order to enable the use of more RAM, the options were then either to resort to some burdensome extended memory solution like in the DOS days, or to expand the computers, including their address bus, to 64 bits.

Far and near pointers were used in old platforms like DOS.
I don't think they're relevant in modern platforms. But you can learn about them here and here (as pointed by other answers). Basically, a far pointer is a way to extend the addressable memory in a computer. I.E., address more than 64k of memory in a 16bit platform.

A pointer basically holds addresses. As we all know, Intel memory management is divided into 4 segments.
So when an address pointed to by a pointer is within the same segment, then it is a near pointer and therefore it requires only 2 bytes for offset.
On the other hand, when a pointer points to an address which is out of the segment (that means in another segment), then that pointer is a far pointer. It consist of 4 bytes: two for segment and two for offset.

Four registers are used to refer to four segments on the 16-bit x86 segmented memory architecture. DS (data segment), CS (code segment), SS (stack segment), and ES (extra segment). A logical address on this platform is written segment:offset, in hexadecimal.
Near pointers refer (as an offset) to the current segment.
Far pointers use segment info and an offset to point across segments. So, to use them, DS or CS must be changed to the specified value, the memory will be dereferenced and then the original value of DS/CS restored. Note that pointer arithmetic on them doesn't modify the segment portion of the pointer, so overflowing the offset will just wrap it around.
And then there are huge pointers, which are normalized to have the highest possible segment for a given address (contrary to far pointers).
On 32-bit and 64-bit architectures, memory models are using segments differently, or not at all.

Well in DOS it was kind of funny dealing with registers. And Segments. All about maximum counting capacities of RAM.
Today it is pretty much irrelevant. All you need to read is difference about virtual/user space and kernel.
Since win nt4 (when they stole ideas from *nix) microsoft programmers started to use what was called user/kernel memory spaces.
And avoided direct access to physical controllers since then. Since then dissapered a problem dealing with direct access to memory segments as well. - Everything became R/W through OS.
However if you insist on understanding and manipulating far/near pointers look at linux kernel source and how it works - you will newer come back I guess.
And if you still need to use CS (Code Segment)/DS (Data Segment) in DOS. Look at these:
https://en.wikipedia.org/wiki/Intel_Memory_Model
http://www.digitalmars.com/ctg/ctgMemoryModel.html
I would like to point out to perfect answer below.. from Lundin. I was too lazy to answer properly. Lundin gave very detailed and sensible explanation "thumbs up"!

Related

Why must an int have a memory address that is divisible by four on most current architectures? [duplicate]

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Does using FreeDOS allow my program to access more than 64 K of memory?

I am interested in programming in C on FreeDOS while learning some basic ASM in the process, will using FreeDOS allow my program to access more than the standard 640K of memory?
And secondly, about the ASM, I know on modern processors it is hard to program on assembly due to the complexity of the CPU architecture, but does using FreeDOS limit me to the presumably simpler 16-bit instruction set?
MS-DOS and FreeDOS use the "HIMEM" areas: These are:
Some memory areas above 0xA000:0x0000 reserved for extension cards that contain RAM instead of extension cards
The memory starting from 0xFFFF:0x0010 to 0xFFFF:0xFFFF which is located above 1MB but can be accessed using 16-bit real mode code (if the so-called A20-line is active).
The maximum memory size that can be archieved this way is about 800K.
Using XMS and EMS you can use up to 64M:
XMS will allocate memory blocks above the area that can be accessed via 16-bit real mode code. There are special functions that can copy data from that memory to the low 640K of memory and vice versa
EMS is similar; however using EMS it is possible to "map" the high memory to a low address (a feature of 32-bit CPUs) which means that you can access some memory above the 1MB area as if it was located at an address below 1MB.
Without any extender a program can use maximum 640KB of low memory in DOS. But each structure will be limited to the size of a segment, or 64KB. That means you can have 10 large arrays of size 64KB. Of course you can have multiple arrays in a segment but their total size must not exceed the segment size. Some compilers also handle addresses spanning across multiple segments automatically so you can use objects larger than 64KB seamlessly, or you can also do the same if you're writing in assembly
To access more memory you need an extender like EMS or XMS. But note that the address space is still 20-bit wide. The extenders just map the high memory regions into some segments in the addressable space so you can only see a small window of your data at a time
Regarding assembly, you can use 32-bit registers in 16-bit mode. There are 66h and 67h prefixes to change the operand size. However that doesn't mean that writing 16-bit code is easier. In fact it has lots of idiosyncrasies to remember like the limited register usage in memory addressing. The 32-bit x86 instruction set is a lot cleaner with saner addressing modes as well as a flat address space which is a lot easier to use.

CPU and memory communication

I'm programmer-beginner, but I want to understand the things a bit more deeply. I did some research and read quite a lot of text, but I'm still yet to understand some things..
When coding a basic thing (in C):
int myNumber;
myNumber = 3;
printf("Here's my number: %d", myNumber);
I found out that (mainly on 32-bit CPU) integer takes place of 32 bits = 4 bytes. So at first line of my code CPU goes into the memory. The memory is byte-addressable, so CPU chooses 4 continuous bytes for my variable and stores the address to first (or last) byte.
On the second line of my code CPU uses his stored address of the MyNumber variable, goes to that address in the memory and finds there 32 bits of reserved space. His task now is to store there the number "3", so he fills those four bytes with the sequence 00000000-00000000-00000000-00000011.
On the third line it does the same - CPU goes to that address in memory and loads the number stored in that address.
(First question - Do I understand it right?)
What I don't understand is this:
The size of that address (pointer to that variable) is 4bytes in 32-bit CPU. (Thats why 32-bit CPU can use max 4GB of memory - because there are only 2^32 different addresses of binary length 32)
Now, where the CPU stores these addresses? Does he have some sort of its own memory or cache to store that? And why it stores the 32 bit long address to 32 bit long integer? Wouldn't it be better to simply store in its cache that actual number than the pointer to that when the sizes are the same?
And last one - if it stores somewhere in its own cache the addresses to all those integers and the lenghts are the same (4 bytes), it will need exactly the same space for storing the addresses as for the actual variables. But variables can take up to 4GBs of space so CPU must have 4GB of its own space to store the addresses to those variables. And that sounds strange..
Thank you for help!
I'm trying to understand that but it's so tough.. :-[
(First question - Do I understand it right?)
The first thing to recognise is that the value might not be stored in main memory at all. The compiler might decide to store it in a register instead, as this is more optimal.1
The memory is byte-addressable, so CPU chooses 4 continuous bytes for my variable and stores the address to first (or last) byte.
Assuming that the compiler does decide to store it in main memory, then yes, on a 32-bit machine, an int is typically 4 bytes, so 4 bytes will be allocated for storage.
The size of that address (pointer to that variable) is 4bytes in 32-bit CPU. (Thats why 32-bit CPU can use max 4GB of memory - because there are only 2^32 different addresses of binary length 32)
Note that the width of an int and the width of a pointer don't have to be the same, so there's not necessarily a connection with the size of the address space.
Now, where the CPU stores these addresses?
In the case of local variables, the address is effectively hardcoded into the executable itself, typically as an offset from the stack pointer.
In the case of dynamically-allocated objects (i.e. stuff that's been malloc-ed), the programmer typically maintains a corresponding pointer variable (otherwise there would be a memory leak!). That pointer might also be dynamically-allocated (in the case of a complex data structure), but if you go back far enough, you'll eventually reach something that's a local variable. In which case, the above rule applies.
But variables can take up to 4GBs of space so CPU must have 4GB of its own space to store the addresses to those variables.
If your program consists of independently malloc-ing millions of ints, then yes, you'd end up with just as much storage required for the pointers. But most programs don't look like that. You typically allocate much bigger objects (like an array, or a big struct).
cache
The specifics of where stuff is stored is architecture-specific. On a modern x86, there's typically 2 or 3 layers of cache sitting between the CPU and main memory. But the cache is not independently addressable; the CPU cannot decide to store the int in cache instead of main memory. Rather, the cache is effectively a redundant copy of a subset of main memory.
Another thing to consider is that the compiler will typically deal with virtual addresses when allocating storage for objects. On a modern x86, these are mapped to physical addresses (i.e. addresses that correspond to physical storage bytes in main memory) by dedicated hardware, along with OS support.
1. Alternatively, the compiler may be able to optimise it away entirely.
On the second line of my code CPU uses his stored address of the MyNumber variable, goes to that address in the memory and finds there 32 bits of reserved space.
Nearly correct. Memory is basically unstructured. The CPU can't see that there are 32 bits of "reserved space". But the CPU was instructed to read 32 bits of data, so it reads 32 bits of data starting from the specified address. Then it just has to hope/assume that those 32 bits actually contain something meaningful.
Now, where the CPU stores these addresses? Does he have some sort of its own memory or cache to store that? And why it stores the 32 bit long address to 32 bit long integer? Wouldn't it be better to simply store in its cache that actual number than the pointer to that when the sizes are the same?
The CPU has a small number of registers, which it can use to store data (common CPUs have 8, 16 or 32 registers, so they can only hold the particular variables that you're working with here and now). So to answer the last part first, yes, the compiler certainly might (and probably will) generate code to just store your int into a register, instead of storing it in a memory, and telling the CPU to load it from a specified address.
As for the other part of the question: ultimately, every part of the program is stored in memory. Part of it is a stream of instructions, and part of it is in chunks of data scattered around memory.
There are a few tricks that help with locating the data the CPU needs: part of the program's memory contains the stack, which typically stores local variables while they're in scope. The CPU always maintains a pointer to the top of the stack in one of its registers, so it can easily locate data on the stack, simply by modifying the stack pointer with a fixed offset. Instructions can directly contain such offsets, so in order to read your int, the compiler could for example generate code which writes the int to the top of the stack when you enter the function, and then when you need to refer to that function, have code which reads the data found at the address the stack pointer points to, plus the small offset needed to locate your variable.
And also keep in mind that the adresses your program sees may not be (or rather rarely are) physical adresses starting from the 'beginning of the memory' or 0. Mostly they are offsets into a specifc memory block where the memory manager knows the real address and the accesses via base+offest as the real data storage.
And we do need the memory since caches are limited ;-)
Mario
Internal to the CPU there is one register that contains the address of the next instruction to be executed. The instructions themselves keep information where the variable is. If the variable is optimized, the instruction may point to a register, but in general, the instruction will have an address of the variable being accessed. Your code, after compiled and loaded in memory, have all that embedded! I recommend looking into assembly language to get a better understanding of all that. Good luck!

Can a 32-bit processor really address 2^32 memory locations?

I feel this might be a weird/stupid question, but here goes...
In the question Is NULL in C required/defined to be zero?, it has been established that the NULL pointer points to an unaddressable memory location, and also that NULL is 0.
Now, supposedly a 32-bit processor can address 2^32 memory locations.
2^32 is only the number of distinct numbers that can be represented using 32 bits. Among those numbers is 0. But since 0, that is, NULL, is supposed to point to nothing, shouldn't we say that a 32-bit processor can only address 2^32 - 1 memory locations (because the 0 is not supposed to be a valid address)?
If a 32-bit processor can address 2^32 memory locations, that simply means that a C pointer on that architecture can refer to 2^32 - 1 locations plus NULL.
the NULL pointer points to an unaddressable memory location
This is not true. From the accepted answer in the question you linked:
Notice that, because of how the rules for null pointers are formulated, the value you use to assign/compare null pointers is guaranteed to be zero, but the bit pattern actually stored inside the pointer can be any other thing
Most platforms of which I am aware do in fact handle this by marking the first few pages of address space as invalid. That doesn't mean the processor can't address such things; it's just a convenient way of making low values a non valid pointer. For instance, several Windows APIs use this to distinguish between a resource ID and a pointer to actual data; everything below a certain value (65k if I recall correctly) is not a valid pointer, but is a valid resource ID.
Finally, just because C says something doesn't mean that the CPU needs to be restricted that way. Sure, C says accessing the null pattern is undefined -- but there's no reason someone writing in assembly need be subject to such limitations. Real machines typically can do much more than the C standard says they have to. Virtual memory, SIMD instructions, and hardware IO are some simple examples.
First, let's note the difference between the linear address (AKA the value of the pointer) and the physical address. While the linear address space is, indeed, 32 bits (AKA 2^32 different bytes), the physical address that goes to the memory chip is not the same. Parts ("pages") of the linear address space might be mapped to physical memory, or to a page file, or to an arbitrary file, or marked as inaccessible and not backed by anything. The zeroth page happens to be the latter. The mapping mechanism is implemented on the CPU level and maintained by the OS.
That said, the zero address being unaddressable memory is just a C convention that's enforced by every protected-mode OS since the first Unices. In MS-DOS-era real-mode operaring systems, null far pointer (0000:0000) was perfectly addressable; however, writing there would ruin system data structures and bring nothing but trouble. Null near pointer (DS:0000) was also perfectly accessible, but the run-time library would typically reserve some space around zero to protect from accidental null pointer dereferencing. Also, in real mode (like in DOS) the address space was not a flat 32-bit one, it was effectively 20-bit.
It depends upon the operating system. It is related to virtual memory and address spaces
In practice (at least on Linux x86 32 bits), addresses are byte "numbers"s, but most are for 4-bytes words so are often multiple of 4.
And more importantly, as seen from a Linux application, only at most 3Gbytes out of 4Gbytes is visible. a whole gigabyte of address space (including the first and last pages, near the null pointer) is unmapped. In practice the process see much less of that. See its /proc/self/maps pseudo-file (e.g. run cat /proc/self/maps to see the address map of the cat command on Linux).

How did 16-bit C compilers work?

C's memory model, with its use of pointer arithmetic and all, seems to model flat address space. 16-bit computers used segmented memory access. How did 16-bit C compilers deal with this issue and simulate a flat address space from the perspective of the C programmer? For example, roughly what assembly language instructions would the following code compile to on an 8086?
long arr[65536]; // Assume 32 bit longs.
long i;
for(i = 0; i < 65536; i++) {
arr[i] = i;
}
How did 16-bit C compilers deal with
this issue and simulate a flat address
space from the perspective of the C
programmer?
They didn't. Instead, they made segmentation visible to the C programmer, extending the language by having multiple types of pointers: near, far, and huge. A near pointer was an offset only, while far and huge pointers were a combined segment and offset. There was a compiler option to set the memory model, which determined whether the default pointer type was near or far.
In Windows code, even today, you'll often see typedefs like LPCSTR (for const char*). The "LP" is a holdover from the 16-bit days; it stands for "Long (far) Pointer".
C memory model does not in any way imply flat address space. It never did. In fact, C language specification is specifically designed to allow non-flat address spaces.
In the most trivial implementation with segmented address space, the size of the largest continuous object would be limited by the size of the segment (65536 bytes on a 16 bit platform). This means that size_t in such implementation would be 16 bit, and that your code simply would not compile, since you are attempting to declare an object that has larger size than the allowed maximum.
A more complex implementation would support so called huge memory model. You see, there's really no problem addressing continuous memory blocks of any size on a segmented memory model, it just requires some extra efforts in pointer arithmetics. So, within the huge memory model, the implementation would make those extra efforts, which would make the code a bit slower, but at the same time would allow addressing objects of virtually any size. So, your code would compile perfectly fine.
The true 16-bit environments use 16 bit pointers which reach any address. Examples include the PDP-11, 6800 family (6802, 6809, 68HC11), and the 8085. This is a clean and efficient environment, just like a simple 32-bit architecture.
The 80x86 family forced upon us a hybrid 16-bit/20-bit address space in so-called "real mode"—the native 8086 addressing space. The usual mechanism to deal with this was enhancing the types of pointers into two basic types, near (16-bit pointer) and far (32-bit pointer). The default for code and data pointers can be set in bulk by a "memory model": tiny, small, compact, medium, far, and huge (some compilers do not support all models).
The tiny memory model is useful for small programs in which the entire space (code + data + stack) is less than 64K. All pointers are (by default) 16 bits or near; a pointer is implicitly associated with a segment value for the whole program.
The small model assumes that data + stack is less than 64K and in the same segment; the code segment contains only code, so can have up to 64K as well, for a maximum memory footprint of 128K. Code pointers are near and implicitly associated with CS (the code segment). Data pointers are also near and associated with DS (the data segment).
The medium model has up to 64K of data + stack (like small), but can have any amount of code. Data pointers are 16 bits and are implicitly tied to the data segment. Code pointers are 32 bit far pointers and have a segment value depending on how the linker has set up the code groups (a yucky bookkeeping hassle).
The compact model is the complement of medium: less than 64K of code, but any amount of data. Data pointers are far and code pointers are near.
In large or huge model, the default subtype of pointers are 32 bit or far. The main difference is that huge pointers are always automatically normalized so that incrementing them avoids problems with 64K wrap arounds. See this.
In DOS 16 bit, I dont remember being able to do that. You could have multiple things that were each 64K (bytes)(because the segment could be adjusted and the offset zeroed) but dont remember if you could cross the boundary with a single array. The flat memory space where you could willy nilly allocate whatever you wanted and reach as deep as you liked into an array didnt happen until we could compile 32 bit DOS programs (on 386 or 486 processors). Perhaps other operating systems and compilers other than microsoft and borland could generate flat arrays greater than 64kbytes. Win16 I dont remember that freedom until win32 hit, perhaps my memory is getting rusty...You were lucky or rich to have a megabyte of memory anyway, a 256kbyte or 512kbyte machine was not unheard of. Your floppy drive had a fraction of a meg to 1.44 meg eventually, and your hard disk if any had a dozen or few meg, so you just didnt compute thing that large that often.
I remember the particular challenge I had learning about DNS when you could download the entire DNS database of all registered domain names on the planet, in fact you had to to put up your own dns server which was almost required at the time to have a web site. That file was 35megabytes, and my hard disk was 100megabytes, plus dos and windows chewing up some of that. Probably had 1 or 2 meg of memory, might have been able to do 32 bit dos programs at the time. Part if it was me wanting to parse the ascii file which I did in multiple passes, but each pass the output had to go to another file, and I had to delete the prior file to have room on the disk for the next file. Two disk controllers on a standard motherboard, one for the hard disk and one for the cdrom drive, here again this stuff wasnt cheap, there were not a lot of spare isa slots if you could afford another hard disk and disk controller card.
There was even the problem of reading 64kbytes with C you passed fread the number of bytes you wanted to read in a 16 bit int, which meant 0 to 65535 not 65536 bytes, and performance dropped dramatically if you didnt read in even sized sectors so you just read 32kbytes at a time to maximize performance, 64k didnt come until well into the dos32 days when you were finally convinced that the value passed to fread was now a 32 bit number and the compiler wasnt going to chop off the upper 16 bits and only use the lower 16 bits (which happened often if you used enough compilers/versions). We are currently suffering similar problems in the 32 bit to 64 transition as we did with the 16 to 32 bit transition. What is most interesting is the code from the folks like me that learned that going from 16 to 32 bit int changed size, but unsigned char and unsigned long did not, so you adapted and rarely used int so that your programs would compile and work for both 16 and 32 bit. (The code from folks from that generation kind of stands out to other folks that also lived through it and used the same trick). But for the 32 to 64 transition it is the other way around and code not refactored to use uint32 type declarations are suffering.
Reading wallyk's answer that just came in, the huge pointer thing that wrapped around does ring a bell, also not always being able to compile for huge. small was the flat memory model we are comfortable with today, and as with today was easy because you didnt have to worry about segments. So it was a desireable to compile for small when you could. You still didnt have a lot of memory or disk or floppy space so you just didnt normally deal with data that large.
And agreeing with another answer, the segment offset thing was 8088/8086 intel. The whole world was not yet dominated by intel, so there were other platforms that just had a flat memory space, or used other tricks perhaps in hardware (outside the processor) to solve the problem. Because of the segment/offset intel was able to ride the 16 bit thing longer than it probably should have. Segment/offset had some cool and interesting things you could do with it, but it was as much a pain as anything else. You either simplified your life and lived in a flat memory space or you constantly worried about segment boundaries.
Really pinning down the address size on old x86's is sort of tricky. You could say that its 16 bit, because the arithmetic you can perform on an address must fit in a 16 bit register. You could also say that it's 32 bit, because actual addresses are computed against a 16 bit general purpose register and 16 bit segment register (all 32 bits are significant). You could also just say it's 20 bit, because the segment registers are shifted 4 bits left and added to the gp registers for hardware addressing.
It actually doesn't matter that much which one of these you chose, because they are all roughly equal approximations of the c abstract machine. Some compilers let you pick a memory model you were using per compilation, while others just assume 32 bit addresses and then carefully check that operations that could overflow 16 bits emit instructions that handle that case correctly.
Check out this wikipedia entry. About Far pointers. Basically, its possible to indicate a segment and an offset, making it possible to jump to another segment.

Resources