What is the memory usage overhead for a 64-bit application? - c

From what I have found so far it's clear that programs compiled for a 64-bit architecture use twice as much RAM for pointers as their 32-bit alternatives - https://superuser.com/questions/56540/32-bit-vs-64-bit-systems.
Does that mean that code compiled for 64-bit uses on average two times more RAM than the 32-bit version?
I somehow doubt it, but I am wondering what the real overhead is. I suppose that small types, like short, byte and char are same sized in a 64-bit architecture? I am not really sure about byte though. Given that many applications work with large strings (like web browsers, etc.), that consist mostly of char arrays in most implementations, the overhead may not be so large.
So even if numeric types like int and long are larger on 64 bit, would it have a significant effect on usage of RAM or not?

It depends on the programming style (and on the language, but you are referring to C).
If you work a lot with pointers (or you have a lot of references in some languages), RAM consumption goes up.
If you use a lot of data with fixed size, such as double or int32_t, RAM consumption does not go up.
For types like int or long, it depends on the architecture; there may be differences between Linux and Windows. Here you see the alternatives you have. In short, Windows uses LLP64, meaning that long long and pointers are 64 bit, while Linux uses LP64, where longis 64 bit as well. Other architectures might make int or even short 64 bit as well, but these are quite uncommon.
float and double should remain the same in size in all cases.
So you see it strongly depends on the usage of the data types.

There are a few reasons for the memory consumption to go up. However the overhead of 64b vs 32b depends from an app to another.
Main reason is using a lot of pointers in your code. However, an
array allocated dynamically in a code compiled for 64bit and running
on a 64bit OS would be the same size as the array allocated on a 32
bit system. Only the address to the array will be larger, the content
size will be the same (except when the type size changed - however
that should not happen and should be well documented).
Another footprint increase would be due to memory alignment. In
64 bit mode the alignment needs to consider a 64bit address so that
should add a small overhead.
Probably the size of the code will increase. On some
architectures the 64bit ISA could be slightly larger. Also, you would
now have to make calls to 64bit addresses.
When running in 64bit registers are larger (64bit) so if you use
many numerical types the compiler might as well place them in
registers so that shouldn't necessarily mean that your RAM footprint
would go up. Using double variables is likely to produce a memory
footprint increase if they are not stored into 64b registers.
When using JIT compiled languages like Java, .NET it is likely that the footprint increase of 64b code would be larger as the runtime environment will generate additional overhead through pointer usage, hidden control structures, etc.
However there is no magic number describing the 64bit memory footprint overhead. That needs to be measured from an application to another. From what I've seen, I never got more than 20% increase in footprint for an application running on 64bit, compared to 32bit. However that's purely based on the applications I encountered and I'm using mostly C and C++.

I think there may be another reason which goes way back in that variables need to be stored in memory on a 64bit boundary at an address that's ...xxxxx000 to be read in one bite, if it's not it needs to read it in a byte at a time.

Related

What is the motivation to explicitly set the "falign-functions" compiler flag to a certain value?

I am working on the SW for an embedded system and trying to understand some low-level details that was setup by an earlier developer. The target platform is a custom made OpenRISC 1200 processor, synthesized in a FPGA. The software is built using a GCC based cross-compiler.
Among the compiler flags I find this one: -falign-functions=16. There is a comment in the build configuration saying:
On Open RISC 1200, function alignment needs to be on a cache boundary (16 bytes). If not, performance suffer severely.
I realize my understanding of cache memories are a bit shallow and I should probably read something like: What Every Programmer Should Know About Memory. I haven't yet, but I will. With that said, I have some questions:
I understand that this is about minimizing cache misses in the instruction cache, but why is that achieved by setting the function alignment to the instruction cache line size (i.e. 16 bytes)?
If this is the most memory efficient way, wouldn't you expect this to be the default setting for function alignment in the cross-compiler? I mean, for a more common platform like x86, amd64 or ARM you don't need to care about function alignments (or am I wrong?).
Most architectures have aspects of memory access and instructions that can depend on alignment.
but why is that achieved by setting the function alignment to the instruction cache line size
The CPU will fetch complete cache lines from memory (as if the memory is divided into these larger blocks rather than bytes). So if all the data you need fits in one cache line, there is just one fetch, but if you have even just 2 bytes of data, but one byte is the end of a cache line and the other byte the start of the next, well now it has to load in two complete cache lines. This wastes space in the small CPU cache, and more memory transfers.
A quick search indicates that the OpenRISC 1200 uses a 16 byte cache line, so when targeting that specifically, aligning the start of any data you have on those 16 byte multiples helps avoid straddling two lines within one function / piece of data.
If this is the most memory efficient way, wouldn't you expect this to be the default setting for function alignment in the cross-compiler?
There can be more to it than that. Firstly, this alignment is achieved by wasting "padding" memory. If you would have used 1 byte of a cache line calling a function, then another 15 bytes are wasted to reach the 16 byte boundary.
Also in the case of a function call, there is a reasonable chance that memory will be in cache anyway, and jumping forward might leave the cached memory, causing a load that would otherwise not be needed.
So this leaves a trade off, functions that use little stack space and return quickly, might not benefit much from the extra alignment, but a function that runs for longer and uses more stack space might benefit by not "wasting" cache space on the "previous function".
Another reason alignment is often desired is when dealing with instructions that either require it outright (fail on an unaligned address), or are much slower (with loads/stores getting split up into parts), or maybe some other effects (like a load/store not being atomic if not properly aligned).
With a quick search I believe the general alignment requirement on OR1200 appears to be 4 bytes, even for 8 byte types. So in this respect an alignment of at least 4 would seem desirable, and 8 or 16 might only provide a benefit in certain cases mentioned before.
I am not familiar with Open RISC specifically, but on some platforms instructions added at a later date (e.g. 16byte / 128bit SSE instructions) require or benefit from an alignment greater than what was the default (I believe AMD64 upped the default alignment to 16, but then later AVX came wanting 32 byte alignment).

What value of alignment should I with mkl_malloc?

The function mkl_malloc is similar to malloc but has an extra alignment argument. Here's the prototype:
void* mkl_malloc (size_t alloc_size, int alignment);
I've noticed different performances with different values of alignment. Apart from trial and error, is there a canonical or documented methodical way to decide on the best value of alignment? i.e. processor being used, function being called, operation being performed etc.
This question widely applicable to anyone who uses MKL so I'm very surprised it is not in the reference manual.
update: I have tried with mkl_sparse_spmm and have not noticed a significant difference in performance for setting the alignment to powers of 2 up to 1024 bytes, after that the performance tends to drop. I'm using an Intel Xeon E5-2683.
Alignment only affects performance when SSE/AVX instructions can be used - this is commonly true when operating with arrays as you wish to apply the same operation to a range of elements.
In general, you want to choose alignment based on the CPU, if it supports AVX2 which has 256bit registers, then you want 32 byte alignment, if it supports AVX512, then 64 bytes would be optimal.
To that end, mkl_malloc will guarantee alignment to the value you specify, however, obviously if the data are 32-byte aligned, then they are also aligned to a (16, 8, 4...)-byte boundary. The purpose of the call is to ensure this is always the case and thus avoid any potential complications.
On my machine (Linux kernel 4.17.11 running on i7 6700K), the default alignment of mkl_malloc seems to be 128-bytes (for large enough arrays, if they are too small the value seems to be 32KB), in other words, any value smaller than that has no effect on alignment, I can however input 256 and the data will be aligned to the 256-byte boundary.
In contrast, using malloc gives me 16byte alignment for 1GB of data and 32-byte alignment for 1KB, whatever the OS gives me with absolutely no preference regarding alignment.
So using mkl_malloc makes sense as it ensures you get the alignment you desire. However, that doesn't mean you should set the value to be too large, that will simply cause you to waste memory and potentially expose you to an increased number of cache misses.
In short, you want your data to be aligned to the size of the vector registers in your CPU so that you can make use of the relevant extensions. Using mkl_malloc with some parameter for alignment guarantees alignment to at least that value, it can however be more. It should be used to make sure the data are aligned the way you want, but there is absolutely no good reason to align to 1MB.
The only reason, why regardless of your input, you have no penalties / gains from specifying the alignment is that you get machine aligned memory no matter what you type in. So on your processor, which supports AVX, you are always getting 32 byte aligned memory regardless of your input.
You will also see, that whatever alignment value you go for, the memory address, which mkl_malloc, returns is divisible 32-aligned. Alternatively you may test that low level intrisics like _mm256_load_pd, which would seg fault, when a not 32 byte aligned address is used never seg fault.
Some minor details: OSX always gives you 32 byte address, independant of heap / stack when you allocate a chunk of memory, while Linux will always give you aligned memory, when allocating on heap. Stack is a matter of luck on Linux, but you exceed with small matrix size already the limit for stack allocations. I have no understanding of memory allocation on Windows.
I noticed the latter, when I was writing tests for my numerics library where I use std::vector<typename T, alignment A> for memory allocation and smaller matrix tests sometimes seg faulted on Linux.
TLDR: your alignment input is effectively discarded and you are getting machine alignment regardless.
I think there can be no "best" value for alignment. Depending on your architecture, alignment is generally a property enforced by the hardware, for optimization reasons mostly.
Coming to your specific question, it's important to state what exactly are you allocating memory for? What piece of hw accesses the memory? For e.g., I have worked with DMA engines which required the source address to be aligned to per transaction transfer size(where xfer size = 4, 8, 16, 32, 128). I also worked with vector registers where it was wise to have a 128 bit aligned load.
To summarize: It depends.

When do we need to use posix_memalign instead of malloc?

Seems posix_memalign let you choose a customized alignment,but when is that necessary?
malloc has already done the alignment work internally.
UPDATE
The exact reason I ask this is because I see nginx does this,ngx_memalign(NGX_POOL_ALIGNMENT, size, log);,here NGX_POOL_ALIGNMENT is defined as 16, nginxs.googlecode.com/svn-history/trunk/src/core/ngx_palloc.c
Basically, if you need tougher alignment than malloc will give you. Malloc generally returns a pointer aligned such, that it may be used with any of the primitive types (often, 8 bytes on common desktop machines).
However, sometimes you need memory aligned on other boundaries, for example 4K-aligned, etc. In this case, you would need memalign.
You would need this, for example,
when writing a memory manager (such as a garbage collector). In this case, it is sometimes handy to work with memory aligned on larger block sizes. This way, you can store meta-data common to all objects in a given block at the bottom of the allocated area, and access this simply by masking the least significant bits of the object pointer.
when interfacing with hardware (never done this myself, but IIRC, certain kinds of block-devices require aligned memory). See n.m.'s answer for details.
The only benefits of posix_memalign, as far as I can tell, are:
Allocating page-aligned (typically 4096 or larger alignment) memory for hardware-specific purposes.
Evil hacks where you keep the low N bits of a pointer zero so you can store an N-bit integer in the low bits. :-)
Various hardware may have alignment requirements which malloc cannot satisfy. The Linux man page gives one such example, I quote:
On many systems there are alignment
restrictions, e.g. on buffers used for
direct block device I/O. POSIX
specifies the
pathconf(path,_PC_REC_XFER_ALIGN) call
that tells what alignment is needed.
A couple of uses:
Some processors have instructions that will only work on data that is aligned on a power of two greater than or equal to the buffer size - for example bit reverse addressing instructions used in ffts (fast fourier transforms).
To align data to cache boundaries to optimize access in multiprocessing applications so that data in the same cache line isn't being accessed by two processors simultaneously.
Basically, if you don't need to do absurd levels of optimizations and/or your hardware doesn't demand that an array be on a particular boundary then you can forget about posix_memalign.

Can a C compiler generate an executable 64-bits where pointers are 32-bits?

Most programs fits well on <4GB address space but needs to use new features just available on x64 architecture.
Are there compilers/platforms where I can use x64 registers and specific instructions but preserving 32-bits pointers to save memory?
Is it possible do that transparently on legacy code? What switch to do that?
OR
What changes on code is it necessary to get 64-bits features while keep 32-bits pointers?
A simple way to circumvent this is if you'd have only few types for your structures that you are pointing to. Then you could just allocate big arrays for your data and do the indexing with uint32_t.
So a "pointer" in such a model would be just an index in a global array. Usually addressing with that should be efficient enough with a decent compiler, and it would save you some space. You'd loose other things that you might be interested in, dynamic allocation for instance.
Another way to achieve something similar is to encode a pointer with the difference to its actual location. If you can ensure that that difference always fits into 32 bit, you could gain too.
It's worth noting that there an ABI in development for linux, X32, that lets you build a x86_64 binary that uses 32 bit indices and addresses.
Only relatively new, but interesting nonetheless.
http://en.wikipedia.org/wiki/X32_ABI
Technically, it is possible for a compiler to do so. AFAIK, in practice it isn't done. It has been proposed for gcc (even with a patch here: http://gcc.gnu.org/ml/gcc/2007-10/msg00156.html) but never integrated (at least, it was not documented the last time I checked). My understanding is that it needs also support from the kernel and standard library to work (i.e. the kernel would need to set up things in a way not currently possible and using the existing 32 or 64 bit ABI to communicate with the kernel would not be possible).
What exactly are the "64-bit features" you need, isn't that a little vague?
Found this while searching myself for an answer:
http://www.codeproject.com/KB/cpp/smallptr.aspx
Also pick up the discussion at the bottom...
Never had any need to think about this, but it is interesting to realize that one can be concerned with how much space pointers need...
It depends on the platform. On Mac OS X, the first 4 GB of a 64-bit process' address space is reserved and unmapped, presumably as a safety feature so no 32-bit value is ever mistaken for a pointer. If you try, there may be a way to defeat this. I worked around it once by writing a C++ "pointer" class which adds 0x100000000 to the stored value. (This was significantly faster than indexing into an array, which also requires finding the array-base address and multiplying before the addition.)
On the ISA level, you can certainly choose to load and zero-extend a 32-bit value and then use it as a 64-bit pointer. It's a good feature for a platform to have.
No change should be necessary to a program unless you wish to use 64-bit and 32-bit pointers simultaneously. In that case you are back to the bad old days of having near and far pointers.
Also, you will certainly break ABI compatibility with APIs that take pointers to pointers.
I think this would be similar to the MIPS n32 ABI: 64-bit registers with 32-bit pointers.
In the n32 ABI, all registers are 64-bit (so requires a MIPS64 processor). But addresses and pointers are only 32-bit (when stored in memory), decreasing the memory footprint. When loading a 32-bit value (such as a pointer) into a register, it is sign-extended into 64-bits. When the processor uses the pointer/address for a load or store, all 64-bits are used (the processor is not aware of the n32-ess of the SW). If your OS supports n32 programs (maybe the OS also follows the n32 model or it may be a proper 64-bit OS with added n32 support), it can locate all memory used by the n32 application in suitable memory (e.g. the lower 2GB and the higher 2GB, virtual addresses). The only glitch with this model is that when registers are saved on the stack (function calls etc), all 64-bits are used, there is no 32-bit data model in the n32 ABI.
Probably such an ABI could be implemented for x86-64 as well.
On x86, no. On other processors, such as PowerPC it is quite common - 64 bit registers and instructions are available in 32 bit mode, whereas with x86 it tends to be "all or nothing".
I'm afraid that if you are concerned about the size of pointers you might have bigger problems to deal with. If the number of pointers is going to be in the millions or billions, you will probably run into limitations within the Windows OS before you actually run out of physical or virtual memory.
Mark Russinovich has written a great article relating to this, named Pushing the Limits of Windows: Virtual Memory.
Linux now has fairly comprehensive support for the X32 ABI which does exactly what the asker is asking, in fact it is partially supported as a configuration under the Gentoo operating system. I think this question needs to be reviewed in light of resent development.
The second part of your question is easily answered. It is very possible, in fact many C implementations have support, for 64-bit operations using 32-bit code. The C type often used for this is long long (but check with your compiler and architecture).
As far as I know it is not possible to have 32-bit pointers in 64-bit native code.

How did 16-bit C compilers work?

C's memory model, with its use of pointer arithmetic and all, seems to model flat address space. 16-bit computers used segmented memory access. How did 16-bit C compilers deal with this issue and simulate a flat address space from the perspective of the C programmer? For example, roughly what assembly language instructions would the following code compile to on an 8086?
long arr[65536]; // Assume 32 bit longs.
long i;
for(i = 0; i < 65536; i++) {
arr[i] = i;
}
How did 16-bit C compilers deal with
this issue and simulate a flat address
space from the perspective of the C
programmer?
They didn't. Instead, they made segmentation visible to the C programmer, extending the language by having multiple types of pointers: near, far, and huge. A near pointer was an offset only, while far and huge pointers were a combined segment and offset. There was a compiler option to set the memory model, which determined whether the default pointer type was near or far.
In Windows code, even today, you'll often see typedefs like LPCSTR (for const char*). The "LP" is a holdover from the 16-bit days; it stands for "Long (far) Pointer".
C memory model does not in any way imply flat address space. It never did. In fact, C language specification is specifically designed to allow non-flat address spaces.
In the most trivial implementation with segmented address space, the size of the largest continuous object would be limited by the size of the segment (65536 bytes on a 16 bit platform). This means that size_t in such implementation would be 16 bit, and that your code simply would not compile, since you are attempting to declare an object that has larger size than the allowed maximum.
A more complex implementation would support so called huge memory model. You see, there's really no problem addressing continuous memory blocks of any size on a segmented memory model, it just requires some extra efforts in pointer arithmetics. So, within the huge memory model, the implementation would make those extra efforts, which would make the code a bit slower, but at the same time would allow addressing objects of virtually any size. So, your code would compile perfectly fine.
The true 16-bit environments use 16 bit pointers which reach any address. Examples include the PDP-11, 6800 family (6802, 6809, 68HC11), and the 8085. This is a clean and efficient environment, just like a simple 32-bit architecture.
The 80x86 family forced upon us a hybrid 16-bit/20-bit address space in so-called "real mode"—the native 8086 addressing space. The usual mechanism to deal with this was enhancing the types of pointers into two basic types, near (16-bit pointer) and far (32-bit pointer). The default for code and data pointers can be set in bulk by a "memory model": tiny, small, compact, medium, far, and huge (some compilers do not support all models).
The tiny memory model is useful for small programs in which the entire space (code + data + stack) is less than 64K. All pointers are (by default) 16 bits or near; a pointer is implicitly associated with a segment value for the whole program.
The small model assumes that data + stack is less than 64K and in the same segment; the code segment contains only code, so can have up to 64K as well, for a maximum memory footprint of 128K. Code pointers are near and implicitly associated with CS (the code segment). Data pointers are also near and associated with DS (the data segment).
The medium model has up to 64K of data + stack (like small), but can have any amount of code. Data pointers are 16 bits and are implicitly tied to the data segment. Code pointers are 32 bit far pointers and have a segment value depending on how the linker has set up the code groups (a yucky bookkeeping hassle).
The compact model is the complement of medium: less than 64K of code, but any amount of data. Data pointers are far and code pointers are near.
In large or huge model, the default subtype of pointers are 32 bit or far. The main difference is that huge pointers are always automatically normalized so that incrementing them avoids problems with 64K wrap arounds. See this.
In DOS 16 bit, I dont remember being able to do that. You could have multiple things that were each 64K (bytes)(because the segment could be adjusted and the offset zeroed) but dont remember if you could cross the boundary with a single array. The flat memory space where you could willy nilly allocate whatever you wanted and reach as deep as you liked into an array didnt happen until we could compile 32 bit DOS programs (on 386 or 486 processors). Perhaps other operating systems and compilers other than microsoft and borland could generate flat arrays greater than 64kbytes. Win16 I dont remember that freedom until win32 hit, perhaps my memory is getting rusty...You were lucky or rich to have a megabyte of memory anyway, a 256kbyte or 512kbyte machine was not unheard of. Your floppy drive had a fraction of a meg to 1.44 meg eventually, and your hard disk if any had a dozen or few meg, so you just didnt compute thing that large that often.
I remember the particular challenge I had learning about DNS when you could download the entire DNS database of all registered domain names on the planet, in fact you had to to put up your own dns server which was almost required at the time to have a web site. That file was 35megabytes, and my hard disk was 100megabytes, plus dos and windows chewing up some of that. Probably had 1 or 2 meg of memory, might have been able to do 32 bit dos programs at the time. Part if it was me wanting to parse the ascii file which I did in multiple passes, but each pass the output had to go to another file, and I had to delete the prior file to have room on the disk for the next file. Two disk controllers on a standard motherboard, one for the hard disk and one for the cdrom drive, here again this stuff wasnt cheap, there were not a lot of spare isa slots if you could afford another hard disk and disk controller card.
There was even the problem of reading 64kbytes with C you passed fread the number of bytes you wanted to read in a 16 bit int, which meant 0 to 65535 not 65536 bytes, and performance dropped dramatically if you didnt read in even sized sectors so you just read 32kbytes at a time to maximize performance, 64k didnt come until well into the dos32 days when you were finally convinced that the value passed to fread was now a 32 bit number and the compiler wasnt going to chop off the upper 16 bits and only use the lower 16 bits (which happened often if you used enough compilers/versions). We are currently suffering similar problems in the 32 bit to 64 transition as we did with the 16 to 32 bit transition. What is most interesting is the code from the folks like me that learned that going from 16 to 32 bit int changed size, but unsigned char and unsigned long did not, so you adapted and rarely used int so that your programs would compile and work for both 16 and 32 bit. (The code from folks from that generation kind of stands out to other folks that also lived through it and used the same trick). But for the 32 to 64 transition it is the other way around and code not refactored to use uint32 type declarations are suffering.
Reading wallyk's answer that just came in, the huge pointer thing that wrapped around does ring a bell, also not always being able to compile for huge. small was the flat memory model we are comfortable with today, and as with today was easy because you didnt have to worry about segments. So it was a desireable to compile for small when you could. You still didnt have a lot of memory or disk or floppy space so you just didnt normally deal with data that large.
And agreeing with another answer, the segment offset thing was 8088/8086 intel. The whole world was not yet dominated by intel, so there were other platforms that just had a flat memory space, or used other tricks perhaps in hardware (outside the processor) to solve the problem. Because of the segment/offset intel was able to ride the 16 bit thing longer than it probably should have. Segment/offset had some cool and interesting things you could do with it, but it was as much a pain as anything else. You either simplified your life and lived in a flat memory space or you constantly worried about segment boundaries.
Really pinning down the address size on old x86's is sort of tricky. You could say that its 16 bit, because the arithmetic you can perform on an address must fit in a 16 bit register. You could also say that it's 32 bit, because actual addresses are computed against a 16 bit general purpose register and 16 bit segment register (all 32 bits are significant). You could also just say it's 20 bit, because the segment registers are shifted 4 bits left and added to the gp registers for hardware addressing.
It actually doesn't matter that much which one of these you chose, because they are all roughly equal approximations of the c abstract machine. Some compilers let you pick a memory model you were using per compilation, while others just assume 32 bit addresses and then carefully check that operations that could overflow 16 bits emit instructions that handle that case correctly.
Check out this wikipedia entry. About Far pointers. Basically, its possible to indicate a segment and an offset, making it possible to jump to another segment.

Resources