Why is memory fragmentation an issue on a 64-bit machine? - c

In a 32-bit machine each process gets a 4GB virtual space. In this case one can worry that we might face trouble due to fragmentation. But in the case of a 64-bit machine we theoretically have a huge addressable virtual memory, so why is memory fragmentation still an issue (if it is) in a 64-bit machine?

Each virtual address that you try to access is mapped by the operating system to physical memory. Physical memory is allocated in pages (e.g. 4K in size). If you manage to allocate a byte at offset 1000000*n and do it for n from 1 to 1000000 (I think you could do that with mmap), then the OS will have to back that with a million pages of physical memory, which is something like 4G. That physical memory will not be available for anything else. If you had allocated the bytes contiguously, you'd only need about 1M of physical memory (256 pages) for your million bytes.
You can get in a similar bad situation if you allocate 4G for legitimate reasons, and then deallocate parts of it, keeping a bit of every page allocated. The OS will not be able to actually reuse the freed memory for anything else because there are no physical pages that are fully free. So that's a fragmentation problem.
In theory, you could imagine that virtual addresses 1000000 and 2000000 would map to the same page of physical memory, avoiding the fragmentation. But in practice, and for good reasons, the virtual memory mapping is done on a page by page basis. You can read more about it here: http://en.wikipedia.org/wiki/Page_table.

Because all that memory is "wasted" consider an application where you have a lot of internal fragmentation. That process requires more pages in memory because the working set is now scattered in memory and that means its memory footprint is much higher. If this application is contending for physical slots in RAM (machines still really only have about 4 - 8 GB of RAM for a typical home setup) then it causes more page swapping. Generally you want to reduce your applications memory footprint to avoid memory pressure and contention with other applications.
There are cases though where it doesn't really matter, it won't kill you to use an extra megabyte here or there but it all adds up in larger applications. It depends on the situation as to whether or not it is important to have as little fragmentation as possible depending on what you're coding or what the aim of your project is.

Related

Memory Mapping in Linux Kernel - use of vamlloc() and kmalloc()

Considering a 32bit x86 Linux system with 4 GB of RAM memory, So as described in books as well as on many forums that the Memory mapping would be as follows:
Kernel logical address - upto 896 MB - Which is one to one mapped and can be allocated using kmalloc().
Kernel virtual address - 128MB (above 896MB - kernel logical address) - allocated using vmalloc() and allocates Virtually contiguous but physically(scattered within RAM) non-contiguous memory pages.
Few points that I am not able to fully understand and need clarity on.
My understanding is that When kmalloc() is used to allocate memory, It always comes from the 0 to 896MB within the RAM and not beyond that.
When we use vmalloc() to allocate memory, Does that memory allocated anywhere from 896MB to 4GB range within the RAM ? or it is allocated only from 896MB upto 1GB range within RAM?
When we say that kernel has only 1GB of virtual address space, Does that meant that the kernel can not access the RAM beyond 1GB ? If it can then how it is done ? Does the 128MB of kernel virtual address space are used for this purpose ?
Please help.
In theory there are 3 different "memory managers". One manages physical RAM (mostly keeping track of pages of free physical RAM), one manages virtual space (what is mapped into each virtual address space where, working with fixed size pieces - the page size), and the third manages "heap" (allowing a larger area of the virtual address space to be split up into arbitrary sized pieces).
Originally; the Linux kernel tried to use its kernel "heap" to manage all 3 of these very different things. By linearly mapping "all RAM" into kernel space they bypass the need for managing the kernel's virtual memory and end up with a simple relationship between virtual addresses in kernel space and physical addresses (e.g. "physical = virtual - base"), and by allocating "heap" you also allocate physical memory.
This was fine originally because at the time computers rarely had more than 128 MiB of RAM (and Linus didn't expect the kernel to exist for very long, as GNU were planning to switch to Hurd "soon"), and kernel space was significantly larger than "all RAM". As the amount of RAM increased it became a problem -"all RAM" became larger than kernel space, so "use heap to manage 3 very different things" couldn't work.
Of course once it became a problem a lot of the kernel's code dependent on "kmalloc to allocate physical memory", making it too hard to fix the problem. Instead, they split physical memory into 2 zones - one zone that would be managed by "kmalloc" and another zone that is managed by "vmalloc"; then changed pieces of the kernel to use "vmalloc" instead of "kmalloc" where it's easy make those changes.
My understanding is that When kmalloc() is used to allocate memory, It always comes from the 0 to 896MB within the RAM and not beyond that.
Yes; this is the first zone of physical memory, which fits into the kernel space mapping that "kmalloc" uses.
When we use vmalloc() to allocate memory, Does that memory allocated anywhere from 896MB to 4GB range within the RAM ? or it is allocated only from 896MB upto 1GB range within RAM?
It would be allocated from any RAM that is not in the first zone (anywhere in "896MB or higher" range).
When we say that kernel has only 1GB of virtual address space, Does that meant that the kernel can not access the RAM beyond 1GB ? If it can then how it is done ? Does the 128MB of kernel virtual address space are used for this purpose ?
Of the kernel's 1 GiB of virtual space; some (896MB) will be the linear mapping of the physical address space, some will be memory mapped (PCI) devices, and some will be set aside as an area where dynamic mappings can be made. For "vmalloc" the kernel would allocate physical pages of RAM and then map them into the "dynamic mapping area" (and return a pointer to where it was mapped that has nothing to do with its physical address and breaks the "physical = virtual - base" relationship).
Note 1: Exact sizes/limits are variable - e.g. kernel can be compiled for "2 GiB / 2 GiB split" where kernel space is 2 GiB (instead of "3 GiB / 1 GiB split"); and the size of the "kmalloc zone" probably depends on various factors (how much space PCI devices need, how much RAM there is, etc) and may be something other than 896MB.
Note 2: Since introducing "vmalloc" to work around the original problem; computers switched to 64 bit (where "all memory" can/does fit in kernel space again), and "vmalloc" became unnecessary (and probably just falls through to "kmalloc"). However a lot of other changes have occurred (introduction of NUMA, encrypted RAM, non-volatile RAM, ..; plus more security vulnerabilities than any single person can keep track of) so the original design flaw has reached a temporary "bad idea, but still technically not broken if we keep adding work-arounds for security vulnerabilities" stage (until RAM and non-volatile RAM sizes inevitably increase and "vmalloc" becomes needed again at some point in the future - probably in about 30 years).

What is the difference between mapped region and unmapped region in memory space?

I came across the following paragraph in an article about malloc.
The heap is a continuous (in term of virtual addresses) space of
memory with three bounds:a starting point, a maximum limit (managed
through sys/ressource.h’s functions getrlimit(2) and setrlimit(2)) and
an end point called the break. The break marks the end of the mapped
memory space, that is, the part of the virtual address space that has
correspondence into real memory.
I would like to better understand the concept of mapped region and unmapped region.
If memory addresses are 64 bits long, as in many modern computers, you have 18446744073709551616 possible memory addresses. (It depends on the processor architecture how many bits can actually be used, but addresses are stored using 64 bits.) That is more than 17 billion gigabytes, which is probably more memory than your computer actually has. So only some of those 17 billion gigabytes correspond to actual memory. For the rest of the addresses, the memory simply doesn't exist. There is no correspondence between the memory address and a memory location. Those addresses are, therefore, unmapped.
That is the simple explanation. In reality, it's a bit more complicated. The memory addresses of your program are not the actual memory addresses of the memory chips, the physical memory, in your computer. Instead, it is virtual memory. Each process has its own memory space, that is, its own 18446744073709551616 addresses, and the memory addresses that a process uses are translated to physical memory addresses by the computer hardware. So one process may have stored some data at memory address 4711, which is actually stored in a real physical memory chip over here, and another process may have also stored some data at memory address 4711, but that is a completely different place, stored in a real physical memory chip over there. The process-internal virtual memory addresses are translated, or mapped, to actual physical memory, but not all of them. The rest, again, are unmapped.
That is, of course, also a simplified explanation. You can use more virtual memory than the amount of physical memory in your computer. This is done by paging, that is, taking some chunks (called pages) of memory not being used right now, and storing them on disk until they are needed again. (This is also called "swapping", even if that term originally meant writing all the memory of a process to disk, not just parts of it.)
And to complicate it even further, some modern operating systems such as Linux and MacOS X (but, I am told, not Windows) overcommit when they allocate memory. This means that they allocate more memory addresses than can be stored on the computer, even using the disk. For example, my computer here with 32 gigabytes of physical memory, and just 4 gigabytes available for paging out data to disk, can't possibly allow for more than 36 gigabyes of actual, usable, virtual memory. But malloc happily allocates more than one hundred thousand gigabytes. It is not until I actually try to store things in all that memory that it is connected to physical memory or disk. But it was part of my virtual memory space, so I would call that too mapped memory, even though it wasn't mapped to anything.
The mapped region in the heap means the virtual memory area that can be mapped together with physical memory. And the unmapped region indicates the unused virtual memory space which does not point to any physical memory.
The boundary between mapped region for the heap and unmapped region is the system break point. As the malloc() is used to request some memory, the system break point would be moved to enlarge the mapped region. Linux system offeres brk() and sbrk() methods to increase and decrease the virtual memory address of the system break point.

Means to allocate contiguous physical memory

I am aware that with C malloc and posix_memaligh one can allocate contiguous memory from the virtual address space of a process. However, I was wondering whether somehow one can allocate a buffer of physically contiguous memory? I am investigating side channel attacks that exploit L2 cache so I want to be sure that I can access the right cache lines..
Your best and easiest take at continuous memory is to request a single "huge" page from the system. The availability of those depends on your CPU and kernel options (on x86_64 the 2MB huge pages are usually available and some CPUs can also do 1GB pages; other architectures can be more flexible than this). Check out Hugepagesize field in /proc/meminfo for the size of huge pages on your setup.
Those can be accessed in two ways:
By means of a MAP_HUGETLB flag passed to mmap(). This way you can be sure that the "huge" virtual page corresponds to a continuous physical memory range. Unfortunately, whether the kernel can supply you with a "huge" page depends on many factors (current layout of memory utilization, kernel options, etc - also see the hugepages kernel boot parameter).
By means of mapping a file from a dedicated HugeTLB filesystem (see here: http://lwn.net/Articles/375096/). With HugeTLB file system you can configure the number of huge pages available in advance for some assurance that the necessary amount of huge pages will be available.
The other approach is to write a kernel module which will allocate continuous physical memory on the kernel side and then map it into your process' address space on request. This approach is sometimes employed on special purpose hardware in embedded systems. Of course, there's still no guarantee that the kernel side memory allocator will be able to come with an appropriately sized continuous physical address range, so on some occasions such address ranges are pre-reserved on boot (one dumb approach is to pass max_addr parameter to kernel on boot to leave some of the RAM out of kernel's reach).
On (almost [Note 1]) all virtual memory architectures, virtual memory is mapped to physical memory in units of a "page". The size of a page is (almost) always a power of 2, and pages are aligned by that size, because the mapping is done by only using the high-order bits of the address. It's common to see a page size of 4K (12 bits of address), although modern CPUs have an option to map much larger pages in order to reduce the size of mapping tables.
Since L2_CACHE_SIZE will generally also be a power of 2 and will be smaller than the page size, any single aligned allocation of size L2_CACHE_SIZE will necessarily be in a single page, so the bytes in the alignment will be physically contiguous as well.
So in this particular case, you can be assured that your allocated memory will be a single cache-line (at least, on standard machine architectures).
Note 1: Undoubtedly there are machines -- possibly imaginary -- which do not function this way. But the one you are playing with is not one of them.

How much data can be malloced at a time? what is the limit in modern OS such as Linux?

How much data can be malloced and how is the limit determined? I am writing an algorithm in C that basically utilizes repeatedly some data stored in arrays. My idea is to have this saved in dynamically allocated arrays but I am not sure if it's possible to have such amounts malloced.
I use 200 arrays of size 2046 holding complex data of size 8 byte each. I use these throughout the process so I do not wish to calculate it over and over.
What are your thoughts about feasibility of such an approach?
Thanks
Mir
How much memory malloc() can allocate depends on:
How much memory your program can address directly
How much physical memory is available
How much swap space is available
On a modern, flat-memory-model 32-bit system, your program can address 4 gigabytes, but some of the address space (usually 2 gigabytes, sometimes 1 gigabyte) is reserved for the kernel. So, as a rule of thumb, you should be able to allocate almost two gigabytes at once, assuming you have the physical memory and swap space to back it up.
On a 64-bit system, running a 64-bit OS and a 64-bit program, your addressable memory is essentially unlimited.
200 arrays of 2048 bytes each is only 400k, which should fit in cache (even on a smartphone).
A 32bit OS has a limit of 4Gb, typically some (upto half on win32) are reserved for the operating system - mapping the address space of graphcis card memory etc.
Linux supports 64Gb of address space (using Intel's 36bit PAE) on 32bit versions.
EDIT: although each process is limited to 4Gb
The main problem with allocating large amounts of memory is if you need it to be locked in RAM - then you obviously need a lot of RAM. Or if you need it all to be contiguous - it's much easier to get 4 * 1Gb chunks of memory than a single 4Gb chunk with nothing else in the way.
A common approach is to allocate all the memory you need at the start of the program so you can be sure that if the app isn't going to be possible it will fail instantly rather than when it's done 90% of the work.
Don't run other memory intensive apps at the same time.
There are also a bunch of flags you can use to suggest to the kernel that this app should get priority in memory or keep memory locked in ram - sorry it's too long since i did HPC on linux and i'm probably out of date with modern kernels.
I think that on most mordern (64bit) systems you can allocate 4GB at a time with a malloc( size_t ) call if that much memory is available. How big is each of those 'complex data' entries? if they are of the size 256 bytes, then you'll only need to allocate 100MB.
256bytes × 200 arrays × 2048 entries = 104857600bytes
104857600 bytes / 1024 / 1024 = 100MB.
So for 4096bytes each that's still only 1600MB or ≃ 1.6GB so it is feasible on most systems today, my four year old laptop got 3GB internal memory. Sometimes I does image manipulation with GIMP and it takes up over 2GB of memory.
With some implementations of malloc(), the regions are not actually backed by memory until they really get used so you can in theory carry on forever (though in practice of course the list of allocated regions assigned to your process in the kernel takes up space, so you might find you can only call malloc() a few million times even if it never actually gives you any memory). It's called "optimistic allocation" and is the strategy used by Linux (which is why it then has the OOM killer, for when it was over-optimistic).

What is the difference between vmalloc and kmalloc?

I've googled around and found most people advocating the use of kmalloc, as you're guaranteed to get contiguous physical blocks of memory. However, it also seems as though kmalloc can fail if a contiguous physical block that you want can't be found.
What are the advantages of having a contiguous block of memory? Specifically, why would I need to have a contiguous physical block of memory in a system call? Is there any reason I couldn't just use vmalloc?
Finally, if I were to allocate memory during the handling of a system call, should I specify GFP_ATOMIC? Is a system call executed in an atomic context?
GFP_ATOMIC
The allocation is high-priority and
does not sleep. This is the flag to
use in interrupt handlers, bottom
halves and other situations where you
cannot sleep.
GFP_KERNEL
This is a normal allocation and might block. This is the flag to use
in process context code when it is safe to sleep.
You only need to worry about using physically contiguous memory if the buffer will be accessed by a DMA device on a physically addressed bus (like PCI). The trouble is that many system calls have no way to know whether their buffer will eventually be passed to a DMA device: once you pass the buffer to another kernel subsystem, you really cannot know where it is going to go. Even if the kernel does not use the buffer for DMA today, a future development might do so.
vmalloc is often slower than kmalloc, because it may have to remap the buffer space into a virtually contiguous range. kmalloc never remaps, though if not called with GFP_ATOMIC kmalloc can block.
kmalloc is limited in the size of buffer it can provide: 128 KBytes*). If you need a really big buffer, you have to use vmalloc or some other mechanism like reserving high memory at boot.
*) This was true of earlier kernels. On recent kernels (I tested this on 2.6.33.2), max size of a single kmalloc is up to 4 MB! (I wrote a fairly detailed post on this.) — kaiwan
For a system call you don't need to pass GFP_ATOMIC to kmalloc(), you can use GFP_KERNEL. You're not an interrupt handler: the application code enters the kernel context by means of a trap, it is not an interrupt.
Short answer: download Linux Device Drivers and read the chapter on memory management.
Seriously, there are a lot of subtle issues related to kernel memory management that you need to understand - I spend a lot of my time debugging problems with it.
vmalloc() is very rarely used, because the kernel rarely uses virtual memory. kmalloc() is what is typically used, but you have to know what the consequences of the different flags are and you need a strategy for dealing with what happens when it fails - particularly if you're in an interrupt handler, like you suggested.
Linux Kernel Development by Robert Love (Chapter 12, page 244 in 3rd edition) answers this very clearly.
Yes, physically contiguous memory is not required in many of the cases. Main reason for kmalloc being used more than vmalloc in kernel is performance. The book explains, when big memory chunks are allocated using vmalloc, kernel has to map the physically non-contiguous chunks (pages) into a single contiguous virtual memory region. Since the memory is virtually contiguous and physically non-contiguous, several virtual-to-physical address mappings will have to be added to the page table. And in the worst case, there will be (size of buffer/page size) number of mappings added to the page table.
This also adds pressure on TLB (the cache entries storing recent virtual to physical address mappings) when accessing this buffer. This can lead to thrashing.
The kmalloc() & vmalloc() functions are a simple interface for obtaining kernel memory in byte-sized chunks.
The kmalloc() function guarantees that the pages are physically contiguous (and virtually contiguous).
The vmalloc() function works in a similar fashion to kmalloc(), except it allocates memory that is only virtually contiguous and not necessarily physically contiguous.
What are the advantages of having a contiguous block of memory? Specifically, why would I need to have a contiguous physical block of memory in a system call? Is there any reason I couldn't just use vmalloc?
From Google's "I'm Feeling Lucky" on vmalloc:
kmalloc is the preferred way, as long as you don't need very big areas. The trouble is, if you want to do DMA from/to some hardware device, you'll need to use kmalloc, and you'll probably need bigger chunk. The solution is to allocate memory as soon as possible, before
memory gets fragmented.
On a 32-bit system, kmalloc() returns the kernel logical address (its a virtual address though) which has the direct mapping (actually with constant offset) to physical address.
This direct mapping ensures that we get a contiguous physical chunk of RAM. Suited for DMA where we give only the initial pointer and expect a contiguous physical mapping thereafter for our operation.
vmalloc() returns the kernel virtual address which in turn might not be having a contiguous mapping on physical RAM.
Useful for large memory allocation and in cases where we don't care about that the memory allocated to our process is continuous also in Physical RAM.
One of other differences is kmalloc will return logical address (else you specify GPF_HIGHMEM). Logical addresses are placed in "low memory" (in the first gigabyte of physical memory) and are mapped directly to physical addresses (use __pa macro to convert it). This property implies kmalloced memory is continuous memory.
In other hand, Vmalloc is able to return virtual addresses from "high memory". These addresses cannot be converted in physical addresses in a direct fashion (you have to use virt_to_page function).
In short, vmalloc and kmalloc both could fix fragmentation. vmalloc use memory mappings to fix external fragmentation; kmalloc use slab to fix internal frgamentation. Fot what it's worth, kmalloc also has many other advantages.

Resources