Considering a 32bit x86 Linux system with 4 GB of RAM memory, So as described in books as well as on many forums that the Memory mapping would be as follows:
Kernel logical address - upto 896 MB - Which is one to one mapped and can be allocated using kmalloc().
Kernel virtual address - 128MB (above 896MB - kernel logical address) - allocated using vmalloc() and allocates Virtually contiguous but physically(scattered within RAM) non-contiguous memory pages.
Few points that I am not able to fully understand and need clarity on.
My understanding is that When kmalloc() is used to allocate memory, It always comes from the 0 to 896MB within the RAM and not beyond that.
When we use vmalloc() to allocate memory, Does that memory allocated anywhere from 896MB to 4GB range within the RAM ? or it is allocated only from 896MB upto 1GB range within RAM?
When we say that kernel has only 1GB of virtual address space, Does that meant that the kernel can not access the RAM beyond 1GB ? If it can then how it is done ? Does the 128MB of kernel virtual address space are used for this purpose ?
Please help.
In theory there are 3 different "memory managers". One manages physical RAM (mostly keeping track of pages of free physical RAM), one manages virtual space (what is mapped into each virtual address space where, working with fixed size pieces - the page size), and the third manages "heap" (allowing a larger area of the virtual address space to be split up into arbitrary sized pieces).
Originally; the Linux kernel tried to use its kernel "heap" to manage all 3 of these very different things. By linearly mapping "all RAM" into kernel space they bypass the need for managing the kernel's virtual memory and end up with a simple relationship between virtual addresses in kernel space and physical addresses (e.g. "physical = virtual - base"), and by allocating "heap" you also allocate physical memory.
This was fine originally because at the time computers rarely had more than 128 MiB of RAM (and Linus didn't expect the kernel to exist for very long, as GNU were planning to switch to Hurd "soon"), and kernel space was significantly larger than "all RAM". As the amount of RAM increased it became a problem -"all RAM" became larger than kernel space, so "use heap to manage 3 very different things" couldn't work.
Of course once it became a problem a lot of the kernel's code dependent on "kmalloc to allocate physical memory", making it too hard to fix the problem. Instead, they split physical memory into 2 zones - one zone that would be managed by "kmalloc" and another zone that is managed by "vmalloc"; then changed pieces of the kernel to use "vmalloc" instead of "kmalloc" where it's easy make those changes.
My understanding is that When kmalloc() is used to allocate memory, It always comes from the 0 to 896MB within the RAM and not beyond that.
Yes; this is the first zone of physical memory, which fits into the kernel space mapping that "kmalloc" uses.
When we use vmalloc() to allocate memory, Does that memory allocated anywhere from 896MB to 4GB range within the RAM ? or it is allocated only from 896MB upto 1GB range within RAM?
It would be allocated from any RAM that is not in the first zone (anywhere in "896MB or higher" range).
When we say that kernel has only 1GB of virtual address space, Does that meant that the kernel can not access the RAM beyond 1GB ? If it can then how it is done ? Does the 128MB of kernel virtual address space are used for this purpose ?
Of the kernel's 1 GiB of virtual space; some (896MB) will be the linear mapping of the physical address space, some will be memory mapped (PCI) devices, and some will be set aside as an area where dynamic mappings can be made. For "vmalloc" the kernel would allocate physical pages of RAM and then map them into the "dynamic mapping area" (and return a pointer to where it was mapped that has nothing to do with its physical address and breaks the "physical = virtual - base" relationship).
Note 1: Exact sizes/limits are variable - e.g. kernel can be compiled for "2 GiB / 2 GiB split" where kernel space is 2 GiB (instead of "3 GiB / 1 GiB split"); and the size of the "kmalloc zone" probably depends on various factors (how much space PCI devices need, how much RAM there is, etc) and may be something other than 896MB.
Note 2: Since introducing "vmalloc" to work around the original problem; computers switched to 64 bit (where "all memory" can/does fit in kernel space again), and "vmalloc" became unnecessary (and probably just falls through to "kmalloc"). However a lot of other changes have occurred (introduction of NUMA, encrypted RAM, non-volatile RAM, ..; plus more security vulnerabilities than any single person can keep track of) so the original design flaw has reached a temporary "bad idea, but still technically not broken if we keep adding work-arounds for security vulnerabilities" stage (until RAM and non-volatile RAM sizes inevitably increase and "vmalloc" becomes needed again at some point in the future - probably in about 30 years).
Related
Architecture: X86, Linux
I have gone through several threads related to kmalloc() and vmalloc() and I know the basic differences between them but still have some doubts. I need your help to understand the fundamentals.
So As we know that in 32-bit systems the virtual address space is divided between user and kernel. The high 1GB memory addresses are allocated for kernel and the lower 3GB to user space.
Please correct me i am wrong here:
So out of 1GB 896MB is 1:1 mapped (physically contiguous) to the kernel logical address and 128MB is used to access the high memory addresses means accessing the physical(RAM) beyond 896 MB by mapping it into the 128MB using vmalloc()? (lets consider we have 2GB of RAM)
And what would happen if we have only 1GB of physical RAM in point 1. The 896MB is 1:1 mapping between kernel logical address and physical addresses achieved through kmalloc(). Then what would be the use of 128MB, why do we need vmalloc() anymore? So lets say if we have 1GB of RAM which entirely can be accessed through kernel logical address so does that mean that it is possible only when the amount of memory requested by a thread/process can be allocated in physically contiguous addresses. And if there is no physical contiguous memory available, in that case memory allocated will not be physically contiguous, hence we need to use vmalloc() to map it in 128MB of space and access it through there?
What happens when we use 64-bit system? Is it true that both kmalloc() and vmalloc() can have enormous amount of address space and can access entire RAM?
I came across the following paragraph in an article about malloc.
The heap is a continuous (in term of virtual addresses) space of
memory with three bounds:a starting point, a maximum limit (managed
through sys/ressource.h’s functions getrlimit(2) and setrlimit(2)) and
an end point called the break. The break marks the end of the mapped
memory space, that is, the part of the virtual address space that has
correspondence into real memory.
I would like to better understand the concept of mapped region and unmapped region.
If memory addresses are 64 bits long, as in many modern computers, you have 18446744073709551616 possible memory addresses. (It depends on the processor architecture how many bits can actually be used, but addresses are stored using 64 bits.) That is more than 17 billion gigabytes, which is probably more memory than your computer actually has. So only some of those 17 billion gigabytes correspond to actual memory. For the rest of the addresses, the memory simply doesn't exist. There is no correspondence between the memory address and a memory location. Those addresses are, therefore, unmapped.
That is the simple explanation. In reality, it's a bit more complicated. The memory addresses of your program are not the actual memory addresses of the memory chips, the physical memory, in your computer. Instead, it is virtual memory. Each process has its own memory space, that is, its own 18446744073709551616 addresses, and the memory addresses that a process uses are translated to physical memory addresses by the computer hardware. So one process may have stored some data at memory address 4711, which is actually stored in a real physical memory chip over here, and another process may have also stored some data at memory address 4711, but that is a completely different place, stored in a real physical memory chip over there. The process-internal virtual memory addresses are translated, or mapped, to actual physical memory, but not all of them. The rest, again, are unmapped.
That is, of course, also a simplified explanation. You can use more virtual memory than the amount of physical memory in your computer. This is done by paging, that is, taking some chunks (called pages) of memory not being used right now, and storing them on disk until they are needed again. (This is also called "swapping", even if that term originally meant writing all the memory of a process to disk, not just parts of it.)
And to complicate it even further, some modern operating systems such as Linux and MacOS X (but, I am told, not Windows) overcommit when they allocate memory. This means that they allocate more memory addresses than can be stored on the computer, even using the disk. For example, my computer here with 32 gigabytes of physical memory, and just 4 gigabytes available for paging out data to disk, can't possibly allow for more than 36 gigabyes of actual, usable, virtual memory. But malloc happily allocates more than one hundred thousand gigabytes. It is not until I actually try to store things in all that memory that it is connected to physical memory or disk. But it was part of my virtual memory space, so I would call that too mapped memory, even though it wasn't mapped to anything.
The mapped region in the heap means the virtual memory area that can be mapped together with physical memory. And the unmapped region indicates the unused virtual memory space which does not point to any physical memory.
The boundary between mapped region for the heap and unmapped region is the system break point. As the malloc() is used to request some memory, the system break point would be moved to enlarge the mapped region. Linux system offeres brk() and sbrk() methods to increase and decrease the virtual memory address of the system break point.
I am aware that with C malloc and posix_memaligh one can allocate contiguous memory from the virtual address space of a process. However, I was wondering whether somehow one can allocate a buffer of physically contiguous memory? I am investigating side channel attacks that exploit L2 cache so I want to be sure that I can access the right cache lines..
Your best and easiest take at continuous memory is to request a single "huge" page from the system. The availability of those depends on your CPU and kernel options (on x86_64 the 2MB huge pages are usually available and some CPUs can also do 1GB pages; other architectures can be more flexible than this). Check out Hugepagesize field in /proc/meminfo for the size of huge pages on your setup.
Those can be accessed in two ways:
By means of a MAP_HUGETLB flag passed to mmap(). This way you can be sure that the "huge" virtual page corresponds to a continuous physical memory range. Unfortunately, whether the kernel can supply you with a "huge" page depends on many factors (current layout of memory utilization, kernel options, etc - also see the hugepages kernel boot parameter).
By means of mapping a file from a dedicated HugeTLB filesystem (see here: http://lwn.net/Articles/375096/). With HugeTLB file system you can configure the number of huge pages available in advance for some assurance that the necessary amount of huge pages will be available.
The other approach is to write a kernel module which will allocate continuous physical memory on the kernel side and then map it into your process' address space on request. This approach is sometimes employed on special purpose hardware in embedded systems. Of course, there's still no guarantee that the kernel side memory allocator will be able to come with an appropriately sized continuous physical address range, so on some occasions such address ranges are pre-reserved on boot (one dumb approach is to pass max_addr parameter to kernel on boot to leave some of the RAM out of kernel's reach).
On (almost [Note 1]) all virtual memory architectures, virtual memory is mapped to physical memory in units of a "page". The size of a page is (almost) always a power of 2, and pages are aligned by that size, because the mapping is done by only using the high-order bits of the address. It's common to see a page size of 4K (12 bits of address), although modern CPUs have an option to map much larger pages in order to reduce the size of mapping tables.
Since L2_CACHE_SIZE will generally also be a power of 2 and will be smaller than the page size, any single aligned allocation of size L2_CACHE_SIZE will necessarily be in a single page, so the bytes in the alignment will be physically contiguous as well.
So in this particular case, you can be assured that your allocated memory will be a single cache-line (at least, on standard machine architectures).
Note 1: Undoubtedly there are machines -- possibly imaginary -- which do not function this way. But the one you are playing with is not one of them.
In a 32-bit machine each process gets a 4GB virtual space. In this case one can worry that we might face trouble due to fragmentation. But in the case of a 64-bit machine we theoretically have a huge addressable virtual memory, so why is memory fragmentation still an issue (if it is) in a 64-bit machine?
Each virtual address that you try to access is mapped by the operating system to physical memory. Physical memory is allocated in pages (e.g. 4K in size). If you manage to allocate a byte at offset 1000000*n and do it for n from 1 to 1000000 (I think you could do that with mmap), then the OS will have to back that with a million pages of physical memory, which is something like 4G. That physical memory will not be available for anything else. If you had allocated the bytes contiguously, you'd only need about 1M of physical memory (256 pages) for your million bytes.
You can get in a similar bad situation if you allocate 4G for legitimate reasons, and then deallocate parts of it, keeping a bit of every page allocated. The OS will not be able to actually reuse the freed memory for anything else because there are no physical pages that are fully free. So that's a fragmentation problem.
In theory, you could imagine that virtual addresses 1000000 and 2000000 would map to the same page of physical memory, avoiding the fragmentation. But in practice, and for good reasons, the virtual memory mapping is done on a page by page basis. You can read more about it here: http://en.wikipedia.org/wiki/Page_table.
Because all that memory is "wasted" consider an application where you have a lot of internal fragmentation. That process requires more pages in memory because the working set is now scattered in memory and that means its memory footprint is much higher. If this application is contending for physical slots in RAM (machines still really only have about 4 - 8 GB of RAM for a typical home setup) then it causes more page swapping. Generally you want to reduce your applications memory footprint to avoid memory pressure and contention with other applications.
There are cases though where it doesn't really matter, it won't kill you to use an extra megabyte here or there but it all adds up in larger applications. It depends on the situation as to whether or not it is important to have as little fragmentation as possible depending on what you're coding or what the aim of your project is.
How much data can be malloced and how is the limit determined? I am writing an algorithm in C that basically utilizes repeatedly some data stored in arrays. My idea is to have this saved in dynamically allocated arrays but I am not sure if it's possible to have such amounts malloced.
I use 200 arrays of size 2046 holding complex data of size 8 byte each. I use these throughout the process so I do not wish to calculate it over and over.
What are your thoughts about feasibility of such an approach?
Thanks
Mir
How much memory malloc() can allocate depends on:
How much memory your program can address directly
How much physical memory is available
How much swap space is available
On a modern, flat-memory-model 32-bit system, your program can address 4 gigabytes, but some of the address space (usually 2 gigabytes, sometimes 1 gigabyte) is reserved for the kernel. So, as a rule of thumb, you should be able to allocate almost two gigabytes at once, assuming you have the physical memory and swap space to back it up.
On a 64-bit system, running a 64-bit OS and a 64-bit program, your addressable memory is essentially unlimited.
200 arrays of 2048 bytes each is only 400k, which should fit in cache (even on a smartphone).
A 32bit OS has a limit of 4Gb, typically some (upto half on win32) are reserved for the operating system - mapping the address space of graphcis card memory etc.
Linux supports 64Gb of address space (using Intel's 36bit PAE) on 32bit versions.
EDIT: although each process is limited to 4Gb
The main problem with allocating large amounts of memory is if you need it to be locked in RAM - then you obviously need a lot of RAM. Or if you need it all to be contiguous - it's much easier to get 4 * 1Gb chunks of memory than a single 4Gb chunk with nothing else in the way.
A common approach is to allocate all the memory you need at the start of the program so you can be sure that if the app isn't going to be possible it will fail instantly rather than when it's done 90% of the work.
Don't run other memory intensive apps at the same time.
There are also a bunch of flags you can use to suggest to the kernel that this app should get priority in memory or keep memory locked in ram - sorry it's too long since i did HPC on linux and i'm probably out of date with modern kernels.
I think that on most mordern (64bit) systems you can allocate 4GB at a time with a malloc( size_t ) call if that much memory is available. How big is each of those 'complex data' entries? if they are of the size 256 bytes, then you'll only need to allocate 100MB.
256bytes × 200 arrays × 2048 entries = 104857600bytes
104857600 bytes / 1024 / 1024 = 100MB.
So for 4096bytes each that's still only 1600MB or ≃ 1.6GB so it is feasible on most systems today, my four year old laptop got 3GB internal memory. Sometimes I does image manipulation with GIMP and it takes up over 2GB of memory.
With some implementations of malloc(), the regions are not actually backed by memory until they really get used so you can in theory carry on forever (though in practice of course the list of allocated regions assigned to your process in the kernel takes up space, so you might find you can only call malloc() a few million times even if it never actually gives you any memory). It's called "optimistic allocation" and is the strategy used by Linux (which is why it then has the OOM killer, for when it was over-optimistic).