I learnt that when we manage a data structure such as a tree or other graph its nodes are stored in the computer in something called a block and nodes of the graph can make up the block and it is the block that is transferred between secondary and primary memory when a data structure gets moved between primary and secondary memory. So I think it's pretty clear what a block is, it can consist of different sizes depending on architecture but is often 4K. Now I want to know how a block relates to memory pages. Do pages consist of blocks or what is the relation of blocks to pages? Can we define what a page is in memory in terms of a block?
You typically try to define a block so it's either the same size as a memory page, or its size is evenly divisible by the size of a memory page, so an integral number of blocks will fit in a page.
As you mentioned, 4K tends to work well -- typical memory page sizes are 4K and 8K. Most also support at least one larger page size (e.g., 1 megabyte) but you can typically more or less ignore them; they're used primarily for mapping single, large chunks of contiguous memory (e.g., the part of graphics memory that's directly visible to the CPU).
Related
My background knowledge:
To my understanding, to be allocated/used properly, memory must be contiguous in the virtual address space, but doesn't have to be actually contiguous in the physical memory or even the physical memory address space.
This would kind of suggest that the way that the memory address translations from physical to virtual work is that it is a series of mappings where any free memory blocks in the physical memory address space get assigned to a corresponding area in the virtual memory address space.
Setup to the question:
This answer, in response to a question about freeing memory in C, refers to memory fragmentation, a scenario in which (in this specific case) repeatedly allocating and freeing memory could result in there existing enough OS-allocated memory for future process usage, but it can't be used because it isn't contiguous in the free store linked-list.
If we could keep plucking memory blocks out of the OS-allocated memory that are not in use even if they are dispersed (not contiguous), wouldn't that fix the problem of memory fragmentation? To me, that seems exactly the same way that the physical to virtual memory address translations work, where non-contiguous blocks are utilized as if they were contiguous.
So, to repeat my question, why does memory have to be contiguous?
Two issues here:
It is necessary for each object to occupy a contiguous region in virtual memory, so that indexing and pointer arithmetic can be done efficiently. If you have an array int arr[5000];, then a statement like arr[i] = 0; boils down to simple arithmetic: the value of i is multiplied by 4 (or whatever sizeof(int) may be) and then added to the base address of arr. Addition is very fast for a CPU. If the elements of arr weren't located at consecutive virtual addresses, then arr[i] would require some more elaborate computation, and your program would be orders of magnitude slower. Likewise, with contiguous arrays, pointer arithmetic like ptr++ really is just addition.
Virtual memory has granularity. Every mapping of a virtual to a physical address requires some metadata to be kept somewhere in memory (say 8 bytes per mapping), and when this mapping is used, it is cached by the CPU in a translation lookaside buffer which requires some silicon on the chip. If every byte of memory could be mapped independently, your mapping tables would require 8 times more memory than the program itself, and you'd need an immense number of TLBs to keep them cached.
So virtual memory is instead done in units of pages, typically 4KB or 16KB or so. A page is a contiguous 4K region of virtual memory that is mapped to a contiguous 4K region of physical memory. Thus you only need one entry in your mapping tables (page tables) and TLB for the entire page, and addresses within a page are mapped one-to-one (the low bits of the virtual address are used directly as the low bits of the physical address).
But this means that fragmentation by sub-page amounts can't be fixed with virtual memory. As in Steve Summit's example, suppose you allocated 1000 objects of 1KB each, which were stored consecutively in virtual memory. Now you free all the odd-numbered objects. Nominally there is now 500 KB of memory available. But if you now want to allocate a new object of size 2KB, none of that 500 KB is usable, since there is no contiguous block of size 2KB in your original 1000 KB region. The available 1KB blocks can't be remapped to coalesce them into a larger block, because they can't be separated from the even-numbered objects with which they share pages. And the even-numbered objects can't be moved around in virtual memory, because there may be pointers to those objects elsewhere in the program, and there is no good way to update them all. (Implementations that do garbage collection might be able to do so, but most C/C++ implementations do not, because that comes with substantial costs of its own.)
So, to repeat my question, why does memory have to be contiguous?
It doesn't have to be contiguous.
To avoid fragmentation within a block of OS-allocated memory page; you need ensure that the memory being allocated from "heap" (e.g. using "malloc()") as at least as large as a block of block of OS-allocated memory. This gives 2 possible options:
a) change the hardware (and OS/software) so that a block of OS-allocated memory is much smaller (e.g. maybe the same size as a cache line, or maybe 64 bytes instead of 4 KiB). This would significantly increase the overhead of managing virtual memory.
b) change the minimum allocation size of the heap so that it's much larger. Typically (for modern systems) if you "malloc(1);" it rounds the size up to 8 bytes or 16 bytes for alignment and calls it "padding". In the same way, it could round the size up to the size of a block of OS-allocated memory instead and call that "padding" (e.g. "malloc(1);" might have 4095 bytes of padding and cost 4 KiB of memory). This is worse than fragmentation because padding can't be allocated (e.g. if you did "malloc(1); malloc(1);" those allocations couldn't use different parts of the same block of OS-allocated memory).
However; this only really applies to small allocations. If you use "malloc();" to allocate a large amount of memory (e.g. maybe 1234 KiB for an array) most modern memory managers will just use blocks of OS-allocated memory, and won't have a reason to care about fragmentation for those large blocks.
In other words; for smaller allocations you can solve fragmentation in the way you've suggested but it would be worse than allowing some fragmentation; and for larger allocations you can solve fragmentation in the way you've suggested and most modern memory managers already do that.
I read that arrays are contiguous in Virtual Memory but probably not in Physical memory, and I don't get that.
Let's suppose I have an array of size 4KB (one page = one frame size), In virtual memory that array is one page.
In virtual memory every page in translated into one frame so our array is still contiguous...
(In Page Table we translate pages into frames not every byte into its own frame...)
Side Question: (When Answering this please mention clearly it's for the side note):
When allocating array in virtual memory of size one page does it have to be one page or could be split into two contiguous pages in virtual memory (for example bottom half of first one and top half of the second)? In this case at worst the answer above is 2, am I wrong?
Unless the start of the array happens to be aligned to the beginning of a memory page, it can still occupy two pages; it can start near the end of one page and end on the next page. Arrays allocated on the stack will probably not be forced to occupy a single page, because stack frames are simply allocated sequentially in the stack memory, and the array will usually be at the same offset within each stack frame.
The heap memory allocator (malloc()) could try to ensure that arrays that are smaller than a page will be allocated entirely on the same page, but I'm not sure if this is actually how most allocators are implemented. Doing this might increase memory fragmentation.
I read that arrays are contiguous in Virtual Memory but probably not in Physical memory, and I don't get that.
This statement is missing something very important. The array size
For small arrays the statement is wrong. For "large/huge" arrays the statement is correct.
In other words: The probability of an array being split over multiple non-contiguous physical pages is a function of the array size.
For small arrays the probability is close to zero but the probability increases as the array size increase. When the array size increases above the systems page size, the probability gets closer and closer to 1. But an array requiring multiple page may still be contiguous in physical memory.
For you side question:
With an array size equal to your systems page size, the array can at maximum span two physical pages.
Anything (array, structure, ...) that is larger than the page size must be split across multiple pages; and therefore may be "virtually contiguous, physical non-contiguous".
Without further knowledge or restriction; anything (array, structure, ...) that is between its minimum alignment (e.g. 4 bytes for an array of uint32_t) and the page size has a probability of being split across multiple pages; where the probability depends on its size and alignment. For example, if page size is 4096 bytes and an array has a minimum alignment of 4 bytes and a size of 4092 bytes, then there's 2 chances in 1024 that it will end up on a single page (and a 99.8% chance that it will be split across multiple pages).
Anything (variable, tiny array, tiny structure, ...) that has a size equal to its minimum alignment won't (shouldn't - see note 3) be split across multiple pages.
Note 1: For anything using memory allocated from the heap, the minimum alignment can be assumed to be the (implementation defined) minimum alignment provided by the heap and not the minimum alignment of the object itself. E.g. for an array of uint16_t the minimum alignment would be 2 bytes; but malloc() will return memory with much larger alignment (maybe 16 bytes)
Note 2: When things are nested (e.g. array inside a structure inside another structure) all of the above applies to the outer structure only. E.g. if you have an array of uint16_t inside a structure where the array happens to begin at offset 4094 within the structure; then it will be significantly more likely that the array will be split across pages.
Note 3: It's possible to explicitly break minimum alignment using pointers (e.g. use malloc() to allocate 1024 bytes, then create a pointer to an array that begins at any offset you want within the allocated area).
Note 4: If something (array, structure, ...) is split across multiple pages; then there's a chance that it will still be physically contiguous. For worst case this depends on the amount of physical memory (e.g. if the computer has 1 GiB of usable physical memory and 4096 byte pages, then there's approximately 1 chance in 262000 that 2 virtually contiguous pages will be "physically contiguous by accident"). If the OS implements page/cache coloring (see https://en.wikipedia.org/wiki/Cache_coloring ) it improves the probability of "physically contiguous by accident" by the number of page/cache "colors" (e.g. if the computer has 1 GiB of usable physical memory and 4096 byte pages, and the OS uses 256 page/cache colors, then there's approximately 1 chance in 1024 that 2 virtually contiguous pages will be "physically contiguous by accident").
Note 5: Most modern operating systems using multiple page sizes (e.g. 4 KiB pages and 2 MiB pages, and maybe also 1 GiB pages). This can either make it hard to guess what the page size actually is, or improve the probability of "physically contiguous by accident" if you assume the smallest page size is used.
Note 6: For some CPUs (e.g. recent AMD/Zen) the TLBs behave as if pages are larger (e.g. as if you're using 16 KiB pages and not 4 KiB pages) if and only if page table entries are compatible (e.g. if 4 page table entries describe four physically contiguous 4 KiB pages with the same permissions/attributes). If an OS is optimized for these CPUs the result is similar to having an extra page size (4 KiB, "16 KiB", 2 MiB and maybe 1 GiB).
When allocating array in virtual memory of size one page does it have to be one page or could be split into two contiguous pages in virtual memory (for example bottom half of first one and top half of the second)?
When allocating an array in heap memory of size one page; the minimum alignment would be the implementation defined minimum alignment provided by the heap manager/malloc() (e.g. maybe 16 bytes). However; most modern heap managers switch to using an alternative (e.g. mmap() or VirtualAlloc() or similar) when the amount of memory being allocated is "large enough"; so (depending on the implementation and their definition of "large enough") it might be page aligned.
When allocating an array in raw virtual memory (e.g. using mmap() or VirtualAlloc() or similar yourself, and NOT using the heap and not using something like malloc()); page alignment is guaranteed (mostly because the virtual memory manager doesn't deal with anything smaller).
Arrays are not necessarily contiguous in physical memory, though they are contiguous in virtual address space. But can it be said that the "tidiness" of arrays in physical memory is significantly higher compared to linked lists? So, which is a better option for a cache-friendly program?
There are two reasons why contiguous memory is more cache-friendly than non-contiguous memory:
If the data is stored contiguously, then the data will likely be stored in less cache lines (which are 64 byte blocks on most platforms). In that case, there is a higher chance that all the data will fit in the cache and new cache lines will less often have to be loaded. If the data is not stored contiguously and is scattered in many random memory locations, then it is possible that only a small fraction of every cache line will contain important data and that the rest of the cache line will contain unimportant data. In that case, more cache lines would be required to cache all important data, and if the cache is not large enough to store all these cache lines, then the cache efficiency will decrease.
The hardware cache prefetcher will do a better job at predicting the next cache line to prefetch, because it is easy to predict a sequential access pattern. Depending on whether the elements of the linked list are scattered or not, the access pattern to a linked list may be random and unpredictable, whereas the access pattern to an array is often sequential.
You are right that even if an array is stored contiguously in the virtual address space, this does not necessarily mean that the array is also contiguously in the physical address space.
However, this is irrelevant with regard to my statements made in #1 of my answer), because a cache line cannot overlap the boundary of a memory page. The content of a single memory page is always contiguous, both in the virtual address space and in the physical address space.
But you are right that it can be relevant with regard to my statements made in #2 of my answer. Assuming a memory page size of 4096 bytes (which is standard on the x64 platform) and a cache line size of 64 bytes, then there are 64 cache lines per memory page. This means that every 64th cache line could be at the edge of a "jump" in the physical address space. As a result, every 64th cache line could be mispredicted by the hardware cache prefetcher. Also, the cache prefetcher may not be able to adapt itself immediately to this new situation, so it may fail to prefetch several cache lines before it is able to reliably predict the next cache lines again and preload them in time. However, as a application programmer, you should not have to worry about this. It is the responsibility of the operating system to arrange the mapping of the virtual memory space to the physical memory space in such a way that there are not too many "jumps" which could have a negative performance impact. If you want to read more on this topic, you might want to read this research paper: Analysis of hardware prefetching across virtual page boundaries
Generally, arrays are better than linked lists in terms of cache efficiency, because they are always contiguous (in the virtual address space).
I'm doing some linux kernel development.
And I'm going to allocate some memory space with something like:
ptr = flex_array_alloc(size=136B, num=1<<16, GFP_KERNEL)
And ptr turns out to be NULL every time I try.
What's more, when I change the size to 20B or num to 256,there's nothing wrong and the memory can be obtained.
So I want to know if there are some limitations for requesting memory in linux kernel modules. And how to debug it or to allocate a big memory space.
Thanks.
And kzalloc has a similar behavior in my environment. That is, requesting a 136B * (1<<16) space failed, while 20B or 1<<8 succeed.
There are two limits to the size of an array allocated with flex_array_allocate. First, the object size itself must not exceed a single page, as indicated in https://www.kernel.org/doc/Documentation/flexible-arrays.txt:
The down sides are that the arrays cannot be indexed directly, individual object size cannot exceed the system page size, and putting data into a flexible array requires a copy operation.
Second, there is a maximum number of elements in the array.
Both limitations are the result of the implementation technique:
…the need for memory from vmalloc() can be eliminated by piecing together an array from smaller parts…
A flexible array holds an arbitrary (within limits) number of fixed-sized objects, accessed via an integer index.… Only single-page allocations are made…
The array is "pieced" together by using an array of pointers to individual parts, where each part is one system page. Since this array is also allocated, and only single-page allocations are made (as noted above), the maximum number of parts is slightly less than the number of pointers which can fit in a page (slightly less because there is also some bookkeeping data.) In effect, this limits the total size of a flexible array to about 2MB on systems with 8-byte pointers and 4kb pages. (The precise limitation will vary depending on the amount of wasted space in a page if the object size is not a power of two.)
I have 8 terabytes of data composed of ~5000 arrays of small sized elements (under a hundred bytes per element). I need to load sections of these arrays (a few dozen megs at a time) into memory to use in an algorithm as quickly as possible. Are memory mapped files right for this use, and if not what else should I use?
Given your requirements I would definitely go with memory mapped files. It's almost exactly what they were made for. And since memory mapped files consume few physical resources, your extremely large files will have little impact on the system as compared to other methods, especially since smaller views can be mapped into the address space just before performing I/O (eg, those arrays of elements). The other big benefit is they give you the simplest working environment possible. You can (mostly) just view your data as a large memory address space and let Windows worry about the I/O. Obviously, you'll need to build in locking mechanisms to handle multiple threads, but I'm sure you know that.