Means to allocate contiguous physical memory

Means to allocate contiguous physical memory - c

I am aware that with C malloc and posix_memaligh one can allocate contiguous memory from the virtual address space of a process. However, I was wondering whether somehow one can allocate a buffer of physically contiguous memory? I am investigating side channel attacks that exploit L2 cache so I want to be sure that I can access the right cache lines..

Your best and easiest take at continuous memory is to request a single "huge" page from the system. The availability of those depends on your CPU and kernel options (on x86_64 the 2MB huge pages are usually available and some CPUs can also do 1GB pages; other architectures can be more flexible than this). Check out Hugepagesize field in /proc/meminfo for the size of huge pages on your setup.
Those can be accessed in two ways:
By means of a MAP_HUGETLB flag passed to mmap(). This way you can be sure that the "huge" virtual page corresponds to a continuous physical memory range. Unfortunately, whether the kernel can supply you with a "huge" page depends on many factors (current layout of memory utilization, kernel options, etc - also see the hugepages kernel boot parameter).
By means of mapping a file from a dedicated HugeTLB filesystem (see here: http://lwn.net/Articles/375096/). With HugeTLB file system you can configure the number of huge pages available in advance for some assurance that the necessary amount of huge pages will be available.
The other approach is to write a kernel module which will allocate continuous physical memory on the kernel side and then map it into your process' address space on request. This approach is sometimes employed on special purpose hardware in embedded systems. Of course, there's still no guarantee that the kernel side memory allocator will be able to come with an appropriately sized continuous physical address range, so on some occasions such address ranges are pre-reserved on boot (one dumb approach is to pass max_addr parameter to kernel on boot to leave some of the RAM out of kernel's reach).

On (almost [Note 1]) all virtual memory architectures, virtual memory is mapped to physical memory in units of a "page". The size of a page is (almost) always a power of 2, and pages are aligned by that size, because the mapping is done by only using the high-order bits of the address. It's common to see a page size of 4K (12 bits of address), although modern CPUs have an option to map much larger pages in order to reduce the size of mapping tables.
Since L2_CACHE_SIZE will generally also be a power of 2 and will be smaller than the page size, any single aligned allocation of size L2_CACHE_SIZE will necessarily be in a single page, so the bytes in the alignment will be physically contiguous as well.
So in this particular case, you can be assured that your allocated memory will be a single cache-line (at least, on standard machine architectures).
Note 1: Undoubtedly there are machines -- possibly imaginary -- which do not function this way. But the one you are playing with is not one of them.

Related

How to allocate block of physical memory? [duplicate]

Is there a way to allocate contiguous physical memory from userspace in linux? At least few guaranteed contiguous memory pages. One huge page isn't the answer.

No. There is not. You do need to do this from Kernel space.
If you say "we need to do this from User Space" - without anything going on in kernel-space it makes little sense - because a user space program has no way of controlling or even knowing if the underlying memory is contiguous or not.
The only reason where you would need to do this - is if you were working in-conjunction with a piece of hardware, or some other low-level (i.e. Kernel) service that needed this requirement. So again, you would have to deal with it at that level.
So the answer isn't just "you can't" - but "you should never need to".
I have written such memory managers that do allow me to do this - but it was always because of some underlying issue at the kernel level, which had to be addressed at the kernel level. Generally because some other agent on the bus (PCI card, BIOS or even another computer over RDMA interface) had the physical contiguous memory requirement. Again, all of this had to be addressed in kernel space.
When you talk about "cache lines" - you don't need to worry. You can be assured that each page of your user-space memory is contiguous, and each page is much larger than a cache-line (no matter what architecture you're talking about).

Yes, if all you need is a few pages, this may indeed be possible.
The file /proc/[pid]/pagemap now allows programs to inspect the mapping of their virtual memory to physical memory.
While you cannot explicitly modify the mapping, you can just allocate a virtual page, lock it into memory via a call to mlock, record its physical address via a lookup into /proc/self/pagemap, and repeat until you just happen to get enough blocks touching eachother to create a large enough contiguous block. Then unlock and free your excess blocks.
It's hackish, clunky and potentially slow, but it's worth a try. On the other hand, there's a decently large chance that this isn't actually what you really need.

DPDK library's memory allocator uses approach #Wallacoloo described. eal_memory.c. The code is BSD licensed.

if specific device driver exports dma buffer which is physical contiguous, user space can access through dma buf apis
so user task can access but not allocate directly
that is because physically contiguous constraints are not from user aplications but only from device
so only device drivers should care.

Is it okay to use mmap() for 4Kb blocks all over the place or is it better to mmap() my whole file in one go?

I want to work on a file which is composed of 4Kb blocks.
As things happen, I will write more data and map new parts, unmap parts that I do not need anymore.
Is a map() of just 4Kb too small when the total amount of file data to map is around 4Gb total? (i.e. some 1,048,576 individually mapped blocks).
I'm worried that making so many small mmap() calls is not going to be efficient in the end, even if they are very well directed to the exact blocks I want to use. At the same time, it may still be better than reading and writing these blocks with read()/write() each time I change one byte.

As far as I understand it, even a single mmap() that covers several contiguous 4kb pages will require the kernel (and the TLB, MMU...) to deal with as many virtual/physical associations as the number of these pages (this is the purpose of memory pages; contiguous virtual pages can be mapped to non-contiguous physical pages).
So, considering the usage of these mapped pages, once set up by a unique or by many mmap() calls, there should not be any difference in performances.
But each single call to mmap() probably requires some overhead in order to choose the part of virtual address space to use; a single mmap() call will just have to choose once a big enough virtual location (it should not be too difficult on a 64-bit system, as stated in other answers) but repeated calls will imply this overhead many times.
So, if I had to deal with this situation on a 64-bit system, I would mmap() the entire file at once, using huge-pages in order to reduce the pressure on TLB.
Note that mapping the entire file at once does not imply using the same amount of physical memory right at this moment; virtual/physical memory association will only occur for each single page when it is accessed for the first time.

There is no shortage of address space on 64-bit architectures. Unless your code has to work in 32-bit architectures too (rare these days), map the whole file once and avoid the overhead of multiple mmap calls and thousands of extra kernel objects. With reading and writing changes, it depends on your desired semantics. See this answer.

On 64-bit systems you should pretty much map the entire file or at least the entire range in one go and let the operating system handle the paging in and out for you. The mmap calls do have some overhead themselves. In practice the user address space on x86-64 is something like 128 TiB so you should be able to map say 1 TiB files/ranges without any problems.

What are benefits of allocating a page-aligned memory chunk?

I realize that most CPUs are better at reading data at an aligned memory address, that is at memory address that is a multiple of CPU word. However, in many places I read about allocating a page-aligned memory. Why might someone want to get a page-aligned memory address? Is it only for even bigger performance?

The "traditional" way to allocate memory is to have it in a contiguous address space (the "heap", growing upwards by calls to sbrk()). Each time you hit a page boundary, there will be a page fault and you get mapped a new page. There are two consequences of this strategy:
pages can only be freed when all allocations inside that page are freed AND when all other allocations are mapped to lower addresses. (the typical effect of heap fragmentation).
larger allocations might occupy one page more than strictly needed (if they start somewhere in the middle of a page).
So this strategy is only suitable for smaller blocks of memory where you don't want to "waste" a whole page for each allocation.
For bigger chunks, it's better to use mmap() which maps you new pages somewhere directly, so you get "page aligned memory". Using this, your allocation doesn't share pages with other allocations. As soon as you don't need the memory any more, you can give it back to the OS. Note that many malloc()implementations choose automatically whether to allocate using sbrk() or mmap(), depending on the size of the desired allocation.

Alignment restrictions are usually associated with direct IO - which bypasses the page cache, copying data to/from disk directly into or from the address space of a process. This can provide significant performance improvements in cases where the page cache is not needed - such as streaming multiple gigabytes of data, especially when doing IO to/from extremely fast disk systems.
Note that only some file systems support direct IO.
On Linux, RedHat's documentation is, in part:
Direct I/O best practices
Users must always take care to use properly aligned and sized IO.
This is especially important for Direct I/O
access. Direct I/O should be aligned on a 'logical_block_size'
boundary and in multiples of the 'logical_block_size'. With native 4K
devices (logical_block_size is 4K) it is now critical that
applications perform Direct I/O that is a multiple of the device's
'logical_block_size'. This means that applications that do not
perform 4K aligned I/O, but 512-byte aligned I/O, will break with
native 4K devices. Applications may consult a device's "I/O Limits"
to ensure they are using properly aligned and sized I/O. The "I/O
Limits" are exposed through both sysfs and block device ioctl
interfaces (also see: libblkid).
sysfs interface
/sys/block//alignment_offset
/sys/block///alignment_offset
/sys/block//queue/physical_block_size
/sys/block//queue/logical_block_size
/sys/block//queue/minimum_io_size
/sys/block//queue/optimal_io_size
Note that the use of direct IO can be limited by actual hardware, as well as software. As noted in the RedHat documentation, physical device limitations matter.
To use direct IO, on Linux the file needs to be opened with the O_DIRECT flag:
int fd = open( filename, O_RDONLY | O_DIRECT );
In my experience, direct IO can result in 20-30% gains in IO performance under certain circumstances. Those circumstances usually involve streaming large amounts of data to/from a file on a very fast file system with the application performing no or very few seek() calls.

Alignment is something that always makes some performance issues. when you write(2) or read(2) a file, it's best if you can adjust the limits of your reading to block aligments, because you make kernel do two block reads instead of one. The worst case being just reading two bytes on a block boundary. Suppose you have a block size of 1024bytes, this code:
char var[2];
int fd;
fd = open("/etc/passwd", O_RDONLY);
lseek(fd, 1023UL, SEEK_SET);
read(fd, &var, sizeof var);
Will make the kernel to force two block reads (at most, as the blocks could be already cached before) for only a two bytes read(2) call.
In the case of memory, all this stuff is normally managed by malloc(3), and, as you don't fail on page faults, you don't get any performance penalties (that's the reason you don't have any standar library function to get aligned memory, even in demand paged virtual systems) as far as you consume memory, the kernel allocs it in pages for you. The processor virtual memory system makes page alignment almost transparent. Only in case you have an unaligned memory access (suppose you access a 32bit integer access misaligned ---unprobable--- to two pages, and those two pages have been swapped out by the kernel, you'll have to wait for the kernel to swap in two pages of memory instead of one ---but that's far improbable thing to occur, the compiler normally forces inner loops to not fail between page boundaries to minimize the probability of this to happen, and you have also the instruction cache to cope with these things)
Said this, there are some places where you do get performance improvements if you somewhat align memory. I'll try to show you a scenario of this:
Suppose you need to dynamically manage a lot of small structures (let's say 16bytes long) and you plan to manage them with malloc(). malloc(3) manages memory including a header in each chunk of memory allocated (let's say this header is 8 bytes long) making the overhead of memory 50% percent more than the ideal. If you arrange to get memory in chunks of (let's say) 64 structures you'll get just one of those headers (8bytes) for each 64*16 = 1024 bytes (amounting for just roughly an 8%)
To manage this, you have to consider knowing to which chunk all of this structures belong (so you can free(3) the chunk when not in use), and you can do this in two ways: 1.- Using a pointer (adding 4 bytes to each structure size --this is pointless as you'll add 4 bytes to each structure, lossing a 25% of memory again) to point to the chunck, or 2.- *forcing the chunck to be aligned, so the chunk address can be easily calculated from the struct address (you only need to substract the rest of the division mod chunksize to the struct address) to get the chunk address. This last method doesn't impose any overhead to locate the chunck, but imposes the practice of all chunks to be chunk aligned (not page aligned).
In this way, you improve performance too much, as you reduce considerably the amount of malloc(3) calls and the waste of memory imposed by allocating small amounts of memory.
By the way, malloc doesn't ask the operating system for the memory you ask it at each call. It allocates memory in chunks, in a manner similar as has been explained here, and normal implementations don't even manage to return the allocated memory to the system again (reusing the freed memory before allocating new one) It controls the calls to sbrk(2) system call, what means that you are going to interfere with malloc in case you use this system call.
Linux/unix will give you page aligned memory when you use shmat(2) system call. Try reading this and related documents.

How can we specify physical address for variable?

Any suggestions/discussions are welcome!
The question is actually brief as title, but I'll explain why I need physical address.
Background:
These days I'm fascinated by cache and multi-core architecture, and now I'm quite curious how cache influence our programs, under the parallel environment.
In some CPU models (for example, my Intel Core Duo T5800), the L2 cache is shared among cores. So, if program A is accessing memory at physical address like
0x00000000, 0x20000000, 0x40000000...
and program B accessing data at
0x10000000, 0x30000000, 0x50000000...
Since these addresses share the same suffix, the related set in L2 cache will be flushed frequently. And we're expected to see two programs fighting with each other, reading data slowly from memory instead of cache, although, they are separated in different cores.
Then I want to verify the result in practice. In this experiment, I have to know the physical address instead of virtual address. But how can I cope with this?
The first attempt:
Eat a large space from heap, mask, and get the certain address.
My CPU has a L2 cache with size=2048KB and associativity=8, so physical addressess like 0x12340000, 0x12380000, 0x123c0000 will be related to the first set in L2 cache.
int HEAP[200000000]={0};
int *v[2];
int main(int argc, char **argv) {
v[0] = (int*)(((unsigned)(HEAP)+0x3fffc) & 0xfffc0000);
v[1] = (int*) ((unsigned)(v[0]) + 0x40000);
// one program pollute v[0], another polluting v[1]
}
Sadly, with the "help" of virtual memory, variable HEAP is not always continuous inside physical memory. v[0] and v[1] might be related to different cache sets.
The second attempt
access /proc/self/mem, and try to get memory information.
Hmm... seems that the results are still about virtual memory.

Your understanding of memory and these addresses is incomplete/incorrect. Essentially, what you're trying to test is futile.
In the context of user-mode processes, pretty much every single address you see is a virtual address. That is, an address that makes sense only in the context of that process. The OS manages the mapping of where this virtual memory space (unique to a process) maps to memory pages. These memory pages at any given time may map to pages that are paged-in (i.e. reside in physical RAM) - or they may be paged-out, and exist only in the swap file on disk.
So to address the Background example, those addresses are from two different processes - it means absolutely nothing to try and compare them. Whether or not their code is present in any of the caches depends on a number of things, including the cache-replacement strategy of the processor, the caching policies enabled by the OS, the number of other processes (including kernel-mode threads), etc.
In your first attempt, again you aren't going to get anywhere as far as actually testing CPU cache directly. First of all, your large buffer is not going to be on the heap. It is going to be part of the data section (specifically the .bss) of the executable. The heap is used for the malloc() family of memory allocations. Secondly, it doesn't really matter if you allocate some huge 1GB region, because although it is contiguous in the virtual address space of your process, it is up to the OS to allocate pages of virtual memory wherever it deems fit - which may not actually be contiguous. Again, you have pretty much no control over memory allocation from userspace. "Is there a way to allocate contiguous physical memory from userspace in linux?" The short answer is No.
/proc/$pid/maps isn't going to get you anywhere either. Yes there are plenty of addresses listed in there, but again, they are all in the virtual address space of process $pid. Some more information on these: How do I read from /proc/$pid/mem under Linux?

What is the difference between vmalloc and kmalloc?

I've googled around and found most people advocating the use of kmalloc, as you're guaranteed to get contiguous physical blocks of memory. However, it also seems as though kmalloc can fail if a contiguous physical block that you want can't be found.
What are the advantages of having a contiguous block of memory? Specifically, why would I need to have a contiguous physical block of memory in a system call? Is there any reason I couldn't just use vmalloc?
Finally, if I were to allocate memory during the handling of a system call, should I specify GFP_ATOMIC? Is a system call executed in an atomic context?
GFP_ATOMIC
The allocation is high-priority and
does not sleep. This is the flag to
use in interrupt handlers, bottom
halves and other situations where you
cannot sleep.
GFP_KERNEL
This is a normal allocation and might block. This is the flag to use
in process context code when it is safe to sleep.

You only need to worry about using physically contiguous memory if the buffer will be accessed by a DMA device on a physically addressed bus (like PCI). The trouble is that many system calls have no way to know whether their buffer will eventually be passed to a DMA device: once you pass the buffer to another kernel subsystem, you really cannot know where it is going to go. Even if the kernel does not use the buffer for DMA today, a future development might do so.
vmalloc is often slower than kmalloc, because it may have to remap the buffer space into a virtually contiguous range. kmalloc never remaps, though if not called with GFP_ATOMIC kmalloc can block.
kmalloc is limited in the size of buffer it can provide: 128 KBytes*). If you need a really big buffer, you have to use vmalloc or some other mechanism like reserving high memory at boot.
*) This was true of earlier kernels. On recent kernels (I tested this on 2.6.33.2), max size of a single kmalloc is up to 4 MB! (I wrote a fairly detailed post on this.) — kaiwan
For a system call you don't need to pass GFP_ATOMIC to kmalloc(), you can use GFP_KERNEL. You're not an interrupt handler: the application code enters the kernel context by means of a trap, it is not an interrupt.

Short answer: download Linux Device Drivers and read the chapter on memory management.
Seriously, there are a lot of subtle issues related to kernel memory management that you need to understand - I spend a lot of my time debugging problems with it.
vmalloc() is very rarely used, because the kernel rarely uses virtual memory. kmalloc() is what is typically used, but you have to know what the consequences of the different flags are and you need a strategy for dealing with what happens when it fails - particularly if you're in an interrupt handler, like you suggested.

Linux Kernel Development by Robert Love (Chapter 12, page 244 in 3rd edition) answers this very clearly.
Yes, physically contiguous memory is not required in many of the cases. Main reason for kmalloc being used more than vmalloc in kernel is performance. The book explains, when big memory chunks are allocated using vmalloc, kernel has to map the physically non-contiguous chunks (pages) into a single contiguous virtual memory region. Since the memory is virtually contiguous and physically non-contiguous, several virtual-to-physical address mappings will have to be added to the page table. And in the worst case, there will be (size of buffer/page size) number of mappings added to the page table.
This also adds pressure on TLB (the cache entries storing recent virtual to physical address mappings) when accessing this buffer. This can lead to thrashing.

The kmalloc() & vmalloc() functions are a simple interface for obtaining kernel memory in byte-sized chunks.
The kmalloc() function guarantees that the pages are physically contiguous (and virtually contiguous).
The vmalloc() function works in a similar fashion to kmalloc(), except it allocates memory that is only virtually contiguous and not necessarily physically contiguous.

What are the advantages of having a contiguous block of memory? Specifically, why would I need to have a contiguous physical block of memory in a system call? Is there any reason I couldn't just use vmalloc?
From Google's "I'm Feeling Lucky" on vmalloc:
kmalloc is the preferred way, as long as you don't need very big areas. The trouble is, if you want to do DMA from/to some hardware device, you'll need to use kmalloc, and you'll probably need bigger chunk. The solution is to allocate memory as soon as possible, before
memory gets fragmented.

On a 32-bit system, kmalloc() returns the kernel logical address (its a virtual address though) which has the direct mapping (actually with constant offset) to physical address.
This direct mapping ensures that we get a contiguous physical chunk of RAM. Suited for DMA where we give only the initial pointer and expect a contiguous physical mapping thereafter for our operation.
vmalloc() returns the kernel virtual address which in turn might not be having a contiguous mapping on physical RAM.
Useful for large memory allocation and in cases where we don't care about that the memory allocated to our process is continuous also in Physical RAM.

One of other differences is kmalloc will return logical address (else you specify GPF_HIGHMEM). Logical addresses are placed in "low memory" (in the first gigabyte of physical memory) and are mapped directly to physical addresses (use __pa macro to convert it). This property implies kmalloced memory is continuous memory.
In other hand, Vmalloc is able to return virtual addresses from "high memory". These addresses cannot be converted in physical addresses in a direct fashion (you have to use virt_to_page function).

In short, vmalloc and kmalloc both could fix fragmentation. vmalloc use memory mappings to fix external fragmentation; kmalloc use slab to fix internal frgamentation. Fot what it's worth, kmalloc also has many other advantages.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight