What are benefits of allocating a page-aligned memory chunk? - c

I realize that most CPUs are better at reading data at an aligned memory address, that is at memory address that is a multiple of CPU word. However, in many places I read about allocating a page-aligned memory. Why might someone want to get a page-aligned memory address? Is it only for even bigger performance?

The "traditional" way to allocate memory is to have it in a contiguous address space (the "heap", growing upwards by calls to sbrk()). Each time you hit a page boundary, there will be a page fault and you get mapped a new page. There are two consequences of this strategy:
pages can only be freed when all allocations inside that page are freed AND when all other allocations are mapped to lower addresses. (the typical effect of heap fragmentation).
larger allocations might occupy one page more than strictly needed (if they start somewhere in the middle of a page).
So this strategy is only suitable for smaller blocks of memory where you don't want to "waste" a whole page for each allocation.
For bigger chunks, it's better to use mmap() which maps you new pages somewhere directly, so you get "page aligned memory". Using this, your allocation doesn't share pages with other allocations. As soon as you don't need the memory any more, you can give it back to the OS. Note that many malloc()implementations choose automatically whether to allocate using sbrk() or mmap(), depending on the size of the desired allocation.

Alignment restrictions are usually associated with direct IO - which bypasses the page cache, copying data to/from disk directly into or from the address space of a process. This can provide significant performance improvements in cases where the page cache is not needed - such as streaming multiple gigabytes of data, especially when doing IO to/from extremely fast disk systems.
Note that only some file systems support direct IO.
On Linux, RedHat's documentation is, in part:
Direct I/O best practices
Users must always take care to use properly aligned and sized IO.
This is especially important for Direct I/O
access. Direct I/O should be aligned on a 'logical_block_size'
boundary and in multiples of the 'logical_block_size'. With native 4K
devices (logical_block_size is 4K) it is now critical that
applications perform Direct I/O that is a multiple of the device's
'logical_block_size'. This means that applications that do not
perform 4K aligned I/O, but 512-byte aligned I/O, will break with
native 4K devices. Applications may consult a device's "I/O Limits"
to ensure they are using properly aligned and sized I/O. The "I/O
Limits" are exposed through both sysfs and block device ioctl
interfaces (also see: libblkid).
sysfs interface
/sys/block//alignment_offset
/sys/block///alignment_offset
/sys/block//queue/physical_block_size
/sys/block//queue/logical_block_size
/sys/block//queue/minimum_io_size
/sys/block//queue/optimal_io_size
Note that the use of direct IO can be limited by actual hardware, as well as software. As noted in the RedHat documentation, physical device limitations matter.
To use direct IO, on Linux the file needs to be opened with the O_DIRECT flag:
int fd = open( filename, O_RDONLY | O_DIRECT );
In my experience, direct IO can result in 20-30% gains in IO performance under certain circumstances. Those circumstances usually involve streaming large amounts of data to/from a file on a very fast file system with the application performing no or very few seek() calls.

Alignment is something that always makes some performance issues. when you write(2) or read(2) a file, it's best if you can adjust the limits of your reading to block aligments, because you make kernel do two block reads instead of one. The worst case being just reading two bytes on a block boundary. Suppose you have a block size of 1024bytes, this code:
char var[2];
int fd;
fd = open("/etc/passwd", O_RDONLY);
lseek(fd, 1023UL, SEEK_SET);
read(fd, &var, sizeof var);
Will make the kernel to force two block reads (at most, as the blocks could be already cached before) for only a two bytes read(2) call.
In the case of memory, all this stuff is normally managed by malloc(3), and, as you don't fail on page faults, you don't get any performance penalties (that's the reason you don't have any standar library function to get aligned memory, even in demand paged virtual systems) as far as you consume memory, the kernel allocs it in pages for you. The processor virtual memory system makes page alignment almost transparent. Only in case you have an unaligned memory access (suppose you access a 32bit integer access misaligned ---unprobable--- to two pages, and those two pages have been swapped out by the kernel, you'll have to wait for the kernel to swap in two pages of memory instead of one ---but that's far improbable thing to occur, the compiler normally forces inner loops to not fail between page boundaries to minimize the probability of this to happen, and you have also the instruction cache to cope with these things)
Said this, there are some places where you do get performance improvements if you somewhat align memory. I'll try to show you a scenario of this:
Suppose you need to dynamically manage a lot of small structures (let's say 16bytes long) and you plan to manage them with malloc(). malloc(3) manages memory including a header in each chunk of memory allocated (let's say this header is 8 bytes long) making the overhead of memory 50% percent more than the ideal. If you arrange to get memory in chunks of (let's say) 64 structures you'll get just one of those headers (8bytes) for each 64*16 = 1024 bytes (amounting for just roughly an 8%)
To manage this, you have to consider knowing to which chunk all of this structures belong (so you can free(3) the chunk when not in use), and you can do this in two ways: 1.- Using a pointer (adding 4 bytes to each structure size --this is pointless as you'll add 4 bytes to each structure, lossing a 25% of memory again) to point to the chunck, or 2.- *forcing the chunck to be aligned, so the chunk address can be easily calculated from the struct address (you only need to substract the rest of the division mod chunksize to the struct address) to get the chunk address. This last method doesn't impose any overhead to locate the chunck, but imposes the practice of all chunks to be chunk aligned (not page aligned).
In this way, you improve performance too much, as you reduce considerably the amount of malloc(3) calls and the waste of memory imposed by allocating small amounts of memory.
By the way, malloc doesn't ask the operating system for the memory you ask it at each call. It allocates memory in chunks, in a manner similar as has been explained here, and normal implementations don't even manage to return the allocated memory to the system again (reusing the freed memory before allocating new one) It controls the calls to sbrk(2) system call, what means that you are going to interfere with malloc in case you use this system call.
Linux/unix will give you page aligned memory when you use shmat(2) system call. Try reading this and related documents.

Related

why mmap is faster than traditional file io [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.
mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.
"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.

Is it okay to use mmap() for 4Kb blocks all over the place or is it better to mmap() my whole file in one go?

I want to work on a file which is composed of 4Kb blocks.
As things happen, I will write more data and map new parts, unmap parts that I do not need anymore.
Is a map() of just 4Kb too small when the total amount of file data to map is around 4Gb total? (i.e. some 1,048,576 individually mapped blocks).
I'm worried that making so many small mmap() calls is not going to be efficient in the end, even if they are very well directed to the exact blocks I want to use. At the same time, it may still be better than reading and writing these blocks with read()/write() each time I change one byte.
As far as I understand it, even a single mmap() that covers several contiguous 4kb pages will require the kernel (and the TLB, MMU...) to deal with as many virtual/physical associations as the number of these pages (this is the purpose of memory pages; contiguous virtual pages can be mapped to non-contiguous physical pages).
So, considering the usage of these mapped pages, once set up by a unique or by many mmap() calls, there should not be any difference in performances.
But each single call to mmap() probably requires some overhead in order to choose the part of virtual address space to use; a single mmap() call will just have to choose once a big enough virtual location (it should not be too difficult on a 64-bit system, as stated in other answers) but repeated calls will imply this overhead many times.
So, if I had to deal with this situation on a 64-bit system, I would mmap() the entire file at once, using huge-pages in order to reduce the pressure on TLB.
Note that mapping the entire file at once does not imply using the same amount of physical memory right at this moment; virtual/physical memory association will only occur for each single page when it is accessed for the first time.
There is no shortage of address space on 64-bit architectures. Unless your code has to work in 32-bit architectures too (rare these days), map the whole file once and avoid the overhead of multiple mmap calls and thousands of extra kernel objects. With reading and writing changes, it depends on your desired semantics. See this answer.
On 64-bit systems you should pretty much map the entire file or at least the entire range in one go and let the operating system handle the paging in and out for you. The mmap calls do have some overhead themselves. In practice the user address space on x86-64 is something like 128 TiB so you should be able to map say 1 TiB files/ranges without any problems.

Means to allocate contiguous physical memory

I am aware that with C malloc and posix_memaligh one can allocate contiguous memory from the virtual address space of a process. However, I was wondering whether somehow one can allocate a buffer of physically contiguous memory? I am investigating side channel attacks that exploit L2 cache so I want to be sure that I can access the right cache lines..
Your best and easiest take at continuous memory is to request a single "huge" page from the system. The availability of those depends on your CPU and kernel options (on x86_64 the 2MB huge pages are usually available and some CPUs can also do 1GB pages; other architectures can be more flexible than this). Check out Hugepagesize field in /proc/meminfo for the size of huge pages on your setup.
Those can be accessed in two ways:
By means of a MAP_HUGETLB flag passed to mmap(). This way you can be sure that the "huge" virtual page corresponds to a continuous physical memory range. Unfortunately, whether the kernel can supply you with a "huge" page depends on many factors (current layout of memory utilization, kernel options, etc - also see the hugepages kernel boot parameter).
By means of mapping a file from a dedicated HugeTLB filesystem (see here: http://lwn.net/Articles/375096/). With HugeTLB file system you can configure the number of huge pages available in advance for some assurance that the necessary amount of huge pages will be available.
The other approach is to write a kernel module which will allocate continuous physical memory on the kernel side and then map it into your process' address space on request. This approach is sometimes employed on special purpose hardware in embedded systems. Of course, there's still no guarantee that the kernel side memory allocator will be able to come with an appropriately sized continuous physical address range, so on some occasions such address ranges are pre-reserved on boot (one dumb approach is to pass max_addr parameter to kernel on boot to leave some of the RAM out of kernel's reach).
On (almost [Note 1]) all virtual memory architectures, virtual memory is mapped to physical memory in units of a "page". The size of a page is (almost) always a power of 2, and pages are aligned by that size, because the mapping is done by only using the high-order bits of the address. It's common to see a page size of 4K (12 bits of address), although modern CPUs have an option to map much larger pages in order to reduce the size of mapping tables.
Since L2_CACHE_SIZE will generally also be a power of 2 and will be smaller than the page size, any single aligned allocation of size L2_CACHE_SIZE will necessarily be in a single page, so the bytes in the alignment will be physically contiguous as well.
So in this particular case, you can be assured that your allocated memory will be a single cache-line (at least, on standard machine architectures).
Note 1: Undoubtedly there are machines -- possibly imaginary -- which do not function this way. But the one you are playing with is not one of them.

When should I use mmap for file access?

POSIX environments provide at least two ways of accessing files. There's the standard system calls open(), read(), write(), and friends, but there's also the option of using mmap() to map the file into virtual memory.
When is it preferable to use one over the other? What're their individual advantages that merit including two interfaces?
mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
mmap also allows the operating system to optimize paging operations. For example, consider two programs; program A which reads in a 1MB file into a buffer creating with malloc, and program B which mmaps the 1MB file into memory. If the operating system has to swap part of A's memory out, it must write the contents of the buffer to swap before it can reuse the memory. In B's case any unmodified mmap'd pages can be reused immediately because the OS knows how to restore them from the existing file they were mmap'd from. (The OS can detect which pages are unmodified by initially marking writable mmap'd pages as read only and catching seg faults, similar to Copy on Write strategy).
mmap is also useful for inter process communication. You can mmap a file as read / write in the processes that need to communicate and then use synchronization primitives in the mmap'd region (this is what the MAP_HASSEMAPHORE flag is for).
One place mmap can be awkward is if you need to work with very large files on a 32 bit machine. This is because mmap has to find a contiguous block of addresses in your process's address space that is large enough to fit the entire range of the file being mapped. This can become a problem if your address space becomes fragmented, where you might have 2 GB of address space free, but no individual range of it can fit a 1 GB file mapping. In this case you may have to map the file in smaller chunks than you would like to make it fit.
Another potential awkwardness with mmap as a replacement for read / write is that you have to start your mapping on offsets of the page size. If you just want to get some data at offset X you will need to fixup that offset so it's compatible with mmap.
And finally, read / write are the only way you can work with some types of files. mmap can't be used on things like pipes and ttys.
One area where I found mmap() to not be an advantage was when reading small files (under 16K). The overhead of page faulting to read the whole file was very high compared with just doing a single read() system call. This is because the kernel can sometimes satisify a read entirely in your time slice, meaning your code doesn't switch away. With a page fault, it seemed more likely that another program would be scheduled, making the file operation have a higher latency.
mmap has the advantage when you have random access on big files. Another advantage is that you access it with memory operations (memcpy, pointer arithmetic), without bothering with the buffering. Normal I/O can sometimes be quite difficult when using buffers when you have structures bigger than your buffer. The code to handle that is often difficult to get right, mmap is generally easier. This said, there are certain traps when working with mmap.
As people have already mentioned, mmap is quite costly to set up, so it is worth using only for a given size (varying from machine to machine).
For pure sequential accesses to the file, it is also not always the better solution, though an appropriate call to madvise can mitigate the problem.
You have to be careful with alignment restrictions of your architecture(SPARC, itanium), with read/write IO the buffers are often properly aligned and do not trap when dereferencing a casted pointer.
You also have to be careful that you do not access outside of the map. It can easily happen if you use string functions on your map, and your file does not contain a \0 at the end. It will work most of the time when your file size is not a multiple of the page size as the last page is filled with 0 (the mapped area is always in the size of a multiple of your page size).
In addition to other nice answers, a quote from Linux system programming written by Google's expert Robert Love:
Advantages of mmap( )
Manipulating files via mmap( ) has a handful of advantages over the
standard read( ) and write( ) system calls. Among them are:
Reading from and writing to a memory-mapped file avoids the
extraneous copy that occurs when using the read( ) or write( ) system
calls, where the data must be copied to and from a user-space buffer.
Aside from any potential page faults, reading from and writing to a memory-mapped file does not incur any system call or context switch
overhead. It is as simple as accessing memory.
When multiple processes map the same object into memory, the data is shared among all the processes. Read-only and shared writable
mappings are shared in their entirety; private writable mappings have
their not-yet-COW (copy-on-write) pages shared.
Seeking around the mapping involves trivial pointer manipulations. There is no need for the lseek( ) system call.
For these reasons, mmap( ) is a smart choice for many applications.
Disadvantages of mmap( )
There are a few points to keep in mind when using mmap( ):
Memory mappings are always an integer number of pages in size. Thus, the difference between the size of the backing file and an
integer number of pages is "wasted" as slack space. For small files, a
significant percentage of the mapping may be wasted. For example, with
4 KB pages, a 7 byte mapping wastes 4,089 bytes.
The memory mappings must fit into the process' address space. With a 32-bit address space, a very large number of various-sized mappings
can result in fragmentation of the address space, making it hard to
find large free contiguous regions. This problem, of course, is much
less apparent with a 64-bit address space.
There is overhead in creating and maintaining the memory mappings and associated data structures inside the kernel. This overhead is
generally obviated by the elimination of the double copy mentioned in
the previous section, particularly for larger and frequently accessed
files.
For these reasons, the benefits of mmap( ) are most greatly realized
when the mapped file is large (and thus any wasted space is a small
percentage of the total mapping), or when the total size of the mapped
file is evenly divisible by the page size (and thus there is no wasted
space).
Memory mapping has a potential for a huge speed advantage compared to traditional IO. It lets the operating system read the data from the source file as the pages in the memory mapped file are touched. This works by creating faulting pages, which the OS detects and then the OS loads the corresponding data from the file automatically.
This works the same way as the paging mechanism and is usually optimized for high speed I/O by reading data on system page boundaries and sizes (usually 4K) - a size for which most file system caches are optimized to.
An advantage that isn't listed yet is the ability of mmap() to keep a read-only mapping as clean pages. If one allocates a buffer in the process's address space, then uses read() to fill the buffer from a file, the memory pages corresponding to that buffer are now dirty since they have been written to.
Dirty pages can not be dropped from RAM by the kernel. If there is swap space, then they can be paged out to swap. But this is costly and on some systems, such as small embedded devices with only flash memory, there is no swap at all. In that case, the buffer will be stuck in RAM until the process exits, or perhaps gives it back withmadvise().
Non written to mmap() pages are clean. If the kernel needs RAM, it can simply drop them and use the RAM the pages were in. If the process that had the mapping accesses it again, it cause a page fault the kernel re-loads the pages from the file they came from originally. The same way they were populated in the first place.
This doesn't require more than one process using the mapped file to be an advantage.

What is the difference between vmalloc and kmalloc?

I've googled around and found most people advocating the use of kmalloc, as you're guaranteed to get contiguous physical blocks of memory. However, it also seems as though kmalloc can fail if a contiguous physical block that you want can't be found.
What are the advantages of having a contiguous block of memory? Specifically, why would I need to have a contiguous physical block of memory in a system call? Is there any reason I couldn't just use vmalloc?
Finally, if I were to allocate memory during the handling of a system call, should I specify GFP_ATOMIC? Is a system call executed in an atomic context?
GFP_ATOMIC
The allocation is high-priority and
does not sleep. This is the flag to
use in interrupt handlers, bottom
halves and other situations where you
cannot sleep.
GFP_KERNEL
This is a normal allocation and might block. This is the flag to use
in process context code when it is safe to sleep.
You only need to worry about using physically contiguous memory if the buffer will be accessed by a DMA device on a physically addressed bus (like PCI). The trouble is that many system calls have no way to know whether their buffer will eventually be passed to a DMA device: once you pass the buffer to another kernel subsystem, you really cannot know where it is going to go. Even if the kernel does not use the buffer for DMA today, a future development might do so.
vmalloc is often slower than kmalloc, because it may have to remap the buffer space into a virtually contiguous range. kmalloc never remaps, though if not called with GFP_ATOMIC kmalloc can block.
kmalloc is limited in the size of buffer it can provide: 128 KBytes*). If you need a really big buffer, you have to use vmalloc or some other mechanism like reserving high memory at boot.
*) This was true of earlier kernels. On recent kernels (I tested this on 2.6.33.2), max size of a single kmalloc is up to 4 MB! (I wrote a fairly detailed post on this.) — kaiwan
For a system call you don't need to pass GFP_ATOMIC to kmalloc(), you can use GFP_KERNEL. You're not an interrupt handler: the application code enters the kernel context by means of a trap, it is not an interrupt.
Short answer: download Linux Device Drivers and read the chapter on memory management.
Seriously, there are a lot of subtle issues related to kernel memory management that you need to understand - I spend a lot of my time debugging problems with it.
vmalloc() is very rarely used, because the kernel rarely uses virtual memory. kmalloc() is what is typically used, but you have to know what the consequences of the different flags are and you need a strategy for dealing with what happens when it fails - particularly if you're in an interrupt handler, like you suggested.
Linux Kernel Development by Robert Love (Chapter 12, page 244 in 3rd edition) answers this very clearly.
Yes, physically contiguous memory is not required in many of the cases. Main reason for kmalloc being used more than vmalloc in kernel is performance. The book explains, when big memory chunks are allocated using vmalloc, kernel has to map the physically non-contiguous chunks (pages) into a single contiguous virtual memory region. Since the memory is virtually contiguous and physically non-contiguous, several virtual-to-physical address mappings will have to be added to the page table. And in the worst case, there will be (size of buffer/page size) number of mappings added to the page table.
This also adds pressure on TLB (the cache entries storing recent virtual to physical address mappings) when accessing this buffer. This can lead to thrashing.
The kmalloc() & vmalloc() functions are a simple interface for obtaining kernel memory in byte-sized chunks.
The kmalloc() function guarantees that the pages are physically contiguous (and virtually contiguous).
The vmalloc() function works in a similar fashion to kmalloc(), except it allocates memory that is only virtually contiguous and not necessarily physically contiguous.
What are the advantages of having a contiguous block of memory? Specifically, why would I need to have a contiguous physical block of memory in a system call? Is there any reason I couldn't just use vmalloc?
From Google's "I'm Feeling Lucky" on vmalloc:
kmalloc is the preferred way, as long as you don't need very big areas. The trouble is, if you want to do DMA from/to some hardware device, you'll need to use kmalloc, and you'll probably need bigger chunk. The solution is to allocate memory as soon as possible, before
memory gets fragmented.
On a 32-bit system, kmalloc() returns the kernel logical address (its a virtual address though) which has the direct mapping (actually with constant offset) to physical address.
This direct mapping ensures that we get a contiguous physical chunk of RAM. Suited for DMA where we give only the initial pointer and expect a contiguous physical mapping thereafter for our operation.
vmalloc() returns the kernel virtual address which in turn might not be having a contiguous mapping on physical RAM.
Useful for large memory allocation and in cases where we don't care about that the memory allocated to our process is continuous also in Physical RAM.
One of other differences is kmalloc will return logical address (else you specify GPF_HIGHMEM). Logical addresses are placed in "low memory" (in the first gigabyte of physical memory) and are mapped directly to physical addresses (use __pa macro to convert it). This property implies kmalloced memory is continuous memory.
In other hand, Vmalloc is able to return virtual addresses from "high memory". These addresses cannot be converted in physical addresses in a direct fashion (you have to use virt_to_page function).
In short, vmalloc and kmalloc both could fix fragmentation. vmalloc use memory mappings to fix external fragmentation; kmalloc use slab to fix internal frgamentation. Fot what it's worth, kmalloc also has many other advantages.

Resources