We have a Linux project where we are pushing struct information over buffers. Recently, we found that the kernel parameter optmem_max was too small. I was asked to increase this by a supervisor. While I understand how to do this, I don't really understand how I know how big to make this.
Further, I don't really get what optmem_max is.
Here's what the kernel documentation says:
"Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence of struct cmsghdr structures with appended data."
(I don't really understand what this means in English).
I see many examples on the Internet suggesting that this should be increased for better performance.
In:
/etc/sysctl.conf
I added this line to fix the problem:
net.core.optmem_max=1020000
Once this is done, we got better performance.
So to summarize my question:
In English, what is optmem_max?
Why is it so low by default in most Linux distros if making it bigger improves performance?
How does one measure what a good size for this number to be?
What are the ramifications of making this really large?
Aside from /etc/sysctl.conf, where is this set in the kernel by default? I grepped the kernel, but I could find no trace of the default value of optmem_max being set to 20480 which is the default on our system.
In English, what is optmem_max?
optmem_max is a kernel option that affects the memory allocated to the cmsg list maintained by the kernel that contains "extra" packet information like SCM_RIGHTS or IP_TTL.
Increasing this option allows the kernel to allocate more memory as needed for more control messages that need to be sent for each socket connected (including IPC sockets/pipes).
Why is it so low by default in most Linux distros if making it bigger improves performance?
Most distributions have normal users in mind and most normal users, even if using Linux/Unix as a server, do not have a farm of servers that have fiber channels between them or server processes that don't need GB of IPC transfer.
A 20KB buffer is large enough for "most" that it minimizes the kernel memory required by default and is also easily enough configured that one can do so if they need.
How does one measure what a good size for this number to be?
Depends on your usage, but the Arch Wiki suggests a 64KB size for optmem_max and a 16MB size for rmem_max and wmem_max (which are the send and receive buffers).
What are the ramifications of making this really large?
More kernel memory that can be allocated to each socket connected, and maybe unnecessarily.
Aside from /etc/sysctl.conf, where is this set in the kernel by default? I grepped the kernel, but I could find no trace of the default value of optmem_max being set to 20480 which is the default on our system.
I'm not a Linux kernel source aficionado, but it looks like it could be in net/core/sock.c:318.
Hope that can help.
Related
I am re-implementing mmap in a device driver for DMA.
I saw this question: Linux Driver: mmap() kernel buffer to userspace without using nopage that has an answer using vm_insert_page() to map one page at a time; hence, for multiple pages, needed to execute in a loop. Is there another API that handles this?
Previously I used dma_alloc_coherent to allocate a chunk of memory for DMA and used remap_pfn_range to build a page table that associates process's virtual memory to physical memory.
Now I would like to allocate a much larger chunk of memory using __get_free_pages with order greater than 1. I am not sure how to build page table in that case. The reason is as follows:
I checked the book Linux Device Drivers and noticed the following:
Background:
When a user-space process calls mmap to map device memory into its address space, the system responds by creating a new VMA to represent that mapping. A driver that supports mmap (and, thus, that implements the mmap method) needs to help that process by completing the initialization of that VMA.
Problem with remap_pfn_range:
remap_pfn_range won’t allow you to remap conventional addresses, which include the ones you obtain by calling get_free_page. Instead, it maps in the zero page. Everything appears to work, with the exception that the process sees private, zero-filled pages rather than the remapped RAM that it was hoping for.
The corresponding implementation using get_free_pages with order 0, i.e. only 1 page in scullp device driver:
The mmap method is disabled for a scullp device if the allocation order is greater than zero, because nopage deals with single pages rather than clusters of pages. scullp simply does not know how to properly manage reference counts for pages that are part of higher-order allocations.
May I know if there is a way to create VMA for pages obtained using __get_free_pages with order greater than 1?
I checked Linux source code and noticed there are some drivers re-implementing struct dma_map_ops->alloc() and struct dma_map_ops->map_page(). May I know if this is the correct way to do it?
I think I got the answer to my question. Feel free to correct me if I am wrong.
I happened to see this patch: mm: Introduce new vm_map_pages() and vm_map_pages_zero() API while I was googling for vm_insert_page.
Previouly drivers have their own way of mapping range of kernel pages/memory into user vma and this was done by invoking vm_insert_page() within a loop.
As this pattern is common across different drivers, it can be generalized by creating new functions and use it across the drivers.
vm_map_pages() is the API which could be used to mapped kernel memory/pages in drivers which has considered vm_pgoff.
After reading it, I knew I found what I want.
That function also could be found in Linux Kernel Core API Documentation.
As for the difference between remap_pfn_range() and vm_insert_page() which requires a loop for a list of contiguous pages, I found this answer to this question extremely helpful, in which it includes a link to explanation by Linus.
As a side note, this patch mm: Introduce new vm_insert_range and vm_insert_range_buggy API indicates that the earlier version of vm_map_pages() was vm_insert_range(), but we should stick to vm_map_pages(), since under the hood vm_map_pages() calls vm_insert_range().
I am working in an Embedded Linux system which has the 2.6.35.3 kernel.
Within the device we require a 4MB+192kB contiguous DMA capable buffer for one of our data capture drivers. The driver uses SPI transfers to copy data into this buffer.
The user space application issues a mmap system call to map the buffer into user space and after that, it directly reads the available data.
The buffer is allocated using "alloc_bootmem_low_pages" call, because it is not possible to allocate more than 4 MB buffer using other methods, such as kmalloc.
However, due to a recent upgrade, we need to increase the buffer space to 22MB+192kB. As I've read, the Linux kernel has only 16MB of DMA capable memory. Therefore, theoretically this is not possible unless there is a way to tweak this setting.
If there is anyone who knows how to perform this, please let me know?
Is this a good idea or will this make the system unstable?
The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
Depending on your embedded system and device hardware, your device either is or isn't subject to this limitation. If it is subject to this limitation, there is no software fix you can apply to allow your device to address a 22MB block of memory, and if you modify the kernel to extend the DMA address space beyond 16MB, then of course the system will become unstable.
On the other hand, if your device is not subject to this limitation (which is the only way it could possibly write to a 22MB buffer), then there is no reason to allocate memory in ZONE_DMA. In this case, I think if you simply replace your alloc_bootmem_low_pages call with an alloc_bootmem_pages call, it should work fine to allocate your 22MB buffer. If the system becomes unstable, then it's probably because your device is subject to a hardware limitation, and you cannot use a 22MB buffer.
It looks like my first attempt at an answer was a little too generic. I think that for the specific i.MX287 architecture you mention in the comments, the DMA zone size is configurable through the CONFIG_DMA_ZONE_SIZE parameter which can be made as large as 32Megs. The relevant configuration option should be under "System Type -> Freescale i.MXS implementations -> DMA memory zone size".
On this architecture, it's seems safe to modify it, as it looks like it's not addressing a hardware limitation (the way it was on x86 architectures) but just determining how to lay out memory.
If you try setting it to 32Meg and testing both alloc_bootmem_pages and alloc_bootmem_low_pages in your own driver, perhaps one of those will work.
Otherwise, I think I'm out of ideas.
I want to add functions in the Linux kernel to write and read data. But I don't know how/where to store it so other programs can read/overwrite/delete it.
Program A calls uf_obj_add(param, param, param) it stores information in memory.
Program B does the same.
Program C calls uf_obj_get(param) the kernel checks if operation is allowed and if it is, it returns data.
Do I just need to malloc() memory or is it more difficult ?
And how uf_obj_get() can access memory where uf_obj_add() writes ?
Where to store memory location information so both functions can access the same data ?
As pointed out by commentators to your question, achieving this in userspace would probably be much safer. However, if you insist on achieving this by modifying kernel code, one way you can go is implementing a new device driver, which has functions such as read and write that you may implement according to your needs, in order to have your processes access some memory space. Your processes can then work, as you described, by reading from and writing onto the same space more or less as if they are reading from/writing to a regular file.
I would recommend reading quite a bit of materials before diving into kernel code, though. A good resource on device drivers is Linux Device Drivers. Even though a significant portion of its information may not be up-to-date, you may find here a version of the source code used in the book ported to linux 3.x. You may find what you are looking for under the directory scull.
Again, as pointed out by commentators to your question, I do not think you should jump right into updating the execution of the kernel space. However, for educational purposes scull may serve as a good starting point to read kernel code and see how to achieve results similar to what you described.
I am running some experiments with I/O intensive applications and am trying to understand the effects of varying the kernel i/o buffer size, different elevator algorithms, and so on.
How can I know the current size of the i/o buffer in the kernel? Does the kernel use more than one buffer as need arises? How can I change the size of this buffer? Is there a config file somewhere that stores this info?
(To be clear, I am not talking about processor or disk caches, I am talking about the buffer used by the kernel internally that buffers reads/writes before flushing them out to disk from time to time).
Thanks in advance.
The kernel does not buffer reads and writes the way you think... It maintains a "page cache" that holds pages from the disk. You do not get to manipulate its size (well, not directly, anyway); the kernel will always use all available free memory for the page cache.
You need to explain what you are really trying to do. If you want some control over how much data the kernel pre-fetches from disk, try a search for "linux readahead". (Hint: blockdev --setra XXX)
If you want some control over how long the kernel will hold dirty pages before flushing them to disk, try a search for "linux dirty_ratio".
A specific application can also bypass the page cache completely by using O_DIRECT, and it can exercise some control over it using fsync, sync_file_range, posix_fadvise, and posix_madvise. (O_DIRECT and sync_file_range are Linux-specific; the rest are POSIX.)
You will be able to ask a better question if you first educate yourself about the Linux VM subsystem, especially the page cache.
I think you mean the disk IO queues. For example:
$ cat /sys/block/sda/queue/nr_requests
128
How this queue is used depends on the IO scheduler that is in use.
$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
cfq is the most common choice, although on systems with advanced disk controllers and in virtual guest systems noop is also a very good choice.
There is no config file for this information that I am aware of. On systems that I need to change the queue settings on I put the changes into /etc/rc.local although you could use a full-up init script instead and place it into an RPM or DEB for mass distribution to a lot of systems.
I'd like to obtain memory usage information for both per process and system wide. In Windows, it's pretty easy. GetProcessMemoryInfo and GlobalMemoryStatusEx do these jobs greatly and very easily. For example, GetProcessMemoryInfo gives "PeakWorkingSetSize" of the given process. GlobalMemoryStatusEx returns system wide available memory.
However, I need to do it on Linux. I'm trying to find Linux system APIs that are equivalent GetProcessMemoryInfo and GlobalMemoryStatusEx.
I found 'getrusage'. However, max 'ru_maxrss' (resident set size) in struct rusage is just zero, which is not implemented. Also, I have no idea to get system-wide free memory.
Current workaround for it, I'm using "system("ps -p %my_pid -o vsz,rsz");". Manually logging to the file. But, it's dirty and not convenient to process the data.
I'd like to know some fancy Linux APIs for this purpose.
You can see how it is done in libstatgrab.
And you can also use it (GPL)
Linux has a (modular) filesystem-interface for fetching such data from the kernel, thus being usable by nearly any language or scripting tool.
Memory can be complex. There's the program executable itself, presumably mmap()'ed in. Shared libraries. Stack utilization. Heap utilization. Portions of the software resident in RAM. Portions swapped out. Etc.
What exactly is "PeakWorkingSetSize"? It sounds like the maximum resident set size (the maximum non-swapped physical-memory RAM used by the process).
Though it could also be the total virtual memory footprint of the entire process (sum of the in-RAM and SWAPPED-out parts).
Irregardless, under Linux, you can strace a process to see its kernel-level interactions. "ps" gets its data from /proc/${PID}/* files.
I suggest you cat /proc/${PID}/status. The Vm* lines are quite useful.
Specifically: VmData refers to process heap utilization. VmStk refers to process stack utilization.
If you continue using "ps", you might consider popen().
I have no idea to get system-wide free memory.
There's always /usr/bin/free
Note that Linux will make use of unused memory for buffering files and caching... Thus the +/-buffers/cache line.