How procfs outputs /proc/iomem? - c

I have looked into similar questions on this site (listed at the end) but still feel like missing a couple points, hopefully someone can help here:
Is there a hook into the proc file system that connects the /proc/iomem inode to a function that dumps the information? I wasn't able to find where in proc fs this function lives. I did a grep under the linux source tree fs/proc for iomem, got nothing. So maybe it is a more of a procfs question... The answer to this question might help me to dig up the answer to the next question..
The /proc/iomem has more entries than the BIOS E820 information I extracted from either dmesg or /sys/firmware/memmap (these two are actually consistent with each other). For example, /sys/firmware/memmap does not seem to have pci memory mapped regions. Drivers' init code calls the request_mem_region() and add more info to the map, so somewhere there should be a global variable (root of all resources ?) that remembers this graph?
The questions on stackoverflow I have looked into:
How is /proc/io* populated?
Expose information to /proc/iomem
Content of /proc/iomem

struct resource iomem_resource is what you're looking for, and it is defined and initialized in kernel/resource.c (via proc_create_seq_data()). In the same file, the instance struct seq_operations resource_op defines what happens when you, for example cat the file from userland.
iomem_resource is a globally exported symbol, and is used throughout the kernel, drivers included, to request resources. You can find instances scattered across the kernel of devm_/request_resource() which take either iomem_resource or its sibling ioport_resource based on either fixed settings, or based on configurations. Examples of methods that take configurations are a) device trees which is prevalent in embedded settings, and b) E820 or UEFI, which can be found more on x86.
Starting with b) which was asked in the question, the file arch/x86/kernel/e820.c shows examples of how reserved memory gets inserted into /proc/iomem via insert_resource().
This excellent link has more details on the dynamics of requesting memory map details from the BIOS.
Another alternative sequence (which relies on CONFIG_OF) for how a device driver requests the needed resources is:
The Open Firmware API is traversing the device tree, and finds a matching driver. For example via a struct of_device_id.
The driver defines a struct platform_device which contains both the struct of_device_id and a probe function. This probing function is thus called.
Inside the probe function, a call to platform_get_resource() is made which reads the reg property from the device tree. This property defines the physical memory map for a specific device.
A call to devm_request_mem_region() is made (which is just a call to request_region()) to actually allocate the resources and add it to /proc/iomem.

Related

How to create vm_area mapping if using __get_free_pages() with order greater than 1?

I am re-implementing mmap in a device driver for DMA.
I saw this question: Linux Driver: mmap() kernel buffer to userspace without using nopage that has an answer using vm_insert_page() to map one page at a time; hence, for multiple pages, needed to execute in a loop. Is there another API that handles this?
Previously I used dma_alloc_coherent to allocate a chunk of memory for DMA and used remap_pfn_range to build a page table that associates process's virtual memory to physical memory.
Now I would like to allocate a much larger chunk of memory using __get_free_pages with order greater than 1. I am not sure how to build page table in that case. The reason is as follows:
I checked the book Linux Device Drivers and noticed the following:
Background:
When a user-space process calls mmap to map device memory into its address space, the system responds by creating a new VMA to represent that mapping. A driver that supports mmap (and, thus, that implements the mmap method) needs to help that process by completing the initialization of that VMA.
Problem with remap_pfn_range:
remap_pfn_range won’t allow you to remap conventional addresses, which include the ones you obtain by calling get_free_page. Instead, it maps in the zero page. Everything appears to work, with the exception that the process sees private, zero-filled pages rather than the remapped RAM that it was hoping for.
The corresponding implementation using get_free_pages with order 0, i.e. only 1 page in scullp device driver:
The mmap method is disabled for a scullp device if the allocation order is greater than zero, because nopage deals with single pages rather than clusters of pages. scullp simply does not know how to properly manage reference counts for pages that are part of higher-order allocations.
May I know if there is a way to create VMA for pages obtained using __get_free_pages with order greater than 1?
I checked Linux source code and noticed there are some drivers re-implementing struct dma_map_ops->alloc() and struct dma_map_ops->map_page(). May I know if this is the correct way to do it?
I think I got the answer to my question. Feel free to correct me if I am wrong.
I happened to see this patch: mm: Introduce new vm_map_pages() and vm_map_pages_zero() API while I was googling for vm_insert_page.
Previouly drivers have their own way of mapping range of kernel pages/memory into user vma and this was done by invoking vm_insert_page() within a loop.
As this pattern is common across different drivers, it can be generalized by creating new functions and use it across the drivers.
vm_map_pages() is the API which could be used to mapped kernel memory/pages in drivers which has considered vm_pgoff.
After reading it, I knew I found what I want.
That function also could be found in Linux Kernel Core API Documentation.
As for the difference between remap_pfn_range() and vm_insert_page() which requires a loop for a list of contiguous pages, I found this answer to this question extremely helpful, in which it includes a link to explanation by Linus.
As a side note, this patch mm: Introduce new vm_insert_range and vm_insert_range_buggy API indicates that the earlier version of vm_map_pages() was vm_insert_range(), but we should stick to vm_map_pages(), since under the hood vm_map_pages() calls vm_insert_range().

what is the detailed process of bps API map helpers like "bpf_map_update_elem"?

In my understanding, when userspace use bpf_map_update_elem(int fd, void *key, void *value, __u64 flags),
first, userspace find the map through the fd;
second, userspace make a memory in user-space;
and ....
I know a little bit, but the specific process is still unclear.
So I wanna know what the detail is when userspace run API map helpers.
Because you mention “user space”, I am unsure what you are talking about exactly. So let's start with some clarification.
BPF maps (or at least, most of the existing types, including hash maps and arrays) can be accessed in two ways:
From user space, by any application running on the system and having sufficient permission
From kernel space, from BPF programs
From user space, there is no “helper” function. Interaction with maps is entirely (*) done through the bpf() syscall (with the BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE_ELEM, BPF_MAP_DELETE_ELEM commands passed to the syscall as its first argument). See the bpf(2) manual page for more details. This is what you use in a user space application that would load and manage BPF programs and maps, say bpftool for example.
From kernel space, i.e. from a BPF program, things work differently and access is done with one of the BPF “helpers” such as bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags). See the bpf-helpers(7) man page for details on existing helpers. You can find details on those helper calls in the Cilium guide, or obviously by reading kernel code (e.g. for array maps). They look like low-level C function calls, with BPF registers being used to pass the necessary arguments, and then it calls from the BPF program instructions into the helper that is compiled as part of the kernel binary.
So you mentioned bpf_map_update_elem() and user space. Although this is the name for the helper on the kernel side, I suspect you might be talking about the function with the same name which is offered by the libbpf library, to provide a wrapper around the bpf() system call. So what happens with this function is rather simple.
There is no need to find the map from the file descriptor in user space: Actually the opposite happens, the file descriptor is open from the map in user space (from its map id, or from its pinned path under the /sys/fs/bpf virtual file system for example). So the fd is passed to the bpf() system call and used by the kernel as a reference to the map.
I'm not sure what you mean but “userspace make a memory in user-space”. There is no need to allocate any memory here: The key and value should already have been filled at this point, and they are passed to the kernel through the bpf() syscall to tell what entry to update, and with what value. Same things for the flags.
Once bpf() has been called, what happens on the kernel side is rather straightforwards. Mostly, the kernel check permissions, validates the arguments (to make sure they are safe and consistent with the map), then it updates the actual data. For array maps, array_map_update_elem() (used with the BPF helper on kernel side too, see link above) is called eventually.
(*) Some interactions might actually be done without the bpf() system call, I believe that with “global data” stored in BPF maps, use applications mmap() to the kernel memory. But this goes beyond the scope of basic usage of arrays and maps.

Ensure that UID/GID check in system call is executed in RCU-critical section

Task
I have a small kernel module I wrote for my RaspBerry Pi 2 which implements an additional system call for generating power consumption metrics. I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it. Otherwise, the call just skips the bulk of its body and returns success.
Background Work
I've read into the issue at length, and I've found a similar question on SO, but there are numerous problems with it, from my perspective (noted below).
Question
The linked question notes that struct task_struct contains a pointer element to struct cred, as defined in linux/sched.h and linux/cred.h. The latter of the two headers doesn't exist on my system(s), and the former doesn't show any declaration of a pointer to a struct cred element. Does this make sense?
Silly mistake. This is present in its entirety in the kernel headers (ie: /usr/src/linux-headers-$(uname -r)/include/linux/cred.h), I was searching in gcc-build headers in /usr/include/linux.
Even if the above worked, it doesn't mention if I would be getting the the real, effective, or saved UID for the process. Is it even possible to get each of these three values from within the system call?
cred.h already contains all of these.
Is there a safe way in the kernel module to quickly determine which groups the user belongs to without parsing /etc/group?
cred.h already contains all of these.
Update
So, the only valid question remaining is the following:
Note, that iterating through processes and reading process's
credentials should be done under RCU-critical section.
... how do I ensure my check is run in this critical section? Are there any working examples of how to accomplish this? I've found some existing kernel documentation that instructs readers to wrap the relevant code with rcu_read_lock() and rcu_read_unlock(). Do I just need to wrap an read operations against the struct cred and/or struct task_struct data structures?
First, adding a new system call is rarely the right way to do things. It's best to do things via the existing mechanisms because you'll benefit from already-existing tools on both sides: existing utility functions in the kernel, existing libc and high-level language support in userland. Files are a central concept in Linux (like other Unix systems) and most data is exchanged via files, either device files or special filesystems such as proc and sysfs.
I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it.
You can't do this in the kernel. Not only is it wrong from a design point of view, but it isn't even possible. The kernel knows nothing about user names. The only knowledge about users in the kernel in that some privileged actions are reserved to user 0 in the root namespace (don't forget that last part! And if that's new to you it's a sign that you shouldn't be doing advanced things like adding system calls). (Many actions actually look for a capability rather than being root.)
What you want to use is sysfs. Read the kernel documentation and look for non-ancient online tutorials or existing kernel code (code that uses sysfs is typically pretty clean nowadays). With sysfs, you expose information through files under /sys. Access control is up to userland — have a sane default in the kernel and do things like calling chgrp, chmod or setfacl in the boot scripts. That's one of the many wheels that you don't need to reinvent on the user side when using the existing mechanisms.
The sysfs show method automatically takes a lock around the file, so only one kernel thread can be executing it at a time. That's one of the many wheels that you don't need to reinvent on the kernel side when using the existing mechanisms.
The linked question concerns a fundamentally different issue. To quote:
Please note that the uid that I want to get is NOT of the current process.
Clearly, a thread which is not the currently executing thread can in principle exit at any point or change credentials. Measures need to be taken to ensure the stability of whatever we are fiddling with. RCU is often the right answer. The answer provided there is somewhat wrong in the sense that there are other ways as well.
Meanwhile, if you want to operate on the thread executing the very code, you can know it wont exit (because it is executing your code as opposed to an exit path). A question arises what about the stability of credentials -- good news, they are also guaranteed to be there and can be accessed with no preparation whatsoever. This can be easily verified by checking the code doing credential switching.
We are left with the question what primitives can be used to do the access. To that end one can use make_kuid, uid_eq and similar primitives.
The real question is why is this a syscall as opposed to just a /proc file.
See this blogpost for somewhat elaborated description of credential handling: http://codingtragedy.blogspot.com/2015/04/weird-stuff-thread-credentials-in-linux.html

Access block level storage via kernel

How to access block level storage via the kernel (w/o using scsi libraries)?
My intent is to implement a block level storage protocol over network for learning purpose, almost the same way SCSI works. Requests will be generated by initiator and sent to target (both userspace program) which makes call to kernel module and returns the data using TCP protocol to initiator.
So far, I have managed to build a simple "Hello" module and run it (I am new at kernel programming), but unable to proceed with block access.
After searching a lot, I found struct buffer_head * bread(int dev,int block) in linux/fs.h, but the compiler throws error.
error: implicit declaration of function ‘bread’
Please help, also feel free to advice on starting with kernel programming.
Thank you!
bread as used in old kernels.
Looking into struct request *blk_get_request(struct request_queue *, int, gfp_t); in linux/blkdev.h
Accessing the block device has to be accomplished via kernel.
Not a kernel developer, but a few comments:
The implicit declaration error means that the definition you've found somehow isn't in scope when you call the function. Maybe it's hidden in an #ifdef or maybe you forgot to include linux/fs.h somehow.
As far as advice on linux kernel programming, you might want to check out kernelnewbies.org.
There have been various books written on kernel programming, but be aware that the details in the kernel change very rapidly. Most of the concepts in the older books will still be valid, but at least some of the details in some areas will have changed.
Finally, you might have to brave the linux kernel mailing list. It's rather intimidating, I'm sorry to say, so try to have your questions well thought out before you post them.
A block level storage protocol is itself a fair bit of work. Perhaps you want to get the protocol in place in user space first, with the target doing direct access to, eg, /dev/sdc before diving into the kernel.
As I read your question more closely, it appears your main interest is in the storage protocol aspect of this project. If so, why do you need to modify the kernel. If you have a locally attached disk, say /dev/sdX on the target, then you can do something like this from user space:
fd = open("/dev/sdX", O_RDWR);
pwrite(fd, buf, len, offset);
pread(fd, buf, len, offset);
So, unless you're specifically interested in playing around inside the kernel, I don't think you need to do any kernel module to do a basic storage protocol between user processes.

how to acess and change variable of kernel space from user space

i,
I have posted query previously and i am repeating same I want to modify igmpv3 (Linux)
which is inbuilt in kernel2.6.-- such that it reads a value from a file and appropriately decides reserved(res 1) value inside the igmpv3 paket which is sent by a host.
I want to add more to above question by saying that this is more a generic question of changing variable
of kernel space from user space.
Thanks in advance for your help.
Regards,
Bhavin
From the perspective of a user land program, you should think of the driver as a "black box" with well defined interfaces instead of code with variables you can change. Using this mental model, there are four ways (i.e. interfaces) to communicate control information to the driver that you should consider:
Command line options. You can pass parameters to a kernel module which are then available to it during initialization.
IOCTLs. This is the traditional way of passing control information to a driver, but this mechanism is a little more cumbersome to use than sysfs.
proc the process information pseudo-file system. proc creates files in the /proc directory which user land programs can read and sometimes write. In the past, this interface was appropriated to also communicate with drivers. Although proc looks similarly to sysfs, newer drivers (Linux 2.6) should use sysfs instead as the intent of the proc is to report on the status of processes.
sysfs is a pseudo-file system used to export information about drivers and devices. See the documentation in the kernel (Documentation/filesystems/sysfs.txt) for more details and code samples. For your particular case, pay attention to the "store" method.
Depending on when you need to communicate with the driver (i.e. initialization or run time), you should add either a new command line option or a new sysfs entry to change how the driver treats the value of reserved fields in the packet.
With regard to filp_open, the function's comment is
/**
* This is the helper to open a file from kernelspace if you really
* have to. But in generally you should not do this, so please move
* along, nothing to see here..
*/
meaning there are better ways than this to do what you want. Also see this SO question for more information on why drivers generally should not open files.
You normally can't. Only structures exposed in /proc and /sys or via a module parameter can be modified from userspace.

Resources