Call a userspace function from within a Linux kernel module

Call a userspace function from within a Linux kernel module - c

I'm programming a simple Linux character device driver to output data to a piece of hardware via I/O ports. I have a function which performs floating point operations to calculate the correct output for the hardware; unfortunately this means I need to keep this function in userspace since the Linux kernel doesn't handle floating point operations very nicely.
Here's a pseudo representation of the setup (note that this code doesn't do anything specific, it just shows the relative layout of my code):
Userspace function:
char calculate_output(char x){
double y = 2.5*x;
double z = sqrt(y);
char output = 0xA3;
if(z > 35.67){
output = 0xC0;
}
return output;
}
Kernelspace code:
unsigned i;
for(i = 0; i < 300; i++){
if(inb(INPUT_PORT) & NEED_DATA){
char seed = inb(SEED_PORT);
char output = calculate_output(seed);
outb(output, OUTPUT_PORT);
}
/* do some random stuff here */
}
I thought about using ioctl to pass in the data from the userspace function, but I'm not sure how to handle the fact that the function call is in a loop and more code executes before the next call to calculate_output occurs.
The way I envision this working is:
main userspace program will start the kernelspace code (perhaps via ioctl)
userspace program blocks and waits for kernelspace code
kernelspace program asks userspace program for output data, and blocks to wait
userspace program unblocks, calculates and sends data (ioctl?), then blocks again
kernelspace program unblocks and continues
kernelspace program finishes and notifies userspace
userspace unblocks and continues to next task
So how do I have the communication between kernelspace and userspace, and also have blocking so that I don't have the userspace continually polling a device file to see if it needs to send data?
A caveat: while fixed point arithmetic would work quite well in my example code, it is not an option in the real code; I require the large range that floating point provides and -- even if not -- I'm afraid rewriting the code to use fixed point arithmetic would obfuscate the algorithm for future maintainers.

I think the simplest solution would be to create a character device in your kernel driver, with your own file operations for a virtual file. Then userspace can open this device O_RDWR. You have to implement two main file operations:
read -- this is how the kernel passes data back up to userspace. This function is run in the context of the userspace thread calling the read() system call, and in your case it should block until the kernel has another seed value that it needs to know the output for.
write-- this is how userspace passes data into the kernel. In your case, the kernel would just take the response to the previous read and pass it onto the hardware.
Then you end up with a simple loop in userspace:
while (1) {
read(fd, buf, sizeof buf);
calculate_output(buf, output);
write(fd, output, sizeof output);
}
and no loop at all in the kernel -- everything runs in the context of the userspace process that is driving things, and the kernel driver is just responsible for moving the data to/from the hardware.
Depending on what your "do some random stuff here" on the kernel side is, it might not be possible to do it quite so simply. If you really need the kernel loop, then you need to create a kernel thread to run that loop, and then have some variables along the lines of input_data, input_ready, output_data and output_ready, along with a couple of waitqueues and whatever locking you need.
When the kernel thread reads data, you put the data in input_ready and set the input_ready flag and signal the input waitqueue, and then do wait_event(<output_ready is set>). The read file operation would do a wait_event(<input_ready is set>) and return the data to userspace when it becomes ready. Similarly the write file operation would put the data it gets from userspace into output_data and set output_ready and signal the output waitqueue.
Another (uglier, less portable) way is to use something like ioperm, iopl or /dev/port to do everything completely in userspace, including the low-level hardware access.

I would suggest that you move the code that does all the "heavy lifting" to user mode - that is, calculate all the 300 values in one go, and pass those to the kernel.
I'm not even sure you can let an arbitrary piece of code call user-mode from the kernel. I'm sure it's possible to do, because that's what for example "signal" does, but I'm far from convinced you can do it "any way you like" (and almost certainly, there are restrictions regarding, for example, what you can do in that function). It certainly doesn't seem like a great idea, and it would DEFINITELY be quite slow to call back to usermode many times.

Related

Buffering expectations using `printf`

Say there exists a C program that executes in some Linux process. Upon start, the C program calls setvbuf to disable buffering on stdout. The program then alternates between two "logical" calls ("logical" in this sense to avoid consideration of the compiler possibly reordering instructions) - the first to printf() and the second incrementing a variable.
int main (int argc, char **argv)
{
setvbuf(stdout, NULL, _IONBF, 0);
unsigned int a = 0;
for (;;) {
printf("hello world!");
a++;
}
}
At some point assume the program receives a signal, e.g. via kill, that causes the program to terminate. Will the contents of stdout always be complete after the signal is received, in the sense that they include the result of all previous invocations to printf(), or is this dependent on other levels of buffering/other behavior not controllable via setvbuf (e.g. kernel buffering)?
The broader context of this question is, if using a synchronous logging mechanism in a C application (e.g. all threads log with printf()), can the log be trusted to be "complete" for all calls that have returned from printf() upon receiving some application-terminating signal?
Edit: I've edited the code snippet and question to remove undefined behavior for clarity.

Any sane interpretation of the expression "unbuffered stream" means that the data has left the stream object when printf returns. In the case of file-descriptor backed streams, that means the data has entered kernel-space, and the kernel should continue sending the data to its final destination (assuming no kernel panic, power loss etc).
But a problem with segfaults is that they may not happen when you think they do. Take for instance the following code:
int *p = NULL;
printf("hello world\n");
*p = 1;
A dumb non-optimizing compiler may create code that segfaults at *p=1;. But that is not the only possibility according to the c-standard. A compiler may for instance, if it can prove that printf doesn't depend on the contents of *p, reorganize the code like this:
int *p = NULL;
*p = 1;
printf("hello world\n");
In that case printf would never be called.
Another possibility is that, since p==NULL, *p=1 is invalid, the compiler may scrap that expression all together.
EDIT: The poster has changed the question from "Segfaulting" to being killed. In that case, it should all depend on if the kernel closes open file descriptors on exit the same way as close does, or not.

Given a construct like:
fprintf(file1, "whatever"); fflush(file1);
file2 = fopen(someExistingFile, "w");
there are some circumstances where it may be essential that fopen doesn't overwrite the existing file unless or until the write to file1 can be guaranteed successful, but there are others where waiting until success of the fflush can be assured before starting the fopen would needlessly degrade performance. In order to allow designers of C implementations to weigh such considerations however they see fit, and also avoid requiring that implementations provide semantic guarantees beyond those offered by the underlying OS (e.g. if an OS reports that the fflush() is complete before data is written to disk, and offers no way of finding out when all pending writes are complete, there would be no way the Standard could usefully require that an implementation which targets that OS must not allow fflush to return at any time when the write could still fail).

So, it appears that there's a basic misunderstanding in your question, and I think it's important to go through the basics of what printf is -> if your stdout buffer size is 0, then the question of "will all data be sent out of the buffer" is always yes, since there isn't a hardware buffer to save data, in theory. That is, somewhere in your computer hardware there's a something like a UART chip, that has a small buffer for transferring data. Most programs I've seen do not use this hardware buffer, so It's not surprising that your program does this.
However, the printf function has an upper layer buffer (in my application ~150 characters), and I'm assuming that this is the buffer you're asking about, note that this is not the same thing as the stdout buffer, its just an allocated piece of memory that stores messages before they're sent to wherever you want them to go. Think about it - if there were no printf-specific buffer you would only be able to send 1 character per function call
Now it really depends on the implementation of printf on your system, if it's nonblocking or blocking. If it's nonblocking, that could mean that data is being transferred by an interrupt or a DMA, probably a combination of both. In which case it depends on if your system stops these transfer mechanisms in the middle of a transfer, or allows them to complete. It's impossible for me to say based on the information you've given
However, in my experience, printf is usually a blocking function; that is it locks up the rest of your code while it's transferring things out of the buffer and moves to the next command only once it's completed, in which case if you have stopped the code from running (again, I'm not certain on the specifics of "kill" in your system) then you have also stopped the transfer.
Your system most likely has blocking PRINTF calls, and considering you say a "kill" signal it sounds like you're not even really sure what you mean by that. I think it's safe to assume that whatever signal you're talking about is not internally stopping your printf function from completing, so your full message will probably be sent before exiting, even if it arrives mid-printf. If your printf is being called it most likely is completing and sending the full message, unless this "kill" signal does something odd. That's the best answer I can give you from a "C" standpoint - if you would like a more absolute answer you would have to give us information that lets us see the implementation of "printf" on your operating system, and/or give us more specifics on how this "kill signal" you mentioned works

How to implement a system call that could check if itself has been successfully executed without going to the kernel log?

I have just stepped into the kernel world and would like to add some system calls. My goal is to add a system call that lets me check if it executed (without looking at the kernel log). However, I have been thinking for a long time, but have not yet figured out how to implement it. Could anyone please give me some advice? Or some pseudocodes? Thanks in advance.
My thinking is that we could implement a new system call, in which it writes something into a buffer. Then, another system call reads the content of the buffer to check if the previous system call has written to the buffer. (Somehow like pthread_create and pthread_join) Hence, my implementation consists of 2 system calls in total.
Here is a sketch of my thinking written in pseudocode:
syscall_2(...){
if (syscall_1 executes)
return 0;
if (syscall_1 NOT executes)
return -1;
}
syscall_1(){
do something;
create a buffer;
write something into buffer;
return syscall_2(buffer); // checks what is in buffer
}

My suggestion is that you have the system call itself accept a pointer to a userspace buffer that it overwrites with a specific piece of information.
You will have to learn how to access userspace memory, and more importantly how to verify that you were given a pointer to memory the process has mapped, and has write access to.
Then, once the system call completes, your program that called it can not only check the return code of the system call, you can also examine the memory to see if the system call wrote the correct thing to it.

Normally, system calls inform the caller if they are executed (how it went) so I guess you are interested in knowing which system calls have been executed, and how many times.
From this perspective, I think the best is to implement a device that can be queried (by means of some ioctl call) and let you know statistics about the individual system calls you can be interested on.
For example.... you can implement the number of system calls of type n you have used in some time.... by checking a counter at start time of interval and at end, and then check how many calls (if you implement a counter) you did in between by just subtracting the counter values at both times. You can also do the same to e.g. calculate the average time, by accumulating the time a system call takes to execute at the end of it. If you do this for example in picosecs, you can be sure this will be a good idea that you can publish. In this schema you can also account for the amount of I/O that each system call does, by counting the amount of bytes transferred to/from usermode.... You could implement this as ioctls to some device and then you don't need to add a system call for it.

Semantics of ALSA PCM calls

Hi I am writing a program that has to capture from three input devices at the same time (in this case, it's three identical USB webcams).
First of all, ALSA is not based on the familiar UNIX paradigm "everything is a file" so I cannot use the regular poll(3) call; knowing that the data stream should be steady among all devices, for now I do something like:
while(!stop)
for(i = 0; i < input_device_count; i++)
{
snd_pcm_readi(handle[i], buffer, frames);
write(fd_out[i], buffer, size);
}
This code iterates over each device and reads from it, writing the result on a file previously open. It works but I suspect there is a better way to do this, also using mmap so I do not have to copy from Kernel-space to user and to Kernel again.
For a moment, let's assume the three inputs stay in sync; the above code still does not guarantee that I will begin to record from the three devices at the same time. Is there a way to guarantee that? In fact, what are the semantics of calls like snd_pcm_prepare() and snd_pcm_start()? Right now I am not using those, I just go straight to snd_pcm_readi().
I have tried to search for code examples but I haven't found anything that has to do with multiple captures at the same time. Any hint would be appreciated!

ALSA is based on the familiar UNIX paradigm "everything is a file"; so to handle multiple devices, you should use poll(3).
There are ALSA plugins that are implemented on top of multiple files, so there might be more than one handle per PCM device.
Call snd_pcm_poll_descriptors_count for each device to know how many pollfd structures you need, then call snd_pcm_poll_descriptors to get the file handles and the respective event bits.
In your loop, after calling poll, you must not read the vales in the pollfd structures directly but call snd_pcm_poll_descriptors_revents to translate them back.
To ensure that multiple devices start at the same time, call snd_pcm_link.
However, this does not guarantee that the devices will run at the same speed.

How does list I/O writev internally work?

The writev function takes an array of struct iovec as input argument
writev(int fd, const struct iovec *iov, int iovcnt);
The input is a list of memory buffers that need to be written to a file (say). What I want to know is:
Does writev internally do this:
for (each element in iov)
write(element)
such that every element of iov is written to file in a separate I/O call? Or does writev write everything to file in a single I/O call?

Per the standards, the for loop you mentioned is not a valid implementation of writev, for several reasons:
The loop could fail to finish writing one iov before proceeding to the next, in the event of a short write - but this could be worked around by making the loop more elaborate.
The loop could have incorrect behavior with respect to atomicity for pipes: if the total write length is smaller than PIPE_BUF, the pipe write is required to be atomic, but the loop would break the atomicity requirement. This issue cannot be worked around except by moving all the iov entries into a single buffer before writing when the total length is at most PIPE_BUF.
The loop might have cases where it could result in blocking, where the single writev call would be required to perform a partial write without blocking. As far as I know, this issue would be impossible to work around in the general case.
Possibly other reasons I haven't thought of.
I'm not sure about point #3, but it definitely exists in the opposite direction, when reading. Calling read in a loop could block if a terminal has some data (shorter than the total iov length) available followed by an EOF indicator; calling readv should return immediately with a partial read in this case. However, due to a bug in Linux, readv on terminals is actually implemented as a read loop in kernelspace, and it does exhibit this blocking bug. I had to work around this bug in implementing musl's stdio:
http://git.etalabs.net/cgi-bin/gitweb.cgi?p=musl;a=commit;h=2cff36a84f268c09f4c9dc5a1340652c8e298dc0
To answer the last part of your question:
Or does writev write everything to file in a single I/O call?
In all cases, a conformant writev implementation will be a single syscall. Getting down to how it's implemented on Linux: for ordinary files and for most devices, the underlying file driver has methods that implement iov-style io directly, without any sort of internal loop. But the terminal driver on Linux is highly outdated and lacks the modern io methods, causing the kernel to fallback to a write/read loop for writev/readv when operating on a terminal.

The direct way to know how code works is read the source code.
see http://www.oschina.net/code/explore/glibc-2.9/sysdeps/posix/writev.c
It simplely alloca() or malloc() a buffer, copy all vectors into it, and call write() once.
That how it works. Nothing mysterious.

Or does writev write everything to file in a single I/O call?
I'm afarid not everything, though sys_writev try its best to write everything in a single call. it's depends on vfs's implement, if the vfs doesn't give an implement of writev, then kenerl will call vfs' write() in a loop. it's better to check the return value of writev/readv to see how many bytes wrotten as you do in write().
you can find the code of writev in kernel, fs/read_write.c:do_readv_writev.

How does a system call translate to CPU instructions?

Let's say there is a simple program like:
#include<stdio.h>
void main()
{
int x;
printf("Cool");
fd = open("/tmp/cool.txt", O_READONLY)
}
The open is a system call here. I suppose when the shell runs it, it makes some hundred other system calls to implement it? How about a declaration like int x - at some point should it have some additional system calls in the backdrop to get the memory from the computer?
I am not sure what is the boundary between a system call and a normal stuff ... everything, in the end, needs the operating system's help right?!
Or is it like the C generates an executable (code) which can be run on the processor and need no OS assistance is needed until a system call is reached - at which point it has to do something to load the OS instructions etc ...
A bit vague :) Please clarify.

I'm not answering the questions in order, so I'm prefixing my answers with the questions. I've taken the liberty of editing them a bit. You didn't specify the processor architecture, but I'm assuming you want to know about x86, so the processor-level details will pertain to x86. Other architectures can behave differently (memory management, how system calls are made, etc.). I'm also using Linux for examples.
Does the c compiler generate executable code that can be run straight on the processor without need for OS assistance until a system call is reached, at which point it has to do something to load the OS instructions?
Yes, that is correct. The compiler generates native machine code that can be run straight on the processor. The executable files that you get from the compiler, however, contain both the code and other needed data, for example, instructions on where to load the code in the memory. On Linux the ELF format is typically used for executables.
If the process is completely loaded into memory and has sufficient stack space, it will not need further OS assistance before it wants to make a system call. When you make a system call, it is just an instruction in the machine code that calls the OS. The program itself does not need to "load the OS instructions" in any way. The processor handles transferring execution to the OS code.
With Linux on the x86 architecture, one way for the machine code to make a system call is to use the software interrupt vector 128 to transfer execution to the operating system. In x86 assembly (Intel syntax), that is expressed as int 0x80. Linux will then perform tasks based on the values that the calling program placed into processor registers before making the system call: the system call number is found in the eax processor register and the system call parameters are found in other processor registers. After the OS is done, it will return a result in the eax register, and has possibly modified buffers pointed to by the system call parameters etc. Note however, that this is not the only way to make a system call.
However, if the process is not entirely in memory, and execution moves to a part of the code that is not in memory at the moment, the processor causes a page fault, which moves execution to the operating system, which then loads the required part of the process into memory and transfers execution back to the process, which can then continue execution normally, without even noticing that anything happened.
I'm not entirely sure on the next point, so take it with a grain of salt. The Wikipedia article on stack overflow (the computer error, not this site :) seems to indicate that stacks are usually of fixed size, so int x; should not cause the OS to run, unless that part of the stack is not in the memory (see previous paragraph). If you had a system with dynamic stack size (if it is even possible, but as far as I can see, it is), int x; could also cause a page fault when the stack space is used up, prompting the operating system to allocate more stack space for the process.
Page faults cause the execution to move to the operating system, but are not system calls in the usual sense of the word. System calls are explicit calls to the OS when you want it to perform some work for you. Page faults and other such events are implicit. Hardware interrupts continuously transfer the execution from your process to the OS so that it can react to them. After that it transfers the execution back to your process, or some other process.
On a multitasking OS, you can run many programs at once even if you have only one processor/core. This is accomplished by running only one program at a time, but switching between programs quickly. The hardware timer interrupt makes sure that control is transferred back to the OS in a timely fashion, so that one process can't hog the CPU all for itself. When control is passed to the OS and it has done what it needs to, it may always start a different process from the one that was interrupted. The OS handles all this totally transparently, so you don't have to think about it, and your process won't notice it. From the viewpoint of your process, it is executing continuously.
In short: Your program executes system calls only when you explicitly ask it to. The operating system may also swap parts of your process in and out of the memory when it wants to, and generally does things related and unrelated to your process in the background, but you don't normally need to think about that at all. (You can reduce the amount of page faults, though, by keeping your program as small as possible, and things like that)
In this case open() is an explicit system call, but I suppose when the shell runs it, it makes some hundred other system calls to implement it.
No, the shell has got nothing to do with an open() call in your c program. Your program makes that one system call, and shell doesn't come into the picture at all.
The shell will only affect your program when it starts it. When you start your program with the shell, the shell does a fork system call to fork off a second process, which then does an execve system call to replace itself with your program. After that, your program is in control. Before the control gets to your main() function though, it executes some initialization code, that was put there by the compiler. If you want to see what system calls a process makes, on Linux you can use strace to view them. Just say strace ls, for example, to see what system calls ls makes during its execution. If you compile a c program with just a main() function that returns immediately, you can see with strace what system calls the initialization code makes.
How does the process get its memory from the computer etc.? It has to involve some system calls again right? I am not sure what is the boundary between a system call and normal stuff. Everything in the end needs the OS help, right?
Yep, system calls. When your program is loaded into memory with the execve system call, it takes care of getting enough memory for your process. When you need more memory and call malloc(), it will make a brk system call to grow the data segment of your process if it has run out of internally cached memory to give you.
Not everything needs explicit help from the OS. If you have enough memory, have all your input in memory, and you write your output data to memory, you won't need the OS at all. That is, as long as you only do calculations on data you already have in memory, don't need more memory, and don't need to communicate with the outside world, you don't need the OS. On the other hand, a program that does not communicate with the outside world at all is a pretty useless one, because it can't get any input, and cannot give any output. Even if you calculate the millionth decimal of pi, it doesn't matter at all if you don't output it to the user.
This answer got quite big, so in case I missed something or didn't explain something clearly enough, please leave me a comment and I'll try to elaborate. If anyone spots any mistakes, be sure to point them out also.