how is tcp(kernel) bypass implemented? - c

Assuming I would like to avoid the overhead of the linux kernel in handling incoming packets and instead would like to grab the packet directly from user space. I have googled around a bit and it seems that all that needs to happen is one would use raw sockets with some socket options. Is this the case? Or is it more involved than this? And if so, what can I google for or reference in order to implement something like this?

There are many techniques for networking with kernel bypass.
First, if you are sending messages to another process on the same machine, you can do so through a shared memory region with no jumps into the kernel.
Passing packets over a network without involving the kernel gets more interesting, and involves specialized hardware that gets direct access to user memory. This idea is called RDMA.
Here's one way it can work (this is what InfiniBand hardware does). The application registers a memory buffer with the RDMA hardware. This buffer is pinned in physical memory, since swapping it out would obviously be bad (since the hardware will keep writing to the physical memory region). A control region is also mapped into userspace memory. When an application is ready to use the buffer to send or receive a message, it writes a command to the control region. The hardware takes the data from a registered buffer on one end, and places it into another registered buffer at the other end.
Clearly, this is too low level, so there are abstractions that make programming RDMA hardware easier. OFED verbs are one such abstraction.
The InfiniBand software stack has one extra interesting bit: the Sockets Direct Protocol (SDP) that is used for compatibility with existing applications. It works by inserting an LD_PRELOAD shim that translates standard socket API calls into IB verbs.
InfiniBand is just what I'm most familiar with. RoCE/iWARP hardware is very similar from the programmer's perspective, but uses a different transport than InfiniBand (TCP using an offload engine in iWarp, Ethernet in RoCE). There are/were also other approaches to RDMA (Quadrics, for example).

Related

What does dev_net_set do in Linux?

I am writing a simple net device driver based on the loopback driver and want to register my net_device structure. This and that page on writing a net device say to just call register_netdev. But they're writing fancy drivers with PCI express and other complicated things.
So, if I just want something like the loopback driver, I should presumably base my code on loopback.c. My question is, what does the first line of this code in loopback_net_init do:
dev_net_set(dev, net);
err = register_netdev(dev);
Apparently net is determined by this code in net_namespace.c:
register_pernet_device(ops) ...
__register_pernet_operations(list, ops)
for_each_net(net) ...
What is this looping for? What might go wrong if I skip the dev_net_set call? Why are others not using it?
AFAIK, net is a structure that will allow the kernel to interact with the device. You need it to register the device and remove it in the module cleanup function. Please review the code under linux/net/8021q/ for examples.
AFAIK, looping happens at the level of sockets (layer 5-7), whereas net_dev is used as the kernel component that immediately interacts with the driver, when you actually want to use a say, ethernet card, or SLIP,PLIP for transmitting frames (layer 2-0). Loopback happens at the level of the network subsystem of the kernel, and lies way above the drivers which interact with the hardware. So I don't see why you would need a driver to use the loopback feature. However, there is also a provision for registering a dummy device with net_dev, though I don't know if that is what you are looking for.
That said, if your intention is to simply use some driver that simulates an actual physical device without one and say, reflects the packets that it recieves, that is possible too. Basically till the net_dev layer, the kernel does all the protocol stuff (TCP/IP), and finally passes off the packet to some handle that the device driver registers with the net_dev or something similar. Similarly on receiving stuff, the device triggers an interrupt, the driver does a DMA operation, and the kernel takes over from there. Hence instead of the code for doing the DMA operation, you can make a module that simply pass over a static packet, that is compatible with ethernet/TCP/IP . In a vast majority of cases, all these aspects (the network and other subsystems) are agnostic to the underlying bus details, i.e. it shouldn't matter whether the ethernet card is connected to PCI or ISA but there can be exceptions. Thus, IMHO, you are trying to do something that should only be attempted after having a thorough understanding of the network subsystem, and a good enough understanding of the kernel as a whole. Till then you will be shooting in the dark. Sometimes you may hit, but often-times you will miss.
http://man7.org/linux/man-pages/man8/ip-netns.8.html
A network namespace is logically another copy of the network stack,
with its own routes, firewall rules, and network devices.
So for_each_net is looping over these namespaces and creating a copy of all "per net" network devices in each one.
Use ip netns list to determine whether you are using network namespaces. Often they are not used, so drivers do not necessarily need to use dev_net_set.

Using sock_create, accept, bind etc in kernel

I'm trying to implement an echo TCP server as a loadable kernel module.
Should I use sock_create, or sock_create_kern?
Should I use accept, or kernel_accept?
I mean it does make sense that I should use kernel_accept for example; but I don't know why. Can't I use normal sockets in the kernel?
The problem is, you are trying to shoehorn an user space application into the kernel.
Sockets (and files and so on) are things the kernel provides to userspace applications via the kernel-userspace API/ABI. Some, but not all, also have an in-kernel callable, for cases when another kernel thingy wishes to use something provided to userspace.
Let's look at the Linux kernel implementation of the socket() or accept() syscalls, in net/socket.c in the kernel sources; look for SYSCALL_DEFINE3(socket, and SYSCALL_DEFINE3(accept,, SYSCALL_DEFINE4(recv,, and so on.
(I recommend you use e.g. Elixir Cross Referencer to find specific identifiers in the Linux kernel sources, then look up the actual code in one of the official kernel Git trees online; that's what I do, anyway.)
Note how pointer arguments have a __user qualifier: this means the data pointed to must reside in user space, and that the functions will eventually use copy_from_user()/copy_to_user() to retrieve or set the data. Furthermore, the operations access the file descriptor table, which is part of the process context: something that normally only exist for userspace processes.
Essentially, this means your kernel module must create an userspace "process" (enough of one to satisfy the requirements of crossing the userspace-kernel boundary when using kernel interfaces) to "hold" the memory and file descriptors, at minimum. It is a lot of work, and in the end, it won't be any more performant than an userspace application would be. (Linux kernel developers have worked on this for literally decades. There are some proprietary operating systems where doing stuff in "kernel space" may be faster, but that is not so in Linux. The cost to do things in userspace is some context switches, and possibly some memory copies (for the transferred data).)
In particular, the TCP/IP and UDP/IP interfaces (see e.g. net/ipv4/udp.c for UDP/IPv4) do not seem to have any interface for kernel-side buffers (other than directly accessing the rx/tx socket buffers, which are in kernel memory).
You have probably heard of TUX web server, a subsystem patch to the Linux kernel by Ingo Molnár. Even that is not a "kernel module server", but more like a subsystem that an userspace process can use to implement a server that runs mostly in kernel space.
The idea of a kernel module that provides a TCP/IP and/or UDP/IP server, is simply like trying to use a hammer to drive in screws. It will work, after a fashion, but the results won't be pretty.
However, for the particular case of an echo server, it just might be possible to bolt it on top of IPv4 (see net/ipv4/) and/or IPv6 (see net/ipv6/) similar to ICMP packets (net/ipv4/icmp.c, net/ipv6/icmp.c). I would consider this route if and only if you intend to specialize in kernel-side networking stuff, as otherwise everything you'd learn doing this is very specialized and not that useful in practice.
If you need to implement something kernel-side for an exercise or something, I'd recommend steering away from "application"-type ideas (services or similar).
Instead, I would warmly recommend developing a character device driver, possibly implementing some kind of inter-process communications layer, preferably bus-style (i.e., one sender, any number of recipients). Something like that has a number of actual real-world use cases (both hardware drivers, as well as stranger things like kdbus-type stuff), so anything you'd learn doing that would be real-world applicable.
(In fact, an echo character device -- which simply outputs whatever is written to it -- is an excellent first target. Although LDD3 is for Linux kernel 2.6.10, it should be an excellent read for anyone diving into Linux kernel development. If you use a more recent kernel, just remember that the example code might not compile as-is, and you might have to do some research wrt. Linux kernel Git repos and/or a kernel source cross referencer like Elixir above.)
In short sockets are just a mechanism that enable two processes to talk, localy or remotely.
If you want to send some data from kernel to userspace you have to use kernel sockets sock_create_kern() with it's family of functions.
What would be the benefit of TCP echo server as kernel module?
It makes sense only if your TCP server provides data which is otherwise not accessible from userspace, e.g. read some post-mortem NVRAM which you can't read normally and to send it to rsyslog via socket.

Linux device driver for a RS232 device in embedded system

I have recently started learning to write Linux device drivers for a specific project that I am working on. Previously most of the work I have done has been with devices running no OS so Linux drivers and development is somewhat new to me.
For the project I am working on I have an embedded system running a Linux based operating system. I have an external device with is controlled via RS232 that I need to write a driver for.
Questions:
1) Is there a way to access serial ports from withing kernel space (and possibly use serial.h, serial_core.h, etc.), how is this usually done, any good examples?
2) From what I found it seems like it would be much easier to access the serial ports in user space by just opening dev/ttyS* and writing to it. When writing a driver for a device like this (RS232 device) is it preferred to do it in user space or is there a way to write a kernel module? How does one decide to write a driver as a kernel module over user space or vise versa?
Are drivers only for generic devices such as UART/serial and then above that is userspace or should this driver be written as a kernel module? I appreciate the help, I have been unable to find much information to answer my questions.
There are a few times when a module that communicates over a serial port may be in the kernel. The pppd (point to point protocol daemon) is one example as Linux has some kernel code devoted to that since it is a high traffic use of serial and it also needs to turn around and put the IP packets into kernel space.
Most other uses would work better from user space since you have a good API that already takes care of a lot of the errors that can happen. This also lessens the chance that your errors will result in massive system failure.
Doing things like this from user space does result in some latency. Reads and writes are buffered, and it's often difficult to tell where in the write operations the hardware actually is, and canceling an already succeeded write call isn't really doable from user space, even if the hardware hasn't yet received the bytes.
I would suggest attempting to do it from user space first and then move to OS driver if necessary. Even if it is necessary to move this into an OS level driver, you'll likely be able to get some progress made from user space.

Kernel bypass for UDP and TCP on Linux- what does it involve?

Per http://www.solacesystems.com/blog/kernel-bypass-revving-up-linux-networking:
[...]a network driver called OpenOnload that use “kernel bypass” techniques to run the application and network driver together in user space and, well, bypass the kernel. This allows the application side of the connection to process many more messages per second with lower and more consistent latency.
[...]
If you’re a developer or architect who has fought with context switching for years kernel bypass may feel like cheating, but fortunately it’s completely within the rules.
What are the functions needed to do such kernel bypassing?
A TCP offload engine will "just work", no special application programming needed. It doesn't bypass the whole kernel, it just moves some of the TCP/IP stack from the kernel to the network card, so the driver is slightly higher level. The kernel API is the same.
TCP offload engine is supported by most modern gigabit interfaces.
Alternatively, if you mean "running code on a SolarFlare network adapter's embedded processor/FPGA 'Application Onload Engine'", then... that's card-specific. You're basically writing code for an embedded system, so you need to say which kind of card you're using.
Okay, so the question is not straight forward to answer without knowing how the kernel handles the network stack.
In generel the network stack is made up of a lot of layers, with the lowest one being the actual hardware, typically this hardware is supported by means of drivers (one for each network interface), the nic's typically provide very simple interfaces, think recieve and send raw data.
On top of this physical connection, with the ability to recieve and send data is a lot of protocols, which are layered as well, near the bottem is the ip protocol, which basically allows you to specify the reciever of your information, while at the top you'll find TCP which supports stable connections.
So in order to answer your question, you most first figure out which part of the network stack you'll need to replace, and what you'll need to do. From my understanding of your question it seems like you'll want to keep the original network stack, and then just sometimes use your own, and in that case you should really just implement the strategy pattern, and make it possible to state which packets should be handled by which toplevel of the network stack.
Depending on how the network stack is implemented in linux, you may or may not be able to achieve this, without kernel changes. In a microkernel architecture, where each part of the network stack is implemented in its own service, this would be trivial, as you would simply pipe your lower parts of the network stack to your strategy pattern, and have this pipe the input to the required network toplevel layers.
Do you perhaps want to send and recieve raw IP packets?
Basically you will need to fill in headers and data in a ip-packet.
There are some examples here on how to send raw ethernet packets:
:http://austinmarton.wordpress.com/2011/09/14/sending-raw-ethernet-packets-from-a-specific-interface-in-c-on-linux/
To handle TCP/IP on your own, i think that you might need to disable the TCP driver in a custom kernel, and then write your own user space server that reads raw ip.
It's probably not that efficient though...

What does the machine code for networking look like?

At the end of the day every piece of code we write eventually gets turned into assembler and then machine language.
If you were writing assembler and wanting to perform a simple connection between two computers, how would you know which memory addresses to use (let alone offsets) within the assembler? Would you need to know specific addresses relating to the operating system?
I'm just wondering how somebody would write a really "clean" and "efficient" message passing library/compiler- the thing which is getting me is what on earth would network communications/IPC look like in assembler?
I think part of this answer could lie with querying known addresses relating to the OS? For example 0x4545456 to 0x 60000000 contains the Linux kernel data for communications X etc.
The addresses are not specific to your OS. They are specific to your hardware/system. Accessing those has nothing to do with assembler vs. another programming language (e.g. C), in fact most device driver code (the code that actually interacts with the networking hardware) is typically written in C.
Here's just one random sample of a network (ethernet) controller:
Intel® 82580EB/82580DB GbE Controller: Datasheet
There are a bunch of registers that your software, either in assembler, or in another language, has to program to get this thing to actually communicate over ethernet. It's probably easier to start with a simpler example, something like a serial port. Let's build a hypothetical, fixed baud rate, serial port controller, mapped to memory:
Address Meaning
0 RX status (reads 0 when no data to read, 1 a byte is available)
1 RX buffer
2 TX status (reads 0 when ready to send, 1 when busy)
3 TX buffer
Now your software, either in assembler or any other language, can transmit data to another computer, by monitoring (polling) address 2 until it's ready, writing the next byte to address 3. We can also received data from another computer by monitoring (polling) address 0 to see when data is ready and reading the byte from address 1 when the data is there.
In a modern operating system/OS those are all physical addresses which need to be somehow mapped into virtual addresses.
Real world hardware, such as the one I linked to, will typically use interrupts, so you don't need to poll. It will usually have DMA, so the hardware can access your data directly rather than you feeding it byte by byte. It will handle various protocols and will have registers for checking and setting various aspects of this protocol.
In a modern OS the actual interaction with the hardware is implemented in a device driver and user software can exchange data with the device driver through some API. Again, this user code may be written in assembler or any other language. The API will vary depending on the OS. Communication/networking is generally built as a "stack" with higher level protocols implemented over the lower level ones. Which part of this stack is in a user library or part of the OS will vary between different operating systems.
For the hypothetical device I described above the API may consist of two single byte blocking calls, read() and write(). You then use some sort of system call mechanism from either assembler or a higher level language to call these and pass parameters/retrieve the output. In some operating systems device I/O may look like file I/O so you would use the generic file read/write to perform operations on the device and the OS will dispatch those to the right device driver. Furthermore, in a typical OS the actual system call will be available through some sort of library, which again you may call from various programming languages.
There are two pieces of code for doing networking in assembly - the kernel code used by the operating system to actually do the networking, and client code that wants to tell the OS what data to send over the network.
Typically, the hardware in a machine has certain memory addresses dedicated to communicating with the network hardware. The machine code for the OS can then write the appropriate values into this memory to control the hardware that ends up sending and receiving bytes. These memory addresses would be hardcoded into the machine code.
In the case of user code that does networking (say, Mozilla Firefox), the process is different. There is typically a machine instruction or set of instructions that are used for user code to tell the operating system to perform some task (in MIPS, for example, this is syscall, while I think x86 uses the int instruction). Client code would work by setting up some buffers with the appropriate data to send to the network, then would use one of the assembly instructions above to tell the OS that it should send the data. The hardware then invokes the OS, which reads the user data and then uses its own machine code (described above) to actually control the network device appropriately. In this way, the OS can guard direct access to the network device by blocking access to the physical addresses controlling the device and moderating access through system calls. It also means that you don't need to know any memory addresses when writing user code to do networking. The OS handles these details, and all you need to know about is what instruction to execute to trigger the system call.
Hope this helps!

Resources