Kernel bypass for UDP and TCP on Linux- what does it involve? - c

Per http://www.solacesystems.com/blog/kernel-bypass-revving-up-linux-networking:
[...]a network driver called OpenOnload that use “kernel bypass” techniques to run the application and network driver together in user space and, well, bypass the kernel. This allows the application side of the connection to process many more messages per second with lower and more consistent latency.
[...]
If you’re a developer or architect who has fought with context switching for years kernel bypass may feel like cheating, but fortunately it’s completely within the rules.
What are the functions needed to do such kernel bypassing?

A TCP offload engine will "just work", no special application programming needed. It doesn't bypass the whole kernel, it just moves some of the TCP/IP stack from the kernel to the network card, so the driver is slightly higher level. The kernel API is the same.
TCP offload engine is supported by most modern gigabit interfaces.
Alternatively, if you mean "running code on a SolarFlare network adapter's embedded processor/FPGA 'Application Onload Engine'", then... that's card-specific. You're basically writing code for an embedded system, so you need to say which kind of card you're using.

Okay, so the question is not straight forward to answer without knowing how the kernel handles the network stack.
In generel the network stack is made up of a lot of layers, with the lowest one being the actual hardware, typically this hardware is supported by means of drivers (one for each network interface), the nic's typically provide very simple interfaces, think recieve and send raw data.
On top of this physical connection, with the ability to recieve and send data is a lot of protocols, which are layered as well, near the bottem is the ip protocol, which basically allows you to specify the reciever of your information, while at the top you'll find TCP which supports stable connections.
So in order to answer your question, you most first figure out which part of the network stack you'll need to replace, and what you'll need to do. From my understanding of your question it seems like you'll want to keep the original network stack, and then just sometimes use your own, and in that case you should really just implement the strategy pattern, and make it possible to state which packets should be handled by which toplevel of the network stack.
Depending on how the network stack is implemented in linux, you may or may not be able to achieve this, without kernel changes. In a microkernel architecture, where each part of the network stack is implemented in its own service, this would be trivial, as you would simply pipe your lower parts of the network stack to your strategy pattern, and have this pipe the input to the required network toplevel layers.

Do you perhaps want to send and recieve raw IP packets?
Basically you will need to fill in headers and data in a ip-packet.
There are some examples here on how to send raw ethernet packets:
:http://austinmarton.wordpress.com/2011/09/14/sending-raw-ethernet-packets-from-a-specific-interface-in-c-on-linux/
To handle TCP/IP on your own, i think that you might need to disable the TCP driver in a custom kernel, and then write your own user space server that reads raw ip.
It's probably not that efficient though...

Related

Is there an option or command that I can used to disable/unload/ or stop the tcp/IP stack in linux. Need it to implement user space tcp in server app

I am working a C program that is uses sockets to implement tcp networking in a server application that I am working on. I was wondering is it possible to disable the tcp/ip stack of the kernel so my system do not interfere with incoming connection sync requests and IP packets.
Or I must compile kernel to disable it please tell if this is the case.
On this question How to create a custom packet in c?
it says
Also note that if you are trying to send raw tcp/udp packets, one problem you will have is disabling the network stack automatically processing the reply (either by treating it as addressed to an existing IP address or attempting to forward it).
If thats the case then how can it be possible.
Or is there any tool or program in Linux that can be used to achieve this like this comment Disable TCP/IP Stack from user-space
There is of course the counterintuitive approach of using additional networking functionality to disable normal networking functionality: netfilter. There are a few iptables matches/targets which might prove beneficial to you (e.g., the “owner” match that may deny or accept based on PID or UID). This still means the functionality is in the kernel, it just limits it.
if someone knows from right above then how can this be done are there any commands?
Well, you could compile yourself a kernel without networking :)
A couple of options
Check out the DPDK project (https://www.linuxjournal.com/content/userspace-networking-dpdk). DPDK passes the Physical NIC to User space via UIO driver to igb_uio|uio_pci_generic|vfio-pci. Thus eliminates Kernel Stack.
Use XDP supported NIC with either Zero-Copy or Driver-mode. with eBPF running one can push the received packets directly to User space bypassing the kernel stack.
Unless this is a homework project, remember: don't invent, reuse.
[EDIT-based on comment] Userspace TCP-IP stack have custom sock-API to read/write into the socket. So with either LD_PRELOAD or source file change, one can use the same application.

What does dev_net_set do in Linux?

I am writing a simple net device driver based on the loopback driver and want to register my net_device structure. This and that page on writing a net device say to just call register_netdev. But they're writing fancy drivers with PCI express and other complicated things.
So, if I just want something like the loopback driver, I should presumably base my code on loopback.c. My question is, what does the first line of this code in loopback_net_init do:
dev_net_set(dev, net);
err = register_netdev(dev);
Apparently net is determined by this code in net_namespace.c:
register_pernet_device(ops) ...
__register_pernet_operations(list, ops)
for_each_net(net) ...
What is this looping for? What might go wrong if I skip the dev_net_set call? Why are others not using it?
AFAIK, net is a structure that will allow the kernel to interact with the device. You need it to register the device and remove it in the module cleanup function. Please review the code under linux/net/8021q/ for examples.
AFAIK, looping happens at the level of sockets (layer 5-7), whereas net_dev is used as the kernel component that immediately interacts with the driver, when you actually want to use a say, ethernet card, or SLIP,PLIP for transmitting frames (layer 2-0). Loopback happens at the level of the network subsystem of the kernel, and lies way above the drivers which interact with the hardware. So I don't see why you would need a driver to use the loopback feature. However, there is also a provision for registering a dummy device with net_dev, though I don't know if that is what you are looking for.
That said, if your intention is to simply use some driver that simulates an actual physical device without one and say, reflects the packets that it recieves, that is possible too. Basically till the net_dev layer, the kernel does all the protocol stuff (TCP/IP), and finally passes off the packet to some handle that the device driver registers with the net_dev or something similar. Similarly on receiving stuff, the device triggers an interrupt, the driver does a DMA operation, and the kernel takes over from there. Hence instead of the code for doing the DMA operation, you can make a module that simply pass over a static packet, that is compatible with ethernet/TCP/IP . In a vast majority of cases, all these aspects (the network and other subsystems) are agnostic to the underlying bus details, i.e. it shouldn't matter whether the ethernet card is connected to PCI or ISA but there can be exceptions. Thus, IMHO, you are trying to do something that should only be attempted after having a thorough understanding of the network subsystem, and a good enough understanding of the kernel as a whole. Till then you will be shooting in the dark. Sometimes you may hit, but often-times you will miss.
http://man7.org/linux/man-pages/man8/ip-netns.8.html
A network namespace is logically another copy of the network stack,
with its own routes, firewall rules, and network devices.
So for_each_net is looping over these namespaces and creating a copy of all "per net" network devices in each one.
Use ip netns list to determine whether you are using network namespaces. Often they are not used, so drivers do not necessarily need to use dev_net_set.

How to create a custom packet in c?

I'm trying to make a custom packet using C using the TCP/IP protocol. When I say custom, I mean being able to change any value from the packet; ex: MAC, IP address and so on.
I tried searching around but I can't find anything that is actually guiding me or giving me example source codes.
How can I create a custom packet or where should I look for guidance?
A relatively easy tool to do this that is portable is libpcap. It's better known for receiving raw packets (and indeed it's better you play with that first as you can compare received packets with your hand crafted ones) but the little known pcap_sendpacket will actually send a raw packet.
If you want to do it from scratch yourself, open a socket with AF_PACKET and SOCK_RAW (that's for Linux, other OS's may vary) - for example see http://austinmarton.wordpress.com/2011/09/14/sending-raw-ethernet-packets-from-a-specific-interface-in-c-on-linux/ and the full code at https://gist.github.com/austinmarton/1922600 . Note you need to be root (or more accurately have the appropriate capability) to do this.
Also note that if you are trying to send raw tcp/udp packets, one problem you will have is disabling the network stack automatically processing the reply (either by treating it as addressed to an existing IP address or attempting to forward it).
Doing this sort of this is not as simple as you think. Controlling the data above the IP layer is relatively easy using normal socket APIs, but controlling data below is a bit more involved. Most operating systems make changing lower-level protocol information difficult since the kernel itself manages network connections and doesn't want you messing things up. Beyond that, there are other platform differences, network controls, etc that can play havoc on you.
You should look into some of the libraries that are out there to do this. Some examples:
libnet - http://libnet.sourceforge.net/
libdnet - http://libdnet.sourceforge.net/
If your goal is to spoof packets, you should read up on network-based spoofing mitigation techniques too (for example egress filtering to prevent spoofed packets from exiting a network).

User-mode TCP stack for retransmits over lossy serial link

I believe that my question is:
Is there a simple user-mode TCP stack on PC operating systems that could be used to exchange data over a lossy serial link with a Linux-based device?
Here is more context:
I have a Linux-based device connected via a serial link to a PC. The serial link is lossy so data being sent between the two devices sometimes needs to be retransmitted. Currently the system uses a custom protocol that includes framing, addressing (for routing to different processes within the Linux device), and a not-so-robust retransmission algorithm.
On the Linux device side, it would be convenient to replace the custom protocol, implement SLIP over the serial link and use TCP for all communications. The problem is that on the PC-side, we're not sure how to use the host's TCP stack without pulling in general IP routing that we don't need. If there were a user-mode TCP stack available, it seems like I could integrate that in the PC app. The only TCP stacks that I've found so far are for microcontrollers. They could be ported, but it would be nice if there were something more ready-to-go. Or is there some special way to use the OS's built in TCP stack without needing administrative privileges or risking IP address conflicts with the real Ethernet interfaces.
Lastly, just to keep the solution focused on TCP, yes, there are other solutions to this problem such as using HDLC or just fixing our custom protocol. However, we wanted to explore the TCP route further in case it was an option.
It appears that the comments have already answered your question, but perhaps to clarify; No you can not use TCP without using IP. TCP is built on top of IP, and it isn't going to work any other way.
PPP is a good way of establishing an IP connection over a serial link, but if you do not have administrative access on both sides of the computer this could be difficult. 172.16.x, 10.x, and 192.168.x are defined as being open for local networks, so you should be able to find a set of IP addresses that does not interfere with the network operation of the local computer.
From the point of view of no configuration, no dependencies, comping up with your own framing / re-transmit protocol should not be too hard, and is probably your best choice if you don't need inter-operability. That being said kermit, {z,y,z}modem would provide both better performance and a standard to code against.
Lastly, you may be able to use something like socat to do protocol translation. I.e. connect a serial stream to a TCP port. That wouldn't address data reliability / re-transmission, but it may be the interface you are looking to program against.

how is tcp(kernel) bypass implemented?

Assuming I would like to avoid the overhead of the linux kernel in handling incoming packets and instead would like to grab the packet directly from user space. I have googled around a bit and it seems that all that needs to happen is one would use raw sockets with some socket options. Is this the case? Or is it more involved than this? And if so, what can I google for or reference in order to implement something like this?
There are many techniques for networking with kernel bypass.
First, if you are sending messages to another process on the same machine, you can do so through a shared memory region with no jumps into the kernel.
Passing packets over a network without involving the kernel gets more interesting, and involves specialized hardware that gets direct access to user memory. This idea is called RDMA.
Here's one way it can work (this is what InfiniBand hardware does). The application registers a memory buffer with the RDMA hardware. This buffer is pinned in physical memory, since swapping it out would obviously be bad (since the hardware will keep writing to the physical memory region). A control region is also mapped into userspace memory. When an application is ready to use the buffer to send or receive a message, it writes a command to the control region. The hardware takes the data from a registered buffer on one end, and places it into another registered buffer at the other end.
Clearly, this is too low level, so there are abstractions that make programming RDMA hardware easier. OFED verbs are one such abstraction.
The InfiniBand software stack has one extra interesting bit: the Sockets Direct Protocol (SDP) that is used for compatibility with existing applications. It works by inserting an LD_PRELOAD shim that translates standard socket API calls into IB verbs.
InfiniBand is just what I'm most familiar with. RoCE/iWARP hardware is very similar from the programmer's perspective, but uses a different transport than InfiniBand (TCP using an offload engine in iWarp, Ethernet in RoCE). There are/were also other approaches to RDMA (Quadrics, for example).

Resources