Kernel level memory handling coding - c

My requirement is to store data in kernel..Data are incoming packets from networks..which may be different in size and have to store for example 250ms duration..and there should be 5 such candidate for which kernel level memory management is required..since packets are coming very fast..my approach is to allocate a large memory say 2mb memory for each such candidate..bez kmalloc and kfree have timing overhead..any help regarding that?

sk_buffs are a generic answer that is network related or as Mike points out a kernel memory cache is even more generic answer to your question. However, I believe you may have put a solution before the question.
The bottle neck with LTE/HSDPA/GSM is the driver and how you get data from the device to the CPU. This depends on how hardware is connected. Are you using SPI, UART, SDHC, USB, PCI?
Also, at least with HSDPA, you need a ppp connection. Isn't LTE the same? Ethernet is not the model to use in this case. Typically you need to emulate a high speed tty. Also, n_gsm supplies a network interface; I am not entirely familiar with this interface, but I suspect that this is to support LTE. This is not well documented. Also, there is the Option USB serial driver, if this is the hardware you are using. An example patch using n_gsm to handle LTE; I believe this patch was reworked into the current n_gsm network support.
You need to tell us more about your hardware.

As already noted within the comments:
struct sk_buff, and it is created for that exact specific purpose
see for example http://www.linuxfoundation.org/collaborate/workgroups/networking/skbuff

Related

How can I disable producing paged (non-linear) skb's in Linux kernel?

Suppose I write some LKM with networking (netfilter) activity and I need to do some tampering with skb including skb_pull(). So I have to take care about is skb linear or not before pulling.
If I generally don't want to have an opportunity to face with non-linear (paged) skb in my system, how can I do this?
In other words: how can I disable producing paged skbs in Linux kernel?
Is there some option on network interfaces to be set up with ethtool or something else?
I think you can refer to the implementation of bpf_skb_pull_data in the kernel, this bpf helper function can pull no-linear data into linear.
https://man7.org/linux/man-pages/man7/bpf-helpers.7.html

Why libpcap is better than sniffing with raw?

If I want to sniffing packet in linux without set any filters, I saw 2 options.
Use libpcap
Use raw socket myself like https://www.binarytides.com/packet-sniffer-code-in-c-using-linux-sockets-bsd-part-2/
Why libpcap is better than use raw sockets myself?
Three reasons:
1) It's way easier to correctly set up.
2) It's portable, even to Windows, which uses a quite similar yet different API for sockets.
3) It's MUCH faster.
1 and 2, IMO, don't need much explanation. I'll dive into 3.
To understand why libpcap is (generally) faster, we need to understand the bottlenecks in the socket API.
The two biggest bottlenecks that libpcap tends to avoid are syscalls and copies.
How it does so is platform-specific.
I'll tell the story for Linux.
Linux, since 2.0 IIRC, implements what it calls AF_PACKET socket family, and later with it PACKET_MMAP. I don't exactly recall the benefits of the former, but the latter is critical to avoid both copying from kernel to userspace (there are still a few copies kernel-side) and syscalls.
In PACKET_MMAP you allocate a big ring buffer in userspace, and then associate it to an AF_PACKET socket. This ring buffer will contain a bit of metadata (most importantly, a marker that says if a region is ready for user processing) and the packet contents.
When a packet arrives to a relevant interface (generally one you bind your socket to), the kernel makes a copy in the ring buffer and marks the location as ready for userspace*.
If the application was waiting on the socket, it gets notified*.
So, why is this better than raw sockets? Because you can do with few or none syscalls after setting up the socket, depending on whether you want to busy-poll the buffer itself or wait with poll until a few packets are ready, and because you don't need the copy from the socket's internal RX buffer to your user buffers, since it's shared with you.
libpcap does all of that for you. And does it on Mac, *BSD, and pretty much any platform that provides you with faster capture methods.
*It's a bit more complex on version 3, where the granularity is in "blocks" instead of packets.

Raw Sockets Vs Libpcap in sending performance

I'm currently attempting to get the best sending performance for an 802.11 frame, I am using libpcap but I wondered if I could speed it up using raw sockets (or any other possible method).
Consider this simple example code for libpcap with a device handle already created previously:
char ourPacket[60][50] = { {0x01, 0x02, ... , 0x50}, ... , {0x01, 0x02, ... , 0x50} };
for( ; ; )
{
for(int i; i = 0; i < 60; ++i)
{
pcap_sendpacket(deviceHandle, ourPacket[i], 50);
}
}
This code segment is done on a thread for each separate CPU core. Is there any faster way to do this for raw 802.11 frame/packets containing Radiotap headers that are stored in an array?
Looking at pcap's source code for pcap_inject (the same function but different return value), it doesn't seem to be using raw sockets to send packets? No clue.
I don't care about capturing performance, as a lot of the other questions have answered that. Is raw sockets even for sending layer 2 packets/frames?
As Gill Hamilton mentioned, the answer will depend on a lot of things. If you see super gains on one system, you may not see them on another, even if both are "running Linux". You're better off testing the code yourself. That being said, here is some info from what my team has found:
Note 1: all the gains were for code that did not just write frames/packets to sockets, but analyzed them and processed them, so it is likely that much or most of our gains were there rather than the read/write.
We are writing a direct raw socket implementation to send/receive Ethernet frames and IP packets. We're seeing about a 250%-450% performance gain on our measliest R&D system which is a MIPS 24K 5V system on a chip with a MT7530 integrated Ethernet NIC/Switch which can barely handle sustained 80 Mbit. On a very modest but much beefier test system with an Intel Celeron J1900 and I211 Gigabit controllers it drops to about 50%-100% vs c libpcap. In fact, we only saw about 80%-150% vs. Python dpkt/scapy implementation. We only saw maybe about a 20% gain on a generic i5 Linux dual-gigabit system vs. a c libpcap implementation. So based on our non-rigorous testing, we saw up to a 20x difference in performance gains of the code depending on the system.
Note 2: All of these gains were when using maximum optimizations and strictest compile parameters during compiling of the custom c code, but not necessarily for the c libpcap code (making all warnings errors on some of the above systems make the libpcap code not compile, and who wants to debug that?), so the differences may be less dramatic. We need to squeeze out every last ounce of performance to enable some sophisticated packet processing using no more than 5.0V and 1.5A, so we'll ultimately be going with a custom ASIC which may be FPGA. That being said, it's A LOT of work to get it working without bugs and we're likely going to be implementing significant portions of the Ethernet/IP/TCP/UPD stack, so I don't recommend it.
Last note: The CPU usage on the MIPS 24K system was about 1/10 for the custom code, but again, I would say that that vast majority of that gain was from the processing.

How to add code into the linux kernel?

I am studying how to analyse and evaluate the TCP/IP protocol stack of Linux. My goal is to study the performance of the tcp/ip protocol stack as a whole, to study the time cost of each layer and interaction between the layers of tcp/p protocol and queuinf of the ip layer.
To do the above : I am using a probing node based schema to to study the internal behaviour of TCP/IP protocol of linux. probing node is a piece of code added into the kernel to record the information like timestamp, queuing length and size of packet.
my question : how to add the probing node into the kernel ?
You can use (for example) SystemTap, the main idea behind this tool is to put probing node somewhere: kernel or userspace program.
If you do not have time to learn SystemTap, you can just put some printk in the kernel and read them from dmesg.
In both cases, you are introducing a big delay in the network stack due to the prints. In order to reduce the delay introduced by probing, I suggest you to use SystemTap, store all your time-sample somewhere and print only at the end of acquisition.

recommended chunk size for sending in C?

I have a server which needs to transfer a large amount of data (~1-2 gigs) per request. What is the recommended amount of data that each call to send should have? Does it matter?
The TCP/IP stack takes care of not sending packets larger than the PATH MTU, and you would have to work rather hard to make it send packets smaller than the MTU, in ways that wouldn't help throughput: and there is no way of dsicovering what the path PTU actually is for any given connection. So leave all consideration of the path MTU out of it.
You can and should make every send() as big as possible.
Your packet size is constraint by the Maximum Transfer Unit (MTU) of the protocol. If you send a packet bigger than the MTU then you get packet fragmentation.
You can do some math: http://www.speedguide.net/articles/mtu-what-difference-does-it-make--111
More References: http://en.wikipedia.org/wiki/Maximum_transmission_unit
But ultimately my suggestion is that if you aim for good enough and just let the os network layer do its work, unless you are programming raw sockets or passing some funky options, is pretty common that the os will have a network buffer and will try to do its best.
Considering this last option then the socket.send() is nothing else than a memory transfer from your user process memory to the kernel's private memory. The rule of thumb is to not exceed the virtual memory page size http://en.wikipedia.org/wiki/Page_(computer_memory) that is usually 4KB.

Resources