Raw Sockets Vs Libpcap in sending performance

Raw Sockets Vs Libpcap in sending performance - c

I'm currently attempting to get the best sending performance for an 802.11 frame, I am using libpcap but I wondered if I could speed it up using raw sockets (or any other possible method).
Consider this simple example code for libpcap with a device handle already created previously:
char ourPacket[60][50] = { {0x01, 0x02, ... , 0x50}, ... , {0x01, 0x02, ... , 0x50} };
for( ; ; )
{
for(int i; i = 0; i < 60; ++i)
{
pcap_sendpacket(deviceHandle, ourPacket[i], 50);
}
}
This code segment is done on a thread for each separate CPU core. Is there any faster way to do this for raw 802.11 frame/packets containing Radiotap headers that are stored in an array?
Looking at pcap's source code for pcap_inject (the same function but different return value), it doesn't seem to be using raw sockets to send packets? No clue.
I don't care about capturing performance, as a lot of the other questions have answered that. Is raw sockets even for sending layer 2 packets/frames?

As Gill Hamilton mentioned, the answer will depend on a lot of things. If you see super gains on one system, you may not see them on another, even if both are "running Linux". You're better off testing the code yourself. That being said, here is some info from what my team has found:
Note 1: all the gains were for code that did not just write frames/packets to sockets, but analyzed them and processed them, so it is likely that much or most of our gains were there rather than the read/write.
We are writing a direct raw socket implementation to send/receive Ethernet frames and IP packets. We're seeing about a 250%-450% performance gain on our measliest R&D system which is a MIPS 24K 5V system on a chip with a MT7530 integrated Ethernet NIC/Switch which can barely handle sustained 80 Mbit. On a very modest but much beefier test system with an Intel Celeron J1900 and I211 Gigabit controllers it drops to about 50%-100% vs c libpcap. In fact, we only saw about 80%-150% vs. Python dpkt/scapy implementation. We only saw maybe about a 20% gain on a generic i5 Linux dual-gigabit system vs. a c libpcap implementation. So based on our non-rigorous testing, we saw up to a 20x difference in performance gains of the code depending on the system.
Note 2: All of these gains were when using maximum optimizations and strictest compile parameters during compiling of the custom c code, but not necessarily for the c libpcap code (making all warnings errors on some of the above systems make the libpcap code not compile, and who wants to debug that?), so the differences may be less dramatic. We need to squeeze out every last ounce of performance to enable some sophisticated packet processing using no more than 5.0V and 1.5A, so we'll ultimately be going with a custom ASIC which may be FPGA. That being said, it's A LOT of work to get it working without bugs and we're likely going to be implementing significant portions of the Ethernet/IP/TCP/UPD stack, so I don't recommend it.
Last note: The CPU usage on the MIPS 24K system was about 1/10 for the custom code, but again, I would say that that vast majority of that gain was from the processing.

Related

Beginner - While() - Optimization

I am new in embedded development and few times ago I red some code about a PIC24xxxx.
void i2c_Write(char data) {
while (I2C2STATbits.TBF) {};
IFS3bits.MI2C2IF = 0;
I2C2TRN = data;
while (I2C2STATbits.TRSTAT) {};
Nop();
Nop();
}
What do you think about the while condition? Does the microchip not using a lot of CPU for that?
I asked myself this question and surprisingly saw a lot of similar code in internet.
Is there not a better way to do it?
What about the Nop() too, why two of them?

Generally, in order to interact with hardware, there are 2 ways:
Busy wait
Interrupt base
In your case, in order to interact with the I2C device, your software is waiting first that the TBF bit is cleared which means the I2C device is ready to accept a byte to send.
Then your software is actually writing the byte into the device and waits that the TRSTAT bit is cleared, meaning that the data has been correctly processed by your I2C device.
The code your are showing is written with busy wait loops, meaning that the CPU is actively waiting the HW. This is indeed waste of resources, but in some case (e.g. your I2C interrupt line is not connected or not available) this is the only way to do.
If you would use interrupt, you would ask the hardware to tell you whenever a given event is happening. For instance, TBF bit is cleared, etc...
The advantage of that is that, while the HW is doing its stuff, you can continue doing other. Or just sleep to save battery.
I'm not an expert in I2C so the interrupt event I have described is most likely not accurate, but that gives you an idea why you get 2 while loop.
Now regarding pro and cons of interrupt base implementation and busy wait implementation I would say that interrupt based implementation is more efficient but more difficult to write since you have to process asynchronous event coming from HW. Busy wait implementation is easy to write but is slower; But this might still be fast enough for you.
Eventually, I got no idea why the 2 NoP are needed there. Most likely a tweak which is needed because somehow, the CPU would still go too fast.

when doing these kinds of transactions (i2c/spi) you find yourself in one of two situations, bit bang, or some form of hardware assist. bit bang is easier to implement and read and debug, and is often quite portable from one chip/family to the next. But burns a lot of cpu. But microcontrollers are mostly there to be custom hardware like a cpld or fpga that is easier to program. They are there to burn cpu cycles pretending to be hardware designs. with i2c or spi you are trying to create a specific waveform on some number of I/O pins on the device and at times latching the inputs. The bus has a spec and sometimes is slower than your cpu. Sometimes not, sometimes when you add the software and compiler overhead you might end up not needing a timer for delays you might be just slow enough. But ideally you look at the waveform and you simply create it, raise pin X delay n ms, raise pin Y delay n ms, drop pin Y delay 2*n ms, and so on. Those delays can come from tuned loops (count from 0 to 1341) or polling a timer until it gets to Z number of ticks of some clock. Massive cpu waste, but the point is you are really just being programmable hardware and hardware would be burning time waiting as well.
When you have a peripheral in your mcu that assists it might do much/most of the timing for you but maybe not all of it, perhaps you have to assert/deassert chip select and then the spi logic does the clock and data timing in and out for you. And these peripherals are generally very specific to one family of one chip vendor perhaps common across a chip vendor but never vendor to vendor so very not portable and there is a learning curve. And perhaps in your case if the cpu is fast enough it might be possible for you to do the next thing in a way that it violates the bus timing, so you would have to kill more time (maybe why you have those Nops()).
Think of an mcu as a software programmable CPLD or FPGA and this waste makes a lot more sense. Unfortunately unlike a CPLD or FPGA you are single threaded so you cant be doing several trivial things in parallel with clock accurate timing (exactly this many clocks task a switches state and changes output). Interrupts help but not quite the same, change one line of code and your timing changes.
In this case, esp with the nops, you should probably be using a scope anyway to see the i2c bus and since/when you have it on the scope you can try with and without those calls to see how it affects the waveform. It could also be a case of a bug in the peripheral or a feature maybe you cant hit some register too fast otherwise the peripheral breaks. or it could be a bug in a chip from 5 years ago and the code was written for that the bug is long gone, but they just kept re-using the code, you will see that a lot in vendor libraries.

What do you think about the while condition? Does the microchip not using a lot of CPU for that?
No, since the transmit buffer won't stay full for very long.
I asked myself this question and surprisingly saw a lot of similar code in internet.
What would you suggest instead?
Is there not a better way to do it? (I hate crazy loops :D)
Not that I, you, or apparently anyone else knows of. In what way do you think it could be any better? The transmit buffer won't stay full long enough to make it useful to retask the CPU.
What about the Nop() too, why two of them?
The Nop's ensure that the signal remains stable long enough. This makes this code safe to call under all conditions. Without it, it would only be safe to call this code if you didn't mess with the i2c bus immediately after calling it. But in most cases, this code would be called in a loop anyway, so it makes much more sense to make it inherently safe.

Create UDP-like library in C

I am looking to implement some kind of transmission protocol in C, to use on a custom hardware. I have the ability to send and receive through RF, but I need to rely in some protocol that validates the package integrity sent/received, so I though it would be a good idea to implement some kind of UDP library.
Of course, if there is any way that I can modify the existing implementations for UDP or TCP so it works over my RF device it would be of great help. The only thing that I think it needs to be changed is the way that a single bit is sent, if I could change that on the UDP library (sys/socket.h) it would save me a lot of time.

UDP does not exist in standard C99 or C11.
It is generally part of some Internet Protocol layer. These are very complex software (as soon as you want some performance).
I would suggest to use some existing operating system kernel (e.g. Linux) and to write a network driver (e.g. for the Linux kernel) for your device. Life is too short to write a competitive UDP like layer (that could take you dozens of years).
addenda
Apparently, the mention of UDP in the question is confusing. Per your comments (which should go inside the question) you just want some serial protocol on a small 8 bits PIC 18F4550 microcontroller (32Kbytes ROM + 2Kbytes RAM). Without knowing additional constraints, I would suggest a tiny "textual" like protocol (e.g. in ASCII lines, no more than 128 bytes per line, \n terminated ....) and I would put some simple hex checksum inside it. In the 1980s Hayes modems had such things.
What you should then do is define and document the protocol first (e.g. as BNF syntax of the message lines), then implement it (probably with buffering and finite state automaton techniques). You might invent some message format like e.g. DOFOO?123,456%BE53 followed by a newline, meaning do the command DOFOO with arguments 123 then 456 and hex checksum BE53

Scheduling routines in C and timing requirements

I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.

You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.

Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.

Kernel level memory handling coding

My requirement is to store data in kernel..Data are incoming packets from networks..which may be different in size and have to store for example 250ms duration..and there should be 5 such candidate for which kernel level memory management is required..since packets are coming very fast..my approach is to allocate a large memory say 2mb memory for each such candidate..bez kmalloc and kfree have timing overhead..any help regarding that?

sk_buffs are a generic answer that is network related or as Mike points out a kernel memory cache is even more generic answer to your question. However, I believe you may have put a solution before the question.
The bottle neck with LTE/HSDPA/GSM is the driver and how you get data from the device to the CPU. This depends on how hardware is connected. Are you using SPI, UART, SDHC, USB, PCI?
Also, at least with HSDPA, you need a ppp connection. Isn't LTE the same? Ethernet is not the model to use in this case. Typically you need to emulate a high speed tty. Also, n_gsm supplies a network interface; I am not entirely familiar with this interface, but I suspect that this is to support LTE. This is not well documented. Also, there is the Option USB serial driver, if this is the hardware you are using. An example patch using n_gsm to handle LTE; I believe this patch was reworked into the current n_gsm network support.
You need to tell us more about your hardware.

As already noted within the comments:
struct sk_buff, and it is created for that exact specific purpose
see for example http://www.linuxfoundation.org/collaborate/workgroups/networking/skbuff

Packet loss caused by OpenSSL? Weird CPU usage

I'm writing network application reading packets from UDP sockets and then decrypting them using OpenSSL.
The main function looks like this:
receive(){
while(1){
read(udp_sock);
decrypt_packet();
}
}
Program used to work fine until I added encryption. Now there's a lot of packets lost between kernel buffer and my application (netstat -su - RcvbufErrors: 77123 and growing ;). Packets are rather big (60K) and I try to use it on 1Gbps ethernet (thus problem begins after exceeding 100Mbps)
Sounds normal - decryption taking too much time and packets are coming too fast. The problem is - CPU usage never exceeds 30% both on the sender and the receiver.
Problem disappears after commenting out this statement in decrypt_packet():
AES_ctr128_encrypt();
My question is - is it possible, that OpenSSL is using some set of instruction which are not count in to CPU usage (I use htop and Gnome system monitor)? If not what else can cause such packet loss is CPU power is still available for processing?

How many CPU cores does your system have? Is your code single threaded? It could be maxing out a single core and thus using only 25% of the available CPU.

Using profiler I was able to solve the problem. OpenSSL is using special set of instructions, which are executed in special part of CPU. Shown CPU usage was low, but in fact it was occupied doing encryption, so my application couldn't read system buffer fast enough.
I moved decryption to other thread which solved the problem. And now the thread handling all encryption is shown as using 0% CPU all the time.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight