protocol handler using dev_add_pack consumes cpu - c

I wrote a kernel module and used dev_add_pack to get all the incoming packets.
According to given filter rules, if packet matches, I am forwarding it to user space.
When I am loading this kernel module and send udp traffic using sipp,
ksoftirqd process appears and starts consume cpu. (I am testing this by top command)
is there any way to save cpu ?

I guess you use ETH_P_ALL type to register your packet_type structure to protocol stack. And I think your packet_type->func is the bottleneck, which maybe itself consumes lots of cpu, or it break the existing protocol stack model and triggers other existing packet_type functions to consumes cpu. So the only way to save cpu is to optimize you packet_type->func. If your function is too complicated, you should consider to spit the function to several parts, use the simple part as the packet_type->func which runs in ksoftirqd context, while the complicated parts should be put to other kernel thread context(you can create new thread in your kernel module if needed).

Related

Linux UART imx8 how to quickly detect frame end?

I have an imx8 module running Linux on my PCB and i would like some tips or pointers on how to modify the UART driver to allow me to be able to detect the end of frame very quickly (less than 2ms) from my user space C application. The UART frame does not have any specific ending character or frame length. The standard VTIME of 100ms is much too long
I am reading from a Sim card, i have no control over the data, no control over the size or content of the data. I just need to detect the end of frame very quickly. The frame could be 3 bytes or 500. The SIM card reacts to data that it receives, typically I send it a couple of bytes and then it will respond a couple of ms later with an uninterrupted string of bytes of unknown length. I am using an iMX8MP
I thought about using the IDLE interrupt to detect the frame end. Turn it on when any byte is received and off once the idle interrupt fires. How can I propagate this signal back to user space? Or is there an existing method to do this?
Waiting for an "idle" is a poor way to do this.
Use termios to set raw mode with VTIME of 0 and VMIN of 1. This will allow the userspace app to get control as soon as a single byte arrives. See:
How to read serial with interrupt serial?
How do I use termios.h to configure a serial port to pass raw bytes?
How to open a tty device in noncanonical mode on Linux using .NET Core
But, you need a "protocol" of sorts, so you can know how much to read to get a complete packet. You prefix all data with a struct that has (e.g.) A type and a payload length. Then, you send "payload length" bytes. The receiver gets/reads that fixed length struct and then reads the payload which is "payload length" bytes long. This struct is always sent (in both directions).
See my answer: thread function doesn't terminate until Enter is pressed for a working example.
What you have/need is similar to doing socket programming using a stream socket except that the lower level is the UART rather than an actual socket.
My example code uses sockets, but if you change the low level to open your uart in raw mode (as above), it will be very similar.
UPDATE:
How quickly after the frame finished would i have the data at the application level? When I try to read my random length frames currently reading in 512 byte chunks, it will sometimes read all the frame in one go, other times it reads the frame broken up into chunks. –
Engo
In my link, in the last code block, there is an xrecv function. It shows how to read partial data that comes in chunks.
That is what you'll need to do.
Things missing from your post:
You didn't post which imx8 board/configuration you have. And, which SIM card you have (the protocols are card specific).
And, you didn't post your other code [or any code] that drives the device and illustrates the problem.
How much time must pass without receiving a byte before the [uart] device is "idle"? That is, (e.g.) the device sends 100 bytes and is then finished. How many byte times does one wait before considering the device to be "idle"?
What speed is the UART running at?
A thorough description of the device, its capabilities, and how you intend to use it.
A uart device doesn't have an "idle" interrupt. From some imx8 docs, the DMA device may have an "idle" interrupt and the uart can be driven by the DMA controller.
But, I looked at some of the linux kernel imx8 device drivers, and, AFAICT, the idle interrupt isn't supported.
I need to read everything in one go and get this data within a few hundred microseconds.
Based on the scheduling granularity, it may not be possible to guarantee that a process runs in a given amount of time.
It is possible to help this a bit. You can change the process to use the R/T scheduler (e.g. SCHED_FIFO). Also, you can use sched_setaffinity to lock the process to a given CPU core. There is a corresponding call to lock IRQ interrupts to a given CPU core.
I assume that the SIM card acts like a [passive] device (like a disk). That is, you send it a command, and it sends back a response or does a transfer.
Based on what command you give it, you should know how many bytes it will send back. Or, it should tell you how many optional bytes it will send (similar to the struct in my link).
The method you've described (e.g.) wait for idle, then "race" to get/process the data [for which you don't know the length] is fraught with problems.
Even if you could get it to work, it will be unreliable. At some point, system activity will be just high enough to delay wakeup of your process and you'll miss the window.
If you're reading data, why must you process the data within a fixed period of time (e.g. 100 us)? What happens if you don't? Does the device catch fire?
Without more specific information, there are probably other ways to do this.
I've programmed such systems before that relied on data races. They were unreliable. Either missing data. Or, for some motor control applications, device lockup. The remedy was to redesign things so that there was some positive/definitive way to communicate that was tolerant of delays.
Otherwise, I think you've "fallen in love" with "idle interrupt" idea, making this an XY problem: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

Call traffic control (tc) from within Linux kernel

There is a userspace util called tc(8) for traffic shaping, i.e.
tc qdisc add dev eth0 root tbf rate 10mbit latency 100ms burst 5000.
The internal implementation of the tc command uses netlink to send specific messages to the kernel which in turn will change things accordingly.
However, there is no public interface for the kernel code for this specific procedure - as in, there is no public API like tc_qdisc_add(x,y,z) - as everything is depending on the data from the netlink message itself.
So, is there a trick to simplify the process and simulate a message from the kernel? Is there a way to bypass the userspace call to tc and get the same outcome just from a kernel context?
is there a trick to simplify the process and simulate a message from the kernel?
I don't see any way to make it simple.
If we do not go into the details of the implementation of specific tc-commands, just to contemplate an existing API inside kernel, we can see that all the related to netlink talking and qdiscs adding code is located in /net/sched subdirectory. The main function for registering qdisc is located in /net/sched/sch_api.c and called register_qdisc(). Also we can see registering basic qdiscs and netlink ops in pktsched_init().
Qdisc operations are described via struct Qdisc_ops and comprise such like init, enqueue, change, etc.
Knowing this we can take a look at how is this implemented in tbf module (net/sched/sch_tbf.c). It has several operations described with tbf_qdisc_ops. Here there is operation to change called which normally is invoked like tc_modify_qdisc() -> qdisc_change() -> tbf_change().
So depending on what exactly you want to tune, you can somehow obtain the particular qdisc, build an appropriate netlink message (struct nlmsghdr, as it is done in usermode) and invoke e.g. ->change(...) method of qdisc.
The answer does not claim to be correct. I just tried to clarify the situation a little bit.

Where, in the e1000 linux code, can I zeroize rx/tx network packets?

I'd need to know where can I make zeroization for the received/transmitted network packets in the e1000 linux driver. I need to know this to pass one compliance requirement, but I'm not able to find in the code of the e1000 where to do zeroization of the network packet buffer (or if it already does the zeroization somewhere, that would be great)
I saw that it does ring zeroization when the interface goes up or down in the kernel in the file Intel_LAN_15.0.0_Linux_Source_A00/Source/base_driver/e1000e-2.4.14/src/netdev.c, in the e1000_clean_rx_ring() and e1000_clean_tx_ring() functions:
/* Zero out the descriptor ring */
memset(rx_ring->desc, 0, rx_ring->size);
But I'm not able to find where it should be done for each packet that the system receives/send.
So, does anybody know where is the place in the code where the buffer zeroization for the tx/rx packets should happen? I bet that it will introduce some overhead, but I have to do it anyway.
We're using the intel EF multi port network card: https://www-ssl.intel.com/content/www/us/en/network-adapters/gigabit-network-adapters/gigabit-et-et2-ef-multi-port-server-adapters-brief.html?
and the kernel 3.4.107
We're using the linux-image-3.4.107-0304107-generic_3.4.107-0304107.201504210712_amd64.deb kernel
EDIT: #skgrrwasme pointed correctly that the e1000_clean_tx_ring and e1000_clean_rx_ring functions seem to do the zeroize work, but as it is done only when the hw is down it is not valid for our compliance need.
So, it seems that the functions that are doing the work for each packet are e1000_clean_rx_irq and e1000_clean_tx_irq, but those functions doesn't zeroize data, they only free memory but doesn't make a memset() with 0 to overwrite memory (and that's what is required). So, what I think could be done is, as it is enough to zeroize data when rx or tx, inside e1000_clean_tx_irq() calls to e1000_unmap_and_free_tx_resource(), but in fact it only frees it, not zeroize it:
if (buffer_info->skb) {
dev_kfree_skb_any(buffer_info->skb);
buffer_info->skb = NULL;
}
So what I think is that we can wrote inside dev_kfree_skb_any(), the memset. That function calls to two functions:
dev_kfree_skb_any(struct sk_buff *skb)
{
if (in_irq() || irqs_disabled())
dev_kfree_skb_irq(skb);
else
dev_kfree_skb(skb);
}
So, something easy would be a call to skb_recycle_check(skb); that will do a:
memset(skb, 0, offsetof(struct sk_buff, tail));
Does this make sense? I think that with this, the memory will be overwritten with zeroes, and the work will be done, but I'm not sure...
TL;DR
As far as I can tell, both the transmit and receive buffers are already cleaned by the driver for both transmit and receive. I don't think you need to do anything.
Longer Answer
I don't think you have to worry about it. The transmit and receive buffer clearing functions, e1000_clean_rx__irq and e1000_clean_rx_irq, seem to be called in any interrupt configuration, and for both transmit and receive. Interrupts can be triggered with any of the following interrupt signaling methods: legacy, MSI, or MSI-X. It appears that ring buffer cleaning happens in any interrupt mode, but they call the cleaning functions in different locations.
Since you have two types of transfers (transmit and receive) and three different types of interrupt invocations (Legacy, MSI, and MSI-X), you have a total of six scenarious where you need to make sure things are cleaned. Fortunately, five of the six situations handle the packets by scheduling a job for NAPI. These scenarios are transmit and receive for Legacy and MSI interrupts, and receive for MSI-X. Part of NAPI handling those packets is calling the e1000_clean function as a callback. If you look at the code, you'll see that it calls the buffer cleaning functions for both TX and RX.
The outlier is the MSI-X TX handler. However, it seems to directly call the TX buffer cleaning function, rather than having NAPI handle it.
Here are the relevant interrupt handlers that weren't specifically listed above:
Legacy (both RX and TX)
MSI (both RX and TX)
MSI-X RX
Notes
All of my function references will open a file in the e1000e driver called netdev.c. They will open a window in the Linux Cross Reference database.
This post discusses the e1000e driver, but some of the function names are "e1000...". I think a lot of the e1000 code was reused in the newer e1000e driver, so some of the names carried over. Just know that it isn't a typo.
The e1000_clean_tx_ring and e1000_clean_rx_ring functions that you referred too appear to only be called when the driver is trying to free resources or the hardware is down, during any actual packet handling. The two I referenced above seem to, though. I'm not sure exactly what the difference between them is, but they appear to get the job done.

State machine event generation in multi-processor architecture

I'm having a small architecture argument with a coworker at the moment. I was hoping some of you could help settle it by strongly suggesting one approach over another.
We have a DSP and Cortex-M3 coupled together with shared memory. The DSP receives requests from the external world and some of these requests are to execute certain wireless test functionality which can only be done on the CM3. The DSP writes to shared memory, then signals the CM3 via an interrupt. The shared memory indicates what the request is along with any necessary data required to perform the request (channel to tune to, register of RF chip to read, etc).
My preference is to generate a unique event ID for each request that can occur in the interrupt. Then before leaving the interrupt pass the event on to the state machine's event queue, which would get handled in the thread devoted to RF activity.
My coworker would instead like to pass a single event ID (generic RF command) to the state machine and have the parsing of the shared memory area occur after receiving this event ID in the state machine. After parsing, then you would know the specific command that you need to act on.
I dislike this approach because you will be doing the parsing of shared memory in whatever state you happen to be in. You can make this a function, but it's still processing that should be state-independent. She doesn't like the idea of parsing shared memory in the interrupt.
Any comments on the better approach? If it helps, we're using the QP framework from Miro Samek for state machine implementation.
EDIT: moved statechart to ftp://hiddenoaks.asuscomm.com/Statechart.bmp
Here's a compromise:
pass a single event ID (generic RF command) to the state machine from the interrupt
create an action_function that "parses" the shared memory and returns a specific command
guard RF_EVENT transitions in the statechart with [parser_action_func() == RF_CMD_1] etc.
The statechart code generator should be smart enough to execute parser_action_func() only once per RF_EVENT. (Dunno if QP framework is that smart).
This has the same statechart semantics of your "unique event ID for each request," and avoids parsing the shared memory in the interrupt handler.
ADDENDUM
The difference in the statechart is N transitions labeled
----RF_EVT_CMD_1---->
----RF_EVT_CMD_2---->
...
----RF_EVT_CMD_N---->
verus
----RF_EVT[cmd()==CMD_1]---->
----RF_EVT[cmd()==CMD_2]---->
...
----RF_EVT[cmd()==CMD_N]---->
where cmd() is the parsing action function.

Overlapping communications with computations in MPI (mvapich2) for large messages

I have a very simple code, a data decomposition problem in which in a loop each process sends two large messages to the ranks before and after itself at each cycle. I run this code in a cluster of SMP nodes (AMD Magny cores, 32 core per node, 8 cores per socket). It's a while I'm in the process of optimizing this code. I have used pgprof and tau for profiling and it looks to me that the bottleneck is the communication. I have tried to overlap the communication with the computations in my code however it looks that the actual communication starts when the computations finish :(
I use persistent communication in ready mode (MPI_Rsend_init) and in between the MPI_Start_all and MPI_Wait_all bulk of the computation is done. The code looks like this:
void main(int argc, char *argv[])
{
some definitions;
some initializations;
MPI_Init(&argc, &argv);
MPI_Rsend_init( channel to the rank before );
MPI_Rsend_init( channel to the rank after );
MPI_Recv_init( channel to the rank before );
MPI_Recv_init( channel to the rank after );
for (timestep=0; temstep<Time; timestep++)
{
prepare data for send;
MPI_Start_all();
do computations;
MPI_Wait_all();
do work on the received data;
}
MPI_Finalize();
}
Unfortunately the actual data transfer does not start until the computations are done, I don't understand why. The network uses QDR InfiniBand Interconnect and mvapich2. each message size is 23MB (totally 46 MB message is sent). I tried to change the message passing to eager mode, since the memory in the system is large enough. I use the following flags in my job script:
MV2_SMP_EAGERSIZE=46M
MV2_CPU_BINDING_LEVEL=socket
MV2_CPU_BINDING_POLICY=bunch
Which gives me an improvement of about 8%, probably because of better placement of the ranks inside the SMP nodes however still the problem with communication remains. My question is why can't I effectively overlap the communications with the computations? Is there any flag that I should use and I'm missing it? I know something is wrong, but whatever I have done has not been enough.
By the order of ranks inside the SMP nodes the actual message sizes between the nodes is also 46MB (2x23MB) and the ranks are in a loop. Can you please help me? To see the flags that other users use I have checked /etc/mvapich2.conf however it is empty.
Is there any other method that I should use? do you think one sided communication gives better performance? I feel there is a flag or something that I'm not aware of.
Thanks alot.
There is something called progression of operations in MPI. The standard allows for non-blocking operations to only be progressed to completion once the proper testing/waiting call was made:
A nonblocking send start call initiates the send operation, but does not complete it. The send start call can return before the message was copied out of the send buffer. A separate send complete call is needed to complete the communication, i.e., to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed. Similarly, a nonblocking receive start call initiates the receive operation, but does not complete it. The call can return before a message is stored into the receive buffer. A separate receive complete call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed.
(words in bold are also bolded in the standard text; emphasis added by me)
Although this text comes from the section about non-blocking communication (§3.7 of MPI-3.0; the text is exactly the same in MPI-2.2), it also applies to persistent communication requests.
I haven't used MVAPICH2, but I am able to speak about how things are implemented in Open MPI. Whenever a non-blocking operation is initiated or a persistent communication request is started, the operation is added to a queue of pending operations and is then progressed in one of the two possible ways:
if Open MPI was compiled without an asynchronous progression thread, outstanding operations are progressed on each call to a send/receive or to some of the wait/test operations;
if Open MPI was compiled with an asynchronous progression thread, operations are progressed in the background even if no further communication calls are made.
The default behaviour is not to enable the asynchronous progression thread as doing so increases the latency of the operations somehow.
The MVAPICH site is unreachable at the moment from here, but earlier I saw a mention of asynchronous progress in the features list. Probably that's where you should start from - search for ways to enable it.
Also note that MV2_SMP_EAGERSIZE controls the shared memory protocol eager message size and does not affect the InfiniBand protocol, i.e. it can only improve the communication between processes that reside on the same cluster node.
By the way, there is no guarantee that the receive operations would be started before the ready send operations in the neighbouring ranks, so they might not function as expected as the ordering in time is very important there.
For MPICH, you can set MPICH_ASYNC_PROGRESS=1 environment variable when runing mpiexec/mpirun. This will spawn a background process which does "asynchronous progress" stuff.
MPICH_ASYNC_PROGRESS - Initiates a spare thread to provide
asynchronous progress. This improves progress semantics for
all MPI operations including point-to-point, collective,
one-sided operations and I/O. Setting this variable would
increase the thread-safety level to
MPI_THREAD_MULTIPLE. While this improves the progress
semantics, it might cause a small amount of performance
overhead for regular MPI operations.
from MPICH Environment Variables
I have tested on my cluster with MPICH-3.1.4, it worked! I believe MVAPICH will also work.

Resources