Resubmitting DMA Engine transactions

Resubmitting DMA Engine transactions - c

I'm writing a custom high-speed Linux SPI driver for an embedded SoC. To send data to the SPI peripheral (DMA_MEM_TO_DEV) I'm the Linux DMA Engine API.
https://www.kernel.org/doc/Documentation/dmaengine/client.txt
Based on the documentation, the steps for setting up and executing a DMA transaction are:
Allocate a DMA slave channel : dma_request_channel
Set slave and controller specific parameters : dmaengine_slave_config
Get a descriptor for tranesaction : dmaengine_prep_slave_single
Submit the transaction : dmaengine_submit
Issue pending requests and wait for callback notification : dma_async_issue_pending
I have this working for single DMA transactions. But I need to send multiple DMA transactions from the same memory location (dma_addr_t buf) of the same size (size_t len) based on some hardware flow control (GPIOs).
For starters, I tried to redo step 1-5 for every DMA transaction. So every time the flow control GPIO IRQ triggers, I reallocate a DMA slave channel, re-set the slave and controller specific parameters, ...
This seems to work too, but I am not sure if it's the most efficient way.
I am wondering if I can just re-submit the transaction again (dmaengine_submit) and issue again (dma_async_issue_pending)? Would this be more efficient?
I can't seem to find any information on how to resubmit the exact same DMA request anywhere in the kernel documentation.

Steps 1 and 2 need not be repeated. Steps 3 to 5 can be repeated for multiple transactions. For example, see the patch here. dspi_dma_xfer function can call dspi_next_xfer_dma_submit multiple times and steps 3 to 5 are redone. Since you want the dma_addr_t buf to be the same, as far as I understand this is what you want.

You need to at least re-init the memory source address, destination address, and length registers as they will change over the course of the DMA transaction. I think it would be just as fast to just redo everything each time - its not very much to setup and thus the time spent is minuscule.

Related

Linux UART imx8 how to quickly detect frame end?

I have an imx8 module running Linux on my PCB and i would like some tips or pointers on how to modify the UART driver to allow me to be able to detect the end of frame very quickly (less than 2ms) from my user space C application. The UART frame does not have any specific ending character or frame length. The standard VTIME of 100ms is much too long
I am reading from a Sim card, i have no control over the data, no control over the size or content of the data. I just need to detect the end of frame very quickly. The frame could be 3 bytes or 500. The SIM card reacts to data that it receives, typically I send it a couple of bytes and then it will respond a couple of ms later with an uninterrupted string of bytes of unknown length. I am using an iMX8MP
I thought about using the IDLE interrupt to detect the frame end. Turn it on when any byte is received and off once the idle interrupt fires. How can I propagate this signal back to user space? Or is there an existing method to do this?

Waiting for an "idle" is a poor way to do this.
Use termios to set raw mode with VTIME of 0 and VMIN of 1. This will allow the userspace app to get control as soon as a single byte arrives. See:
How to read serial with interrupt serial?
How do I use termios.h to configure a serial port to pass raw bytes?
How to open a tty device in noncanonical mode on Linux using .NET Core
But, you need a "protocol" of sorts, so you can know how much to read to get a complete packet. You prefix all data with a struct that has (e.g.) A type and a payload length. Then, you send "payload length" bytes. The receiver gets/reads that fixed length struct and then reads the payload which is "payload length" bytes long. This struct is always sent (in both directions).
See my answer: thread function doesn't terminate until Enter is pressed for a working example.
What you have/need is similar to doing socket programming using a stream socket except that the lower level is the UART rather than an actual socket.
My example code uses sockets, but if you change the low level to open your uart in raw mode (as above), it will be very similar.
UPDATE:
How quickly after the frame finished would i have the data at the application level? When I try to read my random length frames currently reading in 512 byte chunks, it will sometimes read all the frame in one go, other times it reads the frame broken up into chunks. –
Engo
In my link, in the last code block, there is an xrecv function. It shows how to read partial data that comes in chunks.
That is what you'll need to do.
Things missing from your post:
You didn't post which imx8 board/configuration you have. And, which SIM card you have (the protocols are card specific).
And, you didn't post your other code [or any code] that drives the device and illustrates the problem.
How much time must pass without receiving a byte before the [uart] device is "idle"? That is, (e.g.) the device sends 100 bytes and is then finished. How many byte times does one wait before considering the device to be "idle"?
What speed is the UART running at?
A thorough description of the device, its capabilities, and how you intend to use it.
A uart device doesn't have an "idle" interrupt. From some imx8 docs, the DMA device may have an "idle" interrupt and the uart can be driven by the DMA controller.
But, I looked at some of the linux kernel imx8 device drivers, and, AFAICT, the idle interrupt isn't supported.
I need to read everything in one go and get this data within a few hundred microseconds.
Based on the scheduling granularity, it may not be possible to guarantee that a process runs in a given amount of time.
It is possible to help this a bit. You can change the process to use the R/T scheduler (e.g. SCHED_FIFO). Also, you can use sched_setaffinity to lock the process to a given CPU core. There is a corresponding call to lock IRQ interrupts to a given CPU core.
I assume that the SIM card acts like a [passive] device (like a disk). That is, you send it a command, and it sends back a response or does a transfer.
Based on what command you give it, you should know how many bytes it will send back. Or, it should tell you how many optional bytes it will send (similar to the struct in my link).
The method you've described (e.g.) wait for idle, then "race" to get/process the data [for which you don't know the length] is fraught with problems.
Even if you could get it to work, it will be unreliable. At some point, system activity will be just high enough to delay wakeup of your process and you'll miss the window.
If you're reading data, why must you process the data within a fixed period of time (e.g. 100 us)? What happens if you don't? Does the device catch fire?
Without more specific information, there are probably other ways to do this.
I've programmed such systems before that relied on data races. They were unreliable. Either missing data. Or, for some motor control applications, device lockup. The remedy was to redesign things so that there was some positive/definitive way to communicate that was tolerant of delays.
Otherwise, I think you've "fallen in love" with "idle interrupt" idea, making this an XY problem: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

lwip to send data bigger than 64kb

i'm trying to send over lwip a RT data (4 bytes) sampled at 100kHz for 10 channels.
I've understood that lwip has a timer which loops every 250ms and it cannot be changed.
In this case i'm saving the RT over RAM at 100kHz and every 250ms sending the sampled data over TCP.
The problem is that i cannot go over 65535 bytes every 250ms because i get the MEM_ERR.
i already increased the buffer up to 65535 but when i try to increase it more i get several error during compiling.
So my doubt is: can lwip manage buffer bigger than 16bits?
Thanks,
Marco

Better to focus on throughput.
You neglected to show any code, describe which Xilinx board/system you're using, or which OS you're using (e.g. FreeRTOS, linux, etc.).
Your RT data is: 4 bytes * 10 channels * 100kHz --> 400,000 bytes / second.
From your lwip description, you have 65536 byte packets * 4 packets/sec --> 256,000 bytes / second.
This is too slow. And, it much slower than what a typical TCP / Ethernet link can process, so I think your understanding of the maximum rate is off a bit.
You probably can't increase the packet size.
But, you probably can increase the number of packets sent.
You say that the 250ms interval for lwip can't be changed. I believe that it can.
From: https://www.xilinx.com/support/documentation/application_notes/xapp1026.pdf we have the section: Creating an lwIP Application Using the RAW API:
Set up a timer to interrupt at a constant interval. Usually, the interval is around 250 ms. In the timer interrupt, update necessary flags to invoke the lwIP TCP APIs tcp_fasttmr and tcp_slowtmr from the main application loop explained previously
The "usually" seems to me to imply that it's a default and not a maximum.
But, you may not need to increase the timer rate as I don't think it dictates the packet rate, just the servicing/completion rate [in software].
A few guesses ...
Once a packet is queued to the NIC, normally, other packets may be queued asynchronously. Modern NIC hardware often has a hardware queue. That is, the NIC H/W supports multiple pending requests. It can service those at line speed without CPU intervention.
The 250ms may just be a timer interrupt that retires packet descriptors of packets completed by the NIC hardware.
That is, more than one packet can be processed/completed per interrupt. If that were not the case, then only 4 packets / second could be sent and that would be ridiculously low.
Generating an interrupt from the NIC for each packet incurs an overhead. My guess is that interrupts from the NIC are disabled. And, the NIC is run in a "polled" mode. The polling occurs in the timer ISR.
The timer interrupt will occur 4 times per second. But, will process any packets it sees that are completed. So, the ISR overhead is only 4 interrupts / second.
This increases throughput because the ISR overhead is reduced.
UPDATE:
Thanks for the reply, indeed is 4 bytes * 10 channels * 100kHz --> 4,000,000 bytes / second but I agree that we are quite far from the 100Mbit/s.
Caveat: I don't know lwip, so most of what I'm saying is based on my experience with other network stacks (e.g. linux), but it appears that lwip should be similar.
Of course, lwip will provide a way to achieve full line speed.
Regarding the changing of the 250ms timer period, to achieve what i want it should be lowered more than 10times which seems it is too much and it can compromise the stability of the protocol.
When you say that, did you actually try that?
And, again, you didn't post your code or describe your target system, so it's difficult to provide help.
Issues could be because of the capabilities [or lack thereof] of your target system and its NIC.
Or, because of the way your code is structured, you're not making use of the features that can make it fast.
So basically your suggestion is to enable the interrupt on each message? In this case i can send the remaining data in the ACK callback if I understood correctly. – Marco Martines
No.
The mode for interrupt on every packet is useful for low data rates where the use of the link is sparse/sporadic.
If you have an interrupt for every packet, the overhead of entering/exiting the ISR (the ISR prolog/epilog) code will become significant [and possibly untenable] at higher data rates.
That's why the timer based callback is there. To accumulate finished request blocks and [quickly] loop on the queue/chain and complete/reclaim them periodically. If you wish to understand the concept, look at NAPI: https://en.wikipedia.org/wiki/New_API
Loosely, on most systems, when you do a send, a request block is created with all info related to the given buffer. This block is then queued to the transmit queue. If the transmitter is idle, it is started. For subsequent blocks, the block is appended to the queue. The driver [or NIC card] will, after completing a previous request, immediately start a new/fresh one from the front of the queue.
This allows you to queue multiple/many blocks quickly [and return immediately]. They will be sent in order at line speed.
What actually happens depends on system H/W, NIC controller, and OS and what lwip modes you use.

STM32 DMA from timer count to memory

I'm using an STM32H743. I have an external clock signal coming in on a GPIO pin, and I want to very accurately measure elapsed time between each rising (or falling) edge in the external clock signal. So I set things up so that TIM4 is triggered by the external clock, and TIM5 is triggered by the internal oscillator.
I wrote an IRQ so that whenever TIM4 triggers, an interrupt runs that captures TIM5's value. It seems to work OK, but I'm wondering if I can do it through DMA to avoid all the context switching and free up the CPU. Basically I want to set up a DMA so that each TIM4 event initiates a DMA transfer that copies the TIM5 counter value to a circular buffer somewhere.
I've searched through forums and the DMA documentation but I'm hazy on whether a timer register can be a valid DMA source. I was thinking maybe I could do something like this:
hDma->PAR = (uint32_t) &htim5.Instance->CNT;
hDma->M0AR = (uint32_t) myBufferPtr;
hDma->NDTR = myBufferSize;
hDma->CR |= (uint32_t)DMA_SxCR_EN;
But I'm not sure if this can work.
Short version: Can I use the timer's CNT register as a DMA transfer source? Would it be a peripheral-to-memory transfer? Or a memory-to-memory transfer? Are there other flags I need to make this work? Or is it not possible? Or is there another STM32 feature that would make it easier to count time between pulses?

Disclaimer
I must confess that my long practical experience with STM32 by now stayed with mainstream controller families like STM32F0, STM32F3, STM32F4 and STM32L4.
Therefore I'm answering based on what those controllers would offer you in your situation.
The STM32H7 series is much stronger, let alone it offers several additional DMA technologies like DMA2D, MDMA and lots of other stuff that I'm not sure about.
But I think a simplified answer might also help you for now, so I'm daring to write it.
Can I use the timer's CNT register as a DMA transfer source? Would it be a peripheral-to-memory transfer? Or a memory-to-memory transfer? Are there other flags I need to make this work? Or is it not possible?
I would expect this to work.
I don't see a reason not to read the TIMx_CNT register in a DMA transfer.
The CNT register is definitely a peripheral address so you have to configure it as a peripheral-to-memory transfer.
I believe that the peripheral/memory separation refers to the bus from which the DMA controller fetches the data (or to which bus one it delivers them) in the bus matrix implemented in every STM32.
Or is there another STM32 feature that would make it easier to count time between pulses?
Yes, there is:
Many of the TIM peripherals (not all are the same) offer you a feature called "Input Capture" that connects the channel (sub-)peripheral of the TIM instance to the input and has the main part of the (same!) TIM peripheral do the internal clocking.
A prerequisite of this is, that the pin you'd like to measure has a TIMx_CHy alternate function, not "only" a TIMx_ETR one.
The TIM peripherals offer a wealthy range of different configuration options - and a complicated mess as long as you haven't got used to it.
As an introduction and a good overview, I recommend two application notes from ST:
AN4013 Application note. "STM32 cross-series timer overview", Rev.8
Which timers you have on your µC, and which features are offered by which one.
AN4776 Application note. "General-purpose timer cookbook for STM32 microcontrollers", Rev.3
How to use the timers you have. Check out section 2.6, input capture is on page 27.
Looking up those two, I found a third one you might want to check out for better precision, related to HRTIM timers:
AN4539 Application note. "HRTIM cookbook", Rev.4

It is easily done using STM32CubeIDE configurator:
configure timer, enable input capture channel, enable DMA (mode
circular, peripheral to memory,data width word/word). Enable
interrupts.
Prepare buffer for storing captured counter values
Start IC in DMA mode before main loop
For high speed operation you may copy data from timerCaptureBuffer
to timerCaptureBufferSafe inside these callbacks. For example, DMA memory to memory transfer to minimize time spent in HAL_TIM_IC_CaptureHalfCpltCallback and HAL_TIM_IC_CaptureCallback interrupts. Process adjacent captured values stored in timerCaptureBufferSafe after DMA memory to memory callback signals data is ready. You may use signaling flags so timerCaptureBufferSafe will not be overwritten.
Here is an example:
#define TIM_BUFFER_SIZE 128
uint32_t timerCaptureBuffer[TIM_BUFFER_SIZE];
uint32_t timerCaptureBufferSafe[TIM_BUFFER_SIZE];
// ...
HAL_DMA_RegisterCallback(&hdma_memtomem_dma2_stream2,
HAL_DMA_XFER_CPLT_CB_ID,
myDMA_Callback22);
// ...
HAL_TIM_IC_Start_DMA(&htim2, TIM_CHANNEL_1, uint32_t*)timerCaptureBuffer,TIM_BUFFER_SIZE);
// ...
void HAL_TIM_IC_CaptureHalfCpltCallback(TIM_HandleTypeDef *htim)
{
HAL_DMA_Start_IT(&hdma_memtomem_dma2_stream2,
(uint32_t)&timerCaptureBuffer[0],
(uint32_t)&timerCaptureBufferSafe[0],
sizeof(timerCaptureBuffer)/2/4);
// ...
}
void HAL_TIM_IC_CaptureCallback(TIM_HandleTypeDef *htim)
{
HAL_DMA_Start_IT(&hdma_memtomem_dma2_stream2,
(uint32_t)&timerCaptureBuffer[TIM_BUFFER_SIZE/2],
(uint32_t)&timerCaptureBufferSafe[TIM_BUFFER_SIZE/2],
sizeof(timerCaptureBuffer)/2/4);
// ...
}
void myDMA_Callback22(DMA_HandleTypeDef *_hdma)
{
//...
}

Issues setting up DMA doublebuffer mode to work with SPI rx only on STM32

I'm attempting to create a data logger that intakes data from a sensor node with three modalities and transmits that data using three different SPI busses to a Nucleo-f746zg. I would then like to use DMA peripheral to memory mode to take the incoming data from each bus and put it into one of three buffers that will eventually get written to a USB flash drive using FATFS. I'm having issues initializing the DMA in double buffer mode. When I look at each register it seems to be configured correctly but none of the interrupts fire and the data transfer doesn't occur.
I know that the SPI data is coming in. I initially tried this using interrupts but the data rate was too fast and I was losing data so I pivoted to DMA. The Data, once it's in the register correctly writes to the USB drive as well. I'm using the "LL" drivers for both SPI and DMA.
This is my current DMA initialization code:
uint16_t TEST_BUFFER_0[10000];
uint16_t TEST_BUFFER_1[10000];
LL_DMA_SetChannelSelection(DMA1,LL_DMA_STREAM_3, LL_DMA_CHANNEL_0);
LL_DMA_SetDataTransferDirection(DMA1,LL_DMA_STREAM_3, LL_DMA_DIRECTION_PERIPH_TO_MEMORY);
LL_DMA_EnableDoubleBufferMode(DMA1,LL_DMA_STREAM_3);
LL_DMA_SetPeriphIncMode(DMA1,LL_DMA_STREAM_3, LL_DMA_PERIPH_NOINCREMENT);
LL_DMA_SetMemoryIncMode(DMA1,LL_DMA_STREAM_3, LL_DMA_MEMORY_INCREMENT);
LL_DMA_SetPeriphSize(DMA1,LL_DMA_STREAM_3, LL_DMA_PDATAALIGN_HALFWORD);
LL_DMA_SetMemorySize(DMA1,LL_DMA_STREAM_3, LL_DMA_MDATAALIGN_HALFWORD);
LL_DMA_DisableFifoMode(DMA1,LL_DMA_STREAM_3);
LL_DMA_SetMemoryAddress(DMA1,LL_DMA_STREAM_3,TEST_BUFFER_1);
LL_DMA_SetMemory1Address(DMA1,LL_DMA_STREAM_3, TEST_BUFFER_0);
LL_DMA_SetDataLength(DMA1,LL_DMA_STREAM_3, 10000);
LL_DMA_SetPeriphAddress(DMA1,LL_DMA_STREAM_3, LL_SPI_DMA_GetRegAddr(SPI2));
LL_DMA_EnableIT_HT(DMA1,LL_DMA_STREAM_3);
LL_DMA_EnableIT_TC(DMA1,LL_DMA_STREAM_3);
LL_DMA_EnableStream(DMA1,LL_DMA_STREAM_3);
LL_SPI_EnableDMAReq_RX(SPI2);
LL_SPI_Enable(SPI2);
I'm not seeing either of the interrupts firing nor am I seeing my buffers get full. I'm not one hundred percent sure what actually initiates the DMA transfer other than the documentation alluding to this occurring automatically when data comes in. I'm kind of stuck at this point and I'm not sure what else to try.

STM32F2x Is it possible to request multiple DMA streams with single request

I want to setup an application, where a single trigger-factor (compare-match of a timer) shall request mutliple DMA streams (I.e. set new timer-value and send data to SPI)
Is this possible with the STM32F2x µC or have you got an idea for a µC with the following properties:
compare-match free running uint16 timer
~8 DMA channels
SPI HW unit (best would be with 9 chipselects, but these could be simulated via further DMA channels.
clock-frequency >= 80MHz
<=64 pins (I want to set my own layout and >64 is expected to be to complicate
~128kB RAM
~256kB ROM
cheep (developer board ~20-100 EUR, "easy to use" toolchain, complete with JTAG or other programmer (best would be connection via USB)

I'm not sure if I understand what you're asking here, but a single timer channel can be used to trigger two different DMA streams. See Table 22 in Section 9.3.3 of the manual: for instance, TIM2_CH4 can be used to trigger both stream 6 and stream 7. Also, you can configure a timer to be a slave to a second timer, which I believe could be used to make them fire simultaneously, and then you could use the two different timers as a trigger to the DMA peripheral.
Sorry if I'm unable to elaborate further, as I'm more familiar with the STM32F1xx series which doesn't have the concept of streams, only channels.