STM32 USB OTG: how does one properly flush the Tx FIFO?

STM32 USB OTG: how does one properly flush the Tx FIFO? - arm

STM32 manual for the F4xx series chips says that the application writes TXFFLSH bit in the OTG_FS_GRSTCTL register to flush the TxFIFO. As a note it mentions that ...application must write this bit only after checking that the core is neither writing to the
TxFIFO nor reading from the TxFIFO. As a means to do that, it suggests making sure that the NAK Effective Interrupt is set (I presume this means the GINAKEFF bit in OTG_FS_GINTSTS register) to ensure the core is not reading the FIFO, and checking that AHBIDL (AHB idle) bit in OTG_FS_GRSTCTL is set to guarantee that nothing is being written in the FIFO. The (awful) USB OTG library supplied by STM itself ignores both of these checks, while the free libopencm3 library only checks the AHB idle bit. My questions are the following:
The manual does not suggest disabling the USB OTG core before performing the above checks and writing the TXFFLSH bit. Does this not leave open the possibility that the core might start using the FIFO between the time the checks are performed and the writing of the TXFFLSH?
The 'NAK Effective' bit only guarantees that no data is read from the TxFIFO for non-periodic endpoints. Would this not still make it possible for the core to utilize the FIFO for isochronous endpoints unless the core is disabled?
I know these are 'nitpicking' type questions but the project I am working on is supposed to result in a very reliable piece of hardware, where the customer cannot afford checking the device for years so these subtleties matter (yes, we have a watchdog enabled, etc. but we need the core to work without resets most of the time).

Related

Linux UART imx8 how to quickly detect frame end?

I have an imx8 module running Linux on my PCB and i would like some tips or pointers on how to modify the UART driver to allow me to be able to detect the end of frame very quickly (less than 2ms) from my user space C application. The UART frame does not have any specific ending character or frame length. The standard VTIME of 100ms is much too long
I am reading from a Sim card, i have no control over the data, no control over the size or content of the data. I just need to detect the end of frame very quickly. The frame could be 3 bytes or 500. The SIM card reacts to data that it receives, typically I send it a couple of bytes and then it will respond a couple of ms later with an uninterrupted string of bytes of unknown length. I am using an iMX8MP
I thought about using the IDLE interrupt to detect the frame end. Turn it on when any byte is received and off once the idle interrupt fires. How can I propagate this signal back to user space? Or is there an existing method to do this?

Waiting for an "idle" is a poor way to do this.
Use termios to set raw mode with VTIME of 0 and VMIN of 1. This will allow the userspace app to get control as soon as a single byte arrives. See:
How to read serial with interrupt serial?
How do I use termios.h to configure a serial port to pass raw bytes?
How to open a tty device in noncanonical mode on Linux using .NET Core
But, you need a "protocol" of sorts, so you can know how much to read to get a complete packet. You prefix all data with a struct that has (e.g.) A type and a payload length. Then, you send "payload length" bytes. The receiver gets/reads that fixed length struct and then reads the payload which is "payload length" bytes long. This struct is always sent (in both directions).
See my answer: thread function doesn't terminate until Enter is pressed for a working example.
What you have/need is similar to doing socket programming using a stream socket except that the lower level is the UART rather than an actual socket.
My example code uses sockets, but if you change the low level to open your uart in raw mode (as above), it will be very similar.
UPDATE:
How quickly after the frame finished would i have the data at the application level? When I try to read my random length frames currently reading in 512 byte chunks, it will sometimes read all the frame in one go, other times it reads the frame broken up into chunks. –
Engo
In my link, in the last code block, there is an xrecv function. It shows how to read partial data that comes in chunks.
That is what you'll need to do.
Things missing from your post:
You didn't post which imx8 board/configuration you have. And, which SIM card you have (the protocols are card specific).
And, you didn't post your other code [or any code] that drives the device and illustrates the problem.
How much time must pass without receiving a byte before the [uart] device is "idle"? That is, (e.g.) the device sends 100 bytes and is then finished. How many byte times does one wait before considering the device to be "idle"?
What speed is the UART running at?
A thorough description of the device, its capabilities, and how you intend to use it.
A uart device doesn't have an "idle" interrupt. From some imx8 docs, the DMA device may have an "idle" interrupt and the uart can be driven by the DMA controller.
But, I looked at some of the linux kernel imx8 device drivers, and, AFAICT, the idle interrupt isn't supported.
I need to read everything in one go and get this data within a few hundred microseconds.
Based on the scheduling granularity, it may not be possible to guarantee that a process runs in a given amount of time.
It is possible to help this a bit. You can change the process to use the R/T scheduler (e.g. SCHED_FIFO). Also, you can use sched_setaffinity to lock the process to a given CPU core. There is a corresponding call to lock IRQ interrupts to a given CPU core.
I assume that the SIM card acts like a [passive] device (like a disk). That is, you send it a command, and it sends back a response or does a transfer.
Based on what command you give it, you should know how many bytes it will send back. Or, it should tell you how many optional bytes it will send (similar to the struct in my link).
The method you've described (e.g.) wait for idle, then "race" to get/process the data [for which you don't know the length] is fraught with problems.
Even if you could get it to work, it will be unreliable. At some point, system activity will be just high enough to delay wakeup of your process and you'll miss the window.
If you're reading data, why must you process the data within a fixed period of time (e.g. 100 us)? What happens if you don't? Does the device catch fire?
Without more specific information, there are probably other ways to do this.
I've programmed such systems before that relied on data races. They were unreliable. Either missing data. Or, for some motor control applications, device lockup. The remedy was to redesign things so that there was some positive/definitive way to communicate that was tolerant of delays.
Otherwise, I think you've "fallen in love" with "idle interrupt" idea, making this an XY problem: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

When is a Cortex write to a device realised

When writing to device registers on a Cortex M0 (in my case, on an STM32L073), a question arises as to how careful one should be in a) ordering accesses to device memory and b) deciding that a change to a peripheral configuration has actually completed to the point that any dependencies become valid.
Taking a specific example to change the internal voltage regulator to a different voltage. You write the change to PWR->CR and read the status from PWR->CSR. I see code that does something like this:
Write to PWR->CR to set the voltage range
Spin until (PWR->CSR & voltage flag) becomes zero
In my mind there are three issues here:
Access ordering. This is Device Memory so transaction order is preserved relative to other Device access transactions. I would assume this means a DSB is not required between the write to CR and the read from CSR. A linked question and the answer to this is: [ARM CortexA]Difference between Strongly-ordered and Device Memory Type
Device memory can be buffered. Is there a possibility that a write to CR could still be in process when the read from CSR occurs. This would mean that the voltage flag would be clear and the code would proceed. In actual fact the flag hasn't gone high yet!
Hardware response time. Is there a latency between the write and the effects becoming final? In actuality this should always be documented - for the STM32 the docs definitively say that the flag is set when the CR register changes.
Are there any race condition possibilities here? It's really the buffering that worries me - that a peripheral write is still in progress when a peripheral read takes place.

Access ordering.
Accesses are strongly ordered and you do not need barrier instructions to read back the same register.
Device memory can be buffered. Is there a possibility that a write to CR
Yes, it is possible. But it is not because of buffering but because of the bus propagation time. It may take several clocks before a particular operation will go through all bridges.
Hardware response time. Is there a latency between the write and the
effects becoming final
Even if there is a latency it is not important from your point of view. If you set bit in the CR register and wait for the result in the status register. Simply wait for the status bit to have the expected value.

How to exploit interrupts for data transfer over SPI peripheral

I have been implementing device driver for the SPI peripheral of the MCU in C language.
I would like to exploit interrupt mechanism for reception and also for transmission.
As far as the reception part I think that I can implement this via exposing
the function SpiRegisterCallback into the SPI driver interface. This function
enables the client register its function which will be invoked as soon as
data byte is received (reception buffer full interrupt is invoked).
As far as the transmission part I would like to use some SpiTransmit function
which will receive pointer to the data bytes to be transmitted and number of bytes
to be transmitted. As far as implementation I am going to define some internal
callback function of the SPI driver. This internal callback will be registered
for transmission buffer empty interrupt. In this callback function the passed data bytes will be gradually placed into the transmission buffer. I am not sure whether this approach
is appropriate. Can anybody give me an advice how to implement SPI peripheral
driver which exploits interrupts for data transmission? Thanks in advance for any
suggestions.

SPI is often very real-time critical, introducing a callback with function pointers means needless overhead code. The actual copying of data from SPI to RAM must be done internally by your driver. That's all the ISR should be doing. Some general guidance can be found here.
So your ISR should be filling up a buffer, then swap pointers to buffers (no slow memcpy!) in a protected way, so that the caller always has one buffer with valid data, and the ISR always has one working buffer to fill up. Let the caller poll a flag rather than to invoke a callback from inside an ISR. I like to use tripple buffering if I can spare the RAM. That is: one buffer for the ISR, one buffer for the caller and one spare that the ISR can swap with without disrupting the caller.
This is all rather intricate to code and most programmers get it wrong. DMA is superior to interrupts here, so you should really be considering DMA instead. This is something you should be considering when picking MCU.

A request for "any suggestions" does not really make this a great question because multiple answers may be acceptable, and few will be comprehensive. It invites comments rather then answers. However I will indulge:
First, this is not by any definition an exploit. To "exploit" implies making use of something for a purpose it was not intended - that is not the correct term in this case, you are not "exploiting" the interrupt mechanism, you are simply using it.
At high clock rates, in some cases the interrupt latency and context switch time involved in processing the interrupts may be less efficient than a simple busy-wait. If the transfers are more than two or three bytes at a time, you should in any case consider using DMA if available - so the interrupt will be the DMA interrupt for a complete transfer rather then a single character. For applications such as SD card interfacing or EEPROM, DMA will have a significant performance impact and free up the CPU to do other useful work concurrently. A driver that uses a busy-wait for single byte/word transfers and DMA for block transfers may be optimal. This is particularly true perhaps if you are using an RTOS and the ISR triggers a task context to process the data - the context switch overhead may be nearly as much or more than a busy-wait for a single byte. If your SPI clock is > 1MHz for example, you will wait 8us for a byte transfer, your ISR and call backs could easily be greater then that, in which case it is not worthwhile.
So my advice here is to only consider interrupts for SPI if you are using a slow clock and can get other useful work done whilst waiting for the interrupt.
A problem with allowing call-backs in interrupts is it allows the callback provider to do things ill-advised or illegal in an interrupt context, and you loose the ability to control the processing time of the interrupt. It is fine perhaps if the callback is intended for use by someone writing a device driver - they should be aware of what they are doing, but this is the device driver.

STM32 STM32CubeF4 USB CDC operation

I built the code from the STM32CubeF4 for the USB CDC example. I added the missing receive code for CDC_Receive_FS() in usbd_cdc_if.c.
I loaded this into my STM32F4 Discovery and it works. A character typed on Tera Term returns and is displayed on Tera Term.
I am hoping that someone here, could give me some knowledge about how this USB CDC firmware works, specifically, is this being driven by an interrupt that is generated when there is a level shift in voltage on the USB -D and +D pins, or is there an infinite while loop that was launched somewhere, and it's just polling waiting for some data to appear?
What prompted my question is that I see that one can blink the LEDs on this board by toggling the state of the GPIO pins within an infinite while loop in main.c. However, there is nothing within this while loop at all within main.c for USB. So how does this USB CDC firmware get and send a character from/to Tera Term.

I will take the 2 minutes to answer you instead of lecturing you. Receive is done through interrupts. Very, very simply, the hardware sees the voltage change on the D+/D- and flags an interrupt based on the intialization functions. The interrupt calls HAL_PCD_IRQHandler, which calls USBD_LL_DataInStage in the usbd_conf.c file. That ends up calling the function USBD_CDC_DataIn in the usbd_cdc.c file. There is your starting point, but it is not simple. To do what you want you might have to stop the output to UART and just handle it in the main loop.

This question is to broad for this forum and not an actual question for a specific problem. However, as some hints, you might
Read the USB-specs, at least some basic overview (just start at wikipedia). USB does not work by toogling a GPIO in software (see next point)
Read the STM32F4xx reference manual. This is quite comprehensive.
Read the source code of the demo. This should answer all questions.
To track execution paths, you should remember that C always starts with the main() function, so this is a good start to see what's going on. (disclaimer: I know pretty well, it starts with startup, but this might confuse a beginner even more).
If you want to work with USB, you will have to do this all anyway, so you might start with it as well right now. Yes, this will take some time; no surprise, engineers have learned all this for years before they start with larger projects.
All information is available legal and for free on the web.
And, yes, USB is most likely interrupt-driven and might also use DMA to transfer data.

Scheduling routines in C and timing requirements

I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.

You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.

Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.