Output user data fast on Cortex R5 - c

I'm trying to output some user data, character series, on a Cortex R5 to a PC.
Problem is that the uart is too slow for the amount of data and I'm looking for something faster. I hoped ITM could be used but sadly that's only available for Cortex M-series. The data contains status info about processes which I'd like to visualise for a better insight.
The Uart was running at the maximum of 921600 baud thus I'm looking for something faster than that. I'm looking for 2-5 Mbit.
I found information of DCC (debug communication channel) and ETM but I can't really figure out their speeds and how I could use them with user data instead of tracing data.
I have acces to tracers and debuggers (Green Hills SuperTrace and Realview ICE) so requiring those is no problem. I just can't figure out how to read the data. Perhaps I missed the obvious?
Edit: For now it looks like the easiest way is to bypass the CP2105 which limits my uart to 921600. I'll connect the RX/TX pins from the SoC to a RPi which should be able to get much higher bauds. Ofcourse, I'll also need a logic level shifter since the SoC is only 2.5V tolerant (74LVC245). If this setup works I'll answer my question. Thanks for the input!

The DCC is probably going to be slow, and maybe intrusive to use. You're limited to using JTAG to access this.
The ETM ought to be able to trace this information, and you should be able to configure the filtering to trace just the accesses to a specific memory address. Its a very long time since I looked in detail at the ETMv3 data trace, so I'm not sure if you need to trace the associated instruction or not. The debug tools also tend to be more focused on tracing instructions with data being an additional decoration rather than presenting a raw datastream, so processing the data might be non-trivial.
The ETM should provide several bits of data throughput per cycle, so as long as the data is in small bursts there should be enough bandwidth. Obviously this is package dependant, but small numbers of Gbps are achievable (with a substantial protocol cost depending on what information you're trying to push through the trace stream).
In some chips, an ETM can be shared between several processors (of the same type). ETCSCR[14:14] will be non-zero if this is the case, and then you're limited to selecting one core and tracing that (until the ETM is disabled/re-programmed).

Related

Writing device library C/C++ for STM32 or ARM

I need to develop device libraries like uBlox, IMUs, BLE, ecc.. from scratch (almost). Is there any doc or tutorial that can help me?
Question is, how to write a device library using C/C++ (Arduino style if you want) given a datasheet and a platform like STM32 or other ARMs?
Thanks so much
I've tried to read device libraries from Arduino library and various Github, but I would like to have a guide/template to follow (general rules) to write proper device libraries from a given datasheet.
I'm not asking a full definitive guide, just where to start, docs, methods approach.
I've found this one below, but is very basic and quite lite for my targets.
http://blog.atollic.com/device-driver-development-the-ultimate-guide-for-embedded-system-developers
I don't think that you can actually write libraries for STM32 in Arduino style. Most Arduino libraries you can find in the wild promote ease of usage rather than performance. For example, a simple library designed for a specific sensor works well if reading the sensor and reporting the results via serial port is the only thing that firmware must do. When you work on more complex projects where uC has lots to do and satisfy some real time constraints, the general Arduino approach doesn't solve your problems.
The problem with STM32 library development is the complex connection between peripherals, DMA and interrupts. I code them in register level without using the Cube framework and I often find myself digging the reference manual for tables that shows the connections between DMA channels or things like timer master-slave relations. Some peripherals (timers mostly) work similar but each one of them has small differences. It makes development of a hardware library that fits all scenarios practically impossible.
The tasks you need to accomplish are also more complex in STM32 projects. For example, in one of my projects, I fool SPI with a dummy/fake DMA transfer triggered by a timer, so that it can generate periodic 8-pulse trains from its clock pin (data pins are unused). No library can provide you this kind of flexibility.
Still, I believe not all is lost. I think it may be possible to build an hardware abstraction layer (HAL, but not The HAL by ST). So, it's possible to create useful libraries if you can abstract them from the hardware. A USB library can be a good example for this approach, as the STM32 devices have ~3 different USB peripheral hardware variations and it makes sense to write a separate HAL for each one of them. The upper application layer however can be the same.
Maybe that was the reason why ST created Cube framework. But as you know, Cube relies on external code generation tools which are aware of the hardware of each device. So, some of the work can be avoided in runtime. You can't achieve the same result when you write your own libraries unless you also design a similar external code generation tool. And also, the code Cube generates is bloated in most cases. You trade development time for runtime performance and code space.
I assume you will be using a cross toolchain on some platform like Linux, and that the cross toolchain is compatible with some method to load object code on the target CPU. I also assume that you already have a working STM32 board that is documented well enough to figure out how the sensors will connect to the board or to the CPU.
First, you should define what your library is supposed to provide. This part is usually surprisingly difficult. It’s a bit hard to know what it can provide, without knowing a bit about what the hardware sensors are capable of providing. Some iteration on the requirements is expected.
You will need to have access to the documentation for the sensors, usually in the form of the manufacturer’s data sheets. Using the datasheet, and knowing how the device is connected to the target CPU/board, you will need to access the STM32 peripherals that comprise the interface to the sensors. Back to the datasheets, this time for the STM32, to see how to access its peripheral interfaces. That might be simple GPIO bits and bytes, or might be how to use built-in peripherals such as SPI or I2C.
The datasheets for the sensors will detail a bunch of registers, describing the meaning of each, including the meanings of each bit, or group of bits, in certain registers. You will write code in C that accesses the STM32 peripherals, and those peripherals will access the sensors across the electrical interface that is part of the STM32 board.
The workflow usually starts out by writing to a register or three to see if there is some identifiable effect. For example, if you are exercising a digital IO port, you might wire up an LED to see if you can turn it on or off, or a switch to see if you can correctly read its state. This establishes that your code can poke or peek at IO using register level access. There may be existing helper functions to do this work as part of the cross toolchain. Or you might have to develop your own, using pointer indirection to access memory mapped IO. Or there might be specially instructions needed that can only be accessed from inline assembler code. This answer is generic as I don’t know the specifics of the STM32 processor or its typical ecosystem.
Then you move on to more complex operations that might involve sequences of operations, like cycling a bit or two to effect some communication with the device. Or it might be as simple as finding the proper sequence of registers to access for operation of a SPI interface. Often, you will find small chunks of code are complete enough to be re-used by your driver; like how to read or write an individual byte. You can then make that a reusable function to simplify the rest of the work, like accessing certain registers in sequence and printing the contents of register that you read to see if they make sense. Ultimately, you will have two important pieces of information: and understanding of the low-level register accesses needed to create a formal driver, and an understanding of what components and capabilities make up the hardware (ie, you know how the device(s) work).
Now, throw away most of what you’ve done, and develop a formal spec. Use what you now know to include everything that can be useful. Use what you now know to develop a spec that includes an appropriate interface API that your application code can use. Rewrite the driver, armed with the knowledge of how are the pieces work, and taking advantage of the blank canvas afforded you by the fresh rewrite of the spec. Only reuse code that you are completely confident is optimal and appropriate to the format dictated by the spec. Write test code for all of the modules, and use the test code to actually test that the code works and that it conforms to the spec. Re-use the test code every time you modify anything it tests.

Is it wrong to log inner working of an IoT device in case of failure?

I'm currently working on an IoT project and I want to log the execution of my software and hardware.
I want to log them then send them to some DB in case I need to have a look at my device remotely.
The wip IoT device will have to be as minimal as possible so the act of having to write very often inside a flash memory module seems weird to me.
I know that it will run the RTOS OS Nucleus on an Cortex-M4 with some modules connected through SPI.
Can someone with more expertise enlighten me ?
Thanks.
You will have to estimate your hourly/daily/whatever data volume that needs to go into the log and extrapolate to the expected lifetime of your product. Microcontroller flash usually isn't made for logging and thus it features neither enduring flash cells (some 10K-100K write cycles usually compared to 1M or more for dedicated data chips - look it up in the uC spec sheet) nor wear leveling. Wear leveling is any method which prevents software from writing to the same physical cell too frequently (which would e.g. be the directory for a simple file system).
For your log you will have to create a quite clever or complex method to circumvent any flash lifetime problems.
But the problems don't stop there: usually the MCU isn't able to read from Flash memory when writing to it where "writing" means a prolonged (several microseconds up to milliseconds depending on the chip) sequence of instructions controlling the internal Flash statemachine (programming voltage, saturation times, etc.) until the new values have reliably settled in the memory. And, maybe you guessed it, "reading" in this context also means reading instructions, that is you have to make sure that whichever code and interrupts that may occur during the Flash write are only executing code in RAM, cache or other memories and not in the normal instruction memory. It is doable but the more complex the SW system that you are running above the HW layer, the less likely it will work reliably.

How do I increase the speed of my USB cdc device?

I am upgrading the processor in an embedded system for work. This is all in C, with no OS. Part of that upgrade includes migrating the processor-PC communications interface from IEEE-488 to USB. I finally got the USB firmware written, and have been testing it. It was going great until I tried to push through lots of data only to discover my USB connection is slower than the old IEEE-488 connection. I have the USB device enumerating as a CDC device with a baud rate of 115200 bps, but it is clear that I am not even reaching that throughput, and I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong. I control every aspect of this from the front end on the PC to the firmware on the embedded system.
I am assuming my issue is how I write to the USB on the embedded system side. Right now my USB_Write function is run in free time, and is just a while loop that writes one char to the USB port until the write buffer is empty. Is there a more efficient way to do this?
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications. The CPU would just write data across a bus to this board, and it would handle communications, which means that the CPU didn't have to waste free time handling the actual communications, but could offload the communications to a "co processor" (not a CPU but functionally the same here). Even with this concern though I figured I should be getting faster speeds given that full speed USB is on the order of MB/s while IEEE-488 is on the order of kB/s.
In short is this more likely a fundamental system constraint or a software optimization issue?
I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong.
You are correct, the baud number is a dummy value. If you create a CDC/RS232 adapter you would use this to configure your RS232 hardware, in this case it means nothing.
Is there a more efficient way to do this?
Absolutely! You should be writing chunks of data the same size as your USB endpoint for maximum transfer speed. Depending on the device you are using your stream of single byte writes may be gathered into a single packet before sending but from my experience (and your results) this is unlikely.
Depending on your latency requirements you can stick in a circular buffer and only issue data from it to the USB_Write function when you have ENDPOINT_SZ number of byes. If this results in excessive latency or your interface is not always communicating you may want to implement Nagles algorithm.
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications.
The NXP part you mentioned in the comments is without a doubt fast enough to saturate a USB full speed connection.
In short is this more likely a fundamental system constraint or a software optimization issue?
I would consider this a software design issue rather than an optimisation one, but no, it is unlikely you are fundamentally stuck.
Do take care to figure out exactly what sort of USB connection you are using though, if you are using USB 1.1 you will be limited to 64KB/s, USB 2.0 full speed you will be limited to 512KB/s. If you require higher throughput you should migrate to using a separate bulk endpoint for the data transfer.
I would recommend reading through the USB made simple site to get a good overview of the various USB speeds and their capabilities.
One final issue, vendor CDC libraries are not always the best and implementations of the CDC standard can vary. You can theoretically get more data through a CDC endpoint by using larger endpoints, I have seen this bring host side drivers to their knees though - if you go this route create a custom driver using bulk endpoints.
Try testing your device on multiple systems, you may find you get quite different results between windows and linux. This will help to point the finger at the host end.
And finally, make sure you are doing big buffered reads on the host side, USB will stop transferring data once the host side buffers are full.

Reading a 4 µs long +5V TTL from a parallel port -- when to use kernel interrupts

I've got an experimental box of tricks running that, every 100 ms or so, will spit out a 4 microsecond long +5V pulse of electricity on a TTL line. The exact time that this happens is not known ahead of time, but it's important -- so I'd like to use the Red Hat 5.3 computer that essentially runs the experiment to service this TTL, and create a glorified timestamp.
At the moment, what I've done is wired the TTL into pin 13 of the parallel port (STATUS_SELECT, one of the input lines on a parallel port) on the linux box, spawn a process when the experiment starts, use chrt to change its scheduled priority to 99 -- i.e. high -- and then just poll the parallel port repeatedly in a while loop until the pin goes high. I then create an accurate timestamp, and, in a non-blocking way write it to disk.
Obviously, this is inefficient -- sometimes the process is suspended, and a TTL will be missed. As the computer is, itself, busy doing other things (namely acquiring data from my experimental bit of kit -- an MRI scanner!) this happens quite often. Polling is easy, but probably bad.
My question is this: doing something quickly when a TTL occurs seems like the bread-and-butter of computing, but, as far as I can tell, it's only possible to deal with interrupts on linux if you're a kernel module. The parallel port can generate interrupts, and libraries like paraport let you build kernel modules relatively quickly, where you have to supply your own handler.
Is the best way to deal with this problem and create accurate (±25 ms) timestamps for an experiment whenever that TTL comes in -- to write a kernel module that provides a list of recent interrupts to somewhere in /proc, and then read them out with a normal process later? Is that approach not going to work, and be very CPU inefficient -- or open a bag of worms to do with interrupt priority I'm not aware of?
Most importantly, this seems like it should be a solved problem -- is it, and if so do any wise people wish to point me in the right direction? Writing a kernel module seems like, frankly, a lot of hard, risky work for something that feels as if it should perhaps be simple.
The premise that "it's only possible to deal with interrupts on linux if you're a kernel module" dismisses some fairly common and effective strategies.
The simple course of action for responding to interrupts in userspace (especially infrequent ones) is to have a driver which created a kernel device (or in some cases sysfs node) where either a read() or perhaps a custom ioctl() from userspace will block until the interrupt occurs. You'd have to check if the default parallel port driver supports this, but it's extremely common with the GPIO drivers on embedded-type boards, and the basic scheme could be borrowed into the parallel port - provided that the hardware supports true interrupts.
If very precise timing is the goal, you might do better to customize the kernel module to record the timestamp there, and implement a mechanism where a read() from userspace blocks until the interrupt occurs, and then obtains the kernel's already recorded timestamp as the read data - thus avoiding the variable latency of waking userspace and calling back into the kernel to get the time.
You might also look at true local-bus serial ports (if present) as an alternate-interrupt capable interface in cases where the available parallel port is some partial or indirect implementation which doesn't support them.
In situations where your only available interface is something indirect and high latency such as USB, or where you want a lot of host- and operation-system- independence, then it may indeed make sense to use an external microcontroller. In that case, you would probably try to set the micro's clock from the host system, and then have it give you timestamp messages every time it sees an event. If your experiment only needs the timestamps to be relative to each other within a given experimental session, this should work well. But if you need to establish an absolute time synchronization across the USB latency, you may have to do some careful roundtrip measurement and then estimation of the latency in order to compensate it (see NTP for an extreme example).

How to get WIFI parameters (bandwidth, delay) on Ubuntu using C

I am student and I am writting simple application in C99 standard. Program should working on Ubuntu.
I have one problem - I don't know how can I get some Wifi parameters like bandwidth or delay. I haven't any idea how to do this. It is possible to do this using standard functions or any linux API (ech I am windows user)?.
In general, you don't know the bandwidth or delay of a wifi device.
Bandwidth and delay is the type of information from a link.
As far as I know, there is no such information holding in WiFi drivers.
The most link-related information is SINR.
For measuring bandwidth or delay, you should write your own code.
Maybe you should tell us more about your concrete problem. For now, I assume that you are interested in the throughput and latency of a specific wireless link, i.e. a link between two 802.11 stations. This could be a link between an access point and a client or between two ad-hoc stations.
The short answer is that there is no such API. In fact, it is not trivial even to estimate these two link parameters. They depend on the signal quality, on the data rate used by the sending station, on the interference, on the channel utilization, on the load of the computer systems at both ends, and probably a lot of other factors.
Depending on the wireless driver you are using it may be possible to obtain information about the currently used data rate and some packet loss statistics for the station you are communicating with. Have a look at net/mac80211/sta_info.h in your Linux kernel source tree. If you are using MadWifi, you may find useful information in the files below /proc/net/madwifi/ath0/ and in the output of wlanconfig ath0 list sta.
However, all you can do is to make a prediction. If the link quality changes suddenly, your prediction may be entirely wrong.

Resources