Using multiple hcsr04 sensors on Beaglebone Black - c

I am trying to use hcsr04 sensors on the Beaglebone black (adapted from this code - https://github.com/luigif/hcsr04)
I got it working for 4 different sets of sensors individually, and were now unsure of how to combine them into one program.
Is there a way to give the trigger and receive the echos simultaneously, such that interrupts can be generated as different events to the C program.
Running them one after the other is the last option we have in mind.

Russ is correct - since there's 2x PRU cores in the BeagleBone's AM335x processor, there's no way to run 4 instances of that PRU program simultaneously. I suppose you could load one compiled for one set of pins, take a measurement, stop it, then load a different binary compiled for a sensor on different pins, but that would be a pretty inefficient (and ugly, IMHO) way to do it.
If you know any assembly it should be pretty straight-forward to update that code to drive all 4 sensors (PRU assembly instructions). Alternatively you could start from scratch in C and use the clpru PRU C compiler as Russ suggested, though AFAIK that's still in somewhat of a beta state and there's not much info out there on it. Either way, I'd recommend reading from the 4 sensors in parallel or one after the other, loading the measurements into the PRU memory at different offsets, then sending a single signal to the ARM.
In that code you linked, the line:
SBCO roundtrip, c24, 0, 4
Takes 4 bytes from register roundtrip (which is register r4, per the #define roundtrip r4 at the top of the file), and loads it into the PRU data RAM (constant c24 is set to the beginning of data RAM in lines 39-41) at offset 0. So if you had 4 different measurements in 4 registers, you could offset the data in RAM, e.g.:
SBCO roundtrip1, c24, 0, 4
SBCO roundtrip2, c24, 4, 4
SBCO roundtrip3, c24, 8, 4
SBCO roundtrip4, c24, 12, 4
Then read those 4 consecutive 32-bit integers in your C program.

Related

How to determine MCU Clock speed requirements

Overview:
I spent a while trying to think of how to formulate this question. To narrow the scope, I wanted to provide my initial HW requirements in the form of a ‘real life’ example application.
I understand that clock speed is probably relative, in the sense that it is a case by case basis. For example, your requirement for a certain speed may be impacted on by the on-chip peripherals offered by the MCU. As an example, you may spend (n) cycles servicing an ISR for an encoder, or, you could pick an MCU that has a QEI input to do it for you (to some degree), which in turn, may loosen your requirement?
I am not an expert, and am very much still learning, so please call me out if I use an incorrect term, or completely misinterpret something. I assure you; the feedback is welcome!
Example Application:
This application is relatively simple. It can be thought of as a non-blocking state machine, where each ‘iteration’ of the machine must complete within 20ms. A single iteration of this machine has 4 main tasks:
Decode a serial payload, consisting of 32 bytes. The length is fixed at 32 bytes, payload is dynamic, baud is 115200bps (See Task #2 below)
Read 4 incremental shaft encoder signals, which are coupled with 4 DC Motors, 1 encoder for each motor (See Task #1 Below)
Determine the position of 4 limit switches. ISR driven, trigger on rising edge for each switch.
Based on the 3 categories of inputs above, the MCU will output 4 separate PWM signals # 50Hz (20ms) to a motor controller for its next set of movements. (See Task #3 below)
From an IO perspective, I know that the MCU is on the hook for reading 8 digital signals (4 quadrature encoders, 4 limit switches), and decoding a serial frame of 32 bytes over UART.
Based on that data, the MCU will output 4 independent PWM signals, with a pulse width of [1000usec -3200usec], per motor, to the motor controller.
The Question:
After all is said and done, I am trying to think through how I can map my requirements into MCU selection, solely from a speed point of view.
It’s easy for me to look through the datasheet and say, this chip meets my requirements because it has (n) UARTS, (n) ISR input pins, (n) PWM outputs etc. But my projects are so small that I always assume the processor is ‘fast enough’. Aside from my immediate peripheral needs, I never really look into the actual MCU speed, which is an issue on my end.
To resolve that, I am trying to understand what goes into selecting a particular clock speed, based on the needs of a given application. Or, another way to say it, which is probably wrong, but how to you quantify the theoretical load on the processor for that specific application?
Additional Information
Task #1: Encoder:
Each of the 4 motors have different tasks within the system, but regardless, they are the same brand/model motor, and have a maximum RPM of 230. My assumption is, if at its worst case, one of the motors is spinning at 230 RPM, that would mean, at full quadrature resolution (count rising/falling for channel A/B) the 1000PPR encoder would generate 4K interrupts per revolution. As such, the MCU would have to service those interrupts, potentially creating a bottleneck for the system. For example, if (n) number of clock cycles are required to service the ISR, and for 1 revolution of 1 motor, we expect 4K interrupts, that would be … 230(RPM) * 4K (ISR per rev) == 920,000 interrupts per minute? Yikes! And then I guess you could just extrapolate and say, again, at it’s worst case, where each of the 4 motors are spinning at 230 RPM, there’s a potential that, if the encoders are full resolution, the system would have to endure 920K interrupts per minute for each encoder. So 920K * 4 motors == 3,680,000 interrupts per minute? I am 100% sure I am doing something wrong, so please, feel free to set me straight.
Task #2: Serial Decoding
The MCU will require a dedicated HW serial port to decode a packet of 32 bytes, which repeats, with different values, every 7ms. Baud rate will be set to 115200bps.
Task #3: PWM Output
Based on the information from tasks 1 and 2, the MCU will write to 4 separate PWM outputs. The pulse for each output will be between 1000-3200usec with a frequency of 50Hz.
You need to separate real-time critical parts from the rest of the application. For example, the actual reception of an UART frame is somewhat time-critical if you do so interrupt-based. But the protocol decoding is not critical at all unless you are expected to respond within a certain time.
Decode a serial payload, consisting of 32 bytes.
You can either do this the old school way with interrupts filling up a buffer, or you could look for a part with DMA, which is fairly common nowadays. DMA means that you won't have to consider some annoying, relatively low frequency UART interrupt disrupting other tasks.
Read 4 incremental shaft encoder signals
I haven't worked with such encoders so I can't tell how time-critical they are. If you have to catch every single interrupt and your calculations are correct, then 3,680,000 interrupts per minute is still not that bad. 60*60/3680000 = 978us. So roughly one interrupt every millisecond, that's not a "hard real-time" requirement. If that's the only time-critical thing you need to do, then any shabby 8-bitter running at 8MHz could keep up.
Determine the position of 4 limit switches
You don't mention timing here but I assume this is something that could be polled cyclically by a low priority cyclic timer.
the MCU will output 4 separate PWM signals
Not a problem, just pick one with a decent PWM hardware peripheral. You should just need to update some PWM duty cycle registers now and then.
Overall, this doesn't sound all that real-time critical. I've done much worse real-time projects with icky 8 and 16 bitters. However, each time I did, I always regret not picking a faster MCU, because you always come up with stuff to add as the project/product goes on.
It sounds like your average mainstream Cortex M0+ would be a good candidate for this project. Clock it at ~48MHz and you'll have plenty of CPU power. Cortex M4 or larger if you actually expect floating point math (I don't quite see why you'd need that though).
Given the current component crisis, be careful with which brand you pick though! In particular stay clear of STM32, since ST can't produce them right now and you might end up waiting over a year until you get parts.
The answer to the question is "experience". But intuitively your example is not particularly taxing - although there are plenty of ways you could mess it up. I once worked on a project that ran on a 200MHz C5502 DSP at near 100% CPU load. The application now runs on a 72MHz Cortex-M3 at only 60% with additional functionality and I/O not present in the original implementation..
Your application is I/O bound; depending on data rates (and critically interrupt rates), I/O seldom constitutes the highest CPU load, and DMA, hardware FIFOs, input capture timer/counters, and hardware PWM etc. can be used to minimise the I/O impact. I shan't go into it in detail; #Lundin has already done that.
Note also that raw processor speed is important for data or signal processing and number crunching - but what I/O generally requires is deterministic real-time response, and that is seldom simply a matter of MHz or MIPS - you will get more deterministic and possibly faster response from an 8bit AVR running at a few MHz than you can guarantee from a 500MHz application processor running Linux - and it won't take 30 seconds to boot!

Unexpected timing in Arm assembly

I need a very precise timing, so I wrote some assembly code (for ARM M0+).
However, the timing is not what I expected when measuring on an oscilliscope.
#define LOOP_INSTRS_CNT 4 // subs: 1, cmp: 1, bne: 2 (when branching)
#define FREQ_MHZ (BOARD_BOOTCLOCKRUN_CORE_CLOCK / 1000000)
#define DELAY_US_TO_CYCLES(t_us) ((t_us * FREQ_MHZ + LOOP_INSTRS_CNT / 2) / LOOP_INSTRS_CNT)
static inline __attribute__((always_inline)) void timing_delayCycles(uint32_t loopCnt)
{
// note: not all instructions take one cycle, so in total we have 4 cycles in the loop, except for the last iteration.
__asm volatile(
".syntax unified \t\n" /* we need unified to use subs (not for sub, though) */
"0: \t\n"
"subs %[cyc], #1 \t\n" /* assume cycles > 0 */
"cmp %[cyc], #0 \t\n"
"bne.n 0b\t\n" /* this instruction costs 2 cycles when branching! */
: [cyc]"+r" (loopCnt) /* actually input, but we need a temporary register, so we use a dummy output so we can also write to the input register */
: /* input specified in output */
: /* no clobbers */
);
}
// delay test
#define WAIT_TEST_US 100
gpio_clear(PIN1);
timing_delayCycles(DELAY_US_TO_CYCLES(WAIT_TEST_US));
gpio_set(PIN1);
So pretty basic stuff. However, the delay (measured by setting a GPIO pin low, looping, then setting high again) timing is consistently 50% higher than expected. I tried for low values (1 us giving 1.56 us), up to 500 ms giving 750 ms.
I tried to single step, and the loop really does only the 3 steps: subs (1), cmp (1), branch (2). Paranthesis is number of expected clock cycles.
Can anybody shed light on what is going on here?
After some good suggestions I found the issue can be resolved in two ways:
Run core clock and flash clock at the same frequency (if code is running from flash)
Place the code in the SRAM to avoid flash access wait-states.
Note: If anybody copies the above code, note that you can delete the cmp, since the subs has the s flag set. If doing so, remember to set instruction count to 3 instead of 4. This will give you a better time resolution.
You can't use these processors like you would a PIC, the timing doesn't work like that. I have demonstrated this here many times you can look around, maybe will do it again here, but not right now.
First off these are pipelined so you average performance is one thing, but and once in a loop and things like caching and branch prediction learning and other factors have settled then you can get consistent performance, for that implementation. Ignore any documentation related to clocks per instruction for a pipelined processor now matter how shallow, that is the first problem in understanding why the timing doesn't work as expected.
Alignment plays a role and folks are tired of me beating this drum but I have demonstrated it so many times. You can search for fetch in the cortex-m0 TRM and you should immediately see that this will affect performance based on alignment. If the chip vendor has compiled the core for 16 bit only then that would be predictable or more predictable (ignoring other factors). But if they have compiled in the other features and if prefetching is happening as described, then the placement of the loop in the address space can affect the loop by plus or minus a fetch affecting the total time to complete the loop, which is measurable with or without a scope.
Branch prediction, which didn't show up in the arm docs as arm doing it but the chip vendors are fully free to do this.
Caching. While a cortex-m0+ if this is an STM32 or perhaps other brands as well, there is or may be a cache you can't turn off. Not uncommon for the flash to be half the speed of the processor thus flash wait state settings, but often the zero wait state means zero additional and it takes two clocks to get one fetch done or at least is measurable that execution in flash is half the speed of execution in ram with all other settings the same (system clock speed, etc). ST has a pretty good prefetch/caching solution with some trademarked name and perhaps a patent who knows. And rarely can you turn this off or defeat it so the first time through or the time entering the loop can see a delay and technically a pre-fetcher can slow down the loop (see alignment).
Flash, as mentioned depending on the chip vendor and age of the part it is quite common for the flash to be half speed of the core. And then depending on your clock rates, when you read about the flash settings in the chip doc where it shows what the required wait states were relative to system clock speed that is a key performance indicator both for the flash technology and whether or not you should really be raising the system clock up too high, the flash doesn't get any faster it has a speed limit, sram from my experience can keep up and so far I don't see them with wait states, but flashes used to be two or three settings across the range of clock speeds the part supports, the newer released parts the flashes are tending to cover the whole range for the slower cores like the m0+ but the m7 and such keep getting higher clock rates so you would still expect the vendors to need wait states.
Interrupts/exceptions. Are you running this on an rtos, are there interrupts going on are you increasing and/or guaranteeing that this gets interrupted with a longer delay?
Peripheral timing, the peripherals are not expected to respond to a load or store in a single clock they can take as long as they want and depending on the clocking system and chip vendors IP, in house or purchased, the peripheral might not run at the processor clock rate and be running at a divided rate making things slower. Your code no doubt is calling this function for a delay, and then outside this timing loop you are wiggling a gpio pin to see something on a scope which leads to how you conducted your benchmark and additional problems with that based on factors above and this one.
And other factors I have to remember.
Like high end processors like the x86, full sized ARMs, etc the processor no longer determines performance. The chip and motherboard can/do. You basically cannot feed the pipe constantly there are stalls going on all over the place. Dram is slow thus layers of caching trying to deal with it but caching helps sometimes and hurts others, branch predictors hurt as much as they help. And so on but it is heavily driven by the system outside the processor core as to how well you can feed the core, and then you get into the core's properties with respect to the pipeline and its own fetching strategy. Ideally using the width of the bus rather than the size of the instruction, transaction overhead so multiple widths of the bus is even more ideal that one width, etc.
Causing tight loops like this on any core to have a jerky motion and or be inconsistent in timing when the same machine code is used at different alignments. Now granted for size/power/etc the m0+ has a tiny pipe, but it still should show the affects of this. These are not pics or avrs or msp430s no reason to expect a timing loop to be consistent. At best you can use a timing loop for things like spi and i2c bit banging where you need to be greater than or equal to some time value, but if you need to be accurate or within a range, it is technically possible per implementation if you control many of the factors, but it is often not worth the effort and you have this maintenance issue now or readability or understandability of the code.
So bottom line there is no reason to expect consistent timing. If you happened to get consistent/linear timing, then great. The first thing you want to do is check that when you changed and re-built the code to use a different value for the loop that it didn't affect alignment of this loop.
You show a loop like this
loop:
subs r0,#1
cmp r0,#0
bne loop
on a tangent why the cmp, why not just
loop:
subs r0,#1
bne loop
But second you then claim to be measuring this on a scope, which is good because how you measure things plays into the quality of the benchmark often the benchmark varies because of how it is measured the yardstick is the problem not the thing being measured, or you have problems with both then the measurement is much more inconsistent. Had you used systick or other to measure this depending on how you did that the measurement itself can cause the variation, and even if you used gpio to toggle a pin that can and probably is affecting this as well. All other things held constant simply changing the loop count depending on the immediate and the value used could push you between a thumb and thumb2 instruction changing the alignment of some loop.
What you have shown implies you have this timing loop which can be affected by a number of system issues, then you have wrapped that with some other loop itself being affected, plus possibly a call to a gpio library function which can be affected by these factors as well from a performance perspective. Using inline assembly and the style in which you wrote this function that you posted implies you have exposed yourself and can easily see a wide range of performance differences in running what appears to be the same code, or even actually the code under test being the same machine code.
Unless this is a microchip PIC, not PIC32, or a very very short list of other specific brand and family of chips. Ignore the cycle counts per instruction, assume they are wrong, and don't try for accurate timing unless you control the factors.
Use the hardware, if for example you are trying to use the ws8212/neopixel leds and you have a tight window for timing you are not going to be successful or will have limited success using instruction timing. In that specific case you can sometimes get away with using a spi controller or timers in the part to generate accurately timed (far more than you can ever do with software timers managing the bit banging or otherwise). With a PIC I was able to generate tv infrared signals with the carrier frequency and ons and off using timed loops and nops to make a highly accurate signal. I repeated that for one of these programmable led things for a short number of them on a cortex-m using a long linear list of instructions and relying on execution performance it worked but was extremely limited as it was compile time and quick and dirty. SPI controllers are a pain compared to bit banging but another evening with the SPI controller and could send any length of highly accurately timed signals out.
You need to change your focus to using timers and/or on chip peripherals like uart, spi, i2c in non-normal ways to generate whatever signal this is you are trying to generate. Leave timed loops or even timer based loops wrapped by other loops for the greater than or equal cases and not for the within a range of time cases. If unable to do it with one chip, then look around at others, very often when making a product you have to shop for the components, across vendors, etc. Push comes to shove use a CPLD or a PAL or GAL or something like that to get highly accurate but custom timing. Depending on what you are doing and what your larger system picture looks like the ftdi usb chips with mpsse have a generic state machine that you can program to generate an array of signals, they do i2c, spi, jtag, swd etc with this generic programmable system. But if you don't have a usb host then that won't work.
You didn't specify a chip and I have a lot of different chips/boards handy but only a small fraction of what is out there so if I wanted to do a demo it might not be worth it, if mine has the core compiled one way I might not be able to get it to demonstrate a variation where the same exact core from arm compiled another way on another chip might be easy. I suspect first off a lot of your variation is because you are making calls within a bigger loop, call to delay call to state change the gpio, and you are recompiling that for the experiments. Or worse as shown in your question if you are doing a single pass and not a loop around the calls, then that can maximize the inconsistency.

Which clock source is used for clocking instructions in cortex-M4

In my Cortex-M4, I have am using a 8Mhz oscillator as HSE, which then gets multiplied to 72Mhz using PLL which then drives SYSCLK. This got me thinking, which clock is the one being used to execute instructions? In other words, if our CPI is 1 (an ideal value, of course), does that mean we would execute 8 million instructions per second or 72 million instructions per second?
I also found this DWT which can be used to measure clock cycles, and hence CPI. So I am guessing which ever clock that is used to execute instructions would be the same one used by DWT?
It is driven from HCLK (not SYSCLK which clocks system timer and it does not have to be equal to HCLK). Thew source of HCLK is settable by the programmer.
if our CPI is 1 (an ideal value, of course), does that mean we would
execute 8 million instructions per second or 72 million instructions
per second?
You can see how many cycles every instruction takes: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDDIGAC.html
The real speed depends on many factors but mainly depends on the place where your code and data reside and the advanced uC features.
If you execute your code fro the internal TCM SRAM and place data in the SRAM (or even better on some uC in TCI and TCD SRAM)you can archive the theoretical execution efficiency as those memories work at the core clock frequency with no wait states or bus waitstates. Ideally if the uC has TC memory and both instructions and data are fetched using separate buses.
If your code resides in the FLASH memory - this memory may introduce some wait states. STM uC (ART accelerator) read the flash in larger a chunks and fetch the instructions ahead. It allows those uCs to perform almost at the max speed. The problem are branch instructions which require pipeline to be flushed and instructions fetched again.

Using PMU to count ticks on ARM

I'm trying to use the PMU (specifically use PMCCNTR to determine ticks per usec) from userspace in ARM. I have an arm64 kernel running an arm32 bit userspace app.
I created an LKM to force the PMUSERENR.EN bit to be on, which works. I then I ran a test program from userspace:
asm volatile ("mrc p15, 0, %[en], c9, c14, 0"
: [en] "=r" (pmuserenr));
printf(" -- %08x\n", pmuserenr);
The first few times I ran this, the bit was correctly marked as turned on. But every third or fourth time I run it, the bit shows up as off. After puzzling over this, I came across this link, where the user is enabling this bit for each cpu. Does each core have it's own instance of a PMU, or is there only one per SoC? If there are multiple per SoC, are the PMUCCNTR registers synced up? (That is, if I read the PMUCCNTR, then do a context switch to another core, and read PMUCCNTR again, can I safely compare the two to see how many ticks have elapsed?).
Even if I modify my LKM to initialize on every online-cpu (and register a notifier to enable it on all new cpu's coming up), I'm still seeing the same behavior, where every nth read from userspace shows the bit not set.
My other issue is that the PMCCNTR is not incrementing. I set the PMCR.E bit but it's still not incrementing. I found one website that says I need to set the PMCR.C bit -- but this has a side effect of clearing the counter, which I don't want to do in fear of creating a race condition (where someone else is using the counter as I clear it). Any help or insights would be very welcome.
Each cpu has its own PMU instance, they are entirely independant (except for a small number of cluster wide events, where each PMU keeps its own count).
See, for example, the first hit today for "debug memory map pmu", remembering that v7 and v8 debug memory map standards are different. The actual answer is too 'obvious' to expect it to be explicitly documented.
0x010000 - 0x010FFF CPU 0 Debug
0x030000 - 0x030FFF CPU 0 PMU
0x110000 - 0x110FFF CPU 1 Debug
0x130000 - 0x130FFF CPU 1 PMU
0x210000 - 0x210FFF CPU 2 Debug
0x230000 - 0x230FFF CPU 2 PMU
0x310000 - 0x310FFF CPU 3 Debug
0x330000 - 0x330FFF CPU 3 PMU

Usage of EN4B command

Can anybody explains the usage of EN4B command of micron SPI chips.
I want to know the difference between 3 byte and 4 byte address mode in SPI.
I was going through the SPI drivers where I found this commands.
Thanks in Advance !!
From a legacy point of view, SPI commands have always used 3 bytes for the address interested by their operation.
This was fine as with 24 bits it is possible to address up to 128MiB.
When the Flashes grew larger it was needed to switch from 3 bytes to 4 bytes addressing.
Whenever you have any doubts regarding the hardware you can find the answers in the proper datasheet, I don't know what specific chip you are referring to however.
I found the Micron N25Q512A NOR Flash, which is 512MiB so it needs a form of 4 bytes addressing; from it you can learn that
There are 3 bytes legacy commands and new 4 bytes commands.
For example 03h and 13h for the single read.
You can supply a default fourth address byte with a specific register.
The Extended Address Register let you choose the region of the flash for the legacy commands.
You can enable 4 bytes addressing for legacy command.
Either write the appropriate bit in the Nonvolatile Configuration Register or use the ENTER / EXIT 4-BYTE ADDRESS MODE (opcodes B7h and E9h respectively)
This Linux patch also have some insights, basically telling that some chips only support one of the three points above.
Macronix seems to have first opted for the number 3 only and Spansion for the number 1.
Checking some datasheet of theirs seems to suggests that now both support all three methods.

Resources