Operating on low part of ARM NEON vector efficiently with intrinsics

Operating on low part of ARM NEON vector efficiently with intrinsics - arm

ARM provides intrinsics to operate on high portion of intrinsic vector types but I can't see equivalent to operate on the low part.
Consider simple example to calculate residue of the two buffers with 16 elements of 8 bit (8 bit - 8 bit = result in 16 bit).
uint8x16_t src0, src1;
int16x8_t dst0, dst1;
src0 = vld1q_u8((uint8_t const *)src0_);
src1 = vld1q_u8((uint8_t const *)src1_);
dst0 = vreinterpretq_s16_u16(vsubl_u8(vget_low_u8(src0), vget_low_u8(src1)));
dst1 = vreinterpretq_s16_u16(vsubl_high_u8(src0, src1));
To operate on high part of registers we have vsubl_high_u8. But to operate on low, first need to perform extract low of the input register (which are with some latency) and then use sub instruction.
Asm provides access to low and high separately via separate register names (for 32-bit ARM NEON), so this is only a question of intrinsics.
(Editor's note: AArch64 does have separate low/high asm instructions because every s (32-bit) and d (64-bit) register is the low part of a separate q (128-bit) vector register.)

Related

ARM: Using bit-banded memory from C or C++

ARM Cortex supports bit-banded memory, where individual bits are mapped to "bytes" in certain regions. I believe that only certain parts of RAM are bit-banded. I'd like to use bit-banding from C and C++.
How do I this? It seems I'd need to:
Tell the compiler to place certain variables in bit-banded regions. How? What if the variables are elements of a struct?
Tell the compiler, when I want to access a bit, to turn if (flags & 0x4) into if (flags_bb_04). Ideally, I'd like this to be automatic, and to fall back to the former if bit banding isn't available.

The simplest solution is to use regular variables and access them through thier bit-band address. For that you do not need to "tell the compiler" anything. For example, given:
extern "C" volatile uint32_t* getBitBandAddress( volatile const void* address, int bit )
{
volatile uint32_t* bit_address = 0;
uint32_t addr = reinterpret_cast<uint32_t>(address);
// This bit maniplation makes the function valid for RAM
// and Peripheral bitband regions
uint32_t word_band_base = addr & 0xf0000000;
uint32_t bit_band_base = word_band_base | 0x02000000;
uint32_t offset = addr - word_band_base;
// Calculate bit band address
bit_address = reinterpret_cast<volatile uint32_t*>(bit_band_base + (offset * 32u) + (static_cast<uint32_t>(bit) * 4u));
return bit_address ;
}
you could create a 32bit "array" thus:
uint32_t word = 0 ;
uint32_t* bits = getBitbandAddress( word, 0 ) ;
bits[5] = 1 ; // word now == 32 (bit 5 set).
Now if you have a part with external or CCM memory for example that is not bitbandable, you do need to ensure that the linker ( not the compiler) places the normal memory object in bitbandable memory. How that is done is toolchain specific but in gnu for example you might have:
uint32_t word __attribute__ ((section ("ISRAM1"))) = 0 ;
Bitbanding is perhaps most useful for atomically accessing individual bits in peripheral registers. For fast and thread-safe access.
Some compilers are bitband aware and may automatically optimise single bit bitfield access using bitbanding. So for example;
struct
{
bit1 : 1 ;
bit2 : 1 ;
} bits __attribute__ ((section ("BITBANDABLE")));
The compiler (at least armcc v5) may optimise this to utilise the bitband access to bits.bit1 and bits.bit2. YMMV.

TSS entries for stack switching

I have read that the TSS contains information about registers, etc. Right now, I am trying to implement the switch from kernel to user mode and back. I have read the the Intel 80386 manual, and was looking at this resource: http://www.brokenthorn.com/Resources/OSDev23.html for a general workflow. They do this:
void install_tss (uint32_t idx, uint16_t kernelSS, uint16_t kernelESP) {
//! install TSS descriptor
uint32_t base = (uint32_t) &TSS;
gdt_set_descriptor (idx, base, base + sizeof (tss_entry),
I86_GDT_DESC_ACCESS|I86_GDT_DESC_EXEC_CODE|I86_GDT_DESC_DPL|I86_GDT_DESC_MEMORY,
0);
//! initialize TSS
memset ((void*) &TSS, 0, sizeof (tss_entry));
TSS.ss0 = kernelSS;
TSS.esp0 = kernelESP;
TSS.cs=0x0b;
TSS.ss = 0x13;
TSS.es = 0x13;
TSS.ds = 0x13;
TSS.fs = 0x13;
TSS.gs = 0x13;
//! flush tss
flush_tss (idx * sizeof (gdt_descriptor));
}
I am confused as to why RPL = 3
In my case, when I am in user mode and I want to use a trap gate to get to kernel mode, the cs segment in the trap gate would have RPL 0 (the last 2 bits of the 16 bit segment) and the GDT entry corresponding to the cs segment would also have DPL 0. And I've read that an inter-level privilege switch switches stacks (only??) looking at the TSS. I'm guessing that the above piece of code must have a TSS.ss = 0x10.
Note: We're assuming the classic 0x08 = Kernel code, 0x10 = Kernel data, .... GDT structure here

The TSS structure has a lot of fields that are used for hardware task switching (e.g. TSS.ss, which is where the ss register's contents would be saved/loaded if a hardware task switch happened), plus a few fields that are used for switching the task to a higher privilege level ((e.g. (e.g.TSS.ss0` for switching to CPL=0).
You're looking at fields that are used for hardware task switching (which typically aren't worth bothering with because it's faster to do software task switching instead); and I'd guess someone shoved some "hardware task switch compatible" values in there (even though they're not used) to avoid uninitialized values.
Instead, you want to look at the TSS.esp0 and TSS.ss0 fields of the TSS, which are the only 2 fields of the TSS that matter for switching to CPL=0 (and might be the only 2 fields of the TSS you ever use).

UART communication on Atmega32A with PC

I'm a begginer in programming AVR microcontroler and I get a lot of headacke sometimes from reading the datasheets.
I'm trying to make a communication between my AVR and PC just to send some caracters and receive it on my computer.
There are two lines I don't understand from the whole program and that is:
void USART_init(void)
{
UBRRH = (uint8_t)(BAUD_PRESCALLER>>8); <---- this one!
UBRRL = (uint8_t)(BAUD_PRESCALLER); <--- and this one
UCSRB = (1<<RXEN)|(1<<TXEN);
UCSRC = (1<<UCSZ0)|(1<<UCSZ1)|(1<<URSEL);
}
Datasheet
Why do I have to shift BAUD_PRESCALLER with 8? If BAUD_PRESCALLER is a number and shifting that number with 8 doesn't mean the result will be zero?(Because we are shifting it too many times)
From the datasheet I understand that UBRRH contains the four most significant bits and the UBRRL contains the eight least signicant bits of the USART baut rate.(Note:UBBR is a 12-bit register)
So how actually we put all the required numbers in the UBBR register?

You have to shift it right 8 bits because the result of BAUD_PRESCALLER is larger than 8 bits. Shifting it right 8 bits gives you the most significant byte of a 16-bit value.
For example, if the value of BAUD_PRESCALAR is 0x123 - then 0x1 would be assigned to UBRRH and 0x23 would be assigned to UBRRL.
If the library was smart it could also perform sanity checking on the BAUD_PRESCALAR to make sure it fits in 16bits. If it can't, that means you cannot achieve the baud rate you want given the clock you are using. If you're UBRRx is truly 12bits, the sanity check would look something like this:
#if BAUD_PRESCALAR > 0xFFF
#error Invalid prescalar
#endif

Timer rollover handling

I have a 32 bit hardware timer that I'd like to extend to 64 bit effective length in software.
In my embedded system, I have available a 32-bit hardware "core timer" (CT) that ticks at ~ 40 MHz, so it rolls over in about 107 seconds.
That's great for precise timing of periods up to 107 seconds. But I'd like to do equally precise timing of longer periods.
It also has a 32-bit "period" register - when the CT value matches the period register, an interrupt is generated.
My ISR looks like this (simplified for clarity):
const UINT32 ONE_MILLISECOND = TICK_RATE/1000;
UINT64 SwRTC;
void CT_ISR(void) {
PeriodRegister += ONE_MILLISECOND;
SwRTC += ONE_MILLISECOND;
ClearCTInterrupt();
}
So, now I have a 64 bit "SwRTC" that can be used to measure longer periods, but only to a precision of 1 millisecond, plus the 32-bit hardware timer that is precise to 1/40 MHz (25 nanoseconds). Both use the same units (TICK_RATE).
How can I combine both to get a 64 bit timer that's equally precise, while still getting interrupts at 1000 Hz?
My first try looked like this:
UINT64 RTC(void){
UINT64 result;
DisableInterrupts(); // to allow atomic operations
result = (SwRTC & 0xFFFFFFFF00000000ull) + ReadCoreTimer();
EnableInterrupts();
return result;
}
But that's no good, because if the CT rolls over while interrupts are disabled then I'll get a result with a small number in the low-order 32 bits, but without the high-order bits having been incremented by the ISR.
Maybe something like this would work - read it twice and return the higher value:
UINT64 RTC(void){
UINT64 result1, result2;
DisableInterrupts(); // to allow atomic operations
result1 = (SwRTC & 0xFFFFFFFF00000000ull) + ReadCoreTimer();
EnableInterrupts();
DisableInterrupts(); // again
result2 = (SwRTC & 0xFFFFFFFF00000000ull) + ReadCoreTimer();
EnableInterrupts();
if (result1 > result2)
return result1;
else
return result2;
}
I'm not sure if that'll work or if there is a hidden problem there I've missed.
What is the best way to do this?
(Some may ask why I need to time such long periods so precisely in the first place. It's mainly for simplicity - I don't want to use 2 different timing methods depending on the period; I'd prefer to use the same method all the time.)

I think I've almost solved this myself:
UINT64 Rtc(void){
UINT64 softwareTimer = SwRTC;
UINT32 lowOrderBits = softwareTimer; // just take low-order 32 bits
UINT64 coreTimer = ReadCoreTimer();
if (lowOrderBits > coreTimer) // if CT has rolled over since SwRTC was updated
softwareTimer += 0x100000000; // then increment high-order 32 bits of software count
return (softwareTimer & 0xFFFFFFFF00000000ull) + coreTimer;
}
This first reads the 64-bit software timer, then the 32-bit hardware timer.
The hardware timer (updated every 25 nS) should always be >= the low-order 32-bits of the software timer (updated only every 1 mS).
If it's not, that indicates the hardware timer rolled over since the software timer was read.
So, in that case I increment the high-order word of the software timer.
Then just combine the high-order 32 bits from the software time with the low-order 32 bits from the hardware timer.
One nice side effect is there's no need to disable interrupts.
The only problem I can see is, what if compiler optimization re-orders the code so that the hardware timer gets read first? Then I could get an interrupt that increments the software timer before I have a chance to read it.
At first I thought I could fix that by disabling interrupts while reading both timers, but what if the compiler re-orders the code so the DisableInterrupts() comes too late?

If it is an upcounter for example I will typically
elapsed = ((nowtime-starttime)&MASK)+(rollovers<<SIZE);
so long as you sample often enough (many times per rollover) for an upcounter of nowtime (time I just sampled) is less than lasttime (lasttime is the prior nowtime) then it rolled over.
assuming it is counting out every value and not skipping from say 0xFF..FFF to 0x00...01.
Downcounter just do everything opposite starttime-nowtime. nowtime > lasttime.
Some timers have a rollover interrupt which you can sometimes just poll instead of interrupt if you want and again so long as you can guarantee you sample that once or more than once per rollover you can simply use that to flag a rollover count.
If at any time you miss a rollover, then naturally you will be off by the size of the timer 4 giga counts or whatever.
Some hardware may allow for you to use one timer to generate an output clock which you can then feedback as an input to another timer and cascade that way (sometimes they do that on chip).

Problems in AVR C combining ADC readings to generate PWM output

I'm writing a program for an ATMega328P that will take readings from several ADC channels, combine them into a single signal and output this signal through PWM.
I've successfully backed off my ADC polling to 50Hz per channel using Single Conversion mode. I'm using Timer/Counter2 for PWM generation, and Timer/Counter1 for doing the calculations I need to do to set compare values for Timer/Counter2. This is the ISR for Timer/Counter1:
// Interrupt service routine called to generate PWM compare values
ISR(TIMER1_COMPA_vect)
{
// Grab most recent ADC reading for ADC0
uint32_t sensor_value_0 = adc_readings[0];
// Get current value for base waveform from wavetable stored in sinewave_data
uint32_t sample_value_0 = pgm_read_byte(&sinewave_data[sample_0]);
// Multiply these two values together
// In other words, use the ADC reading to modulate the amplitude of base wave
uint32_t sine_0 = (sample_value_0 * sensor_value_0) >> 10;
// Do the same thing for ADC2
uint32_t sensor_value_1 = adc_readings[1];
uint32_t sample_value_1 = pgm_read_byte(&sinewave_data[sample_1]);
uint32_t sine_1 = (sample_value_1 * sensor_value_1) >> 10;
// Add channels together, divide by two, set compare register for PWM
OCR2A = (sine_0 + sine_1) >> 1;
// Move successive ADC base waves through wavetable at integral increments
// i.e., ADC0 is carried by a 200Hz sine wave, ADC1 at 300Hz, etc.
sample_0 += 2;
sample_1 += 3;
// Wrap back to front of wavetable, if necessary
if (sample_0 >= sinewave_length) {
sample_0 = 0;
}
if (sample_1 >= sinewave_length) {
sample_1 = 0;
}
} // END - Interrupt service routine called to generate PWM compare values
My problem is that that I get no PWM output. If I set either sensor_value_0 or sensor_value_1 to 1024 and leave the other sensor_value_ set to read from the ADC, I do get one full-amplitude component wave, and an amplitude-modulated component wave. If however, I choose a different value for the hardcoded, mock amplitude, I am not so lucky (1023, for instance). Any other values give me no PWM output. If I set both sensor_value_s to look at the same ADC channel, I would expect two component waves whose amplitudes are modulated identically. Instead, I get no PWM output. What is most confusing of all to me is that if I choose a value for the hardcoded amplitude that is an exact power of two, all is well.
The whole power-of-two part makes this seem to me to be a bit-twiddling issue that I'm not seeing. Can you see what I must have clearly missed? I'd appreciate any tips at all!
(I've posted my entire source here to keep things as neat as possible on SO.)

Your issue may be caused by the architecture of the AVR that you're developing on. The ATMega328p has 8 bit registers, similar to most other AVR chips. This means that the 32b values that you're working with must be stored in memory by the compiler and broken up into four separate registers every time you perform arithmetic on them. In fact, there are no arithmetic instructions that perform on more than one register at once, so I'm really not sure what the compiler is doing!
I'd be interested to know what the disassembly of your code is, but my guess is that gcc is using the MUL instruction to execute the sample_value_0 * sensor_value_0 code. This instruction operates on two 8b values and produces a 16b value, so I wouldn't be surprised if the reason you're seeing an odd dependence on multiples of two produce results.
I'd say try reworking this block of code by changing the data types of the variables. Use uint8_t for sensor_value_* and sample_value_*, and uint16_t for sine_*. Then, to make sure everything fits in the 8b OCR2A register, change the assignment to something like:
OCR2A = (sine_0 + sine_1) & 0xFF;

#Devrin, I appreciate the response, but just manipulating types didn't do it for me. Here's what I ended up doing:
uint8_t sine_0 = (pgm_read_byte(&sinewave_data[sample_0]) >> 5) * (adc_readings[1] >> 5);
uint8_t sine_1 = (pgm_read_byte(&sinewave_data[sample_1]) >> 5) * (adc_readings[2] >> 5);
OCR2A = (sine_0 >> 1) + (sine_1 >> 1);
Essentially, I've done all my shifting immediately, instead of waiting until the last minute. Unfortunately, I lose a lot of precision, but at least the code works as expected. Now, I wil begin cranking things back up to find the initial cause of my issues.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Operating on low part of ARM NEON vector efficiently with intrinsics - arm

Related

ARM: Using bit-banded memory from C or C++

TSS entries for stack switching

UART communication on Atmega32A with PC

Timer rollover handling

Problems in AVR C combining ADC readings to generate PWM output

Categories

Resources