Our task is intended to demonstrate the benefit of using DMA to copy a large amount of data versus relying on the processor to directly handle the copying.
The processor is an STM32F407 on the ST discovery board.
In order to measure the copying time, a GPIO pin must be turned ON during copying and OFF once it has been copied.
The code appears to be functional but it is currently showing the CPU taking about 2.15ms to complete and DMA about 4.5ms, which is the opposite of what is intended. I'm not sure if there simply isn't enough data for the faster speed of DMA to offset the overhead in setting it up perhaps?
I have tried both copying elements of an array using the CPU and also using the memcpy function which seemed to yield very similar times.
The function code is shown below:
DMASpeed(void)
{
#define elementNum 32000
int *ptr = NULL;
ptr = (int*)malloc(elementNum * sizeof(int));
int *ptr2 = NULL;
ptr2 = (int*)malloc(elementNum * sizeof(int));
for (int i = 0; i < elementNum; i++)
{
ptr[i] = 4;
}
LD5_GPIO_Port->BSRR = (uint32_t)LD5_Pin << 16U;
LD6_GPIO_Port->BSRR = (uint32_t)LD6_Pin << 16U;
// Initial value
// printf("BEFORE: dst = '%s'\n", dst);
// Transfer
printf("Initiate DMA Transfer...\n");
HAL_DMA_Start(&hdma_memtomem_dma2_stream0, (int)ptr, (int)ptr2, (elementNum * sizeof(int)));
LD5_GPIO_Port->BSRR = LD5_Pin;
printf("DMA Transfer initiated.\n");
// Poll for DMA completion
printf("Poll for DMA completion.\n");
HAL_DMA_PollForTransfer(&hdma_memtomem_dma2_stream0,
HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
LD5_GPIO_Port->BSRR = (uint32_t)LD5_Pin << 16U;
printf("DMA complete.\n");
// Print result
// printf("AFTER: dst = '%s'\n", dst);
free(ptr);
free(ptr2);
ptr = (int*)malloc(elementNum * sizeof(int));
ptr2 = (int*)malloc(elementNum * sizeof(int));
for (int i = 0; i < elementNum; i++)
{
ptr[i] = i;
}
printf("Initiate CPU Transfer...\n");
LD6_GPIO_Port->BSRR = LD6_Pin;
// for (int i = 0; i<512; i++)
// {
// ptr2[i] = ptr[i];
// }
memcpy(ptr2, ptr, (elementNum * sizeof(int)));
printf("CPU Transfer Complete.\n");
LD6_GPIO_Port->BSRR = (uint32_t)LD6_Pin << 16U;
free(ptr);
free(ptr2);
}
Thanks in advance for any assistance
you try to proof something what is not the true. DMA memory to memory transfer will be always slower than direct CPU one. DMA was not intended to be faster than the CPU. it's there is to provide the transfer w
without the CPU activity in the background. the core has always priority over the DMA.
MEM to MEM DMA transfer will be always slower than the CPU one
There is another problem as well. Many STM devices have memory areas which are not accessible by the DMA (for example CCMRAM).
Remove printf in below code segment:
LD5_GPIO_Port->BSRR = LD5_Pin;
printf("DMA Transfer initiated.\n"); // <--Remove this
// Poll for DMA completion
printf("Poll for DMA completion.\n"); // <--Remove this
You are turning ON the pin and then printing large text , it is adding up in your total time calculation.
Remove all printf OR atleast do not print anything in between pin toggling.
EDIT:
To be precise you are printing 50 characters in case of DMA transfer and 23 characters in case of CPU transfer.
For those, who google for "How to fasten DMA memory-to-memory transfer?" here is the piece of advice: force your compiler to allocate all HAL code, related to your DMA transfer to the RAM, the best is to the RAM exclusively coupled with the Core. Your compiler will generate function code, which will be copied to the specific RAM at startup, and then all that functions will be called from the RAM and sped up because of it. However, that is also true for copying "by hand".
In this case, it is recommended to allocate to the RAM the following files/functions:
stm32[whatever]_hal_dma.c
DMA[N]_Stream[M]_IRQHandler(), where N and M are the numbers of your DMA and stream used for the transfer respectively.
Related
I have a program which accesses single bytes in a large array at random. Since this array exceeds the L2 cache, it requires many queries to RAM. I created a benchmark which emulates this program by generating random numbers and querying a large array. It seems like the benchmark is under-performing relative to my RAM's advertised speed.
I have DDR4-2933 RAM which is supposed to handle 2933 MT/s. The maximum transfer rate I have been able to achieve is 56.8 MT/s. What would be the bottleneck preventing faster execution? If I had to speculate, I might say the CPU could only reorder instructions within some fixed window and this would limit the parallelization of the fetches. Although, I have no evidence beyond the benchmarks.
Benchmark & Methodology
I created a short program which populates a large array with random numbers. Then, it loads values at random offsets from the array and XORs them together. Memory accesses should be reorderable.
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
/* XOROSHIRO PRNG */
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static uint64_t s[4];
uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
/* BENCHMARK */
int main(int argc, char ** argv) {
char const * prog = argc > 0 ? argv[0] : "[program]";
if(argc != 3) {
fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs]", prog);
exit(1);
}
unsigned int sshift = strtol(argv[1], NULL, 10);
uint64_t calls = strtol(argv[2], NULL, 10);
size_t size = 1ull << sshift;
uint64_t smask = (1ull << sshift) - 1;
uint8_t * buf = malloc(size);
if(!buf) {
fprintf(stderr, "no mem");
exit(1);
}
// Seed PRNG with magic values
s[0] = 0x0BE38E2AC;
s[1] = 0x23D933C53;
s[2] = 0xE72482E32;
s[3] = 0x35C339D23;
for(size_t i = 0; i < size; i++) {
buf[i] = next();
}
clock_t start = clock();
double cpu_time_used;
uint8_t val = 0;
for(size_t i = 0; i < calls; i++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
cpu_time_used = ((double) (clock() - start)) / CLOCKS_PER_SEC;
printf("time: %.3f\n", cpu_time_used);
printf("calls: %ld\n", calls);
printf("time/call: %.3e\n", cpu_time_used / calls);
printf("MT / sec: %.3e\n", calls / cpu_time_used / 1e6);
printf("val: %d\n", val); // ensure val is not optimized out
}
All benchmarks were executed on a Intel(R) Core(TM) i9-10885H CPU running at 2.40 GHz with CPU scaling disabled. The program was compiled with gcc -O3 main.c -o main -g -O3 -flto and called with nice -n -2 ./main 31 1000000000 for an array of size 2^31 and 1e9 accesses. The number of transactions per second issued to RAM can be approximated by the number of bytes fetched since virtually all of the lookups result in cache missed. I tested this using perf and it seemed like around 95% of cache lookups resulted in misses.
The random number generator occupies about 7.1% of program overhead. This was measured by removing the line val ^= buf[idx]; which executes the fetch. perf reported around 99% of memory overhead was due to last-level cache misses.
Follow-up Benchmarks
Loop unrolling:
By default, GCC 12.2.0 did not unroll the loop. It took some cajoling. I replaced:
for(size_t i = 0; i < calls; i++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
with:
for(size_t i = 0; i < batches; i++) {
#pragma GCC unroll 3
for(size_t j = 0; j < 8; j++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
}
I look five timings at 2.0 GHz. The version without loop unrolling seemed to perform better but not significantly. I'm not very surprised since the branch would seem highly predictable.
no unrolling unrolling
0 23.249 24.287
1 24.044 24.520
2 23.455 24.358
3 23.233 24.167
4 22.969 23.833
Multi-processing
I implemented a threaded version of the benchmark. It evenly divides the accesses among the threads. I also switch to measuring wall clock time. Here are the measurements (taken at 2.4 GHz):
threads time MT/s
0 1 17.345 57.653502
1 2 8.949 111.744329
2 4 5.022 199.123855
3 8 3.533 283.045570
4 16 3.203 312.207306
5 32 3.200 312.500000
6 64 3.173 315.159155
7 124 3.165 315.955766
Modified program:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
#include <threads.h>
#include <pthread.h>
/* XOROSHIRO PRNG */
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static thread_local uint64_t s[4];
static uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
struct xor_args {
size_t iters;
uint8_t * buf;
uint8_t res;
uint64_t smask;
pthread_t id;
};
#include <sys/random.h>
void * xor_worker(struct xor_args * a) {
if(-1 == getrandom(&s, sizeof(s), GRND_RANDOM)) {
fprintf(stderr, "getrandom() failed\n");
exit(1);
}
uint8_t val = 0;
for(size_t i = 0; i < a->iters; i++) {
uint64_t idx = next() & a->smask;
val ^= a->buf[idx];
}
a->res = val;
return NULL;
}
/* BENCHMARK */
int main(int argc, char ** argv) {
char const * prog = argc > 0 ? argv[0] : "[program]";
if(argc != 4) {
fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs] [threads]", prog);
exit(1);
}
unsigned int sshift = strtol(argv[1], NULL, 10);
uint64_t calls = strtol(argv[2], NULL, 10);
int threads = strtol(argv[3], NULL, 10);
size_t size = 1ull << sshift;
uint64_t smask = (1ull << sshift) - 1;
uint8_t * buf = malloc(size);
if(!buf) {
fprintf(stderr, "no mem");
exit(1);
}
// Seed PRNG with magic values
s[0] = 0x0BE38E2AC;
s[1] = 0x23D933C53;
s[2] = 0xE72482E32;
s[3] = 0x35C339D23;
for(size_t i = 0; i < size; i++) {
buf[i] = next();
}
struct timespec start, finish;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &start);
struct xor_args * args = malloc(sizeof(struct xor_args) * threads);
int s;
for(int i = 0; i < threads; i++) {
struct xor_args * arg = &args[i];
arg->iters = calls / (uint64_t) threads;
if(i == 0) {
args->iters += calls % threads;
}
arg->smask = smask;
arg->buf = buf;
s = pthread_create(&arg->id, NULL, (void * (*)(void *)) xor_worker, arg);
if(s != 0) {
fprintf(stderr, "pthread_create() failed");
exit(1);
}
}
uint8_t val = 0;
for(int i = 0; i < threads; i++) {
struct xor_args * arg = &args[i];
s = pthread_join(arg->id, NULL);
if(s != 0) {
fprintf(stderr, "pthread_join() failed");
exit(1);
}
val ^= arg->res;
}
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("time: %.3f\n", elapsed);
printf("calls: %ld\n", calls);
printf("time/call: %.3e\n", elapsed / calls);
printf("M T / sec: %.3e\n", calls / elapsed / 1e6);
printf("val: %d\n", val); // ensure val is not optimized out
}
The program is bound by the memory hierarchy. Indeed, on modern processor, accessing 1 byte from the RAM causes the whole cache line to be fetched from the RAM. This means 64 bytes on your processor. One channel of DDR4-2933 RAM can reach a theoretical bandwidth of 8*2933e6/1024**3 = 21.85 GiB/s. However, regarding the fact that the cache is useless here and that 64 bytes are fetched, the maximum throughput is 21.85/64*1024 = 350 MiB/s. Based on this, the program should at least take 1000000000/(350*1024**2) = 2.72 s.
In practice, no modern processor can fully saturate a DDR4 memory (especially using only 1 core). Modern RAM are designed to be fast mainly for contiguous access, not random one, despite the name. This is because speeding up random accesses is very hard and reading contiguous data is frequent (and far simpler to optimize at the hardware level).
The main problem with DDR RAM is its latency which is huge since several decades (50-120 ns per cache line fetch). It has not changed much since the last decade while processors have become significantly faster. As a result, it is critical to be able to mitigate this latency by sending a lot of simultaneous requests, that is increasing the concurrency. However, the execution flow is sequential and the number of simultaneous in-flight requests that 1 core can achieve is limited. Modern processors can execution few instructions in parallel from a sequential program as long as the instructions to be executed are independent. The thing is the program is pretty sequential due to the random number generator though multiple instructions can still be executed in parallel.
Multiple read requests can be done concurrently, but the size of the buffers to receive data from the RAM and the caches is limited. Intel processor units dedicated for that is the line-fill buffer (LFB) between the L1 and L2 cache, the super-queue between the L2 and the L3, and the integrated memory controller (iMC) between the L3 and the RAM. You can use perf to check which one is the bottleneck. AFAIK, the LFB has about 12-14 slots on your target processor so it should not be a bottleneck here. The super-queue can read 32-bytes/cycle, that is, 1 cache-line every 2 cycle, or 1 byte of buf every 2 cycles. That being said the L3 latency is pretty big: at least about a dozen of nanoseconds. In general, prefetching units are used to mitigate the latency but random accesses are unpredictable so hardware prefetchers are completely useless in this case.
Virtual memory make random accesses slower. Indeed, when a cache-line is fetched, the processor needs to translate a virtual address to a physical one. This is done by the Translation Lookaside Buffer (TLB) unit of a processor core. It is basically a cache able to translate the address of a virtual memory page to a physical one. There are multiple TLB units for different caches and the size of the cache is dependent of the size of the pages. Pages are typically 4K ones on your processor unless your OS decides to use huge pages in this specific case. Assuming its does not, the biggest TLB (STLB) has 1536 entries for 4K pages. This means it cannot be efficient for random accesses done on a 1536*4K = 6 MiB buffer. Beyond this limit, the number of TLB miss will be frequent. When a TLB miss happen, the processor needs to call special kernel functions so to know how to translate the virtual addresses based on kernel data structures. This operation is very expensive (like any kernel calls in general) so it introduces a significant additional latency. If huge pages are used, then the threshold is 3 GiB (for pages of 2 MiB) or even 16 GiB (for 1 GiB pages). Using huge pages can thus significantly speed things up compared to basic 4K pages.
Assuming the processor has large enough buffers and TLB caches so not to limit the concurrency, the DDR4-SDRAM are typically the main bottleneck. Indeed, it does not only has a huge latency, but it is also optimized for contiguous accesses. Indeed, such RAM devices are split in banks so to reduce the device latency. Contiguous memory requests can be efficiently managed by multiple banks, but random requests are often significantly less efficient due to bank-conflict: when 2 requests are assigned to the same banks, they are processed serially rather than concurrently. While random requests tends to to be spread amongst banks, this load-balancing is not as good as contiguous request resulting in a bit lower throughout.
In the end, the speed of the program is certainly latency-bound and the time taken is calls * latency / concurrency. It looks like latency / concurrency is 10-15 ns in your case.
One solution is to speed up a bit this program is to use prefetching instructions. It often help to hide the latency by telling the processor to prefetch data in the caches (or temporary buffers) early so subsequent accesses are faster since data are supposed to be closer from computing units. This optimization is brittle : data should be prefetched early enough (otherwise the processor will stall waiting due to cache misses) and data should be kept in the target caches (not be evicted by newer prefetching instructions). In your case, this is not so easy since the index is random. Prefetching relatively large chunks so to then read the values faster should help a bit. It should be slower for small arrays and faster for big ones. Note that software prefetching is not a silver-bullet : it is generally not as good as hardware prefetching (due to the limited hardware concurrency).
An alternative solution is generally to use multiple cores (thanks to multiple threads). That being said, this is not easy too here due to the (inherently sequential) random number generator (RNG) unless it is Ok to use multiple RNG. In this case, fetching full cache lines is generally the main bottleneck due to the high concurrency, the small RAM bandwidth and mainly due to the poor efficiency (1 byte used over 64). Unfortunately, there is currently no way to efficiently extract single bytes from a DDR4-SDRAM using any mainstream x86-64 processor (including yours).
For more information, please read:
What Every Programmer Should Know About Memory
How much of ‘What Every Programmer Should Know About Memory’ is still valid?
How does the CPU cache affect the performance of a C program
I have a some problem with for loop copies' time I don't why for loop is taking much time for the copy small data size.
I am using PIC24FJ256GL406 MCU and everything is fine but when operate the UART micro-controller is running slow because some delay is occurring while copy the buffer data.
Let me explain you with code and debug log.
Here I am posting the function for the UART transmission. this function is generally transfer only FIRST character and Rest of the byte will be transfer in the Interrupt routine service.
My clock Frequency is 32Mhz So peripheral will be 16 Mhz.
I did not understand why for loop is taking Almost 20 MS for the copy the 16 byte data only. So This time will be increase if data is more.
unsigned int UART1_WriteBuffer(const uint8_t *buffer, const unsigned int bufLen)
{
//transmit first char
U1TXREG = buffer[0];
while (!U1STAbits.TRMT);
numBytesWritten = bufLen - 1;
totalByte = 0;
//get the current time stamp
WSTimestamp currentTimeStamp = WSGetCurrTimestamp();
WMLogInfo(GEN_LOG, "current time stamp %ld", currentTimeStamp);
uint16_t i = 0;
//memset
memset(&uart1_txByteQ, 0x00, sizeof(uart1_txByteQ));
for (i = 0; i < numBytesWritten; i++)
{
uart1_txByteQ = buffer[i + 1]; //copy the data
}
_U1TXIE = 1;
//get last time stamp
WSTimestamp lastTimeStamp = WSGetCurrTimestamp();
WMLogInfo(GEN_LOG, "last time stamp %ld ", lastTimeStamp);
///print the debug
WMLogInfo(GEN_LOG, "total = %ld MS time taken fo copy the = %d byte", lastTimeStamp - currentTimeStamp, numBytesWritten);
return bufLen;
}
My Interrupt Routine service.
void __attribute__((interrupt, no_auto_psv)) _U1TXInterrupt(void)
{
if (totalByte < numBytesWritten)
{
U1TXREG = uart1_txByteQ[totalByte++];
while (!U1STAbits.TRMT);
}
else
{
_U1TXIE = 0;
_U1TXIE = 0;
}
}
Console Log.
This is function log. please note.
GEN:main loop current time stamp 6503<\r><\n>
GEN:current time stamp 6506<\r><\n>
GEN:last time stamp 6526 <\r><\n>
GEN:total = 20 MS time taken fo copy the = 16 byte<\r><\n>
GEN:command "AT+QREFUSECS=1,1<\r>" send with len [17]
The long delays are caused by busy-waiting for flags in combination with some use-case bug. Why are you using Tx interrupt in the first place in case you intend to busy-wait poll for a flag anyhow? The ISR doesn't make sense - it would seem that you should simply drop Tx interrupts entirely.
In addition, you have a naive implementation of buffer copies. It's a very common embedded systems beginner bug to hard copy RAM buffers needlessly. There's very few cases where you actually need to do hard copies and this isn't one. Instead you should use a system with double buffers and simply swap a pointer between them:
static uint8_t buf1 [n];
static uint8_t buf2 [n];
static uint8_t* rx_buf = buf1; // buffer used by rx ISR
static uint8_t* app_buf = buf2; // buffer used by the application
...
if(rx_done)
{ // swap buffers
uint8_t* tmp = rx_buf;
rx_buf = app_buf;
app_buf = tmp;
}
This also solves the double-buffering problem where an UART rx interrupt needs to store it's incoming data somewhere at the same time as the main program uses data. You'll need to protect against race conditions somehow - with a semaphore etc. And you'll need to declare variables shared with ISRs volatile to protect against bad compiler optimizers.
I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed.
The memory is two two-dimensional arrays:
volatile uint16_t memoryBuffer[2][10][20] = {0};
The DMA operates on the opposite "matrix" than the program function:
void myTask(uint8_t indexOppositeOfDMA)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
//Do some stuff with memory (readings only):
foo(memoryBuffer[indexOppositeOfDMA][n][m]);
}
}
}
Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?
Platform Cortex-M4
The problem without volatile
Let's assume that volatile is omitted from the data array. Then the C compiler
and the CPU do not know that its elements change outside the program-flow. Some
things that could happen then:
The whole array might be loaded into the cache when myTask() is called for
the first time. The array might stay in the cache forever and is never
updated from the "main" memory again. This issue is more pressing on multi-core
CPUs if myTask() is bound to a single core, for example.
If myTask() is inlined into the parent function, the compiler might decide
to hoist loads outside of the loop even to a point where the DMA transfer
has not been completed.
The compiler might even be able to determine that no write happens to
memoryBuffer and assume that the array elements stay at 0 all the time
(which would again trigger a lot of optimizations). This could happen if
the program was rather small and all the code is visible to the compiler
at once (or LTO is used).
Remember: After all the compiler does not know anything about the DMA
peripheral and that it is writing "unexpectedly and wildly into memory"
(from a compiler perspective).
If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...
The problem with volatile
Making
the whole array volatile is often a pessimisation. For speed reasons you
probably want to unroll the loop. So instead of loading from the
array and incrementing the index alternatingly such as
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
it can be faster to load multiple elements at once and increment the index
in larger steps such as
load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;
This is especially true, if the loads can be fused together (e.g. to perform
one 32-bit load instead of two 16-bit loads). Further you want the
compiler to use SIMD instruction to process multiple array elements with
a single instruction.
These optimizations are often prevented if the load happens from
volatile memory because compilers are usually very conservative with
load/store reordering around volatile memory accesses.
Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).
Possible solution 1: fences
So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.
Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).
Possible solution 2: atomic variable holding the buffer state
We use the following pattern a lot in our codebase (e.g. when reading data from
SPI via DMA and calling a function to analyze it): The buffer is declared as
plain array (no volatile) and an atomic flag is added to each buffer, which
is set when the DMA transfer has finished. The code looks something
like this:
typedef struct Buffer
{
uint16_t data[10][20];
// Flag indicating if the buffer has been filled. Only use atomic instructions on it!
int filled;
// C11: atomic_int filled;
// C++: std::atomic_bool filled{false};
} Buffer_t;
Buffer_t buffers[2];
Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy
void setupDMA(void)
{
for (int i = 0; i < 2; ++i)
{
int bufferFilled;
// Atomically load the flag.
bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
// C11: bufferFilled = atomic_load(&buffers[i].filled);
// C++: bufferFilled = buffers[i].filled;
if (!bufferFilled)
{
currentDmaBuffer = &buffers[i];
... configure DMA to write to buffers[i].data and start it
}
}
// If you end up here, there is no free buffer available because the
// data processing takes too long.
}
void DMA_done_IRQHandler(void)
{
// ... stop DMA if needed
// Atomically set the flag indicating that the buffer has been filled.
__sync_fetch_and_or(¤tDmaBuffer->filled, 1);
// C11: atomic_store(¤tDmaBuffer->filled, 1);
// C++: currentDmaBuffer->filled = true;
currentDmaBuffer = 0;
// ... possibly start another DMA transfer ...
}
void myTask(Buffer_t* buffer)
{
for (uint8_t n=0; n<10; n++)
for (uint8_t m=0; m<20; m++)
foo(buffer->data[n][m]);
// Reset the flag atomically.
__sync_fetch_and_and(&buffer->filled, 0);
// C11: atomic_store(&buffer->filled, 0);
// C++: buffer->filled = false;
}
void waitForData(void)
{
// ... see setupDma(void) ...
}
The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more,
make the incoming data slower or the processing code faster or whatever is
sufficient in your case.
Possible solution 3: OS support
If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on
the queue until it is non-empty. The pattern looks a bit like this:
MemoryPool pool; // A pool to acquire DMA buffers.
Queue bufferQueue; // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.
void setupDMA(void)
{
currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
// ... make the DMA write to currentBuffer
}
void DMA_done_IRQHandler(void)
{
// ... stop DMA if needed
Queue_Post(&bufferQueue, currentBuffer);
currentBuffer = 0;
}
void myTask(void)
{
void* buffer = Queue_Wait(&bufferQueue);
[... work with buffer ...]
MemoryPool_Deallocate(&pool, buffer);
}
This is probably the easiest approach to implement but only if you have an OS
and if portability is not an issue.
Here you say that the buffer is non-volatile:
"memoryBuffer is non-volatile inside the scope of myTask"
But here you say that it must be volatile:
"but may be changed next time i call myTask"
These two sentences are contradicting. Clearly the memory area must be volatile or the compiler can't know that it may be updated by DMA.
However, I rather suspect that the actual performance loss comes from accessing this memory region repeatedly through your algorithm, forcing the compiler to read it back over and over again.
What you should do is to take a local, non-volatile copy of the part of the memory you are interested in:
void myTask(uint8_t indexOppositeOfDMA)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
volatile uint16_t* data = &memoryBuffer[indexOppositeOfDMA][n][m];
uint16_t local_copy = *data; // this access is volatile and wont get optimized away
foo(&local_copy); // optimizations possible here
// if needed, write back again:
*data = local_copy; // optional
}
}
}
You'll have to benchmark it, but I'm pretty sure this should improve performance.
Alternatively, you could first copy the whole part of the array you are interested in, then work on that, before writing it back. That should help performance even more.
You're not allowed to cast away the volatile qualifier1.
If the array must be defined holding volatile elements then the only two options, "that let the compiler know that the memory has changed", are to keep the volatile qualifier, or use a temporary array which is defined without volatile and is copied to the proper array after the function call. Pick whichever is faster.
1 (Quoted from: ISO/IEC 9899:201x 6.7.3 Type qualifiers 6)
If an attempt is
made to refer to an object defined with a volatile-qualified type through use of an lvalue
with non-volatile-qualified type, the behavior is undefined.
It seems to me that you a passing half of the buffer to myTask and each half does not need to be volatile. So I wonder if you could solve your issue by defining the buffer as such, and then passing a pointer to one of the half-buffers to myTask. I'm not sure whether this will work but maybe something like this...
typedef struct memory_buffer {
uint16_t buffer[10][20];
} memory_buffer ;
volatile memory_buffer double_buffer[2];
void myTask(memory_buffer *mem_buf)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
//Do some stuff with memory:
foo(mem_buf->buffer[n][m]);
}
}
}
I don't know you platform/mCU/SoC, but usually DMAs have interrupt that trigger on programmable threshold.
What I can imagine is to remove volatile keyword and use interrupt as semaphore for task.
In other words:
DMA is programmed to interrupt when last byte of buffer is written
Task is block on a semaphore/flag waiting that the flag is released
When DMA calls the interrupt routine cange the buffer pointed by DMA for the next reading time and change the flag that unlock the task that can elaborate data.
Something like:
uint16_t memoryBuffer[2][10][20];
volatile uint8_t PingPong = 0;
void interrupt ( void )
{
// Change current DMA pointed buffer
PingPong ^= 1;
}
void myTask(void)
{
static uint8_t lastPingPong = 0;
if (lastPingPong != PingPong)
{
for (uint8_t n = 0; n < 10; n++)
{
for (uint8_t m = 0; m < 20; m++)
{
//Do some stuff with memory:
foo(memoryBuffer[PingPong][n][m]);
}
}
lastPingPong = PingPong;
}
}
I am having problem understanding difference between below two code cases. Case 1 is working as per expectation and Case 2 is not.
Problem Statement: I need to write some set of DWORDS on my device file and trigger a DMA. DMA capacity is 128*4 bytes (128 DWORDS). So I want to trigger DMA(using ioctl) after writing 128 bytes to device file descriptor to have full utilisation of capacity. I can do this for individual Dwords as well.
Basic difference in two cases:
In first case intention is to write 128 DWORDS to the file at once and in second case write DWORDS individually and trigger DMA after 128 DWORDS are written.
Is the data written to the file same in both cases? Second is not working so something wrong there. please help. By not working I mean the expected result of DMA commands are not happening so in second case data in file descriptor is is not same as first case just before the dma command.
Case 1 (WORKING)
int input_dwords[128] = {0xAABBCCDD, 0XBBCCDDAA, ....} //128 DWORDS (actually this data is in
//text file just putting in array for
//illustration)
int cmd_buf = (int*)malloc(sizeof(int) * 128); //space for 128 DWORDS
int* cur = cmd_buf;
for(int i = 0; i< 128; i++)
{
*cur = input_dword[i]
cur++;
}
//Write to file in one shot
write(fd, cmd_buf, sizeof(int)*128);
//trigger DMA (ioctl)
trigger_dma(fd);
Case 2 (NOT WORKING)
int input_dwords[128] = {0xAABBCCDD, 0xBBCCDDAA, ....}
int* cur = (int*)input_dwords;
for(int i = 0 ; i< 128 i++)
{
//writing to file one DWORD at a time.
write(fd, cur, sizeof(int));
cur++;
}
//trigger DMA (ioctl)
trigger_dma(fd);
I'm developing on an AD Blackfin BF537 DSP running uClinux. I have a total of 32MB SD-RAM available. I have an ADC attached, which I can access using a simple, blocking call to read().
The most interesting part of my code is below. Running the program seems to work just fine, I get a nice data package that I can fetch from the SD-card and plot. However, if I comment out the float calculation part (as noted in the code), I get only zeroes in the ft_all.raw file. The same occurs if I change optimization level from -O3 to -O0.
I've tried countless combinations of all sorts of things, and sometimes it works, sometimes it does not - earlier (with minor modifications to below), the code would only work when optimization was disabled. It may also break if I add something else further down in the file.
My suspicion is that the data transferred by the read()-function may not have been transferred fully (is that possible, even though it returns the correct number of bytes?). This is also the first time I initialize pointers using direct memory adresses, and I have no idea how the compiler reacts to this - perhaps I missed something, here?
I've spent days on this issue now, and I'm getting desperate - I would really appreciate some help on this one! Thanks in advance.
// Clear the top 16M memory for data processing
memset((int *)0x01000000,0x0000,(size_t)SIZE_16M);
/* Prep some pointers for data processing */
int16_t *buffer;
int16_t *buf16I, *buf16Q;
buffer = (int16_t *)(0x1000000);
buf16I = (int16_t *)(0x1600000);
buf16Q = (int16_t *)(0x1680000);
/* Read data from ADC */
int rbytes = read(Sportfd, (int16_t*)buffer, 0x200000);
if (rbytes != 0x200000) {
printf("could not sample data! %X\n",rbytes);
goto end;
} else {
printf("Read %X bytes\n",rbytes);
}
FILE *outfd;
int wbytes;
/* Commenting this region results in all zeroes in ft_all.raw */
float a,b;
int c;
b = 0;
for (c = 0; c < 1000; c++) {
a = c;
b = b+pow(a,3);
}
printf("b is %.2f\n",b);
/* Only 12 LSBs of each 32-bit word is actual data.
* First 20 bits of nothing, then 12 bits I, then 20 bits
* nothing, then 12 bits Q, etc...
* Below, the I and Q parts are scaled with a factor of 16
* and extracted to buf16I and buf16Q.
* */
int32_t *buf32;
buf32 = (int32_t *)buffer;
uint32_t i = 0;
uint32_t n = 0;
while (n < 0x80000) {
buf16I[i] = buf32[n] << 4;
n++;
buf16Q[i] = buf32[n] << 4;
i++;
n++;
}
printf("Saving to /mnt/sd/d/ft_all.raw...");
outfd = fopen("/mnt/sd/d/ft_all.raw", "w+");
if (outfd == NULL) {
printf("Could not open file.\n");
}
wbytes = fwrite((int*)0x1600000, 1, 0x100000, outfd);
fclose(outfd);
if (wbytes < 0x100000) {
printf("wbytes not correct (= %d) \n", (int)wbytes);
}
printf(" done.\n");
Edit: The code seems to work perfectly well if I use read() to read data from a simple file rather than the ADC. This leads me to believe that the rather hacky-looking code when extracting the I and Q parts of the input is working as intended. Inspecting the assembly generated by the compiler confirms this.
I'm trying to get in touch with the developer of the ADC driver to see if he has an explanation of this behaviour.
The ADC is connected through a SPORT, and is opened as such:
sportfd = open("/dev/sport1", O_RDWR);
ioctl(sportfd, SPORT_IOC_CONFIG, spconf);
And here are the options used when configuring the SPORT:
spconf->int_clk = 1;
spconf->word_len = 32;
spconf->serial_clk = SPORT_CLK;
spconf->fsync_clk = SPORT_CLK/34;
spconf->fsync = 1;
spconf->late_fsync = 1;
spconf->act_low = 1;
spconf->dma_enabled = 1;
spconf->tckfe = 0;
spconf->rckfe = 1;
spconf->txse = 0;
spconf->rxse = 1;
A bfin_sport.h file from Analog Devices is also included: https://gist.github.com/tausen/5516954
Update
After a long night of debugging with the previous developer on the project, it turned out the issue was not related to the code shown above at all. As Chris suggested, it was indeed an issue with the SPORT driver and the ADC configuration.
While debugging, this error messaged appeared whenever the data was "broken": bfin_sport: sport ffc00900 status error: TUVF. While this doesn't make much sense in the application, it was clear from printing the data, that something was out of sync: the data in buffer was on the form 0x12000000,0x34000000,... rather than 0x00000012,0x00000034,... whenever the status error was shown. It seems clear then, why buf16I and buf16Q only contained zeroes (since I am extracting the 12 LSBs).
Putting in a few calls to usleep() between stages of ADC initialization and configuration seems to have fixed the issue - I'm hoping it stays that way!