Offload into MIC (Xeon Phi) error iterating over loaded array

Offload into MIC (Xeon Phi) error iterating over loaded array - c

I have problems when offloading some datastructures to my MIC.
I am offloading into MIC with the following directives:
#pragma offload target(mic:mic_no)\
inout(is_selected : length(query_sequences_count)ALLOC)\
in(a:length(a_size) ALLOC)\
in(a_disp:length(offload_db_count)ALLOC)
However if I try to execute inside the offloaded region:
//loads next 64 characters of a into datadb
__m512i datadb __attribute__ ((aligned(64)));
datadb = _mm512_load_epi32(a+iter_db+a_disp[j]);
This causes the following error:
Offload error:process on the device 0 was terminated by signal 11(SIGSEGV)
But if I instead copy the content of a into another array like this:
char db[64];
for(window_db_iter = 0; window_db_iter < 64; window_db_iter++)
db[window_db_iter] = *(a+iter_db+a_disp[j]+window_db_iter);
//Now this works fine
datadb = _mm512_load_epi32(db);
I have checked that a offloads with the correct length, a_size is the size of a and that a_disp is correct as well. Also a+iter_db+a_disp[j] remains always inside the bounds of memory. My guess is that it has to do with the process of copying the memory onto the MIC. Any ideas?
Thanks!

After some time, I have found the answer to my question.
First the data structure needs to be aligned. If not, it is going to return an error. The Offload error does not mean the error was caused during the process of copying memory from host CPU to coprocessor, it can be caused anywhere in the code.
Second, if you have unaligned memory and cannot/dont want to align it you can use the align modifier during the offload like this:
#pragma offload target(mic:mic_no)\
inout(is_selected : length(query_sequences_count)ALLOC)\
in(a[0:a_size]: aligned(64) ALLOC)\
in(a_disp:length(offload_db_count)ALLOC)
Now the copied memory will be copied aligned.

Related

Value is wrong first time pointer is dereferenced but correct after that

I have a ZYNQ Ultrascale+ MPSoC Genesys ZU dev board that I'm running my application on. I have an accelerator in the PL that is connected to the PS through a simple AXI DMA. The DMA reads the DDR memory through a normal, non-coherent, FPD slave port on the PS. The application is running on one of the A53 cores in the PS.
I've verified with an ILA that the data being written to the AXI slave port is correct. However, some of the data I'm reading back in software was incorrect. At least part of the issue before was the cache in the A53. As a temporary solution I've disabled the D-cache at the start of the program so there should be no issues there anymore. Now though, the first time I try to print/read from the array of data I receive, I get an incorrect value. Subsequent reads return the correct value. What gives? How is this happening?
Using the Vitis debugger/memory viewer, I've verified that the correct data is present at the memory location I allocated and told the DMA to write to.
Below is a watered down version of the program, removing much of the program that has no issues.
#define CACHE_LINE_SIZE 64
int main(void)
{
Xil_DCacheDisable();
//A bunch of DMA initialization
...
//Send data to accelerator through DMA, no issues here
...
float* outputCorrelation;
const size_t outputCorrelationSizeBytes = sizeof(*outputCorrelation) * 80;
outputCorrelation = aligned_alloc(CACHE_LINE_SIZE, outputCorrelationSizeBytes);
if(outputCorrelation == NULL) {
printf("Aligned Malloc failed\n");
return XST_FAILURE;
}
//Initiate data receive transfer first
int result = XAxiDma_SimpleTransfer(&axiDma,(UINTPTR) outputCorrelation, outputCorrelationSizeBytes, XAXIDMA_DEVICE_TO_DMA);
if(result != XST_SUCCESS) {
return result;
}
//Send data - assembledData allocation isn't shown as no problems here
result = XAxiDma_SimpleTransfer(&axiDma,(UINTPTR) assembledData, sizeof(*assembledData) * inLen, XAXIDMA_DMA_TO_DEVICE);
if(result != XST_SUCCESS) {
return result;
}
//Wait for completion interrupts from DMA
...
for(size_t x = 0; x < 80; x++) {
printf("[%zu]\t%f\n", x, outputCorrelation[x]);
}
}
The expected output is the value 4 for every element of the array.
Output:
[0] -nan
[1] 4.000000
[2] 4.000000
[3] 4.000000
[4] 4.000000
...
[79] 4.000000
If I add a print of the any value of the array prior to for loop, the first value becomes correct and all values in the for loop are perfect. What's going on here and how can I solve it?
Edit:
I had a thought that the compiler might be optimizing away the read or something since none of the functions directly write to the allocated array so I tried marking the output buffer as volatile. This did not change the behavior.
I did some more testing with my PL accelerator and tried connecting it to the LPD ports of the PS so I could try using the RPU instead of the APU. Using the exact same code in the RPU instead of the APU yielded my expected result. I have a suspicion there's still some issues with cache coherency even though I disabled the dcache when running on the APU.
Something I also didn't mention earlier is that when I single-step through my code, the issue does not exist. When still using the debugger but running through the critical sections, the issue does exist.

Understanding the writing to flash process in the STM32 reference manual

I am programming the stm32l412kb where at one point I will be writing data to flash (from UART). From the stm32l41xx reference manual, I understand the steps in how to clear the memory before writing to it but on page 84 there is one step that I do not know how to do when writing the actual data. That step is the
Perform the data write operation at the desired memory address
What data write operation is it mentioning? I can't see any register the memory address goes to so I assume its going to use pointers? How would I go about doing this?
Your help is much appreciated,
Many thanks,
Harry

Apart from a couple of things (e.g. only write after erase, timings, alignment, lock/unlock) their ain't much difference between writing to RAM and writing to FLASH memory. So if you have followed the steps from the reference manual and the FLASH memory is ready (i.e. cleared and unlocked) then you can simply take an aligned memory address and write to it.
STMs very own HAL library contains a function which does all the cumbersome boilerplate for you and allows you to "just write":
HAL_StatusTypeDef HAL_FLASH_Program(uint32_t TypeProgram, uint32_t Address, uint64_t Data)
Internally this function uses a subroutine which performs the actual write and it looks like this:
static void FLASH_Program_DoubleWord(uint32_t Address, uint64_t Data)
{
/* Check the parameters */
assert_param(IS_FLASH_PROGRAM_ADDRESS(Address));
/* Set PG bit */
SET_BIT(FLASH->CR, FLASH_CR_PG);
/* Program first word */
*(__IO uint32_t*)Address = (uint32_t)Data;
/* Barrier to ensure programming is performed in 2 steps, in right order
(independently of compiler optimization behavior) */
__ISB();
/* Program second word */
*(__IO uint32_t*)(Address+4U) = (uint32_t)(Data >> 32);
}
As you can see there is no magic involved. It's just a dereferenced pointer and an assignment.

What data write operation is it mentioning?
The "data write" is just a normal write to a address in memory that is the flash memory. It is usually the STR assembly instruction. Screening at your datasheet, I guess the flash memory addresses are between 0x08080000 and 0x00080000.
Ex. the following C code would write the value 42 to the first flash memory address:
*(volatile uint32_t*)0x00080000 = 42.
For a reference implementation you can see stm32 hal drivers:
/* Set PG bit */
SET_BIT(FLASH->CR, FLASH_CR_PG);
/* Program the double word */
*(__IO uint32_t*)Address = (uint32_t)Data;
*(__IO uint32_t*)(Address+4) = (uint32_t)(Data >> 32);

Improve performance of reading volatile memory

I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed.
The memory is two two-dimensional arrays:
volatile uint16_t memoryBuffer[2][10][20] = {0};
The DMA operates on the opposite "matrix" than the program function:
void myTask(uint8_t indexOppositeOfDMA)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
//Do some stuff with memory (readings only):
foo(memoryBuffer[indexOppositeOfDMA][n][m]);
}
}
}
Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?
Platform Cortex-M4

The problem without volatile
Let's assume that volatile is omitted from the data array. Then the C compiler
and the CPU do not know that its elements change outside the program-flow. Some
things that could happen then:
The whole array might be loaded into the cache when myTask() is called for
the first time. The array might stay in the cache forever and is never
updated from the "main" memory again. This issue is more pressing on multi-core
CPUs if myTask() is bound to a single core, for example.
If myTask() is inlined into the parent function, the compiler might decide
to hoist loads outside of the loop even to a point where the DMA transfer
has not been completed.
The compiler might even be able to determine that no write happens to
memoryBuffer and assume that the array elements stay at 0 all the time
(which would again trigger a lot of optimizations). This could happen if
the program was rather small and all the code is visible to the compiler
at once (or LTO is used).
Remember: After all the compiler does not know anything about the DMA
peripheral and that it is writing "unexpectedly and wildly into memory"
(from a compiler perspective).
If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...
The problem with volatile
Making
the whole array volatile is often a pessimisation. For speed reasons you
probably want to unroll the loop. So instead of loading from the
array and incrementing the index alternatingly such as
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
it can be faster to load multiple elements at once and increment the index
in larger steps such as
load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;
This is especially true, if the loads can be fused together (e.g. to perform
one 32-bit load instead of two 16-bit loads). Further you want the
compiler to use SIMD instruction to process multiple array elements with
a single instruction.
These optimizations are often prevented if the load happens from
volatile memory because compilers are usually very conservative with
load/store reordering around volatile memory accesses.
Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).
Possible solution 1: fences
So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.
Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).
Possible solution 2: atomic variable holding the buffer state
We use the following pattern a lot in our codebase (e.g. when reading data from
SPI via DMA and calling a function to analyze it): The buffer is declared as
plain array (no volatile) and an atomic flag is added to each buffer, which
is set when the DMA transfer has finished. The code looks something
like this:
typedef struct Buffer
{
uint16_t data[10][20];
// Flag indicating if the buffer has been filled. Only use atomic instructions on it!
int filled;
// C11: atomic_int filled;
// C++: std::atomic_bool filled{false};
} Buffer_t;
Buffer_t buffers[2];
Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy
void setupDMA(void)
{
for (int i = 0; i < 2; ++i)
{
int bufferFilled;
// Atomically load the flag.
bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
// C11: bufferFilled = atomic_load(&buffers[i].filled);
// C++: bufferFilled = buffers[i].filled;
if (!bufferFilled)
{
currentDmaBuffer = &buffers[i];
... configure DMA to write to buffers[i].data and start it
}
}
// If you end up here, there is no free buffer available because the
// data processing takes too long.
}
void DMA_done_IRQHandler(void)
{
// ... stop DMA if needed
// Atomically set the flag indicating that the buffer has been filled.
__sync_fetch_and_or(&currentDmaBuffer->filled, 1);
// C11: atomic_store(&currentDmaBuffer->filled, 1);
// C++: currentDmaBuffer->filled = true;
currentDmaBuffer = 0;
// ... possibly start another DMA transfer ...
}
void myTask(Buffer_t* buffer)
{
for (uint8_t n=0; n<10; n++)
for (uint8_t m=0; m<20; m++)
foo(buffer->data[n][m]);
// Reset the flag atomically.
__sync_fetch_and_and(&buffer->filled, 0);
// C11: atomic_store(&buffer->filled, 0);
// C++: buffer->filled = false;
}
void waitForData(void)
{
// ... see setupDma(void) ...
}
The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more,
make the incoming data slower or the processing code faster or whatever is
sufficient in your case.
Possible solution 3: OS support
If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on
the queue until it is non-empty. The pattern looks a bit like this:
MemoryPool pool; // A pool to acquire DMA buffers.
Queue bufferQueue; // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.
void setupDMA(void)
{
currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
// ... make the DMA write to currentBuffer
}
void DMA_done_IRQHandler(void)
{
// ... stop DMA if needed
Queue_Post(&bufferQueue, currentBuffer);
currentBuffer = 0;
}
void myTask(void)
{
void* buffer = Queue_Wait(&bufferQueue);
[... work with buffer ...]
MemoryPool_Deallocate(&pool, buffer);
}
This is probably the easiest approach to implement but only if you have an OS
and if portability is not an issue.

Here you say that the buffer is non-volatile:
"memoryBuffer is non-volatile inside the scope of myTask"
But here you say that it must be volatile:
"but may be changed next time i call myTask"
These two sentences are contradicting. Clearly the memory area must be volatile or the compiler can't know that it may be updated by DMA.
However, I rather suspect that the actual performance loss comes from accessing this memory region repeatedly through your algorithm, forcing the compiler to read it back over and over again.
What you should do is to take a local, non-volatile copy of the part of the memory you are interested in:
void myTask(uint8_t indexOppositeOfDMA)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
volatile uint16_t* data = &memoryBuffer[indexOppositeOfDMA][n][m];
uint16_t local_copy = *data; // this access is volatile and wont get optimized away
foo(&local_copy); // optimizations possible here
// if needed, write back again:
*data = local_copy; // optional
}
}
}
You'll have to benchmark it, but I'm pretty sure this should improve performance.
Alternatively, you could first copy the whole part of the array you are interested in, then work on that, before writing it back. That should help performance even more.

You're not allowed to cast away the volatile qualifier1.
If the array must be defined holding volatile elements then the only two options, "that let the compiler know that the memory has changed", are to keep the volatile qualifier, or use a temporary array which is defined without volatile and is copied to the proper array after the function call. Pick whichever is faster.
1 (Quoted from: ISO/IEC 9899:201x 6.7.3 Type qualifiers 6)
If an attempt is
made to refer to an object defined with a volatile-qualified type through use of an lvalue
with non-volatile-qualified type, the behavior is undefined.

It seems to me that you a passing half of the buffer to myTask and each half does not need to be volatile. So I wonder if you could solve your issue by defining the buffer as such, and then passing a pointer to one of the half-buffers to myTask. I'm not sure whether this will work but maybe something like this...
typedef struct memory_buffer {
uint16_t buffer[10][20];
} memory_buffer ;
volatile memory_buffer double_buffer[2];
void myTask(memory_buffer *mem_buf)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
//Do some stuff with memory:
foo(mem_buf->buffer[n][m]);
}
}
}

I don't know you platform/mCU/SoC, but usually DMAs have interrupt that trigger on programmable threshold.
What I can imagine is to remove volatile keyword and use interrupt as semaphore for task.
In other words:
DMA is programmed to interrupt when last byte of buffer is written
Task is block on a semaphore/flag waiting that the flag is released
When DMA calls the interrupt routine cange the buffer pointed by DMA for the next reading time and change the flag that unlock the task that can elaborate data.
Something like:
uint16_t memoryBuffer[2][10][20];
volatile uint8_t PingPong = 0;
void interrupt ( void )
{
// Change current DMA pointed buffer
PingPong ^= 1;
}
void myTask(void)
{
static uint8_t lastPingPong = 0;
if (lastPingPong != PingPong)
{
for (uint8_t n = 0; n < 10; n++)
{
for (uint8_t m = 0; m < 20; m++)
{
//Do some stuff with memory:
foo(memoryBuffer[PingPong][n][m]);
}
}
lastPingPong = PingPong;
}
}

Initialize array starting from specific address in memory - C programming

Do you have idea how to initialize array of structs starting from specific address in memory (not virtual, physical DDR memory). I am working on implementation of TxRx on SoC (ARM-FPGA). Basically ARM (PS) and FPGA (PL) communicate to each other by using shared RAM memory. Currently I am working on transmitter side, so I need to constantly load packets that I get from MAC layer to memory, then my Tx reads data and sends it in air. To achieve this I want to implement circular FIFO buffer on (ARM) side, in way that I can store up to 6 packets into buffer and send them one by one, in same time loading other packets on places of already sent packages. Because I need to use specific memory addresses I am interested is it possible to initialize array of structure that will be stored on specific addresses in memory. For example I want that my array starts at adress 0x400000 and ends at address 0x400000 + MaximumNumberOfPackets x SizeOfPackets I know how to do it for one instantiate of structure for example like this:
buffer_t *tmp = (struct buffer_t *)234881024;
But how to do it for array of structures?

A pointer to a single struct (or int, float, or anything else) is inherently a pointer to an array of them. The pointer type provides the sizeof() value for an array entry, and thus allows pointer arithmetic to work.
Thus, given a struct buffer you can simply do
static struct buffer * const myFIFO = (struct buffer *) 0x40000
and then simply access myFIFO as an array
for (size_t i = 0; i < maxPackets; ++i)
{
buffer[i].someField = initialValue1;
buffer[i].someOtherField = 42;
}
This works just the way you expect.
What you can't do (using pure standard C) is declare an array at a particular address like this:
struct buffer myFIFO[23] # 0x400000;
However, your compiler may have extensions to allow it. Many embedded compilers do (after all, that's often how they declare memory-mapped device registers), but it will be different for every compiler vendor, and possibly for every chip because it is a vendor extension.
GCC does allow it for AVR processors via an attribute, for example
volatile int porta __attribute__((address (0x600)));
But it doesn't seem to support it for an ARM.

Generally #kdopen is right but for arm you should create an entry in MEMORY section linker script that shows to linker where is your memory:
MEMORY
{
...
ExternalDDR (w) : ORIGIN = 0x400000, LENGTH = 4M
}
And than, when you are declaring variable just use the
__attribute__((section("ExternalDDR")))

I found the way how to do it. So could I do it like this. I set this into linker script:
MEMORY {
ps7_ddr_0_S_AXI_BASEADDR : ORIGIN = 0x00100000, LENGTH = 0x1FF00000
ps7_ram_0_S_AXI_BASEADDR : ORIGIN = 0x00000000, LENGTH = 0x00030000
ps7_ram_1_S_AXI_BASEADDR : ORIGIN = 0xFFFF0000, LENGTH = 0x0000FE00
DAC_DMA (w) : ORIGIN = 0xE000000, LENGTH = 64K
}
.dacdma : {
__dacdma_start = .;
*(.data)
__dacdma_end = .;
} > DAC_DMA
And then I set this into code
static buffer_t __attribute__((section("DAC_DMA"))) buf_pool[6];

Why does using a structure in C program cause Link error

I am writing a C program for a 8051 architecture chip and the SDCC compiler.
I have a structure called FilterStructure;
my code looks like this...
#define NAME_SIZE 8
typedef struct {
char Name[NAME_SIZE];
} FilterStructure;
void ReadFilterName(U8 WheelID, U8 Filter, FilterStructure* NameStructure);
int main (void)
{
FilterStructure testStruct;
ReadFilterName('A', 3, &testFilter);
...
...
return 0;
}
void ReadFilterName(U8 WheelID, U8 Filter, FilterStructure* NameStructure)
{
int StartOfName = 0;
int i = 0;
///... do some stuff...
for(i = 0; i < 8; i++)
{
NameStructure->Name[i] = FLASH_ByteRead(StartOfName + i);
}
return;
}
For some reason I get a link error "?ASlink-Error-Could not get 29 consecutive bytes in internal RAM for area DSEG"
If I comment out the line that says FilterStructure testStruct; the error goes away.
What does this error mean? Do I need to discard the structure when I am done with it?

The message means that your local variable testStruct couldn't be allocated in RAM (or DSEG that should be DATA SEGMENT of your binary), since your memory manager couldn't find 29 consecutive bytes to allocate it.
This is strange since your struct should be 8 bytes long.. but btw it's nothing to do with discarding the structure, this seems a memory management problem.. I don't know 8051 specs so well but it should be quite limited right?
EDIT: looking at 8051 specs it seems it just has 128 bytes of RAM. This can cause the problem because the variable, declared as a local, is allocated in internal RAM while you should try to allocate it on an external RAM chip if it's possible (using the address/data bus of the chip), but I'm not sure since this kind of microcontroller shouldn't be used to do these things.

you've run out of memory....by the looks of it.
try moving it out as a global variable, see if that makes it better.

Just a guess: 8051 has only 128 or 256 bytes of "internal RAM". Not so much... It can use part of it as stack and part for registers. Maybe your "large" (8 bytes!!!) structure on the stack forces the compiler to reserve too much stack space inside the internal memory. I suggest to have a look into the linker map file, maybe you can "rearrange" the memory partition. The massage says "consecutive bytes", so perhaps there is still enough space availabe, but it's fragmented.
Bye

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Offload into MIC (Xeon Phi) error iterating over loaded array - c

Related

Value is wrong first time pointer is dereferenced but correct after that

Understanding the writing to flash process in the STM32 reference manual

Improve performance of reading volatile memory

Initialize array starting from specific address in memory - C programming

Why does using a structure in C program cause Link error

Categories

Resources