I'm a beginner in ARM microcontroller programming and have the following problem to be solved.
There are two ISRs in the program: ISR_Timer and ISR_Buffer. ISR_Timer is executed each 5 minutes. ISR_Buffer is executed each time external device buffer should be filled (several times in a second). External device buffer is a small one.
ISR_Buffer takes data to fill buffer from external SRAM. There are two big buffers in SRAM. First is currently used, second is used for recalculation. Then they are swapped.
ISR_Timer sets the flag that indicates the main() to recalculate the second buffer in external SRAM. After that ISR_Buffer uses that buffer. The first one is used for next recalculation. Recalculation takes about 1 minute.
The problem is that both main() and ISR_Buffer access external SRAM and those accesses are not atomic. main() function writes data to SRAM during buffer recalculation. ISR_Buffers reads data to fill small device buffer. How to solve this issue?
IDE: IAR. Chip: AT91SAM7.
If I understand right, you can use cyclic buffer. Being implemented right, it will guarantee atomic write and read.
Or, you can mask interrupts in main() during buffer manipulation to ensure that ISR has no access to data. But those manipulations must be fast, else your external device will get buffer underflow.
Related
So I’m emulating a small microprocessor in c that has an internal flash storage represented as an array of chars. The emulation is entirely single threaded and operates on this flash storage as well as some register variables.
What I want to do is have a second “read” thread that periodically (~ every 10ms or about the monitor refreshrate) polls the data from that array and displays it in some sort of window. (The flash storage is only 32KiB in size so it can be displayed in a 512x512 black and white image).
The thing is that the main emulation thread should have an absolutely minimal performance overhead to do this (or optimally not even care about the second thread at all). A rw mutex is absolutely out of the question since that would absolutely tank the performance of my emulator. Best case scenario as I said is to let the emulator be completely oblivious of the existence of the read thread.
I don’t care about memory order either since it doesn’t matter if some changes are visible earlier than others to the read thread as long as they are visible at some point at all.
Is this possible at all in c / cpp or at least through some sort of memory_barrier() function that I can call in my emulation thread about every 1000th clock cycle that would then assure visibility to my read thread?
Is it enough to just use a volatile on the flash memory? / would this affect performance in any significant way?
I don’t want to stall the complete emulation thread just to copy over the flash array to some different place.
Pseudo code:
int main() {
char flash[32 * 1024];
flash_binary(flash, "program.bin");
// periodically displays the content of flash
pthread_t *read_thread = create_read_thread(flash);
// does something with flash, highly performance critical
emulate_cpu(flash);
kill_read_thread(read_thread);
}
I'm using the device STM32F746. I know it has a hardware 2D Graphics accelerator.
I know how to do animation using double buffering.
But according to this
https://www.touchgfx.com/news/high-quality-graphics-using-only-internal-memory/
They are claiming that they use only one framebuffer for animation.
How is that possible and what techniques that are used using that STM32F746 ?
It is the double buffering. One buffer is stored in the MCU memory, where the next frame is prepared and composed. Another buffer is in LCD driver memory, to where data being transferred from the MCU when it is ready, and displayed on the LCD with the required refresh rate.
That's why that library requires so much of MCU memory.
Despite the answer was accepted it is wrong.
In fact those controllers have their own LCD-driving circuit, thus, do not require external driver. They use part of internal memory as the screen buffer and constantly refresh the image on the LCD.
In the library, that only part of memory is used. The write operatation are synchronized with LCD refresh, so they avoid flickering.
So, the only one buffer is used: the same buffer contains the output image and used to compose the next frame.
Let's suppose that there are two threads, A and B. There is also a shared array: float X[100].
Thread A writes to the array one element at a time in order, every 10 steps it updates a shared variable index (in a safe way) that indicates the current index, and it also sends a signal to thread B.
As soon as thread B receives the signal, it reads index in a safe way, and then proceed to read the elements of X until position index.
Is it safe to do this? Thread A really updates the array or just a copy in cache?
Every sane way of one thread sending a signal to another provides the assurance that anything written by a thread before sending a signal is guaranteed to be visible to a thread after it receives that signal. So as long as you sent the signal through some means that provided this guarantee, which they pretty much all do, you are safe.
Note that attempting to use a condition variable without a predicate protected by a mutex is not a sane way of one thread sending a signal to another! Among other things, it doesn't guarantee that the thread that you think received the signal actually received the signal. You do need to make sure the thread that does the reads in fact received the very signal sent by the thread that does the writes.
Is it safe to do this?
Provided your data modification is rendered safe and protected by critical sections, locks or whatever, this kind of access is perfectly safe for what concerns hardware access.
Thread A really updates the array or just a copy in cache?
Just a copy in cache. Most caches are presently write-back and just write data back to memory when a line is ejected from the cache if it has been modified. This largely improves memory bandwidth, especially in a multicore context.
BUT all happens as if the memory had been updated.
For shared memory processors, there are generally cache coherency protocols (except in some processors for real time applications). The basic idea of these protocols is that a state is associated with every cache line.
State describes informations concerning the line in the cache of the different processors.
These states indicate, for instance, if the line is only present in the current cache, or is shared by several caches, in sync with memory, invalid... See for instance this description of the popular MESI cache coherence protocol.
So what happens, when a cache line is written and is also present in another processor?
Thanks to the state, the cache knows that one or more other processor also have a copy of the line and it will send an invalidate signal. The line will be invalidated in the other caches and when they want to read or write it, they have to reload its content. Actually, this reload will be served by the cache that has the valid copy to limit memory accesses.
This way, whilst data is only written in the cache, the behavior is similar to a situation where data would have been written to memory.
BUT, despite the fact that functionally the hardware will ensure correctness of the transfer, one must be take into account the cache existence, to avoid performances degradation.
Assume cache A is updating a line and cache B is reading it. Whenever cache A writes, the line in cache B is invalidated. And whenever cache B wants to read it, if the line has been invalidated, it must fetch it from cache A. This can lead to many transfers of the line between the caches and render inefficient the memory system.
So concerning your example, probably 10 is not a good idea, and you should use informations on the caches to improve your exchanges between sender and receiver.
For instance, if you are on a pentium with 64 bytes cache lines, you should declare X as
_Alignas(64) float X[100];
This way the starting address of X will be a multiple of 64 and fit cache lines boundaries. The _Alignas quaiifier exists since C17, and by including stdalign.h, you can also use similarly alignas(64). Before C17, there were several extensions in most compilers in order to have an aligned placement.
And of course, you should indicate process B to read data only when a full 64 bytes line (16 floats) has been written.
This way, when thread B accesses the data, the cache line will not be modified any longer by thread A and only one initial transfer between caches A and B Will take place. This reduction in the number of transfers between the caches may have a significant impact on performances depending on your program.
If you're using a variable to that tracks readiness to read the index, the variable is protected by a mutex and the signalling is done via a pthread condition variable that thread B waits on under the mutex, then yes.
If you're using POSIX signals, then I believe you need a synchronization mechanism on top of that. Writing to an atomic variable with memory_order_release in thread A, and reading it with memory_order_acquire in thread B should guarantee in the most lightweight fashion that writes in A preceding the write to the atomic should be visible in B after it has read the atomic.
For best performance, the array sharing should be also done in such a way that the shared parts of the array do not cross cache-line boundaries (or else you're performance might degrade due to false sharing).
I am creating Ethernet packets in an embedded system. I have my Data / IP and UDP packet headers defined in pre-allocated buffers and I have a large buffer that is used to grab data from the FPGA's fabric using DMA.
I also have some user data headers and footers where the data comes from the fabric in other ways, mostly SPI transfer of temperature, PCB address etc. Or even grabs of some of the configuration registers (single transaction, on-boot).
Now, at the moment I concatenate these using memcpy into a new larger buffer (also pre-allocated), and then send to the Transmit buffer of the on-FPGA MAC.
My issues:
1) All these buffers are on the FPGA hence requiring memory, I could copy them one at a time into the MAC Tx buffer but this would prevent my second idea.
2) All being buffers, gives the possibility of forming a pipeline, where new data (DN+1) can be put into the first buffers, while subsequent buffers are storing and concatenating the data of (DN+0).
If I have a nice modularised code, how do I create a pipeline from buffer to buffer. In hardware I'd use flags, only passing data from Buffer A to B when Buffer B has finished passing its data to C. In terms of C, memcpy and memmove return only void, I'd therefore need to make my own boolean flag that is modified after memcpy finishes and I'd need to make these flags globals so that I can easily pass their status into other functions.
Finally, as this is embedded, I don't have access to the full C libraries and both time and memory are at a premium.
Thanks
Ed
I am writing a C program in which I need to flush my memory. I would like know if there is any UNIX system command to flush the CPU cache.
This is a requirement for my project which involves calculating the time taken for my logic.
I have read about the cacheflush(char *s, int a, int b) function but I am not sure as to whether it will be suitable and what to pass in the parameters.
I take it you mean "CPU cache", not memory cache
The link above is good: the suggestion "write a lot of data via CPU" is not Windows specific
Here's another variation on the same theme:
How to clear CPU L1 and L2 cache
Here's an article about Linux and CPU cache:
http://lwn.net/Articles/252125/
NOTE:
At this (very, very low) level, "Linux" != "Unix"
This is how Intel suggests flushing the cache:
mem_flush(const void *p, unsigned int allocation_size){
const size_t cache_line = 64;
const char *cp = (const char *)p;
size_t i = 0;
if (p == NULL || allocation_size <= 0)
return;
for (i = 0; i < allocation_size; i += cache_line) {
asm volatile("clflush (%0)\n\t"
:
: "r"(&cp[i])
: "memory");
}
asm volatile("sfence\n\t"
:
:
: "memory");
}
If you're writing a user-mode (not kernel-mode) program, and if it's single-threaded, then there's really no reason for you to ever bother flushing your cache in the first place. Your user-mode program can just forget that it even exists; it's just there to speed up your program's execution, and the OS manages it via the processor's MMU.
There are only a couple reasons I can think of that you might actually want to flush the cache from your user-mode application:
Your app is intended to run on a symmetric multiprocessor system, or has data transactions with external hardware)
You're simply testing your cache for some sort of performance test (in which case you should probably really should be writing your test to operate in kernel mode, perhaps as a driver).
In any case, assuming you're using Linux...
#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
This assumes you have a block of memory you just wrote to and you want to make sure it's flushed out of the cache back to main memory. The block begins at addr, and it's nbytes long, and it's in one of the two caches (or both):
ICACHE Flush the instruction cache.
DCACHE Write back to memory and invalidate the affected valid cache lines.
BCACHE Same as (ICACHE|DCACHE).
Normally you'd only need to flush the DCACHE, since when you write data to "memory" (i.e. to the cache), it's normally data, not instructions.
If you want to flush "all of the cache" for some strange testing reason, you could malloc() a big block that you know is larger than your CPU's cache (shoot, make it 8 times as big!), write any old garbage into it, and just flush that entire block.
See also: How to perform cache operations in C++?
OK, sorry about my first answer. I later read your follow-up comments below your question, so I realize now that you want to flush the INSTRUCTION CACHE to boot your program (or parts of it) out of the cache, so that when you test its performance, you also test its initial load time out of main memory into the instruction cache. Do you also need to flush any data your code will use out to main memory, so that both data and code are fresh loads?
Before anything else, I'd like to mention that main memory itself is also a form of cache, with your hard disk (either the program on disk, or swap space on disk) being the lowest, slowest place your program's instructions could be coming from. That said, when you first run through a routine for the first time, if it hasn't already been loaded into main memory from disk by virtue of being near other code that has already executed, then its CPU instructions will first have to be loaded from disk. That takes an order of magnitude or more longer than loading it from main memory into the cache. Then once it's loaded into main memory, it takes somewhere along the lines of an order of magnitude longer to load from main memory into the cache than it takes to load from the cache into the CPU's instruction fetcher. So if you want to test your code's cold-start performance, you have to decide what cold-start means.... pulling it out of disk, or pulling it out of main memory. I don't know of any command to "flush" instructions/data out of main memory out to swap space, so flushing it out to main memory is about as much as you can do (that I know of), but keep in mind that your test results may still differ from the first run (when it may be pulling it off disk) to subsequent runs, even if you do flush the instruction cache.
Now, how would one go about flushing the instruction cache to ensure that their own code is flushed out to main memory?
If I needed to do this (very odd thing to do in my opinion), I'd probably start by finding the length & approximate placement of my functions in memory. Since I'm using Linux, I'd issue the command "objdump -d {myprogram} > myprogram.dump.txt", then I'd open myprogram.dump.txt in an editor and search for the functions I want to flush out, and figure out how long they are by subtracting their end address form their start address using a hex calculator. I'd write down the sizes of each. Later I'd add cacheflush() calls in my code, giving it the address of each function I want to flush out as 'addr' and the length I found as 'nbytes', and ICACHE. Just for safety I'd probably fudge a little & add about 10% to the size, just in case I make a few tweaks to the code and forget to adjust the nbytes. I'd make a call to cacheflush() like this for each function I want to flush out. Then if I need to flush out the data also, if it's using global/static data, I can flush those also (DCACHE), but if it's stack or heap data, there's really nothing realistic that I can (or should) do to flush that out of cache. Trying to do so would be an exercise in silliness, because it would be creating a condition that would never or very rarely exist in normal execution.
Assuming you're using Linux...
#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
...where cache is one of:
ICACHE Flush the instruction cache.
DCACHE Write back to memory and invalidate the affected valid cache lines.
BCACHE Same as (ICACHE|DCACHE).
BTW, is this homework for a class?