how instructions are fetched in cortex M processors - arm

As far as I understand, Cortex M0/M3 processors have only one memory space that holds instructions and data and that the access is only through the memory bus interface. Thus, if I understand correctly, every clock cycle the processor must read a new instruction to enter the pipeline, but that means that the bus will always be busy reading instructions so how data can be read simultaneously (for load word/store word instructions for example)?
Additionally, what is the latency of reading an instruction from the memory? since if it is not a single cycle, then the processor must constantly halt itself until the next instruction is fetched, so how it is handled?
Thanks

Yes this is how it happens, the processor stalls a lot, this goes on with big processors as well as small, difficult at best to feed a pipelined processor (although some of these pipes are shallow on cortex-ms, but pipelined nevertheless).
Many of the parts I have used and I have touched most of the vendors, the flash is at half clock speed of the core, so even at zero wait states you can only get an instruction every other clock (on average naturally with overhead rolled in) if fetching a halfword at a time, if fetching a word at a time which many of the cores offer then that is ideally two instructions per two clocks or one per. thumb2 of course you take the hit. ST definitely has a prefetcher/cacher thing with a fancy marketing name that does a pretty good job. Others may offer that as well or just rely on what arm offers which varies.
The different cortex-ms have different mixtures of busses. I hate the von-Neumann/Harvard references as there is little practical use for an actual Harvard architecture, thus the "modified" adjective which means they can do anything and try to attract folks taught in school that Harvard means performance. The busses can have multiple transactions in flight and there are a different number of busses as is somewhat obvious when you go in and release clocks for a peripheral, apb1 clock control ahb2 clock control, etc. Peripherals, flash, etc. But we can run code from sram, so it's not Harvard. Forget Harvard and von-Neumann terms and just focus on the actual implementation.
The bus documentation is as readily available as the core documentation. If you buy the right fpga board you can request a free eval of a core which you can then get an up close and personal view as to how it really works.
End of the day there is some parallelism, but on many chips the flash is half speed so if you are not fetching two per or have some other solution you are barely making it and stalling often if you have other same bus accesses. Likewise on many of these chips the peripherals can't run as fast as the core, so that alone incurs a stall, but even if the peripheral runs on the same clock doesn't mean it turns around a csr or data access as fast as sram, so you incur a stall there too.
There is no reason to assume you will get one instruction per clock performance out of these parts any more than a full sized arm or x86 or other.
While there are some important details that are not documented and only seen when you get the core there is documentation on each core and bus to get a rough idea if how to tune your code to perform better or tune your expectations of how it will really perform. I know I have demonstrated this here and elsewhere, it is pretty easy even with an ST to see a performance difference between flash and sram and see that it takes more clocks than instructions to perform a benchmark.
Your question is too broad in a few ways. The cortex-m0 and m3 are quite different, one was the first one out and dripping with features, the other was tuned for size and has just less stuff in general not meant to necessarily compete in this way. Then how long is the latency, etc, that is strictly chip company and family within the chip company so that questions extremely broad all the cortex-m products out there, dozens of different answers to that question. ARM makes cores not chips, the chip vendors make chips and buy IP from various places and make some of their own, some small part of that chip might be some IP they buy from a processor vendor.

What you've described is known as the "von Neumann bottleneck", and in machines with a pure von Neumann architecture with shared data and program memory, accesses are usually interleaved. However you might like to check out the "modified Harvard architecture", because that's basically what is in use here. The backing store is shared like in a von Neumann machine, but the instruction and data fetch paths are separate like in a Harvard machine and crucially they have separate caches. So if an instruction or data fetch results in a cache hit, a memory fetch doesn't takes place and there is no bottleneck.
The second part of your question doesn't make a great deal of sense I'm afraid, because it is meaningless to talk about instruction fetch times in terms of instruction cycles. By definition, if an instruction fetch is delayed for some reason, the execution of that instruction (and subsequent instructions) must be delayed. Typically this is done by inserting NOPs into the pipeline until the next instruction is ready (known as "bubbling" the pipeline).

re: part 2: Instruction fetch can be pipelined to hide some / all of the fetch latency. Cortex-M3 has a prefetch unit with a 3-word FIFO.
https://developer.arm.com/documentation/ddi0337/e/introduction/prefetch-unit (This can hold up to six 16-bit Thumb instructions.)
This buffer can also supply instructions while data load/store is happening, in a config where those compete with each other (not Harvard split bus, and without a data or instruction cache).
This prefetch is of course speculative; discarded on branches. (It's simple and small enough not to be worth doing branch prediction to try to fetch from the right place before decode even knows the upcoming instruction stream contains a branch.)

Related

How to manage devices that cannot access d-cache in ARM

I'm using an SPI device with DMA enabled in an STM32H7 SoC. The DMA periph. cannot access d-cache, so in order to make it work I have disabled d-cache entirely (for more info. about this, see this explanation). However, I would like to avoid disabling d-cache globally for a problem that only affects to a small region of memory.
I have read this post about the meaning of clean and invalidate cache operations, in the ARM domain. My understanding is that, by cleaning a cache area, you force it to be written in the actual memory. On the other hand, by invalidating a cache area, you force the actual memory to be cached. Is this correct?
My intention with this is to follow these steps to transmit something over SPI (with DMA):
Write the value you want on the buffer that DMA will read from.
Clean d-cache for that area to force it to go to actual memory, so DMA can see it.
Launch the operation: DMA will read the value from the area above and write it to the SPI's Tx buffer.
SPI reads data at the same time it writes, so there will be data in the SPI's Rx buffer, which will be read by DMA and then it will write it to the recv. buffer provided by the user. It could happen that an observer of such buffer can indeed access d-cache. The latter could not be updated with the new value received by SPI yet, so invalidate the recv. buffer area to force d-cache to get updated.
Does the above make sense?
EDIT
Adding some more sources/examples of the problem I'm facing:
Example from the ST github: https://github.com/STMicroelectronics/STM32CubeH7/issues/153
Post in ST forums answring and explaining the d-cache problem: https://community.st.com/s/question/0D53W00000m2fjHSAQ/confused-about-dma-and-cache-on-stm32-h7-devices
Here the interconnection between memory and DMA:
As you can see, DMA1 can access sram1, 2 and 3. I'm using sram2.
Here the cache attributes of sram2:
As you can see, it is write back,write allocate, but not write through. I'm not familiar with these attributes, so I read the definition from here. However, that article seems to talk about the CPU physical cache (L1, L2 etc.) I'm not sure if ARM i-cache and d-cache refer to this physical cache. In any case, I'm assuming the definition for write through and the other terms are valid for d-cache as well.
I forget off hand how the data cache works on the cortex-m7/armv7-m. I want to remember it does not have an MMU and caching is based on address. ARM and ST would be smart enough to know to put cached and non-cached access to sram from the processor core.
If you are wanting to send or receive data using DMA you do not go through the cache.
You linked a question from before which I had provided an answer.
Caches contain some amount of sram as we tend to see a spec for this many KBytes or this many MBytes, whatever. But there are also tag rams and other infrastructure. How does the cache know if there is a hit or a miss. Not from the data, but from other bits of information. Taken from the address of the transaction. Some number of bits of that address are taken and compared to however many "ways" you have so there may be 8 ways for example so there are 8 small memories think of them as arrays of structures in C. In that structure is some information is this cache line valid? If valid what is the tag or bit of address that it is tied to, is it clean/dirty...
Clean or dirty meaning the overall caching infrastructure will be designed (kinda the whole point) to hold information in a faster sram (sram in mcus is very fast already so why a cache in the first place???), which means that write transactions, if they go through the cache (they should in some form) will get written to the cache, and then based on design/policy will get written out into system memory or at least get written on the memory side of the cache. While the cache contains information that has been written that is not also in system memory (due to a write) that is dirty. And when you clean the cache using ARM's term clean, or flush is another term, etc. You go through all of the cache and look for items that are valid and dirty and you initiate writes to system memory to clean them. This is how you force things out the cache into system memory for coherency reasons, if you have a need to do that.
Invalidate a cache simply means you go through the tag rams and you change the valid bit to indicate invalid for that cache line. Basically that "loses" all information about that cache line it is now available to use. It will not result in any hits and it will not do a write to the system for a clean/flush. The actual cache line in the cache memory does not have to be zeroed or put in any other state. Technically just the valid/invalid bit or bits.
How things generally get into a cache are certainly from reads. Depending on the design and settings if a read is cacheable then the cache will first look to see if it has a tag for that item and if it is valid, if so then it simply takes the information in the cache and returns it. If there is a miss, that data does not have a copy in the cache, then it initiates one or more cache line reads from the system side. So a single byte read can/will cause a larger, sometimes much larger, read to happen on the system side, the transaction is held until that (much larger) data (read) returns and then it is put in the cache and the item requested is returned to the processor.
Depending on the architecture and settings, writes may or may not create an entry in the cache, if a (cacheable) write happens and there are no hits in the cache then it may just go straight to the system side as a write of that size and shape. As if the cache was not there. If there is a cache hit then it will go into the cache, and the that/those cache lines are marked as dirty and then depending on the design, etc it may be written to system memory as a side effect of the write from the processor side, the processor will be freed to continue execution but the cache and other logic (write buffer) may continue to process this transaction moving this new data to the system side essentially cleaning/flushing automatically. One normally does not expect this as it takes away performance that the cache was there to provide in the first place.
In any case if it is determined that a transaction has a miss and it is to be cached, then based on that tag, the ways have already been examined to determine if there was a hit. One of the ways will be chosen to hold this new cache line. How that is determined is based on design and in some cases programmable settings. Hopefully if there are any that are invalid then it would go to one of those. But round robin, randomizer, oldest first, etc are solutions you may see. And if there is dirty data in that space then it has to get written out first, making room for the new information. So, absolutely a single byte or single word read (since they have the same performance in a system like this) can require a cache flush of a cache line, then a read from the system and then the result is returned, more clock cycles than if the cache was not there. Nature of the beast. Caches are not perfect, with the right information and experience you can easily write code that makes the cache degrade the performance of the application.
Clean means if a cache line is valid and dirty then write it out to system memory and mark it as clean.
Invalidate means if the cache line is valid then mark it as valid. If it was valid and dirty that information is lost.
In your case you do not want to deal with cache at all for these transactions, the cache in question is in the arm core so nobody but the arm core has access to that cache, nobody else is behind the cache, they are all on the system end.
Taking a quick look at the ARM ARM for armv7-m they do use address space to determine write through and cached or not. One then needs to look at the cortex-m7 TRM for further information and then, particularly in this case, since it is a chip thing not an arm thing anyway, the whole system. The arm processor is just some bit of ip that st bought to glue into a chip with a bunch of other ip and ip of their own that is glued together. Like the engine in the car, the engine manufacturer cant answer questions about the rear differential nor the transmission, that is the car company not the engine company.
arm knows what they are doing
st knows what they are doing
if a chip company makes a chip with dma but the only path between the processor and the memory shared with the dma engine is through the processors cache when the cache is enabled, and clean/flush and invalidate of address ranges are constantly required to use that dma engine...Then you need to immediately discard that chip, blacklist that company's products (if this product is that poorly designed then assume all of their products are), and find a better company to buy products from.
I cant imagine that is the case here, so
Initialize the peripheral, choosing to use DMA and configure the peripheral or dma engine or both (for each direction).
Start the peripheral (this might be part of 4)
write the tx data to the configured address space for dma
tell the peripheral to start the transfer
monitor for completion of transfer
read the received data from the configured address space for dma
That is generic but that is what you are looking for, caches are not involved. For a part/family like this there should be countless examples including the (choose your name for the quality) one or more library solutions and examples that come from the chip vendor. Look at how they others are using the part, compare that to the documentation, determine your risk level for their solution and use it or modify it or learn from it if nothing else.
I know that st products do not have an instruction cache they do their own thing, or at least that is what I remember (some trademarked name for a flash cache, on most of them you cannot turn it off). Does that mean they have not implemented a data cache on the products either? Possible. Just because the architecture for an ip product has a feature (fpu, caches, ...) does not automatically mean that the chip vendor has enabled/implemented those. Depending on the ip there are various ways to do that as some ip does not have a compile time option for the chip vendor to not compile in a feature. if nothing else the chip vendor could simply stub out the cache memory interfaces and write a few lines of text in the docs that there is no cache, and you can write control registers and see things appear to enable that feature but it simply does not work. One expects that arm provides compile time features, that are not in the public documentation we can see, but are available to the chip vendor in some form. Sometimes when you buy the ip you are given a menu if you will like ordering a custom burger at a fancy burger shop, a list of checkboxes, mayo, mustard, pickle. ... fpu, cache, 16 bit fetch, 32 bit fetch, one cycle multiply, x cycle multiply, divide, etc. And the chip vendor then produces your custom burger. Or some vendors you get the whole burger then you have to pick off the pickles and onions yourself.
So again, not our job to read the docs for you, so first off does this part even have a dcache? Look between the arm arm, the arm trm and the documentation for the chip address spaces (as well as the countless examples) and determine what address space or whet settings, etc are needed to access portions of sram in a non-cached way. If it has a data cache feature at all.
I have investigated a bit more:
With regards to clean and invalidate memory question, the answer is yes: clean will force cache to be written in memory and invalidate will force memory to be cached.
With regards to the steps I proposed, again yes, it makes sense.
Here is a sequence of 4 videos that explain this exact situation (DMA and memory coherency). As can be seen, the 'software' solution (doesn't involve MPU) proposed by the videos (and other resources provided above) is exactly the sequence of steps I posted.
https://youtu.be/5xVKIGCPy2s
https://youtu.be/2q8IvCxSjaY
https://youtu.be/6IEtoG7m0jI
https://youtu.be/0DhYTqPCRiA
The other proposed solution is to configure the cortex-m7 MPU to change the attributes of a particular memory region to keep memory coherency.
This all apart from the easiest solution which is to globally disable d-cache, although, naturally, this is not desirable.

Is it possible the to lock the ISR instructions to L1 cache?

I am running a bare metal application on one of the cores of ARM cortex A9 processor. My ISR is quite small an I am wondering whether it would be possible to lock my ISR instructions in the L1 cache? Is it possible? Is there any one who would explain some drawbacks of doing it?
Regards,
N
The Cortex-A9 does not support L1 cache lockdown (neither instructions nor data).
The drawback is that taking large chunks of the cache away (lockdown is usually done on a granularity of entire cache ways) decreases performance for everything else in the system.
Not to mention the fact that if your ISR is indeed small, and it is called frequently, it is somewhat likely to be in the cache anyway.
What is the benefit you were expecting to gain from doing this?
Your condition is the perfect fit for fast interrupt. (FIQ)
You only have to assign the last interrupt number for that particular ISR.
While other interrupt numbers are just vectors, the last number branches directly to the code area, thus saving one memory load plus interlock. You save about three cycles or so.
Besides, i-cache lockdown isn't as efficient as d-cache lockdown.
CA9 doesn't support L1 cache lockdown anyway (for some good reasons), so don't bother.
Just make sure the ISR is cache line aligned for maximum efficiency. (typically 32 or 64byte)

Multiple threads and CPU cache

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?
Suppose i have a memory layout like the following
int output[100];
Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.
The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.
For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.
Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.
A few comments have asked about how to go about avoiding this problem.
At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).
To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.
If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.
If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.
On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.
In general it is a bad idea to share overlapping memory regions like if one thread processes 0,2,4... and the other processes 1,3,5... Although some architectures may support this, most architectures will not, and you probably can not specify on which machines your code will run on. Also the OS is free to assign your code to any core it likes (a single one, two on the same physical processor, or two cores on separate processors). Also each CPU usually has a separate first level cache, even if its on the same processor.
In most situations 0,2,4.../1,3,5... will slow down performance extremely up to possibly being slower than a single CPU.
Herb Sutters "Eliminate False Sharing" demonstrates this very well.
Using the scheme [...n/2-1] and [n/2...n] will scale much better on most systems. It even may lead to super linear performance as the cache size of all CPUs in sum can be possibly used. The number of threads used should be always configurable and should default to the number of processor cores found.
I might be mistaking, but whether the core's cache is shared or not depends on the implementation of the CPU. You'd have to look up the technical sheets on the manufacturer's page to check whether each core in your CPU has their own cache or whether the cache was shared.
I was working on image manipulation as well for a security company and sometimes we got corrupted images after running batch operations on threads. After long investigations we came to the conclusion that the cache was shared between CPU Core's and that in rare cases the data was beeing overwritten or replaced with incorrect data.
Whether this is something to keep into account or is rather a rare event I cannot anwser.
Intel documentation
Intel publishes per-generation datasheets which may contain this kind of information.
For example, for the processor i5-3210M which I had on my older computer, I look up the 3rd generation - Datasheet Volume 1 3.3 "Intel Hyper-Threading Technology (Intel HT Technology)" says:
The processor supports Intel Hyper-Threading Technology (Intel HT Technology)
that allows an execution core to function as two logical processors. While some
execution resources such as caches, execution units, and buses are shared, each
logical processor has its own architectural state with its own set of general-purpose registers and control registers.
which confirms that caches are shared in a given hyperthread for that generation of CPUs.
See also:
similar question for cache sharing across cores: How are cache memories shared in multicore Intel CPUs?
further analysis of threads vs cores: https://superuser.com/questions/133082/what-is-the-difference-between-hyper-threading-and-multiple-cores/995858#995858
the architecture spec itself also has a section about the sharing of certain resources that must be valid across all implementations, although it does not mention caches: What does multicore assembly language look like?

Can we optimize code to reduce power consumption?

Are there any techniques to optimize code in order to ensure lesser power consumption.Architecture is ARM.language is C
From the ARM technical reference site:
The features of the ARM11 MPCore
processor that improve energy
efficiency include:
accurate branch and sub-routine return prediction, reducing the number
of incorrect instruction fetch and
decode operations
use of physically addressed caches, which reduces the number of cache
flushes and refills, saving energy in
the system
the use of MicroTLBs reduces the power consumed in translation and
protection lookups each cycle
the caches use sequential access information to reduce the number of
accesses to the tag RAMs and to
unwanted data RAMs.
In the ARM11 MPCore processor
extensive use is also made of gated
clocks and gates to disable inputs to
unused functional blocks. Only the
logic actively in use to perform a
calculation consumes any dynamic
power.
Based on this information, I'd say that the processor does a lot of work for you to save power. Any power wastage would come from poorly written code that does more processing than necessary, which you wouldn't want anyway. If you're looking to save power, the overall design of your application will have more effect. Network access, screen rendering, and other power-hungry operations will be of more concern for power consumption.
Optimizing code to use less power is, effectively, just optimizing code. Regardless of whether your motives are monetary, social, politital or the like, fewer CPU cycles = less energy used. What I'm trying to say is I think you can probably replace "power consumption" with "execution time", as they would, essentially, be directly proportional - and you therefore may have more success when not "scaring" people off with a power-related question. I may, however, stand corrected :)
Yes. Use a profiler and see what routines are using most of the CPU. On ARM you can use some JTAG connectors, if available (I used Lauterbach both for debugging and for profiling). The main problem is generally to put your processor, when in idle, in a low-consumption state (deep sleep). If you cannot reduce the CPU percentage used by much (for example from 80% to 50%) it won't make a big difference. Depending on what operating systems you are running the options may vary.
The July 2010 edition of the Communications of the ACM has an article on energy-efficient algorithms which might interest you. I haven't read it yet so cannot impart any of its wisdom.
Try to stay in on chip memory (cache) for idle loops, keep I/O to a minimum, keep bit flipping to a minimum on busses. NV memory like proms and flash consume more power to store zeros than ones (which is why they erase to ones, it is actually a zero but the transitor(s) invert the bit before you see it, zeros stored as ones, ones stored as zeros, this is also why they degrade to ones when they fail), I dont know about volatile memories, dram uses half as many transistors as sram, but has to be refreshed.
For all of this to matter though you need to start with a lower power system as the above may not be noticeable. dont use anything from intel for example.
If you are not running Windows XP+ or a newer version of Linux, you could run a background thread which does nothing but HLT.
This is how programs like CPUIdle reduce power consumption/heat.
If the processor is tuned to use less power when it needs less cycles, then simply making your code run more efficiently is the solution. Else, there's not much you can do unless the operating system exposes some sort of power management functionality.
Keep IO to a minimum.
On some ARM processors it's possible to reduce power consumption by putting the voltage regulator in standby mode.

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?
The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.
No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.
Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.
I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)
Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

Resources