I am working on an application that has quite a few internal data structures, but also processes huge amounts of user data. During this processing, I need to have the CPU look at the data just once (the rest of the processing is done via zero copies and DMA, so the CPU need not touch the data at all).
I am searching for a way to process the user data (even if it means copying it to a temporary buffer) without having it evict the internal structures from the CPU's data cache. In other words, I'm looking for a way to tell the CPU "give me this data, but I'm never going to need it again".
I seem to recall that gcc had an intrinsic to do it, but going over the list, I seem to have misremembered (or otherwise couldn't find it). Either way, assembly solution (Intel) would work fine for my purposes.
Logic states that there must be a way to do this, as it is necessary to do this before sending data to (or receiving from) DMA buffers.
Related
I'm using an SPI device with DMA enabled in an STM32H7 SoC. The DMA periph. cannot access d-cache, so in order to make it work I have disabled d-cache entirely (for more info. about this, see this explanation). However, I would like to avoid disabling d-cache globally for a problem that only affects to a small region of memory.
I have read this post about the meaning of clean and invalidate cache operations, in the ARM domain. My understanding is that, by cleaning a cache area, you force it to be written in the actual memory. On the other hand, by invalidating a cache area, you force the actual memory to be cached. Is this correct?
My intention with this is to follow these steps to transmit something over SPI (with DMA):
Write the value you want on the buffer that DMA will read from.
Clean d-cache for that area to force it to go to actual memory, so DMA can see it.
Launch the operation: DMA will read the value from the area above and write it to the SPI's Tx buffer.
SPI reads data at the same time it writes, so there will be data in the SPI's Rx buffer, which will be read by DMA and then it will write it to the recv. buffer provided by the user. It could happen that an observer of such buffer can indeed access d-cache. The latter could not be updated with the new value received by SPI yet, so invalidate the recv. buffer area to force d-cache to get updated.
Does the above make sense?
EDIT
Adding some more sources/examples of the problem I'm facing:
Example from the ST github: https://github.com/STMicroelectronics/STM32CubeH7/issues/153
Post in ST forums answring and explaining the d-cache problem: https://community.st.com/s/question/0D53W00000m2fjHSAQ/confused-about-dma-and-cache-on-stm32-h7-devices
Here the interconnection between memory and DMA:
As you can see, DMA1 can access sram1, 2 and 3. I'm using sram2.
Here the cache attributes of sram2:
As you can see, it is write back,write allocate, but not write through. I'm not familiar with these attributes, so I read the definition from here. However, that article seems to talk about the CPU physical cache (L1, L2 etc.) I'm not sure if ARM i-cache and d-cache refer to this physical cache. In any case, I'm assuming the definition for write through and the other terms are valid for d-cache as well.
I forget off hand how the data cache works on the cortex-m7/armv7-m. I want to remember it does not have an MMU and caching is based on address. ARM and ST would be smart enough to know to put cached and non-cached access to sram from the processor core.
If you are wanting to send or receive data using DMA you do not go through the cache.
You linked a question from before which I had provided an answer.
Caches contain some amount of sram as we tend to see a spec for this many KBytes or this many MBytes, whatever. But there are also tag rams and other infrastructure. How does the cache know if there is a hit or a miss. Not from the data, but from other bits of information. Taken from the address of the transaction. Some number of bits of that address are taken and compared to however many "ways" you have so there may be 8 ways for example so there are 8 small memories think of them as arrays of structures in C. In that structure is some information is this cache line valid? If valid what is the tag or bit of address that it is tied to, is it clean/dirty...
Clean or dirty meaning the overall caching infrastructure will be designed (kinda the whole point) to hold information in a faster sram (sram in mcus is very fast already so why a cache in the first place???), which means that write transactions, if they go through the cache (they should in some form) will get written to the cache, and then based on design/policy will get written out into system memory or at least get written on the memory side of the cache. While the cache contains information that has been written that is not also in system memory (due to a write) that is dirty. And when you clean the cache using ARM's term clean, or flush is another term, etc. You go through all of the cache and look for items that are valid and dirty and you initiate writes to system memory to clean them. This is how you force things out the cache into system memory for coherency reasons, if you have a need to do that.
Invalidate a cache simply means you go through the tag rams and you change the valid bit to indicate invalid for that cache line. Basically that "loses" all information about that cache line it is now available to use. It will not result in any hits and it will not do a write to the system for a clean/flush. The actual cache line in the cache memory does not have to be zeroed or put in any other state. Technically just the valid/invalid bit or bits.
How things generally get into a cache are certainly from reads. Depending on the design and settings if a read is cacheable then the cache will first look to see if it has a tag for that item and if it is valid, if so then it simply takes the information in the cache and returns it. If there is a miss, that data does not have a copy in the cache, then it initiates one or more cache line reads from the system side. So a single byte read can/will cause a larger, sometimes much larger, read to happen on the system side, the transaction is held until that (much larger) data (read) returns and then it is put in the cache and the item requested is returned to the processor.
Depending on the architecture and settings, writes may or may not create an entry in the cache, if a (cacheable) write happens and there are no hits in the cache then it may just go straight to the system side as a write of that size and shape. As if the cache was not there. If there is a cache hit then it will go into the cache, and the that/those cache lines are marked as dirty and then depending on the design, etc it may be written to system memory as a side effect of the write from the processor side, the processor will be freed to continue execution but the cache and other logic (write buffer) may continue to process this transaction moving this new data to the system side essentially cleaning/flushing automatically. One normally does not expect this as it takes away performance that the cache was there to provide in the first place.
In any case if it is determined that a transaction has a miss and it is to be cached, then based on that tag, the ways have already been examined to determine if there was a hit. One of the ways will be chosen to hold this new cache line. How that is determined is based on design and in some cases programmable settings. Hopefully if there are any that are invalid then it would go to one of those. But round robin, randomizer, oldest first, etc are solutions you may see. And if there is dirty data in that space then it has to get written out first, making room for the new information. So, absolutely a single byte or single word read (since they have the same performance in a system like this) can require a cache flush of a cache line, then a read from the system and then the result is returned, more clock cycles than if the cache was not there. Nature of the beast. Caches are not perfect, with the right information and experience you can easily write code that makes the cache degrade the performance of the application.
Clean means if a cache line is valid and dirty then write it out to system memory and mark it as clean.
Invalidate means if the cache line is valid then mark it as valid. If it was valid and dirty that information is lost.
In your case you do not want to deal with cache at all for these transactions, the cache in question is in the arm core so nobody but the arm core has access to that cache, nobody else is behind the cache, they are all on the system end.
Taking a quick look at the ARM ARM for armv7-m they do use address space to determine write through and cached or not. One then needs to look at the cortex-m7 TRM for further information and then, particularly in this case, since it is a chip thing not an arm thing anyway, the whole system. The arm processor is just some bit of ip that st bought to glue into a chip with a bunch of other ip and ip of their own that is glued together. Like the engine in the car, the engine manufacturer cant answer questions about the rear differential nor the transmission, that is the car company not the engine company.
arm knows what they are doing
st knows what they are doing
if a chip company makes a chip with dma but the only path between the processor and the memory shared with the dma engine is through the processors cache when the cache is enabled, and clean/flush and invalidate of address ranges are constantly required to use that dma engine...Then you need to immediately discard that chip, blacklist that company's products (if this product is that poorly designed then assume all of their products are), and find a better company to buy products from.
I cant imagine that is the case here, so
Initialize the peripheral, choosing to use DMA and configure the peripheral or dma engine or both (for each direction).
Start the peripheral (this might be part of 4)
write the tx data to the configured address space for dma
tell the peripheral to start the transfer
monitor for completion of transfer
read the received data from the configured address space for dma
That is generic but that is what you are looking for, caches are not involved. For a part/family like this there should be countless examples including the (choose your name for the quality) one or more library solutions and examples that come from the chip vendor. Look at how they others are using the part, compare that to the documentation, determine your risk level for their solution and use it or modify it or learn from it if nothing else.
I know that st products do not have an instruction cache they do their own thing, or at least that is what I remember (some trademarked name for a flash cache, on most of them you cannot turn it off). Does that mean they have not implemented a data cache on the products either? Possible. Just because the architecture for an ip product has a feature (fpu, caches, ...) does not automatically mean that the chip vendor has enabled/implemented those. Depending on the ip there are various ways to do that as some ip does not have a compile time option for the chip vendor to not compile in a feature. if nothing else the chip vendor could simply stub out the cache memory interfaces and write a few lines of text in the docs that there is no cache, and you can write control registers and see things appear to enable that feature but it simply does not work. One expects that arm provides compile time features, that are not in the public documentation we can see, but are available to the chip vendor in some form. Sometimes when you buy the ip you are given a menu if you will like ordering a custom burger at a fancy burger shop, a list of checkboxes, mayo, mustard, pickle. ... fpu, cache, 16 bit fetch, 32 bit fetch, one cycle multiply, x cycle multiply, divide, etc. And the chip vendor then produces your custom burger. Or some vendors you get the whole burger then you have to pick off the pickles and onions yourself.
So again, not our job to read the docs for you, so first off does this part even have a dcache? Look between the arm arm, the arm trm and the documentation for the chip address spaces (as well as the countless examples) and determine what address space or whet settings, etc are needed to access portions of sram in a non-cached way. If it has a data cache feature at all.
I have investigated a bit more:
With regards to clean and invalidate memory question, the answer is yes: clean will force cache to be written in memory and invalidate will force memory to be cached.
With regards to the steps I proposed, again yes, it makes sense.
Here is a sequence of 4 videos that explain this exact situation (DMA and memory coherency). As can be seen, the 'software' solution (doesn't involve MPU) proposed by the videos (and other resources provided above) is exactly the sequence of steps I posted.
https://youtu.be/5xVKIGCPy2s
https://youtu.be/2q8IvCxSjaY
https://youtu.be/6IEtoG7m0jI
https://youtu.be/0DhYTqPCRiA
The other proposed solution is to configure the cortex-m7 MPU to change the attributes of a particular memory region to keep memory coherency.
This all apart from the easiest solution which is to globally disable d-cache, although, naturally, this is not desirable.
I've seen a few questions here on Stack overflow dealing with the same issues, but no definite answer. I thought I'll ask again, with a bunch of questions of my own. All relate to the subject matter at hand.
So, do we know when the data transfer from host to the openCL device occurs? Can you tell me the exact memory transfer operation of the functions below (that is, what data is transferred or created, if any, when these functions are invoked?):
clCreateBuffer()
clSetKernelArg()
clEnqueueNDRangeKernel()
The first two don't even produce events, so we can't time them, but surely some data transferring is happening here.
Is there a way to transfer data to a device without first setting it as a kernel arg?
It appears (from preliminary testing of my own) that a mem object created with CL_MEM_USE_HOST_PTR gets directly manipulated by the device. Why would that not be desirable, since, that way, we could avoid further data transfer commands (and surely the driver implements this in the most efficient way)?
Does transferred data (say, as par of a kernel arg) stay at the device for further manipulation, after a kernel returns? If not is there a way to do just that?
Buffer copies are related to command queues. Command queues are synced with host using finish() as easiest way.
clCreateBuffer()
clEnqueueWriteBuffer() <-------- you can get event data from this
(set blocking parameter to false to queue everything quickly)
(set blockinig to true if you sync write here)
clSetKernelArg()
clEnqueueWriteBuffer() <----- it could be here too
clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() <----- or here (too quickly re-set an array?)
clFinish() <--------- this ensures all queued commands are executed before this
now you can query data of that event to check when it started and when ended
to let a buffer stay in device, you should create it in device first then don't migrate it to another device. Using only CL_MEM_READ_WRITE flag in createBuffer() is enough to make it a real buffer on device-side until you release that buffer.
CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR uses host memory as device maps it to its cores. This is faster for streaming data in and out because of not-needing of extra data movements in host side. If you need to use device memory such as fast gddr5 or hbm always, then you should not use these flags.
Copy to device once, use as much as you want. If device has its own memory of course. For example, Intel HD Graphics 400 doesn't have its own memory and shares RAM so it is much faster to use CL_MEM_..._HOST_PTR flags and especially USE_HOST_PTR.
To check if device shares RAM with CPU, you query CL_DEVICE_HOST_UNIFIED_MEMORY property of device.
It appears (from preliminary testing of my own) that a mem object
created with CL_MEM_USE_HOST_PTR gets directly manipulated by the
device
Even without map/unmap commands pror to kernel execution, my computer is behaving same, but I'm using map/unmap just to be safe and it doesn't tax too many cycles.
Edit: if you want to make sure a command doesn't start before you want, you can add a user event in event list input parameter of bufferwrite command. Then you can trigger the user event to let writing start because commands wait for all events in the list to be fired+completed before continuing (if there are any specified in event list input parameter)
I have a quite powerful embedded linux device that is to be used for collecting data from various sockets/fd:s using C. This data is to be parsed, buffered and passed on to a TCP/IP or a UDP socket to be transferred somewhere else for long term storage. This last step happens either when a sufficient amount of data has been acquired, or when some other event triggers.
My question is: is there any reason not to buffer everything on the heap (as opposed to writing/reading to some linux file descriptor) given that
the sole purpose of my device is this type of data acquisition
the device is never used for long term storage
Using only the heap sounds counter-intuitive, but I can't really see why we shouldn't store as much as we can in the heap, at least until RAM becomes scarce.
I don't quite get why you say "using the heap is counter-intuitive" - Millions of embedded routers and switches use the heap for store-and-forward queues (I understand what you do is similar).
It very much depends on the data that you acquire. Anything that can be re-acquired in case of a power failure or other reset events of your device doesn't really need to go into permanent storage.
Data that is hard or impossible to re-acquire and this valuable (like sensor data , for example), you might possibly want to push into a safe place where it is protected from resets and power-down, however.
On the other hand, if your data is not segmented but rather stream-oriented, storing it to a file might be a lot easier - Also beware that out-of-memory conditions and heap memory leaks can be a real nuisance to debug in embedded systems.
Data stored in main memory usually is not retained on power loss. If your collected data must survive power loss, it must be stored in non-volatile memory.
Unfortunately, just writing data to a file does not guarantee reliable storage, sine most linux file systems suffer from risk of data loss on power loss.
A second scenario, where storage in a file might be useful is, that the data collected in a file can survive a crash of your application. We all do our best to never let our applications crash, but despite all efforts, it still happens too often. :-(
I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.
Originally I thought to make code faster it would be better to try and reduce the transition between Kernel and user space- by pushing more of the code to run in the kernel. However, I have read in a few forums like SO that the opposite is actually done- more of the code is pushed into the user space. Why is this? It seems counter intuitive? Putting more of the code into the user space still requires kernel-user transitions, whereas putting the code in the kernel doesnt requite kernel-user transitions?
In case anyone asks- I am thinking about an application processing packet data.
EDIT
So more details, I am thinking about when packet data arrives- I want to re-write the network stack and cut out code which isn't applicable for my packet processing and have zero copy- putting the packet data somewhere where the user program can access it as quick as possible.
The kernel is a time sensitive area, it’s where your ISRs, time tick routines, and hardware critical sections reside. Because of this, the objective is to keep kernel code small and tight, get in, get your work done, and get out.
In your case you're getting packets from the network, that's a hardware dependent task (you need to get data from the lower network layers), so get your data, clear the buffers, and send it via a DMA transfer to user space; then do your processing in user space.
From my experiences: The preformance gained by executing your code in ther kernel will not outweigh the preformance lost overall by executing more code in the kernel.
If you expect your code to go into the official kernel release, "shuffling user mode parts of it into the kernel" is probably a bad idea as a rule.
Of course, if you can prove that by doing so is the BEST (subjective, I know) way to achieve better performance, and the cost is acceptable (in terms of extra code in kernel -> more burden of maintenance on the kernel, bigger kernel -> more complaints about kernel being "too big" etc), then by all means follow that route.
But in general, it's probably better to approach this by doing more work in user-mode, and make the kernel mode task smaller, if that is at all an alternative. Without knowing exactly what you are doing in the kernel and what you are doing in usermode, it's hard to say for sure what you should/shouldn't do. But for example batching up a dozen "items" into a block that is ONE request for the kernel to do something is a better option than calling the kernel a dozen times.
In response to your edit describing what you are doing:
Would it not be better to pass a user-mode memory region to receive the data, and then just copy into that when the packet arrives. Assuming "all memory is equal" [if it isn't, you have problems with "in place use" anyway], this should work just as well, with less time spent in the kernel.
Transitions from user-mode to kernel-mode take some time and resources, so keeping the code in only one of the modes may increase performance.
As mentioned: in your case probably the best option you have is to fetch the data as fast as possible and make it available in user-land right away and do the processing in user-land... moving all the processing to kernel-level seems to me unnecessary... Unless you have a good reason to do so... with no further information it seems to me you have no reason to believe you'll do it faster in kernel-mode than user-mode, all you could spare is a mode transition now and then, which shouldn't be relevant.