memory to cache mapping - c

We've just began the topic of cache, memory mapping, and address structure and I am confused on how to find the cache size for this question.
If a cache size is given to you as 128 blocks, and memory has 16kb block of 8 words each, where a word is 4 bytes, does it mean that each cache block is also 8 words each? Or does it mean that we have to divide 128 into 8 to get the actual number of cache blocks? In that case what would the size of the cache be? Would it equal 128*8*4 in bytes?

Related

Direct mapped cache - Small cache block size and high offset

I have to deal with different cache block sizes ranging from 4 to 128 for 16-bit words, depending on the cache controller mode. The adresses are 20-bit long with 8 bits for tag, 8 bits for cache memory block ID and 4 bits for offset. The adresses are in hex so they are in a form of 0x0C835, but since the first 4 bits of tag are always zero the microcontroller will receive them as C835. Can someone explain to me how to deal with an offset that is higher than the block size number? For example cache block size of 4 and an offset equal to 8?

Are Arrays Contiguous? (Virtual vs Physical)

I read that arrays are contiguous in Virtual Memory but probably not in Physical memory, and I don't get that.
Let's suppose I have an array of size 4KB (one page = one frame size), In virtual memory that array is one page.
In virtual memory every page in translated into one frame so our array is still contiguous...
(In Page Table we translate pages into frames not every byte into its own frame...)
Side Question: (When Answering this please mention clearly it's for the side note):
When allocating array in virtual memory of size one page does it have to be one page or could be split into two contiguous pages in virtual memory (for example bottom half of first one and top half of the second)? In this case at worst the answer above is 2, am I wrong?
Unless the start of the array happens to be aligned to the beginning of a memory page, it can still occupy two pages; it can start near the end of one page and end on the next page. Arrays allocated on the stack will probably not be forced to occupy a single page, because stack frames are simply allocated sequentially in the stack memory, and the array will usually be at the same offset within each stack frame.
The heap memory allocator (malloc()) could try to ensure that arrays that are smaller than a page will be allocated entirely on the same page, but I'm not sure if this is actually how most allocators are implemented. Doing this might increase memory fragmentation.
I read that arrays are contiguous in Virtual Memory but probably not in Physical memory, and I don't get that.
This statement is missing something very important. The array size
For small arrays the statement is wrong. For "large/huge" arrays the statement is correct.
In other words: The probability of an array being split over multiple non-contiguous physical pages is a function of the array size.
For small arrays the probability is close to zero but the probability increases as the array size increase. When the array size increases above the systems page size, the probability gets closer and closer to 1. But an array requiring multiple page may still be contiguous in physical memory.
For you side question:
With an array size equal to your systems page size, the array can at maximum span two physical pages.
Anything (array, structure, ...) that is larger than the page size must be split across multiple pages; and therefore may be "virtually contiguous, physical non-contiguous".
Without further knowledge or restriction; anything (array, structure, ...) that is between its minimum alignment (e.g. 4 bytes for an array of uint32_t) and the page size has a probability of being split across multiple pages; where the probability depends on its size and alignment. For example, if page size is 4096 bytes and an array has a minimum alignment of 4 bytes and a size of 4092 bytes, then there's 2 chances in 1024 that it will end up on a single page (and a 99.8% chance that it will be split across multiple pages).
Anything (variable, tiny array, tiny structure, ...) that has a size equal to its minimum alignment won't (shouldn't - see note 3) be split across multiple pages.
Note 1: For anything using memory allocated from the heap, the minimum alignment can be assumed to be the (implementation defined) minimum alignment provided by the heap and not the minimum alignment of the object itself. E.g. for an array of uint16_t the minimum alignment would be 2 bytes; but malloc() will return memory with much larger alignment (maybe 16 bytes)
Note 2: When things are nested (e.g. array inside a structure inside another structure) all of the above applies to the outer structure only. E.g. if you have an array of uint16_t inside a structure where the array happens to begin at offset 4094 within the structure; then it will be significantly more likely that the array will be split across pages.
Note 3: It's possible to explicitly break minimum alignment using pointers (e.g. use malloc() to allocate 1024 bytes, then create a pointer to an array that begins at any offset you want within the allocated area).
Note 4: If something (array, structure, ...) is split across multiple pages; then there's a chance that it will still be physically contiguous. For worst case this depends on the amount of physical memory (e.g. if the computer has 1 GiB of usable physical memory and 4096 byte pages, then there's approximately 1 chance in 262000 that 2 virtually contiguous pages will be "physically contiguous by accident"). If the OS implements page/cache coloring (see https://en.wikipedia.org/wiki/Cache_coloring ) it improves the probability of "physically contiguous by accident" by the number of page/cache "colors" (e.g. if the computer has 1 GiB of usable physical memory and 4096 byte pages, and the OS uses 256 page/cache colors, then there's approximately 1 chance in 1024 that 2 virtually contiguous pages will be "physically contiguous by accident").
Note 5: Most modern operating systems using multiple page sizes (e.g. 4 KiB pages and 2 MiB pages, and maybe also 1 GiB pages). This can either make it hard to guess what the page size actually is, or improve the probability of "physically contiguous by accident" if you assume the smallest page size is used.
Note 6: For some CPUs (e.g. recent AMD/Zen) the TLBs behave as if pages are larger (e.g. as if you're using 16 KiB pages and not 4 KiB pages) if and only if page table entries are compatible (e.g. if 4 page table entries describe four physically contiguous 4 KiB pages with the same permissions/attributes). If an OS is optimized for these CPUs the result is similar to having an extra page size (4 KiB, "16 KiB", 2 MiB and maybe 1 GiB).
When allocating array in virtual memory of size one page does it have to be one page or could be split into two contiguous pages in virtual memory (for example bottom half of first one and top half of the second)?
When allocating an array in heap memory of size one page; the minimum alignment would be the implementation defined minimum alignment provided by the heap manager/malloc() (e.g. maybe 16 bytes). However; most modern heap managers switch to using an alternative (e.g. mmap() or VirtualAlloc() or similar) when the amount of memory being allocated is "large enough"; so (depending on the implementation and their definition of "large enough") it might be page aligned.
When allocating an array in raw virtual memory (e.g. using mmap() or VirtualAlloc() or similar yourself, and NOT using the heap and not using something like malloc()); page alignment is guaranteed (mostly because the virtual memory manager doesn't deal with anything smaller).

Use of memory pools in C (Embedded systems)

I'm writing a piece of code for an embedded system (mcu Cortex-M4 with very limited ram < 32KB) where I need big chunks of memory for a very small and defined time and I thought to use some kind of a memory pool so the memory won't get wasted for functions and actions that run once or twice a lifetime.
I made some efforts but I think I'd like to learn more about memory pools and rewrite my code better.
I saw some examples where a pointer is used to point to the next available free chunk, but I don't understand how I can process those kind of pools.
For example, I can't use strstr in a memory pool where the total string would be spread into more than one chunk. I would need to read chunk by chunk and store the total string into one larger array to carry on further process. Please correct me if I'm wrong.
So, if I get it right, if I have a memory pool of 1024 bytes with 32 bytes long for each chunk that gives us 32 chunks in total. And if I want to store a string of total length let's say 256 chars (bytes) I'd need 8 chunks but if I want to read the string I'd need to copy those 8 chunks into a 256 chars array.
Am I missing something?

Why double in C is 8 bytes aligned?

I was reading a article about data types alignment in memory(here) and I am unable to understand one point i.e.
Note that a double variable will be allocated on 8 byte boundary on 32
bit machine and requires two memory read cycles. On a 64 bit machine,
based on number of banks, double variable will be allocated on 8 byte
boundary and requires only one memory read cycle.
My doubt is: Why double variables need to be allocated on 8 byte boundary and not on 4 byte? If it is allocated on 4 byte boundary still we need only 2 memory read cycles(on a 32 bit machine). Correct me if I am wrong.
Also if some one has a good tutorial on member/memory alignment, kindly share.
The reason to align a data value of size 2^N on a boundary of 2^N is to avoid the possibility that the value will be split across a cache line boundary.
The x86-32 processor can fetch a double from any word boundary (8 byte aligned or not) in at most two, 32-bit memory reads. But if the value is split across a cache line boundary, then the time to fetch the 2nd word may be quite long because of the need to fetch a 2nd cache line from memory. This produces poor processor performance unnecessarily. (As a practical matter, the current processors don't fetch 32-bits from the memory at a time; they tend to fetch much bigger values on much wider busses to enable really high data bandwidths; the actual time to fetch both words if they are in the same cache line, and already cached, may be just 1 clock).
A free consequence of this alignment scheme is that such values also do not cross page boundaries. This avoids the possibility of a page fault in the middle of an data fetch.
So, you should align doubles on 8 byte boundaries for performance reasons. And the compilers know this and just do it for you.
Aligning a value on a lower boundary than its size makes it prone to be split across two cachelines. Splitting the value in two cachlines means extra work when evicting the cachelines to the backing store (two cachelines will be evicted; instead of one), which is a useless load of memory buses.
8 byte alignment for double on 32 bit architecture doesn't reduce memory reads but it still improve performance of the system in terms of reduced cache access. Please read the following :
https://stackoverflow.com/a/21220331/5038027
Refer this wiki article about double precision floating point format
The number of memory cycles depends on your hardware architecture which determines how many RAM banks you have. If you have a 32-bit architecture and 4 RAM banks, you need only 2 memory cycle to read.(Each RAM bank contributing 1 byte)

What is meant by "memory is 8 bytes aligned"?

While going through one project, I have seen that the memory data is "8 bytes aligned". Can anyone please explain what this means?
An object that is "8 bytes aligned" is stored at a memory address that is a multiple of 8.
Many CPUs will only load some data types from aligned locations; on other CPUs such access is just faster. There's also several other possible reasons for using memory alignment - without seeing the code it's hard to say why.
Aligned access is faster because the external bus to memory is not a single byte wide - it is typically 4 or 8 bytes wide (or even wider). This means that the CPU doesn't fetch a single byte at a time - it fetches 4 or 8 bytes starting at the requested address. As a consequence of this, the 2 or 3 least significant bits of the memory address are not actually sent by the CPU - the external memory can only be read or written at addresses that are a multiple of the bus width. If you requested a byte at address "9", the CPU would actually ask the memory for the block of bytes beginning at address 8, and load the second one into your register (discarding the others).
This implies that a misaligned access can require two reads from memory: If you ask for 8 bytes beginning at address 9, the CPU must fetch the 8 bytes beginning at address 8 as well as the 8 bytes beginning at address 16, then mask out the bytes you wanted. On the other hand, if you ask for the 8 bytes beginning at address 8, then only a single fetch is needed. Some CPUs will not even perform such a misaligned load - they will simply raise an exception (or even silently load the wrong data!).
The memory alignment is important for performance in different ways. It has a hardware related reason. Since the 80s there is a difference in access time between the CPU and the memory. The speed of the processor is growing faster than the speed of the memory. This difference is getting bigger and bigger over time (to give an example: on the Apple II the CPU was at 1.023 MHz, the memory was at twice that frequency, 1 cycle for the CPU, 1 cycle for the video. A modern PC works at about 3GHz on the CPU, with a memory at barely 400MHz). One solution to the problem of ever slowing memory, is to access it on ever wider busses, instead of accessing 1 byte at a time, the CPU will read a 64 bit wide word from the memory. This means that even if you read 1 byte from memory, the bus will deliver a whole 64bit (8 byte word). The memory will have these 8 byte units at address 0, 8, 16, 24, 32, 40 etc. A multiple of 8. If you access, for example an 8 byte word at address 4, the hardware will have to read the word at address 0, mask the high 4 bytes of that word, then read word at address 8, mask the low part of that word, combine it with the first half and give that to the register. As you can see a quite complicated (thus slow) operation. This is the first reason one likes aligned memory access. I will give another reason in 2 hours.
"X bytes aligned" means that the base address of your data must be a multiple of X. It can be used for using some special hardware like a DMA in some special hardware, for a faster access by the cpu, etc...
It is the case of the Cell Processor where data must be 16 bytes aligned in order to be copied to/from the co-processor.
if the memory data is 8 bytes aligned, it means: sizeof(the_data) % 8 == 0. generally in C language, if a structure is proposed to be 8 bytes aligned, its size must be multiplication of 8, and if it is not, padding is required manually or by compiler. some compilers provide directives to make a structure aligned with n bytes, for VC, it is #prgama pack(8), and for gcc, it is __attribute__((aligned(8))).

Resources