Direct mapped cache - Small cache block size and high offset

Direct mapped cache - Small cache block size and high offset - c

I have to deal with different cache block sizes ranging from 4 to 128 for 16-bit words, depending on the cache controller mode. The adresses are 20-bit long with 8 bits for tag, 8 bits for cache memory block ID and 4 bits for offset. The adresses are in hex so they are in a form of 0x0C835, but since the first 4 bits of tag are always zero the microcontroller will receive them as C835. Can someone explain to me how to deal with an offset that is higher than the block size number? For example cache block size of 4 and an offset equal to 8?

Related

What is meant by an aligned/unaligned AXI transfer

Can anyone explain the difference between an aligned and an unaligned data transfer

It is not limited to AXI busses it is a general term which affects the bus transfers and leaves undesirable results (performance hits). But that depends heavily on the overall architecture.
If addresses are in units of bytes, byte addressable, then a byte is always aligned. Assuming a byte is 8 bits, then a 16 bit transfer would be aligned if it is on a 16 bit boundary, meaning the lower address bit is a zero. A 16 bit value which covers the addresses 0x1000 and 0x1001 is aligned, and is considered to be at address 0x1000 (big or little endian). But a 16 bit value that covers the addresses 0x1001 and 0x1002 is not aligned, it is considered to be at address 0x1001. 0x1002 and 0x1003 would be aligned. 32 bit value two lower address bits need to be zero to be aligned. A 32 bit value at 0x1000 is aligned but 0x1001, 0x1002, 0x1003 would all be unaligned.
Memories are generally not 8 bits wide from an interface perspective as well as a geometry, depends on what kind of memory or where. The cache in a processor that stages the transfers to slow dram, is going to likely be 32 or 64 or wider, some power of 2 or a power of 2 with a parity bit or ecc (32, 33 bits or 40) all of this is hidden from you other than performance hits you may run into. When you have a memory that is 32 bits wide and if I call a 32 bit value a word then that memory is word addressable the address 0x123 is a word address, its equivalent byte address is 0x123*4 or 0x48C. If you were to write a 32 bit value to byte address 0x48c that becomes a single word write to that memory at that memories address 0x123. But if you were to do a word write to byte address 0x48E, then you would need to do a read of word address 0x123 in that sram/memory replace two of the bytes from the word you are writing. and write that modified word back, then you would have to read from word address 0x124, modify two bytes and write the modified word back.
Various busses work various ways. some will put the single word on a word sized bus and allow unaligned addresses. a 32 bit wide axi would need to turn that 0x48E word write into two axi transfers one with two byte lanes enabled in the byte mask and the second transfer with the other two byte lanes enabled. A 64 bit wide axi bus. lets see....10010001110...would need to do two axi transfers one transfer with 16 bits of the data and a second one with the other 16 bits of the data because of where that 32 bits lands. But a word transfer at address 0x1001 would/should be a single transfer on a 64 bit axi bus with the middle four byte lanes enabled.
Other bus schemes work like this and some don't some will let a 32 bit thing fit in the 32 or 64 bit bus, but the memory controller on the other end has to do the multiple transactions to cache or create multiple transactions on the next bus.
Although technically possible to byte address dram as far as some of the standard parts and busses work, another thing a cache buys you is that the smaller and unaligned transactions can hit the faster sram, but the cache line reads and evictions can be optimized for the next bus or the external memory so dram for example for most of the systems we use can always be accessed aligned in multiples of the bus width (64 or 64+ecc) for desktops and servers and 32 or 16 bit for embedded systems, laptops, phones. The two busses and solutions can be optimized for each side with the cache being the translator.

memory to cache mapping

We've just began the topic of cache, memory mapping, and address structure and I am confused on how to find the cache size for this question.
If a cache size is given to you as 128 blocks, and memory has 16kb block of 8 words each, where a word is 4 bytes, does it mean that each cache block is also 8 words each? Or does it mean that we have to divide 128 into 8 to get the actual number of cache blocks? In that case what would the size of the cache be? Would it equal 128*8*4 in bytes?

Naturally aligned memory address

I need to extract a memory address from within an existing 64-bit value, and this address points to a 4K array, the starting value is:
0x000000030c486000
The address I need is stored within bits 51:12, so I extract those bits using:
address = start >> 12 & 0x0000007FFFFFFFFF
This leaves me with the address of:
0x000000000030c486
However, the documentation I'm reading states that the array stored at the address is 4KB in size, and naturally aligned.
I'm a little bit confused over what naturally aligned actually means. I know with page aligned stuff the address normally ends with '000' (although I could be wrong on that).
I'm assuming that as the address taken from the starting value is only 40 bits long, I need to perform an additional bitshifting operation to arrange the bits so that they can be correctly interpreted any further.
If anyone could offer some advice on doing this, I'd appreciate it.
Thanks

Normally, "naturally aligned" means that any item is aligned to at least a multiple of its own size. For example, a 4-byte object is aligned to an address that's a multiple of 4, an 8-byte object is aligned to an address that's a multiple of 8, etc.
For an array, you don't normally look at the size of the whole array, but at the size of an element of the array.
Likewise, for a struct or union, you normally look at the size of the largest element.

Natural alignment requires that every N byte access must be aligned on a memory address boundary of N. We can express this in terms of the modulus operator: addr % N must be zero. for examples:
Accessing 4 bytes of memory from address 0x10004 is aligned (0x10004 % 4 = 0).
Accessing 4 bytes of memory from address 0x10005 is unaligned (0x10005 % 4 = 1).

From a hardware perspective, memory is typically divided into chunks of some size, such that any or all of the data within a chunk can be read or written in a single operation, but any single operation can only affect data within a single chunk.
A typical 80386-era system would have memory grouped into four-byte chunks. Accessing a two-byte or four-byte value which fit entirely within a single chunk would require one operation. If the value was stored partially in one chunk and partially in another, two operations would be required.
Over the years, chunk sizes have gotten larger than data sizes, to the point that most randomly-placed 32-bit values would fit entirely within a chunk, but a second issue may arise with some processors: if a chunk is e.g. 512 bits (64 bytes) and a 32-bit word is known to be aligned at a multiple of four bytes (32 bits), fetching each bit of the word can come from any of 16 places. If the word weren't known to be aligned, each bit could come from any of 61 places for the cases where the word fits entirely within the chunk. The circuitry to quickly select from among 61 choices is more complex than circuitry to select among 16, and most code will use aligned data, so even in cases where an unaligned word would fit within a single accessible chunk, hardware might still need a little extra time to extract it.

A “naturally aligned” address is one that is a multiple of some value that is preferred for the data type on the processor. For most elementary data types on most common processors, the preferred alignment is the same as the size of the data: Four-byte integers should be aligned on multiples of four bytes, eight-byte floating-point should be aligned on multiples of eight bytes, and so on. Some platforms require alignment, some merely prefer it. Some types have alignment requirements different from their sizes. For example, a 12-byte long float may require four-byte alignment. Specific values depend in your target platform. “Naturally aligned” is not a formal term, so some people might define it only as preferred alignment that is a multiple of the data size, while others might allow it to be used for other alignments that are preferred on the processor.
Taking bits out of a 64-bit value suggests the address has been transformed in some way. For example, key bits from the address have been stored in a page table entry. Reconstructing the original address might or might not be as simple as extracting the bits and shifting them all the way to the “right” (low end). However, it is also common for bits such as this to be shifted to a different position (with zeroes left in the low bits). You should check the documentation carefully.
Note that a 4 KiB array, 4096 bytes, corresponds to 212 bytes. The coincidence of 12 with the 51:12 field in the 64-bit value suggests that the address might be obtained simply by extracting those 40 bits without shifting them at all.

Why double in C is 8 bytes aligned?

I was reading a article about data types alignment in memory(here) and I am unable to understand one point i.e.
Note that a double variable will be allocated on 8 byte boundary on 32
bit machine and requires two memory read cycles. On a 64 bit machine,
based on number of banks, double variable will be allocated on 8 byte
boundary and requires only one memory read cycle.
My doubt is: Why double variables need to be allocated on 8 byte boundary and not on 4 byte? If it is allocated on 4 byte boundary still we need only 2 memory read cycles(on a 32 bit machine). Correct me if I am wrong.
Also if some one has a good tutorial on member/memory alignment, kindly share.

The reason to align a data value of size 2^N on a boundary of 2^N is to avoid the possibility that the value will be split across a cache line boundary.
The x86-32 processor can fetch a double from any word boundary (8 byte aligned or not) in at most two, 32-bit memory reads. But if the value is split across a cache line boundary, then the time to fetch the 2nd word may be quite long because of the need to fetch a 2nd cache line from memory. This produces poor processor performance unnecessarily. (As a practical matter, the current processors don't fetch 32-bits from the memory at a time; they tend to fetch much bigger values on much wider busses to enable really high data bandwidths; the actual time to fetch both words if they are in the same cache line, and already cached, may be just 1 clock).
A free consequence of this alignment scheme is that such values also do not cross page boundaries. This avoids the possibility of a page fault in the middle of an data fetch.
So, you should align doubles on 8 byte boundaries for performance reasons. And the compilers know this and just do it for you.

Aligning a value on a lower boundary than its size makes it prone to be split across two cachelines. Splitting the value in two cachlines means extra work when evicting the cachelines to the backing store (two cachelines will be evicted; instead of one), which is a useless load of memory buses.

8 byte alignment for double on 32 bit architecture doesn't reduce memory reads but it still improve performance of the system in terms of reduced cache access. Please read the following :
https://stackoverflow.com/a/21220331/5038027

Refer this wiki article about double precision floating point format
The number of memory cycles depends on your hardware architecture which determines how many RAM banks you have. If you have a 32-bit architecture and 4 RAM banks, you need only 2 memory cycle to read.(Each RAM bank contributing 1 byte)

What is meant by "memory is 8 bytes aligned"?

While going through one project, I have seen that the memory data is "8 bytes aligned". Can anyone please explain what this means?

An object that is "8 bytes aligned" is stored at a memory address that is a multiple of 8.
Many CPUs will only load some data types from aligned locations; on other CPUs such access is just faster. There's also several other possible reasons for using memory alignment - without seeing the code it's hard to say why.
Aligned access is faster because the external bus to memory is not a single byte wide - it is typically 4 or 8 bytes wide (or even wider). This means that the CPU doesn't fetch a single byte at a time - it fetches 4 or 8 bytes starting at the requested address. As a consequence of this, the 2 or 3 least significant bits of the memory address are not actually sent by the CPU - the external memory can only be read or written at addresses that are a multiple of the bus width. If you requested a byte at address "9", the CPU would actually ask the memory for the block of bytes beginning at address 8, and load the second one into your register (discarding the others).
This implies that a misaligned access can require two reads from memory: If you ask for 8 bytes beginning at address 9, the CPU must fetch the 8 bytes beginning at address 8 as well as the 8 bytes beginning at address 16, then mask out the bytes you wanted. On the other hand, if you ask for the 8 bytes beginning at address 8, then only a single fetch is needed. Some CPUs will not even perform such a misaligned load - they will simply raise an exception (or even silently load the wrong data!).

The memory alignment is important for performance in different ways. It has a hardware related reason. Since the 80s there is a difference in access time between the CPU and the memory. The speed of the processor is growing faster than the speed of the memory. This difference is getting bigger and bigger over time (to give an example: on the Apple II the CPU was at 1.023 MHz, the memory was at twice that frequency, 1 cycle for the CPU, 1 cycle for the video. A modern PC works at about 3GHz on the CPU, with a memory at barely 400MHz). One solution to the problem of ever slowing memory, is to access it on ever wider busses, instead of accessing 1 byte at a time, the CPU will read a 64 bit wide word from the memory. This means that even if you read 1 byte from memory, the bus will deliver a whole 64bit (8 byte word). The memory will have these 8 byte units at address 0, 8, 16, 24, 32, 40 etc. A multiple of 8. If you access, for example an 8 byte word at address 4, the hardware will have to read the word at address 0, mask the high 4 bytes of that word, then read word at address 8, mask the low part of that word, combine it with the first half and give that to the register. As you can see a quite complicated (thus slow) operation. This is the first reason one likes aligned memory access. I will give another reason in 2 hours.

"X bytes aligned" means that the base address of your data must be a multiple of X. It can be used for using some special hardware like a DMA in some special hardware, for a faster access by the cpu, etc...
It is the case of the Cell Processor where data must be 16 bytes aligned in order to be copied to/from the co-processor.

if the memory data is 8 bytes aligned, it means: sizeof(the_data) % 8 == 0. generally in C language, if a structure is proposed to be 8 bytes aligned, its size must be multiplication of 8, and if it is not, padding is required manually or by compiler. some compilers provide directives to make a structure aligned with n bytes, for VC, it is #prgama pack(8), and for gcc, it is __attribute__((aligned(8))).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight