how c manage data with different size from CPU word size? - c

while i was taking a course in hardware/software interface, our teacher said the cpu get the data using the word machine, for example, the CPU can get the value at address 0x00 but cannot get the value at address 0x01 but the value at address 0x00 + (word-size:normally it 4 byte), so i want to understand how we can use short int in c that contain only 2 byte?

Different CPU's have different memory access restrictions. One common form is to allow access at any byte offset, but suffer performance penalties if this occurs because two word-aligned transfers might be needed in order to retrieve all bytes necessary for the operation. The second form simply disallows non-aligned memory access.
In the first case, the compile has nothing to worry about, memory can be "packed" or word-aligned and both will work, but one will be faster and one will use less memory.
In the second case, the compiler must generate code to get an un-aligned data into the correct register location(s) to allow arithmetic and logical operations to occur. These operations shift and/or mask bits in byte-sized units. For example a 4-byte load may be needed to load a 2-byte variable. In this instance 2 bytes of the data will likely need to be "zeroed" and the 2 "good" bytes may need to be shifted into the low-order bits of the register. All of this is the responsibility of the compiler, but it adds overhead to access the data. On the other hand, it means the data occupies less memory when not stored in a register. This can be important if you have a large amount of data, such as an array of a few million integers; storing it in half the space may actually make the program run faster (even given the above overhead), since twice as many array values fit in the cache.

A 32 bit processor uses 4 bytes for integer, So first let me explain about memory banking
memory banking= If the processor had used the single byte addressing scheme,it would take 4 memory cycles to perform a read/write operation. So from processors point of view memory is banked as shown.
So if a integer is stored as in the address range 0x0000-0x0003 ie;4 byes ,it just requires a single memory read/write cycle so the processor is relieved from the extra memory cycles and hence improved performance.
Also it is also possible that a variable allocation can start from a number which is not a multiple of four hence it might require 2 or more memory cycles(float,double etc..) in accordance with how they are stored. The CPU fetches the data in minimum number of read cycles.
In short data type only two bytes of the memory are allocated and if it starts from 0x0001 then it ends in 0x0002 still 1 read cycle is required for it. But the advantage is that in integer it allocates 4 bytes where as the short allocates two byes so the two bytes is saved from the integer data type. So storage is improved
Reference:
Geeksforgeeks
IBM

Related

Why must an int have a memory address that is divisible by four on most current architectures? [duplicate]

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Why "any primitive object of K bytes must have an address that is a multiple of K"?

Computer Systems: a Programmer's Perspective says
The x86-64 hardware will work correctly regardless of the alignment of
data. However, Intel recommends that data be aligned to improve memory
system performance. Their alignment rule is based on the principle
that any primitive object of K bytes must have an address that is a
multiple of K. We can see that this rule leads to the following
alignments:
K Types
1 char
2 short
4 int, float
8 long, double, char *
Why is it that "any primitive object of K bytes must have an address that is a multiple of K"?
How is "aligned" defined or what does it mean?
On a x86-64 machine,
if an object has K bytes (such as K=2 (e.g. short) or K=4 (e.g. int, or float)), "any primitive object of K bytes must have an address that is a multiple of K" means that such an object must have an address that is a multiple of K. But isn't the object aligned, as long as its storage space falls completely between two addresses which are two consecutive multiples of 8, which is a less strict requirement than that the object must have an address that is a multiple of K?
If the K of an object is smaller than 8 but not equal to 1, 2 or 4, does "any primitive object of K bytes must have an address that is a multiple of K" still apply? For example if K=3,5,6, or 7?
On a X86 machine, which has 32-bit addresses,
what is the alignment rule, and Does "any primitive object of K bytes must have an address that is a multiple of K" still apply?
Thanks.
Since this was tagged in C as well; do note that not only does the architecture make these decisions, but so do compilers. The C compiler often has its own alignment rules that mostly follow either the required or the preferred alignment of the architecture - especially when optimizing for speed. And the compiler's requirements are what you you need to worry about the most time, not the architecture requirement.
Even if the processor supports unaligned accesses, it might have a preferred alignment for multibyte objects that the C compiler can exploit. For example a compiler is allowed to know that a any int will reside at, and therefore any int * pointer will always point to - an address divisible by 4.
Now there are people who say that since x86-64 supports unaligned acccess, they can make an int * pointer that points to an address not divisible by 4 and things will work fine.
They're wrong.
There are some instructions in the x86-64 instruction set that require alignment. I.e. the "will work correctly regardless of alignment" means that these instructions too work "correctly, according to the specification, when given an unaligned access" - they raise an exception that would kill your process. The reason for having these is that they can be so much faster and require less silicon to implement than the versions that can deal with unaligned data.
And the compiler knows exactly when it is allowed to use these instructions! Whenever it sees an int * being dereferenced it knows that it can use an instruction that requires the operand be aligned at 4 bytes, should it be more effective.
See this question for a case where OP run into problem with C code that "should have been fine on x86-64 anyway": C undefined behavior. Strict aliasing rule, or incorrect alignment?
As for x86-32, the alignment requirement for doubles is generally 4 in C compilers because doubles need to be passed on stack and stack grows in 4 not 8 byte increments.
And finally:
If the K of an object is smaller than 8 but not equal to 1, 2 or 4, does "any primitive object of K bytes must have an address that is a multiple of K" still apply? For example if K=3,5,6, or 7?
There are no primitive objects with K<-{3,5,6,7} in x86.
The C standard's stance is that an alignment can only be a power of 2, and there are no gaps in arrays. Therefore an object with such a size would need to be padded upwards to its alignment requirement, or its alignment requirement must be 1.
The rules are different on each processor model. I will discuss one hypothetical example. We may have a processor with an eight-byte interface to the bus. Given some address X, the processor can load eight bytes from that address by requesting the memory to deliver eight bytes from its unit of storage numbered X/8. That is, the memory does not have any way to address individual bytes. The processor can only request data at a certain address that is a multiple of eight, and the memory will send the entire eight bytes at that address. (Keep in mind this is a hypothetical example to illustrate basic principles. Also, I am ignoring cache. Cache helps mask some of the effects of alignment issues, because the misalignments can be largely managed in level-one cache inside the processor. But handling this still requires extra hardware, as discussed below.)
Suppose we want the four-byte object that is in bytes 7, 8, 9, and 10. To get this, the processor has to request unit 0 from memory, which supplies bytes 0 through 7, and it has to request unit 1, which supplies bytes 8 through 15. So, already, there is a performance problem: We had to use two bus transfers to get this word that is only half the size of one transfer. That is inefficient, and the bus can only do half as many of these double transfers as it can if we loaded only aligned data requiring single transfers.
Continuing, the processor has all the bytes it needs, 0 through 15, so it extracts bytes 7 through 10, which make up the object we want. To do this, though, it has to shift the bytes to put them into a register. Ideally, if nobody did any “unaligned” loads, four-byte objects would come in from the bus only at offsets 0 and 4 in the eight-byte transfers, and the processor only needs to have wires gong from those offsets to the register destinations.
However, our processor supports unaligned loads, so it has additional switches and wires so the data can be shunted down a different path, where it will be shifted by three bytes. Keep in mind, the data from both transfers has to be shifted by three bytes and then spliced together. So a lot of extra wires and switches are needed. Two eight-byte transfers is 128 bits, so there are hundreds of extra connections involved in this.
Well, fine, the processor has these wires and switches, why not use them? To make this processor fast, it supports multiple loads and stores in progress simultaneously. As soon as the bus transfers one piece of data, we want to be getting another from the bus, while the data from the first is still on its way to a register. So there are actually multiple parts of the processor moving data around for several loads. Since we expect unaligned loads to be rare, maybe only one of the parts for handling loads has the extra components to handle unaligned loads. The others all handle aligned loads. So, if you have just one unaligned load occasionally, the processor sends it to that part, and the performance effect is unnoticeable. However, if you do many unaligned loads in a row, they all have to go through the one part, so they end up waiting in a queue instead of running in parallel, and performance decreases.
That is just for loads. When you store that four-byte object, there is no way to write just bytes 7 through 10. Since the bus and the memory only work in eight-byte units, we need to write units 0 and 1, which also contains bytes 0 through 6 and bytes 11 to 15. To implement the store, the processor must:
Load memory unit 0, providing bytes 0 through 7.
Load memory unit 1, providing bytes 8 through 15.
Move the first byte of the four-byte object into byte 7.
Move the last three bytes of the object into bytes 8 through 10.
Store the changed memory unit 0.
Store the changed memory unit 1.
Again, that is twice as much work as it would be with an aligned object (load one memory unit, move the bytes in, store the unit). And, besides the time of the operations, you are occupying more resources inside the processor—it has to use two internal registers to hold the data from memory temporarily while it is merging the changes.
Actually, it is more than twice the work and resources, because it also requires extra wires and switches to shift the bytes by non-standard amounts.
The processor bus, which is the media used to access memory is normally the processor size in bits. This means a 32bit processor normally access memory in 32bit chunks, meaning that only one memory read access is necessary to read the data from memory.
Addresses by the contrary, are byte oriented, so a double (8 bytes) normally occupies eight different contiguous memory. So to make an access to a single eight bytes data (with only one bus request) The data must begin at a single eight byte word and finish before we get to the next. For old processors this was imperative, in case you requested a memory access that is not data aligned, an exception was fired. Actual processors don't have this restriction, but beware you that in case you have for example a double in a non multiple of eight address, the processor will need to make two bus accesses (with the overhead that this implies) to get the data from memory.
For this reason (you can double or even more, the time required to execute some piece of code if all the data is unaligned, against the time required to if the data is properly aligned) the processor vendor warns you about the alignment of data.
Modern processors have several levels of caches, that are read from main memory in chunks of one cache line (64 or even more bytes) so this is not an issue. Anyway, it is good idea to have data aligned anyway, for the case you need to run your code in a non-such-advanced processor.

Is 8-byte alignment for "double" type necessary?

I understand word-alignment, which makes the cpu only need to read once when reading an integer into a register.
But is 8-byte alignment (let's assume 32bit system) for "double" necessary? What is the benefit? What will happen if the space for storing a "double" is just 4-byte alignment?
There are multiple hardware components that may be adversely affected by unaligned loads or stores.
The interface to memory might be eight bytes wide and only able to access memory at multiples of eight bytes. Loading an unaligned eight-byte double then requires two reads on the bus. Stores are worse, because an aligned eight-byte store can simply write eight bytes to memory, but an unaligned eight-byte store must read two eight-byte pieces, merge the new data with the old data, and write two eight-byte pieces.
Cache lines are typically 32 or 64 bytes. If eight-byte objects are aligned to multiples of eight bytes, then each object is in just one cache line. If they are unaligned, then some of the objects are partly in one cache line and partly in another. Loading or storing these objects then requires using two cache lines instead of one. This effect occurs at all levels of cache (three levels is not uncommon in modern processors).
Memory system pages are typically 512 bytes or more. Again, each aligned object is in just one page, but some unaligned objects are in multiple pages. Each page that is accessed requires hardware resources: The virtual address must be translated to a physical address, this may require accessing translation tables, and address collisions must be detected. (Processors may have multiple load and store operations in operation simultaneously. Even though your program may appear to be single-threaded, the processor reads instructions in advance and tries to execute those that it can. So a processor may start a load instruction before preceding instructions have completed. However, to be sure this does not cause an error, the processor checks each load instruction to be sure it is not loading from an address that a prior store instruction is changing. If an access crosses a page boundary, the two parts of the loaded data have to be checked separately.)
The response of the system to unaligned operations varies from system to system. Some systems are designed to support only aligned accesses. In these cases, unaligned accesses either cause exceptions that lead to program termination or exceptions that cause execution of special handlers that emulate unaligned operations in software (by performing aligned operations and merging the data as necessary). Software handlers such as these are much slower than hardware operations.
Some systems support unaligned accesses, but this usually consumes more hardware resources than aligned accesses. In the best case, the hardware performs two operations instead of one. But some hardware is designed to start operations as if they were aligned and then, upon discovering the operation is not aligned, to abort it and start over using different paths in the hardware to handle the unaligned operation. In such systems, unaligned accesses have a significant performance penalty, although it is not as great as in systems where software handles unaligned accesses.
In some systems, the hardware may have multiple load-store execution units that can perform the two operations required of unaligned accesses just as quickly as one unit can perform the operation of aligned accesses. So there is no direct performance degradation of unaligned accesses. However, because multiple execution units are kept busy by unaligned accesses, they are unavailable to perform other operations. Thus, programs that perform many load-store operations, normally in parallel, will execute more slowly with unaligned accesses than with aligned accesses.
On many architectures, unaligned access of any load/store unit (short, int, long) is simply an exception. Compilers are responsible for ensuring it doesn't happen on potentially mis-aligned data, by emitting smaller access instructions and re-assembling in registers if they can't prove a given pointer is OK.
Performance-wise, 8-byte alignment of doubles on 32-bit systems can be valuable for a few reasons. The most apparent is that 4-byte alignment of an 8-byte double means that one element could cross the boundary of two cache lines. Memory access occurs in units of whole cache lines, and so misalignment doubles the cost of access.
I seem to remember that the recommendation for 486 was to align double on 32 bits boundaries, so requiring 64 bits alignment is not mandatory.
You seem to think that there is a relationship between the data bus width and the processor bitness. While it is often the case, you can find variation in both direction. For instance the Pentium was a 32-bit processor, but its data bus size was 64 bits.
Caches offer something else which may explain the usefulness of having 64-bit alignment for 64-bit types. Here the external bus is not a factor, it is the cache line size which is important. Data crossing the line cache is costlier to access than data not crossing it (even if it is unaligned in both cases). Aligning types on their size makes it sure that they won't cross cache lines as long as cache line size is a multiple of the type size.
I just found the answer:
"6. When memory reading is efficient in reading 4 bytes at a time on 32 bit machine, why should a double type be aligned on 8 byte boundary?
It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU). Any floating point operation in the code will be translated into FPU instructions. The main processor is nothing to do with floating point execution. All this will be done behind the scenes.
As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float types will be promoted to 64 bit prior to execution.
The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary. I am assuming (I don’t have concrete information) in case of FPU operations, data fetch might be different, I mean the data bus, since it goes to FPU. Hence, the address decoding will be different for double types (which is expected to be on 8 byte boundary). It means, the address decoding circuits of floating point unit will not have last 3 pins."
Edited:
The advantage of byte alignment is to reduce the number of memory cycles to retrieve the data. For example, an 8 byte which might take a single cycle if it is aligned might now take 2 cycles since a part of it is obtained the first time and the second part in the next memory cycle.
I came across this:
"Aligned access is faster because the external bus to memory is not a single byte wide - it is typically 4 or 8 bytes wide (or even wider). So the CPU doesn't fetch a single byte at a time - it fetches 4 or 8 bytes starting at the requested address. Therefore, the 2 or 3 least significant bits of the memory address are not actually sent by the CPU - the external memory can only be read or written at addresses that are a multiple of the bus width. If you requested a byte at address "9", the CPU would actually ask the memory for the block of bytes beginning at address 8, and load the second one into your register (discarding the others).
This implies that a misaligned access can require two reads from memory: If you ask for 8 bytes beginning at address 9, the CPU must fetch the 8 bytes beginning at address 8 as well as the 8 bytes beginning at address 16, then mask out the bytes you wanted. On the other hand, if you ask for the 8 bytes beginning at address 8, then only a single fetch is needed. Some CPUs will not even perform such a misaligned load - they will simply raise an exception (or even silently load the wrong data!)."
You might see this link for more details.
http://www.ibm.com/developerworks/library/pa-dalign/

Why double in C is 8 bytes aligned?

I was reading a article about data types alignment in memory(here) and I am unable to understand one point i.e.
Note that a double variable will be allocated on 8 byte boundary on 32
bit machine and requires two memory read cycles. On a 64 bit machine,
based on number of banks, double variable will be allocated on 8 byte
boundary and requires only one memory read cycle.
My doubt is: Why double variables need to be allocated on 8 byte boundary and not on 4 byte? If it is allocated on 4 byte boundary still we need only 2 memory read cycles(on a 32 bit machine). Correct me if I am wrong.
Also if some one has a good tutorial on member/memory alignment, kindly share.
The reason to align a data value of size 2^N on a boundary of 2^N is to avoid the possibility that the value will be split across a cache line boundary.
The x86-32 processor can fetch a double from any word boundary (8 byte aligned or not) in at most two, 32-bit memory reads. But if the value is split across a cache line boundary, then the time to fetch the 2nd word may be quite long because of the need to fetch a 2nd cache line from memory. This produces poor processor performance unnecessarily. (As a practical matter, the current processors don't fetch 32-bits from the memory at a time; they tend to fetch much bigger values on much wider busses to enable really high data bandwidths; the actual time to fetch both words if they are in the same cache line, and already cached, may be just 1 clock).
A free consequence of this alignment scheme is that such values also do not cross page boundaries. This avoids the possibility of a page fault in the middle of an data fetch.
So, you should align doubles on 8 byte boundaries for performance reasons. And the compilers know this and just do it for you.
Aligning a value on a lower boundary than its size makes it prone to be split across two cachelines. Splitting the value in two cachlines means extra work when evicting the cachelines to the backing store (two cachelines will be evicted; instead of one), which is a useless load of memory buses.
8 byte alignment for double on 32 bit architecture doesn't reduce memory reads but it still improve performance of the system in terms of reduced cache access. Please read the following :
https://stackoverflow.com/a/21220331/5038027
Refer this wiki article about double precision floating point format
The number of memory cycles depends on your hardware architecture which determines how many RAM banks you have. If you have a 32-bit architecture and 4 RAM banks, you need only 2 memory cycle to read.(Each RAM bank contributing 1 byte)

CPU and memory communication

I'm programmer-beginner, but I want to understand the things a bit more deeply. I did some research and read quite a lot of text, but I'm still yet to understand some things..
When coding a basic thing (in C):
int myNumber;
myNumber = 3;
printf("Here's my number: %d", myNumber);
I found out that (mainly on 32-bit CPU) integer takes place of 32 bits = 4 bytes. So at first line of my code CPU goes into the memory. The memory is byte-addressable, so CPU chooses 4 continuous bytes for my variable and stores the address to first (or last) byte.
On the second line of my code CPU uses his stored address of the MyNumber variable, goes to that address in the memory and finds there 32 bits of reserved space. His task now is to store there the number "3", so he fills those four bytes with the sequence 00000000-00000000-00000000-00000011.
On the third line it does the same - CPU goes to that address in memory and loads the number stored in that address.
(First question - Do I understand it right?)
What I don't understand is this:
The size of that address (pointer to that variable) is 4bytes in 32-bit CPU. (Thats why 32-bit CPU can use max 4GB of memory - because there are only 2^32 different addresses of binary length 32)
Now, where the CPU stores these addresses? Does he have some sort of its own memory or cache to store that? And why it stores the 32 bit long address to 32 bit long integer? Wouldn't it be better to simply store in its cache that actual number than the pointer to that when the sizes are the same?
And last one - if it stores somewhere in its own cache the addresses to all those integers and the lenghts are the same (4 bytes), it will need exactly the same space for storing the addresses as for the actual variables. But variables can take up to 4GBs of space so CPU must have 4GB of its own space to store the addresses to those variables. And that sounds strange..
Thank you for help!
I'm trying to understand that but it's so tough.. :-[
(First question - Do I understand it right?)
The first thing to recognise is that the value might not be stored in main memory at all. The compiler might decide to store it in a register instead, as this is more optimal.1
The memory is byte-addressable, so CPU chooses 4 continuous bytes for my variable and stores the address to first (or last) byte.
Assuming that the compiler does decide to store it in main memory, then yes, on a 32-bit machine, an int is typically 4 bytes, so 4 bytes will be allocated for storage.
The size of that address (pointer to that variable) is 4bytes in 32-bit CPU. (Thats why 32-bit CPU can use max 4GB of memory - because there are only 2^32 different addresses of binary length 32)
Note that the width of an int and the width of a pointer don't have to be the same, so there's not necessarily a connection with the size of the address space.
Now, where the CPU stores these addresses?
In the case of local variables, the address is effectively hardcoded into the executable itself, typically as an offset from the stack pointer.
In the case of dynamically-allocated objects (i.e. stuff that's been malloc-ed), the programmer typically maintains a corresponding pointer variable (otherwise there would be a memory leak!). That pointer might also be dynamically-allocated (in the case of a complex data structure), but if you go back far enough, you'll eventually reach something that's a local variable. In which case, the above rule applies.
But variables can take up to 4GBs of space so CPU must have 4GB of its own space to store the addresses to those variables.
If your program consists of independently malloc-ing millions of ints, then yes, you'd end up with just as much storage required for the pointers. But most programs don't look like that. You typically allocate much bigger objects (like an array, or a big struct).
cache
The specifics of where stuff is stored is architecture-specific. On a modern x86, there's typically 2 or 3 layers of cache sitting between the CPU and main memory. But the cache is not independently addressable; the CPU cannot decide to store the int in cache instead of main memory. Rather, the cache is effectively a redundant copy of a subset of main memory.
Another thing to consider is that the compiler will typically deal with virtual addresses when allocating storage for objects. On a modern x86, these are mapped to physical addresses (i.e. addresses that correspond to physical storage bytes in main memory) by dedicated hardware, along with OS support.
1. Alternatively, the compiler may be able to optimise it away entirely.
On the second line of my code CPU uses his stored address of the MyNumber variable, goes to that address in the memory and finds there 32 bits of reserved space.
Nearly correct. Memory is basically unstructured. The CPU can't see that there are 32 bits of "reserved space". But the CPU was instructed to read 32 bits of data, so it reads 32 bits of data starting from the specified address. Then it just has to hope/assume that those 32 bits actually contain something meaningful.
Now, where the CPU stores these addresses? Does he have some sort of its own memory or cache to store that? And why it stores the 32 bit long address to 32 bit long integer? Wouldn't it be better to simply store in its cache that actual number than the pointer to that when the sizes are the same?
The CPU has a small number of registers, which it can use to store data (common CPUs have 8, 16 or 32 registers, so they can only hold the particular variables that you're working with here and now). So to answer the last part first, yes, the compiler certainly might (and probably will) generate code to just store your int into a register, instead of storing it in a memory, and telling the CPU to load it from a specified address.
As for the other part of the question: ultimately, every part of the program is stored in memory. Part of it is a stream of instructions, and part of it is in chunks of data scattered around memory.
There are a few tricks that help with locating the data the CPU needs: part of the program's memory contains the stack, which typically stores local variables while they're in scope. The CPU always maintains a pointer to the top of the stack in one of its registers, so it can easily locate data on the stack, simply by modifying the stack pointer with a fixed offset. Instructions can directly contain such offsets, so in order to read your int, the compiler could for example generate code which writes the int to the top of the stack when you enter the function, and then when you need to refer to that function, have code which reads the data found at the address the stack pointer points to, plus the small offset needed to locate your variable.
And also keep in mind that the adresses your program sees may not be (or rather rarely are) physical adresses starting from the 'beginning of the memory' or 0. Mostly they are offsets into a specifc memory block where the memory manager knows the real address and the accesses via base+offest as the real data storage.
And we do need the memory since caches are limited ;-)
Mario
Internal to the CPU there is one register that contains the address of the next instruction to be executed. The instructions themselves keep information where the variable is. If the variable is optimized, the instruction may point to a register, but in general, the instruction will have an address of the variable being accessed. Your code, after compiled and loaded in memory, have all that embedded! I recommend looking into assembly language to get a better understanding of all that. Good luck!

Resources