what is the size of every memory cell in the stack and is it possible to split one cell

what is the size of every memory cell in the stack and is it possible to split one cell - c

In a 64 bit system every memory cell is 64 bit, so how does it save an int variable that contains less space? Wouldn't it spend one 64 bit address any way? If so why bother to use difference types of variables if they going to catch one cell any way.

Your use of terminology is all over the place.
A memory cell typically corresponds to a logic gate on the hardware level and is very likely to be 1 bit large assuming binary computers.
What I think you are asking about is the smallest addressable unit in a computer, also known as a byte, which is very likely 8 bits large.
This has nothing to do with the data register width of the CPU, which is what one usually refers to when talking about "64 bit computers". The data register width is the largest chunk of data that the CPU can process in a single instruction, but not necessarily the smallest. And this has no relation with the address bus width of the computer, though they are often the same nowadays.
When you declare a variable in C, the size allocated depends on the system. An int is for example very likely 32 bit large on all 32 bit and 64 bit computers. Notably, all mainstream 64 bit computers also support 32 bit or smaller instructions. So it doesn't necessarily make sense for the compiler to allocate more memory than 32 bit - you might get larger memory use for no speed gained.
I believe the term you are fishing for is alignment. It is only inefficient for the computer to read smaller chunks in case they are allocated on misaligned addresses. That is, an address which is not evenly divisible by the data register width (expressed in bytes). Such accesses are typically slower, or in some cases not supported at all. So a 64 bit compiler might therefore decide to allocate a small variable inside a 8 byte chunk, and leave the remaining bytes that aren't used as padding bytes. However, in case the compiler optimizes for size, it may chose to store data in a more memory-effective way, at the cost of access time.

Related

Why 2 raised to 32 power results in a number in bytes instead of bits?

I just restart the C programming study. Now, I'm studying the memory storage capacity and the difference between bit and byte. I came across to this definition.
There is a calculation to a 32 bits system. I'm very confused, because in this calculation 2^32 = 4294967296 bytes and it means about 4 Gigabyte. My question is: Why 2 raised to 32 power results in a number in bytes instead of bits ?
Thanks for helping me.

Because the memory is byte-addressable (that is, each byte has its own address).

There are two ways to look at this:
A 32-bit integer can hold one of 2^32 different values. Thus, a uint32_t can represent the values from 0 to 4294967295.
A 32-bit address can represent 2^32 different addresses. And as Scott said, on a byte-addressable system, that means 2^32 different bytes can be addressed. Thus, a process with 32-bit pointers can address up to 4 GiB of virtual memory. Or, a microprocessor with a 32-bit address bus can address up to 4 GiB of RAM.

That description is really superficial and misses a lot of important considerations, especially as to how memory is defined and accessed.
Fundamentally an N-bit value has 2N possible states, so a 16-bit value has 65,536 possible states. Additionally, memory is accessed as bytes, or 8-bit values. This was not always the case, older machines had different "word" sizes, anywhere from 4 to 36 bits per word, occasionally more, but over time the 8-bit word, or "byte", became the dominant form.
In every case a memory "address" contains one "word" or, on more modern machines, "byte". Memory is measured in these units, like "kilowords" or "gigabytes", for reasons of simplicity even though the individual memory chips themselves are specified in terms of bits. For example, a 1 gigabyte memory module often has 8 gigabit chips on it. These chips are read at the same time, the resulting data combined to produce a single byte of memory.
By that article's wobbly definition this means a 16-bit CPU can only address 64KB of memory, which is wrong. DOS systems from the 1980s used two pointers to represent memory, a segment and an offset, and could address 16MB using an effective 24-bit pointer. This isn't the only way in which the raw pointer size and total addressable memory can differ.
Some 32-bit systems also had an alternate 36-bit memory model that allowed addressing up to 64GB of memory, though an individual process was limited to a 4GB slice of the available memory.
In other words, for systems with a singular pointer to a memory address and where the smallest memory unit is a byte then the maximum addressable memory is 2N bytes.
Thankfully, since 64-bit systems are now commonplace and a computer with > 64GB of memory is not even exotic or unusual, addressing systems are a lot simpler now then when having to work around pointer-size limitations.

We say that memory is byte-addressable, you can think like byte is the smallest unit of memory so you are not reading by bits but bytes. The reason might be that the smallest data type is 1 byte, even boolean type in c/c++ is 1 byte.

how does the processor read memory?

I'm trying to re-implement malloc and I need to understand the purpose of the alignment. As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut. I think I understand that a 64-bit processor reads 64-bit by 64-bit memory. Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?
I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?

The effects can even include correctness, not just performance: C Undefined Behaviour (UB) leading to possible segfaults or other misbehaviour for example if you have a short object that doesn't satisfy alignof(short). (Faulting is expected on ISAs where load/store instructions require alignment by default, like SPARC, and MIPS before MIPS64r6. And possible even on x86 after compiler optimization of loops, even though x86 asm allows unaligned loads/stores except for some SIMD with 16-byte or wider.)
Or tearing of atomic operations if an _Atomic int doesn't have alignof(_Atomic int).
(Typically alignof(T) = sizeof(T) up to some size, often register width or wider, in any given ABI).
malloc should return memory with alignof(max_align_t) because you don't have any type info about how the allocation will be used.
For allocations smaller than sizeof(max_align_t), you can return memory that's merely naturally aligned (e.g. a 4-byte allocation aligned by 4 bytes) if you want, because you know that storage can't be used for anything with a higher alignment requirement.
Over-aligned stuff like the dynamically-allocated equivalent of alignas (16) int32_t foo needs to use a special allocator like C11 aligned_alloc. If you're implementing your own allocator library, you probably want to support aligned_realloc and aligned_calloc, filling those gaps that ISO C leave for no apparent reason.
And make sure you don't implement the braindead ISO C++17 requirement for aligned_alloc to fail if the allocation size isn't a multiple of the alignment. Nobody wants an allocator that rejects an allocation of 101 floats starting on a 16-byte boundary, or much larger for better transparent hugepages. aligned_alloc function requirements and How to solve the 32-byte-alignment issue for AVX load/store operations?
I think I understand that a 64-bit processor reads 64-bit by 64-bit memory
Nope. Data bus width and burst size, and load/store execution unit max width or actually-used width, don't have to be the same as width of integer registers, or however the CPU defines its bitness. (And in modern high performance CPUs typically aren't. e.g. 32-bit P5 Pentium had a 64-bit bus; modern 32-bit ARM has load/store-pair instructions that do atomic 64-bit accesses.)
Processors read whole cache lines from DRAM / L3 / L2 cache into L1d cache; 64 bytes on modern x86; 32 bytes on some other systems.
And when reading individual objects or array elements, they read from L1d cache with the element width. e.g. a uint16_t array may only benefit from alignment to a 2-byte boundary for 2-byte loads/stores.
Or if a compiler vectorizes a loop with SIMD, a uint16_t array can be read 16 or 32 bytes at a time, i.e. SIMD vectors of 8 or 16 elements. (Or even 64 with AVX512). Aligning arrays to the expected vector width can be helpful; unaligned SIMD load/store run fast on modern x86 when they don't cross a cache-line boundary.
Cache-line splits and especially page-splits are where modern x86 slows down from misalignment; unaligned within a cache line generally not because they spend the transistors for fast unaligned load/store. Some other ISAs slow down, and some even fault, on any misalignment, even within a cache line. The solution is the same: give types natural alignment: alignof(T) = sizeof(T).
In your struct example, modern x86 CPUs will have no penalty even though the short is misaligned. alignof(int) = 4 in any normal ABI, so the whole struct has alignof(struct) = 4, so the char;short;char block starts at a 4-byte boundary. Thus the short is contained within a single 4-byte dword, not crossing any wider boundary. AMD and Intel both handle this with full efficiency. (And the x86 ISA guarantees that accesses to it are atomic, even uncached, on CPUs compatible with P5 Pentium or later: Why is integer assignment on a naturally aligned variable atomic on x86?)
Some non-x86 CPUs would have penalties for the misaligned short, or have to use other instructions. (Since you know the alignment relative to an aligned 32-bit chunk, for loads you'd probably do a 32-bit load and shift.)
So yes there's no problem accessing one single word containing the short, but the problem is for load-port hardware to extract and zero-extend (or sign-extend) that short into a full register. This is where x86 spends the transistors to make this fast. (#Eric's answer on a previous version of this question goes into more detail about the shifting required.)
Committing an unaligned store back into cache is also non-trivial. For example, L1d cache might have ECC (error-correction against bit flips) in 32-bit or 64-bit chunks (which I'll call "cache words"). Writing only part of a cache word is thus a problem for that reason, as well as for shifting it to an arbitrary byte boundary within the cache word you want to access. (Coalescing of adjacent narrow stores in the store buffer can produce a full-width commit that avoids an RMW cycle to update part of a word, in caches that handle narrow stores that way). Note that I'm saying "word" now because I'm talking about hardware that's more word-oriented instead of being designed around unaligned loads/stores the way modern x86 is. See Are there any modern CPUs where a cached byte store is actually slower than a word store? (storing a single byte is only slightly simpler than an unaligned short)
(If the short spans two cache words, it would of course needs to separate RMW cycles, one for each byte.)
And of course the short is misaligned for the simple reason that alignof(short) = 2 and it violates this ABI rule (assuming an ABI that does have that). So if you pass a pointer to it to some other function, you could get into trouble. Especially on CPUs that have fault-on-misaligned loads, instead of hardware handling that case when it turns out to be misaligned at runtime. Then you can get cases like Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where GCC auto-vectorization expected to reach a 16-byte boundary by doing some multiple of 2-byte elements scalar, so violating the ABI leads to a segfault on x86 (which is normally tolerant of misalignment.)
For the full details on memory access, from DRAM RAS / CAS latency up to cache bandwidth and alignment, see What Every Programmer Should Know About Memory? It's pretty much still relevant / applicable
Also Purpose of memory alignment has a nice answer. There are plenty of other good answers in SO's memory-alignment tag.
For a more detailed look at (somewhat) modern Intel load/store execution units, see: https://electronics.stackexchange.com/questions/329789/how-can-cache-be-that-fast/329955#329955
how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
It doesn't, other than the fact it's running instructions which treat the data that way.
In asm / machine-code, everything is just bytes. Every instruction specifies exactly what to do with which data. It's up to the compiler (or human programmer) to implement variables with types, and the logic of a C program, on top of a raw array of bytes (main memory).
What I mean by that is that in asm, you can run any load or store instruction you want to, and it's up to you to use the right ones on the right addresses. You could load 4 bytes that overlap two adjacent int variable into a floating-point register, then and run addss (single-precision FP add) on it, and the CPU won't complain. But you probably don't want to because making the CPU interpret those 4 bytes as an IEEE754 binary32 float is unlikely to be meaningful.

modern processors and memory are built to optimize memory access as much as possible. One the current way of accessing memory is to address it not byte by byte but by an address of a bigger block, e.g. by an 8 byte blocks. You do not need 3 lower bits of the address this way. To access a certain byte within the block the processs needs to get the block at the aligned address, then shift and mask the byte. So, it gets slower.
When fields in the struct are not aligned, there is a risk of slowing down the access to them. Therefore, it is better to align them.
But the alignment requirements are based on the underlying platform. For systems which support word access (32 bit), 4-byte alignment is ok, otherwise 8-byte can be used or some other. The compiler (and libc) knows the requirements.
So, in your example char, short, char, the short will start with an odd byte position if not padded. To access it, the system might need to read the 64 bit word for the struct, then shift it 1 byte right and then mask 2 bytes in order to provide you with this byte.

As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut.
It's not necessarily an execution thing, an x86 has variable length instructions starting with single 8 bit instructions on up to a handful to several bytes, its all about being unaligned. but they have taken measures to smooth that out for the most part.
If I have a 64 bit bus on the edge of my processor that doesn't mean edge of chip that means edge of the core. The other side of this is a memory controller that knows the bus protocol and is the first place the addresses start to be decoded and the transactions start to split up down other buses toward their destination.
It is very much architecture and bus design specific and you can have architectures with different buses over time or different versions you can get an arm with a 64 bus or a 32 bit bus for example. But let's say we have a not atypical situation where the bus is 64 bits wide and all transactions on that bus are aligned on a 64 bit boundary.
If I were to do a 64 bit write to 0x1000 that would be a single bus transaction, which these days is some sort of write address bus with some id x and a length of 0 (n-1) then the other side acks that I see you want to do a write with id x, I am ready to take your data. Then the processor uses the data bus with id x to send the data, one clock per 64 bits this is a single 64 bit so one clock on that bus. and maybe an ack comes back or maybe not.
But if I wanted to do a 64 bit write to 0x1004, what would happen is that turns into two transactions one complete 64 bit address/data transaction at address 0x1000 with only four byte lanes enabled lanes 4-7 (representing bytes at address 0x1004-0x1007). Then a complete transaction at 0x1008 with 4 byte lanes enabled, lanes 0-3. So the actual data movement across the bus goes from one clock to two, but there is also twice the overhead of the handshakes to get to those data cycles. On that bus it is very noticeable, how the overall system design is though you may feel it or not, or may have to do many of them to feel it or not. But the inefficiency is there, buried in the noise or not.
I think I understand that a 64-bit processor reads 64-bit by 64-bit memory.
Not a good assumption at all. 32 bit ARMs have 64 bit buses these days the ARMv6 and ARMv7s for example come with them or can.
Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?
unsigned char a 0x1000
unsigned short b 0x1001
unsigned char c 0x1003
unsigned int d 0x1004
You would normally use the structure items in the code something.a something.b something.c something.d. When you access something.b that is a 16 bit transaction against the bus. In a 64 bit system you are correct that if aligned as I have addressed it, then the whole structure is being read when you do x = something.b but the processor is going to discard all but byte lanes 1 and 2 (discarding 0 and 3-7), then if you access something.c it will do another bus transaction at 0x1000 and discard all but lane 3.
When you do a write to something.b with a 64 bit bus only byte lanes 1 and 2 are enabled. Now where more pain comes in is if there is a cache it is likely also constructed of a 64 bit ram to mate up with this bus, doesn't have to, but let's assume it does. You want to write through the cache to something.b, a write transaction at 0x1000 with byte lanes 1 and 2 enabled 0, 3-7 disabled. The cache ultimately gets this transaction, it internally has to do a read-modify write because it is not a full 64 bit wide transaction (all lanes enabled) so you are taking hit with that read-modify write from a performance perspective as well (same was true for the unaligned 64 bit write above).
The short is unaligned because when packed its address lsbit is set, to be aligned a 16 bit item in an 8 bit is a byte world needs to be zero, for a 32 bit item to be aligned the lower two bits of its address are zero, 64 bit, three zeros and so on.
Depending on the system you may end up on a 32 or 16 bit bus (not for memory so much these days) so you can end up with the multiple transfers thing.
Your highly efficient processors like MIPS and ARM took the approach of aligned instructions, and forced aligned transactions even in the something.b case that specifically doesn't have a penalty on a 32 nor 64 bit bus. The approach is performance over memory consumption, so the instructions are to some extent wasteful in their consumption to be more efficient in their fetching and execution. The data bus is likewise much simpler. When high level concepts like a struct in C are constructed there is memory waste in padding to align each item in the struct to gain performance.
unsigned char a 0x1000
unsigned short b 0x1002
unsigned char c 0x1004
unsigned int d 0x1008
as an example
I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
unsigned char c 0x1003
the compiler generates a single byte sized read at address 0x1003, this turns in to that specific instruction with that address and the processor generates the bus transaction to do that, the other side of the processor bus then does its job and so on down the line.
The compiler in general does not turn a packed version of that struct into a single 64 bit transaction that gives you all of the items, you burn a 64 bit bus transaction for each item.
it is possible that depending on the instruction set, prefetcher, caches and so on that instead of using a struct at a high level you create a single 64 bit integer and you do the work in the code, then you might or might not gain performance. This is not expected to perform better on most architectures running with caches and such, but when you get into embedded systems where you may have some number of wait states on the ram or some number of wait states on the flash or whatever code storage there is you can find times where instead of fewer instructions and more data transactions you want more instructions and fewer data transactions. code is linear a code section like this read, mask and shift, mask and shift, etc. the instruction storage may have a burst mode for linear transactions but data transactions take as many clocks as they take.
A middle ground is to just make everything a 32 bit variable or a 64 bit, then it is all aligned and performs relatively well at the cost of more memory used.
Because folks don't understand alignment, have been spoiled by x86 programming, choose to use structs across compile domains (such a bad idea), the ARMs and others are tolerating unaligned accesses, you can very much feel the performance hit on those platforms as they are so efficient if everything is aligned, but when you do something unaligned it just generates more bus transactions making everything take longer. So the older arms would fault by default, the arm7 could have the fault disabled but would rotate the data around the word (nice trick for swapping 16 bit values in a word) rather than spill over into the next word, later architectures default to not fault on aligned or most folks set them to not fault on aligned and they read/write the unaligned transfers as one would hope/expect.
For every x86 chip you have in your computer you have several if not handfuls of non-x86 processors in that same computer or peripherals hanging off that computer (mouse, keyboard, monitor, etc). A lot of those are 8-bit 8051s and z80s, but also a lot of them are arm based. So there is lots of non-x86 development going on not just all the phones and tablets main processors. Those others desire to be low cost and low power so more efficiency in the coding both in its bus performance so the clock can be slower but also a balance of code/data usage overall to reduce the cost of the flash/ram.
It is quite difficult to force these alignment issues on an x86 platform there is a lot of overhead to overcome its architectural issues. But you can see this on more efficient platforms. Its like a train vs a sports car, something falls off a train a person jumps off or on there is so much momentum its not noticed one bit, but step change the mass on the sports car and you will feel it. So trying to do this on an x86 you are going to have to work a lot harder if you can even figure out how to do it. But on other platforms its easier to see the effects. Unless you find an 8086 chip and I suspect you can feel the differences there, would have to pull out my manual to confirm.
If you are lucky enough to have access to chip sources/simulations then you can see this kind of thing happening all over the place and can really start to hand tune your program (for that platform). Likewise you can see what caching, write buffering, instruction prefetching in its various forms and so on do for overall performance and at times create parallel periods of time where other not-so-efficient transactions can hide, and or intentional spare cycles are created so that transactions that take extra time can have a time slice.

Why must an int have a memory address that is divisible by four on most current architectures? [duplicate]

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?

The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.

It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.

you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.

Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.

#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.

If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.

If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.

On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Why double in C is 8 bytes aligned?

I was reading a article about data types alignment in memory(here) and I am unable to understand one point i.e.
Note that a double variable will be allocated on 8 byte boundary on 32
bit machine and requires two memory read cycles. On a 64 bit machine,
based on number of banks, double variable will be allocated on 8 byte
boundary and requires only one memory read cycle.
My doubt is: Why double variables need to be allocated on 8 byte boundary and not on 4 byte? If it is allocated on 4 byte boundary still we need only 2 memory read cycles(on a 32 bit machine). Correct me if I am wrong.
Also if some one has a good tutorial on member/memory alignment, kindly share.

The reason to align a data value of size 2^N on a boundary of 2^N is to avoid the possibility that the value will be split across a cache line boundary.
The x86-32 processor can fetch a double from any word boundary (8 byte aligned or not) in at most two, 32-bit memory reads. But if the value is split across a cache line boundary, then the time to fetch the 2nd word may be quite long because of the need to fetch a 2nd cache line from memory. This produces poor processor performance unnecessarily. (As a practical matter, the current processors don't fetch 32-bits from the memory at a time; they tend to fetch much bigger values on much wider busses to enable really high data bandwidths; the actual time to fetch both words if they are in the same cache line, and already cached, may be just 1 clock).
A free consequence of this alignment scheme is that such values also do not cross page boundaries. This avoids the possibility of a page fault in the middle of an data fetch.
So, you should align doubles on 8 byte boundaries for performance reasons. And the compilers know this and just do it for you.

Aligning a value on a lower boundary than its size makes it prone to be split across two cachelines. Splitting the value in two cachlines means extra work when evicting the cachelines to the backing store (two cachelines will be evicted; instead of one), which is a useless load of memory buses.

8 byte alignment for double on 32 bit architecture doesn't reduce memory reads but it still improve performance of the system in terms of reduced cache access. Please read the following :
https://stackoverflow.com/a/21220331/5038027

Refer this wiki article about double precision floating point format
The number of memory cycles depends on your hardware architecture which determines how many RAM banks you have. If you have a 32-bit architecture and 4 RAM banks, you need only 2 memory cycle to read.(Each RAM bank contributing 1 byte)

How did 16-bit C compilers work?

C's memory model, with its use of pointer arithmetic and all, seems to model flat address space. 16-bit computers used segmented memory access. How did 16-bit C compilers deal with this issue and simulate a flat address space from the perspective of the C programmer? For example, roughly what assembly language instructions would the following code compile to on an 8086?
long arr[65536]; // Assume 32 bit longs.
long i;
for(i = 0; i < 65536; i++) {
arr[i] = i;
}

How did 16-bit C compilers deal with
this issue and simulate a flat address
space from the perspective of the C
programmer?
They didn't. Instead, they made segmentation visible to the C programmer, extending the language by having multiple types of pointers: near, far, and huge. A near pointer was an offset only, while far and huge pointers were a combined segment and offset. There was a compiler option to set the memory model, which determined whether the default pointer type was near or far.
In Windows code, even today, you'll often see typedefs like LPCSTR (for const char*). The "LP" is a holdover from the 16-bit days; it stands for "Long (far) Pointer".

C memory model does not in any way imply flat address space. It never did. In fact, C language specification is specifically designed to allow non-flat address spaces.
In the most trivial implementation with segmented address space, the size of the largest continuous object would be limited by the size of the segment (65536 bytes on a 16 bit platform). This means that size_t in such implementation would be 16 bit, and that your code simply would not compile, since you are attempting to declare an object that has larger size than the allowed maximum.
A more complex implementation would support so called huge memory model. You see, there's really no problem addressing continuous memory blocks of any size on a segmented memory model, it just requires some extra efforts in pointer arithmetics. So, within the huge memory model, the implementation would make those extra efforts, which would make the code a bit slower, but at the same time would allow addressing objects of virtually any size. So, your code would compile perfectly fine.

The true 16-bit environments use 16 bit pointers which reach any address. Examples include the PDP-11, 6800 family (6802, 6809, 68HC11), and the 8085. This is a clean and efficient environment, just like a simple 32-bit architecture.
The 80x86 family forced upon us a hybrid 16-bit/20-bit address space in so-called "real mode"—the native 8086 addressing space. The usual mechanism to deal with this was enhancing the types of pointers into two basic types, near (16-bit pointer) and far (32-bit pointer). The default for code and data pointers can be set in bulk by a "memory model": tiny, small, compact, medium, far, and huge (some compilers do not support all models).
The tiny memory model is useful for small programs in which the entire space (code + data + stack) is less than 64K. All pointers are (by default) 16 bits or near; a pointer is implicitly associated with a segment value for the whole program.
The small model assumes that data + stack is less than 64K and in the same segment; the code segment contains only code, so can have up to 64K as well, for a maximum memory footprint of 128K. Code pointers are near and implicitly associated with CS (the code segment). Data pointers are also near and associated with DS (the data segment).
The medium model has up to 64K of data + stack (like small), but can have any amount of code. Data pointers are 16 bits and are implicitly tied to the data segment. Code pointers are 32 bit far pointers and have a segment value depending on how the linker has set up the code groups (a yucky bookkeeping hassle).
The compact model is the complement of medium: less than 64K of code, but any amount of data. Data pointers are far and code pointers are near.
In large or huge model, the default subtype of pointers are 32 bit or far. The main difference is that huge pointers are always automatically normalized so that incrementing them avoids problems with 64K wrap arounds. See this.

In DOS 16 bit, I dont remember being able to do that. You could have multiple things that were each 64K (bytes)(because the segment could be adjusted and the offset zeroed) but dont remember if you could cross the boundary with a single array. The flat memory space where you could willy nilly allocate whatever you wanted and reach as deep as you liked into an array didnt happen until we could compile 32 bit DOS programs (on 386 or 486 processors). Perhaps other operating systems and compilers other than microsoft and borland could generate flat arrays greater than 64kbytes. Win16 I dont remember that freedom until win32 hit, perhaps my memory is getting rusty...You were lucky or rich to have a megabyte of memory anyway, a 256kbyte or 512kbyte machine was not unheard of. Your floppy drive had a fraction of a meg to 1.44 meg eventually, and your hard disk if any had a dozen or few meg, so you just didnt compute thing that large that often.
I remember the particular challenge I had learning about DNS when you could download the entire DNS database of all registered domain names on the planet, in fact you had to to put up your own dns server which was almost required at the time to have a web site. That file was 35megabytes, and my hard disk was 100megabytes, plus dos and windows chewing up some of that. Probably had 1 or 2 meg of memory, might have been able to do 32 bit dos programs at the time. Part if it was me wanting to parse the ascii file which I did in multiple passes, but each pass the output had to go to another file, and I had to delete the prior file to have room on the disk for the next file. Two disk controllers on a standard motherboard, one for the hard disk and one for the cdrom drive, here again this stuff wasnt cheap, there were not a lot of spare isa slots if you could afford another hard disk and disk controller card.
There was even the problem of reading 64kbytes with C you passed fread the number of bytes you wanted to read in a 16 bit int, which meant 0 to 65535 not 65536 bytes, and performance dropped dramatically if you didnt read in even sized sectors so you just read 32kbytes at a time to maximize performance, 64k didnt come until well into the dos32 days when you were finally convinced that the value passed to fread was now a 32 bit number and the compiler wasnt going to chop off the upper 16 bits and only use the lower 16 bits (which happened often if you used enough compilers/versions). We are currently suffering similar problems in the 32 bit to 64 transition as we did with the 16 to 32 bit transition. What is most interesting is the code from the folks like me that learned that going from 16 to 32 bit int changed size, but unsigned char and unsigned long did not, so you adapted and rarely used int so that your programs would compile and work for both 16 and 32 bit. (The code from folks from that generation kind of stands out to other folks that also lived through it and used the same trick). But for the 32 to 64 transition it is the other way around and code not refactored to use uint32 type declarations are suffering.
Reading wallyk's answer that just came in, the huge pointer thing that wrapped around does ring a bell, also not always being able to compile for huge. small was the flat memory model we are comfortable with today, and as with today was easy because you didnt have to worry about segments. So it was a desireable to compile for small when you could. You still didnt have a lot of memory or disk or floppy space so you just didnt normally deal with data that large.
And agreeing with another answer, the segment offset thing was 8088/8086 intel. The whole world was not yet dominated by intel, so there were other platforms that just had a flat memory space, or used other tricks perhaps in hardware (outside the processor) to solve the problem. Because of the segment/offset intel was able to ride the 16 bit thing longer than it probably should have. Segment/offset had some cool and interesting things you could do with it, but it was as much a pain as anything else. You either simplified your life and lived in a flat memory space or you constantly worried about segment boundaries.

Really pinning down the address size on old x86's is sort of tricky. You could say that its 16 bit, because the arithmetic you can perform on an address must fit in a 16 bit register. You could also say that it's 32 bit, because actual addresses are computed against a 16 bit general purpose register and 16 bit segment register (all 32 bits are significant). You could also just say it's 20 bit, because the segment registers are shifted 4 bits left and added to the gp registers for hardware addressing.
It actually doesn't matter that much which one of these you chose, because they are all roughly equal approximations of the c abstract machine. Some compilers let you pick a memory model you were using per compilation, while others just assume 32 bit addresses and then carefully check that operations that could overflow 16 bits emit instructions that handle that case correctly.

Check out this wikipedia entry. About Far pointers. Basically, its possible to indicate a segment and an offset, making it possible to jump to another segment.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight