In my quest to understand how structure padding works in C in the Linux x86 environment, I read that aligned access is faster than access that is mis-aligned. And while I think I understand the reasons given for that, they all seem to come with the underlying pre-supposition that the CPU can't directly access addresses that are not a multiple of the bus width, and so, for instance, if a 32-bit bus CPU was instructed to read 4 bytes of memory starting from address "2", it would first read 4 bytes starting from address "0", mask the first two bytes, read another 4 bytes starting from address "4", mask the last two bytes, and lastly combine the two results, as opposed to just being able to read the 4 bytes at once in case they were 4 bytes aligned.
So, my question is this: Why is that pre-supposition true? Why can't the CPU directly access addresses that are not a multiple of the bus width?
Technically nothing prevents you from making a machine that can address any address on the memory bus. It's just that everything is much simpler if you make the address a multiple of the bus size.
This means that for example, to make a 32-bit memory, you can just take 4x 8-bit memory chips, plug each one on a fourth of the data bus with the same address bus. When you send a single address, all 4 chips will read/write their corresponding 8-bit to form a 32-bit word. Notice that to do that, you ignore the lowest 2 address bits, since you get 32-bits for a single access, and essentially force the address to be 32-bit aligned.
Addr | bus addr | CHIP0 CHIP2 CHIP3 CHIP1 | Value read
#0x00 => 000000b 0x59 0x32 0xaa 0xba 0x5932aaba
#0x04 => 000001b 0x84 0xff 0x51 0x32 0x84ff5132
To access non-aligned words in such a configuration, you would need to send two different bus addresses to the 4 chips since maybe two chips have your value on address 0, and the two others on address 1. Which means you either need to have several address busses, or make multiple accesses anyway.
Note that modern DRAM is obviously more complex, and accesses are multiple of cache lines, so much bigger than the bus.
Overall, in most memory use cases, rounding the accesses make things simpler.
Look at this scheme and then answer the following questions :
As you see there's a simple C program roughly converted into assembly instructions. For the sake of simplicity let's suppose each instruction is 3 Bytes long and let's suppose page and frame size is also 3 Bytes long.
Is it right this flow of procedures ?
Is the program really scattered into pages and then put in RAM frames like that ?
If so how is the system capable of associating specific pages to the specific segments they belong to ?
I read in an OS book that segmentation and paging can coexist. Is this scenario related to this scenario ?
Please answer all this questions and try also to clarify the confusion regarding the coexistence of segmentation and paging. Good reference material about this subject is also very appreciated, specifically some concrete real example with C program and not abstract stuff
Is it right this flow of procedures?
Is the program really scattered into pages and then put in RAM frames like that?
In principle: Yes.
Please note that the sections ("Code", "Data", "Stack" and "Heap") are only names of ranges in the memory. For the CPU, there is no "border" between "Data" and "Stack".
If so how is the system capable of associating specific pages ...
I'm simplifying a lot in the next lines:
Reading memory works the following way: The CPU sends some address to the memory (e.g. RAM) and it sends a signal that data shall be read. The memory then sends the data stored at this address back to the CPU.
Until the early 1990s, there were computers that used CPUs and so-called MMUs in two separate chips. This device was put between the CPU and the memory. When the CPU sends an address to the memory, the address (cpu_address) is actually sent to the MMU. The MMU now performs an operation that is similar to the following C code operation:
int page_table[];
new_address = (page_table[cpu_address / PAGE_SIZE] *
PAGE_SIZE) + (cpu_address % PAGE_SIZE);
... the MMU sends the address new_address to the memory.
The MMU is built in a way that the operating system can write to the page_table.
Let's say the instruction LOAD R1,[XXX1] is stored at the "virtual address" 1234. This means: The CPU thinks that the instruction is stored at address 1234. The CPU has executed the instruction STORE [XXX2],5 and wants to execute the next instruction. For this reason it must read the next instruction from the memory, so it sends the address 1234 to the memory.
However, the address is not received by the memory but by the MMU. The MMU now calculates: 1234/3 = 411, 1234%3 = 1. Now the MMU looks up the page table; let's say that entry #411 is 56. Then it calculates: 56 * 3 + 1 = 169. It sends the address 169 to the memory so the memory will not return the data stored in address 1234, but the data stored in address 169.
Modern desktop CPUs (but not small microcontrollers) have the CPU and the MMU in one chip.
You may ask: Why is that done?
Let's say a program is compiled in a way that it must be loaded to address 1230 and you have another program that must also be loaded to address 1230. Without MMU it would not be possible to load both programs at the same time.
With an MMU, the operating system may load one program to address 100 and the other one to address 200. The OS will then modify the data in the page_table in a way the CPU accesses the currently running program when accessing address 1230.
... to the specific segments they belong to?
When compiling the program, the linker decides at which address some item (variable or function) is found. In the simplest case, it simply places the first function to a fixed address (for example always 1230) and appends all other functions and variables.
So if the code of your program is 100 bytes long, the first variable is found at address 1230 + 100 = 1330.
The executable file (created by the linker) contains some information that says that the code must be loaded to address 1230.
When the operating system loads the program, it checks its length. Let's say your program is 120 bytes long. It calculates 120/3=40, so 40 pages are needed.
The OS searches for 40 pages of RAM which are not used. Then it modifies the page_table in a way that address 1230 actually accesses the first free page, 1233 accesses the second free page, 1236 accesses the third free page and so on.
Now the OS loads the program to virtual address 1230-1349 (addresses sent from the CPU to the MMU); because of the MMU, the data will be written to the free pages in RAM.
I read in an OS book that segmentation and paging can coexist. Is this scenario related to this scenario?
In this case, the word "segmentation" describes a special feature of 16- and 32-bit x86 CPUs. At least I don't know other CPUs supporting that feature. In modern x86 operating systems, this feature is no longer used.
Your example seems not to use segmentation (your drawing may be interpreted differently).
If you run a 16-bit Windows program on an old 32-bit Windows version (recent versions do not support 16-bit programs any more), paging and segmentation are used the same time.
specifically some concrete real example with C program and not abstract stuff
Unfortunately, a concrete real example would be even more difficult to understand because the page sizes are much larger than 3 bytes (4 kilobytes on x86 CPUs) ...
I will use the TM4C123 Arm Microcontroller/Board as an example.
Most of it's I/O registers are memory mapped so you can get/set their values using
regular memory load/store instructions.
My questions is, is there some type of register outside of cpu somewhere on the bus which is mapped to memory and you read/write to it using the memory region essentially having duplicate values one on the register and on memory, or the memory IS the register itself?
There are many buses even in an MCU. Bus after bus after bus splitting off like branches in a tree. (sometimes even merging unlike a tree).
It may predate the intel/motorola battle but certainly in that time frame you had segmented vs flat addressing and you had I/O mapped I/O vs memory mapped I/O, since motorola (and others) did not have a separate I/O bus (well one extra...address...signal).
Look at the arm architecture documents and the chip documentation (arm makes IP not chips). You have load and store instructions that operate on addresses. The documentation for the chip (and to some extent ARM provides rules for the cortex-m address space) provides a long list of addresses for things. As a programmer you simply line up the address you do loads and stores with and the right instructions.
Someones marketing may still carry about terms like memory mapped I/O, because intel x86 still exists (how????), some folks will continue to carry those terms. As a programmer, they are number one just bits that go into registers, and for single instructions here and there those bits are addresses. If you want to add adjectives to that, go for it.
If the address you are using based on the chip and core documentation is pointing at an sram, then that is a read or write of memory. If it is a flash address, then that is the flash. The uart, the uart. timer 5, then timer 5 control and status registers. Etc...
There are addresses in these mcus that point at two or three things, but not at the same time. (address 0x00000000 and some number of kbytes after that). But, again, not at the same time. And this overlap at least in many of these cortex-m mcus, these special address spaces are not going to overlap "memory" and peripherals (I/O). But instead places where you can boot the chip and run some code. With these cortex-ms I do not think you can even use the sort of mmu to mix these spaces. Yes definitely in full sized arms and other architectures you can use a fully blow mcu to mix up the address spaces and have a virtual address space that lands on a physical address space of a different type.
Or in other words, why does accessing an arbitrary element in an array take constant time (instead of O(n) or some other time)?
I googled my heart out looking for an answer to this and did not find a very good one so I'm hoping one of you can share your low level knowledge with me.
Just to give you an idea of how low of an answer I'm hoping for, I'll tell you why I THINK it takes constant time.
When I say array[4] = 12 in a program, I'm really just storing the bit representation of the memory address into a register. This physical register in the hardware will turn on the corresponding electrical signals according to the bit representation I fed it. Those electrical signals will then somehow magically ( hopefully someone can explain the magic ) access the right memory address in physical/main memory.
I know that was rough, but it was just to give you an idea of what kind of answer I'm looking for.
(editor's note: From the OP's later comments, he understands that address calculations take constant time, and just wonders about what happens after that.)
Because software likes O(1) "working" memory and thus the hardware is designed to behave that way
The basic point is that the address space of a program is thought as abstractly having O(1) access performance, i.e. whatever memory location you want to read, it should take some constant time (which anyway isn't related to the distance between it and the last memory access). So, being arrays nothing more than contiguous chunks of address space, they should inherit this property (accessing an element of an array is just a matter of adding the index to the start address of the array, and then dereferencing the obtained pointer).
This property comes from the fact that, in general, the address space of a program have some correspondence with the physical RAM of the PC, which, as the name (random access memory) partially implies, should have by itself the property that, whatever location in the RAM you want to access, you get to it in constant time (as opposed, for example, to a tape drive, where the seek time depends from the actual length of tape you have to move to get there).
Now, for "regular" RAM this property is (at least AFAIK) true - when the processor/motherboard/memory controller asks to a RAM chip to get some data, it does so in constant time; the details aren't really relevant for software development, and the internals of memory chips changed many times in the past and will change again in the future. If you are interested in an overview of the details of current RAMs, you can have a look here about DRAMs.
The general concept is that RAM chips don't contain a tape that must be moved, or a disk arm that must be positioned; when you ask to them a byte at some location, the work (mostly changing the settings of some hardware muxes, that connect the output to the cells where the byte status is stored) is the same for any location you could be asking for; thus, you get O(1) performance
There is some overhead behind this (the logical address have to be mapped to physical address by the MMU, the various motherboard pieces have to talk with each other to tell the RAM to fetch the data and bring it back to the processor, ...), but the hardware is designed to do so in more or less constant time.
So:
arrays map over address space, which is mapped over RAM, which has O(1) random access; being all maps (more or less) O(1), arrays keep the O(1) random access performance of RAM.
The point that does matter to software developers, instead, is that, although we see a flat address space and it normally maps over RAM, on modern machines it's false that accessing any element has the same cost. In facts, accessing elements that are in the same zone can be way cheaper than jumping around the address space, due to the fact that the processor has several onboard caches (=smaller but faster on-chip memories) that keep recently used data and memory that is in the same neighborhood; thus, if you have good data locality, continuous operations in memory won't keep hitting the ram (which have much longer latency than caches), and in the end your code will run way faster.
Also, under memory pressure, operating systems that provide virtual memory can decide to move rarely used pages of you address space to disk, and fetch them on demand if they are accessed (in response to a page fault); such operation is very costly, and, again, strongly deviates from the idea that accessing any virtual memory address is the same.
The calculation to get from the start of the array to any given element takes only two operations, a multiplication (times sizeof(element)) and addition. Both of those operations are constant time. Often with today's processors it can be done in essentially no time at all, as the processor is optimized for this kind of access.
When I say array[4] = 12 in a program, I'm really just storing the bit
representation of the memory address into a register. This physical
register in the hardware will turn on the corresponding electrical
signals according to the bit representation I fed it. Those electrical
signals will then somehow magically ( hopefully someone can explain
the magic ) access the right memory address in physical/main memory.
I am not quite sure what you are asking but I dont see any answers related to what is really going on in the magic of the hardware. Hopefully I understood enough to go through this long winded explanation (which is still very high level).
array[4] = 12;
So from comments it sounds like it is understood that you have to get the base address of array, and then multiply by the size of an array element (or shift if that optimization is possible) to get the address (from your programs perspective) of the memory location. Right of the bat we have a problem. Are these items already in registers or do we have to go get them? The base address for array may or may not be in a register depending on code that surrounds this line of code, in particular code that precedes it. That address might be on the stack or in some other location depending on where you declared it and how. And that may or may not matter as to how long it takes. An optimizing compiler may (often) go so far as to pre-compute the address of array[4] and place that somewhere so it can go into a register and the multiply never happens at runtime, so it is absolutely not true that the computation of array[4] for a random access is a fixed amount of time compared to other random accesses. Depending on the processor, some immediate patterns are one instruction others take more that also has a factor on whether this address is read from .text or stack or etc, etc...To not chicken and egg that problem to death, assume we have the address of array[4] computed.
This is a write operation, from the programmers perspective. Starting with a simple processor, no cache, no write buffer, no mmu, etc. Eventually the simple processor will put the address on the edge of the processor core, with a write strobe and data, each processors bus is different than other processor families, but it is roughly the same the address and data can come out in the same cycle or in separate cycles. The command type (read, write) can happen at the same time or different. but the command comes out. The edge of the processor core is connected to a memory controller that decodes that address. The result is a destination, is this a peripheral if so which one and on what bus, is this memory, if so on what memory bus and so on. Assume ram, assume this simple processor has sram not dram. Sram is more expensive and faster in an apples to apples comparison. The sram has an address and write/read strobes and other controls. Eventually you will have the transaction type, read/write, the address and the data. The sram however its geometry is will route and store the individual bits in their individual pairs/groups of transistors.
A write cycle can be fire and forget. All the information that is needed to complete the transaction, this is a write, this is the address, this is the data, is known right then and there. The memory controller can if it chooses tell the processor that the write transaction is complete, even if the data is nowhere near the memory. That address/data pair will take its time getting to the memory and the processor can keep operating. Some systems though the design is such that the processors write transaction waits until a signal comes back to indicate that the write has made it all the way to the ram. In a fire and forget type setup, that address/data will be queued up somewhere, and work its way to the ram. The queue cant be infinitely deep otherwise it would be the ram itself, so it is finite, and it is possible and likely that many writes in a row can fill that queue faster than the other end can write to ram. At that point the current and or next write has to wait for the queue to indicate there is room for one more. So in situations like this, how fast your write happens, whether your simple processor is I/O bound or not has to do with prior transactions which may or may not be write instructions that preceded this instruction in question.
Now add some complexity. ECC or whatever name you want to call it (EDAC, is another one). The way an ECC memory works is the writes are all a fixed size, even if your implementation is four 8 bit wide memory parts giving you 32 bits of data per write, you have to have a fixed with that the ECC covers and you have to write the data bits plus the ecc bits all at the same time (have to compute the ecc over the full width). So if this was an 8 bit write for example into a 32 bit ECC protected memory then that write cycle requires a read cycle. Read the 32 bits (check the ecc on that read) modify the new 8 bits in that 32 bit pattern, compute the new ecc pattern, write the 32 bits plus ecc bits. Naturally that read portion of the write cycle can end up with an ecc error, which just makes life even more fun. Single bit errors can be corrected usually (what good is an ECC/EDAC if it cant), multi-bit errors not. How the hardware is designed to handle these faults affects what happens next, the read fault may just trickle back to the processor faulting the write transaction, or it may go back as an interrupt, etc. But here is another place where one random access is not the same as another, depending on the memory being accessed, and the size of the access a read-modify-write definitely takes longer than a simple write.
Dram can also fall into this fixed width category, even without ECC. Actually all memory falls into this category at some point. The memory array is optimized on the silicon for a certain height and width in units of bits. You cannot violate that memory it can only be read and written in units of that width at that level. The silicon libraries will include many geometries of ram, and the designers will chose those geometries for their parts, and the parts will have fixed limits and often you can use multiple parts to get some integer multiple width of that size, and sometimes the design will allow you to write to only one of those parts if only some of the bits are changing, or some designs will force all parts to light up. Notice how the next ddr family of modules that you plug into your home computer or laptop, the first wave is many parts on both sides of the board. Then as that technology gets older and more boring, it may change to fewer parts on both sides of the board, eventually becoming fewer parts on one side of the board before that technology is obsolete and we are already into the next.
This fixed width category also carries with it alignment penalties. Unfortunately most folks learn on x86 machines, which dont restrict you to aligned accesses like many other platforms. There is a definite performance penalty on x86 or others for unaligned accesses, if allowed. It is usually when folks go to a mips or usually an arm on some battery powered device is when they first learn as programmers about aligned accesses. And sadly find them to be painful rather than a blessing (due to the simplicity both in programming and for the hardware benefits that come from it). In a nutshell if your memory is say 32 bits wide and can only be accessed, read or write, 32 bits at a time that means it is limited to aligned accesses only. A memory bus on a 32 bit wide memory usually does not have the lower address bits a[1:0] because there is no use for them. those lower bits from a programmers perspective are zeros. if though our write was 32 bits against one of these 32 bit memories and the address was 0x1002. Then somebody along the line has to read the memory at address 0x1000 and take two of our bytes and modify that 32 bit value, then write it back. Then take the 32 bits at address 0x1004 and modify two bytes and write it back. four bus cycles for a single write. If we were writing 32 bits to address 0x1008 though it would be a simple 32 bit write, no reads.
sram vs dram. dram is painfully slow, but super cheap. a half to a quarter the number of transistors per bit. (4 for sram for example 1 for dram). Sram remembers the bit so long as the power is on. Dram has to be refreshed like a rechargable battery. Even if the power stays on a single bit will only be remembered for a very short period of time. So some hardware along the way (ddr controller, etc) has to regularly perform bus cycles telling that ram to remember a certain chunk of the memory. Those cycles steal time from your processor wanting to access that memory. dram is very slow, it may say 2133Mhz (2.133ghz) on the box. But it is really more like 133Mhz ram, right 0.133Ghz. The first cheat is ddr. Normally things in the digital world happen once per clock cycle. The clock goes to an asserted state then goes to a deasserted state (ones and zeros) one cycle is one clock. DDR means that it can do something on both the high half cycle and on the low half cycle. so that 2133Ghz memory really uses a 1066mhz clock. Then pipeline like parallelisms happen, you can shove in commands, in bursts, at that high rate, but eventually that ram has to actually get accessed. Overall dram is non-determinstic and very slow. Sram on the other hand, no refreshes required it remembers so long as the power is on. Can be several times faster (133mhz * N), and so on. It can be deterministic.
The next hurdle, cache. Cache is good and bad. Cache is generally made from sram. Hopefully you have an understanding of a cache. If the processor or someone upstream has marked the transaction as non-cacheable then it goes through uncached to the memory bus on the other side. If cacheable then the a portion of the address is looked up in a table and will result in a hit or miss. this being a write, depending on the cache and/or transaction settings, if it is a miss it may pass through to the other side. If there is a hit then the data will be written into the cache memory, depending on the cache type it may also pass through to the other side or that data may sit in the cache waiting for some other chunk of data to evict it and then it gets written to the other side. caches definitely make reads and sometimes make writes non-deterministic. Sequential accesses have the most benefit as your eviction rate is lower, the first access in a cache line is slow relative to the others, then the rest are fast. which is where we get this term of random access anyway. Random accesses go against the schemes that are designed to make sequential accesses faster.
Sometimes the far side of your cache has a write buffer. A relatively small queue/pipe/buffer/fifo that holds some number of write transactions. Another fire and forget deal, with those benefits.
Multiple layers of caches. l1, l2, l3...L1 is usually the fastest either by its technology or proximity, and usually the smallest, and it goes up from there speed and size and some of that has to do with cost of the memory. We are doing a write, but when you do a cache enabled read understand that if l1 has a miss it goes to l2 which if it has a miss goes to l3 which if it has a miss goes to main memory, then l3, l2 and l1 all will store a copy. So a miss on all 3 is of course the most painful and is slower than if you had no cache at all, but sequential reads will give you the cached items which are now in l1 and super fast, for the cache to be useful sequential reads over the cache line should take less time overall than reading that much memory directly from the slow dram. A system doesnt have to have 3 layers of caches, it can vary. Likewise some systems can separate instruction fetches from data reads and can have separate caches which dont evict each other, and some the caches are not separate and instruction fetches can evict data from data reads.
caches help with alignment issues. But of course there is an even more severe penalty for an unaligned access across cache lines. Caches tend to operate using chunks of memory called cache lines. These are often some integer multiple in size of the memory on the other side. a 32 bit memory for example the cache line might be 128 bits or 256 bits for example. So if and when the cache line is in the cache, then a read-modify-write due to an unaligned write is against faster memory, still more painful than aligned but not as painful. If it were an unaligned read and the address was such that part of that data is on one side of a cache line boundary and the other on the other then two cache lines have to be read. A 16 bit read for example can cost you many bytes read against the slowest memory, obviously several times slower than if you had no caches at all. Depending on how the caches and memory system in general is designed, if you do a write across a cache line boundary it may be similarly painful, or perhaps not as much it might have the fraction write to the cache, and the other fraction go out on the far side as a smaller sized write.
Next layer of complexity is the mmu. Allowing the processor and programmer the illusion of flat memory spaces and/or control over what is cached or not, and/or memory protection, and/or the illusion that all programs are running in the same address space (so your toolchain can always compile/link for address 0x8000 for example). The mmu takes a portion of the virtual address on the processor core side. looks that up in a table, or series of tables, those lookups are often in system address space so each one of those lookups may be one or more of everything stated above as each are a memory cycle on the system memory. Those lookups can result in ecc faults even though you are trying to do a write. Eventually after one or two or three or more reads, the mmu has determined what the address is on the other side of the mmu is, and the properties (cacheable or not, etc) and that is passed on to the next thing (l1, etc) and all of the above applies. Some mmus have a bit of a cache in them of some number of prior transactions, remember because programs are sequential, the tricks used to boost the illusion of memory performance are based on sequential accesses, not random accesses. So some number of lookups might be stored in the mmu so it doesnt have to go out to main memory right away...
So in a modern computer with mmus, caches, dram, sequential reads in particular, but also writes are likely to be faster than random access. The difference can be dramatic. The first transaction in a sequential read or write is at that moment a random access as it has not been seen ever or for a while. Once the sequence continues though the optimizations fall in order and the next few/some are noticeably faster. The size and alignment of your transaction plays an important role in performance as well. While there are so many non-deterministic things going on, as a programmer with this knowledge you modify your programs to run much faster, or if unlucky or on purpose can modify your programs to run much slower. Sequential is going to be, in general faster on one of these systems. random access is going to be very non-deterministic. array[4]=12; followed by array[37]=12; Those two high level operations could take dramatically different amounts of time, both in the computation of the write address and the actual writes themselves. But for example discarded_variable=array[3]; array[3]=11; array[4]=12; Can quite often execute significantly faster than array[3]=11; array[4]=12;
Arrays in C and C++ have random access because they are stored in RAM - Random Access Memory in a finite, predictable order. As a result, a simple linear operation is required to determine the location of a given record (a[i] = a + sizeof(a[0]) * i). This calculation has constant time. From the CPU's perspective, no "seek" or "rewind" operation is required, it simply tells memory "load the value at address X".
However: On a modern CPU the idea that it takes constant time to fetch data is no-longer true. It takes constant amortized time, depending on whether a given piece of data is in cache or not.
Still - the general principle is that the time to fetch a given set of 4 or 8 bytes from RAM is the same regardless of the address. E.g. if, from a clean slate, you access RAM[0] and RAM[4294967292] the CPU will get the response within the same number of cycles.
#include <iostream>
#include <cstring>
#include <chrono>
// 8Kb of space.
char smallSpace[8 * 1024];
// 64Mb of space (larger than cache)
char bigSpace[64 * 1024 * 1024];
void populateSpaces()
{
memset(smallSpace, 0, sizeof(smallSpace));
memset(bigSpace, 0, sizeof(bigSpace));
std::cout << "Populated spaces" << std::endl;
}
unsigned int doWork(char* ptr, size_t size)
{
unsigned int total = 0;
const char* end = ptr + size;
while (ptr < end) {
total += *(ptr++);
}
return total;
}
using namespace std;
using namespace chrono;
void doTiming(const char* label, char* ptr, size_t size)
{
cout << label << ": ";
const high_resolution_clock::time_point start = high_resolution_clock::now();
auto result = doWork(ptr, size);
const high_resolution_clock::time_point stop = high_resolution_clock::now();
auto delta = duration_cast<nanoseconds>(stop - start).count();
cout << "took " << delta << "ns (result is " << result << ")" << endl;
}
int main()
{
cout << "Timer resultion is " <<
duration_cast<nanoseconds>(high_resolution_clock::duration(1)).count()
<< "ns" << endl;
populateSpaces();
doTiming("first small", smallSpace, sizeof(smallSpace));
doTiming("second small", smallSpace, sizeof(smallSpace));
doTiming("third small", smallSpace, sizeof(smallSpace));
doTiming("bigSpace", bigSpace, sizeof(bigSpace));
doTiming("bigSpace redo", bigSpace, sizeof(bigSpace));
doTiming("smallSpace again", smallSpace, sizeof(smallSpace));
doTiming("smallSpace once more", smallSpace, sizeof(smallSpace));
doTiming("smallSpace last", smallSpace, sizeof(smallSpace));
}
Live demo: http://ideone.com/9zOW5q
Output (from ideone, which may not be ideal)
Success time: 0.33 memory: 68864 signal:0
Timer resultion is 1ns
Populated spaces
doWork/small: took 8384ns (result is 8192)
doWork/small: took 7702ns (result is 8192)
doWork/small: took 7686ns (result is 8192)
doWork/big: took 64921206ns (result is 67108864)
doWork/big: took 65120677ns (result is 67108864)
doWork/small: took 8237ns (result is 8192)
doWork/small: took 7678ns (result is 8192)
doWork/small: took 7677ns (result is 8192)
Populated spaces
strideWork/small: took 10112ns (result is 16384)
strideWork/small: took 9570ns (result is 16384)
strideWork/small: took 9559ns (result is 16384)
strideWork/big: took 65512138ns (result is 134217728)
strideWork/big: took 65005505ns (result is 134217728)
What we are seeing here are the effects of cache on memory access performance. The first time we hit smallSpace it takes ~8100ns to access all 8kb of small space. But when we call it again immediately after, twice, it takes ~600ns less at ~7400ns.
Now we go away and do bigspace, which is bigger than current CPU cache, so we know we've blown away the L1 and L2 caches.
Coming back to small, which we're sure is not cached now, we again see ~8100ns for the first time and ~7400 for the second two.
We blow the cache out and now we introduce a different behavior. We use a strided loop version. This amplifies the "cache miss" effect and significantly bumps the timing, although "small space" fits into L2 cache so we still see a reduction between pass 1 and the following 2 passes.
I am trying to understand how computer boots up in very detail.
I came across two things which made me more curious,
1. RAM is placed at the bottom of ROM, to avoid Memory Holes as in Z80 processor.
2. Reset Vector is used, which takes the processor to a memory location in ROM, whose contents point to the actual location (again ROM) from where processor would actually start executing instructions (POST instruction). Why so?
If you still can't understand me, this link will explain you briefly,
http://lateblt.tripod.com/bit68.txt
The processor logic is generally rigid and fixed, thus the term hardware. Software is something that can be changed, molded, etc. thus the term software.
The hardware needs to start some how, two basic methods,
1) an address, hardcoded in the logic, in the processors memory space is read and that value is an address to start executing code
2) an address, hardcoded in the logic, is where the processor starts executing code
When the processor itself is integrated with other hardware, anything can be mapped into any address space. You can put ram at address 0x1000 or 0x40000000 or both. You can map a peripheral to 0x1000 or 0x4000 or 0xF0000000 or all of the above. It is the choice of the system designers or a combination of the teams of engineers where things will go. One important factor is how the system will boot once reset is relesed. The booting of the processor is well known due to its architecture. The designers often choose two paths:
1) put a rom in the memory space that contains the reset vector or the entry point depending on the boot method of the processor (no matter what architecture there is a first address or first block of addresses that are read and their contents drive the booting of the processor). The software places code or a vector table or both in this rom so that the processor will boot and run.
2) put ram in the memory space, in such a way that some host can download a program into that ram, then release reset on the processor. The processor then follows its hardcoded boot procedure and the software is executed.
The first one is most common, the second is found in some peripherals, mice and network cards and things like that (Some of the firmware in /usr/lib/firmware/ is used for this for example).
The bottom line though is that the processor is usually designed with one boot method, a fixed method, so that all software written for that processor can conform to that one method and not have to keep changing. Also, the processor when designed doesnt know its target application so it needs a generic solution. The target application often defines the memory map, what is where in the processors memory space, and one of the tasks in that assignment is how that product will boot. From there the software is compiled and placed such that it conforms to the processors rules and the products hardware rules.
It completely varies by architecture. There are a few reasons why cores might want to do this though. Embedded cores (think along the lines of ARM and Microblaze) tend to be used within system-on-chip machines with a single address space. Such architectures can have multiple memories all over the place and tend to only dictate that the bottom area of memory (i.e. 0x00) contains the interrupt vectors. Then then allows the programmer to easily specify where to boot from. On Microblaze, you can attach memory wherever the hell you like in XPS.
In addition, it can be used to easily support bootloaders. These are typically used as a small program to do a bit of initialization, then fetch a larger program from a medium that can't be accessed simply (e.g. USB or Ethernet). In these cases, the bootloader typically copies itself to high memory, fetches below it and then jumps there. The reset vector simply allows the programmer to bypass the first step.