How does array offset access actually work - arrays

We all are aware of how easy it is to access elements of an array in the blink of an eye:
#include<stdio.h>
int main()
{
int array[10];
array[5]=6; //setat operation at index 5
printf("%d",array[5]); //getat operation
}
Yea, question may sound a bit stupid but how does a compiler just get you the index that you want to access for inserting data or for displaying it, so fast. Does it traverse to that index on its own for completing setat(),getat() operations.
Cause general means are: if you are asked to pick 502th element from a row of 1000 units, you would start counting until u get the count 502 (in case of computer 501) so is this the same happening in computer.

The array is stored in random-access memory (RAM). RAM is is divided into equal-sized, individually addressable units, such as bytes. Addressable means enumerable by an address, which is a number. Random access means that the processor doesn't have to traverse addresses 0 through 499 in order to access location 500. It directly proceeds to 500. How it works is that the computer places a binary representation of the adress 500 onto a collection of signal lines called the "address bus". All of the devices connected to the address bus simultaneously examine the address and their circuitry answers the question "is this address in my range?". The device for which the answer is yes, then springs into action.
In the case of RAM it circuitry further decodes the address to determine which row and column of which bank to activate. The values read out are placed onto the data bus for the processor to collect. The actual implementation is considerably more complicated due to caching, but that's the basic idea.
The main idea is that the machine accesses memory and memory-like resources (such as I/O ports) using an address, and the address is distributed, as a set of electrical signals, in parallel to all of the devices, which can look at it at once; and those devices themselves have parallel circuitry to further analyze the address to identify a specific resource within their innards. So addressing happens very fast, without having to search through resources that are not being addressed.
C language arrrays are a very low-level concept. A C array sits at some address in memory and holds equal sized objects. These objects are accessed by performing arithmetic. For instance if the array elements are 8 bytes wide, then accessing the 17th element means that the machine has to multiply 17 x 8 to produce the offset 136, which is then added to the address of the array to produce the address of the element.
In youor program you have the expression array[5]. The value 5 is known to the C compiler at compile time (before the program is translated, linked and executed). The size of the array elements, which are of type int is also known at compile time. The address of array isn't known at compile time. Therefore the offset calculation likely takes place at compile time; the 5 is converted to a sizeof (int) * 5 offset calculated at compile time to a value like 20, which is then added to the address of array at run-time to calculate the address of array[5] and fetch its value from that address.

Related

Why "any primitive object of K bytes must have an address that is a multiple of K"?

Computer Systems: a Programmer's Perspective says
The x86-64 hardware will work correctly regardless of the alignment of
data. However, Intel recommends that data be aligned to improve memory
system performance. Their alignment rule is based on the principle
that any primitive object of K bytes must have an address that is a
multiple of K. We can see that this rule leads to the following
alignments:
K Types
1 char
2 short
4 int, float
8 long, double, char *
Why is it that "any primitive object of K bytes must have an address that is a multiple of K"?
How is "aligned" defined or what does it mean?
On a x86-64 machine,
if an object has K bytes (such as K=2 (e.g. short) or K=4 (e.g. int, or float)), "any primitive object of K bytes must have an address that is a multiple of K" means that such an object must have an address that is a multiple of K. But isn't the object aligned, as long as its storage space falls completely between two addresses which are two consecutive multiples of 8, which is a less strict requirement than that the object must have an address that is a multiple of K?
If the K of an object is smaller than 8 but not equal to 1, 2 or 4, does "any primitive object of K bytes must have an address that is a multiple of K" still apply? For example if K=3,5,6, or 7?
On a X86 machine, which has 32-bit addresses,
what is the alignment rule, and Does "any primitive object of K bytes must have an address that is a multiple of K" still apply?
Thanks.
Since this was tagged in C as well; do note that not only does the architecture make these decisions, but so do compilers. The C compiler often has its own alignment rules that mostly follow either the required or the preferred alignment of the architecture - especially when optimizing for speed. And the compiler's requirements are what you you need to worry about the most time, not the architecture requirement.
Even if the processor supports unaligned accesses, it might have a preferred alignment for multibyte objects that the C compiler can exploit. For example a compiler is allowed to know that a any int will reside at, and therefore any int * pointer will always point to - an address divisible by 4.
Now there are people who say that since x86-64 supports unaligned acccess, they can make an int * pointer that points to an address not divisible by 4 and things will work fine.
They're wrong.
There are some instructions in the x86-64 instruction set that require alignment. I.e. the "will work correctly regardless of alignment" means that these instructions too work "correctly, according to the specification, when given an unaligned access" - they raise an exception that would kill your process. The reason for having these is that they can be so much faster and require less silicon to implement than the versions that can deal with unaligned data.
And the compiler knows exactly when it is allowed to use these instructions! Whenever it sees an int * being dereferenced it knows that it can use an instruction that requires the operand be aligned at 4 bytes, should it be more effective.
See this question for a case where OP run into problem with C code that "should have been fine on x86-64 anyway": C undefined behavior. Strict aliasing rule, or incorrect alignment?
As for x86-32, the alignment requirement for doubles is generally 4 in C compilers because doubles need to be passed on stack and stack grows in 4 not 8 byte increments.
And finally:
If the K of an object is smaller than 8 but not equal to 1, 2 or 4, does "any primitive object of K bytes must have an address that is a multiple of K" still apply? For example if K=3,5,6, or 7?
There are no primitive objects with K<-{3,5,6,7} in x86.
The C standard's stance is that an alignment can only be a power of 2, and there are no gaps in arrays. Therefore an object with such a size would need to be padded upwards to its alignment requirement, or its alignment requirement must be 1.
The rules are different on each processor model. I will discuss one hypothetical example. We may have a processor with an eight-byte interface to the bus. Given some address X, the processor can load eight bytes from that address by requesting the memory to deliver eight bytes from its unit of storage numbered X/8. That is, the memory does not have any way to address individual bytes. The processor can only request data at a certain address that is a multiple of eight, and the memory will send the entire eight bytes at that address. (Keep in mind this is a hypothetical example to illustrate basic principles. Also, I am ignoring cache. Cache helps mask some of the effects of alignment issues, because the misalignments can be largely managed in level-one cache inside the processor. But handling this still requires extra hardware, as discussed below.)
Suppose we want the four-byte object that is in bytes 7, 8, 9, and 10. To get this, the processor has to request unit 0 from memory, which supplies bytes 0 through 7, and it has to request unit 1, which supplies bytes 8 through 15. So, already, there is a performance problem: We had to use two bus transfers to get this word that is only half the size of one transfer. That is inefficient, and the bus can only do half as many of these double transfers as it can if we loaded only aligned data requiring single transfers.
Continuing, the processor has all the bytes it needs, 0 through 15, so it extracts bytes 7 through 10, which make up the object we want. To do this, though, it has to shift the bytes to put them into a register. Ideally, if nobody did any “unaligned” loads, four-byte objects would come in from the bus only at offsets 0 and 4 in the eight-byte transfers, and the processor only needs to have wires gong from those offsets to the register destinations.
However, our processor supports unaligned loads, so it has additional switches and wires so the data can be shunted down a different path, where it will be shifted by three bytes. Keep in mind, the data from both transfers has to be shifted by three bytes and then spliced together. So a lot of extra wires and switches are needed. Two eight-byte transfers is 128 bits, so there are hundreds of extra connections involved in this.
Well, fine, the processor has these wires and switches, why not use them? To make this processor fast, it supports multiple loads and stores in progress simultaneously. As soon as the bus transfers one piece of data, we want to be getting another from the bus, while the data from the first is still on its way to a register. So there are actually multiple parts of the processor moving data around for several loads. Since we expect unaligned loads to be rare, maybe only one of the parts for handling loads has the extra components to handle unaligned loads. The others all handle aligned loads. So, if you have just one unaligned load occasionally, the processor sends it to that part, and the performance effect is unnoticeable. However, if you do many unaligned loads in a row, they all have to go through the one part, so they end up waiting in a queue instead of running in parallel, and performance decreases.
That is just for loads. When you store that four-byte object, there is no way to write just bytes 7 through 10. Since the bus and the memory only work in eight-byte units, we need to write units 0 and 1, which also contains bytes 0 through 6 and bytes 11 to 15. To implement the store, the processor must:
Load memory unit 0, providing bytes 0 through 7.
Load memory unit 1, providing bytes 8 through 15.
Move the first byte of the four-byte object into byte 7.
Move the last three bytes of the object into bytes 8 through 10.
Store the changed memory unit 0.
Store the changed memory unit 1.
Again, that is twice as much work as it would be with an aligned object (load one memory unit, move the bytes in, store the unit). And, besides the time of the operations, you are occupying more resources inside the processor—it has to use two internal registers to hold the data from memory temporarily while it is merging the changes.
Actually, it is more than twice the work and resources, because it also requires extra wires and switches to shift the bytes by non-standard amounts.
The processor bus, which is the media used to access memory is normally the processor size in bits. This means a 32bit processor normally access memory in 32bit chunks, meaning that only one memory read access is necessary to read the data from memory.
Addresses by the contrary, are byte oriented, so a double (8 bytes) normally occupies eight different contiguous memory. So to make an access to a single eight bytes data (with only one bus request) The data must begin at a single eight byte word and finish before we get to the next. For old processors this was imperative, in case you requested a memory access that is not data aligned, an exception was fired. Actual processors don't have this restriction, but beware you that in case you have for example a double in a non multiple of eight address, the processor will need to make two bus accesses (with the overhead that this implies) to get the data from memory.
For this reason (you can double or even more, the time required to execute some piece of code if all the data is unaligned, against the time required to if the data is properly aligned) the processor vendor warns you about the alignment of data.
Modern processors have several levels of caches, that are read from main memory in chunks of one cache line (64 or even more bytes) so this is not an issue. Anyway, it is good idea to have data aligned anyway, for the case you need to run your code in a non-such-advanced processor.

How does CPU reads data from memory?How cache Plays important role [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have written a piece of code where I am allocating memory to variable Bextradata (which is member of structure which is also allocated using malloc) as
Bextradata = (U8_WMC *) malloc(Size);
memset(Bextradata, 0,Size);
memcpy(BextraData,pdata + 18,Size);
and later trying to read this variable in some other file that too just once .So how does this variable will read from memory.will it place this variable in a cache or it will read it from main memory.
Thanks in Advance
Before you understand the working of a CPU, you need to understand few terms. The CPU consists of the ALU (for arithmetic and logic operations), the Control Unit and a bunch of registers. The number of registers in a CPU depends on the architecture and varies. The types of registers present are general purpose registers, special purpose registers instruction pointer and a few others. You can read about them. Now when we generally say 32-bit processor or 64-bit processor we're referring to the size of the registers of the CPU.
Now lets look at the following code:
int a = 10;
int b = 20;
a = a + b;
When the above program is loaded, it's instructions are stored in the main memory. Every instruction in a program is stored in a location in the main memory. Each location has a specific size to it ( Depends on the architecture again, but let's assume it's one byte). Every location has an address to it. The size of the address of particular location in RAM is equal to the size of the instruction pointer. In 64-bit Systems the size of instruction pointer will be 64 bits. That means it can address upto 2^64-1 locations. And since 1 location is generally 1 byte, therefore the total RAM, in theory for 64 bit systems, could be 16 exabytes. ( for 32 bit systems it is 2^32-1 ~ 4 GB)
Now lets look at the first instruction a = 10. This is a store operation. A computer can do following basic operations - add, multiply, subtract, divide, store, jump, etc. You can read the instruction set of any processor, for more on this. Again the instruction set differs from system to system. Coming back, When the program is loaded to memory the instruction pointer points to the first address or base address. In this case it is a = 10. The contents of this location are brought to one of the general purpose registers of the CPU. From this it is taken to the ALU which understands that this is a store operation (cus additional bits are added which represent it as a store operation). The ALU then stores it into one of the locations in RAM and also in the cache. The decision to store it in the cache depends on the compiler and a concept called hardware prefetching. When the compiler parses through a program it sees the frequently used variables and enables them to be stored in cache. In this case, we can see that variable 'a' will be used again so the compiler adds additional intermediate instructions to the program, to store it in the cache as well. Why? For faster access. (In terms of speed always remember Registers > Cache > RAM > Disc )
After the first instruction is executed, the instruction pointer is incremented and it now points to the second instruction, that is, b = 20. The same happens with this as well.
The third is a = a + b. For this there are actually four operations (if u look at the assembly level), that are, 1) Fetch a , 2) Fetch b , 3) Add a and b, 4) store result in a. Now since the variables a and b are present in cache, they are brought from those locations. They are then added and the result is stored back to a.
I hope you understood how it works.
Also you need to know that when a program is loaded in the main memory, it occupies a certain space. This space is called a segment. It has a base address and a final address. You can assume the base address as the first instruction and final address as the last instruction. If from your program you try to dereference a pointer that points from outside this segment, you get the famous error - Segmentation fault. For example :
int *ptr = NULL;
printf(*ptr);
This will give me a segmentation fault as I am trying to dereference a pointer that stores an address whose value is NULL and since NULL is not in the segment, it will give a seg fault.

how c manage data with different size from CPU word size?

while i was taking a course in hardware/software interface, our teacher said the cpu get the data using the word machine, for example, the CPU can get the value at address 0x00 but cannot get the value at address 0x01 but the value at address 0x00 + (word-size:normally it 4 byte), so i want to understand how we can use short int in c that contain only 2 byte?
Different CPU's have different memory access restrictions. One common form is to allow access at any byte offset, but suffer performance penalties if this occurs because two word-aligned transfers might be needed in order to retrieve all bytes necessary for the operation. The second form simply disallows non-aligned memory access.
In the first case, the compile has nothing to worry about, memory can be "packed" or word-aligned and both will work, but one will be faster and one will use less memory.
In the second case, the compiler must generate code to get an un-aligned data into the correct register location(s) to allow arithmetic and logical operations to occur. These operations shift and/or mask bits in byte-sized units. For example a 4-byte load may be needed to load a 2-byte variable. In this instance 2 bytes of the data will likely need to be "zeroed" and the 2 "good" bytes may need to be shifted into the low-order bits of the register. All of this is the responsibility of the compiler, but it adds overhead to access the data. On the other hand, it means the data occupies less memory when not stored in a register. This can be important if you have a large amount of data, such as an array of a few million integers; storing it in half the space may actually make the program run faster (even given the above overhead), since twice as many array values fit in the cache.
A 32 bit processor uses 4 bytes for integer, So first let me explain about memory banking
memory banking= If the processor had used the single byte addressing scheme,it would take 4 memory cycles to perform a read/write operation. So from processors point of view memory is banked as shown.
So if a integer is stored as in the address range 0x0000-0x0003 ie;4 byes ,it just requires a single memory read/write cycle so the processor is relieved from the extra memory cycles and hence improved performance.
Also it is also possible that a variable allocation can start from a number which is not a multiple of four hence it might require 2 or more memory cycles(float,double etc..) in accordance with how they are stored. The CPU fetches the data in minimum number of read cycles.
In short data type only two bytes of the memory are allocated and if it starts from 0x0001 then it ends in 0x0002 still 1 read cycle is required for it. But the advantage is that in integer it allocates 4 bytes where as the short allocates two byes so the two bytes is saved from the integer data type. So storage is improved
Reference:
Geeksforgeeks
IBM

CPU and memory communication

I'm programmer-beginner, but I want to understand the things a bit more deeply. I did some research and read quite a lot of text, but I'm still yet to understand some things..
When coding a basic thing (in C):
int myNumber;
myNumber = 3;
printf("Here's my number: %d", myNumber);
I found out that (mainly on 32-bit CPU) integer takes place of 32 bits = 4 bytes. So at first line of my code CPU goes into the memory. The memory is byte-addressable, so CPU chooses 4 continuous bytes for my variable and stores the address to first (or last) byte.
On the second line of my code CPU uses his stored address of the MyNumber variable, goes to that address in the memory and finds there 32 bits of reserved space. His task now is to store there the number "3", so he fills those four bytes with the sequence 00000000-00000000-00000000-00000011.
On the third line it does the same - CPU goes to that address in memory and loads the number stored in that address.
(First question - Do I understand it right?)
What I don't understand is this:
The size of that address (pointer to that variable) is 4bytes in 32-bit CPU. (Thats why 32-bit CPU can use max 4GB of memory - because there are only 2^32 different addresses of binary length 32)
Now, where the CPU stores these addresses? Does he have some sort of its own memory or cache to store that? And why it stores the 32 bit long address to 32 bit long integer? Wouldn't it be better to simply store in its cache that actual number than the pointer to that when the sizes are the same?
And last one - if it stores somewhere in its own cache the addresses to all those integers and the lenghts are the same (4 bytes), it will need exactly the same space for storing the addresses as for the actual variables. But variables can take up to 4GBs of space so CPU must have 4GB of its own space to store the addresses to those variables. And that sounds strange..
Thank you for help!
I'm trying to understand that but it's so tough.. :-[
(First question - Do I understand it right?)
The first thing to recognise is that the value might not be stored in main memory at all. The compiler might decide to store it in a register instead, as this is more optimal.1
The memory is byte-addressable, so CPU chooses 4 continuous bytes for my variable and stores the address to first (or last) byte.
Assuming that the compiler does decide to store it in main memory, then yes, on a 32-bit machine, an int is typically 4 bytes, so 4 bytes will be allocated for storage.
The size of that address (pointer to that variable) is 4bytes in 32-bit CPU. (Thats why 32-bit CPU can use max 4GB of memory - because there are only 2^32 different addresses of binary length 32)
Note that the width of an int and the width of a pointer don't have to be the same, so there's not necessarily a connection with the size of the address space.
Now, where the CPU stores these addresses?
In the case of local variables, the address is effectively hardcoded into the executable itself, typically as an offset from the stack pointer.
In the case of dynamically-allocated objects (i.e. stuff that's been malloc-ed), the programmer typically maintains a corresponding pointer variable (otherwise there would be a memory leak!). That pointer might also be dynamically-allocated (in the case of a complex data structure), but if you go back far enough, you'll eventually reach something that's a local variable. In which case, the above rule applies.
But variables can take up to 4GBs of space so CPU must have 4GB of its own space to store the addresses to those variables.
If your program consists of independently malloc-ing millions of ints, then yes, you'd end up with just as much storage required for the pointers. But most programs don't look like that. You typically allocate much bigger objects (like an array, or a big struct).
cache
The specifics of where stuff is stored is architecture-specific. On a modern x86, there's typically 2 or 3 layers of cache sitting between the CPU and main memory. But the cache is not independently addressable; the CPU cannot decide to store the int in cache instead of main memory. Rather, the cache is effectively a redundant copy of a subset of main memory.
Another thing to consider is that the compiler will typically deal with virtual addresses when allocating storage for objects. On a modern x86, these are mapped to physical addresses (i.e. addresses that correspond to physical storage bytes in main memory) by dedicated hardware, along with OS support.
1. Alternatively, the compiler may be able to optimise it away entirely.
On the second line of my code CPU uses his stored address of the MyNumber variable, goes to that address in the memory and finds there 32 bits of reserved space.
Nearly correct. Memory is basically unstructured. The CPU can't see that there are 32 bits of "reserved space". But the CPU was instructed to read 32 bits of data, so it reads 32 bits of data starting from the specified address. Then it just has to hope/assume that those 32 bits actually contain something meaningful.
Now, where the CPU stores these addresses? Does he have some sort of its own memory or cache to store that? And why it stores the 32 bit long address to 32 bit long integer? Wouldn't it be better to simply store in its cache that actual number than the pointer to that when the sizes are the same?
The CPU has a small number of registers, which it can use to store data (common CPUs have 8, 16 or 32 registers, so they can only hold the particular variables that you're working with here and now). So to answer the last part first, yes, the compiler certainly might (and probably will) generate code to just store your int into a register, instead of storing it in a memory, and telling the CPU to load it from a specified address.
As for the other part of the question: ultimately, every part of the program is stored in memory. Part of it is a stream of instructions, and part of it is in chunks of data scattered around memory.
There are a few tricks that help with locating the data the CPU needs: part of the program's memory contains the stack, which typically stores local variables while they're in scope. The CPU always maintains a pointer to the top of the stack in one of its registers, so it can easily locate data on the stack, simply by modifying the stack pointer with a fixed offset. Instructions can directly contain such offsets, so in order to read your int, the compiler could for example generate code which writes the int to the top of the stack when you enter the function, and then when you need to refer to that function, have code which reads the data found at the address the stack pointer points to, plus the small offset needed to locate your variable.
And also keep in mind that the adresses your program sees may not be (or rather rarely are) physical adresses starting from the 'beginning of the memory' or 0. Mostly they are offsets into a specifc memory block where the memory manager knows the real address and the accesses via base+offest as the real data storage.
And we do need the memory since caches are limited ;-)
Mario
Internal to the CPU there is one register that contains the address of the next instruction to be executed. The instructions themselves keep information where the variable is. If the variable is optimized, the instruction may point to a register, but in general, the instruction will have an address of the variable being accessed. Your code, after compiled and loaded in memory, have all that embedded! I recommend looking into assembly language to get a better understanding of all that. Good luck!

How are arrays and hash maps constant time in their access?

Specifically: given a hash (or an array index), how does the machine get to the data in constant time?
It seems to me that even passing by all the other memory locations (or whatever) would take an amount of time equal to the number of locations passed (so linear time). A coworker has tried valiantly to explain this to me but had to give up when we got down to circuits.
Example:
my_array = new array(:size => 20)
my_array[20] = "foo"
my_array[20] # "foo"
Access of "foo" in position 20 is constant because we know which bucket "foo" is in. How did we magically get to that bucket without passing all the others on the way? To get to house #20 on a block you would still have to pass by the other 19...
How did we magically get to that
bucket without passing all the others
on the way?
"We" don't "go" to the bucket at all. The way RAM physically works is more like broadcasting the bucket's number on a channel on which all buckets listen, and the one whose number was called will send you its contents.
Calculations happen in the CPU. In theory, the CPU is the same "distance" from all memory locations (in practice it's not, because of caching, which can have a huge impact on performance).
If you want the gritty details, read "What every programmer should know about memory".
Then to understand you have to look at how memory is organized and accessed. You may have to look at the way an address decoder works. The thing is, you do NOT have to pass by all the other addresses to get to the one you want in the memory. You can actually jump to the one you want. Otherwise our computers would be really really slow.
Unlike a turing machine, which would have to access memory sequentially, computers use random access memory, or RAM, which means if they know where the array starts and they know they want to access the 20th element of the array, they know what part of memory to look at.
It is less like driving down a street and more like picking the correct mail slot for your apartment in a shared mailbox.
2 things are important:
my_array has information about where in memory computer must jump to get this array.
index * sizeof type gets offset from beginning of array.
1 + 2 = O(1) where data can be found
Big O doesn't work like that. It's supposed to be a measure of how much computational resources are used by a particular algorithm and function. It's not meant to measure the amount of memory used and if you are talking about traversing that memory, it's still a constant time. If I need to find the second slot of an array it's a matter of adding an offset to a pointer. Now, if I have a tree structure and I want to find a particular node you are now talking about O(log n) because it doesn't find it on the first pass. On average it takes O(log n) to find that node.
Let's discuss this in C/C++ terms; there's some additional stuff to know about C# arrays but it's not really relevant to the point.
Given an array of 16-bit integer values:
short[5] myArray = {1,2,3,4,5};
What's really happened is that the computer has allocated a block of space in memory. This memory block is reserved for that array, is exactly the size needed to hold the full array (in our case 16*5 == 80 bits == 10 bytes), and is contiguous. These facts are givens; if any or none of them are true in your environment at any given time, you're generally at risk for your program crashing due to an access vialoation.
So, given this structure, what the variable myArray really is, behind the scenes, is the memory address of the start of the block of memory. This is also, conveniently, the start of the first element. Each additional element is lined up in memory right after the first, in order. The memory block allocated for myArray might look like this:
00000000000000010000000000000010000000000000001100000000000001000000000000000101
^ ^ ^ ^ ^
myArray([0]) myArray[1] myArray[2] myArray[3] myArray[4]
It is considered a constant-time operation to access a memory address and read a constant number of bytes. As in the above figure, you can get the memory address for each one if you know three things; the start of the memory block, the memory size of each element, and the index of the element you want. So, when you ask for myArray[3] in your code, that request is turned into a memory address by the following equation:
myArray[3] == &myArray+sizeof(short)*3;
Thus, with a constant-time calculation, you have found the memory address of the fourth element (index 3), and with another constant-time operation (or at least considered so; actual access complexity is a hardware detail and fast enough that you shouldn't care) you can read that memory. This is, if you've ever wondered, why indexes of collections in most C-style languages start at zero; the first element of the array starts at the location of the array itself, no offset (sizeof(anything)*0 == 0)
In C#, there are two notable differences. C# arrays have some header information that is of use to the CLR. The header comes first in the memory block, and the size of this header is constant and known, so the addressing equation has just one key difference:
myArray[3] == &myArray+headerSize+sizeof(short)*3;
C# doesn't allow you to directly reference memory in its managed environment, but the runtime itself will use something like this to perform memory access off the heap.
The second thing, which is common to most flavors of C/C++ as well, is that certain types are always dealt with "by reference". Anything you have to use the new keyword to create is a reference type (and there are some objects, like strings, that are also reference types although they look like value types in code). A reference type, when instantiated, is placed in memory, doesn't move, and is usually not copied. Any variable that represents that object is thus, behind the scenes, just the memory address of the object in memory. Arrays are reference types (remember myArray was just a memory address). Arrays of reference types are arrays of these memory addresses, so accessing an object that is an element in an array is a two-step process; first you calculate the memory address of the element in the array, and get that. That is another memory address, which is the location of the actual object (or at least its mutable data; how compound types are structured in memory is a whole other can o' worms). This is still a constant-time operation; just two steps instead of one.

Resources