How to identify live objects while traversing the heap? - c

(Context: The system I am working on already maintains a form of garbage collection. I'm working on compaction.)
Most compaction algorithms follow a basic structure:
Find first object
Move object to beginning of heap
Find second object
Move second object to address right after first object
Rinse and repeat
This algorithm is followed in section 2.2 of this paper except using two pointers, denoted "from" and "to". Essentially the FROM pointer traverses the heap until it finds live objects. Then it moves said object to the TO pointer. Then TO is incremented accordingly.
The algorithm is simple, but I have yet to find much information on how these pointers determine what is a "live object". This article discusses the creation of a basic mark-and-sweep garbage collector that runs through the stack, recursively going to each reference and marking them as live. The article however requires a linked list of ALL objects ever allocated. However, this is because the author is more or less creating their own VM.
My question is, is there a way of traversing a heap in C and identifying whether the current object is a live object? Is there a similar linked list of all allocated objects already in C that I could use? Or will I require more overhead?

My question is, is there a way of traversing a heap in C and identifying whether the current object is a live object?
At a high level, the process is looking at all active pointers and determining whether or not each piece of allocated memory is accessible. (Please note that this is very complicated is C, including because a pointer could be stored in an int or other data types.) If the memory is accessible via a pointer, then it is "live" in your terms. If not, then garbage collectors would consider it safe to free that memory.
If you're asking whether or not C has a native function for determining whether or not some allocated memory can be reached, then the answer is no.
Is there a similar linked list of all allocated objects already in C that I could use? Or will I require more overhead?
Again, if you're looking for a linked list that C natively provides and you can access, then the answer is no. You'd need to implement these things.
Forgive me if you've already seen this, but there are garbage collectors that you can download if you want to see how others have done it.

TL;DR: It's impossible.
To make that work, you need to solve some non-trivial problems:
Be able to name the live objects of the heap. That means to find and follow recursively all pointers in global variables and on the stack.
Move the live objects downwards to create a compact heap
Adjust pointers in your program to reflect the new locations of the moved objects.
Regarding 1.: At runtime, the C language doesn't help you to identify where you have pointer-type global variables. And on the stack, you find a mixture of e.g. integers, function-call return addresses or data pointers. For both memory areas, you have to find a way to enumerate all potential pointer values.
To make things worse, a pointer can not only point to the beginning of your data structure, but also to some inside element. And this pointer also makes the whole object "live".
Regarding 2.: That's the easy part, using the algorithm you mentioned.
Regarding 3.: Now your objects live at new addresses, so your old pointer values are no longer correct (pointing to the old locations), and you have to adjust them. So once again, you have to follow all root references (like in 1.) and adjust all pointers that are affected by your moves. But as you can't tell for sure if e.g. 0x12345678 was meant as an numeric integer or as an (old-location) address, changing that to the new-location address might break some computation.

Related

Can a heap object whose initial address is not stored be protected from garbage collection?

Suppose that
I am modifying someone else's C program;
a garbage collector is active;
there exists an object on the heap I do not want the garbage collector to reap; and
the object lives until the program exits, so it is unnecessary to free() it.
Must I store the object's initial address? Suppose that I don't care about the initial address. Suppose instead that I only care about some pointers into the object's interior, and that these pointers are all I store. Suppose that I throw the initial address away.
Will the garbage collector reap my object?
ADDITIONAL INFORMATION
The program does not now collect garbage as far as I know. However, if a future revision of the program began to collect garbage, then the code I am adding today might suddenly turn into a hard-to-find bug. I don't want to make a hard-to-find bug; but the program is an old, stable program thousands of users have used for many years. The program is thus known to function acceptably under a wide variety of real conditions. Redesign is not an option.
The program employs global data structures it never bothers to free(). This is the design within which I must work.
If you want to know: the pointers I wish to store—the pointers that point into their objects' interiors—happen to point to words within an ASCII string. I care only about the words, not about the whole string. Especially, I don't care about whitespace at the start of the string, which is why I don't care about the string's initial address; but a garbage collector might inadvertently care, mightn't it?
It seems silly to store a linked list of pointers neither I nor anyone else will ever use, just to fend off a hypothetical garbage collector that does not exist; but I'd store the list if truly necessary.
Or is my concern groundless? Does no one ever add garbage collection to old C programs, anyway?
Let us set aside the fact that a program that uses allocate() but not free() can hardly be called "stable";
Let us set aside the fact that adding garbage collection to an existing large C program that works is one of those "if there you go, only pain will you find" situations.
The answer to your question is:
It depends on how your garbage collector works.
If it is exotic, it may be sweeping memory and looking for pointers that point anywhere within the heap, not just to beginnings of memory blocks. In this case, you are covered, since a pointer pointing in the middle of a string will be enough to keep the string anchored in memory. (Prevent it from being garbage-collected.)
If it is not that exotic, then it will only be looking for pointers that point to beginnings of memory blocks. (Which is a rather sensible thing to do.) In which case, no, your object is not anchored unless you maintain a pointer to the object itself.
Personally, I wouldn't even try an exotic garbage collector, but that's just me.

Garbage collection issue in an interpreter implemented in C

I'm working on a hobby compiler/interpreter for a toy procedural language and I've implemented most of the features I set out to explore except for a good garbage collection algorithm (similar to this guy). I've read quite a bit about various algorithms and I have a general idea of how to implement them. Earlier iterations of my language runtime used reference counting but I dropped it to learn something more advanced so I'm now considering a mark and copy compacting algorithm.
My first issue in getting started is preventing the algorithm from collecting 'objects' in native extension functions (i.e. functions written in C). The root set consists of 'objects' on the interpreter's stack and 'objects' in symbol tables, and I shouldn't have too much trouble with these, however, if a container 'object' is created in a C function, then populated with child 'objects', how can I prevent the GC from collecting them since it's not actually on the interpreter stack or bound to a symbol?
Things that make implementing GC easier:
All 'objects' in my language are of a builtin type (e.g. not object oriented)
The interpreter stack is just a stack of pointers to structs
Symbol tables are just arrays of pointers to structs
User code:
f = open('words.txt', 'r');
lines = readlines(f);
close(f);
Interpreter (after parsing, compiling to bytecode...):
push filename, open_mode
call builtin_fopen which returns a struct wrapping a FILE*
store result in symbol f
push symbol f
call builtin_flines which creates a list type l, then used C fread to read each line
of the file as a string type, appending it to the list l
store result in symbol lines, and so on....
Now if the GC ran while one of the strings containing a line in the file was being allocated, the root set does not yet have any reference to l, so it should get collected.
Any ideas on how to handle this better?
Dedicate a separate contiguous allocation arena for the interpreter's heap. Never collect anything outside of the arena.
You always have the arena's current top (assuming it grows from lower to higher addresses). Everything above the top is not collectible but considered in the root set. A builtin function that has to allocate several linked objects allocates them above the top, then moves the top up so that all the allocated objects end up in the collectible heap at once. If the collection happens in the midst of the function execution, objects above the top are moved to the new heap all at once.
Since I'm the original "this" guy you mentioned, though I could give you some insight on your first issue based on what I've designed so far in my project (I promise I will blog about it eventually). So first, all memory allocation goes through a mutator function. The inputs arguments are the type of object you are creating, and a reference to a pointer type object that will then point to it. That pointer object is then updated at the time the new object is created. If an object is being allocated for the exclusive use of a C function in the interpreter runtime, then it is a root object. In this case, NULL is passed as the second argument, and the object is added to the list of root objects. Now later on, if that internal function no longer needs the object, it will then have to remove that object from the root-objects list. (it doesn't de-allocate the object itself, as that will be handled by the garbage collection routine eventually). Oh, and the interpreter stack itself is also an object within the interpreter (a list-type or array-type object), so a pointer to it is also in the root-objects list (again, another list-type object that is also known to the interpreter). A pointer to the root-objects list is the only pointer that the garbage collector needs to know about.
Also, as for when to start a garbage collection run -- since memory is effectively unlimited on modern architectures, I've decided to kick off the garbage collector when X number of objects has been allocated. After running you have Y objects left. If Y still greater than Z percent of X, then X gets bumped up enough to make that so. Then I just hope that malloc() never fails (if it does, I just dump an error out and exit the interpreter).
Hope this helps, and hopefully someone else will add more clarifications since I am more of an amateur when it comes to language / interpreter design.
You need to provide your native functions with an interface via which they can tell the garbage collector what objects they have references to, and then have them use that interface.
The easiest way is probably to not let the native code have direct pointers to interpreter/garbage collected data at all. Instead, you give the native code a handle to the object and have it call back into the runtime to get values from an object. In your example, builtin_flines would call the runtime to allocate a list and get back a handle to it. It would then read lines, and call the runtime to append each one to the list, finally returning the complete list. The runtime would manage all the handles for a given native call, freeing them up after the native call returns.
Some complications:
When you input a line to be interpreted like
100 if X then gosub 5000
But 5000 does not exist yet, you are spaghetti coding...
Maybe x does not have any assigned value or data type yet.
If we don't index now, are we going wait till someone
types "run" or executes a line directly from the prompt?
if we do index now to speed things up later, how will we
know the last instance of "100" or "X" or "5000" gets
removed?
What entry do we make in the master index of "things"?
Assuming these things may include lines of basic code,
strings, and other variables we want to handle by name
or line number.
We want to find quickly, and use to strategically identify
garbage collection potential when the need for collection
arises.
How much static space do we burn on the index of things
that may change in size? Which details besides label,
location, and length are useful enough to justify indexing?
Should we attempt to index empty space when a variable
shrinks? Or just index the variable's largest historical
size along with its current size? How do we identify those
variables that change in size most frequently, and should
we avoid cleaning them, or even deliberately pad them?
When do we clean up the entire mess? or is it better to
defrag only enough free space to squeeze in something that
cannot be otherwise jammed sideways into an existing hole?
Purposeful delays and waiting for "input" seem good targets
that we might exploit to proactively clean up some of the
mess. There is no assurance any basic program will have such
deadtime.
Sorry this is not an answer, but the original question seems
to invite some brainstorming towards a better scheme. We need
a clear strategy that requires defining the entire problem.

Finding roots for garbage collection in C

I'm trying to implement a simple mark and sweep garbage collector in C. The first step of the algorithm is finding the roots. So my question is how can I find the roots in a C program?
In the programs using malloc, I'll be using the custom allocator. This custom allocator is all that will be called from the C program, and may be a custom init().
How does garbage collector knows what all the pointers(roots) are in the program? Also, given a pointer of a custom type how does it get all pointers inside that?
For example, if there's a pointer p pointing to a class list, which has another pointer inside it.. say q. How does garbage collector knows about it, so that it can mark it?
Update: How about if I send all the pointer names and types to GC when I init it? Similarly, the structure of different types can also be sent so that GC can traverse the tree. Is this even a sane idea or am I just going crazy?
First off, garbage collectors in C, without extensive compiler and OS support, have to be conservative, because you cannot distinguish between a legitimate pointer and an integer that happens to have a value that looks like a pointer. And even conservative garbage collectors are hard to implement. Like, really hard. And often, you will need to constrain the language in order to get something acceptable: for instance, it might be impossible to correctly collect memory if pointers are hidden or obfuscated. If you allocate 100 bytes and only keep a pointer to the tenth byte of the allocation, your GC is unlikely to figure out that you still need the block since it will see no reference to the beginning. Another very important constraint to control is the memory alignment: if pointers can be on unaligned memory, your collector can be slowed down by a factor of 10x or worse.
To find roots, you need to know where your stacks start, and where your stacks end. Notice the plural form: each thread has its own stack, and you might need to account for that, depending on your objectives. To know where a stack starts, without entering into platform-specific details (that I probably wouldn't be able to provide anyways), you can use assembly code inside the main function of the current thread (just main in a non-threaded executable) to query the stack register (esp on x86, rsp on x86_64 to name those two only). Gcc and clang support a language extension that lets you assign a variable permanently to a register, which should make it easy for you:
register void* stack asm("esp"); // replace esp with the name of your stack reg
(register is a standard language keyword that is most of the time ignored by today's compilers, but coupled with asm("register_name"), it lets you do some nasty stuff.)
To ensure you don't forget important roots, you should defer the actual work of the main function to another one. (On x86 platforms, you can also query ebp/rbp, the stack frame base pointers, instead, and still do your actual work in the main function.)
int main(int argc, const char** argv, const char** envp)
{
register void* stack asm("esp");
// put stack somewhere
return do_main(argc, argv, envp);
}
Once you enter your GC to do collection, you need to query the current stack pointer for the thread you've interrupted. You will need design-specific and/or platform-specific calls for that (though if you get something to execute on the same thread, the technique above will still work).
The actual hunt for roots starts now. Good news: most ABIs will require stack frames to be aligned on a boundary greater than the size of a pointer, which means that if you trust every pointer to be on aligned memory, you can treat your whole stack as a intptr_t* and check if any pattern inside looks like any of your managed pointers.
Obviously, there are other roots. Global variables can (theoretically) be roots, and fields inside structures can be roots too. Registers can also have pointers to objects. You need to separately account for global variables that can be roots (or forbid that altogether, which isn't a bad idea in my opinion) because automatic discovery of those would be hard (at least, I wouldn't know how to do it on any platform).
These roots can lead to references on the heap, where things can go awry if you don't take care.
Since not all platforms provide malloc introspection (as far as I know), you need to implement the concept of scanned memory--that is, memory that your GC knows about. It needs to know at least the address and the size of each of such allocation. When you get a reference to one of these, you simply scan them for pointers, just like you did for the stack. (This means that you should take care that your pointers are aligned. This is normally the case if you let your compiler do its job, but you still need to be careful when you use third-party APIs).
This also means that you cannot put references to collectable memory to places where the GC can't reach it. And this is where it hurts the most and where you need to be extra-careful. Otherwise, if your platform supports malloc introspection, you can easily tell the size of each allocation you get a pointer to and make sure you don't overrun them.
This just scratches the surface of the topic. Garbage collectors are extremely complex, even when single-threaded. When you add threads to the mix, you enter a whole new world of hurt.
Apple has implemented such a conservative GC for the Objective-C language and dubbed it libauto. They have open-sourced it, along with a good part of the low-level technologies of Mac OS X, and you can find the source here.
I can only quote Hot Licks here: good luck!
Okay, before I go even further, I forgot something very important: compiler optimizations can break the GC. If your compiler is not aware of your GC, it can very well never put certain roots on the stack (only dealing with them in registers), and you're going to miss them. This is not too problematic for single-threaded programs if you can inspect registers, but again, a huge mess for multithreaded programs.
Also be very careful about the interruptibility of allocations: you must make sure that your GC cannot kick in while you're returning a new pointer because it could collect it right before it is assigned to a root, and when your program resumes it would assign that new dangling pointer to your program.
And here's an update to address the edit:
Update: How about if I send all the pointer names and types to GC when
I init it? Similarly, the structure of different types can also be
sent so that GC can traverse the tree. Is this even a sane idea or am
I just going crazy?
I guess you could allocate our memory then register it with the GC to tell it that it should be a managed resource. That would solve the interruptability problem. But then, be careful about what you send to third-party libraries, because if they keep a reference to it, your GC might not be able to detect it since they won't register their data structures with your GC.
And you likely won't be able to do that with roots on the stack.
The roots are basically all static and automatic object pointers. Static pointers would be linked inside the load modules. Automatic pointers must be found by scanning stack frames. Of course, you have no idea where in the stack frames the automatic pointers are.
Once you have the roots you need to scan objects and find all the pointers inside them. (This would include pointer arrays.) For that you need to identify the class object and somehow extract from it information about pointer locations. Of course, in C many objects are not virtual and do not have a class pointer within them.
Good luck!!
Added: One technique that could vaguely make your quest possible is "conservative" garbage collection. Since you intend to have your own allocator, you can (somehow) keep track of allocation sizes and locations, so you can pick any pointer-sized chunk out of storage and ask "Might this possibly be a pointer to one of my objects?" You can, of course, never know for sure, since random data might "look like" a pointer to one of your objects, but still you can, through this mechanism, scan a chunk of storage (like a frame in the call stack, or an individual object) and identify all the possible objects it might address.
With a conservative collector you cannot safely do object relocation/compaction (where you modify pointers to objects as you move them) since you might accidentally modify "random" data that looks like an object pointer but is in fact meaningful data to some application. But you can identify unused objects and free up the space they occupy for reuse. With proper design it's possible to have a very effective non-compacting GC.
(However, if your version of C allows unaligned pointers scanning could be very slow, since you'd have to try every variation on byte alignment.)

How to implement Reference counting in C?

read about it here.
I need to implement a variation of such an interface, say we are given a large memory space to manage there should be getmem(size) and free(pointer to block) functions that has to make sure free(pointer to block) can actually free the memory if and only if all processes using that block are done using it.
What I was thinking about doing is to define a Collectable struct as pointer to block, size of it, and process using it count. then whenever a process using a Collectable struct instance for the first time it has to explicitly increment the count, and whenever the process free()'s it, the count is decremented.
The problem with this approach is that all processes must respond to that interface and make it explicitly work : whenever assigning collectable pointer to an instance the process must explicitly inc that counter, which does not satisfy me, I was thinking maybe there is a way to create a macro for this to happen implicitly in every assignment?
I'm seeking of ways to approach this problem for a while, so other approaches and ideas would be great...
EDIT : the above approach doesn't satisfy me not only because it doesn't look nice but mostly because I cant assume a running process's code would care for updating my count. I need a way to make sure its done without changing the process's code...
An early problem with reference counting is that it is relatively easy to count the initial reference by putting code in a custom malloc / free implementation, but it is quite a bit harder to determine if the initial recipient passes that address around to others.
Since C lacks the ability to override the assignment operator (to count the new reference), basically you are left with a limited number of options. The only one that can possibly override the assignment is macrodef, as it has the ability to rewrite the assignment into something that inlines the increment of the reference count value.
So you need to "expand" a macro that looks like
a = b;
into
if (b is a pointer) { // this might be optional, if lookupReference does this work
struct ref_record* ref_r = lookupReference(b);
if (ref_r) {
ref_r->count++;
} else {
// error
}
}
a = b;
The real trick will be in writing a macro that can identify the assignment, and insert the code cleanly without introducing other unwanted side-effects. Since macrodef is not a complete language, you might run into issues where the matching becomes impossible.
(jokes about seeing nails where you learn how to use a hammer have an interesting parallel here, except that when you only have a hammer, you had better learn how to make everything a nail).
Other options (perhaps more sane, perhaps not) is to keep track of all address values assigned by malloc, and then scan the program's stack and heap for matching addresses. If you match, you might have found a valid pointer, or you might have found a string with a luck encoding; however, if you don't match, you certainly can free the address; provided they aren't storing an address + offset calculated from the original address. (perhaps you can macrodef to detect such offsets, and add the offset as multiple addresses in the scan for the same block)
In the end, there isn't going to be a foolproof solution without building a referencing system, where you pass back references (pretend addresses); hiding the real addresses. The down side to such a solution is that you must use the library interface every time you want to deal with an address. This includes the "next" element in the array, etc. Not very C-like, but a pretty good approximation of what Java does with its references.
Semi-serious answer
#include "Python.h"
Python has a great reference counting memory manager. If I had to do this for real in production code, not homework, I'd consider embedding the python object system in my C program which would then make my C program scriptable in python too. See the Python C API documentation if you are interested!
Such a system in C requires some discipline on the part of the programmer but ...
You need to think in terms of ownership. All things that hold references are owners and must keep track of the objects to which it holds references, e.g. through lists. When a reference holding thing is destroyed it must loop its list of referred objects and decrement their reference counters and if zero destroy them in turn.
Functions are also owners and should keep track of referenced objects, e.g. by setting up a list at the start of the function and looping through it when returning.
So you need to determine in which situations objects should be transferred or shared with new owners and wrap the corresponding situations in macros/functions that add or remove owned objects to owning objects' lists of referenced objects (and adjust the reference counter accordingly).
Finally you need to deal with circular references somehow by checking for objects that are no longer reachable from objects/pointers on the stack. That could be done with some mark and sweep garbage collection mechanism.
I don't think you can do it automatically without overridable destructors/constructors.
You can look at HDF5 ref counting but those require explicit calls in C:
http://www.hdfgroup.org/HDF5/doc/RM/RM_H5I.html

What is the real difference between Pointers and References?

AKA - What's this obsession with pointers?
Having only really used modern, object oriented languages like ActionScript, Java and C#, I don't really understand the importance of pointers and what you use them for. What am I missing out on here?
It's all just indirection: The ability to not deal with data, but say "I'll direct you to some data, over there". You have the same concept in Java and C#, but only in reference format.
The key differences are that references are effectively immutable signposts - they always point to something. This is useful, and easy to understand, but less flexible than the C pointer model. C pointers are signposts that you can happily rewrite. You know that the string you're looking for is next door to the string being pointed at? Well, just slightly alter the signpost.
This couples well with C's "close to the bone, low level knowledge required" approach. We know that a char* foo consists of a set of characters beginning at the location pointed to by the foo signpost. If we also know that the string is at least 10 characters long, we can change the signpost to (foo + 5) to point at then same string, but start half the length in.
This flexibility is useful when you know what you're doing, and death if you don't (where "know" is more than just "know the language", it's "know the exact state of the program"). Get it wrong, and your signpost is directing you off the edge of a cliff. References don't let you fiddle, so you're much more confident that you can follow them without risk (especially when coupled with rules like "A referenced object will never disappear", as in most Garbage collected languages).
You're missing out on a lot! Understanding how the computer works on lower levels is very useful in several situations. C and assembler will do that for you.
Basically a pointer lets you write stuff to any point in the computer's memory. On more primitive hardware/OS or in embedded systems this actually might do something useful. Say turn the blinkenlichts on and off again.
Of course this doesn't work on modern systems. The operating system is the Lord and Master of main memory. If you try to access a wrong memory location, your process will pay for its hubris with its life.
In C, pointers are the way of passing references to data. When you call a function, you don't want to copy a million bits to a stack. Instead you just tell where the data resides in the main memory. In other words, you give a pointer to the data.
To some extent that is what happens even with Java. You pass references to objects, not the objects themselves. Remember, ultimately every object is a set of bits in the computer main memory.
Pointers are for directly manipulating the contents of memory.
It's up to you whether you think this is a good thing to do, but it's the basis of how anything gets done in C or assembler.
High-level languages hide pointers behind the scenes: for example a reference in Java is implemented as a pointer in almost any JVM you'll come across, which is why it's called NullPointerException rather than NullReferenceException. But it doesn't let the programmer directly access the memory address it points to, and it can't be modified to take a value other than the address of an object of the correct type. So it doesn't offer the same power (and responsibility) that pointers in low-level languages do.
[Edit: this is an answer to the question 'what's this obsession with pointers?'. All I've compared is assembler/C-style pointers with Java references. The question title has since changed: had I set out to answer the new question I might have mentioned references in languages other than Java]
This is like asking, “what's this obsession with CPU instructions? Do I miss out on something by not sprinkling x86 MOV instructions all over the place?”
You just need pointers when programming on a low level. In most higher-level programming language implementations, pointers are used just as extensively as in C, but hidden from the user by the compiler.
So... Don't worry. You're using pointers already -- and without the dangers of doing so incorrectly, too. :)
I see pointers as a manual transmission in a car. If you learn to drive with a car that has an automatic transmission, that won't make for a bad driver. And you can still do most everything that the drivers that learned on a manual transmission can do. There will just be a hole in your knowledge of driving. If you had to drive a manual you'd probably be in trouble. Sure, it is easy to understand the basic concept of it, but once you have to do a hill start, you're screwed. But, there is still a place for manual transmissions. For instance, race car drivers need to be able to shift to get the car to respond in the most optimal way to the current racing conditions. Having a manual transmission is very important to their success.
This is very similar to programming right now. There is a need for C/C++ development on some software. Some examples are high-end 3D games, low level embedded software, things where speed is a critical part of the software's purpose, and a lower level language that allows you closer access to the actual data that needs to be processed is key to that performance. However, for most programmers this is not the case and not knowing pointers is not crippling. However, I do believe everybody can benefit from learning about C and pointers, and manual transmissions too.
Since you have been programming in object-oriented languages, let me put it this way.
You get Object A instantiate Object B, and you pass it as a method parameter to Object C. The Object C modifies some values in the Object B. When you are back to Object A's code, you can see the changed value in Object B. Why is this so?
Because you passed in a reference of Object B to Object C, not made another copy of Object B. So Object A and Object C both hold references to the same Object B in memory. Changes from one place and be seen in another. This is called By Reference.
Now, if you use primitive types instead, like int or float, and pass them as method parameters, changes in Object C cannot be seen by Object A, because Object A merely passed a copy instead of a reference of its own copy of the variable. This is called By Value.
You probably already knew that.
Coming back to the C language, Function A passes to Function B some variables. These function parameters are natively copies, By Value. In order for Function B to manipulate the copy belonging to Function A, Function A must pass a pointer to the variable, so that it becomes a pass By Reference.
"Hey, here's the memory address to my integer variable. Put the new value at that address location and I will pick up later."
Note the concept is similar but not 100% analogous. Pointers can do a lot more than just passing "by reference". Pointers allow functions to manipulate arbitrary locations of memory to whatever value required. Pointers are also used to point to new addresses of execution code to dynamically execute arbitrary logic, not just data variables. Pointers may even point to other pointers (double pointer). That is powerful but also pretty easy to introduce hard-to-detect bugs and security vulnerabilities.
If you haven't seen pointers before, you're surely missing out on this mini-gem:
void strcpy(char *dest, char *src)
{
while(*dest++ = *src++);
}
Historically, what made programming possible was the realization that memory locations could hold computer instructions, not just data.
Pointers arose from the realization that memory locations could also hold the address of other memory locations, thus giving us indirection. Without pointers (at a low level) most complicated data structures would be impossible. No linked-lists, binary-trees or hash-tables. No pass by reference, only by value. Since pointers can point to code, without them we would also have no virtual functions or function look up tables.
I use pointers and references heavily in my day to day work...in managed code (C#, Java) and unmanaged (C++, C). I learned about how to deal with pointers and what they are by the master himself...[Binky!!][1] Nothing else needs to be said ;)
The difference between a pointer and reference is this. A pointer is an address to some block of memory. It can be rewritten or in other words, reassigned to some other block of memory. A reference is simply a renaming of some object. It can only be assigned once! Once it is assigned to an object, it cannot be assigned to another. A reference is not an address, it is another name for the variable. Check out C++ FAQ for more on this.
Link1
LInk2
I'm currently waist-deep in designing some high level enterprise software in which chunks of data (stored in an SQL database, in this case) are referenced by 1 or more other entities. If a chunk of data remains when no more entities reference it, we're wasting storage. If a reference points so data that's not present, that's a big problem too.
There's a strong analogy to be made between our issues, and those of memory management in a language that uses pointers. It's tremendously useful to be able to talk to my colleagues in terms of that analogy. Not deleting unreferenced data is a "memory leak". A reference that goes nowhere is a "dangling pointer". We can choose explicit "frees", or we can implement "garbage collection" using "reference counting".
So here, understanding low-level memory management is helping design high-level applications.
In Java you're using pointers all the time. Most variables are pointers to objects - which is why:
StringBuffer x = new StringBuffer("Hello");
StringBuffer y = x;
x.append(" boys");
System.out.println(y);
... prints "Hello boys" and not "Hello".
The only difference in C is that it's common to add and subtract from pointers - and if you get the logic wrong you can end up messing with data you shouldn't be touching.
Strings are fundamental to C (and other related languages). When programming in C, you must manage your memory. You don't just say "okay, I'll need a bunch of strings"; you need to think about the data structure. How much memory do you need? When will you allocate it? When will you free it? Let's say you want 10 strings, each with no more than 80 characters.
Okay, each string is an array of characters (81 characters - you mustn't forget the null or you'll be sorry!) and then each string is itself in an array. The final result will be a multidimensional array something like
char dict[10][81];
Note, incidentally, that dict isn't a "string" or an "array", or a "char". It's a pointer. When you try to print one of those strings, all you're doing is passing the address of a single character; C assumes that if it just starts printing characters it will eventually hit a null. And it assumes that if you are at the start of one string, and you jump forward 81 bytes, you'll be at the start of the next string. And, in fact taking your pointer and adding 81 bytes to it is the only possible way to jump to the next string.
So, why are pointers important? Because you can't do anything without them. You can't even do something simple like print out a bunch of strings; you certainly can't do anything interesting like implement linked lists, or hashes, or queues, or trees, or a file system, or some memory management code, or a kernel or...whatever. You NEED to understand them because C just hands you a block of memory and let's you do the rest, and doing anything with a block of raw memory requires pointers.
Also many people suggest that the ability to understand pointers correlates highly with programming skill. Joel has made this argument, among others. For example
Now, I freely admit that programming with pointers is not needed in 90% of the code written today, and in fact, it's downright dangerous in production code. OK. That's fine. And functional programming is just not used much in practice. Agreed.
But it's still important for some of the most exciting programming jobs. Without pointers, for example, you'd never be able to work on the Linux kernel. You can't understand a line of code in Linux, or, indeed, any operating system, without really understanding pointers.
From here. Excellent article.
To be honest, most seasoned developers will have a laugh (hopefully friendly) if you don't know pointers.
At my previous Job we had two new hires last year (just graduated) that didn't know about pointers, and that alone was the topic of conversation with them for about a week. No one could believe how someone could graduate without knowing pointers...
References in C++ are fundamentally different from references in Java or .NET languages; .NET languages have special types called "byrefs" which behave much like C++ "references".
A C++ reference or .NET byref (I'll use the latter term, to distinguish from .NET references) is a special type which doesn't hold a variable, but rather holds information sufficient to identify a variable (or something that can behave as one, such as an array slot) held elsewhere. Byrefs are generally only used as function parameters/arguments, and are intended to be ephemeral. Code which passes a byref to a function guarantees that the variable which is identified thereby will exist at least until that function returns, and functions generally guarantee not to keep any copy of a byref after they return (note that in C++ the latter restriction is not enforced). Thus, byrefs cannot outlive the variables identified thereby.
In Java and .NET languages, a reference is a type that identifies a heap object; each heap object has an associated class, and code in the heap object's class can access data stored in the object. Heap objects may grant outside code limited or full access to the data stored therein, and/or allow outside code to call certain methods within their class. Using a reference to calling a method of its class will cause that reference to be made available to that method, which may then use it to access data (even private data) within the heap object.
What makes references special in Java and .NET languages is that they maintain, as an absolute invariant, that every non-null reference will continue to identify the same heap object as long as that reference exists. Once no reference to a heap object exists anywhere in the universe, the heap object will simply cease to exist, but there is no way a heap object can cease to exist while any reference to it exists, nor is there any way for a "normal" reference to a heap object to spontaneously become anything other than a reference to that object. Both Java and .NET do have special "weak reference" types, but even they uphold the invariant. If no non-weak references to an object exist anywhere in the universe, then any existing weak references will be invalidated; once that occurs, there won't be any references to the object and it can thus be invalidated.
Pointers, like both C++ references and Java/.NET references, identify objects, but unlike the aforementioned types of references they can outlive the objects they identify. If the object identified by a pointer ceases to exist but the pointer itself does not, any attempt to use the pointer will result in Undefined Behavior. If a pointer isn't known either to be null or to identify an object that presently exists, there's no standard-defined way to do anything with that pointer other than overwrite it with something else. It's perfectly legitimate for a pointer to continue to exist after the object identified thereby has ceased to do so, provided that nothing ever uses the pointer, but it's necessary that something outside the pointer indicate whether or not it's safe to use because there's no way to ask the pointer itself.
The key difference between pointers and references (of either type) is that references can always be asked if they are valid (they'll either be valid or identifiable as null), and if observed to be valid they will remain so as long as they exist. Pointers cannot be asked if they are valid, and the system will do nothing to ensure that pointers don't become invalid, nor allow pointers that become invalid to be recognized as such.
For a long time I didn't understand pointers, but I understood array addressing. So I'd usually put together some storage area for objects in an array, and then use an index to that array as the 'pointer' concept.
SomeObject store[100];
int a_ptr = 20;
SomeObject A = store[a_ptr];
One problem with this approach is that after I modified 'A', I'd have to reassign it to the 'store' array in order for the changes to be permanent:
store[a_ptr] = A;
Behind the scenes, the programming language was doing several copy-operations. Most of the time this didn't affect performance. It mostly made the code error-prone and repetitive.
After I learned to understand pointers, I moved away from implementing the array addressing approach. The analogy is still pretty valid. Just consider that the 'store' array is managed by the programming language's run-time.
SomeObject A;
SomeObject* a_ptr = &A;
// Any changes to a_ptr's contents hereafter will affect
// the one-true-object that it addresses. No need to reassign.
Nowadays, I only use pointers when I can't legitimately copy an object. There are a bunch of reasons why this might be the case:
To avoid an expensive object-copy
operation for the sake of
performance.
Some other factor doesn't permit an
object-copy operation.
You want a function call to have
side-effects on an object (don't
pass the object, pass the pointer
thereto).
In some languages- if you want to
return more than one value from a
function (though generally
avoided).
Pointers are the most pragmatic way of representing indirection in lower-level programming languages.
Pointers are important! They "point" to a memory address, and many internal structures are represented as pointers, IE, An array of strings is actually a list of pointers to pointers! Pointers can also be used for updating variables passed to functions.
You need them if you want to generate "objects" at runtime without pre allocate memory on the stack
Parameter efficency - passing a pointer (Int - 4 bytes) as opposed to copying a whole (arbitarily large) object.
Java classes are passed via reference (basically a pointer) also btw, its just that in java that's hidden from the programmer.
Programming in languages like C and C++ you are much closer to the "metal". Pointers hold a memory location where your variables, data, functions etc. live. You can pass a pointer around instead of passing by value (copying your variables and data).
There are two things that are difficult with pointers:
Pointers on pointers, addressing, etc. can get very cryptic. It leads to errors, and it is hard to read.
Memory that pointers point to is often allocated from the heap, which means you are responsible for releasing that memory. Bigger your application gets, harder it is to keep up with this requirement, and you end up with memory leaks that are hard to track down.
You could compare pointer behavior to how Java objects are passed around, with the exception that in Java you do not have to worry about freeing the memory as this is handled by garbage collection. This way you get the good things about pointers but do not have to deal with the negatives. You can still get memory leaks in Java of course if you do not de-reference your objects but that is a different matter.
Also just something to note, you can use pointers in C# (as opposed to normal references) by marking a block of code as unsafe. Then you can run around changing memory addresses directly and do pointer arithmetic and all that fun stuff. It's great for very fast image manipulation (the only place I personally have used it).
As far as I know Java and ActionScript don't support unsafe code and pointers.
I am always distressed by the focus on such things as pointers or references in high-level languages. It's really useful to think at a higher level of abstraction in terms of the behavior of objects (or even just functions) as opposed to thinking in terms of "let me see, if I send the address of this thing to there, then that thing will return me a pointer to something else"
Consider even a simple swap function. If you have
void swap(int & a, int & b)
or
procedure Swap(var a, b : integer)
then interpret these to mean that the values can be changed. The fact that this is being implemented by passing the addresses of the variables is just a distraction from the purpose.
Same with objects --- don't think of object identifiers as pointers or references to "stuff". Instead, just think of them as, well, OBJECTS, to which you can send messages. Even in primitive languages like C++, you can go a lot further a lot faster by thinking (and writing) at as high a level as possible.
Write more than 2 lines of c or c++ and you'll find out.
They are "pointers" to the memory location of a variable. It is like passing a variable by reference kinda.

Resources