I'm writing a simple language that compiles to C, and I want to implement smart pointers. I need a bit of help with that though, as I can't seem to think of how I would go around it, or if it's even possible. My current idea is to free the pointer when it goes out of scope, the compiler would handle inserting the frees. This leads to my questions:
How would I tell when a pointer has gone out of scope?
Is this even possible?
The compiler is written in C, and compiles to C. I thought that I could check when the pointer goes out of scope at compile-time, and insert a free into the generated code for the pointer, i.e:
// generated C code.
int main() {
int *x = malloc(sizeof(*x));
*x = 5;
free(x); // inserted by the compiler
}
The scoping rules (in my language) are exactly the same as C.
My current setup is your standard compiler, first it lexes the file contents, then it parses the token stream, semantically analyzes it, and then generates code to C. The parser is a recursive descent parser. I would like to avoid something that happens on execution, i.e. I want it to be a compile-time check that has little to no overhead, and isn't full blown garbage collection.
For functions, each { starts a new scope, and each } closes the corresponding scope. When a } is reached, the variables inside that block go out-of-scope. Members of structs go out of scope when the struct instance goes out of scope. There's a couple exceptions, such as temporary objects go out-of-scope at the next ;, and compilers silently put for loops inside their own block scope.
struct thing {
int member;
};
int foo;
int main() {
thing a;
{
int b = 3;
for(int c=0; c<b; ++c) {
int d = rand(); //the return value of rand goes out of scope after assignment
} //d and c go out of scope here
} //b goes out of scope here
}//a and its members go out of scope here
//globals like foo go out-of-scope after main ends
C++ tries really hard to destroy objects in the opposite order they're constructed, you should probably do that in your language too.
(This is all from my knowledge of C++, so it might be slightly different from C, but I don't think it is)
As for memory, you'll probably want to do a little magic behind the scenes. Whenever the user mallocs memory, you replace it with something that allocates more memory, and "hide" a reference count in the extra space. It's easiest to do that at the beginning of the allocation, and to keep alignment guarantees, you use something akin to this:
typedef union {
long double f;
void* v;
char* c;
unsigned long long l;
} bad_alignment;
void* ref_count_malloc(int bytes)
{
void* p = malloc(bytes + sizeof(bad_alignment)); //does C have sizeof?
int* ref_count = p;
*ref_count = 1; //now is 1 pointer pointing at this block
return p + sizeof(bad_alignment);
}
When they copy a pointer, you silently add something akin to this before the copy
void copy_pointer(void* from, void* to) {
if (from != NULL)
ref_count_free(free); //no longer points at previous block
bad_alignment* ref_count = to-sizeof(bad_alignment);
++*ref_count; //one additional pointing at this block
}
And when they free or a pointer goes out of scope, you add/replace the call with something like this:
void ref_count_free(void* ptr) {
if(ptr) {
bad_alignment* ref_count = ptr-sizeof(bad_alignment);
if (--*ref_count == 0) //if no more pointing at this block
free(ptr);
}
}
If you have threads, you'll have to add locks to all that. My C is rusty and the code is untested, so do a lot of research on these concepts.
The problem is slightly more difficult, since your code is straightforward, but... what if another pointer is made to point to the same place as x?
// generated C code.
int main() {
int *x = malloc(sizeof(*x));
int *y = x;
*x = 5;
free(x); // inserted by the compiler, now wrong
}
You doubtlessly will have a heap structure, in which each block has a header that tells a) whether the block is in use, and b) the size of the block. This can be achieved with a small structure, or by using the highest bit for a) in the integer value for b) [is this a 64bit compiler or 32bit?]. For simplicity, lets consider:
typedef struct {
bool allocated: 1;
size_t size;
} BlockHeader;
You would have to add another field to that small structure, which would be a reference count. Each time a pointer points to that block in the heap, you increment the reference count. When a pointer stops pointing to a block, then its reference count is decremented. If it reaches 0, then it can be compacted or whatever. The use of the allocated field has now gone.
typedef struct {
size_t size;
size_t referenceCount;
} BlockHeader;
Reference counting is quite simple to implement, but comes with a down side: it means there is overhead each time the value of a pointer changes. Still, is the simplest scheme to work, and that's why some programming languages still use it, such as Python.
Related
I am very new to C so sorry in advance if this is really basic. This is related to homework.
I have several helper functions, and each changes the value of a given variable (binary operations mostly), i.e.:
void helper1(unsigned short *x, arg1, arg2) --> x = &some_new_x
The main function calls other arguments arg3, arg4, arg5. The x is supposed to start at 0 (16-bit 0) at first, then be modified by helper functions, and after all the modifications, should be eventually returned by mainFunction.
Where do I declare the initial x and how/where do I allocate/free memory? If I declare it within mainFunc, it will reset to 0 every time helpers are called. If I free and reallocate memory inside helper functions, I get the "pointer being freed was not allocated" error even though I freed and allocated everything, or so I thought. A global variable doesn't do, either.
I would say that I don't really fully understand memory allocation, so I assume that my problem is with this, but it's entirely possible I just don't understand how to change variable values in C on a more basic level...
The variable x will exist while the block in which it was declared is executed, even during helper execution, and giving a pointer to the helpers allows them to change its value. If I understand your problem right, you shouldn't need dynamic memory allocation. The following code returns 4 from mainFunction:
void plus_one(unsigned short* x)
{
*x = *x + 1;
}
unsigned short mainFunction(void)
{
unsigned short x = 0;
plus_one(&x);
plus_one(&x);
plus_one(&x);
plus_one(&x);
return x;
}
By your description I'd suggest declaring x in your main function as a local variable (allocated from the stack) which you then pass by reference to your helper functions and return it from your main function by value.
int main()
{
int x; //local variable
helper(&x); //passed by reference
return x; //returned by value
}
Inside your helper you can modify the variable by dereferencing it and assigning whatever value needed:
void helper(int * x)
{
*x = ...; //change value of x
}
The alternative is declaring a pointer to x (which gets allocated from the heap) passing it to your helper functions and free-ing it when you have no use for it anymore. But this route requires more careful consideration and is error-prone.
Functions receive a value-wise copy of their inputs to locally scoped variables. Thus a helper function cannot possibly change the value it was called with, only its local copy.
void f(int n)
{
n = 2;
}
int main()
{
int n = 1;
f(n);
return 0;
}
Despite having the same name, n in f is local to the invocation of f. So the n in main never changes.
The way to work around this is to pass by pointer:
int f(int *n)
{
*n = 2;
}
int main()
{
int n = 1;
f(&n);
// now we also see n == 2.
return 0;
}
Note that, again, n in f is local, so if we changed the pointer n in f, it would have no effect on main's perspective. If we wanted to change the address n in main, we'd have to pass the address of the pointer.
void f1(int* nPtr)
{
nPtr = malloc(sizeof int);
*nPtr = 2;
}
void f2(int** nPtr)
{
// since nPtr is a pointer-to-a-pointer,
// we have to dereference it once to
// reach the "pointer-to-int"
// typeof nPtr = (int*)*
// typeof *nPtr = int*
*nPtr = malloc(sizeof int);
// deref once to get to int*, deref that for int
**nPtr = 2;
}
int main()
{
int *nPtr = NULL;
f1(nPtr); // passes 'NULL' to param 1 of f1.
// after the call, our 'nPtr' is still NULL
f2(&nPtr); // passes the *address* of our nPtr variable
// nPtr here should no-longer be null.
return 0;
}
---- EDIT: Regarding ownership of allocations ----
The ownership of pointers is a messy can of worms; the standard C library has a function strdup which returns a pointer to a copy of a string. It is left to the programmer to understand that the pointer is allocated with malloc and is expected to be released to the memory manager by a call to free.
This approach becomes more onerous as the thing being pointed to becomes more complex. For example, if you get a directory structure, you might be expected to understand that each entry is an allocated pointer that you are responsible for releasing.
dir = getDirectory(dirName);
for (i = 0; i < numEntries; i++) {
printf("%d: %s\n", i, dir[i]->de_name);
free(dir[i]);
}
free(dir);
If this was a file operation you'd be a little surprised if the library didn't provide a close function and made you tear down the file descriptor on your own.
A lot of modern libraries tend to assume responsibility for their resources and provide matching acquire and release functions, e.g. to open and close a MySQL connection:
// allocate a MySQL descriptor and initialize it.
MYSQL* conn = mysql_init(NULL);
DoStuffWithDBConnection(conn);
// release everything.
mysql_close(conn);
LibEvent has, e.g.
bufferevent_new();
to allocate an event buffer and
bufferevent_free();
to release it, even though what it actually does is little more than malloc() and free(), but by having you call these functions, they provide a well-defined and clear API which assumes responsibility for knowing such things.
This is the basis for the concept known as "RAII" in C++
This is perhaps one of the most odd things I've ever encountered. I don't program much in C but from what I know to be true plus checking with different sources online, variables macroName and macroBody are only defined in scope of the while loop. So every time the loop runs, I'm expecting marcoName and macroBody to get new addresses and be completely new variables. However that is not true.
What I'm finding is that even though the loop is running again, both variables share the same address and this is causing me serious headache for a linked list where I need to check for uniqueness of elements. I don't know why this is. Shouldn't macroName and macroBody get completely new addresses each time the while loop runs?
I know this is the problem because I'm printing the addresses and they are the same.
while(fgets(line, sizeof(line), fp) != NULL) // Get new line
{
char macroName[MAXLINE];
char macroBody[MAXLINE];
// ... more code
switch (command_type)
{
case hake_macro_definition:
// ... more code
printf("**********%p | %p\n", ¯oName, ¯oBody);
break;
// .... more cases
}
}
Code that is part of my linked-list code.
struct macro {
struct macro *next;
struct macro *previous;
char *name;
char *body;
};
Function that checks if element already exists inside linked-list. But since *name has the same address, I always end up inside the if condition.
static struct macro *macro_lookup(char *name)
{
struct macro *temp = macro_list_head;
while (temp != NULL)
{
if (are_strings_equal(name, temp->name))
{
break;
}
temp = temp->next;
}
return temp;
}
These arrays are allocated on the stack:
char macroName[MAXLINE];
char macroBody[MAXLINE];
The compiler has pre-allocated space for you that exists at the start of your function. In other words, from the computer's viewpoint, the location of these arrays would the same as if you had defined them outside the loop body at the top of your function body.
The scope in C merely indicates where an identifier is visible. So the compiler (but not the computer) enforces the semantics that macroName and macroBody cannot be referenced before or after the loop body. But from the computer's viewpoint, the actual data for these arrays exists once the function starts and only goes away when the function ends.
If you were to look at the assembly dump of your code, you'd likely see that your machine's frame pointer is decremented by a big enough amount for your function's call stack to have space for all of your local variables, including these arrays.
What I need to mention in addition to chrisaycock's answer: you should never use pointers to local variables outside function these variables were defined in. Consider this example:
int * f()
{
int local_var = 0;
return &local_var;
}
int g(int x)
{
return (x > 0) ? x : 0;
}
int main()
{
int * from_f = f(); //
*from_f = 100; //Undefined behavior
g(15); //some function call to change stack
printf("%d", *from_f); //Will print some random value
return 0;
}
The same, actually, applies to a block. Technically, block-local variables can be cleaned out after the block ends. So, on each iteration of a loop old addresses can be invalid. It will not be true since C compiler indeed puts these vars to the same address for perfomance reasons, but you can not rely on it.
What you need to understand is how memory is allocated. If you want to implement a list, it is a structure that grows. Where does the memory come from? You can not allocate much memory from the stack, plus the memory is invalidated once you return from a function. So, you will need to allocate it from the heap (using malloc).
Update : Sorry, just a big mistake. It is meaningless to write int *a = 3; But please just think the analogy to the case like TCHAR *a = TEXT("text"); (I edited my question, so some answers and comments are strange, since they are for my original question which is not suitable)
In main function, suppose I have a pointer TCHAR *a = TEXT("text"); Then it excutes the following code:
int i;
for (i = 0; i < 1000; i++) {
a = test(i);
}
with the function TCHAR* test(int par) defined by:
TCHAR* test(int par)
{
TCHAR *b = TEXT("aaa");
return b;
}
My question is, after executing the above code, but before the program ends, in the memory:
1. the pointer `a` remains?
2. The 1000 pointers `b` are deleted each time the function test(...) exits ?
3. But there are still 1000 memory blocks there?
In fact, my question is motivated from the following code, which shows a tooltip when mouse is over a tab item in a tab control with the style TCS_TOOLTIPS:
case WM_NOTIFY
if (lpnmhdr->code == TTN_GETDISPINFO) {
LPNMTTDISPINFO lpnmtdi;
lpnmtdi = (LPNMTTDISPINFO)lParam;
int tabIndex = (int) wParam; // wParam is the index of the tab item.
lpnmtdi->lpszText = SetTabToolTipText(panel->gWin.At(tabIndex));
break;
}
I am thinking if the memory usage increases each time it calls
SetTabToolTipText(panel->gWin.At(tabIndex)), which manipulates with TCHAR and TCHAR* and return a value of type LPTSTR.
Yes, the pointer a remains till we return from the main function
The variable b (a 4-byte pointer) is automatic. It is created each time we call test function. Once we return from it, the variable disappears (the pointer). Please note, the value to which b points isn't affected.
No. In most of the cases, I think, there will be only one block allocated during compilation time (most likely in the read-only memory) and the function will be returning the same pointer on every invocation.
If SetTabToolTipText allocates a string inside using some memory management facilities new/malloc or some os-specific, you should do an additional cleanup. Otherwise there'll be a memory leak.
If nothing like this happens inside (it's not mentioned in the documentation or comments etc), it's most likely returning the pointer to some internal buffer which you typically use as readonly. In this case, there should be no concerns about a memory consumption increase.
You dont allocate any memory so you don't have to worry about memory being freed. When your vaiables go out of scope they will be freed automatically. In this function
int test(int par)
{
int *b = par;
}
you don't have a return value even though the function says that is will return an int, so you should probably do so as in this line
for (i = 0; i < 1000; i++) {
a = test(i);
}
you assign to a the value that is returned by test(). Also
int* a = 3;
int* b = par;
are asking for trouble. You are assigning integer values to a pointer variable. You should probably rethink your above code.
Pointer should contain adress... so int* a = 3 is something meaningless... And in function you don't allocate memory for int (only for par variable, which then destroy when the function ends), you allocate memory for storing adress in int* b, this memory also free when the funciton ends.
I have a dynamic array of structures, so I thought I could store the information about the array in the first structure.
So one attribute will represent the amount of memory allocated for the array and another one representing number of the structures actually stored in the array.
The trouble is, that when I put it inside a function that fills it with these structures and tries to allocate more memory if needed, the original array gets somehow distorted.
Can someone explain why is this and how to get past it?
Here is my code
#define INIT 3
typedef struct point{
int x;
int y;
int c;
int d;
}Point;
Point empty(){
Point p;
p.x=1;
p.y=10;
p.c=100;
p.d=1000; //if you put different values it will act differently - weird
return p;
}
void printArray(Point * r){
int i;
int total = r[0].y+1;
for(i=0;i<total;i++){
printf("%2d | P [%2d,%2d][%4d,%4d]\n",i,r[i].x,r[i].y,r[i].c,r[i].d);
}
}
void reallocFunction(Point * r){
r=(Point *) realloc(r,r[0].x*2*sizeof(Point));
r[0].x*=2;
}
void enter(Point* r,int c){
int i;
for(i=1;i<c;i++){
r[r[0].y+1]=empty();
r[0].y++;
if( (r[0].y+2) >= r[0].x ){ /*when the amount of Points is near
*the end of allocated memory.
reallocate the array*/
reallocFunction(r);
}
}
}
int main(int argc, char** argv) {
Point * r=(Point *) malloc ( sizeof ( Point ) * INIT );
r[0]=empty();
r[0].x=INIT; /*so here I store for how many "Points" is there memory
//in r[0].y theres how many Points there are.*/
enter(r,5);
printArray(r);
return (0);
}
Your code does not look clean to me for other reasons, but...
void reallocFunction(Point * r){
r=(Point *) realloc(r,r[0].x*2*sizeof(Point));
r[0].x*=2;
r[0].y++;
}
The problem here is that r in this function is the parameter, hence any modifications to it are lost when the function returns. You need some way to change the caller's version of r. I suggest:
Point * // Note new return type...
reallocFunction(Point * r){
r=(Point *) realloc(r,r[0].x*2*sizeof(Point));
r[0].x*=2;
r[0].y++;
return r; // Note: now we return r back to the caller..
}
Then later:
r = reallocFunction(r);
Now... Another thing to consider is that realloc can fail. A common pattern for realloc that accounts for this is:
Point *reallocFunction(Point * r){
void *new_buffer = realloc(r, r[0].x*2*sizeof(Point));
if (!new_buffer)
{
// realloc failed, pass the error up to the caller..
return NULL;
}
r = new_buffer;
r[0].x*=2;
r[0].y++;
return r;
}
This ensures that you don't leak r when the memory allocation fails, and the caller then has to decide what happens when your function returns NULL...
But, some other things I'd point out about this code (I don't mean to sound like I'm nitpicking about things and trying to tear them apart; this is meant as constructive design feedback):
The names of variables and members don't make it very clear what you're doing.
You've got a lot of magic constants. There's no explanation for what they mean or why they exist.
reallocFunction doesn't seem to really make sense. Perhaps the name and interface can be clearer. When do you need to realloc? Why do you double the X member? Why do you increment Y? Can the caller make these decisions instead? I would make that clearer.
Similarly it's not clear what enter() is supposed to be doing. Maybe the names could be clearer.
It's a good thing to do your allocations and manipulation of member variables in a consistent place, so it's easy to spot (and later, potentially change) how you're supposed to create, destroy and manipulate one of these objects. Here it seems in particular like main() has a lot of knowledge of your structure's internals. That seems bad.
Use of the multiplication operator in parameters to realloc in the way that you do is sometimes a red flag... It's a corner case, but the multiplication can overflow and you can end up shrinking the buffer instead of growing it. This would make you crash and in writing production code it would be important to avoid this for security reasons.
You also do not seem to initialize r[0].y. As far as I understood, you should have a r[0].y=0 somewhere.
Anyway, you using the first element of the array to do something different is definitely a bad idea. It makes your code horribly complex to understand. Just create a new structure, holding the array size, the capacity, and the pointer.
Which is considered better style?
int set_int (int *source) {
*source = 5;
return 0;
}
int main(){
int x;
set_int (&x);
}
OR
int *set_int (void) {
int *temp = NULL;
temp = malloc(sizeof (int));
*temp = 5;
return temp;
}
int main (void) {
int *x = set_int ();
}
Coming for a higher level programming background I gotta say I like the second version more. Any, tips would be very helpful. Still learning C.
Neither.
// "best" style for a function which sets an integer taken by pointer
void set_int(int *p) { *p = 5; }
int i;
set_int(&i);
Or:
// then again, minimise indirection
int an_interesting_int() { return 5; /* well, in real life more work */ }
int i = an_interesting_int();
Just because higher-level programming languages do a lot of allocation under the covers, does not mean that your C code will become easier to write/read/debug if you keep adding more unnecessary allocation :-)
If you do actually need an int allocated with malloc, and to use a pointer to that int, then I'd go with the first one (but bugfixed):
void set_int(int *p) { *p = 5; }
int *x = malloc(sizeof(*x));
if (x == 0) { do something about the error }
set_int(x);
Note that the function set_int is the same either way. It doesn't care where the integer it's setting came from, whether it's on the stack or the heap, who owns it, whether it has existed for a long time or whether it's brand new. So it's flexible. If you then want to also write a function which does two things (allocates something and sets the value) then of course you can, using set_int as a building block, perhaps like this:
int *allocate_and_set_int() {
int *x = malloc(sizeof(*x));
if (x != 0) set_int(x);
return x;
}
In the context of a real app, you can probably think of a better name than allocate_and_set_int...
Some errors:
int main(){
int x*; //should be int* x; or int *x;
set_int(x);
}
Also, you are not allocating any memory in the first code example.
int *x = malloc(sizeof(int));
About the style:
I prefer the first one, because you have less chances of not freeing the memory held by the pointer.
The first one is incorrect (apart from the syntax error) - you're passing an uninitialised pointer to set_int(). The correct call would be:
int main()
{
int x;
set_int(&x);
}
If they're just ints, and it can't fail, then the usual answer would be "neither" - you would usually write that like:
int get_int(void)
{
return 5;
}
int main()
{
int x;
x = get_int();
}
If, however, it's a more complicated aggregate type, then the second version is quite common:
struct somestruct *new_somestruct(int p1, const char *p2)
{
struct somestruct *s = malloc(sizeof *s);
if (s)
{
s->x = 0;
s->j = p1;
s->abc = p2;
}
return s;
}
int main()
{
struct somestruct *foo = new_somestruct(10, "Phil Collins");
free(foo);
return 0;
}
This allows struct somestruct * to be an "opaque pointer", where the complete definition of type struct somestruct isn't known to the calling code. The standard library uses this convention - for example, FILE *.
Definitely go with the first version. Notice that this allowed you to omit a dynamic memory allocation, which is SLOW, and may be a source of bugs, if you forget to later free that memory.
Also, if you decide for some reason to use the second style, notice that you don't need to initialize the pointer to NULL. This value will either way be overwritten by whatever malloc() returns. And if you're out of memory, malloc() will return NULL by itself, without your help :-).
So int *temp = malloc(sizeof(int)); is sufficient.
Memory managing rules usually state that the allocator of a memory block should also deallocate it. This is impossible when you return allocated memory. Therefore, the second should be better.
For a more complex type like a struct, you'll usually end up with a function to initialize it and maybe a function to dispose of it. Allocation and deallocate should be done separately, by you.
C gives you the freedom to allocate memory dynamically or statically, and having a function work only with one of the two modes (which would be the case if you had a function that returned dynamically allocated memory) limits you.
typedef struct
{
int x;
float y;
} foo;
void foo_init(foo* object, int x, float y)
{
object->x = x;
object->y = y;
}
int main()
{
foo myFoo;
foo_init(&foo, 1, 3.1416);
}
In the second one you would need a pointer to a pointer for it to work, and in the first you are not using the return value, though you should.
I tend to prefer the first one, in C, but that depends on what you are actually doing, as I doubt you are doing something this simple.
Keep your code as simple as you need to get it done, the KISS principle is still valid.
It is best not to return a piece of allocated memory from a function if somebody does not know how it works they might not deallocate the memory.
The memory deallocation should be the responsibility of the code allocating the memory.
The first is preferred (assuming the simple syntax bugs are fixed) because it is how you simulate an Out Parameter. However, it's only usable where the caller can arrange for all the space to be allocated to write the value into before the call; when the caller lacks that information, you've got to return a pointer to memory (maybe malloced, maybe from a pool, etc.)
What you are asking more generally is how to return values from a function. It's a great question because it's so hard to get right. What you can learn are some rules of thumb that will stop you making horrid code. Then, read good code until you internalize the different patterns.
Here is my advice:
In general any function that returns a new value should do so via its return statement. This applies for structures, obviously, but also arrays, strings, and integers. Since integers are simple types (they fit into one machine word) you can pass them around directly, not with pointers.
Never pass pointers to integers, it's an anti-pattern. Always pass integers by value.
Learn to group functions by type so that you don't have to learn (or explain) every case separately. A good model is a simple OO one: a _new function that creates an opaque struct and returns a pointer to it; a set of functions that take the pointer to that struct and do stuff with it (set properties, do work); a set of functions that return properties of that struct; a destructor that takes a pointer to the struct and frees it. Hey presto, C becomes much nicer like this.
When you do modify arguments (only structs or arrays), stick to conventions, e.g. stdc libraries always copy from right to left; the OO model I explained would always put the structure pointer first.
Avoid modifying more than one argument in one function. Otherwise you get complex interfaces you can't remember and you eventually get wrong.
Return 0 for success, -1 for errors, when the function does something which might go wrong. In some cases you may have to return -1 for errors, 0 or greater for success.
The standard POSIX APIs are a good template but don't use any kind of class pattern.