How to distinguish a malloced string from a string literal? - c

Is there a way (in pure C) to distinguish a malloced string from a string literal, without knowing which is which? Strictly speaking, I'm trying to find a way to check a variable whether it's a malloced string or not and if it is, I'm gonna free it; if not, I'll let it go.
Of course, I can dig the code backwards and make sure if the variable is malloced or not, but just in case if an easy way exists...
edit: lines added to make the question more specific.
char *s1 = "1234567890"; // string literal
char *s2 = strdup("1234567890"); // malloced string
char *s3;
...
if (someVar > someVal) {
s3 = s1;
} else {
s3 = s2;
}
// ...
// after many, many lines of code an algorithmic branches...
// now I lost track of s3: is it assigned to s1 or s2?
// if it was assigned to s2, it needs to be freed;
// if not, freeing a string literal will pop an error

Is there a way (in pure C) to distinguish a malloced string from a string literal,
Not in any portable way, no. No need to worry though; there are better alternatives.
When you write code in C you do so while making strong guarantees about "who" owns memory. Does the caller own it? Then it's their responsibility to deallocate it. Does the callee own it? Similar thing.
You write code which is very clear about the chain of custody and ownership and you don't run into problems like "who deallocates this?" You shouldn't feel the need to say:
// after many, many lines of code an algorithmic branches...
// now I forgot about s3: was it assigned to s1 or s2?
The solution is; don't forget! You're in control of your code, just look up the page a bit. Design it to be bulletproof against leaking memory out to other functions without a clear understanding that "hey, you can read this thing, but there are no guarantees that it will be valid after X or Y. It's not your memory, treat it as such."
Or maybe it is your memory. Case in point; your call to strdup. strdup let's you know (via documentation) that it is your responsibility to deallocate the string that it returns to you. How you do that is up to you, but your best bet is to limit its scope to be as narrow as possible and to only keep it around for as short a time as necessary.
It takes time and practice for this to become second nature. You will create some projects which handle memory poorly before you get good at it. That's ok; your mistakes in the beginning will teach you exactly what not to do, and you'll avoid repeating them in the future (hopefully!)
Also, as #Lasse alluded to in the comments, you don't have to worry about s3, which is a copy of a pointer, not the entire chunk of memory. If you call free on s2 and s3 you end up with undefined behavior.

Here is a practical way:
Although the C-language standard does not dictate this, for all identical occurrences of a given literal string in your code, the compiler generates a single copy within the RO-data section of the executable image.
In other words, every occurrence of the string "1234567890" in your code is translated into the same memory address.
So at the point where you want to deallocate that string, you can simply compare its address with the address of the literal string "1234567890":
if (s3 != "1234567890") // address comparison
free(s3);
To emphasize this again, it is not imposed by the C-language standard, but it is practically implemented by any decent compiler of the language.
UPDATE:
The above is merely a practical trick referring directly to the question at hand (rather than to the motivation behind this question).
Strictly speaking, if you've reached a point in your implementation where you have to distinguish between a statically allocated string and a dynamically allocated string, then I would tend to guess that your initial design is flawed somewhere along the line.

Related

In C, how do I make a string with non-static duration time?

I have a string, that is only used once when my application launches. Ordinary string literals, eg. "Hello" are static, meaning they're only deallocated when the program ends. I don't want that. They can be deallocated earlier. How do I say, Hey, like, this string literal shouldn't be static. It should be deallocated when the scope ends. How do I do that? For example,
memcpy(GameDir+HomeDirLen, "/.Data", 7);
The "/.Data" is still stored in ram as the literal even long after the this line of code runs. That's a waste, because it's only used once.
With typical implementations, if your program contains the string "/.Data" anywhere, either as a literal or as an initializer for an array of any duration, then the program is going to contain those bytes somewhere in the executable. They'll be loaded (or mapped) into memory when the program loads, and I don't know of any implementation that can free such memory before the program exits. So the other answers so far don't really accomplish what you want.
(If your array was of auto duration, then initializing is typically done under the hood by copying from an anonymous static string. Or it could be done by a sequence of immediate store instructions, which probably uses even more memory.)
So if you really want to ensure that those bytes don't occupy memory for the life of the process, you'll have to get them from somewhere other than the program itself. For instance, you could store the string in a file, open it, and read the string into an auto or malloced array. Then you really will recover the memory when the array goes out of scope or is freed (assuming, of course, that free actually does recover memory in a way that's useful to you). You could also use mmap if your system provides it.
On the other hand, modern operating systems usually have virtual memory. So if your string literal is in the read-only data section of the program, then if physical memory becomes tight, the system can simply drop that page of physical memory and use it for something else. If your program should attempt to access that data again, the system will allocate a new page and transparently populate it from the executable file on disk - but if you never access it, that will never happen.
Of course this doesn't help much if your string is really only 7 bytes, because there will be lots of other stuff in that page of memory (a page is commonly 4KB or somewhere around there). But if your string is really big, or you have a lot of such strings, then this effect may work just as well as actually freeing the memory. You may even be able to use various compiler-specific options to ensure that all your only-needed-once strings are placed contiguously in the executable, so that they will all be in the same pages of memory.
I have a string, that is only used once when my application launches.
Ordinary string literals, eg. "Hello" are static, meaning they're only
deallocated when the program ends. I don't want that. They can be
deallocated earlier. How do I say, Hey, like, this string literal
shouldn't be static. It should be deallocated when the scope ends. How
do I do that?
You cannot. All string literals have static storage duration, and that's really the only way they could work. If you have a string literal in your program source that is used in any way, then the program image has to contain the bytes of the literal's representation somewhere among the program data. If the literal appears inside a function, as must be the case in your example, then the representation needs to be retained for use each time the function is called. Similar applies to uses at file scope: string literals used there typically are accessible for the entire run of the program.
The exception is string literals used as initializers for (other) character arrays with static storage duration. Such an initialization results in, initially, two identical copies of the same data, at most one of which is actually accessible at run time. There's no use for retaining the data for the literal separately. C does not specify a way for you to express that the literal should not be retained, but your compiler is at liberty to omit the unneeded duplicate at its own discretion, and at least some do.
Compilers may also fold identical string literals, and perhaps even fold literals that just have identical tails, and / or perform other space-saving optimizations. And your compiler is likely to be better than you are at recognizing when and how such optimizations can safely be performed.
This answer does not really match your specific need. I'll leave it for the comments.
You can use a compound literal ... and a pointer to it
char *p = (char[]){"Hello!"}; // vs char *p = "Hello!"
*p = 'C'; // *p = 'C'; // illegal
see https://ideone.com/62UNKO

Why do so many standard C functions tamper with parameters instead of returning values?

Many functions like strcat, strcpy and alike don't return the actual value but change one of the parameters (usually a buffer). This of course creates a boatload of side effects.
Wouldn't it be far more elegant to just return a new string? Why isn't this done?
Example:
char *copy_string(char *text, size_t length) {
char *result = malloc(sizeof(char) * length);
for (int i = 0; i < length; ++i) {
result[i] = text[i];
}
return result;
}
int main() {
char *copy = copy_string("Hello World", 12);
// *result now lingers in memory and can not be freed?
}
I can only guess it has something to do with memory leaking since there is dynamic memory being allocated inside of the function which you can not free internally (since you need to return a pointer to it).
Edit: From the answers it seems that it is good practice in C to work with parameters rather than creating new variables. So I should aim for building my functions like that?
Edit 2: Will my example code lead to a memory leak? Or can *result be free'd?
To answer your original question: C, at the time it was designed, was tailored to be a language of maximum efficiency. It was, basically, just a nicer way of writing assembly code (the guy who designed it, wrote his own compiler for it).
What you say (that parameters are often used rather than return codes) is mainly true for string handling. Most other functions (those that deal with numbers for example) work through return codes as expected. Or they only modify values for parameters if they have to return more than one value.
String handling in C today is considered one of the major (if not THE major) weakness in C. But those functions were written with performance in mind, and with the machines available those days (and the intent of performance) working on the callers buffers was the way of choice.
Re your edit 1: Today other intents may apply. Performance usually isn't the limiting factor. Equally or important are readability, robustness, pronenees to error. And generally, as said, the string handling in C is today generally considered an horrible relic of the past. So it's basically your choice, depending on your intent.
Re your edit 2: Yes, the memory will leak. You need to call free(copy); Which ties into edit 1: proneness of error - it's easy to forget the free and create leaks that way (or attempt to free it twice or access it after it was freed). It may be more readable and more more prone to error too (even more than the clunky original C approach of modifying the caller's buffer).
Generally, I'd suggest, whenever you have the choice, to work with a newer dialect that support std-string or something similar.
Why do so many standard C functions tamper with parameters instead of returning values?
Because that's often what the users of the C library wants.
Many functions like strcat, strcpy and alike don't return the actual value but change one of the parameters (usually a buffer). This of course creates a boatload of side effects. Wouldn't it be far more elegant to just return a new string? Why isn't this done?
It's not very efficient to allocate a memory and it'll require the user to free() them later, which is an unnecessary burden on the user. Efficiency and letting users do what they want (even if they want shoot themselves in the foot) is a part of C's philosophy.
Besides, there are syntax/implementation issues. For example, how can the following be done if the strcpy() function actually returns a newly allocated string?
char arr[256] = "Hello";
strcpy(arr, "world");
Because C doesn't allow you assign something to an array (arr).
Basically, you are questioning C is the way it is. For that question, the common answer is "historical reasons".
Two reasons:
Properly designed functions should only concern themselves with their designated purpose, and not unrelated things such as memory allocation.
Making a hard copy of the string would make the function far slower.
So for your example, if there is a need for a hard copy, the caller should malloc the buffer and afterwards call strcpy. That separates memory allocation from the algorithm.
On top of that, good design practice dictates that the module that allocated memory should also be responsible for freeing it. Otherwise the caller might not even realize that the function is allocating memory, and there would be a memory leak. If the caller instead is responsible for the allocation, then it is obvious that the caller is also responsible for clean-up.
Overall, C standard library functions are designed to be as fast as possible, meaning they will strive to meet the case where the caller has minimal requirements. A typical example of such a function is malloc, which doesn't even set the allocated data to zero, because that would take extra time. Instead they added an additional function calloc for that purpose.
Other languages have different philosophies, where they would for example force a hard copy for all string handling functions ("immutable objects"). This makes the function easier to work with and perhaps also the code easier to read, but it comes at the expense of a slower program, which needs more memory.
This is one of the main reasons why C is still widely used for development. It tends is much faster and more efficient than any other language (except raw assembler).

Is the function strcpy always dangerous?

Are functions like strcpy, gets, etc. always dangerous? What if I write a code like this:
int main(void)
{
char *str1 = "abcdefghijklmnop";
char *str2 = malloc(100);
strcpy(str2, str1);
}
This way the function doesn't accept arguments(parameters...) and the str variable will always be the same length...which is here 16 or slightly more depending on the compiler version...but yeah 100 will suffice as of march, 2011 :).
Is there a way for a hacker to take advantage of the code above?
10x!
Absolutely not. Contrary to Microsoft's marketing campaign for their non-standard functions, strcpy is safe when used properly.
The above is redundant, but mostly safe. The only potential issue is that you're not checking the malloc return value, so you may be dereferencing null (as pointed out by kotlinski). In practice, this likely to cause an immediate SIGSEGV and program termination.
An improper and dangerous use would be:
char array[100];
// ... Read line into uncheckedInput
// Extract substring without checking length
strcpy(array, uncheckedInput + 10);
This is unsafe because the strcpy may overflow, causing undefined behavior. In practice, this is likely to overwrite other local variables (itself a major security breach). One of these may be the return address. Through a return to lib C attack, the attacker may be able to use C functions like system to execute arbitrary programs. There are other possible consequences to overflows.
However, gets is indeed inherently unsafe, and will be removed from the next version of C (C1X). There is simply no way to ensure the input won't overflow (causing the same consequences given above). Some people would argue it's safe when used with a known input file, but there's really no reason to ever use it. POSIX's getline is a far better alternative.
Also, the length of str1 doesn't vary by compiler. It should always be 17, including the terminating NUL.
You are forcefully stuffing completely different things into one category.
Functions gets is indeed always dangerous. There's no way to make a safe call to gets regardless of what steps you are willing to take and how defensive you are willing to get.
Function strcpy is perfectly safe if you are willing to take the [simple] necessary steps to make sure that your calls to strcpy are safe.
That already puts gets and strcpy in vastly different categories, which have nothing in common with regard to safety.
The popular criticisms directed at safety aspects of strcpy are based entirely on anecdotal social observations as opposed to formal facts, e.g. "programmers are lazy and incompetent, so don't let them use strcpy". Taken in the context of C programming, this is, of course, utter nonsense. Following this logic we should also declare the division operator exactly as unsafe for exactly the same reasons.
In reality, there are no problems with strcpy whatsoever. gets, on the other hand, is a completely different story, as I said above.
yes, it is dangerous. After 5 years of maintenance, your code will look like this:
int main(void)
{
char *str1 = "abcdefghijklmnop";
{enough lines have been inserted here so as to not have str1 and str2 nice and close to each other on the screen}
char *str2 = malloc(100);
strcpy(str2, str1);
}
at that point, someone will go and change str1 to
str1 = "THIS IS A REALLY LONG STRING WHICH WILL NOW OVERRUN ANY BUFFER BEING USED TO COPY IT INTO UNLESS PRECAUTIONS ARE TAKEN TO RANGE CHECK THE LIMITS OF THE STRING. AND FEW PEOPLE REMEMBER TO DO THAT WHEN BUGFIXING A PROBLEM IN A 5 YEAR OLD BUGGY PROGRAM"
and forget to look where str1 is used and then random errors will start happening...
Your code is not safe. The return value of malloc is unchecked, if it fails and returns 0 the strcpy will give undefined behavior.
Besides that, I see no problem other than that the example basically does not do anything.
strcpy isn't dangerous as far as you know that the destination buffer is large enough to hold the characters of the source string; otherwise strcpy will happily copy more characters than your target buffer can hold, which can lead to several unfortunate consequences (stack/other variables overwriting, which can result in crashes, stack smashing attacks & co.).
But: if you have a generic char * in input which hasn't been already checked, the only way to be sure is to apply strlen to such string and check if it's too large for your buffer; however, now you have to walk the entire source string twice, once for checking its length, once to perform the copy.
This is suboptimal, since, if strcpy were a little bit more advanced, it could receive as a parameter the size of the buffer and stop copying if the source string were too long; in a perfect world, this is how strncpy would perform (following the pattern of other strn*** functions). However, this is not a perfect world, and strncpy is not designed to do this. Instead, the nonstandard (but popular) alternative is strlcpy, which, instead of going out of the bounds of the target buffer, truncates.
Several CRT implementations do not provide this function (notably glibc), but you can still get one of the BSD implementations and put it in your application. A standard (but slower) alternative can be to use snprintf with "%s" as format string.
That said, since you're programming in C++ (edit I see now that the C++ tag has been removed), why don't you just avoid all the C-string nonsense (when you can, obviously) and go with std::string? All these potential security problems vanish and string operations become much easier.
The only way malloc may fail is when an out-of-memory error occurs, which is a disaster by itself. You cannot reliably recover from it because virtually anything may trigger it again, and the OS is likely to kill your process anyway.
As you point out, under constrained circumstances strcpy isn't dangerous. It is more typical to take in a string parameter and copy it to a local buffer, which is when things can get dangerous and lead to a buffer overrun. Just remember to check your copy lengths before calling strcpy and null terminate the string afterward.
Aside for potentially dereferencing NULL (as you do not check the result from malloc) which is UB and likely not a security threat, there is no potential security problem with this.
gets() is always unsafe; the other functions can be used safely.
gets() is unsafe even when you have full control on the input -- someday, the program may be run by someone else.
The only safe way to use gets() is to use it for a single run thing: create the source; compile; run; delete the binary and the source; interpret results.

Better practice to strcpy() or point to another data structure?

Because it's always easier to see code...
My parser fills this object:
typedef struct pair {
char* elementName;
char* elementValue;
} pair;
My interpreter wants to read that object and fill this one:
typedef struct thing {
char* label;
} thing;
Should I do this:
thing.label = pair.elementName;
or this:
thing.label = (char*)malloc(strlen(pair.elementName)+1);
strcpy(thing.label, pair.elementName);
EDIT: Yes, I guess I should have specified what the rest of the program will do with the objects. I will eventually need to save "pair" to a file. So when thing.label is modified, then (at some point) pair.elementName needs to be modified to match. So I guess the former is the best way to do it?
No good answer to that question as there is too little context. It all depends on how the rest of the program manages the lifetimes of the objects it creates.
I would personally do the former, but it's a tradeoff. The former avoids the need to allocate new memory and copy data to it, but the latter avoids the confusion of aliasing by keeping thing.label and pair.elementName pointing to separate memory addresses, which means you need to free both of them (with the former you need to be sure to free exactly one, to avoid either a memory leak or a double free)
Here are some of the things that need to be known to answer the question:
Which object will 'own' the string? Or will both own their string (in which case a 'deep' copy is necessary)?
are the lifetimes of the pair and thing objects related in any way - will one object always 'outlive' the other? Does one of these objects own the other one?
If the pair and thing objects are independent, then copying the string data is probably the correct thing to do. If one is owned by the other, then that might indicate that a simple sharing of the pointer is appropriate.
Not that these are the only possible answers - just a couple of the easier ones.
From an "object" independence standpoint, it is probably better to make a copy of the data to avoid problems with dangling pointers.
It would be more efficient and faster to just assign the pointer, but unless that extra performance is highly critical, you will probably be better off (from a debugging standpoint) by making the copy.
The answer is, as always, "It Depends." If all you are doing with the "copied" value is reading it, it is probably okay to just copy the pointer address (i.e., the former), as long as you cleanup properly. If the "copied" value is going to be modified in any way, you are going to want to create a new string entirely (i.e., the latter) to avoid any unintended side effects caused by the "original" value changing (unless, of course, that is exactly the desired effect).
If you want to do a copy, and do all the cleanup afterwards... in C you should do this:
thing.label = strdup(pair.elementName);
I don't want to be a c-police, but please use safer strncpy() instead of strcpy().
char* strncpy(char *s1, const char *s2, size_t n);
strncpy function copies at most n characters from s2 into s1.

zeroing out memory

gcc 4.4.4 C89
I am just wondering what most C programmers do when they want to zero out memory.
For example, I have a buffer of 1024 bytes. Sometimes I do this:
char buffer[1024] = {0};
Which will zero all bytes.
However, should I declare it like this and use memset?
char buffer[1024];
.
.
memset(buffer, 0, sizeof(buffer));
Is there any real reason you have to zero the memory? What is the worst that can happen by not doing it?
The worst that can happen? You end up (unwittingly) with a string that is not NULL terminated, or an integer that inherits whatever happened to be to the right of it after you printed to part of the buffer. Yet, unterminated strings can happen other ways, too, even if you initialized the buffer.
Edit (from comments) The end of the world is also a remote possibility, depending on what you are doing.
Either is undesirable. However, unless completely eschewing dynamically allocated memory, most statically allocated buffers are typically rather small, which makes memset() relatively cheap. In fact, much cheaper than most calls to calloc() for dynamic blocks, which tend to be bigger than ~2k.
c99 contains language regarding default initialization values, I can't, however, seem to make gcc -std=c99 agree with that, using any kind of storage.
Still, with a lot of older compilers (and compilers that aren't quite c99) still in use, I prefer to just use memset()
I vastly prefer
char buffer[1024] = { 0 };
It's shorter, easier to read, and less error-prone. Only use memset on dynamically-allocated buffers, and then prefer calloc.
When you define char buffer[1024] without initializing, you're going to get undefined data in it. For instance, Visual C++ in debug mode will initialize with 0xcd. In Release mode, it will simply allocate the memory and not care what happens to be in that block from previous use.
Also, your examples demonstrate runtime vs. compile time initialization. If your char buffer[1024] = { 0 } is a global or static declaration, it will be stored in the binary's data segment with its initialized data, thus increasing your binary size by about 1024 bytes (in this case). If the definition is in a function, it's stored on the stack and is allocated at runtime and not stored in the binary. If you provide an initializer in this case, the initializer is stored in the binary and an equivalent of a memcpy() is done to initialize buffer at runtime.
Hopefully, this helps you decide which method works best for you.
In this particular case, there's not much difference. I prefer = { 0 } over memset because memset is more error-prone:
It provides an opportunity to get the bounds wrong.
It provides an opportunity to mix up the arguments to memset (e.g. memset(buf, sizeof buf, 0) instead of memset(buf, 0, sizeof buf).
In general, = { 0 } is better for initializing structs too. It effectively initializes all members as if you had written = 0 to initialize each. This means that pointer members are guaranteed to be initialized to the null pointer (which might not be all-bits-zero, and all-bits-zero is what you'd get if you had used memset).
On the other hand, = { 0 } can leave padding bits in a struct as garbage, so it might not be appropriate if you plan to use memcmp to compare them later.
The worst that can happen by not doing it is that you write some data in character by character and later interpret it as a string (and you didn't write a null terminator). Or you end up failing to realise a section of it was uninitialised and read it as though it were valid data. Basically: all sorts of nastiness.
Memset should be fine (provided you correct the sizeof typo :-)). I prefer that to your first example because I think it's clearer.
For dynamically allocated memory, I use calloc rather than malloc and memset.
One of the things that can happen if you don't initialize is that you run the risk of leaking sensitive information.
Uninitialized memory may have something sensitive in it from a previous use of that memory. Maybe a password or crypto key or part of a private email. Your code may later transmit that buffer or struct somewhere, or write it to disk, and if you only partially filled it the rest of it still contains those previous contents. Certain secure systems require zeroizing buffers when an address space can contain sensitive information.
I prefer using memset to clear a chunk of memory, especially when working with strings. I want to know without a doubt that there will be a null delimiter after my string. Yes, I know you can append a \0 on the end of each string and some functions do this for you, but I want no doubt that this has taken place.
A function could fail when using your buffer, and the buffer remains unchanged. Would you rather have a buffer of unknown garbage, or nothing?
This post has been heavily edited to make it correct. Many thanks to Tyler McHenery for pointing out what I missed.
char buffer[1024] = {0};
Will set the first char in the buffer to null, and the compiler will then expand all non-initialized chars to 0 too. In such a case it seems that the differences between the two techniques boil down to whether the compiler generates more optimized code for array initialization or whether memset is optimized faster than the generated compiled code.
Previously I stated:
char buffer[1024] = {0};
Will set the first char in the buffer
to null. That technique is commonly
used for null terminated strings, as
all data past the first null is
ignored by subsequent (non-buggy)
functions that handle null terminated
strings.
Which is not quite true. Sorry for the miscommunication, and thanks again for the corrections.
Depends how you're filling it: if you're planning on writing to it before even potentially reading anything, then why bother? It also depends what you're going to use the buffer for: if it's going to be treated as a string, then you just need to set the first byte to \0:
char buffer[1024];
buffer[0] = '\0';
However, if you're using it as a byte stream, then the contents of the entire array are probably going to be relevant, so memseting the entire thing or setting it to { 0 } as in your example is a smart move.
I also use memset(buffer, 0, sizeof(buffer));
The risk of not using it is that there is no guarantee that the buffer you are using is completely empty, there might be garbage which may lead to unpredictable behavior.
Always memset-ing to 0 after malloc, is a very good practice.
yup, calloc() method defined in stdlib.h allocates memory initialized with zeros.
I'm not familiar with the:
char buffer[1024] = {0};
technique. But assuming it does what I think it does, there's a (potential) difference to the two techniques.
The first one is done at COMPILE time, and the buffer will be part of the static image of the executable, and thus be 0's when you load.
The latter will be done at RUN TIME.
The first may incur some load time behaviour. If you just have:
char buffer[1024];
the modern loaders may well "virtually" load that...that is, it won't take any real space in the file, it'll simply be an instruction to the loader to carve out a block when the program is loaded. I'm not comfortable enough with modern loaders say if that's true or not.
But if you pre-initialize it, then that will certainly need to be loaded from the executable.
Mind, neither of these have "real" performance impacts in the small. They may not have any in the "large". Just saying there's potential here, and the two techniques are in fact doing something quite different.

Resources