zeroing out memory - c

gcc 4.4.4 C89
I am just wondering what most C programmers do when they want to zero out memory.
For example, I have a buffer of 1024 bytes. Sometimes I do this:
char buffer[1024] = {0};
Which will zero all bytes.
However, should I declare it like this and use memset?
char buffer[1024];
.
.
memset(buffer, 0, sizeof(buffer));
Is there any real reason you have to zero the memory? What is the worst that can happen by not doing it?

The worst that can happen? You end up (unwittingly) with a string that is not NULL terminated, or an integer that inherits whatever happened to be to the right of it after you printed to part of the buffer. Yet, unterminated strings can happen other ways, too, even if you initialized the buffer.
Edit (from comments) The end of the world is also a remote possibility, depending on what you are doing.
Either is undesirable. However, unless completely eschewing dynamically allocated memory, most statically allocated buffers are typically rather small, which makes memset() relatively cheap. In fact, much cheaper than most calls to calloc() for dynamic blocks, which tend to be bigger than ~2k.
c99 contains language regarding default initialization values, I can't, however, seem to make gcc -std=c99 agree with that, using any kind of storage.
Still, with a lot of older compilers (and compilers that aren't quite c99) still in use, I prefer to just use memset()

I vastly prefer
char buffer[1024] = { 0 };
It's shorter, easier to read, and less error-prone. Only use memset on dynamically-allocated buffers, and then prefer calloc.

When you define char buffer[1024] without initializing, you're going to get undefined data in it. For instance, Visual C++ in debug mode will initialize with 0xcd. In Release mode, it will simply allocate the memory and not care what happens to be in that block from previous use.
Also, your examples demonstrate runtime vs. compile time initialization. If your char buffer[1024] = { 0 } is a global or static declaration, it will be stored in the binary's data segment with its initialized data, thus increasing your binary size by about 1024 bytes (in this case). If the definition is in a function, it's stored on the stack and is allocated at runtime and not stored in the binary. If you provide an initializer in this case, the initializer is stored in the binary and an equivalent of a memcpy() is done to initialize buffer at runtime.
Hopefully, this helps you decide which method works best for you.

In this particular case, there's not much difference. I prefer = { 0 } over memset because memset is more error-prone:
It provides an opportunity to get the bounds wrong.
It provides an opportunity to mix up the arguments to memset (e.g. memset(buf, sizeof buf, 0) instead of memset(buf, 0, sizeof buf).
In general, = { 0 } is better for initializing structs too. It effectively initializes all members as if you had written = 0 to initialize each. This means that pointer members are guaranteed to be initialized to the null pointer (which might not be all-bits-zero, and all-bits-zero is what you'd get if you had used memset).
On the other hand, = { 0 } can leave padding bits in a struct as garbage, so it might not be appropriate if you plan to use memcmp to compare them later.

The worst that can happen by not doing it is that you write some data in character by character and later interpret it as a string (and you didn't write a null terminator). Or you end up failing to realise a section of it was uninitialised and read it as though it were valid data. Basically: all sorts of nastiness.
Memset should be fine (provided you correct the sizeof typo :-)). I prefer that to your first example because I think it's clearer.
For dynamically allocated memory, I use calloc rather than malloc and memset.

One of the things that can happen if you don't initialize is that you run the risk of leaking sensitive information.
Uninitialized memory may have something sensitive in it from a previous use of that memory. Maybe a password or crypto key or part of a private email. Your code may later transmit that buffer or struct somewhere, or write it to disk, and if you only partially filled it the rest of it still contains those previous contents. Certain secure systems require zeroizing buffers when an address space can contain sensitive information.

I prefer using memset to clear a chunk of memory, especially when working with strings. I want to know without a doubt that there will be a null delimiter after my string. Yes, I know you can append a \0 on the end of each string and some functions do this for you, but I want no doubt that this has taken place.
A function could fail when using your buffer, and the buffer remains unchanged. Would you rather have a buffer of unknown garbage, or nothing?

This post has been heavily edited to make it correct. Many thanks to Tyler McHenery for pointing out what I missed.
char buffer[1024] = {0};
Will set the first char in the buffer to null, and the compiler will then expand all non-initialized chars to 0 too. In such a case it seems that the differences between the two techniques boil down to whether the compiler generates more optimized code for array initialization or whether memset is optimized faster than the generated compiled code.
Previously I stated:
char buffer[1024] = {0};
Will set the first char in the buffer
to null. That technique is commonly
used for null terminated strings, as
all data past the first null is
ignored by subsequent (non-buggy)
functions that handle null terminated
strings.
Which is not quite true. Sorry for the miscommunication, and thanks again for the corrections.

Depends how you're filling it: if you're planning on writing to it before even potentially reading anything, then why bother? It also depends what you're going to use the buffer for: if it's going to be treated as a string, then you just need to set the first byte to \0:
char buffer[1024];
buffer[0] = '\0';
However, if you're using it as a byte stream, then the contents of the entire array are probably going to be relevant, so memseting the entire thing or setting it to { 0 } as in your example is a smart move.

I also use memset(buffer, 0, sizeof(buffer));
The risk of not using it is that there is no guarantee that the buffer you are using is completely empty, there might be garbage which may lead to unpredictable behavior.
Always memset-ing to 0 after malloc, is a very good practice.

yup, calloc() method defined in stdlib.h allocates memory initialized with zeros.

I'm not familiar with the:
char buffer[1024] = {0};
technique. But assuming it does what I think it does, there's a (potential) difference to the two techniques.
The first one is done at COMPILE time, and the buffer will be part of the static image of the executable, and thus be 0's when you load.
The latter will be done at RUN TIME.
The first may incur some load time behaviour. If you just have:
char buffer[1024];
the modern loaders may well "virtually" load that...that is, it won't take any real space in the file, it'll simply be an instruction to the loader to carve out a block when the program is loaded. I'm not comfortable enough with modern loaders say if that's true or not.
But if you pre-initialize it, then that will certainly need to be loaded from the executable.
Mind, neither of these have "real" performance impacts in the small. They may not have any in the "large". Just saying there's potential here, and the two techniques are in fact doing something quite different.

Related

Is it bad programming practice to declare an array type in C?

I have a bunch of arrays of various lengths and I'd like to keep the length close to the array, is it bad programming practice to define something like this?
typedef struct Array {
long len;
int* buf;
} Array;
Are there any obvious downsides or pitfalls to this?
No, not necessarily. You will need to pass the length of a buffer around with the pointer to make sure that you don't overrun it. It would be a slight performance hit due to the extra dereference, but that can be a worthwhile tradeoff if always having the length available avoids causing bugs. You would need to make sure that the length gets updated if you resize the buffer, and make sure that the buffer gets freed if you free your struct. Other than that, it makes sense to me, and it's what a lot of other higher-level languages do in their array implementations.
This is fine, a similar structure is commonly used for counted strings, aka vectors, which are not null terminated, hence the need to store their length. I wouldn't bother with this approach if the arrays are only used locally, though.
As it's been noted, it would be better to use an unsigned type such as size_t or uint32_t for the length.
When dealing with pointers one needs to take the usual precautions to ensure that it always has a valid reference before using or passing around, that allocated memory is freed, etc. Using valgrind is highly recommended.

Assign to a null pointer an area inside a function and preserve the value outside

I have a function that reads from a socket, it returns a char** where packets are stored and my intention is to use a NULL unsigned int pointer where I store the length of single packet.
char** readPackets(int numToRead,unsigned int **lens,int socket){
char** packets=(char**)malloc(numToRead);
int *len=(int*)malloc(sizeof(int)*numToRead);
*(lens)=len;
for(int i=0;i<numToRead;i++){
//read
packets[i]=(char*)malloc(MAX_ETH_LEN);
register int pack_len=read(socket,packets[i],MAX_ETH_LEN);
//TODO handler error in case of freezing
if(pack_len<=0){
i--;
continue;
}
len[i]=pack_len;
}
return packets;
}
I use it in this way:
unsigned int *lens_out=NULL;
char **packets=readPackets(N_PACK,&lens,sniff_sock[handler]);
where N_PACK is a constant defined previously.
Now the problem is that when I am inside the function everything works, in fact *(lens) points to the same memory area of len and outside the function lens_out points to the same area too. Inside the function len[i] equals to *(lens[i]) (I checked it with gdb).
The problem is that outside the function even if lens_out points to the same area of len elements with same index are different for example
len[0]=46
lens_out[0]=4026546640
Can anyone explain where I made the mistake?
Your statement char** packets=(char**)malloc(numToRead) for sure does not reserve enough memory. Note that an element of packets-array is of type char*, and that sizeof(char*) is probably 8 (eventually 4), but very very unlikely 1. So you should write
char** packets = malloc(sizeof(char*) * numToRead)
Otherwise, you write out of the bounds of reserved memory, thereby yielding undefined behaviour (probably the one you explained).
Note further that with i--; continue;, you get memory leaks since you assign a new memory block to the ith element, but you lose reference to the memory reserved right before. Write free(packets[i]);i--;continue; instead.
Further, len[0] is an integral type, whereas lens[0] refers to a pointer to int. Comparing these two does not make sense.
Firstly, I want to put it out there that you should write clear code for the sake of future maintenance, and not for what you think is optimal (yet). This entire function should merely be replaced with read. That's the crux of my post.
Now the problem is that when I am inside the function everything works
I disagree. On a slightly broader topic, the biggest problem here is that you've posted a question containing code which doesn't compile when copied and pasted unmodified, and the question isn't about the error messages, so we can't answer the question without guessing.
My guess is that you haven't noticed these error messages; you're running a stale binary which we don't have the source code for, we can't reproduce the issue and we can't see the old source code, so we can't help you. It is as valid as any other guess. For example, there's another answer which speculates:
Your statement char** packets=(char**)malloc(numToRead) for sure does not reserve enough memory.
The malloc manual doesn't guarantee that precisely numToRead bytes will be allocated; in fact, allocations to processes tend to be performed in pages and just as the sleep manual doesn't guarantee a precise number of milliseconds/microseconds, it may allocate more or it may allocate less; in the latter case, malloc must return NULL which your code needs to check.
It's extremely common for implementations to seem to behave correctly when a buffer is overflowed anyway. Nonetheless, it'd be best if you fixed that buffer overflow. malloc doesn't know about the type you're allocating; you need to tell it everything about the size, not just the number of elements.
P.S. You probably want select or sleep within your loop, you know, to "handle error in case of freezing" or whatever. Generally, OSes will switch context to another program when you call one of those, and only switch back when there's data ready to process. By calling sleep after sending or receiving, you give the OS a heads up that it needs to perform some I/O. The ability to choose that timing can be beneficial, when you're optimising. Not at the moment, though.
Inside the function len[i] equals to *(lens[i]) (I checked it with gdb).
I'm fairly sure you've misunderstood that. Perhaps gdb is implicitly dereferencing your pointers, for you; that's really irrelevant to C (so don't confuse anything you learn from gdb with anything C-related).
In fact, I strongly recommend learning a little bit less about gdb and a lot more about assert, because the former won't help you document your code for future maintenance from other people, including us, those who you ask questions to, where-as the latter will. If you include assert in your code, you're almost certainly strengthening your question (and code) much more than including gdb into your question would.
The types of len[i] and *(len[i]) are different, and their values are affected by the way types are interpreted. These values can only be considered equal When they're converted to the same type. We can see this through C11/3.19p1 (the definition of "value", where the standard establishes it is dependant upon type). len[i] is an int * value, where-as *(len[i]) is an int value. The two categories of values might have different alignment, representation and... well, they have different semantics entirely. One is for integral data, and the other is a reference to an object or array. You shouldn't be comparing them, no matter how equal they may seem; the information you obtain from such a comparison is virtually useless.
You can't use len[i] in a multiplication expression, for example. They're certainly not equal in that respect. They might compare equal (as a side-effect of comparison introducing implicit conversions), which is useless information for you to have, and that is a different story.
memcmp((int[]){0}, (unsigned char[]){ [sizeof int] = 42 }, sizeof int) may return 0 indicating that they're equal, but you know that array of characters contains an extra byte, right? Yeh... they're equal...
You must check the return value of malloc (and don't cast the return value), if you're using it, though I really think you should reconsider your options with that regard.
The fact that you use malloc means everyone who uses your function must then use free; it's locking down-stream programmers into an anti-pattern that can tear the architecture of software apart. You should separate categories of allocation logic and user interface logic from processing logic.
For example, you use read which gives you the opportunity to choose whatever storage duration you like. This means you have an immense number of optimisation opportunities. It gives you, the downstream programmer, the opportunity to write flexible code which assigns whatever storage duration you like to the memory used. Imagine if, on the other hand, you had to free every return value from every function... That's the mess you're encouraging.
This is especially a poor, inefficient design when constants are involved (i.e. your usecase), because you could just use an automatic array and get rid of the calls to malloc and free altogether... Your downstream programmers code could be:
char packet[size][count];
int n_read = read(fd, packet, size * count);
Perhaps you think using malloc to allocate (and later read) n spaces for packets is faster than using something else to allocate n spaces. You should test that theory, because from my experience computers tend to be optimised for simpler, shorter, more concise logic.
In anticipation:
But I can't return packet; like that!
True. You can't return packet; to your downstream programmer, so you modify an object pointed at by an argument. That doesn't mean you should use malloc, though.
Unfortunately, too many programs are adopting this "use malloc everywhere" mentality. It's reminiscent of the "don't use goto" crap that we've been fed. Rather than listening to cargo cult propaganda, I recommend thinking critically about what you hear, because your peers are in the same position as you; they don't necessarily know what they're talking about.

Specify a single char as read-only in C?

Can a single char be made read-only in C ? (I would like to make '\0' read-only to avoid buffer overflows.)
char var[5 + 1] = "Hello";
var[5] = '\0'; // specify this char as read-only ?
You can't make a mixed const/non-const array or string. The \0 not being overwritten should be guaranteed by the invariants in your program.
Can't do it, you have to ensure this through proper practice.
strictly speaking, no.
On some systems you could obtain something similar by playng with the linker, e.g. by specifying the address of "var" and forcing it to be 5 bytes before a "read only" section. But it works only on very few cases, and anyway it's not part of the C language.
No, such a concept of a read only memory location does not exist at all in C. C Programmers are left on their own when allocating/accessing/manipulating memory and they are assumed to know what they are doing :D. Such is the responsability that comes out from raw power :D.
If you can switch to C++ then you may consider using std::vector. While buffer overflows are still possible with std::vectors they are less likely when you use methods contained in the class interface. This methods abstracts element access and insertion and so you won't need to explicitly manage memory. Notice though you can still access directly an element that is outside of the vector size if you don't use iterators.
Since overflows are recurring problems in c/c++ there are several tools to help the programmer with these kind of mistakes. Tools range from static language analyzer to runtime detection of unprotected memory access in debug mode.
Make use of a const char * ptr which keeps track of your \0.

Declaring array before or inside loop

What is more effective?
Declaring array e.g. char buffer[4096] before loop and in each iteration clear its content by strlcpy or declare array inside the loop with initial "" value?
char buffer[4096] = "";
while(sth)
{
strlcpy(buffer,"",sizeof(buffer));
//some action;
}
vs
while(sth)
{
char buffer[4096] = "";
//some action;
}
There's probably no significant difference between the two. The only way to be sure would be to benchmark or look at the assembly generated by the compiler.
The second example is clearer and would be preferred.
Neither, there is no reason to clear the entire buffer if it's merely to be used as a string buffer. All you need to do is ensure that a NUL byte \0 is appended to the end of the relevant string.
If you want to guarantee that the str* functions start at the beginning of buffer, a simple *buffer = '\0' is all that is needed for the reason mentioned above.
The only possible difference in performance is the assignment to the buffer, and this will probably translated by your compiler to a single statement which will be identical to this:
*buffer = 0;
Compilers will not perform any stack management when you enter the loop. Whether you define buffer inside or outside the loop doesn't make any difference for the stack. Place for buffer will be reserved on the stack in the beginning of the routine, regardless of where you define your buffer variable (which doesn't mean that the compiler will allow you to access buffer outside the scope where it is defined).
Even the single assignment shown above will probably be optimized by your compiler if it sees that it doesn't make sense to reinitialize buffer every time over and over again.
To quote Knuth, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
Focus on readability. Optimize when you need to.
I suggest the first version. In the first version the array is allocated only once. In the second version, the array is allocated every iteration of the loop.
Although compilers may optimize the allocation (such as performing only once), I believe factoring it out indicates to the reader that the memory is allocated only once.
This may be nothing to worry about since many platforms allocate on the stack and allocation consists of changing the contents of the stack pointer.
When in doubt profile.
When considering optimizations, don't. Spend the time making the program correct and robust.

Is it a common practice to re-use the same buffer name for various things in C?

For example, suppose I have a buffer called char journal_name[25] which I use to store the journal name. Now suppose a few lines later in the code I want to store someone's name into a buffer. Should I go char person_name[25] or just reuse the journal_name[25]?
The trouble is that everyone who reads the code (and me too after a few weeks) has to understand journal_name is now actually person_name.
But the counter argument is that having two buffers increases space usage. So better to use one.
What do you think about this problem?
Thanks, Boda Cydo.
The way to solve this in the C way, if you really don't want to waste memory, is to use blocks to scope the buffers:
int main()
{
{
char journal_name[26];
// use journal name
}
{
char person_name[26];
// use person name
}
}
The compiler will reuse the same memory location for both, while giving you a perfectly legible name.
As an alternative, call it name and use it for both <.<
Some code is really in order here. However a couple of points to note:
Keep the identifier decoupled from your objects. Call it scratchpad or anything. Also, by the looks of it, this character array is not dynamically allocated. Which means you have to allocate a large enough scratch-pad to be able to reuse them.
An even better approach is to probably make your functions shorter: One function should ideally do one thing at a time. See if you can break up and still face the issue.
As an alternative to the previous (good) answers, what about
char buffer[25];
char* journal_name = buffer;
then later
char* person_name = buffer;
Would it be OK ?
If both buffers are automatic why not use this?
Most of compilers will handle this correctly with memory reused.
But you keep readability.
{
char journal_name[25];
/*
Your code which uses journal_name..
*/
}
{
char person_name[25];
/*
Your code which uses person_name...
*/
}
By the way even if your compiler is stupid and you are very low of memory you can use union but keep different names for readability. Usage of the same variable is worst way.
Please use person_name[25]. No body likes hard to read code. Its not going to do much if anything at all for your program in terms of memory. Please, just do it the readable way.
You should always (well unless you're very tight on memory) go for readability and maintainability when writing code for the very reason you mentioned in your question.
25 characters (unless this is just an example) isn't going to "break the bank", but if memory is at a premium you could dynamically allocate the storage for journal_name and then free it when you've finished with it before dynamically allocating the storage for person_name. Though there is the "overhead" of the pointer to the array.
Another way would be to use local scoping on the arrays:
void myMethod()
{
... some code
{
char journal_name[25];
... some more code
}
... even more code
{
char person_name[25];
... yet more code
}
}
Though even with this pseudo code the method is getting quite long and would benefit from refactoring into subroutines which wouldn't have this problem.
If you are worried about memory, and I doubt 25 bytes will be an issue, but then you can just use malloc and free and then you just have an extra 4-8 bytes being used for the pointer.
But, as others mentioned, readability is important, and you may want to decompose your function so that the two buffers are used in functions that actually give more indication as to their uses.
UPDATE:
Now, I have had a buffer called buffer that I would use for reading from a file, for example, and then I would use a function pointer that was passed, to parse the results so that the function reads the file, and handles it appropriately, so that the buffer isn't filled in and then I have to remember that it shouldn't be overwritten yet.
So, yet, reusing a buffer can be useful, when reading from sockets or files, but you want to localize the usage of this buffer otherwise you may have race conditions.

Resources