Difference between methods to create a character array - c

I am curious about the different methods to create a character array in C. Let's say we want to create a character array holding the string "John Smith". We could either initialize the array by supplying the number of elements explicitly, i.e.
char entireName[11] = "John Smith";
where there are four spaces for characters J-o-h-n, one for the space, five for S-m-i-t-h, and one for the string terminator \0.
You could also do the above by simply typing
char entireName[] = "John Smith";
Will there be a large difference in who these two character arrays are compiled? Is the same amount of memory allocated for the two expressions, and executed at the same speed?
What really is the difference?

Both are same, but the second one is advisable.
In case you're leaving out the size of the array during definition and initialization, the compiler will allocate proper size required. This is less error prone, compared to the definition with a fixed size as sometimes
we may forget to reserve the space for null-terminator \0.
we may supply an initializer string more than that of the size specified.
The fact remains, with proper warnings enable, you'll get an warning if you do the above, but with the second approach, theses scenarios will not arise, so less worries.
EDIT:
FWIW, in the second scenario, the array length will be decided based on the supplied initializer string length. As we know, compiler time strings cannot be resized at runtime, so that's the only possible limitation of the second approach. If, at a later part, you want the array to hold something bigger than that of the supplied initializer string, the second approach is not suitable.

The two versions are basically identical as given in the question.
However, as the array is not const, you apparently intend to change it, so the string literal is just to initialize it. In this case, giving the maximum size for the array should strongly be considered.
The size allocated is the same for both cases of this example (the compiler calculates the size from the string literal and appends '\0').
However, if you you intend to store a longer string into the array later, the version char entireName[] = "John Smith"; will result in _undefined behaviour(UB, **anything** can happen). This because the compiler only allocates the size required by the string literal (plus'\0'), but does not know you need more during execution. In theses case, always use the explicit form[]`.
Warning: If the size of the string literal exactly matches the given size of the array, you might not be warned (tested with gcc 4.8.2 -Wall -Wextra: no warning) that the implictit '\0' cannot be stored. So, use that with caution! I suspect some legacy reasons for this being legal (it was in pre-ANSI K&C-C actually), possibly conserve RAM or packing. However, if the string litera as given does not fit, gcc does warn, if you enable most warnings (for gcc, see above).
For a const array always use the second version, as that is easier and even more explicitly stating that you want the size of the given string literal. Without being able to change the value lateron, nothing is gained in giving an explicit size, but (see above) some safety is lost.

There's no difference between the two as you specify the same size that the compiler would allocate otherwise.
However, if you explicitly specify the size and is less than the size of the string literal that you intend to copy, for example,
char entireName[8] = "John Smith";
then only 8 chars will be copied and rest will be discarded and there won't a 0 terminator either. This is not what you would want to do in most cases. For this reason, it's always better to let the compiler do it.

Related

Size doesn't increase, yet it stores larger data. How is this possible?

I was trying this code in gcc6.3:
char a[2];
char b[]="Aditya";
strcpy(a,b);
printf("%s %lu\n",a,sizeof(a));
the output was:
aditya#aditya-Gateway-series:~/luc$ ./a
Aditya 2
How can variable a be still 2 bytes big and store an information of 7bytes?
In your code:
strcpy(a,b);
invokes undefined behaviour, as you're trying to access memory which is not valid. Don't rely on the outcome.
To elaborate, a has only storage to hold two chars, if you try to write (here, to copy) more than a single-char string (with the null-terminator), you'll overrrun the allocated memory, thereby venturing into invalid memory location.
The source buffer of strcpy(), b has more content that can be fit into the destination buffer a, thus the operation involves boundary overrun. It's the job of the programmer to ensure that the destination buffer has sufficient memory.
That said, regarding the size calculation - let me add, array size, once defined, cannot be changed. You can chose to fill up the contents of leave them unitialized / unused, but arrays, once defined, cannot be resized.
As Sourav Ghosh said, your usage of strcpy is incorrect and induces undefined behavior. What I think happens is that a is of size 2, b is ob size 7, and they happen to be placed next to each other in the memory, resulting in 9 bytes of continuous allocated memory. So after copy, a is still of size 2, and holds "Ad" - however printing it displays the whole string as the print continues until first end-of-string character. If you print b, I think you'll get "itya", as its adress is located 2 bytes next to a.
I hope this is clear enough and it helps !
a only contains {'A', 'd'} - the remaining characters are written to the memory after a. In your case, you didn’t write over anything important, so the code appears to function as expected.
Appears to. The behavior of writing past the end of an array is undefined, meaning there’s no requirement on the compiler or runtime environment to handle the situation in any particular way. C does not require a bounds check on array accesses - it won’t throw an exception if you write past the end of an array. Your code may crash immediately, you may wind up with corrupted data, it may leave your program in a bad state such that it crashes later (those situations are fun to debug, let me tell you), or it may work with no apparent problems.
It is up to you, the programmer, to make sure the target array is large enough to hold the new value. The language doesn’t protect you at all here.

Is it possible to have a string ignore null terminate chars

I have a function that gets passed an array of chars, or a string. I use this array for data array and thus has a lot of random characters in it including NULL chars. My problem comes in when I am trying to retrieve this data the compiler sees and the Null char and thinks the string ends. Thereby effectively throwing out all the data after that. Is there an option where I can somehow make an array that is not ended by a Null char?
A C string and an array of char is not the same thing. The first is implemented by means of the second with the additional convention that the string ends where the array has the first 0 element.
So what you need is just an unsigned char[something] and you'd have to keep track of the length that you want to have separately. Then also you shouldn't use strcpy or similar functions but memcpy etc.
The null ('\0') terminator is treated as the string terminator in C. So You need to tell the compiler exactly how much data to read, why don't you maintain a separate count for the size of the data and then use functions which use that size to operate on the data?
Firstly, a string in C language is not some sort of "black box" object that can somehow choose to ignore or not ignore something out of its own will. It is based on a mere raw array of chars, of which you have full unrestricted control. This means that it is really you who chooses how to process the data stored in that array of chars. Not the compiler, not the string itself, but you and only you.
Secondly, a string in C language is defined as a sequence of characters ending with zero character. This immediately means that if you attempt using string-specific functions with your array, they will always stop at zeros. If you want your data to contain embedded zeros, then you should not call it "strings" and you should not use any string-specific functions with it. So, forget about strcmp, strcpy and such. Again, it is something you are responsible for, not the compiler.
Thirdly, the functions you would use with such data would typically be functions like memcpy for copying, memcmp for comparison and so on. Anything that's missing you'll have to implement yourself. And since you no longer have any terminating characters in your data, it is your responsibility to know where the data begins and where it ends.

Is it more secure to add a length specifier to a printf call

My question is from a security perspective. I'm in the process of cleaning up some code and I've found the current code has a uncontrolled format string vulnerability; it's passing a string to printf() something like:
void print_it(char * str)
{
printf(str);
which is obviously considered insecure programming, even gcc will typically ding you with at least some sort of warning:
warning: format not a string literal and no format arguments
Now to "fix" the issue we can make sure what you're getting is treated as a string
printf("%s", str);
But I was wondering if there's any... additional security in using a length specificer as well. Something like:
printf("%.*s", (int)sizeof(str), str);
I can't think of any reason why that would be more secure, but it wouldn't surprise me if I was missing something obvious here.
Sort of, but not to the extent that modern C shops guard their printf statements this safely. That is used when you are handling non-null terminated strings, which is very common when interacting with Fortran code.
Strictly speaking it would be a security gain in order to guard against runaway reads, perhaps you were about to printf sensitive data following a breached null character.
But
printf("%.*s", (int)sizeof(str), str);
is far worse; you just said "ignore the null character and print out the full contents anyway." Or rather, it's worse unless you're dealing with space-padded strings all the way to their memory's end, which is likely the case if the string came from Fortran.
This however is extremely important:
printf("%s", str);
as printf(str) is a major security flaw. Read about printf attacks using the %n specifier, which writes.
There's some additional security in
printf("%.*s", (int)sizeof(str), str);
since it will print at most sizeof(char*) bytes - usually four or eight - so it won't go and read much of the memory if str points to a char array that is not 0-terminated.
But more typically, it will cut the output short without a good reason to do so.
If you meant
printf("%.*s", (int)strlen(str), str);
that is entirely pointless, since in the cases where a precision for the printf would be necessary, the strlen call will do the same invalid memory accesses.
This idea won't work when the array is passed to a function as arrays decay into a pointer or for any malloc'ed pointer for that matter. Because sizeof(var) is going give the size of the pointer, not the array. So it can't be used in the printf() to specify the length.
This is only applicable to automatic (stack allocated) arrays. So when can this be useful then? I can think of two cases:
1. when you write into array more than the size of the array.
In this case, you have already caused undefined behaviour by writing somewhere that doesn't belong to you. End of story.
2. when you don't have a null-byte at the end of the string (but not crossed the boundary of the array).
In this case, the array has some valid content but not the null byte.. Two possibilities here:
2.a. Using the length specifier is going to print the whole content of the array. So if you access uninitialized bytes in the array (even within its size), it's still going to cause undefined behaviour. Otherwise you have to track the length of the valid content in order to use %s in the printf() along with the length specifier to avoid UB. In this case, you already know the length of the valid content. Hence you can simply null-terminate it yourself rather telling printf() to print only the valid content.
2.b. Let's say, you have initialized the whole array at the beginning with zeros. In the case, the array is going to be a valid string and hence no need of the length modifier.
So I'd say it's not of much use.

Fastest way to copy an array - Does it have something questionable?

When working with arrays of the same length, consider creating a structure which only contains an array so that it is easier to copy the array by simply copying the structure into another one.
The definition and declaration of the structure would be like this:
typedef struct {
char array[X]; /* X is an arbitrary constant */
} Array;
Array array1;
Then, perform the copy by simply doing:
Array array2;
array2 = array1;
I have found that this as the fastest way of copying an array. Does it have any disadvantage?
Edit: X is an arbitrary constant, let's say 10. The array is not a variable length array.
X may be arbitrary (up until the limits of your compile environment), but it's (naturally) the same number for all objects of type Array. If you need many arrays of same length and type to be copied to each other, there's nothing inherently wrong about this, although accessing these arrays might be more cumbersome than usual.
Array array1;
array1.array[0] // to access the first element
This works fine, but since X must be defined at compile-time it is rather inflexible. You can get the same performance over an array of any length by using memcpy instead. In fact, compilers will usually translate array2 = array1 into a call to memcpy.
As mentioned in the comments, whether direct assignment gets translated into a memcpy call depends on the size of the array amongst other compiler heuristics.
It's entirely dependent on the compiler and level of optimizations you use. The compiler "should" reduce it to memcpy with a constant size, which should in turn be reduced to whatever machine specific operations exist for copying various size blocks of memory. Small blocks should be highly machine specific these days. Actually calling the memcpy library function to copy 4 bytes would be so "10 years ago."
With optimizations off all performance bets are off.

zeroing out memory

gcc 4.4.4 C89
I am just wondering what most C programmers do when they want to zero out memory.
For example, I have a buffer of 1024 bytes. Sometimes I do this:
char buffer[1024] = {0};
Which will zero all bytes.
However, should I declare it like this and use memset?
char buffer[1024];
.
.
memset(buffer, 0, sizeof(buffer));
Is there any real reason you have to zero the memory? What is the worst that can happen by not doing it?
The worst that can happen? You end up (unwittingly) with a string that is not NULL terminated, or an integer that inherits whatever happened to be to the right of it after you printed to part of the buffer. Yet, unterminated strings can happen other ways, too, even if you initialized the buffer.
Edit (from comments) The end of the world is also a remote possibility, depending on what you are doing.
Either is undesirable. However, unless completely eschewing dynamically allocated memory, most statically allocated buffers are typically rather small, which makes memset() relatively cheap. In fact, much cheaper than most calls to calloc() for dynamic blocks, which tend to be bigger than ~2k.
c99 contains language regarding default initialization values, I can't, however, seem to make gcc -std=c99 agree with that, using any kind of storage.
Still, with a lot of older compilers (and compilers that aren't quite c99) still in use, I prefer to just use memset()
I vastly prefer
char buffer[1024] = { 0 };
It's shorter, easier to read, and less error-prone. Only use memset on dynamically-allocated buffers, and then prefer calloc.
When you define char buffer[1024] without initializing, you're going to get undefined data in it. For instance, Visual C++ in debug mode will initialize with 0xcd. In Release mode, it will simply allocate the memory and not care what happens to be in that block from previous use.
Also, your examples demonstrate runtime vs. compile time initialization. If your char buffer[1024] = { 0 } is a global or static declaration, it will be stored in the binary's data segment with its initialized data, thus increasing your binary size by about 1024 bytes (in this case). If the definition is in a function, it's stored on the stack and is allocated at runtime and not stored in the binary. If you provide an initializer in this case, the initializer is stored in the binary and an equivalent of a memcpy() is done to initialize buffer at runtime.
Hopefully, this helps you decide which method works best for you.
In this particular case, there's not much difference. I prefer = { 0 } over memset because memset is more error-prone:
It provides an opportunity to get the bounds wrong.
It provides an opportunity to mix up the arguments to memset (e.g. memset(buf, sizeof buf, 0) instead of memset(buf, 0, sizeof buf).
In general, = { 0 } is better for initializing structs too. It effectively initializes all members as if you had written = 0 to initialize each. This means that pointer members are guaranteed to be initialized to the null pointer (which might not be all-bits-zero, and all-bits-zero is what you'd get if you had used memset).
On the other hand, = { 0 } can leave padding bits in a struct as garbage, so it might not be appropriate if you plan to use memcmp to compare them later.
The worst that can happen by not doing it is that you write some data in character by character and later interpret it as a string (and you didn't write a null terminator). Or you end up failing to realise a section of it was uninitialised and read it as though it were valid data. Basically: all sorts of nastiness.
Memset should be fine (provided you correct the sizeof typo :-)). I prefer that to your first example because I think it's clearer.
For dynamically allocated memory, I use calloc rather than malloc and memset.
One of the things that can happen if you don't initialize is that you run the risk of leaking sensitive information.
Uninitialized memory may have something sensitive in it from a previous use of that memory. Maybe a password or crypto key or part of a private email. Your code may later transmit that buffer or struct somewhere, or write it to disk, and if you only partially filled it the rest of it still contains those previous contents. Certain secure systems require zeroizing buffers when an address space can contain sensitive information.
I prefer using memset to clear a chunk of memory, especially when working with strings. I want to know without a doubt that there will be a null delimiter after my string. Yes, I know you can append a \0 on the end of each string and some functions do this for you, but I want no doubt that this has taken place.
A function could fail when using your buffer, and the buffer remains unchanged. Would you rather have a buffer of unknown garbage, or nothing?
This post has been heavily edited to make it correct. Many thanks to Tyler McHenery for pointing out what I missed.
char buffer[1024] = {0};
Will set the first char in the buffer to null, and the compiler will then expand all non-initialized chars to 0 too. In such a case it seems that the differences between the two techniques boil down to whether the compiler generates more optimized code for array initialization or whether memset is optimized faster than the generated compiled code.
Previously I stated:
char buffer[1024] = {0};
Will set the first char in the buffer
to null. That technique is commonly
used for null terminated strings, as
all data past the first null is
ignored by subsequent (non-buggy)
functions that handle null terminated
strings.
Which is not quite true. Sorry for the miscommunication, and thanks again for the corrections.
Depends how you're filling it: if you're planning on writing to it before even potentially reading anything, then why bother? It also depends what you're going to use the buffer for: if it's going to be treated as a string, then you just need to set the first byte to \0:
char buffer[1024];
buffer[0] = '\0';
However, if you're using it as a byte stream, then the contents of the entire array are probably going to be relevant, so memseting the entire thing or setting it to { 0 } as in your example is a smart move.
I also use memset(buffer, 0, sizeof(buffer));
The risk of not using it is that there is no guarantee that the buffer you are using is completely empty, there might be garbage which may lead to unpredictable behavior.
Always memset-ing to 0 after malloc, is a very good practice.
yup, calloc() method defined in stdlib.h allocates memory initialized with zeros.
I'm not familiar with the:
char buffer[1024] = {0};
technique. But assuming it does what I think it does, there's a (potential) difference to the two techniques.
The first one is done at COMPILE time, and the buffer will be part of the static image of the executable, and thus be 0's when you load.
The latter will be done at RUN TIME.
The first may incur some load time behaviour. If you just have:
char buffer[1024];
the modern loaders may well "virtually" load that...that is, it won't take any real space in the file, it'll simply be an instruction to the loader to carve out a block when the program is loaded. I'm not comfortable enough with modern loaders say if that's true or not.
But if you pre-initialize it, then that will certainly need to be loaded from the executable.
Mind, neither of these have "real" performance impacts in the small. They may not have any in the "large". Just saying there's potential here, and the two techniques are in fact doing something quite different.

Resources