Differentiating between embedded NUL and NUL-terminator - c

I have a const char* pointing to data in hex format, I need to find the length of the data for that I am checking for NUL-terminator but when \x00 comes up it detects it as NUL-terminator returning incorrect length.
How can I get around that?
const char* orig = "\x09\x00\x04\x00\x02\x00\x10\x00\x42\x00\x02\x00\x01\x80\x0f\x00"
uint64_t get_char_ptr_len(const char *c)
{
uint64_t len = 0;
if (*c)
{
while (c[len] != '\0') {
len++;
}
}
return len;
}

\x00 is the NUL terminator; in facts, \x00 is just another way to write \0.
If you have byte data that contains embedded NULs, you cannot use NUL as a terminator, period; you have to keep both a pointer to the data and the data size, exactly as function that operate on "raw bytes" (such as memcpy or fwrite) do.
As for literals, make sure you initialize an array (and not just take a pointer to it) to be able to retrieve its size using sizeof:
const char orig[] = "\x09\x00\x04\x00\x02\x00\x10\x00\x42\x00\x02\x00\x01\x80\x0f\x00";
Now you can use sizeof(orig) to get its size (which will be one longer than the number of explicitly-written characters, as there's the implicit NUL terminator at the end); careful though, as arrays decay to pointer at pretty much every available occasion, in particular when being passed to functions.

\x indicates hexadecimal notation.
Have a look at an ASCII table to see what \x00 represent.
\x00 = NULL // In Hexadecimal notation.
\x00 is just another way to write \0.
Try
const char orig[] = "\x09\x00\x04\x00\x02\x00\x10\x00\x42\x00\x02\x00\x01\x80\x0f\x00";
and
len=sizeof(orig)/sizeof(char);

Related

How does an array terminate?

As we know a string terminates with '\0'.
It's because to know the compiler that string ended, or to secure from garbage values.
But how does an array terminate?
If '\0' is used it will take it as 0 a valid integer,
So how does the compiler knows the array ended?
C does not perform bounds checking on arrays. That's part of what makes it fast. However that also means it's up to you to ensure you don't read or write past the end of an array. So the language will allow you to do something like this:
int arr[5];
arr[10] = 4;
But if you do, you invoke undefined behavior. So you need to keep track of how large an array is yourself and ensure you don't go past the end.
Note that this also applies to character arrays, which can be treated as a string if it contains a sequence of characters terminated by a null byte. So this is a string:
char str[10] = "hello";
And so is this:
char str[5] = { 'h', 'i', 0, 0, 0 };
But this is not:
char str[5] = "hello"; // no space for the null terminator.
C doesn't provide any protections or guarantees to you about 'knowing the array is ended.' That's on you as the programmer to keep in mind in order to avoid accessing memory outside your array.
C language does not have native string type. In C, strings are actually one-dimensional array of characters terminated by a null character '\0'.
From C Standard#7.1.1p1 [emphasis mine]
A string is a contiguous sequence of characters terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the string or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of bytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order.
String is a special case of character array which is terminated by a null character '\0'. All the standard library string related functions read the input string based on this rule i.e. read until first null character.
There is no significance of null character '\0' in array of any type apart from character array in C.
So, apart from string, for all other types of array, programmer is suppose to explicitly keep the track of number of elements in the array.
Also, note that, first null character ('\0') is the indication of string termination but it is not stopping you to read beyond it.
Consider this example:
#include <stdio.h>
int main(void) {
char str[5] = {'H', 'i', '\0', 'z'};
printf ("%s\n", str);
printf ("%c\n", str[3]);
return 0;
}
When you print the string
printf ("%s\n", str);
the output you will get is - Hi
because with %s format specifier, printf() writes every byte up to and not including the first null terminator [note the use of null character in the strings], but you can also print the 4th character of array as it is within the range of char array str though beyond first '\0' character
printf ("%c\n", str[3]);
the output you will get is - z
Additional:
Trying to access array beyond its size lead to undefined behavior which includes the program may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.
It’s just a matter of convention. If you wanted to, you could totally write code that handled array termination (for arrays of any type) via some sentinel value. Here’s an example that does just that, arbitrarily using -1 as the sentinel:
int length(int arr[]) {
int i;
for (i = 0; arr[i] != -1; i++) {}
return i;
}
However, this is obviously utterly unpractical: You couldn’t use -1 in the array any longer.
By contrast, for C strings the sentinel value '\0' is less problematic because it’s expected that normal test won’t contain this character. This assumption is kind of valid. But even so there are obviously many strings which do contain '\0' as a valid character, and null-termination is therefore by no means universal.
One very common alternative is to store strings in a struct that looks something like this:
struct string {
unsigned int length;
char *buffer;
}
That is, we explicitly store a length alongside a buffer. This buffer isn’t null-terminated (although in practice it often has an additional terminal '\0' byte for compatibility with C functions).
Anyway, the answer boils down to: For C strings, null termination is a convenient convention. But it is only a convention, enforced by the C string functions (and by the C string literal syntax). You could use a similar convention for other array types but it would be prohibitively impractical. This is why other conventions developed for arrays. Notably, most functions that deal with arrays expect both an array and a length parameter. This length parameter determines where the array terminates.

Converting non printable ASCII character to binary

I am trying to convert a string of non-printable ASCII character to binary. Here is the code:
int main(int argc, char *argv[])
{
char str[32];
sprintf(str,"\x01\x00\x02");
printf("\n[%x][%x][%x]",str[0],str[1],str[2]);
return 1;
}
I expect the output should be [1][0][2], but it prints [1][0][4].
What am I doing wrong here?
The sprintf operation ended at the first instance of \x00 in your string literal, because NUL (U+0000) terminates strings in C. (That the compiler does not complain when you write \x00 inside a string literal is arguably a misfeature of the language.) Thus str[2] accesses uninitialized memory and the program is entitled to print complete nonsense or even crash.
To do what you wanted to do, simply eliminate the sprintf:
int main(void)
{
static const unsigned char str[32] =
{ 0x01, 0x00, 0x02 }; // will be zero-filled to declared size
printf("[%02x][%02x][%02x]\n", str[0], str[1], str[2]);
return 0;
}
(Binary data should always be stored in arrays of unsigned char, not plain char; or uint8_t if you have it. Because U+0000 terminates strings, I think it's better style to write embedded binary data using an array literal rather than a string literal; but it is more typing. The static const is just because the data is never modified and known at compile time; the program would work without it. Don't declare argc and argv if you're not going to use them. Return zero, not one, from main to indicate successful completion.)
(Using sprintf the way you were using it is a bad idea for other reasons: for instance, if your binary block contained \x25 (also known as % in ASCII), it would try to read additional arguments-to-be-formatted, and again print complete nonsense or crash. If you have a good reason to not just use static initialized data, the right way to copy blocks of binary data around is memcpy.)
C strings end with a null byte, so sprintf only reads until \x00. Instead, you can use memcpy (like this) or simply initialize with
char str[32] = "\x01\x00\x02";
"\x00" terminates the format string which is the 2nd argument of the sprint() prematurely. Obviously that was unintentional but there is no ways sprint() can figure out that the first NUL is not the last NUL. So the format string it works on is actually shorter than what you intended to pass.

Printf Variable String Length Specifier

I have a struct that contains a string and a length:
typedef struct string {
char* data;
size_t len;
} string_t;
Which is all fine and dandy. But, I want to be able to output the contents of this struct using a printf-like function. data may not have a nul terminator (or have it in the wrong place), so I can't just use %s. But the %.*s specifier requires an int, while I have a size_t.
So the question now is, how can I output the string using printf?
Assuming that your string doesn't have any embedded NUL characters in it, you can use the %.*s specifier after casting the size_t to an int:
string_t *s = ...;
printf("The string is: %.*s\n", (int)s->len, s->data);
That's also assuming that your string length is less than INT_MAX. If you have a string longer than INT_MAX, then you have other problems (it will take quite a while to print out 2 billion characters, for one thing).
A simple solution would just be to use unformatted output:
fwrite(x.data, 1, x.len, stdout);
This is actually bad form, since `fwrite` may not write everything, so it should be used in a loop;
for (size_t i, remaining = x.len;
remaining > 0 && (i = fwrite(x.data, 1, remaining, stdout)) > 0;
remaining -= i) {
}
(Edit: fwrite does indeed write the entire requested range on success; looping is not needed.)
Be sure that x.len is no larger than SIZE_T_MAX.
how can I output the string using printf?
In a single call? You can't in any meaningful way, since you say you might have null terminators in strange places. In general, if your buffer might contain unprintable characters, you'll need to figure out how you want to print (or not) those characters when outputting your string. Write a loop, test each character, and print it (or not) as your logic dictates.

Malloc adding one to initializing a char for string concat C

I found this function on stackoverflow which concates two strings together. Here is the function:
char* concatstring(char *s1,char *s2)
{
char *result = malloc(strlen(s1)+strlen(s2)+1);
strcpy(result,s1);
strcat(result,s2);
return result;
}
My question is, why do we add 1 to the malloc call?
It's because in C "strings" are stored as arrays of chars followed by a null byte. This is by convention. Consequently, null bytes may not appear inside any C string.
However, the actual string itself does not contain the null byte (which is just part of the representation of the string), and so strlen reports the number of non-null bytes in the string. To create a C string that is the result of concatenating two strings, you thus need to leave room for the null terminator.
In fact, every string operation one way or another needs to deal with the null terminator. Unfortunately, the details vary from function to function (e.g. snprintf does it right, but strncpy is dangerously different), and you should read each function's manual very carefully to understand who takes care of the null terminator and how.
You need to allocate space for the '\0' (NULL character) which is used to terminate strings in C.
i.e. the string "cat" is actually "cat\0".
If the string is "cat":
char * mystring = "cat";
Then strlen(mystring), would return 3.
But in reality it takes 4 bytes to store mystring, with one byte to store null character.
So if you have two strings, "dog" and "cat", their length will be 3 and 3 , although the number of bytes required to store them would be 4 each. The memory required to store their concatenation would be 3+3 +1 = 7.
So the 1 in malloc is to allocate extra byte to store the null character.

Questions on C strings

I am new to C and I am very much confused with the C strings. Following are my questions.
Finding last character from a string
How can I find out the last character from a string? I came with something like,
char *str = "hello";
printf("%c", str[strlen(str) - 1]);
return 0;
Is this the way to go? I somehow think that, this is not the correct way because strlen has to iterate over the characters to get the length. So this operation will have a O(n) complexity.
Converting char to char*
I have a string and need to append a char to it. How can i do that? strcat accepts only char*. I tried the following,
char delimiter = ',';
char text[6];
strcpy(text, "hello");
strcat(text, delimiter);
Using strcat with variables that has local scope
Please consider the following code,
void foo(char *output)
{
char *delimiter = ',';
strcpy(output, "hello");
strcat(output, delimiter);
}
In the above code,delimiter is a local variable which gets destroyed after foo returned. Is it OK to append it to variable output?
How strcat handles null terminating character?
If I am concatenating two null terminated strings, will strcat append two null terminating characters to the resultant string?
Is there a good beginner level article which explains how strings work in C and how can I perform the usual string manipulations?
Any help would be great!
Last character: your approach is correct. If you will need to do this a lot on large strings, your data structure containing strings should store lengths with them. If not, it doesn't matter that it's O(n).
Appending a character: you have several bugs. For one thing, your buffer is too small to hold another character. As for how to call strcat, you can either put the character in a string (an array with 2 entries, the second being 0), or you can just manually use the length to write the character to the end.
Your worry about 2 nul terminators is unfounded. While it occupies memory contiguous with the string and is necessary, the nul byte at the end is NOT "part of the string" in the sense of length, etc. It's purely a marker of the end. strcat will overwrite the old nul and put a new one at the very end, after the concatenated string. Again, you need to make sure your buffer is large enough before you call strcat!
O(n) is the best you can do, because of the way C strings work.
char delimiter[] = ",";. This makes delimiter a character array holding a comma and a NUL Also, text needs to have length 7. hello is 5, then you have the comma, and a NUL.
If you define delimiter correctly, that's fine (as is, you're assigning a character to a pointer, which is wrong). The contents of output won't depend on delimiter later on.
It will overwrite the first NUL.
You're on the right track. I highly recommend you read K&R C 2nd Edition. It will help you with strings, pointers, and more. And don't forget man pages and documentation. They will answer questions like the one on strcat quite clearly. Two good sites are The Open Group and cplusplus.com.
A "C string" is in reality a simple array of chars, with str[0] containing the first character, str[1] the second and so on. After the last character, the array contains one more element, which holds a zero. This zero by convention signifies the end of the string. For example, those two lines are equivalent:
char str[] = "foo"; //str is 4 bytes
char str[] = {'f', 'o', 'o', 0};
And now for your questions:
Finding last character from a string
Your way is the right one. There is no faster way to know where the string ends than scanning through it to find the final zero.
Converting char to char*
As said before, a "string" is simply an array of chars, with a zero terminator added to the end. So if you want a string of one character, you declare an array of two chars - your character and the final zero, like this:
char str[2];
str[0] = ',';
str[1] = 0;
Or simply:
char str[2] = {',', 0};
Using strcat with variables that has local scope
strcat() simply copies the contents of the source array to the destination array, at the offset of the null character in the destination array. So it is irrelevant what happens to the source after the operation. But you DO need to worry if the destination array is big enough to hold the data - otherwise strcat() will overwrite whatever data sits in memory right after the array! The needed size is strlen(str1) + strlen(str2) + 1.
How strcat handles null terminating character?
The final zero is expected to terminate both input strings, and is appended to the output string.
Finding last character from a string
I propose a thought experiment: if it were generally possible to find the last character
of a string in better than O(n) time, then could you not also implement strlen
in better than O(n) time?
Converting char to char*
You temporarily can store the char in an array-of-char, and that will decay into
a pointer-to-char:
char delimiterBuf[2] = "";
delimiterBuf[0] = delimiter;
...
strcat(text, delimiterBuf);
If you're just using character literals, though, you can simply use string literals instead.
Using strcat with variables that has local scope
The variable itself isn't referenced outside the scope. When the function returns,
that local variable has already been evaluated and its contents have already been
copied.
How strcat handles null terminating character?
"Strings" in a C are NUL-terminated sequences of characters. Both inputs to
strcat must be NUL-terminated, and the result will be NUL-terminated. It
wouldn't be useful for strcat to write an extra NUL-byte to the result if it
doesn't need to.
(And if you're wondering what if the input strings have multiple trailing
NUL bytes already, I propose another thought experiment: how would strcat know
how many trailing NUL-bytes there are in a string?)
BTW, since you tagged this with "best-practices", I'll also recommend that you take care not to write past the end of your destination buffers. Typically this means avoiding strcat and strcpy (unless you've already checked that the input strings won't overflow the destination) and using safer versions (e.g. strncat. Note that strncpy has its own pitfalls, so that's a poor substitute. There also are safer versions that are non-standard, such as strlcpy/strlcat and strcpy_s/strcat_s.)
Similarly, functions like your foo function always should take an additional argument specifying what the size of the destination buffer is (and documentation should make it explicitly clear whether that size accounts for a NUL terminator or not).
How can I find out the last character
from a string?
Your technique with str[strlen(str) - 1] is fine. As pointed out, you should avoid repeated, unnecessary calls to strlen and store the results.
I somehow think that, this is not the
correct way because strlen has to
iterate over the characters to get the
length. So this operation will have a
O(n) complexity.
Repeated calls to strlen can be a bane of C programs. However, you should avoid premature optimization. If a profiler actually demonstrates a hotspot where strlen is expensive, then you can do something like this for your literal string case:
const char test[] = "foo";
sizeof test // 4
Of course if you create 'test' on the stack, it incurs a little overhead (incrementing/decrementing stack pointer), but no linear time operation involved.
Literal strings are generally not going to be so gigantic. For other cases like reading a large string from a file, you can store the length of the string in advance as but one example to avoid recomputing the length of the string. This can also be helpful as it'll tell you in advance how much memory to allocate for your character buffer.
I have a string and need to append a
char to it. How can i do that? strcat
accepts only char*.
If you have a char and cannot make a string out of it (char* c = "a"), then I believe you can use strncat (need verification on this):
char ch = 'a';
strncat(str, &ch, 1);
In the above code,delimiter is a local
variable which gets destroyed after
foo returned. Is it OK to append it to
variable output?
Yes: functions like strcat and strcpy make deep copies of the source string. They don't leave shallow pointers behind, so it's fine for the local data to be destroyed after these operations are performed.
If I am concatenating two null
terminated strings, will strcat
append two null terminating characters
to the resultant string?
No, strcat will basically overwrite the null terminator on the dest string and write past it, then append a new null terminator when it's finished.
How can I find out the last character from a string?
Your approach is almost correct. The only way to find the end of a C string is to iterate throught the characters, looking for the nul.
There is a bug in your answer though (in the general case). If strlen(str) is zero, you access the character before the start of the string.
I have a string and need to append a char to it. How can i do that?
Your approach is wrong. A C string is just an array of C characters with the last one being '\0'. So in theory, you can append a character like this:
char delimiter = ',';
char text[7];
strcpy(text, "hello");
int textSize = strlen(text);
text[textSize] = delimiter;
text[textSize + 1] = '\0';
However, if I leave it like that I'll get zillions of down votes because there are three places where I have a potential buffer overflow (if I didn't know that my initial string was "hello"). Before doing the copy, you need to put in a check that text is big enough to contain all the characters from the string plus one for the delimiter plus one for the terminating nul.
... delimiter is a local variable which gets destroyed after foo returned. Is it OK to append it to variable output?
Yes that's fine. strcat copies characters. But your code sample does no checks that output is big enough for all the stuff you are putting into it.
If I am concatenating two null terminated strings, will strcat append two null terminating characters to the resultant string?
No.
I somehow think that, this is not the correct way because strlen has to iterate over the characters to get the length. So this operation will have a O(n) complexity.
You are right read Joel Spolsky on why C-strings suck. There are few ways around it. The ways include either not using C strings (for example use Pascal strings and create your own library to handle them), or not use C (use say C++ which has a string class - which is slow for different reasons, but you could also write your own to handle Pascal strings more easily than in C for example)
Regarding adding a char to a C string; a C string is simply a char array with a nul terminator, so long as you preserve the terminator it is a string, there's no magic.
char* straddch( char* str, char ch )
{
char* end = &str[strlen(str)] ;
*end = ch ;
end++ ;
*end = 0 ;
return str ;
}
Just like strcat(), you have to know that the array that str is created in is long enough to accommodate the longer string, the compiler will not help you. It is both inelegant and unsafe.
If I am concatenating two null
terminated strings, will strcat append
two null terminating characters to the
resultant string?
No, just one, but what ever follows that may just happen to be nul, or whatever happened to be in memory. Consider the following equivalent:
char* my_strcat( char* s1, const char* s2 )
{
strcpy( &str[strlen(str)], s2 ) ;
}
the first character of s2 overwrites the terminator in s1.
In the above code,delimiter is a local
variable which gets destroyed after
foo returned. Is it OK to append it to
variable output?
In your example delimiter is not a string, and initialising a pointer with a char makes no sense. However if it were a string, the code would be fine, strcat() copies the data from the second string, so the lifetime of the second argument is irrelevant. Of course you could in your example use a char (not a char*) and the straddch() function suggested above.

Resources