Converting non printable ASCII character to binary - c

I am trying to convert a string of non-printable ASCII character to binary. Here is the code:
int main(int argc, char *argv[])
{
char str[32];
sprintf(str,"\x01\x00\x02");
printf("\n[%x][%x][%x]",str[0],str[1],str[2]);
return 1;
}
I expect the output should be [1][0][2], but it prints [1][0][4].
What am I doing wrong here?

The sprintf operation ended at the first instance of \x00 in your string literal, because NUL (U+0000) terminates strings in C. (That the compiler does not complain when you write \x00 inside a string literal is arguably a misfeature of the language.) Thus str[2] accesses uninitialized memory and the program is entitled to print complete nonsense or even crash.
To do what you wanted to do, simply eliminate the sprintf:
int main(void)
{
static const unsigned char str[32] =
{ 0x01, 0x00, 0x02 }; // will be zero-filled to declared size
printf("[%02x][%02x][%02x]\n", str[0], str[1], str[2]);
return 0;
}
(Binary data should always be stored in arrays of unsigned char, not plain char; or uint8_t if you have it. Because U+0000 terminates strings, I think it's better style to write embedded binary data using an array literal rather than a string literal; but it is more typing. The static const is just because the data is never modified and known at compile time; the program would work without it. Don't declare argc and argv if you're not going to use them. Return zero, not one, from main to indicate successful completion.)
(Using sprintf the way you were using it is a bad idea for other reasons: for instance, if your binary block contained \x25 (also known as % in ASCII), it would try to read additional arguments-to-be-formatted, and again print complete nonsense or crash. If you have a good reason to not just use static initialized data, the right way to copy blocks of binary data around is memcpy.)

C strings end with a null byte, so sprintf only reads until \x00. Instead, you can use memcpy (like this) or simply initialize with
char str[32] = "\x01\x00\x02";

"\x00" terminates the format string which is the 2nd argument of the sprint() prematurely. Obviously that was unintentional but there is no ways sprint() can figure out that the first NUL is not the last NUL. So the format string it works on is actually shorter than what you intended to pass.

Related

How does an array terminate?

As we know a string terminates with '\0'.
It's because to know the compiler that string ended, or to secure from garbage values.
But how does an array terminate?
If '\0' is used it will take it as 0 a valid integer,
So how does the compiler knows the array ended?
C does not perform bounds checking on arrays. That's part of what makes it fast. However that also means it's up to you to ensure you don't read or write past the end of an array. So the language will allow you to do something like this:
int arr[5];
arr[10] = 4;
But if you do, you invoke undefined behavior. So you need to keep track of how large an array is yourself and ensure you don't go past the end.
Note that this also applies to character arrays, which can be treated as a string if it contains a sequence of characters terminated by a null byte. So this is a string:
char str[10] = "hello";
And so is this:
char str[5] = { 'h', 'i', 0, 0, 0 };
But this is not:
char str[5] = "hello"; // no space for the null terminator.
C doesn't provide any protections or guarantees to you about 'knowing the array is ended.' That's on you as the programmer to keep in mind in order to avoid accessing memory outside your array.
C language does not have native string type. In C, strings are actually one-dimensional array of characters terminated by a null character '\0'.
From C Standard#7.1.1p1 [emphasis mine]
A string is a contiguous sequence of characters terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the string or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of bytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order.
String is a special case of character array which is terminated by a null character '\0'. All the standard library string related functions read the input string based on this rule i.e. read until first null character.
There is no significance of null character '\0' in array of any type apart from character array in C.
So, apart from string, for all other types of array, programmer is suppose to explicitly keep the track of number of elements in the array.
Also, note that, first null character ('\0') is the indication of string termination but it is not stopping you to read beyond it.
Consider this example:
#include <stdio.h>
int main(void) {
char str[5] = {'H', 'i', '\0', 'z'};
printf ("%s\n", str);
printf ("%c\n", str[3]);
return 0;
}
When you print the string
printf ("%s\n", str);
the output you will get is - Hi
because with %s format specifier, printf() writes every byte up to and not including the first null terminator [note the use of null character in the strings], but you can also print the 4th character of array as it is within the range of char array str though beyond first '\0' character
printf ("%c\n", str[3]);
the output you will get is - z
Additional:
Trying to access array beyond its size lead to undefined behavior which includes the program may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.
It’s just a matter of convention. If you wanted to, you could totally write code that handled array termination (for arrays of any type) via some sentinel value. Here’s an example that does just that, arbitrarily using -1 as the sentinel:
int length(int arr[]) {
int i;
for (i = 0; arr[i] != -1; i++) {}
return i;
}
However, this is obviously utterly unpractical: You couldn’t use -1 in the array any longer.
By contrast, for C strings the sentinel value '\0' is less problematic because it’s expected that normal test won’t contain this character. This assumption is kind of valid. But even so there are obviously many strings which do contain '\0' as a valid character, and null-termination is therefore by no means universal.
One very common alternative is to store strings in a struct that looks something like this:
struct string {
unsigned int length;
char *buffer;
}
That is, we explicitly store a length alongside a buffer. This buffer isn’t null-terminated (although in practice it often has an additional terminal '\0' byte for compatibility with C functions).
Anyway, the answer boils down to: For C strings, null termination is a convenient convention. But it is only a convention, enforced by the C string functions (and by the C string literal syntax). You could use a similar convention for other array types but it would be prohibitively impractical. This is why other conventions developed for arrays. Notably, most functions that deal with arrays expect both an array and a length parameter. This length parameter determines where the array terminates.

Why is my output wrong? C newbie

#include <stdio.h>
int main(void)
{
char username;
username = '10A';
printf("%c\n", username);
return 0;
}
I just started learning C, and here is my first problem. Why is this program giving me 2 warnings (multi-character constant, overflow in implicit constant conversion)?
And instead of giving 10A as output, it is giving just A.
You are trying to stuff multiple characters into a single set of '', and into a single char variable. You need "" for string literals, and you'll need an array of characters to hold a string. And to print a string, use %s.
Putting all of this together, you get:
#include <stdio.h>
int main(void)
{
char username[] = "10A";
printf("%s\n", username);
return 0;
}
Footnote
From Jonathan Leffler in the comments below regarding multi-character constants:
Note that multi-character constants are a part of C (hence the warning, not an error), but the value of a multi-character constant is implementation defined and hence not portable. It is an integer value; it is larger than fits in a char, so you get that warning. You could have gotten almost anything as the output — 1, A and a null byte could all be plausible.
'10A' is an allowed but obscure way to define a value.
In the case of an int variable,
int username = '10A';
printf("%x\n", username);
will output
313041
These are pairs of hexadecimal values - each pair is
0x31 is the '1' of your input.
0x30 is the '0' of your input.
0x41 is the 'A' of your input.
But a char type can't hold this.
In C there are no String objects. Instead Strings are arrays of characters (followed by a null character). Other answers have pointed out statically allocating this memory. However I recommend dynamically allocating Strings. Just remember C lacks a garbage memory collector (like there is in java). So remember to free your pointers. Have fun!!
You could use char *username to point to the beginning of the address and loop through the memory after. For instance use sizeof(username) to get the size and then loop printf until you have printed the amount of characters in username. However you may end up with major problems if you aren't careful...

Differentiating between embedded NUL and NUL-terminator

I have a const char* pointing to data in hex format, I need to find the length of the data for that I am checking for NUL-terminator but when \x00 comes up it detects it as NUL-terminator returning incorrect length.
How can I get around that?
const char* orig = "\x09\x00\x04\x00\x02\x00\x10\x00\x42\x00\x02\x00\x01\x80\x0f\x00"
uint64_t get_char_ptr_len(const char *c)
{
uint64_t len = 0;
if (*c)
{
while (c[len] != '\0') {
len++;
}
}
return len;
}
\x00 is the NUL terminator; in facts, \x00 is just another way to write \0.
If you have byte data that contains embedded NULs, you cannot use NUL as a terminator, period; you have to keep both a pointer to the data and the data size, exactly as function that operate on "raw bytes" (such as memcpy or fwrite) do.
As for literals, make sure you initialize an array (and not just take a pointer to it) to be able to retrieve its size using sizeof:
const char orig[] = "\x09\x00\x04\x00\x02\x00\x10\x00\x42\x00\x02\x00\x01\x80\x0f\x00";
Now you can use sizeof(orig) to get its size (which will be one longer than the number of explicitly-written characters, as there's the implicit NUL terminator at the end); careful though, as arrays decay to pointer at pretty much every available occasion, in particular when being passed to functions.
\x indicates hexadecimal notation.
Have a look at an ASCII table to see what \x00 represent.
\x00 = NULL // In Hexadecimal notation.
\x00 is just another way to write \0.
Try
const char orig[] = "\x09\x00\x04\x00\x02\x00\x10\x00\x42\x00\x02\x00\x01\x80\x0f\x00";
and
len=sizeof(orig)/sizeof(char);

C Why is an unrelated/undeclared variable influencing the output of another?

When the character array substring[#] is set as [64], the file outputs an additional character. The additional character varies with each compile. Sometimes es?, sometimes esx among others.
If I change the [64] to any other number (at least the ones I've tried: 65, 256,1..) it outputs correctly as es.
Even more strange, if I leave the unused/undeclared character array char newString[64] in this file, it outputs the correct substring es even with the 64.
How does the seemingly arbitrary size of 64 affect the out?
How does a completely unrelated character array (newString) influence how another character array is output?
.
int main () {
char string[64];
char newString[64];
char substring[64];
fgets(string,64,stdin);
strncpy(substring, string+1, 1);
printf("%s\n", substring);
return 0;
}
The problem is, strncpy() will not copy the null terminator because you've asked it not to.
Using strncpy() is safe and dangerous at the same time, because it will not always copy the null terminator, also using it for a single byte is pointless, instead do this
substring[0] = string[1];
substring[1] = '\0';
and it shall work.
You should read the manual page strncpy(3) to understand what I mean correctly, if you read the manual carefully every time you would become a better programmer in a shorter time.

Why storing Unicode Characters in char works?

I have a program I made to test I/O from a terminal:
#include <stdio.h>
int main()
{
char *input[100];
scanf("%s", input);
printf("%s", input);
return 0;
}
It works as it should with ASCII characters, but it also works with Unicode characters and emoji.
Why is this?
Your code works because the input and output stream have the same encoding, and you do not do anything with c.
Basically, you type something, which is converted into a sequence of bytes, which are then stored in c, then you send back that sequence of bytes to stdout which convert them back to readable characters.
As long as the encoding and decoding process are compatible, you will get the "expected" result.
Now, what happens if you try to use standard "string" C functions? Let's assume you typed "♠Hello" in your terminal, you will get the expected output but:
strlen(c) -> 8
c[0] -> Some strange character
c[3] -> H
You see? You may be able to store whatever you want in a char array, it does not mean you should. If you want to deal with extended character sets, use wchar_t instead.
You're probably running on Linux, with your terminal set to UTF-8 so scanf produces UTF-8, and printf can output it. UTF-8 is designed such that char[] can store it. I explicitly use char[] and not char because non-ASCII characters need more than one byte.
Your program is undefined as it has undefined behavior.
scanf("%s", input);
expects a pointer to string, but
char *input[100];
input is pointer to pointer to char, char *.
Your program may work because the buffer you pass to scanf is of sufficient size to store unicode character and a characters you pass don't have a NULL byte in between them, but it may not work as well because the implementation of C on your (and any other) machine is allowed to do anything in cases of UB.

Resources