It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
It is very simple I hope. These are 20 hex values separated by a back-slash \ and C compiler indeed making them a string of 33 characters because \NUMBER is single value \NUMBER+ALPHA = 2 bytes as well as \ALPHA+NUMBER 2 bytes.
char str[] =
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6";
//when saved is 33 bytes
My question is after it has been saved to 33 bytes on disk, can we (after reading 33 bytes) remake the same presentation that we have in C? So the program prints "\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6", any problem solvers here?
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6";
//when read back program should output this ^
The string literal you have:
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6"
will produce undefined behavior according to C89 (not sure if the source for C89 can be trusted, but my point below still holds) and implementation-defined behavior according to C11 standard. In particular, \d, \e, \9, \c are escape sequences not defined in the standard. gcc will not complain about \e, since it is a GNU extension, which represent ESC.
Since there are implementation-defined behavior, it is necessary for us to know what compiler you are using as the result may vary.
Another thing is that, you didn't show clearly that you are aware of the content of the string after compilation. (A clearer way to show would be to include a hex dump of what the string looks like in memory, and show how you are aware of the escape sequences).
This is how the looks-like-hex string is recognized by the compiler:
String: \b 3 \b c \77 \7 \d e \e d \44 \9 3 \75 \c e \c 0 \9 \1 9 \5 9 \c 8 \f \b e \c 6 \20 \6
Char: \b 3 \b c \77 \7 d e \e d \44 \9 3 \75 c e c 0 9 \1 9 \5 9 c 8 \f \b e c 6 \20 \6
Hex: 08 33 08 63 3f 07 64 65 1b 64 24 39 33 3d 63 65 63 30 39 01 39 05 39 63 38 0c 08 65 63 36 18 06 00
Enough beating around the bush. Assuming that you are using gcc to compile the code (warnings ignored). When the code is run, the whole char[] is written to file using fwrite. I also assume only lower case characters are used in the source code.
You should map all possible escape sequences \xy that looks like 2-digit hex number to sequences of 1 or 2 bytes. There are not that many of them, and you can write a program to simulate the behavior of the compiler:
If x is any of a, b, f (other escape sequences like \n are not hex digit) and e (due to GNU extension). It is mapped to special character.
(If you use uppercase character in source code, do note that \E maps to ESC)
If xy forms a valid octal sequence. It is mapped to character with corresponding value.
If x forms a valid octal sequence. It is mapped to character with corresponding value.
Otherwise, x stays the same.
If y is not consumed, y stays the same.
Note that it is possible for the actual char to come from 2 different ways. For example, \f and \14 will map to the same char. In such case, it might not be possible to get back the string in the source. The most you can do is guess what the string in the source can be.
Use your string as an example, at the beginning, 08 and 33 can come from \b3, but it can also come from \10\63.
Using the map produce, there are cases where the mapping is clear: hex larger than 3f cannot come from octal escape sequence, and must come from direct interpretation of the character in the original string. From this, you know that if e is encountered, it must be the 2nd character in a looks-like-hex sequence.
You can use the map as a guide, and the simulation as a method to check whether the map will produce back the ASCII code. Without knowing anything about the string declared in the source code, the most you can derive is a list of candidates for the original (broken) string in the source code. You can reduce the size of the list of candidates if you at least know the length of the string in the source code.
Related
This question already has answers here:
How can I merge two ASCII characters? [closed]
(2 answers)
Closed 2 years ago.
I tried the following code but couldn't get the desired output.
Result should be AB and it should come from single variable C
int main()
{
int a = 'A';
int b = 'B';
unsigned int C = a << 8 | b;
printf(" %c\n",C);
return 0;
}```
%c will print a single character. If you want to print a string, you have to use %s and provide a pointer to this string. Strings in C have to be null-terminated, meaning they require one additional character after the text and this character carries the value \0 (zero).
You could do this in an int, but you'd have to understand some concepts first.
If you are using a computer with Intel architecture, integer variables larger than one byte will store data in reverse order in memory. This is called little-endianness.
So a number like 0x11223344 (hexadecimal) will be stored in memory as the sequence of bytes 44 33 22 11.
'A' is equivalent to the number 65, or 0x00000041, and if put in a 32-bit integer will be stored as 41 00 00 00.
When you do 'A' << 8 | 'B' you create the number 0x00006566, but in memory it is actually 66 65 00 00 (equivalent to string "BA\0\0"). It's in the opposite order of what you're trying to do, but since it is technically null-terminated, it's a valid string.
You can print this using sprintf("%s", &C);
If you're in a big-endian architecture (such as ARM), you will have to work out the null-terminator, but I think I already gave you enough information to figure out what is going on for yourself.
you're trying to print only single byte located at &C, so depending on if you're machine is little endian you'd get "66" as output or if you're machine is big endian you'd get 0 as output.
A string is represented as an array of char. For example, if I have a string "abcdef" at address 0x80000000, is the following correct?
0x80000008
0x80000004: 00 00 46 45
0x80000000: 44 43 42 41
(In stack, it grows down so I have address decreasing)
The lower addresses are always first - even in the stack. So your example should be:
80000000: 41 42 43 44
80000004: 45 46 00 00
Your example is actually the string: "ABCDEF". The string "abcdef" should be:
80000000: 61 62 63 64
80000004: 65 66 00 00
Also, in memory dumps, the default radix is 16 (hexadecimal), so "0x" is redundant. Notice that the character codes are also in hexadecimal. For example the string "JKLMNOP" will be:
80000000: 4A 4B 4C 4D
80000000: 4E 4F 50 00
No strings are usually placed in the stack. Only in data memory. Sometimes in the stack are placed pointers to strings, i.e. the start address of the string.
Your (and my) examples concerns so called ASCII encoding. But there are many possible character encoding schemes possible. For example EBCDIC also uses 8bit codes, but different than ASCII.
But the 8 bit codes are not mandatory. UTF-32 for example uses 32 bit codes. Also, it is not mandatory to have fixed code size. UTF-8 uses variable code size from 1 to 6 bytes, depending on the characters encoded.
That isn’t actually assembly. You can get an example of that by running gcc-S. Traditionally in x86 assembly, you would declare a label followed by a string, which would be declared as db (data bytes). If it were a C-style string, it would be followed by db 0. Modern assemblers have an asciiz type that adds the zero byte automatically. If it were a Pascsl-style string, it would be preceded by an integer containing its size. These would be laid out contiguously in memory, and you would get the address of the string by using the label, similarly to how you would get the address of a branch target from its label.
Which option you would use depends on what you’re going to do with it. If you’re passing to a C standard library function, you probably want a C-style string. If you’re going to be writing it with write() or send() and copying it to buffers with bounds-checking, you might want to store its length explicitly, even though no system or library call uses that format any more. Good, secure code shouldn’t use strcpy() either. However, you can both store the length and null-terminate the string.
Some old code for MS-DOS used strings terminated with $, a convention copied from CP/M for compatibility with 8-bit code on the Z80. There were a bunch of these legacies in OSes up to Windows ME.
This question already has answers here:
Meaning of leading zero in integer literal
(3 answers)
Closed 8 years ago.
I am an amature C programer. I can only use C Programing language.
I have a following code containing a loop in TC++IDE.
It is simple code for printing consecutiveNo. till given value,
it contains something like this:
i = 00100
in the above line when i enter 00100 the colour of normal integer value changes.(It changes to dark blue/Navy blue)
And when i use this in my loop. instead of repeating 100 times it repeats only "64" times.
Same happens with any value which is like 023 instead 0f 23.
Please explain what kind of IDENTIFIER/Variable is 00100 or values similar to it are.
And Also explain Why does it happen so ? (64 instead of 100).
Regards and Thank You in advanced !
This happens because a numeric literal starting with a zero is interpreted as a number written in octal.
A numeric literal beginning with 0 is interpreted as octal number in C and as 100 in octal is 64 in decimal this explains what you observe.
I have a string "Artîsté" in latin1 table. I use a C mysql connector to get the string out of the table. I have character_set_connection set to utf8.
In the debugger it looks like :
"Art\xeest\xe9"
If I print the hex values with printf ("%02X", (unsigned char) a[i]); for each char I get
41 72 74 EE 73 74 E9
How do I know if it is utf8 or latin1?
\x74\xee\x73 isn't a valid UTF-8 sequence, since UTF-8 never has a run of only 1 byte with the top bit set. So of the two, it must be Latin-1.
However, if you see bytes that are valid UTF-8 data, then it's not always possible to rule out that it might be Latin-1 that just so happens to also be valid UTF-8.
Latin-1 does have some invalid bytes (the ASCII control characters 0x00-0x1F and the unused range 0x7f-0x9F), so there are some UTF-8 strings that you can be sure are not Latin-1. But in my experience it's common enough to see Windows CP1252 mislabelled as Latin-1, that rejecting all those code points is fairly futile except in the case where you're converting from another charset to Latin-1, and want to be strict about what you output. CP1252 has a few unused bytes too, but not as many.
as yo can see in the schema of a UTF-8 sequence you can have 2 great possibilities:
1st bit = 0 (same as ascii), 1 byte per char having value <=0X7F
1st bit = 1 of utf-8 sequence, the sequence length is >= 2 bytes having value >= 0X80
This is iso-8859 encoding
41 72 74 *EE* 73 74 *E9*
only 2 stand alone bytes with values >= 0x80
ADD BEWARE
Be carefull! Even if you found a well formatted UTF-8 sequence, you cannot differentiate it from a bounch of ISO-8859 chars!
I don't know what is that, I found this in the openSSL source code.
Is those some sort of byte sequence? Basically I just need to convert my char * to that kind of style as a parameter.
It's a byte sequence in hexadecimal. \x6d\xe3\x85 is hex character 6d followed by hex e3, followed by hex 85. The syntax is \xnn where nn is your hex sequence.
If what you read was
char foo[] = "\x6d\xe3\x85";
then that is the same as
char foo[] = { 0x6d, 0xE3, 0x85, 0x00 };
Further, I can tell you that 0x6D is the ASCII code point for 'm', 0xE3 is the ISO 8859.1 code point for 'ã', and 0x85 is the Windows-1252 code point for '…'.
But without knowing more about the context, I can't tell you how to "convert [your] char * to that kind of style as a parameter", except to say that you might not need to do any conversion at all! The \x notation allows you to write string constants containing arbitrary byte sequences into your source code. If you already have an arbitrary byte sequence in a buffer in your program, I can't imagine your needing to back-convert it to \x notation before feeding it to OpenSSL.
Try the following code snippet to understand more about hex byte sequence..
#include <stdio.h>
int main(void)
{
char word[]="\x48\x65\x6c\x6c\x6f";
printf("%s\n", word);
return 0;
}
/*
Output:
$
$ ./a.out
Hello
$
*/
The \x sequence is used to escape byte values in hexadecimal notation. So the sequence you've cited escapes the bytes 6D, E3 and 85 which translate into 109, 227 and 133. While 6D can also be represented as the character m in ASCII, you cannot represent the later two in ASCII as it only covers the range 0..127. So for values beyond 127 you need a special way to write them, and the \x is such a way.
The other way is to escape as octal numbers using \<number>, for example 109 would be \155.
If you need explicit byte values it's better to use these escape sequences since (AFAIK) the C standard doesn't guarantee that your string will be encoded using ASCII. So when you compile for example on an EBCDIC system your m would be represented as byte value 148 instead of 109.