Convert this kind of hex string to a NSData/NSString - c

I have this hex string:
\x5c30\x3032\x5f5c\x3337\x345c\x3334\x366f\x5c32\x3633\x5c30\x3136\x5c32\x3132\x5c32\x3234\x4e5c\x3236\x335c\x3231\x335c\x3337\x355c\x3335\x315c\x3232\x365c\x3337
How could I convert it to a NSString or NSData? I though of using C methods, but I'm not experienced in C :(

Looks like Unicode characters (specifically, CJK ideographs) to me.
Use an NSScanner to scan the string. Scan up to a backslash, and add whatever you scanned to a mutable string. Then, scan the backslash and throw it away, and then scan the x and throw that away.
Then, scan four single characters, which will be the digits (NSScanner doesn't have a method to scan a single character, so you will need to get them yourself using characterAtIndex: and then adjust the scanner's scan location accordingly). Perform the appropriate conversion of the hexadecimal digit characters to numbers and the math to assemble a single number from them, and you will have the code point (character value) represented by the escape sequence. Add that single character to your string.
Repeat that until you run out of input string, and you will have converted the input string with all its escape sequences into a string with the unescaped characters.

Related

How to escape a character in bytearray

I am creating a bytearray from a list.
mybytes_array = bytes([255,110,41,128,09])
I then uses regex to find all occurences of
[(m.start(0), m.end(0)) for m in re.finditer(mybytes_array, ba)]
I can have any value instead of 41 that creates a metacharacter for regex. I want to escape that character so that I can match it against ba that is also a bytearray
How can I do that?
I cannot obviously convert to string append backslash and then match against ba. So I am not sure how can I change the mybytes_array so as to search the correct string.
The re package can work on both str and bytes inputs as long as the arguments are of the same type.
You may use re.escape to escape the whole bytes.
Your code will be something like
[(m.start(0), m.end(0)) for m in re.finditer(re.escape(mybytes_array), ba)]

Encoding an array of strings into a single string

You're given an array of strings where each character in the string is lowercase. Each character and the length of each string is randomly generated. Encode the string such that:
1. The encoded output is a single string with minimum possible length
2. You should be able to decode the string later
I am thinking the mention of each character being lowercase is key here. Since there are only 26 lowercase characters, maybe we can encode them using 5 bits instead of 8 bits and then pack them. But I am not sure how to implement this bit packing while looping over the array of strings
For 26 characters and a separator you could use base32. Basically concatenate the strings with a delimiter and then do a base32 decode - should be easy to find code for that. Just do not use those characters that result in 4-5 zeros in binary so that you do not accidentally have the null terminator in the middle of your string.
For decoding you'll do base32 encode and then split the string at delimiters.

sizeof(string) not including a "\" sign

I've been going through strlen and sizeof for strings (char arrays) and I don't quite get one thing.
I have the following code:
int main() {
char str[]="gdb\0eahr";
printf("sizeof=%u\n",sizeof(str));
printf("strlen=%u\n",strlen(str));
return 0;
}
the output of the code is:
sizeof=9
strlen=3
At first I was pretty sure that 2 separate characters \ followed by 0 wouldn't actually act as a NUL (\0) but I managed to figure that it does.
The thing is that I have no idea why sizeof shows 9 and not 10.
Since sizeof counts the amount of used bytes by the data type why doesn't it count the byte for the \?
In a following example:
char str[]="abc";
printf("sizeof=%u\n",sizeof(str));
that would print out "4" because of the NUL value terminating the array so why is \ being not counted?
In a character or string constant, the \ character marks the beginning of an escape sequence, used to represent character values for which there isn't a symbol in the source character set. For example, the escape sequence \n represents the newline character, \b represents the backspace character, \0 represents the zero-valued character (which is also the string terminator), etc.
In the string literal "gdb\0eahr", the escape sequence \0 maps to a single 0-valued character; the actual contents of str are {'g', 'd', 'b', 0, 'e', 'a', 'h', 'r', 0}.
It seems you already have the answer:
At first I was pretty sure that 2 separate characters "\" followed by "0" wouldn't actually act as a NULL "\0" but I managed to figure that it does.
The sequence \0 is an octal escape sequence for the byte 0. So while there are two characters in the code to denote this, it translated to a single byte in the string.
So you have 7 alphabetic characters, a null byte in the middle, and a null byte at the end. That's 9 bytes.
Why should char str[]="gdb\0eahr"; be 10 bytes with sizeof operator? It is 9 bytes because there are 8 string elements + trailing zero.
\0 is only 1 character, not 2. \'s purpose is to escape characters, therefore you might see some of these: \t, \n, \\ and others.
Strlen returns 3 because you have string termination at position str[3].
Single sequence of \ acts as escape character and is not part of string size. If you want to literally use \ in your string, you have to write it twice in sequence like \\, then this is single char of \ printable char.
The C compiler scans text strings as a part of compiling the source code and during the scan any special, escape sequences of characters are turned into a single character. The symbol backslash (\) is used to indicate the start of an escape sequence.
There are several formats for escape sequences. The most basic is a backslash followed by one of several special letters. These two characters are then translated into a single character. Some of these are:
'\n' is turned into a line feed character (0x0A or decimal 10)
'\t' is turned into a tab character (0x09 or decimal 9)
'\r' is turned into a carriage return character (0x0D or decimal 13)
'\\' is turned into a backslash character (0x5C)
This escape sequence idea was used back in the old days so that when a line of text was printed to a teletype machine or printer or a CRT terminal, the programmer could use these and other special command code characters to set where the next character would be printed or to cause the device to do some physical action like ring a bell or feed the paper to the next line.
The escape character also allowed you to embed a double quote (") or a single quote (') into a text string so that you could print text that contain quote marks.
In addition to the above special sequences of backslash followed by a letter there was also a way to specify any character by using a backslash followed by one up to three octal digits (0 through 7). So you could specify a line feed character by either using '\n' or you could use '\12' where 12 is the octal representation of the hexadecimal value A or the decimal value 10.
Then the ability to use a hexadecimal escape sequence was introduced with the backslash followed by the letter x followed by one or more hexadecimal digits. So you can write a line feed character with '\n' or '\12' or '\xa'.
See also Escape sequences in C in Wikipedia.

Unknown escape sequence

I am trying to printf a string that shows a temperature table
printf("TABLE 24A (20\°C)");
The degree sign is a constant I have defined as 0xDF so the the string looks like this: "TABLE 24A (20\xDF C)"
This works but looks incorrect because of the space between the \xDF and the C.
If I remove the space the compiler issues a warning hex escape sequence out of range.
If I modify the string to "TABLE 24A (20\xDF\C)" I get the correct result but the compiler issues warning unknown escape sequence: '\C'
Is there a way to get rid of the warnings but lose the space between the two characters?
You can take advantage of the fact that consecutive string literals are automatically concatenated:
printf("**TABLE 24A (20\xDF" "C)**");
This prevents the parser from consuming more characters for the escape sequence than you want.
You could also pass in the character as a parameter and use the %c format specifier to print it:
printf("**TABLE 24A (20%cC)**", '\xDF');
\x escape sequences consume as many adjacent hex digits as possible. The C is being parsed as a hex digit.
With \x, you could combine two adjacent string literals.
printf("**TABLE 24A (20\xDF""C)**");
Or use a \unnnn Unicode escape, which is limited to four hex characters.
printf("**TABLE 24A (20\u00DFC)**");
Or octal \nnn:
printf("**TABLE 24A (20\337C)**");

C scan unicode character from string

I've wchar_t type string which includes unicode characters like "ş, ç, ü,.."
I need to take this character one by one from string but I can't read them with sscanf. I couldn't found alternative function. So what should I do?

Resources