Encoding an array of strings into a single string

Encoding an array of strings into a single string - c

You're given an array of strings where each character in the string is lowercase. Each character and the length of each string is randomly generated. Encode the string such that:
1. The encoded output is a single string with minimum possible length
2. You should be able to decode the string later
I am thinking the mention of each character being lowercase is key here. Since there are only 26 lowercase characters, maybe we can encode them using 5 bits instead of 8 bits and then pack them. But I am not sure how to implement this bit packing while looping over the array of strings

For 26 characters and a separator you could use base32. Basically concatenate the strings with a delimiter and then do a base32 decode - should be easy to find code for that. Just do not use those characters that result in 4-5 zeros in binary so that you do not accidentally have the null terminator in the middle of your string.
For decoding you'll do base32 encode and then split the string at delimiters.

Related

How to escape a character in bytearray

I am creating a bytearray from a list.
mybytes_array = bytes([255,110,41,128,09])
I then uses regex to find all occurences of
[(m.start(0), m.end(0)) for m in re.finditer(mybytes_array, ba)]
I can have any value instead of 41 that creates a metacharacter for regex. I want to escape that character so that I can match it against ba that is also a bytearray
How can I do that?
I cannot obviously convert to string append backslash and then match against ba. So I am not sure how can I change the mybytes_array so as to search the correct string.

The re package can work on both str and bytes inputs as long as the arguments are of the same type.
You may use re.escape to escape the whole bytes.
Your code will be something like
[(m.start(0), m.end(0)) for m in re.finditer(re.escape(mybytes_array), ba)]

Can I store UTF8 in C-style char array

Follow up
Can UTF-8 contain zero byte?
Can I safely store UTF8 string in zero terminated char * ?
I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.

Yes.
Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).
As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.

In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.
To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.
It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.
So the most feasible solution is not to accept U+0000 in strings.

Trouble comparing UTF-8 characters using wchar.h

I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.

C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).

First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.

Convert this kind of hex string to a NSData/NSString

I have this hex string:
\x5c30\x3032\x5f5c\x3337\x345c\x3334\x366f\x5c32\x3633\x5c30\x3136\x5c32\x3132\x5c32\x3234\x4e5c\x3236\x335c\x3231\x335c\x3337\x355c\x3335\x315c\x3232\x365c\x3337
How could I convert it to a NSString or NSData? I though of using C methods, but I'm not experienced in C :(

Looks like Unicode characters (specifically, CJK ideographs) to me.
Use an NSScanner to scan the string. Scan up to a backslash, and add whatever you scanned to a mutable string. Then, scan the backslash and throw it away, and then scan the x and throw that away.
Then, scan four single characters, which will be the digits (NSScanner doesn't have a method to scan a single character, so you will need to get them yourself using characterAtIndex: and then adjust the scanner's scan location accordingly). Perform the appropriate conversion of the hexadecimal digit characters to numbers and the math to assemble a single number from them, and you will have the code point (character value) represented by the escape sequence. Add that single character to your string.
Repeat that until you run out of input string, and you will have converted the input string with all its escape sequences into a string with the unescaped characters.

convert char array to integer value and add them

How can I extract numbers from a char array, separated with spaces, convert them to integers and sum them? For example:
"34 54 3 23"

I'd start at the beginning of the array, check each character in turn with isdigit() and keep a current value and a current total.
When reaching the terminating NUL char (or last element of the array), the current total is already calculated.

You need to parse the string.
If you know how many integers are in there, you could use just a sscanf.
Otherwise find out where blanks are (with something similar to strtok, for example) and then read integers using atoi