How do I decode an email subject - c

There is a problem parsing the subject in the mail header.
For example, the form of the subject is as follows.
subject: =?iso-2022-KR?B?DjlMOC4PIA....gyDzogT?=
My guess is that base64 decoding should include the escape character -SO, SI, ESC$)C-. However, decoding is not included.
How can I get a normal string?
I hope the results are as below.
Subject: like this, 안녕하세요.
Please give me a hint how to respond at the code level. in C
Update
sorry. I had a SO, SI, but I missed it. But there was no ESC$)C, The problem is resolved immediately and shared for others.
In the absence of a ESC$)C, the libiconv is a problem, but the gconv(in glibc) was not a problem. What I used was the libiconv. Changing to gconv has solved the problem.
thanks.

So in =?iso-2022-KR?B?DjlMOC4PIA....gyDzogT?= the Bsandwiched by question marks means base64 encoded. The iso-2022-KR is the character set. The DjlMOC4PIA....gyDzogT is the base64 encoded title.
You first base64 decode the title. It's easy to find a solution for this in C.
This will leave you with an array of binary bytes which is the title encoded in the ISO-2022-KR character set. Presumably you want to convert that to UTF-8 or some other character set your computer can handle. Your best bet for this part is to use a character set conversion utility. If you are on Linux or macOS, you can use the iconv library. See iconv_open, iconv and iconv_close.

Related

How to handle an error while decoding a Base64 string

I am trying to write a small Base64 coder/decoder program, and I'm trying to figure out if there are any rules, or any guidelines, or expected behavior, when a run into a character that is not valid.
I could fail fast (complain and exit), ignore non-valid characters (like I would do for newlines, etc.), or do a junk-in, junk-out approach (where the data will be partially decoded, and the rest depends on the severity or the exact number of errors).
On a similar point: I imagine I should ignore newlines (like in PEM files, where lines are broken at a 64-character length), but are there any other control characters that I could expect, and should ignore properly?
If it is of any interest, I'm coding in pure (vanilla) C, which doesn't already have the library for it. But that detail shouldn't really matter for the answer I'm looking for.
Thanks.
My apologies. The RFC's on MIME (1341, 1521, 2045) contain the following paragraph, which I could not find until now:
The output stream (encoded bytes) must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Table 1 must be ignored by decoding software. In base64 data, characters other than those in Table 1, line breaks, and other white space probably indicate a transmission error, about which a warning message or even a message rejection might be appropriate under some circumstances.
In any case, it is appropriate that this question and answer be available ono StackOverflow.
P.S. If there are other standards of Base64 with other guidelines, links and quotes are appropriate in other answers.

Will program crash if Unicode data is parsed as Multibyte?

Okay basically what I'm asking is:
Let's say I use PathFindFileNameA on a unicode enabled path. I obtain this path via GetModuleFileNameA, but since this api doesn't support unicode characters (italian characters for example) it will output junk characters in that part of the system path.
Let's assume x represents a junk character in the file path, such as:
C:\Users\xxxxxxx\Desktop\myfile.sys
I assume that PathFindFileNameA just parses the string with strtok till it encounters the last \\, and outputs the remainder in a preallocated buffer given str_length - pos_of_last \\.
The question is, will the PathFindFileNameA parse the string correctly even if it encounters the junk characters from a failed unicode conversion (since the multi-byte API reciprocal is called), or will it crash the program?
Don't answer something like "Well just use MultiByteToWideChar", or "Just use a wide-version of the API". I am asking a specific question, and a specific answer would be appreciated.
Thanks!
Why you think that Windows API only do strtok? I used to hear that all xxA APIs are redirected to xxW APIs before win10 are released.
And I think the answer to this question is quite simple. Just write a easy program and then set the code page to what you want.Running that program and the answer goes out.
P.S.:personally I think that GetModuleFileNameA will work correctly even if there are junk characters because Windows will store the Image Name as an UNICODE_STRING internally. And even if you uses MBCS, the junk code does not contain zero bytes, and it will work as usual since it's just using strncpy.
Sorry for my last answer :)

String marshalling with marshal_as and encodings

Converting between String^ and std::string is very easy using marshal_as. However, I have nowhere found a description of how encodings in such a string are handled. String^ uses UTF-16 but what about std::string? Text in that can be interpreted in various ways and it would be very usefull if the marshalling would convert to an encoding that is native to your application.
In my case all std::string instances contain UTF-8 encoded text. So how would I tell marshal_as to give me an UTF-8 encoded variant of the original String^ (and vice versa)?
I agree that the documentation is lacking. Without proper documentation we are programming by coincidence. marshal_as can be very useful but when I have a question that isn't answered in the documentation, I just skip it and do it in multiple steps. Someone may have an accurate answer about how marshal_as works in each case but unless you add it to your code as a comment, the next programmer isn't going to think of the issue or understand it, even after checking the documentation.
The BCL is very capable of converting characters. I suggest using an Encoding member to GetBytes and then copy them to a C or C++ string data structure/class. Despite requiring more steps, it is then clear which character sets and encodings you are using, how mismatches are handled, how the string ownership can be transfered and how it should be destroyed. (Mismatches are, of course, not applicable when converting between UTF-16 and UTF-8.)

Portable literal strings in C source files

Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.
I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).
The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.

Char sent via socket

This question is pure teoretical, and not about the right way of doing it, but do we need to convert char 'x' to network format? I'm intressted in all cases: always / sometimes / never
I personally think I should but i need to be sure, than you.
No, char is a single byte value so endianess doesn't matter
As you're thinking of it (endianess, ntohs ntohl, etc...), no.
Less basically, I should raise a non-network-bound warning : Any string not attached with its encoding is unreadable.
Say you're sending/storing, over network or not, the string "Français" the 'ç' will have to be encoded using a character encoding. Not specifying the character encoding mean it can be read "Français" if you encoded it using utf8 but your reader was thinking it was latin1.
So you have two solutions :
Write down a spec for you application to only use the same char encoding
Put a metadata on a header somewhere specifying the encoding for future strings (like HTTP).

Resources