JavaMail API getSubject(), subject has multiple "=?utf-8?B?~?=", how can I parse? - jakarta-mail

My mail Subject is
Subject: =?utf-8?B?7IOI66Gc7Jq0IOyEpOusuOyhsOyCrOqwgCDsi5zsnpHrkJjs?=
=?utf-8?B?l4jsirXri4jri6QhIOydtCDquLDtmowg64aT7LmY7KeAIOuniOyEuOya?=
=?utf-8?B?lCE=?=
mimeMessage.getSubject() returns below:
The black diamonds are failed characters. And language is Korean.
And the below is correct subject:
I concatenated the raw data like below, MimeUtility.decodeText() returns good result.
(delete \r\n, delete inner "=?utf-8?B?" and "?=")
MimeUtility.decodeText(=?utf-8?B?7IOI66Gc7Jq0IOyEpOusuOyhsOyCrOqwgCDsi5zsnpHrkJjsl4jsirXri4jri6QhIOydtCDquLDtmowg64aT7LmY7KeAIOuniOyEuOyalCE=?=)
The result is:
How can I parse the subject which has multiple lines?

The problem is that the mailer that encoded this text encoded it incorrectly. What mailer was used to create this message?
The 16 bit Korean Unicode characters are converted to a stream of 8 bit bytes in UTF-8 format. The 8 bit bytes are then encoded using base64 encoding.
The MIME spec (RFC 2047) requires that each encoded word contain complete characters:
Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.
In your example above, the bytes representing one of the Korean characters are split across multiple encoded words. Combining them into one encoded word, as you have done, allows the text to be decoded correctly.
This is a bug in the mailer that created the message and should be reported to the owner of that mailer.
Unfortunately, there's no good workaround in JavaMail for such a broken mailer.

I created function which decodes recursively 5 times.
/*
* Decodes 5 times encoded text with MimeUtility.decodeText()
*/
private String decode(String encoded) throws UnsupportedEncodingException {
String result = MimeUtility.decodeText(encoded);
int counter = 0;
while (result.contains("=?") && counter < 5) {
counter++;
String end = result.substring(result.indexOf("=?"));
result = result.substring(0, result.indexOf("=?")) + MimeUtility.decodeText(end);
}
return result;
}

Related

How do I convert from Hex to a Unicode code point?

I have a pointer to a stream of bytes encoded in UTF8. I am trying to publish this byte stream as a JSON compatible string.
It worked fine until I hit the em dash. At this point the output of my program began to spit out garbage.
I was using snprintf to get the job done like so:
if (nUTF8CodePoints == 2)
{
DebugLog(#"2 Unicode code points");
snprintf( myEscapedUnicode, 8, "\\u%2x%2x",*cur,*(cur+1));
}
else if (nUTF8CodePoints == 3)
{
DebugLog(#"3 Unicode code points");
snprintf( myEscapedUnicode, 8, "\\u%2x%2x%2x",*cur,*(cur+1),*(cur+2));
}
else if (nUTF8CodePoints == 4)
{
DebugLog(#"4 Unicode code points");
snprintf( myEscapedUnicode, 8, "\\u%2x%2x%2x%2x",*cur,*(cur+1),*(cur+2),*(cur+3));
}
This code gives me \ue2809 where I expected U+2014. Now I am confused. I thought the U+XXXX mean that the XXXX was supposed to be hex. Yet-hex representation is giving me a different 6 digits than the 4 that are expected. How am I supposed to encode this to the expected JSON compatible UTF-8?
Something tells me I'm close, but not cigar. For example, the utf8-chartable.de em dash entry concurs with me that there is a difference. Still, I don't quite yet understand what it and am not sure how to get C to print it.
U+2014 — e2 80 94 EM DASH
So how do I print out these 3 bytes (e2 80 94) as U+2014? And what does the XXXX mean in this U+2014? I thought it was supposed to be hex.
As I understand it, JSON is allowed (with an exception) to contain UTF-8-encoded text as-is. So to start out with, I don't think you need to treat your Unicode characters specially, or try to turn them into \uXXXX escape sequences, at all.
If you do want to emit a \uXXXX sequence, you're going to have to convert from UTF-8 back to a "pure" Unicode character (or, formally, something more like UTF-16). One way to do this -- at least, if your C library is up to it, and you've got your locale set correctly -- is with the mbtowc function. I think you should be able to use it something like this:
setlocale(LC_CTYPE, "UTF-8")
wchar_t wc;
mbtowc(&wc, cur, nUTF8CodePoints);
snprintf(myEscapedUnicode, 8, "\\u%04x", wc);
The only wrinkle is characters that don't fit in 16 bits, or stated another way, characters outside the Basic Multilingual Plane (BMP). Although UTF-8 can handle these just fine, as I understand it, in JSON they must be encoded as surrogate pairs, using \u notation. (I learn this from Wikipedia; I don't claim any JSON expertise here.)
So far I've ducked this requirement in my own JSON work. I'll go out on a limb and guess it would look something like this (see this description of low and high surrogates in Wikipedia):
if(w > 0xffff) {
unsigned int lo = (w - 0x10000) & 0x3ff;
unsigned int hi = ((w - 0x10000) >> 10) & 0x3ff;
snprintf(myEscapedUnicode, 12, "\\u%04x\\u%04x", hi + 0xD800, lo + 0xDC00);
}
Note that this is going to take more than 8 bytes in myEscapedUnicode.

Encoding an array of strings into a single string

You're given an array of strings where each character in the string is lowercase. Each character and the length of each string is randomly generated. Encode the string such that:
1. The encoded output is a single string with minimum possible length
2. You should be able to decode the string later
I am thinking the mention of each character being lowercase is key here. Since there are only 26 lowercase characters, maybe we can encode them using 5 bits instead of 8 bits and then pack them. But I am not sure how to implement this bit packing while looping over the array of strings
For 26 characters and a separator you could use base32. Basically concatenate the strings with a delimiter and then do a base32 decode - should be easy to find code for that. Just do not use those characters that result in 4-5 zeros in binary so that you do not accidentally have the null terminator in the middle of your string.
For decoding you'll do base32 encode and then split the string at delimiters.

Get size of Blob / String in bytes in apex ?

I wants to know what the size of string/blob in apex.
What i found is just size() method, which return the number of characters in string/blob.
What the size of single character in Salesforce ?
Or there is any way to know the size in bytes directly ?
I think the only real answer here is "it depends". Why do you need to know this?
The methods on String like charAt and codePointAt suggest that UTF-16 might be used internally; in that case, each character would be represented by 2 or 4 bytes, but this is hardly "proof".
Apex seems to be translated to Java and running on some form of JVM and Strings in Java are represented internally as UTF-16 so again that could indicate that characters are 2 or 4 bytes in Apex.
Any time Strings are sent over the wire (e.g. as responses from a #RestResource annotated class), UTF-8 seems to be used as a character encoding, which would mean 1 to 4 bytes per character are used, depending on what character it is. (See section 2.5 of the Unicode standard.)
But you should really ask yourself why you think your code needs to know this because it most likely doesn't matter.
You can estimate string size doing the following:
String testString = 'test string';
Blob testBlob = Blob.valueOf(testString);
// below converts blob to hexadecimal representation; four ones and zeros
// from blob will get converted to single hexadecimal character
String hexString = EncodingUtil.convertToHex(testBlob);
// One byte consists of eight ones and zeros so, one byte consists of two
// hex characters, hence the number of bytes would be
Integer numOfBytes = hexString.length() / 2;
Another option to estimate the size would be to get the heap size before and after assigning value to String variable:
String testString;
System.debug(Limits.getHeapSize());
testString = 'testString';
System.debug(Limits.getHeapSize());
The difference between two printed numbers would be the size a string takes on the heap.
Please note that the values obtained from those methods will be different. We don't know what type of encoding is used for storing string in Salesforce heap or when converting string to blob.

How to map Unicode codepoints from an UTF-16 file using C

I need to read a file in binary mode which is written in UTF-16 encoding and transform it to UNICODE Codepoints. I had no problems to succesfully map the codepoints from the U+0000..U+FFFF interval. The problem is, from U+10000 to U+10FFFF UTF-16 encoding uses two pieces to form the Codepoint.
Example: This rocket " 🚀 " is encoded in UTF-16 as 0xD83D 0xDE80, forming the UNICODE Codepoint: U+1F680.
Since UTF-16 encoding is exactly the same number as the UNICODE Codepoints from the interval U+0000 to U+FFFF, I wrote my code to simply translate the UTF-16 reading into the UNICODE Codepoint. The problem is with U+10000 and forward, since my program understands the first piece (D83D) as being something from the interval U+0000 to U+FFFF.
How can I avoid this error? What can I do my code for it to know that the piece it's reading needs one more piece to successfully form the UNICODE Codepoint.
Thanks in advance!
The search term you are missing is "surrogate pair". Note that the following code doesn't do any error checking or bounds checking.
int next_codepoint(uint16_t *text) {
int c1 = text[0];
if (c >= 0xd800 && c < 0xdc00) {
int c2 = text[1];
return ((c1 & 0x3ff) << 10) + (c2 & 0x3ff) + 0x10000;
}
return c1;
}
This is described in the Unicode specification which is freely available from the Unicode website, as well as Wikipedia articles on UTF-16. There are also many libraries available for codec conversion, like iconv. You are trying to convert UTF-16 to UTF-32, if that helps.
Either do the surrogate pair conversion, or use a library that does this for you, like iconv libunistring. See:
https://www.gnu.org/software/libiconv/
https://www.gnu.org/software/libunistring/
Example:
https://github.com/drichardson/examples/blob/master/iconv/utf8-to-utf32.c

Special characters are not displayed correctly in the Linux Terminal

I have a file encoded in UTF-8, as it is shown by the following command :
file -i D.txt D.txt: text/plain; charset=utf-8
I just want to display each character one after one, so I have done this :
FILE * F_entree = fopen("D.txt", "r");
if (! F_entree) usage("impossible d'ouvrir le fichier d'entrée");
char ligne[TAILLE_MAX];
while (fgets(ligne, TAILLE_MAX, F_entree))
{
string mot = strtok(strdup(ligne), "\t");
while (*mot++){printf("%c \n", *mot) ;}
}
But the special characters aren't well displayed (a <?> is displayed instead) in the terminal (on Ubuntu 12). I think the problem is that only ASCII code can be stocked in %c, but how can I display those special characters?
And what's the good way to keep those characters in memory (in order to implement a tree index)? (I'm aware that this last question is unclear, don't hesitate to ask for clarifications.)
It does not work because your code splits up the multi-byte characters into separate ones. As your console expects a valid multi-byte code, after seeing a first one, and it does not receive the correct codes, you get your <?> -- translated freely, "whuh?". It does not receive a correct code because you are stuffing a space and newline in there.
Your console can only correctly interpret UTF8 characters if you send the right codes and in the correct sequence. The algorithm is:
Is the next character the start code for a UTF-8 sequence? If not, print it and continue.
If it is, print it and print all "next" codes for this character. See Wikipedia on UTF8 for the actual encoding; I took a shortcut in my code below.
Only then print your space (..?) and newline.
The procedure to recognize the start and length of a UTF8 multibyte character is this:
"Regular" (ASCII) characters never have their 7th bit set. Testing against 0x80 is enough to differentiate them from UTF8.
Each UTF8 character sequence starts with one of the bit patterns 110xxxxx, 1110xxxx, 11110xxx, 111110xx, or 1111110x. Every unique bit pattern has an associated number of extra bytes. The first one, for example, expects one additional byte. The xxx bits are combined with bits from the next byte(s) to form a 16-bit or longer Unicode character. (After all, that is what UTF8 is all about.)
Each next byte -- no matter how many! -- has the bit pattern 10xxxxxx. Important: none of the previous patterns start with this code!
Therefore, as soon as you see any UTF8 character, you can immediately display it and all 'next' codes, as long as they start with the bit pattern 10....... This can be tested efficiently with a bit-mask: value & 0xc0, and the result should be 0x80. Any other value means it's not a 'next' byte anymore, so you're done then.
All of this only works if your source file is valid UTF8. If you get to see some strange output, it most likely is not. If you need to check the input file for validity, you do need to implement the entire table in the Wikipedia page, and check if each 110xxxxx byte is in fact followed by a single 10xxxxxx byte, and so on. The pattern 10xxxxxx appearing on itself would indicate an error.
A definitive must-read is Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). See also UTF-8 and Unicode FAQ for Unix/Linux for more background information.
My code below addresses a few other issues with yours. I've used English variable names (see Meta Stackoverflow "Foreign variable names etc. in code"). It appears to me strdup is not necessary. Also, string is a C++ expression.
My code does not "fix" or handle anything beyond the UTF-8 printing. Because of your use of strtok, the code only prints the text before the first \t Tab character on each line in your input file. I assume you know what you are doing there ;-)
Add.: Ah, forgot to address Q2, "what's the good way to keep those characters in memory". UTF8 is designed to be maximally compatible with C-type char strings. You can safely store them as such. You don't need to do anything special to print them on an UTF8-aware console -- well, except when you are doing stuff as you do here, printing them as separate characters. printf ought to work just fine for whole words.
If you need UTF8-aware equivalents of strcmp, strchr, and strlen, you can roll your own code (see the Wikipedia link above) or find yourself a good pre-made library. (I left out strcpy intentionally!)
#define MAX_LINE_LENGTH 1024
int main (void)
{
char line[MAX_LINE_LENGTH], *word;
FILE *entry_file = fopen("D.txt", "r");
if (!entry_file)
{
printf ("not possible to open entry_file\n");
return -1;
}
while (fgets(line, MAX_LINE_LENGTH, entry_file))
{
word = strtok(line, "\t");
while (*word)
{
/* print UTF8 encoded characters as a single entity */
if (*word & 0x80)
{
do
{
printf("%c", *word);
word++;
} while ((*word & 0xc0) == 0x80);
printf ("\n");
} else
{
/* print low ASCII characters as-is */
printf("%c \n", *word);
word++;
}
}
}
return 0;
}

Resources