Facebook content collation and non-western encoded characters - database

If a user writes a string of text in arabic into a facebook comment and saves, what is the collation type of the data storage?
I don't believe that they use a mysql table for comments, but I've just messed with the topic using a localhost mysql table, where I stored some arabic in a binary character.
it transformed the text into some presumably escaped sequence of character. but once you've saved it, it stayed that way.
If you consider i18n, even when I have facebook set to english, typing in other non-western encoded characters still saves and displays correctly.
any insight into how they've achieved this?

First; I don't know for sure but I don't believe MySQL comes into play anywhere for this.
The right thing to do is store it UTF-8 in <some-system>, period. Which might as well be MySQL I guess. I don't know specifics but I do believe MySQL (and PHP for that matter**) are not really up-to-par with UTF-8/Unicode support and so they might manifest some "glitches". For example, you need to execute "set names=utf8" or some crazy stuff first thing after opening the connection for utf8 to work at all (which might be why your test didn't work). Also, I remember something about MySQL not supporting 4-byte encoded UTF-8 characters, only up to 3. Don't know if that is true currently, but I vaguely remember something about it. [edit] Should be fixed in 5.5+
I don't know about Arabic but they might be the 4-byte kind. [edit] They should need 2 or 3 bytes.
And while we're on glitches: about PHP I remember stuff like strlen() returning bytes instead of actual characters etc. If I'm not mistaken it has some mb_XXX functions (multibyte string) that should handle UTF-8 better. [edit] Turns out it does.
I don't see how i18n and setting facebook to English (or Swahili for that matter) would affect this at all. It's just the language used in the interface (and maybe/probably affecting datetime formatting etc.) and has nothing to do with user-generated content.
Oh, almost forgot the obligatory The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)-link
** Just mentioning it because it usually goes hand-in-hand with MySQL.

Related

Allowed characters in AppEngine Datastore key name

If I create a named key for use in Google AppEngine, what kind of String is the key-name? Does it use Unicode characters or is it a binary string?
More specifically, if I want to have my key-name made up of 8-bit binary data, is there a way I can do it? If not, can I at least use 7-bit binary data? Or are there any reserved values? Does it use NULL as the End-Of-String marker, for example?
GAE docs do not specify any restrictions on the key-name String. So a String with any content should be valid.
If you want to use a binary data as an identifier, then you should encode it into a String. You can use any of the binary-to-text encoding methods: most used seem to be Base64 (3 bytes = 4 chars) and BinHex (1 byte = 2 chars).
I meanwhile had some time to actually test this out by generating a bunch of keys with binary names and then performing a kind-only query to get all the keys back. Here are the results:
Any binary character is fine. If you create an entity with key name "\x00\x13\x127\x255", a query will find this entity and its key name will return that same string
The AppEngine Dashboard, Database Viewer, and other tools will simply omit characters that aren't displayable, so the key names "\x00test" and \x00\x00test will both show up as separate entities, but their keys are both shown as "test"
I have not tested all available AppEngine tools, just some of the basics in the Console, so there may be other tools that get confused by such keys...
Keys are UTF-8 encoded, so any character between 128 and 255 takes up 2 bytes of storage
From this, I would derive the following recommendations:
If you need to be able to work with individual entities from the AppEngine console and need to identify them by key, you are limited to printable characters and thus need to encode the binary key name into a String either in Base16 (hex; 50% overhead), Base64 (33% overhead), or Base85 (25% overhead)
If you don't care about key readability, but need to pack as much data as possible into the key name with minimal storage use, use Base128 encoding (i.e. 7-bits only; 14% overhead) to avoid the implicit UTF-8 encoding (50% overhead!) of 8-bit data data
Asides:
I will accept #PeterKnego's answer instead of this one since this one basically only confirms and expands on what he already assumed correctly.
From looking through the source code of the Java API, I think that the UTF-8 encoding of the key-name happens in the API (while building the protocol buffer) rather than in BigTable, so if you really want to go nuts on storage space maximization, it may be possible to build your own protocol buffers and store full 8-bit data without overhead. But this is probably asking for trouble...

Parsing a domain name

I am parsing the domain name out of a string by strchr() the last . (dot) and counting back until the dot before that (if any), then I know I have my domain.
This is a rather nasty piece code and I was wondering if anyone has a better way.
The possible strings I might get are:
domain.com
something.domain.com
some.some.domain.com
You get the idea. I need to extract the "domain.com" part.
Before you tell me to go search in google, I already did. No answer, hence I am asking here.
Thank you for your help
EDIT:
The string I have contains a full hostname. This usually is in the form of whatever.domain.com but can also take other forms and as someone mentioned it can also have whatever.domain.co.uk. Either way, I need to parse the domain part of the hostname: domain.com or domain.co.uk
Did you mean strrchr()?
I would probably approach this by doing:
strrchr to get the last dot in the string, save a pointer here, replace the dot with a NUL ('\0').
strrchr again to get the next to last dot in the string. The character after this is the start of the name you are looking for (domain.com).
Using the pointer you saved in #1, put the dot back where you set it NUL.
Beware that names can sometimes end with a dot, if this is a valid part of your input set, you'll need to account for it.
Edit: To handle the flexibility you need in terms of example.co.uk and others, the function described above would take an additional parameter telling it how many components to extract from the end of the name.
You're on your own for figuring out how to decide how many components to extract -- as Philip Potter mentions in a comment below, this is a Hard Problem.
This isn't a reply to the question itself, but an idea for an alternate approach:
In the context of already very nasty code, I'd argue that a good way to make it less nasty, and provide a good facility of parsing domain names and the likes - is to use PCRE or a similar library for regular expressions. That will definitly help you out if you also want to validate that the tld exists, for instance.
It may take some effort to learn initially, but if you need to make changes to existing matching/parsing code, or create more code for string matching - I'd argue that a regex-lib may simplify this a lot in the long term. Especially for more advanced matching.
Another library I recall which supports regex, is glib.
Not sure what flavor of C, but you probably want to tokenize the domain using "." as the separator.
Try this: http://www.metalshell.com/source_code/31/String_Tokenizer.html
As for the domain name, not sure what your end goal is, but domains can have lots and lots of nodes, you could have a domain name foo.baz.biz.boz.bar.co.uk.
If you just want the last 2 nodes, then use above and get the last two tokens.

Solr - character substitution

I have Solr with indexed database. In my database all data is in Latvian. The problem is, I need to be able to search word Riga as if it is word Rīga. Of course, i can define synonym - Rīga = Riga, but can i just define, that letter ī is letter i? I read something about solr.ISOLatin1AccentFilterFactory, but as far as i understood, this is not for UTF-8 encoding, right? Advices?
Used PatternReplaceFilterFactory with index and query. Seems to be working right.
ISOLatin1AccentFilterFactory is exactly what you are looking for... as long as the accent EXISTS in the latin-1 character set (lower 7 bits of UTF-8 are identical to latin-1). The ī that you mentioned doesn't appear to exist in ISO-8859-1 so ISOLatin1AccentFilterFactory won't work in this SPECIFIC case. I would still recommend that you use ISOLatin1AccentFilterFactory in addition to any exceptions that you take care of using PatternReplaceFilterFactory as there probably are some Latvian characters that it will help (assuming, I don't have experience with Latvian)
FYI, I did actually try the against my Solr setup with ISOLatin1AccentFilterFactory and it didn't help this case.
Look at ICUTokenizerFactory which provides Unicode character normalization. Extremely useful and very easy.
http://lucene.apache.org/solr/api/org/apache/solr/analysis/ICUTokenizerFactory.html
http://site.icu-project.org/

Case fold UTF-8 without knowing the language

I'm trying to evaluate different strategies for case insensitive UTF-8 string comparison.
I've read some material from the Unicode consortium, experimented with ICU and tried to come up with various quality-of-implementation alternatives.
On multiple occasions I've seen texts differ between Simple Case Mapping and Full Case Mapping, and I wanted to make sure I understand the difference entirely.
As I read it, Simple Case Mapping is "context-free", i.e. doesn't need to know what language the payload is. This will give approximate results, due to the Turkic "I/ı/İ/i" debacle.
Full Case Mapping, on the other hand, needs to know the language of the payload to be able to perform the mapping. With that extra information, it can take special measures to cover cases where "Kim" as a Turkic string should become "KİM" in upper-case, but "Kim" as an English string, should become "KIM" in upper-case.
Have I got that right?
Are there other examples of "multi-faceted" code points that fold differently for different languages?
Thanks!
UPDATE: One of the sources mentioning simple case mapping as language independent is ICU's documentation. I interpreted that as Unicode truth, but maybe it's just a statement of the implementation?
No, a "full case mapping" is a casing where one codepoint needs to be replaced by more than one new codepoints. A simple case mapping is a single codepoint substitution.
If you want to implement this yourself then the Unicode CaseFolding.txt file is crucial to get this right. Note the status field code "T", specifically there to handle the Turkish I problem.
Well ... The consonant combination "SS" would down-case to "ss" for most Western languages, but in German it might become the special letter "ß". That's just "might", there are quite involved usage rules to consider.
I think this doesn't directly affect collation order (any Germans are of course welcome to correct me) though, so maybe it's a moot point.

Phone Number Columns in a Database

In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.

Resources