Case fold UTF-8 without knowing the language - c

I'm trying to evaluate different strategies for case insensitive UTF-8 string comparison.
I've read some material from the Unicode consortium, experimented with ICU and tried to come up with various quality-of-implementation alternatives.
On multiple occasions I've seen texts differ between Simple Case Mapping and Full Case Mapping, and I wanted to make sure I understand the difference entirely.
As I read it, Simple Case Mapping is "context-free", i.e. doesn't need to know what language the payload is. This will give approximate results, due to the Turkic "I/ı/İ/i" debacle.
Full Case Mapping, on the other hand, needs to know the language of the payload to be able to perform the mapping. With that extra information, it can take special measures to cover cases where "Kim" as a Turkic string should become "KİM" in upper-case, but "Kim" as an English string, should become "KIM" in upper-case.
Have I got that right?
Are there other examples of "multi-faceted" code points that fold differently for different languages?
Thanks!
UPDATE: One of the sources mentioning simple case mapping as language independent is ICU's documentation. I interpreted that as Unicode truth, but maybe it's just a statement of the implementation?

No, a "full case mapping" is a casing where one codepoint needs to be replaced by more than one new codepoints. A simple case mapping is a single codepoint substitution.
If you want to implement this yourself then the Unicode CaseFolding.txt file is crucial to get this right. Note the status field code "T", specifically there to handle the Turkish I problem.

Well ... The consonant combination "SS" would down-case to "ss" for most Western languages, but in German it might become the special letter "ß". That's just "might", there are quite involved usage rules to consider.
I think this doesn't directly affect collation order (any Germans are of course welcome to correct me) though, so maybe it's a moot point.

Related

Should flags only be binary?

I may be erring towards pedantry here, but say I have a field in a database that currently has two values (but may contain more in future). I know I could name this as a flag (e.g. MY_FLAG) containing values 0 and 1, but should more values be required (e.g. 0,1,2,3,4), is it still correct to call the field a flag?
I seem to recall reading something previously, that a flag should always be binary, and anything else should be labelled more appropriately, but I may be mistaken. Does anyone know if my thinking is correct? If so, can you point me to any information on this please? My googling has turned nothing up!!
Thanks very much :o)
Flags are usually binary because when we say flag it means either it is up(1) or down(0).
Just like it is used in military to flag up and down in order to show the war-signs. The concept of flagging is taken from there.
Regarding what you are saying
"your words : values be required (e.g. 0,1,2,3,4)"
In such a situation use Enum. Enumerations are build for such cases or sometimes what we do is , we justify the meaning of these numeric values in comments or in separate file so that more memory could be saved(we use tinyInt or bit field). But never name such a situation Flag.
Flags have standard meaning that is either Up or Down. It doesn't mean that you will get error or something but it is not a good practice. Hope you get it.
It's all a matter of conventions and the ability to maintain your database/code effectively. Technically, you can have a column called my_flag defined as a varchar and hold values like "batman" and "barak obama".
By convention, flags are boolean. If you intend to have other values there, it's probably a better idea to call the column something else, like some_enum, or my_code.
Very occasionally, people talk about (for example) tri-state flags, but Wikipedia and most of the dictionary definitions that I read reserve "flag" for binary / two state uses1.
Of course, neither Wikipedia or any dictionary has the authority to say some English usage is "incorrect". "Correct" usage is really "conventional" usage; i.e. what other people say / write.
I would argue that saying or writing "tri-state flag" is unconventional, but it is unambiguous and serves its purpose of communicating a concept adequately. (And the usage can be justified ...)
1 - Most, but not all; see http://www.oxforddictionaries.com/definition/english/flag.
Don't call anything "flag". Or "count" or "mark" or "int" or "code". Name it like everything else in code: after what it means.
workday {mon..fri}
tall {yes,no}
zip_code {00000..99999}
state {AL..WY}
Notice that (something like) yes/no plays the 'flag' role of indicating a permanent dichotomy. (In lieu of boolean, which does that in the rest of the universe outside SQL). For when the specification/contract really is whether something is so. If a design might add more values you should use a different type.
Of course if you want to add more info to a name you can. Add distinctions that are meaningful if you can.
workday {monday..friday}
workday_abbrev {mon..fri}
is_tall {yes,no}
zip_plus_5 {00000-99..99999-99}
state_name {Alabama..Wyoming}
state_2 {AL..WY}

Facebook content collation and non-western encoded characters

If a user writes a string of text in arabic into a facebook comment and saves, what is the collation type of the data storage?
I don't believe that they use a mysql table for comments, but I've just messed with the topic using a localhost mysql table, where I stored some arabic in a binary character.
it transformed the text into some presumably escaped sequence of character. but once you've saved it, it stayed that way.
If you consider i18n, even when I have facebook set to english, typing in other non-western encoded characters still saves and displays correctly.
any insight into how they've achieved this?
First; I don't know for sure but I don't believe MySQL comes into play anywhere for this.
The right thing to do is store it UTF-8 in <some-system>, period. Which might as well be MySQL I guess. I don't know specifics but I do believe MySQL (and PHP for that matter**) are not really up-to-par with UTF-8/Unicode support and so they might manifest some "glitches". For example, you need to execute "set names=utf8" or some crazy stuff first thing after opening the connection for utf8 to work at all (which might be why your test didn't work). Also, I remember something about MySQL not supporting 4-byte encoded UTF-8 characters, only up to 3. Don't know if that is true currently, but I vaguely remember something about it. [edit] Should be fixed in 5.5+
I don't know about Arabic but they might be the 4-byte kind. [edit] They should need 2 or 3 bytes.
And while we're on glitches: about PHP I remember stuff like strlen() returning bytes instead of actual characters etc. If I'm not mistaken it has some mb_XXX functions (multibyte string) that should handle UTF-8 better. [edit] Turns out it does.
I don't see how i18n and setting facebook to English (or Swahili for that matter) would affect this at all. It's just the language used in the interface (and maybe/probably affecting datetime formatting etc.) and has nothing to do with user-generated content.
Oh, almost forgot the obligatory The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)-link
** Just mentioning it because it usually goes hand-in-hand with MySQL.

Identifying the components in a English sentence that do not make sense

I'm wondering is there an algorithm or a library which helps me identify the components in an English which has no meaning? e.g., very serious grammar error? If so, could you explain how it works, because I would really like to implement that or use that for my own projects.
Here's a random example:
In the sentence: "I closed so etc page hello the door."
As a human, we can quickly identify that [so etc page hello] does not make any sense. Is it possible for a machine to point out that the string does not make any sense and also contains grammar errors?
If there's such a solution, how precise can that be? Is it possible, for example, given a clip of an English sentence, the algorithm returns a measure, indicating how meaningful, or correct that clip is? Thank you very much!
PS: I've looked at CMU's link grammar as well as the NLTK library. But still I'm not sure how to use for example link grammar parser to do what I would like to do as the if the parser doesn't accept the sentence, I don't know how to tweak it to tell me which part it is not right.. and I'm not sure whether NLTK supported that.
Another thought I had towards solving the problem is to look at the frequencies of the word combination. Since I'm currently interested in correcting very serious errors only. If I define the "serious error" to be the cases where words in a clip of a sentence are rarely used together, i.e., the frequency of the combo should be much lower than those of the other combos in the sentence.
For instance, in the above example: [so etc page hello] these four words really seldom occur together. One intuition of my idea comes from when I type such combo in Google, no related results jump out. So is there any library that provides me such frequency information like Google does? Such frequencies may give a good hint on the correctness of the word combo.
I think that what you are looking for is a language model. A language model assigns a probability to each sentence of k words appearing in your language. The simplest kind of language models are n-grams models: given the first i words of your sentence, the probability of observing the i+1th word only depends on the n-1 previous words.
For example, for a bigram model (n=2), the probability of the sentence w1 w2 ... wk is equal to
P(w1 ... wk) = P(w1) P(w2 | w1) ... P(wk | w(k-1)).
To compute the probabilities P(wi | w(i-1)), you just have to count the number of occurrence of the bigram w(i-1) wi and of the word w(i-1) on a large corpus.
Here is a good tutorial paper on the subject: A Bit of Progress in Language Modeling, by Joshua Goodman.
Yes, such things exist.
You can read about it on Wikipedia.
You can also read about some of the precision issues here.
As far as determining which part is not right after determining the sentence has a grammar issue, that is largely impossible without knowing the author's intended meaning. Take, for example, "Over their, dead bodies" and "Over there dead bodies". Both are incorrect, and could be fixed either by adding/removing the comma or swapping their/there. However, these result in very different meanings (yes, the second one would not be a complete sentence, but it would be acceptable/understandable in context).
Spell checking works because there are a limited number of words against which you can check a word to determine if it is valid (spelled correctly). However, there are infinite sentences that can be constructed, with infinite meanings, so there is no way to correct a poorly written sentence without knowing what the meaning behind it is.
I think what you are looking for is a well-established library that can process natural language and extract the meanings.
Unfortunately, there's no such library. Natural language processing, as you probably can imagine, is not an easy task. It is still a very active research field. There are many algorithms and methods in understanding natural language, but to my knowledge, most of them only work well for specific applications or words of specific types.
And those libraries, such as the CMU one, seems to be still quite rudimental. It can't do what you want to do (like identifying errors in English sentence). You have to develop algorithm to do that using the tools that they provide (such as sentence parser).
If you want to learn about it check out ai-class.com. They have some sections that talks about processing language and words.

How to determine subject, object and other words?

I'm trying to implement application that can determine meaning of sentence, by dividing it to smaller pieces. So I need to know what words are subject, object etc. so that my program can know how to handle this sentence.
This is an open research problem. You can get an overview on Wikipedia, http://en.wikipedia.org/wiki/Natural_language_processing. Consider phrases like "Time flies like an arrow, fruit flies like a banana" - unambiguously classifying words is not easy.
You should look at the Natural Language Toolkit, which is for exactly this sort of thing.
See this section of the manual: Categorizing and Tagging Words - here's an extract:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
"Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective."
I guess there is not "simple" way to do this. You have to build a linguistic analyzer (which is quite possible), however, a language as a lot of exceptional cases. And that is what makes implementing a linguistic analyzer that hard.
The specific problem you mention, the identification of the subject and objects of a clause, is accomplished by syntactic parsing. You can get a good idea of how parsing works by using this demo of parsing software developed by Stanford University.
However, syntactic parsing does not determine the meanining of a sentence, only its structure. Determining meaning (semantics) is a very hard problem in general and there is no technology that can really 'understand' a sentence in the same way that a human would. Although there is no general solution, you may be able to do something in a very restricted subject domain. For example, is the data you want to analyse about a narrow topic with a limited set of 'things' that people talk about?
StompChicken has given the right answer to this question, but I'd like to add that the concepts of subject and object are known as grammatical relations, and that Briscoe and Carroll's RASP is a parser that can go the extra step of deducing a list of relations from the parse.
Here's some example output from their demo page. It's an extract from the output for a sentence that begins "We describe a robust accurate domain-independent approach...":
(|ncsubj| |describe:2_VV0| |We:1_PPIS2| _)
(|dobj| |describe:2_VV0| |approach:7_NN1|)

Display vs. Search vs. Sort strings in a database

Let's say I've got a database full of music artists. Consider the following artists:
The Beatles -
"The" is officially part of the name, but we don't want to sort it with the "T"s if we are alphabetizing. We can't easily store it as "Beatles, The" because then we can't search for it properly.
Beyoncé -
We need to allow the user to be able to search for "Beyonce" (without the diacritic mark)and get the proper results back. No user is going to know how or take the time to type the special diacritcal character on the last "e" when searching, yet we obviously want to display it correctly when we need to output it.
What is the best way around these problems? It seems wasteful to keep an "official name", a "search name", and a "sort name" in the database since a very large majority of entries will all be exactly the same, but I can't think of any other options.
The library science folks have a standard answer for this. The ALA Filing Rules cover all of these cases in a perfectly standard way.
You're talking about the grammatical sort order. This is a debatable topic. Some folks would take issue with your position.
Generally, you transform the title to a normalized form: "Beatles, The". Generally, you leave it that way. Then sort.
You can read about cataloging rules here: http://en.wikipedia.org/wiki/Library_catalog#Cataloging_rules
For "extended" characters, you have several choices. For some folks, é is a first-class letter and the diacritical is part of it. They aren't confused. For other folks, all of the diacritical characters map onto unadorned characters. This mapping is a feature of some Unicode processing tools.
You can read about Unicode diacritical stripping here: http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/lvg/current/docs/designDoc/UDF/unicode/NormOperations/stripDiacritics.html
http://www.siao2.com/2005/02/19/376617.aspx

Resources