How to ignore case in an NDB/DB query - google-app-engine

This seems like a simple question, but I don't see anything in the class definition.
If I have the query
Video.query(Video.tags.IN(topics))
topics are coming in as lowercase unicode strings, but Video.tags are mostly capitalized. I can loop through topics and capitalize them before querying with them, but is there a way to ignore case altogether?

It's not possible to ignore case in a query.
Typically if you know you want to do a case insensitive search, you may store a "denormalized" duplicate of the data in lower case. Whenever you want to query, you would lowercase the text before querying.
To reduce write costs, you probably only want to index the lowercased version, and you probably wouldn't need to index the actual case-sensitive data.

Related

Is there any lucene/solr spell checker which can handle space insertions/removal typos?

As far as I know almost all do spell checking based on single query term and are unable to do changes on whole input query to increase coverage in corpra. I have one in lingpipe but it is very expensive... http://alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html
So my question what is the best Apache alternative to lingpipe like spell checker?
The spellcheckers in lucene treat whitespace like any other character. So in general you can feed them your query logs or whatever, and spellcheck/autocomplete full queries.
For lucene this should just work, for solr you need to ensure the QueryConverter doesn't split up your terms... see https://issues.apache.org/jira/browse/SOLR-3143
On the other hand, these suggesters currently work on the whole input, so if you want to suggest queries that have never been searched before, instead you want something that maybe only takes the last N words of context similar to http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html.
I'm hoping we will provide that style of suggester soon also as an alternative, possibly under https://issues.apache.org/jira/browse/LUCENE-3842.
But keep in mind, thats not suitable for all purposes, so I think its going to likely just be an option. For example, if you are doing e-commerce there is no sense is suggesting products you don't sell :)

solr schema for article->paragraph structure

I want to index some articles and show the paragraph number in the search result. So I guess the solr schema should looks like this:
article_id, paragraph_number, paragraph_content
Therefore, I need to parse article first, extract paragraphs and index it one by one.
I'm worried about the performance since one article can contain 100 paragraphs.
Any suggestion?
It is better to do the heavy lifting at index time rather than search time. So parsing the paragraphs out of the document when you index is probably the right way to go.
How many articles do you have? It really shouldn't be a problem to strip paragraphs (we do much more complex pre-processing that that).
If you only need to match individual paragraphs against the fulltext query (as opposed to filters etc.), you could also do this using highlighting -- split up the paragraphs, prefix each one with its paragraph number, and then index the paragraphs as multiple values in a single field in a single document. At search time, you'd do a highlight on the field with a full match (e.g. fragment size of -1) and no decoration of the highlight; so what you'd get back is the paragraph that matched the fulltext query, prefixed by its paragraph number (which you'd probably want to then pull back out).
Not sure if this fits your use case exactly but might be an interesting approach to try -- I do something similar to identify photos whose caption matches the fulltext query to display next to article search results.

Why would you want a case sensitive database?

What are some reasons for choosing a case sensitive collation over a case insensitive one? I can see perhaps a modest performance gain for the DB engine in doing string comparisons. Is that it? If your data is set to all lower or uppercase then case sensitive could be reasonable but it's a disaster if you store mixed case data and then try to query it. You have to then say apply a lower() function on the column so that it'll match the corresponding lower case string literal. This prevents index usage in every dbms that I've used. So wondering why anyone would use such an option.
There are many examples of data with keys that are naturally case sensitive:
Files in a case sensitive filesystem like Unix.
Base-64 encoded names (which I believe is what YouTube is using, as in Artelius's answer).
Symbols in most programming languages.
Storing case sensitive data in case-insensitive system runs the risk of data inconsistency or even the loss of important information. Storing case insensitive data in a case sensitive system is, at worst, slightly inefficient. As you point out, if you only know the case-insensitive name of an object you're looking for, you need to adjust your query:
SELECT * FROM t WHERE LOWER(name) = 'something';
I note that in PostgreSQL (and presumably in other systems), it is a simple matter to create an index on the expression LOWER(name) which will be used in such queries.
Depends on the data you want to store. Most UNIX filesystems are databases with case-sensitive keys. YouTube videos seem to be organised with case-sensitive keys.
Most of the time you want case-insensitive searches, but clearly there are certain exceptions.
Use a case insensitive index for your field. In most cases you don't want to manipulate the data in order to find it.
One reason is for Content Management. Typically you will need to identify changes in content so that those changes can be reviewed, recorded and published. Case matters for human readable content. "Dave Doe" is correct. "dave doe" is plain wrong.
Case-sensitivity also matters for software developers. If you don't know the desired case-sensitivity for all your customers' systems then you may want to test case-senstivity as part of testing anyway.
I have worked on an application that involves a database with purely natural keys (i.e. 'codes') that should be case sensitive but are not necessarily so.
A lot of data would come out of the database in stored procs (with the database is doing the joins), where case sensitivity is not an issue. However some data needed to come from the database in separate queries and then be 'stitched together' in loops - primarily due to a complex data type that SQL couldn't easily work with - and this is where the problem arose. When I'm iterating two result sets and trying to join on the 'code', the values Productcode and ProductCode don't naturally match.
Rather than fixing the data, I had to change my code (C#) to do case insensitive string matching. Not throughout the entire solution, mind, just when looking through these 'codes' for matching.
If I had a case sensitive database I would've had tidier code.
Now, rather than 'why case sensitive', I'd really like to know why you'd want a case insensitive database. Is it due to laziness? I don't see any good reason that databases are case insensitive.

Should I write table and column names ALWAYS lower case?

I wonder if it's a problem, if a table or column name contains upper case letters. Something lets me believe databases have less trouble when everything is kept lower case. Is that true? Which databases don't like any upper case symbol in table and column names?
I need to know, because my framework auto-generates the relational model from an ER-model.
(this question is not about whether it's good or bad style, but only about if it's a technical problem for any database)
As far as I know there is no problem using either uppercase and lowercase. One reason for the using lower case convention is so that queries are more readable with lowercase table and column names and upper case sql keywords:
SELECT column_a, column_b FROM table_name WHERE column_a = 'test'
It is not a technical problem for the database to have uppercase letters in your table or column names, for any DB engine that I'm aware of. Keep in mind many DB implementations use case sensitive names, so always refer to tables and columns using the same case with which they were created (I am speaking very generally since you didn't specify a particular implementation).
For MySQL, here is some interesting information about how it handles identifier case. There are some options you can set to determine how they are stored internally. http://dev.mysql.com/doc/refman/5.0/en/identifier-case-sensitivity.html
The SQL-92 standard specifies that identifiers and keywords are case-insensitive (per A Guide to the SQL Standard 4th edition, Date / Darwen)
That's not to say that a particular DBMS isn't either (1) broken, or (2) configurable (and broken)
From a programming style perspective, I suggest using different cases for keywords and identifiers. Personally, I like uppercase identifiers and lowercase keywords, because it highlights the data that you're manipulating.
SQL standard requires names stored in uppercase
The SQL standard requires identifiers be stored in all-uppercase. See section 5.2.13 of the SQL-92 as quoted from a draft copy in this Answer on another Question. The standard allows you use undelimited identifiers in lowercase or mixed case, as the SQL processor is required to convert as needed to convert to the uppercase version.
This requirement presumably dates back to the early days of SQL when mainframe systems were limited to uppercase English characters only.
Non-issue
Many database ignore this requirement by the standard.
For example, Postgres does just the opposite, converting all unquoted (“undelimited”) identifiers to lowercase — this despite Postgres otherwise hewing closer to the standard than any other system I know of.
Some databases may store the identifier in the case you specified.
Generally this is a non-issue. Virtually all databases do a case-insensitive lookup from the case used by an identifier to the case stored by the database.
There are occasional oddball cases where you may need to specify an identifier in its stored case or you may need to specify all-uppercase. This may happen with certain utilities where you must pass an identifier as a string outside the usual SQL processor context. Rare, but tuck this away in the back of your head in case you encounter some mysterious "cannot find table" kind of error message someday when using some unusual tool/utility. Has happened to me once.
Snake case
Common practice nowadays seems to be to use all lowercase with underscore separating words. This style is known as Snake case.
The use of underscore rather than Camel case helps if your identifiers are ever presented as all uppercase (or all lowercase) and thereby lose readability without the word separation.
Bonus Tip: The SQL standard (SQL-92 section 5.2.11) explicitly promises to never use a trailing underscore in a keyword. So append a trailing underscore to all your identifiers to eliminate all worry of accidentally colliding.
As far as I know for a common L.A.M.P. setup it won't really matter - but be aware that MySQL hosted on Linux is case sensitive!
To keep my code tidy I usually stick to lower case names for tables and colums, uppercase MySQL-Code and mixed Upper-Lower-Case variables - like this:
SELECT * FROM my_table WHERE id = '$myNewID'
I use pascal case for field names lower case for table names (usually) as follows:
students
--------
ID
FirstName
LastName
Email
HomeAddress
courses
-------
ID
Name
Code
[etc]
Why is this cool? because it's readable, and because I can parse it as:
echo preg_replace('/([a-z])([A-Z])/','$1 $2',$field); //insert a space
NOW, here's the fun part for tables:
StudentsCourses
--------------
Students_ID
Courses_ID
AcademicYear
Semester
notice I capitalized S and C? That way they point back to the primary table(s). You could even write a routine to logically parse db structure this way and build queries automatically. So I use caps in tables when they are JOIN tables as in this case.
Similarly, think of the _ as a -> in this table as: Students->ID and Courses->ID
Not student_id - instead Students_ID - the cognate of the field matches the exact name of the table.
Using these simple conventions produces a readable protocol which handles about 70% of your typical relational structure.
If you're using postgresql and PHP, for instance, you'd have to write your query like this:
$sql = "SELECT somecolumn FROM \"MyMixedCaseTable\" where somerow= '$somevar'";
"Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers FOO, foo, and "foo" are considered the same by PostgreSQL, but "Foo" and "FOO" are different from these three and each other. (The folding of unquoted names to lower case in PostgreSQL is incompatible with the SQL standard, which says that unquoted names should be folded to upper case. Thus, foo should be equivalent to "FOO" not "foo" according to the standard. If you want to write portable applications you are advised to always quote a particular name or never quote it.)"
http://www.postgresql.org/docs/8.4/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS
So, sometimes, it depends on what you are doing...
Whatever you use, keep in mind the MySQL on Linux is case sensitive, while on Windows it is case insensitive .
The column names which are mixed case or uppercase have to be double quoted in PostgreSQL. If you don't want to worry about it in the future, name it in the lower case.
MySQL - the columns are absolutely case insensitive. And it can lead to problems. Say someone has written "mynAme" instead of "myName". The system would work fine, but once some developer would go searching for it through the source code, they might overlook it, and you all get in trouble.
No modern database cannot handle upper or lower case text.
Think this is worth emphasizing: If a binary or case-sensitive collation is in effect, then (at least in Sql Server and other databases with rich collation features) identifiers and variable names WILL be case sensitive. You can even create tables whose names differ only in case. (—I am not sure the info above about the sql-92 standard is correct—if so, this part of the standard is not widely followed.)

What's the best way to store a title in a database to allow sorting without the leading "The", "A"

I run (and am presently completely overhauling) a website that deals with theater (njtheater.com if you're interested).
When I query a list of plays from the database, I'd like "The Merchant of Venice" to sort under the "M"s. Of course, when I display the name of the play, I need the "The" in front.
What the best way of designing the database to handle this?
(I'm using MS-SQL 2000)
You are on the right track with two columns, but I would suggest storing the entire displayable title in one column, rather than concatenating columns. The other column is used purely for sorting. This gives you complete flexibility over sorting and display, rather than being stuck with a simple prefix.
This is a fairly common approach when searching (which is related to sorting). One column (with an index) is case-folded, de-punctuated, etc. In your case, you'd also apply the grammatical convention of removing leading articles to the values in this field. This column is then used as a comparison key for searching or sorting. The other column is not indexed, and preserves the original key for display.
Store the title in two fields: TITLE-PREFIX and TITLE-TEXT (or some such). Then sort on the second, but display the concatenation of the two, with a space between.
My own solution to the problem was to create three columns in the database.
article varchar(4)
sorttitle varchar(255)
title computed (article + sortitle)
"article" will only be either "The ", "A " "An " (note trailing space on each) or empty string (not null)
"sorttitle" will be the title with the leading article removed.
This way, I can sort on SORTTITLE and display TITLE. There's little actual processing going on the computed field (so it's fast), and there's only a little work to be done when inserting.
I agree with doofledorfer, but I would recommend storing spaces entered as part of the prefix instead of assuming it's a single space. It gives your users more flexibility. You may also be able to do some concatenation in your query itself, so you don't have to merge the fields as part of your business logic.
I don't know if this can be done in SQL Server. If you can create function based indexes you could create one that does a regex on the field or that uses your own function. This would take less space than an additional field, would be kept up to date by the database itself, and allows the complete title to be stored together.

Resources