I've finished my first semester in a college-level SQL course where we used "SQL queries for Mere Mortals" 3rd edition.
Long term I want to work in data governance or as a data scientist, so digging deeper is needed and I found the Stanford SQL course. Today taking the first mini quiz, I got the answers right but on these two I'm not understanding WHY I got the answers right.
My 'SQL for Mere Mortals' book doesn't even cover hash or tree-based indexes so I've been searching online for them.
I mostly guessed based on what she said but it feels more like luck than "I solidly understand why". So I've ordered "Introduction to Algorithms" 3rd edition by Thomas Cormen and it arrived last week but it will take me a while to read through all 1,229 pages.
Found that book in this other stackoverflow link =>https://stackoverflow.com/questions/66515417/why-is-hash-function-fast
Stanford Course => https://www.edx.org/course/databases-5-sql
I thought a hash index on College.enrollment would not speed up because they limit it to less than a number vs an actual number ?? I'm guessing per this link Better to use "less than equal" or "in" in sql query that the query would be faster if we used "<=" rather than "<" ?
This one was just a process of elimination as it mentions the first item after the WHERE clause, but then was confusing as it mentions the last part of Apply.cName = College.cName.
My questions:
I'm guessing that similar to algebra having numerators and denominators, quotients, and many other terms that specifically describe part of an equation using technical terms. How would you use technical terms to describe why these answers are correct.
On the second question, why is the first part of the second line referenced and the last part of the same line referenced as the answers. Why didn't they pick the first part of each of the last part of each?
For context, most of my SQL queries are written for PostgreSQL now within PyCharm on python but I do a lot of practice using the PgAgmin4 or MySqlWorkbench desktop platforms.
I welcome any recommendations you have on paper books or pdf's that have step-by-step tutorials as many, many websites have holes or reference technical details that are confusing.
Thanks
1. A hash index is only useful for equality matches, whereas a tree index can be used for inequality (< or >= etc).
With this in mind, College.enrollment < 5000 cannot use a hash index, as it is an inequality. All other options are exact equality matches.
This is why most RDBMSs only let you create tree-based indexes.
2. This one is pretty much up in the air.
"the first item after the WHERE clause" is not relevant. Most RDBMSs will reorder the joins and filters as they see fit in order to match indexes and table statistics.
I note that the query as given is poorly written. It should use proper JOIN syntax, which is much clearer, and has been in use for 30 years already.
SELECT * -- you should really specify exact columns
FROM Student AS s -- use aliases
JOIN [Apply] AS a ON a.sID = s.sID -- Apply is a reserved keyword in many RDBMS
JOIN College AS c ON c.cName = a.aName
WHERE s.GPA > 1.5 AND c.cName < 'Cornell';
Now it's hard to say what a compiler would do here. A lot depends on the cardinalities (size of tables) in absolute terms and relative to each other, as well as the data skew in s.GPA and c.cName.
It also depends on whether secondary key (or indeed INCLUDE) columns are added, this is clearly not being considered.
Given the options for indexes you have above, and no other indexes (not realistic obviously), we could guesstimate:
Student.sID, College.cName
This may result in an efficient backwards scan on College starting from 'Cornell', but Apply would need to be joined with a hash or a naive nested loop (scanning the index each time).
The index on Student would mean an efficient nested loop with an index seek.
Student.sID, Student.GPA
Is this one index or two? If it's two separate indexes, the second will be used, and the first is obviously going to be useless. Apply and College will still need heavy joins.
Apply.cName, College.cName
This would probably get you a merge-join on those two columns, but Student would need a big join.
Apply.sID, Student.GPA
Student could be efficiently scanned from 1.5, and Apply could be seeked, but College requires a big join.
Of these options, the first or the last is probably better, but it's very hard to say without further info.
In a real system, I would have indexes on all tables, and use INCLUDE columns wisely in order to avoid key-lookups. You would want to try to get a better feel for which tables are the ones that need to be filtered early etc.
First question
A hash-index is not linearly-searchable (see Slide 7), that is, you cannot perform range-comparisons with a hash-index. This is because (in general terms) hash functions are one-way: given the output of a hash function you cannot determine the input, and the output will be in apparently random order (having a random order is good for ensuring an even load over the set of hashtable bins).
Now, for a contrived and oversimplified example:
Supposing you have these rows:
PK | Enrollment
----------------
1 | 1
2 | 10
3 | 100
4 | 1000
5 | 10000
A perfect hash index of this table would look something like this:
Assuming that the hash of 1 is 0xF822AA896F34253E and the hash of 10 is 0xB383A8BBDAA41F98, and so on...
EnrollmentHash | PhysicalRowPointer
---------------------------------------
0xF822AA896F34253E | 1
0xB383A8BBDAA41F98 | 2
0xA60DCD4E78869C9C | 3
0x49B0AF769E6B1EB3 | 4
0x724FD1728666B90B | 5
So given this hashtable index, looking at the hashes you cannot determine which hash represents larger enrollment values vs. smaller values. But a hashtable index does give you O(1) lookup for single specific values, which is why it works best for discrete, non-continuous, data values, especially columns used in JOIN criteria.
Whereas a tree-hash does preserve relative ordering information about values, but with O( log n ) lookup time.
Second question
First, I need to rewrite the query to use modern JOIN syntax. The old style (using commas) has been obsolete since SQL-92 in 1992, that's almost 30 years ago.
SELECT
*
FROM
Apply
INNER JOIN Student ON Student.sID = Apply.sID
INNER JOIN College ON Apply.cName = Apply.cName
WHERE
Student.GPA > 1.5
AND
College.cName < 'Cornell'
Now, generally speaking the best way to answer this kind of question would be to know what the STATISTICS (cardinality, value distribution, etc) of the tables are. But without that I can still make some guesses.
I assume that College is the smallest table (~500 rows?), Student will have maybe 1-2m rows, and assuming every Student makes 4-5 applications then the Apply table will have ~5m rows.
...armed with that inference, we can deduce:
Student.sID = Apply.sID is an ID match - so a hash-index would be better in most cases (excepting if the PK clustering matters, but I won't digress).
Student.GPA > 1.5 - this is a range search so having a tree-based index here helps.
College.cName < 'Cornell' - again, this is a range comparison so a tree-based index here helps too.
So the best indexes would be Student.GPA and College.cName, but that isn't an option - so let's see what the benefits of each option are...
(As I was writing this, I saw that #charlieface posted their answer which already covers this, so I'll just link to theirs to save my time: https://stackoverflow.com/a/67829326/159145 )
Related
Firstly, sorry for the vague title and if this question has been asked before, but I was not entirely sure how to phrase it.
I am looking for general design principles for finding pairs of 'similar' objects from two different data sources.
Lets for simplicity say that we have two databases, A and B, both containing large volumes of objects, each with time-stamp and geo-location, along with some other data that we don't care about here.
Now I want to perform a search along these lines:
Within as certain time-frame and location dictated as search tiem, find pairs of objects from A and B respectively, ordered by some similarity score. Here for example some scalar 'time/space distance' function, distance(a,b), that calculates the distance in time and space between the objects.
I am expecting to get a (potentially ginormous) set of results where the first result is a pair of data points which has the minimum 'distance'.
I realize that the full search space is cardinality(A) x cardinality(B).
Are there any general guidelines on how to do this in a reasonable efficient way? I assume that I would need to replicate the two databases into a common repository like Hadoop? But then what? I am not sure how to perform such a query in Hadoop either.
What is this this type of query called?
To me, this is some kind of "fuzzy inner join" that I struggle wrapping my head around how to construct, let along efficiently at scale.
SQL joins don't have to be based on equality. You can use ">", "<", "BETWEEN".
You can even do something like this:
select a.val aval, b.val bval, a.val - b.val diff
from A join B on abs(a.val - b.val) < 100
What you need is a way to divide your objects into buckets in advance, without comparing them (or at least making a linear, rather than square, number of comparisons). That way, at query time, you will only be comparing a small number of items.
There is no "one-size-fits-all" way to bucket your items. In your case the bucketing can be based on time, geolocation, or both. Time-based bucketing is very natural, and can also scales elastically (increase or decrease the bucket size). Geo-clustering buckets can be based on distance from a particular point in space (if the space is abstract), or on some finite division of the space (for example, if you divide the entire Earth's world map into tiles, which can also scale nicely if done right).
A good question to ask is "if my data starts growing rapidly, can I handle it by just adding servers?" If not, you might need to rethink the design.
I need a way of storing sets of arbitrary size for fast query later on.
I'll be needing to query the resulting data structure for subsets or sets that are already stored.
===
Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves.
I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.
==
Original examples:
For example, say the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.
What algorithms are there for such needs?
I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.
Clarifications:
If it helps answer the question, I'm intending on using them for storing states:
Simple example:
[Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure
Conclusion
I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:
I'm confident that I can now contribute to the solution. One possible quite efficient way is a:
Trie invented by Frankling Mark Liang
Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.
The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.
What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).
But first a drawing by me:
The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.
Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.
Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).
Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.
Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.
I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.
I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.
How about having an inverse index built of hashes?
Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.
Then you define a map that maps an index to a vector of pointers to the tuples
std::map<size_t,std::vector<TupleStruct*> > mymap;
or, if you can use global indices, just
std::map<size_t,std::vector<size_t> > mymap;
For retrieval by queries X and Y, you need to
get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
Bitmap Index Design Choices and Their Performance
Implications
Bitmap Index Design and Evaluation
Compressing Bitmap Indexes for Faster Search Operations
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.
Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:
Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.
On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.
Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.
I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).
In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.
It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...
I have some data, up to a between a million and a billion records, each which is represented by a bitfield, about 64 bits per key. The bits are independent, you can imagine them basically as random bits.
If I have a test key and I want to find all values in my data with the same key, a hash table will spit those out very easily, in O(1).
What algorithm/data structure would efficiently find all records most similar to the query key? Here similar means that most bits are identical, but a minimal number are allowed to be wrong. This is traditionally measured by Hamming distance., which just counts the number of mismatched bits.
There's two ways this query might be made, one might be by specifying a mismatch rate like "give me a list of all existing keys which have less than 6 bits that differ from my query" or by simply best matches, like "give me a list of the 10,000 keys which have the lowest number of differing bits from my query."
You might be temped to run to k-nearest-neighbor algorithms, but here we're talking about independent bits, so it doesn't seem likely that structures like quadtrees are useful.
The problem can be solved by simple brute force testing a hash table for low numbers of differing bits. If we want to find all keys that differ by one bit from our query, for example, we can enumerate all 64 possible keys and test them all. But this explodes quickly, if we wanted to allow two bits of difference, then we'd have to probe 64*63=4032 times. It gets exponentially worse for higher numbers of bits.
So is there another data structure or strategy that makes this kind of query more efficient?
The database/structure can be preprocessed as much as you like, it's the query speed that should be optimized.
What you want is a BK-Tree. It's a tree that's ideally suited to indexing metric spaces (your problem is one), and supports both nearest-neighbour and distance queries. I wrote an article about it a while ago.
BK-Trees are generally described with reference to text and using levenshtein distance to build the tree, but it's straightforward to write one in terms of binary strings and hamming distance.
This sounds like a good fit for an S-Tree, which is like a hierarchical inverted file. Good resources on this topic include the following papers:
Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes.
Improved Methods for Signature-Tree Construction (2000)
Quote from the first one:
The hierarchical bitmap index efficiently supports dif-
ferent classes of queries, including subset, superset and similarity queries.
Our experiments show that the hierarchical bitmap index outperforms
other set indexing techniques significantly.
These papers include references to other research that you might find useful, such as M-Trees.
Create a binary tree (specifically a trie) representing each key in your start set in the following way: The root node is the empty word, moving down the tree to the left appends a 0 and moving down the right appends a 1. The tree will only have as many leaves as your start set has elements, so the size should stay manageable.
Now you can do a recursive traversal of this tree, allowing at most n "deviations" from the query key in each recursive line of execution, until you have found all of the nodes in the start set which are within that number of deviations.
I'd go with an inverted index, like a search engine. You've basically got a fixed vocabulary of 64 words. Then similarity is measured by hamming distance, instead of cosine similarity like a search engine would want to use. Constructing the index will be slow, but you ought to be able to query it with normal search enginey speeds.
The book Introduction to Information Retrieval covers the efficient construction, storage, compression and querying of inverted indexes.
"Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions", from 2008, seems to be the best result as of then. I won't try to summarize since I read it over a year ago and it's hairy. That's from a page on locality-sensitive hashing, along with an implementation of an earlier version of the scheme. For more general pointers, read up on nearest neighbor search.
This kind of question has been asked before: Fastest way to find most similar string to an input?
The database/structure can be
preprocessed as much as you like
Well...IF that is true. Then all you need is a similarity matrix of your hamming distances. Make the matrix sparse by pruning out large distances. It doesn't get any faster and not that much of a memory hog.
Well, you could insert all of the neighbor keys along with the original key. That would mean that you store (64 choose k) times as much data, for k differing bits, and it will require that you decide k beforehand. Though you could always extend k by brute force querying neighbors, and this will automatically query the neighbors of your neighbors that you inserted. This also gives you a time-space tradeoff: for example, if you accept a 64 x data blowup and 64 times slower you can get two bits of distance.
I haven't completely thought this through, but I have an idea of where I'd start.
You could divide the search space up into a number of buckets where each bucket has a bucket key and the keys in the bucket are the keys that are more similar to this bucket key than any other bucket key. To create the bucket keys, you could randomly generate 64 bit keys and discard any that are too close to any previously created bucket key, or you could work out some algorithm that generates keys that are all dissimilar enough. To find the closest key to a test key, first find the bucket key that is closest, and then test each key in the bucket. (Actually, it's possible, but not likely, for the closest key to be in another bucket - do you need to find the closest key, or would a very close key be good enough?)
If you're ok with doing it probabilistically, I think there's a good way to solve question 2. I assume you have 2^30 data and cutoff and you want to find all points within cutoff distance from test.
One_Try()
1. Generate randomly a 20-bit subset S of 64 bits
2. Ask for a list of elements that agree with test on S (about 2^10 elements)
3. Sort that list by Hamming distance from test
4. Discard the part of list after cutoff
You repeat One_Try as much as you need while merging the lists. The more tries you have, the more points you find. For example, if x is within 5 bits, you'll find it in one try with about (2/3)^5 = 13% probability. Therefore if you repeat 100 tries you find all but roughly 10^{-6} of such x. Total time: 100*(1000*log 1000).
The main advantage of this is that you're able to output answers to question 2 as you proceed, since after the first few tries you'll certainly find everything within distance not more than 3 bits, etc.
If you have many computers, you give each of them several tries, since they are perfectly parallelizable: each computer saves some hash tables in advance.
Data structures for large sets described here: Detecting Near-Duplicates for Web Crawling
or
in memory trie: Judy-arrays at sourceforge.net
Assuming you have to visit each row to test its value (or if you index on the bitfield then each index entry), then you can write the actual test quite efficiently using
A xor B
To find the difference bits, then bit-count the result, using a technique like this.
This effectively gives you the hamming distance.
Since this can compile down to tens of instructions per test, this can run pretty fast.
If you are okay with a randomized algorithm (monte carlo in this case), you can use the minhash.
If the data weren't so sparse, a graph with keys as the vertices and edges linking 'adjacent' (Hamming distance = 1) nodes would probably be very efficient time-wise. The space would be very large though, so in your case, I don't think it would be a worthwhile tradeoff.
Is there any way to select a subset from a large set based on a property or predicate in less than O(n) time?
For a simple example, say I have a large set of authors. Each author has a one-to-many relationship with a set of books, and a one-to-one relationship with a city of birth.
Is there a way to efficiently do a query like "get all books by authors who were born in Chicago"? The only way I can think of is to first select all authors from the city (fast with a good index), then iterate through them and accumulate all their books (O(n) where n is the number of authors from Chicago).
I know databases do something like this in certain joins, and Endeca claims to be able to do this "fast" using what they call "Record Relationship Navigation", but I haven't been able to find anything about the actual algorithms used or even their computational complexity.
I'm not particularly concerned with the exact data structure... I'd be jazzed to learn about how to do this in a RDBMS, or a key/value repository, or just about anything.
Also, what about third or fourth degree requests of this nature? (Get me all the books written by authors living in cities with immigrant populations greater than 10,000...) Is there a generalized n-degree algorithm, and what is its performance characteristics?
Edit:
I am probably just really dense, but I don't see how the inverted index suggestion helps. For example, say I had the following data:
DATA
1. Milton England
2. Shakespeare England
3. Twain USA
4. Milton Paridise Lost
5. Shakespeare Hamlet
6. Shakespeare Othello
7. Twain Tom Sawyer
8. Twain Huck Finn
INDEX
"Milton" (1, 4)
"Shakespeare" (2, 5, 6)
"Twain" (3, 7, 8)
"Paridise Lost" (4)
"Hamlet" (5)
"Othello" (6)
"Tom Sawyer" (7)
"Huck Finn" (8)
"England" (1, 2)
"USA" (3)
Say I did my query on "books by authors from England". Very quickly, in O(1) time via a hashtable, I could get my list of authors from England: (1, 2). But then, for the next step, in order retrieve the books, I'd have to, for EACH of the set {1, 2}, do ANOTHER O(1) lookup: 1 -> {4}, 2 -> {5, 6} then do a union of the results {4, 5, 6}.
Or am I missing something? Perhaps you meant I should explicitly store an index entry linking Book to Country. That works for very small data sets. But for a large data set, the number of indexes required to match any possible combination of queries would make the index grow exponentially.
For joins like this on large data sets, a modern RDBMS will often use an algorithm called a list merge. Using your example:
Prepare a list, A, of all authors who live in Chicago and sort them by author in O(Nlog(N)) time.*
Prepare a list, B, of all (author, book name) pairs and sort them by author in O(Mlog(M)) time.*
Place these two lists "side by side", and compare the authors from the "top" (lexicographically minimum) element in each pile.
Are they the same? If so:
Output the (author, book name) pair from top(B)
Remove the top element of the B pile
Goto 3.
Otherwise, is top(A).author < top(B).author? If so:
Remove the top element of the A pile
Goto 3.
Otherwise, it must be that top(A).author > top(B).author:
Remove the top element of the B pile
Goto 3.
* (Or O(0) time if the table is already sorted by author, or has an index which is.)
The loop continues removing one item at a time until both piles are empty, thus taking O(N + M) steps, where N and M are the sizes of piles A and B respectively. Because the two "piles" are sorted by author, this algorithm will discover every matching pair. It does not require an index (although the presence of indexes may remove the need for one or both sort operations at the start).
Note that the RDBMS may well choose a different algorithm (e.g. the simple one you mentioned) if it estimates that it would be faster to do so. The RDBMS's query analyser generally estimates the costs in terms of disk accesses and CPU time for many thousands of different approaches, possibly taking into account such information as the statistical distributions of values in the relevant tables, and selects the best.
SELECT a.*, b.*
FROM Authors AS a, Books AS b
WHERE a.author_id = b.author_id
AND a.birth_city = "Chicago"
AND a.birth_state = "IL";
A good optimizer will process that in less than the time it would take to read the whole list of authors and the whole list of books, which is sub-linear time, therefore. (If you have another definition of what you mean by sub-linear, speak out.)
Note that the optimizer should be able to choose the order in which to process the tables that is most advantageous. And this applies to N-level sets of queries.
Generally speaking, RDBMSes handle these types of queries very well. Both commercial and open source database engines have evolved over decades using all the reasonable computing algorithms applicable, to do just this task as fast as possible.
I would venture a guess that the only way you would beat RDBMS in speed is, if your data is specifically organized and require specific algorithms. Some RDBSes let you specify which of the underlying algorithms you can use for manipulating data, and with open-source ones, you can always rewrite or implement a new algorithm, if needed.
However, unless your case is very special, I believe it might be a serious overkill. For most cases, I would say putting the data in RDBMS and manipulating it via SQL should work well enough so that you don't have to worry abouut underlying algorithms.
Inverted Index.
Since this has a loop, I'm sure it fails the O(n) test. However, when your result set has n rows, it's impossible to avoid iterating over the result set. The query, however, is two hash lookups.
from collections import defaultdict
country = [ "England", "USA" ]
author= [ ("Milton", "England"), ("Shakespeare","England"), ("Twain","USA") ]
title = [ ("Milton", "Paradise Lost"),
("Shakespeare", "Hamlet"),
("Shakespeare", "Othello"),
("Twain","Tom Sawyer"),
("Twain","Huck Finn"),
]
inv_country = {}
for id,c in enumerate(country):
inv_country.setdefault(c,defaultdict(list))
inv_country[c]['country'].append( id )
inv_author= {}
for id,row in enumerate(author):
a,c = row
inv_author.setdefault(a,defaultdict(list))
inv_author[a]['author'].append( id )
inv_country[c]['author'].append( id )
inv_title= {}
for id,row in enumerate(title):
a,t = row
inv_title.setdefault(t,defaultdict(list))
inv_title[t]['title'].append( id )
inv_author[a]['author'].append( id )
#Books by authors from England
for t in inv_country['England']['author']:
print title[t]