Alternative 1 of index data entry - database

One of the three alternatives of what to store as a data entry in an index is a data entry k* which is an actual record with search key value k. My question is, if you're going to store the actual record in the index, then what's the point of creating the index?

This is an extreme case, because it does not really correspond to
having data entries separated by data records (the hashed file is an
example of this case).
(M. Lenzerini, R. Rosati, Database Management Systems: Access file manager and query evaluation, "La Sapienza" University of Rome, 2016)
Alternative 1 is often used for direct indexing, for instance in B-trees and hash indexes (see also Oracle, Building Domain Indexes)
Let's do a concrete example.
We have a relation R(a,b,c) and we have a clustered B+⁠-⁠tree using alternative 2 on search key a. Since the tree is clustered, the relation R must be sorted by a.
Now, let's suppose that a common query for the relation is:
SELECT *
FROM R
WHERE b > 25
so we want to build another index to efficiently support this kind of query.
Case 1: clustered tree with alt. 2
We know that clustered B+⁠-⁠trees with alternative 2 are efficient with range queries, because they needs just to search for the first good result (say the one with b=25), then do 1 page access to the relation's page to which this result points, and finally scan that page (and eventually, some other pages) until the records fall within the given range.
To sum up:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Use the found pointer to go to a specific page. Cost: 1
Scan the page and eventual other pages. Cost: num. of relevant pages
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + 1 + #relevant-pages
where ƒ is the fan-out and ℓ the number of leaves.
Unfortunately, in our case a tree on search key b must be unclustered, because the relation is already sorted by a
Case 2: unclustered tree with alt. 2 (or 3)
We also know that B+⁠-⁠trees are not so efficient in range queries when they are unclustered. Infact, having a tree with alternative 2 or 3, in the tree we'd store only the pointers to the records, so for each result that falls in the range we'd have to do a page access to a potential different page (because the relation has a different order with respect to the index).
To sum up:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Follow scanning the leaf (and maybe other leaves) and do a different page access for each tuple that falls in the range. Cost: num. of other relevant leaves + num. of relevant tuples
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves + #relevant-tuples
notice that the number of tuples is pretty bigger respect to the number of pages!
Case 3: unclustered tree with alt. 1
Using alternative 1, we have all the data in the tree, so for executing the query we:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Follow scanning the leaf (and maybe other leaves). Cost: num. of other relevant leaves
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves
that is even smaller than (or at most equal to) the cost of case 1, but this instead is allowed.
I hope I was clear enough.
N.B. The cost is expressed in terms of page accesses because the I/O operations from/to second⁠-⁠storage are the most expensive ones in terms of time (we ignore the cost of scanning a whole page in main memory but we consider just the cost of accessing it).

Related

How searching a million keys organized as B-tree will need 114 comparisons?

Please explain how it will take 114 comparisons. The following is the screenshot taken from my book (Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press). My reasoning is that in worst case each node will have just minimum number of children (i.e. 5), so I took log base 5 of a million, and it comes to 9. So assuming at each level of the tree we search minimum number of keys (i.e. 4), it comes to somewhere like 36 comparisons, nowhere near 114.
Consider a situation in which we have to search an un-indexed and
unsorted database that contains n key values. The worst case running
time to perform this operation would be O(n). In contrast, if the data
in the database is indexed with a B tree, the same search operation
will run in O(log n). For example, searching for a single key on a set
of one million keys will at most require 1,000,000 comparisons. But if
the same data is indexed with a B tree of order 10, then only 114
comparisons will be required in the worst case.
Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press
The worst case tree has the minimum number of keys everywhere except on the path you're searching.
If the size of each internal node is in [5,10), then in the worst case, a tree with a million items will be about 10 levels deep, when most nodes have 5 keys.
The worst case path to a node, however, might have 10 keys in each node. The statement seems to assume that you'll do a linear search instead of a binary search inside each node (I would advise to do a binary search instead), so that can lead to around 10*10 = 100 comparisons.
If you carefully consider the details, the real number might very well come out to 114.
(This is not an Answer to the question asked, but a related discussion.)
Sounds like a textbook question, not a real-life question.
Counting comparisons is likely to be the best way to judge an in-memory tree, but not for a disk-based dataset.
Even so, the "average" number of comparisons (for in-memory) or disk hits (for disk-based) is likely to be the metric to compute.
(Sure, it is good to compute the maximum numbers as a useful exercise for understanding the structures.)
Perhaps the optimal "tree" for in memory searching is a Binary tree, but with 3-way fan out. And keep the tree balanced with 2 or 3 elements in each node.
For disk based searching -- think databases -- the optimal is likely to be a BTree with the size of a block being based on what is efficient to read from disk. Counting comparisons in a poor second when it comes to the overall time taken to fetch a row.

How to calulcate index pages scanned and relation pages scanned

I have some homework problems that require that I calculate the total cost of queries. Some of the assumptions made for the queries are:
All costs are in terms of disk pages, cost of scanning the index, and cost of reading matching
tuples from the relation if needed.
If reading tuples from a relation, use the worst case scenario: all tuples are in different disk
pages.
Assume all indices have 3 levels (root, internal node, and leaf level), and the root of all indices
are in memory so scanning the root incurs no cost. Most searches will have one node from
the internal level and a number of leaf nodes.
Postgresql returns the size of an index in terms of the number of pages using the query below.
For the B-tree index, assume the numbers provided are the number of leaf nodes.
My professor also gives us a sample query to get the cost for, but I don't know exactly how he got to these numbers. The database in question looks like:
create table series (
seriesid int primary key
, title varchar(400)
, yearreleased int
, contentrating varchar(40) -- age group the movie is intended for
, imdbrating float -- imdb rating
, rottentomatoes int -- rotten tomatoes rating
, description text
, seasons int -- how many seasons are available
, date_added date -- date series is added to Netflix
) ;
and the query the calculate the cost for looks like so:
qa: select director from seriesdirectors where seriesid <= 100;
The index referenced is seriesdirectors_pkey which has 2 relation pages, and the query returns 5 tuples in total.
Using these numbers my professor somehow got to the conclusion that the number of index pages scanned is 3 and the number of relation pages scanned is 0.
The reasoning being: "because there are 5 tuples matching this query (1 internal node, and 2 leaf nodes in the worst case), and the index contains all the necessary information (so zero tuples need to be read)." I have been trying to understand the concepts of index pages and how exactly my professor got these numbers so I can work on the rest of the problems. If any more context is needed and I will let you know.
Edit: The definition of the index page in this case is a way to store tuples from a database with a unique identifier (i.e. a standard index for a database, more here: https://en.wikipedia.org/wiki/Database_index).
The big question I have is how can I calculate the total number of index pages queried and the number of relation pages queried. The example query above (qa) was using the index seriesdirectors_pkey which has 2 relation pages and returns a result of 5 rows. I don't know how we get from that to knowing that 3 index pages were queried and 0 relation pages were queried.

Need algorithm for fast storage and retrieval (search) of sets and subsets

I need a way of storing sets of arbitrary size for fast query later on.
I'll be needing to query the resulting data structure for subsets or sets that are already stored.
===
Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves.
I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.
==
Original examples:
For example, say the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.
What algorithms are there for such needs?
I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.
Clarifications:
If it helps answer the question, I'm intending on using them for storing states:
Simple example:
[Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure
Conclusion
I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:
I'm confident that I can now contribute to the solution. One possible quite efficient way is a:
Trie invented by Frankling Mark Liang
Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.
The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.
What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).
But first a drawing by me:
The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.
Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.
Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).
Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.
Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.
I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.
I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.
How about having an inverse index built of hashes?
Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.
Then you define a map that maps an index to a vector of pointers to the tuples
std::map<size_t,std::vector<TupleStruct*> > mymap;
or, if you can use global indices, just
std::map<size_t,std::vector<size_t> > mymap;
For retrieval by queries X and Y, you need to
get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
Bitmap Index Design Choices and Their Performance
Implications
Bitmap Index Design and Evaluation
Compressing Bitmap Indexes for Faster Search Operations
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.
Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:
Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.
On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.
Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.
I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).

Inverted Index Evaluation Order

I read somewhere that when you have an inverted index (for instance, you have a sorted list of pages of brutus, a sorted list of pages for caesar, and a sorted list of pages for calpurnia), when you do caesar AND brutus AND calpurnia, if the number of pages for calpurnia and brutus are less than the number of pages for caesar, then you should do caesar AND (brutus and calpurnia), meaning you should evaluate the latter AND first. In general, whenever you have a series of AND, you always evaluate the pair with the lowest number of pages first. What is the reasoning behind this? Why is this efficient?
It is not true for every case of inverted indexes. If you need to sequentially scan the whole inverted indexes, then in would not matter which postings list intersection you do first.
But, assume a scenario when the inverted lists are stored in an indexed relation. Then evaluating the pair with smaller number of document occurrences will be equal to joining relations with higher selectivities, thus increasing the efficiency of the evaluation.
Intuitively, when we intersect smaller lists, we create a stronger filter which is used as a feed to the index to find the matches.
Assume we are interested in evaluating the keyword query a b c, where a, b and c are words in documents. Also assume the number of documents matching are as follows:
a --> 20
b --> 100
c --> 1000
a+b --> 10
a+c --> 15
b+c --> 50
a+b+c --> 5
Note that (a JOIN b) has size 10 and (b JOIN c) has size 50. Thus the first will require 10 accesses to the index on c, while the second requires 50 accesses to the index on a. But using a hash-based or a tree-based index, such accesses to the index do not differ greatly in cost and are usually done in a single I/O.
An important thing to realize is that because of the sorting, which you mentioned already, the inverted lists can be searched for any given document id very efficiently (generally, in logarithmic time), for example using binary search.
To see the effect of that, assume a query caesar AND brutus, and assume that there are occcaesar pages for caesar and occbrutus pages for brutus (i.e. occX denotes the length of the pages list for a term X). Now assume, for the sake of the example, that occcaesar > occbrutus, i.e. caesar occurs more frequently in the content than brutus.
What you do then is to iterate through all pages for brutus first, and search for each of them in the pages list for caesar. If indeed the lists can be searched in logarithmic time, this means you need
occbrutus * log(occcaesar)
computational steps to identify all pages that contain both terms.
If you had done it reversely (i.e. iterating through the caesar list and searching for each of its pages in the brutus list), the smaller number would end up in the logarithm and the greater number would become a factor, so the total time the evaluation takes would be longer.
Having said this, it also important to realize that in practice things are more complicated than this, because (a) the lists are not only sorted but also compressed, which makes search harder, and (b) parts of the lists may be stored on disk rather than in memory, which means the total number of disk accesses is overwhelmingly more important than the total number of computational steps. Hence, the algorithm described above might not apply in its purest form, but the principle is as described.

Data structure or algorithm for second degree lookups in sub-linear time?

Is there any way to select a subset from a large set based on a property or predicate in less than O(n) time?
For a simple example, say I have a large set of authors. Each author has a one-to-many relationship with a set of books, and a one-to-one relationship with a city of birth.
Is there a way to efficiently do a query like "get all books by authors who were born in Chicago"? The only way I can think of is to first select all authors from the city (fast with a good index), then iterate through them and accumulate all their books (O(n) where n is the number of authors from Chicago).
I know databases do something like this in certain joins, and Endeca claims to be able to do this "fast" using what they call "Record Relationship Navigation", but I haven't been able to find anything about the actual algorithms used or even their computational complexity.
I'm not particularly concerned with the exact data structure... I'd be jazzed to learn about how to do this in a RDBMS, or a key/value repository, or just about anything.
Also, what about third or fourth degree requests of this nature? (Get me all the books written by authors living in cities with immigrant populations greater than 10,000...) Is there a generalized n-degree algorithm, and what is its performance characteristics?
Edit:
I am probably just really dense, but I don't see how the inverted index suggestion helps. For example, say I had the following data:
DATA
1. Milton England
2. Shakespeare England
3. Twain USA
4. Milton Paridise Lost
5. Shakespeare Hamlet
6. Shakespeare Othello
7. Twain Tom Sawyer
8. Twain Huck Finn
INDEX
"Milton" (1, 4)
"Shakespeare" (2, 5, 6)
"Twain" (3, 7, 8)
"Paridise Lost" (4)
"Hamlet" (5)
"Othello" (6)
"Tom Sawyer" (7)
"Huck Finn" (8)
"England" (1, 2)
"USA" (3)
Say I did my query on "books by authors from England". Very quickly, in O(1) time via a hashtable, I could get my list of authors from England: (1, 2). But then, for the next step, in order retrieve the books, I'd have to, for EACH of the set {1, 2}, do ANOTHER O(1) lookup: 1 -> {4}, 2 -> {5, 6} then do a union of the results {4, 5, 6}.
Or am I missing something? Perhaps you meant I should explicitly store an index entry linking Book to Country. That works for very small data sets. But for a large data set, the number of indexes required to match any possible combination of queries would make the index grow exponentially.
For joins like this on large data sets, a modern RDBMS will often use an algorithm called a list merge. Using your example:
Prepare a list, A, of all authors who live in Chicago and sort them by author in O(Nlog(N)) time.*
Prepare a list, B, of all (author, book name) pairs and sort them by author in O(Mlog(M)) time.*
Place these two lists "side by side", and compare the authors from the "top" (lexicographically minimum) element in each pile.
Are they the same? If so:
Output the (author, book name) pair from top(B)
Remove the top element of the B pile
Goto 3.
Otherwise, is top(A).author < top(B).author? If so:
Remove the top element of the A pile
Goto 3.
Otherwise, it must be that top(A).author > top(B).author:
Remove the top element of the B pile
Goto 3.
* (Or O(0) time if the table is already sorted by author, or has an index which is.)
The loop continues removing one item at a time until both piles are empty, thus taking O(N + M) steps, where N and M are the sizes of piles A and B respectively. Because the two "piles" are sorted by author, this algorithm will discover every matching pair. It does not require an index (although the presence of indexes may remove the need for one or both sort operations at the start).
Note that the RDBMS may well choose a different algorithm (e.g. the simple one you mentioned) if it estimates that it would be faster to do so. The RDBMS's query analyser generally estimates the costs in terms of disk accesses and CPU time for many thousands of different approaches, possibly taking into account such information as the statistical distributions of values in the relevant tables, and selects the best.
SELECT a.*, b.*
FROM Authors AS a, Books AS b
WHERE a.author_id = b.author_id
AND a.birth_city = "Chicago"
AND a.birth_state = "IL";
A good optimizer will process that in less than the time it would take to read the whole list of authors and the whole list of books, which is sub-linear time, therefore. (If you have another definition of what you mean by sub-linear, speak out.)
Note that the optimizer should be able to choose the order in which to process the tables that is most advantageous. And this applies to N-level sets of queries.
Generally speaking, RDBMSes handle these types of queries very well. Both commercial and open source database engines have evolved over decades using all the reasonable computing algorithms applicable, to do just this task as fast as possible.
I would venture a guess that the only way you would beat RDBMS in speed is, if your data is specifically organized and require specific algorithms. Some RDBSes let you specify which of the underlying algorithms you can use for manipulating data, and with open-source ones, you can always rewrite or implement a new algorithm, if needed.
However, unless your case is very special, I believe it might be a serious overkill. For most cases, I would say putting the data in RDBMS and manipulating it via SQL should work well enough so that you don't have to worry abouut underlying algorithms.
Inverted Index.
Since this has a loop, I'm sure it fails the O(n) test. However, when your result set has n rows, it's impossible to avoid iterating over the result set. The query, however, is two hash lookups.
from collections import defaultdict
country = [ "England", "USA" ]
author= [ ("Milton", "England"), ("Shakespeare","England"), ("Twain","USA") ]
title = [ ("Milton", "Paradise Lost"),
("Shakespeare", "Hamlet"),
("Shakespeare", "Othello"),
("Twain","Tom Sawyer"),
("Twain","Huck Finn"),
]
inv_country = {}
for id,c in enumerate(country):
inv_country.setdefault(c,defaultdict(list))
inv_country[c]['country'].append( id )
inv_author= {}
for id,row in enumerate(author):
a,c = row
inv_author.setdefault(a,defaultdict(list))
inv_author[a]['author'].append( id )
inv_country[c]['author'].append( id )
inv_title= {}
for id,row in enumerate(title):
a,t = row
inv_title.setdefault(t,defaultdict(list))
inv_title[t]['title'].append( id )
inv_author[a]['author'].append( id )
#Books by authors from England
for t in inv_country['England']['author']:
print title[t]

Resources