MongoDB Compound Indexes - Does the sort order matter? - database
I've dived recently into mongodb for a project of mine.
I've been reading up on indexes, and for a small collection, i know it wouldn't matter much but when it grows there's going to be performance issues without the right indexes and queries.
Lets say i have a collection like so
{user_id:1,slug:'one-slug'}
{user_id:1,slug:'another-slug'}
{user_id:2,slug:'one-slug'}
{user_id:3,slug:'just-a-slug}
And i have to search my collection where
user id == 1 and slug == 'one-slug'
In this collection, slugs will be unique to user ids.
That is, user id 1 can have only one slug of the value 'one-slug'.
I understand that user_id should be given priority due to its high cardinality, but what about slug? Since its unique as well most of the time. I also cant wrap my head around ascending and descending indexes, or how its going to affect performance in this case or the right order i should be using in this collection.
I've read a bit but i can't wrap my head around it, particularly for my scenario. Would be awesome to hear from others.
You can think of MongoDB single-field index as an array, with pointers to document locations. For example, if you have a collection with (note that the sequence is deliberately out-of-order):
[collection]
1: {a:3, b:2}
2: {a:1, b:2}
3: {a:2, b:1}
4: {a:1, b:1}
5: {a:2, b:2}
Single-field index
Now if you do:
db.collection.createIndex({a:1})
The index approximately looks like:
[index a:1]
1: {a:1} --> 2, 4
2: {a:2} --> 3, 5
3: {a:3} --> 1
Note three important things:
It's sorted by a ascending
Each entry points to the location where the relevant documents resides
The index only records the values of the a field. The b field does not exist in the index at all
So if you do a query like:
db.collection.find().sort({a:1})
All it has to do is to walk the index from top to bottom, fetching and outputting the document pointed to by the entries. Notice that you can also walk the index from the bottom, e.g.:
db.collection.find().sort({a:-1})
and the only difference is you walk the index in reverse.
Because b is not in the index at all, you cannot use the index when querying anything about b.
Compound index
In a compound index e.g.:
db.collection.createIndex({a:1, b:1})
It means that you want to sort by a first, then sort by b. The index would look like:
[index a:1, b:1]
1: {a:1, b:1} --> 4
2: {a:1, b:2} --> 2
3: {a:2, b:1} --> 3
4: {a:2, b:2} --> 5
5: {a:3, b:2} --> 1
Note that:
The index is sorted from a
Within each a you have a sorted b
You have 5 index entries vs. only three in the previous single-field example
Using this index, you can do a query like:
db.collection.find({a:2}).sort({b:1})
It can easily find where a:2 then walk the index forward. Given that index, you cannot do:
db.collection.find().sort({b:1})
db.collection.find({b:1})
In both queries you can't easily find b since it's spread all over the index (i.e. not in contiguous entries). However you can do:
db.collection.find({a:2}).sort({b:-1})
since you can essentially find where the a:2 are, and walk the b entries backward.
Edit: clarification of #marcospgp's question in the comment:
The possibility of using the index {a:1, b:1} to satisfy find({a:2}).sort({b:-1}) actually make sense if you see it from a sorted table point of view. For example, the index {a:1, b:1} can be thought of as:
a | b
--|--
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
3 | 1
3 | 2
find({a:2}).sort({b:1})
The index {a:1, b:1} means sort by a, then within each a, sort the b values. If you then do a find({a:2}).sort({b:1}), the index knows where all the a=2 are. Within this block of a=2, the b would be sorted in ascending order (according to the index spec), so that query find({a:2}).sort({b:1}) can be satisfied by:
a | b
--|--
1 | 1
1 | 2
2 | 1 <-- walk this block forward to satisfy
2 | 2 <-- find({a:2}).sort({b:1})
2 | 3 <--
3 | 1
3 | 2
find({a:2}).sort({b:-1})
Since the index can be walked forward or backwards, a similar procedure was followed, with a small twist at the end:
a | b
--|--
1 | 1
1 | 2
2 | 1 <-- walk this block backward to satisfy
2 | 2 <-- find({a:2}).sort({b:-1})
2 | 3 <--
3 | 1
3 | 2
The fact that the index can be walked forward or backward is the key point that enables the query find({a:2}).sort({b:-1}) to be able to use the index {a:1, b:1}.
Query planner explain
You can see what the query planner plans by using db.collection.explain().find(....). Basically if you see a stage of COLLSCAN, no index was used or can be used for the query. See explain results for details on the command's output.
[Cannot comment due to a lack of reputation]
Index direction only matters when you're sorting.
Not completely exact : some queries can be faster with particular direction index, even if no order is required in the query itself (sorting is just for results). For example, queries with date criteria : searching for users who subscribe yesterday will be faster with a desc direction on index, than with asc direction or no index.
difference between {user_id:1,slug:1} and {slug:1,user_id:1}
mongo will filter on first field, then on second field with first field matching (and so on...) in index. The more restrictive fields must be at first places to really improve the query
Related
How to model arbitrarily ordering items in database?
I accepted a new feature to re-order some items by using Drag-and-Drop UI and save the preference for each user to the database. What's the best way to do so? After reading some questions on StackOverflow, I found this solution. Solution 1: Use decimal numbers to indicate order For example, id item order 1 a 1 2 b 2 3 c 3 4 d 4 If I insert item 4 between item 1 and 2, the order becomes, id item order 1 a 1 4 d 1.5 2 b 2 3 c 3 In this way, every new order = order[i-1] + order[i+1] / 2 If I need to save the preference for every user, then I need to another relationship table like this, user_id item_id order 1 1 1 1 2 2 1 3 3 1 4 1.5 I need num_of_users * num_of_items records to save this preference. However, there's a solution I can think of. Solution 2: Save the order preference in a column in the User table This is straightforward by adding a column in the User table to record the order. Each value would be parsed as an array of item_ids that ranked by the index of the array. user_id . item_order 1 [1,4,2,3] 2 [1,2,3,4] Is there any limitation of this solution? Or is there any other ways to solve this problem?
Usually, an explicit ordering deals with the presentation or some specific processing of data. Hence, it's a good idea to separate entities of theirs presentation/processing. For example users ----- user_id (PK) user_login ... user_lists ---------- list_id, user_id (PK) item_index item_index can be a simply integer value : ordered continuously (1,2...N): DELETE/INSERT of the whole list are normally required to change the order ordered discretely with some seed (10,20...N): you can insert new items without reordering the whole list Another reason to separate entity data and lists: reordering lists should be done in transaction that may lead to row/table locks. In case of separated tables only data in list table is impacted.
Index in SQL Server 2008 database [duplicate]
I've created composite indexes (indices for you mathematical folk) on tables before with an assumption of how they worked. I was just curious if my assumption is correct or not. I assume that when you list the order of columns for the index, you are also specifying how the indexes will be grouped. For instance, if you have columns a, b, and c, and you specify the index in that same order a ASC, b ASC, and c ASC then the resultant index will essentially be many indexes for each "group" in a. Is this correct? If not, what will the resultant index actually look like?
Composite indexes work just like regular indexes, except they have multi-values keys. If you define an index on the fields (a,b,c) , the records are sorted first on a, then b, then c. Example: | A | B | C | ------------- | 1 | 2 | 3 | | 1 | 4 | 2 | | 1 | 4 | 4 | | 2 | 3 | 5 | | 2 | 4 | 4 | | 2 | 4 | 5 |
Composite index is like a plain alphabet index in a dictionary, but covering two or more letters, like this: AA - page 1 AB - page 12 etc. Table rows are ordered first by the first column in the index, then by the second one etc. It's usable when you search by both columns OR by first column. If your index is like this: AA - page 1 AB - page 12 … AZ - page 245 BA - page 246 … you can use it for searching on 2 letters ( = 2 columns in a table), or like a plain index on one letter: A - page 1 B - page 246 … Note that in case of a dictionary, the pages themself are alphabetically ordered. That's an example of a CLUSTERED index. In a plain, non-CLUSTERED index, the references to pages are ordered, like in a history book: Gaul, Alesia: pages 12, 56, 78 Gaul, Augustodonum Aeduorum: page 145 … Gaul, Vellaunodunum: page 24 Egypt, Alexandria: pages 56, 194, 213, 234, 267 Composite indexes may also be used when you ORDER BY two or more columns. In this case a DESC clause may come handy. See this article in my blog about using DESC clause in a composite index: Descending indexes
The most common implementation of indices uses B-trees to allow somewhat rapid lookups, and also reasonably rapid range scans. It's too much to explain here, but here's the Wikipedia article on B-trees. And you are right, the first column you declare in the create index will be the high order column in the resulting B-tree. A search on the high order column amounts to a range scan, and a B-tree index can be very useful for such a search. The easiest way to see this is by analogy with the old card catalogs you have in libraries that have not yet converted to on line catalogs. If you are looking for all the cards for Authors whose last name is "Clemens", you just go to the author catalog, and very quickly find a drawer that says "CLE- CLI" on the front. That's the right drawer. Now you do a kind of informal binary search in that drawer to quickly find all the cards that say "Clemens, Roger", or "Clemens, Samuel" on them. But suppose you want to find all the cards for the authors whose first name is "Samuel". Now you're up the creek, because those cards are not gathered together in one place in the Author catalog. A similar phenomenon happens with composite indices in a database. Different DBMSes differ in how clever their optimizer is at detecting index range scans, and accurately estimating their cost. And not all indices are B-trees. You'll have to read the docs for your specific DBMS to get the real info.
No. Resultant index will be single index but with compound key. KeyX = A,B,C,D; KeyY = 1,2,3,4; Index KeyX, KeyY will be actually: A1,A2,A3,B1,B3,C3,C4,D2 So that in case you need to find something by KeyX and KeyY - that will be fast and will use single index. Something like SELECT ... WHERE KeyX = "B" AND KeyY = 3. But it's important to understand: WHERE KeyX = ? requests will use that index, while WHERE KeyY = ? will NOT use such index at all.
Vector Space Model query - set of documends search
i'm trying to write a code for vsm search in c. So using a collection of documents i built a hashtable (inverded index) in wich each slot holds a word along with it's df and a pointer to a list in which each slot hold a name of a document(in which the word appeared at least once) along with the tf(how many times it appeared in this doccument). The user will write a question(also chooses weighting qqq.ddd and comparing method but that doesn't matter for my question) and i have to print him the documents that are relevant to it(from the most relevant to the least relevant). So the examples i've seen are showing which are the steps having only one document for example: we have a collection of 1.000.000 documents(N=1.000.000) and we want to compare 1 document: car insurance auto insurance with the queston: best car insurance So in the example it creates an array like this: Term | Query | Document | tf | tf auto | 0 | 1 best | 1 | 0 car | 1 | 1 insurance| 1 | 2 The example also gives the df for each term so using these clues and the weighting and comparing methods it's easy to compare them turning them into vectors by finding the 4 coordinates(1 for each word in the array). So in this example there are 1.000.000 documents and to see how relevant the document with the query is we use 1 time each(4 words) of the words that there are in the query and in the document. So we have to find 4 coordinates and then compare. In what i'm trying to do there are like 8000 documents each of them having from 3 to 50 words. So how am i suppose to compare how relevant is a query with each document? If i have a query: ping pong document 1: this is ping kong document 2: i am ping tongue To compare the query-document1 i will use the words: this is ping kong pong (so 5 coordinates) and to compare the query-document2 i will use the words: i am ping tongue is kong (6 coordinates) and then since i use the same comparing method the one with the highest score is the most relevant? OR do i have to use for both the words: this is ping kong am tongue kong (7 coordinates)? So my question is which is the right way to compare all these 8000 documents with the question? I hope i succeed on making my question easy to understand. thank you for your time!
Retrieving data from 2 tables that have a 1 to many relationship - more efficient with 1 query or 2?
I need to selectively retrieve data from two tables that have a 1 to many relationship. A simplified example follows. Table A is a list of events: Id | TimeStamp | EventTypeId -------------------------------- 1 | 10:26... | 12 2 | 11:31... | 13 3 | 14:56... | 12 Table B is a list of properties for the events. Different event types have different numbers of properties. Some event types have no properties at all: EventId | Property | Value ------------------------------ 1 | 1 | dog 1 | 2 | cat 3 | 1 | mazda 3 | 2 | honda 3 | 3 | toyota There are a number of conditions that I will apply when I retrieve the data, however they all revolve around table A. For instance, I may want only events on a certain day, or only events of a certain type. I believe I have two options for retrieving the data: Option 1 Perform two queries: first query table A (with a WHERE clause) and store data somewhere, then query table B (joining on table A in order to use same WHERE clause) and "fill in the blanks" in the data that I retrieved from table A. This option requires SQL Server to perform 2 searches through table A, however the resulting 2 data sets contain no duplicate data. Option 2 Perform a single query, joining table A to table B with a LEFT JOIN. This option only requires one search of table A but the resulting data set will contain many duplicated values. Conclusion Is there a "correct" way to do this or do I need to try both ways and see which one is quicker?
Ex Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id and a subquery something like this - Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept) When I consider performance which of the two queries would be faster and why ? would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...). As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things. The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs! More REF
PostgreSQL multidimensional array search
I am a newbie to Postgresql and was trying with it. I have created a simple table: CREATE table items_tags ( ut_id SERIAL Primary KEY, item_id integer, item_tags_weights text[] ); where: item_id - Item Id with these tags are associated item_tags_weights - Tags associated with Itm including weight Example entry: -------------------- ut_id | item_id | item_tags_weights ---------+---------+------------------------------------------------------------------------------------------------------------------------------- 3 | 2 | {{D,1},{B,9},{W,3},{R,18},{F,9},{L,15},{G,12},{T,17},{0,3},{I,7},{E,14},{S,2},{O,5},{M,4},{V,3},{H,2},{X,14},{Q,9},{U,6},{P,16},{N,11},{J,1},{A,12},{Y,15},{C,15},{K,4},{Z,17}} 1000003 | 3 | {{Q,4},{T,19},{P,15},{M,14},{O,20},{S,3},{0,6},{Z,6},{F,4},{U,13},{E,18},{B,14},{V,14},{X,10},{K,18},{N,17},{R,14},{J,12},{L,15},{Y,3},{D,20},{I,18},{H,20},{W,15},{G,7},{A,11},{C,14}} 4 | 4 | {{Q,2},{W,7},{A,6},{T,19},{P,8},{E,10},{Y,19},{N,11},{Z,13},{U,19},{J,3},{O,1},{C,2},{L,7},{V,2},{H,12},{G,19},{K,15},{D,7},{B,4},{M,9},{X,6},{R,14},{0,9},{I,10},{F,12},{S,11}} 5 | 5 | {{M,9},{B,3},{I,6},{L,12},{J,2},{Y,7},{K,17},{W,6},{R,7},{V,1},{0,12},{N,13},{Q,2},{G,14},{C,2},{S,6},{O,19},{P,19},{F,4},{U,11},{Z,17},{T,3},{E,10},{D,2},{X,18},{H,2},{A,2}} (4 rows) where: {D,1} - D = tag, 1 = tag weight Well, I just wanted to list the items_id where tags = 'U' according tag weight. On way is to select ALL the tags from database and do the processing in high-level language with sort and use the result set. For this, I can do the following: 1) SELECT * FROM user_tags WHERE 'X' = ANY (interest_tags_weights) 2) Extract and sort the information and display. But considering that multiple items can be associated with a single 'TAG', and assuming 10 million entry, this method will be surely sluggish. Any idea to list as needed with CREATE function or so? Any pointers will be helpfull. Many thanks.
Have you considered normalization, i.e. moving the array field into another table? Apart from being easy to query and extend, it's likely to have better performance on larger databases.