Index in SQL Server 2008 database [duplicate] - sql-server

I've created composite indexes (indices for you mathematical folk) on tables before with an assumption of how they worked. I was just curious if my assumption is correct or not.
I assume that when you list the order of columns for the index, you are also specifying how the indexes will be grouped. For instance, if you have columns a, b, and c, and you specify the index in that same order a ASC, b ASC, and c ASC then the resultant index will essentially be many indexes for each "group" in a.
Is this correct? If not, what will the resultant index actually look like?

Composite indexes work just like regular indexes, except they have multi-values keys.
If you define an index on the fields (a,b,c) , the records are sorted first on a, then b, then c.
Example:
| A | B | C |
-------------
| 1 | 2 | 3 |
| 1 | 4 | 2 |
| 1 | 4 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | 4 |
| 2 | 4 | 5 |

Composite index is like a plain alphabet index in a dictionary, but covering two or more letters, like this:
AA - page 1
AB - page 12
etc.
Table rows are ordered first by the first column in the index, then by the second one etc.
It's usable when you search by both columns OR by first column. If your index is like this:
AA - page 1
AB - page 12
…
AZ - page 245
BA - page 246
…
you can use it for searching on 2 letters ( = 2 columns in a table), or like a plain index on one letter:
A - page 1
B - page 246
…
Note that in case of a dictionary, the pages themself are alphabetically ordered. That's an example of a CLUSTERED index.
In a plain, non-CLUSTERED index, the references to pages are ordered, like in a history book:
Gaul, Alesia: pages 12, 56, 78
Gaul, Augustodonum Aeduorum: page 145
…
Gaul, Vellaunodunum: page 24
Egypt, Alexandria: pages 56, 194, 213, 234, 267
Composite indexes may also be used when you ORDER BY two or more columns. In this case a DESC clause may come handy.
See this article in my blog about using DESC clause in a composite index:
Descending indexes

The most common implementation of indices uses B-trees to allow somewhat rapid lookups, and also reasonably rapid range scans. It's too much to explain here, but here's the Wikipedia article on B-trees. And you are right, the first column you declare in the create index will be the high order column in the resulting B-tree.
A search on the high order column amounts to a range scan, and a B-tree index can be very useful for such a search. The easiest way to see this is by analogy with the old card catalogs you have in libraries that have not yet converted to on line catalogs.
If you are looking for all the cards for Authors whose last name is "Clemens", you just go to the author catalog, and very quickly find a drawer that says "CLE- CLI" on the front. That's the right drawer. Now you do a kind of informal binary search in that drawer to quickly find all the cards that say "Clemens, Roger", or "Clemens, Samuel" on them.
But suppose you want to find all the cards for the authors whose first name is "Samuel". Now you're up the creek, because those cards are not gathered together in one place in the Author catalog. A similar phenomenon happens with composite indices in a database.
Different DBMSes differ in how clever their optimizer is at detecting index range scans, and accurately estimating their cost. And not all indices are B-trees. You'll have to read the docs for your specific DBMS to get the real info.

No. Resultant index will be single index but with compound key.
KeyX = A,B,C,D; KeyY = 1,2,3,4;
Index KeyX, KeyY will be actually: A1,A2,A3,B1,B3,C3,C4,D2
So that in case you need to find something by KeyX and KeyY - that will be fast and will use single index. Something like SELECT ... WHERE KeyX = "B" AND KeyY = 3.
But it's important to understand: WHERE KeyX = ? requests will use that index, while WHERE KeyY = ? will NOT use such index at all.

Related

MongoDB Compound Indexes - Does the sort order matter?

I've dived recently into mongodb for a project of mine.
I've been reading up on indexes, and for a small collection, i know it wouldn't matter much but when it grows there's going to be performance issues without the right indexes and queries.
Lets say i have a collection like so
{user_id:1,slug:'one-slug'}
{user_id:1,slug:'another-slug'}
{user_id:2,slug:'one-slug'}
{user_id:3,slug:'just-a-slug}
And i have to search my collection where
user id == 1 and slug == 'one-slug'
In this collection, slugs will be unique to user ids.
That is, user id 1 can have only one slug of the value 'one-slug'.
I understand that user_id should be given priority due to its high cardinality, but what about slug? Since its unique as well most of the time. I also cant wrap my head around ascending and descending indexes, or how its going to affect performance in this case or the right order i should be using in this collection.
I've read a bit but i can't wrap my head around it, particularly for my scenario. Would be awesome to hear from others.
You can think of MongoDB single-field index as an array, with pointers to document locations. For example, if you have a collection with (note that the sequence is deliberately out-of-order):
[collection]
1: {a:3, b:2}
2: {a:1, b:2}
3: {a:2, b:1}
4: {a:1, b:1}
5: {a:2, b:2}
Single-field index
Now if you do:
db.collection.createIndex({a:1})
The index approximately looks like:
[index a:1]
1: {a:1} --> 2, 4
2: {a:2} --> 3, 5
3: {a:3} --> 1
Note three important things:
It's sorted by a ascending
Each entry points to the location where the relevant documents resides
The index only records the values of the a field. The b field does not exist in the index at all
So if you do a query like:
db.collection.find().sort({a:1})
All it has to do is to walk the index from top to bottom, fetching and outputting the document pointed to by the entries. Notice that you can also walk the index from the bottom, e.g.:
db.collection.find().sort({a:-1})
and the only difference is you walk the index in reverse.
Because b is not in the index at all, you cannot use the index when querying anything about b.
Compound index
In a compound index e.g.:
db.collection.createIndex({a:1, b:1})
It means that you want to sort by a first, then sort by b. The index would look like:
[index a:1, b:1]
1: {a:1, b:1} --> 4
2: {a:1, b:2} --> 2
3: {a:2, b:1} --> 3
4: {a:2, b:2} --> 5
5: {a:3, b:2} --> 1
Note that:
The index is sorted from a
Within each a you have a sorted b
You have 5 index entries vs. only three in the previous single-field example
Using this index, you can do a query like:
db.collection.find({a:2}).sort({b:1})
It can easily find where a:2 then walk the index forward. Given that index, you cannot do:
db.collection.find().sort({b:1})
db.collection.find({b:1})
In both queries you can't easily find b since it's spread all over the index (i.e. not in contiguous entries). However you can do:
db.collection.find({a:2}).sort({b:-1})
since you can essentially find where the a:2 are, and walk the b entries backward.
Edit: clarification of #marcospgp's question in the comment:
The possibility of using the index {a:1, b:1} to satisfy find({a:2}).sort({b:-1}) actually make sense if you see it from a sorted table point of view. For example, the index {a:1, b:1} can be thought of as:
a | b
--|--
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
3 | 1
3 | 2
find({a:2}).sort({b:1})
The index {a:1, b:1} means sort by a, then within each a, sort the b values. If you then do a find({a:2}).sort({b:1}), the index knows where all the a=2 are. Within this block of a=2, the b would be sorted in ascending order (according to the index spec), so that query find({a:2}).sort({b:1}) can be satisfied by:
a | b
--|--
1 | 1
1 | 2
2 | 1 <-- walk this block forward to satisfy
2 | 2 <-- find({a:2}).sort({b:1})
2 | 3 <--
3 | 1
3 | 2
find({a:2}).sort({b:-1})
Since the index can be walked forward or backwards, a similar procedure was followed, with a small twist at the end:
a | b
--|--
1 | 1
1 | 2
2 | 1 <-- walk this block backward to satisfy
2 | 2 <-- find({a:2}).sort({b:-1})
2 | 3 <--
3 | 1
3 | 2
The fact that the index can be walked forward or backward is the key point that enables the query find({a:2}).sort({b:-1}) to be able to use the index {a:1, b:1}.
Query planner explain
You can see what the query planner plans by using db.collection.explain().find(....). Basically if you see a stage of COLLSCAN, no index was used or can be used for the query. See explain results for details on the command's output.
[Cannot comment due to a lack of reputation]
Index direction only matters when you're sorting.
Not completely exact : some queries can be faster with particular direction index, even if no order is required in the query itself (sorting is just for results). For example, queries with date criteria : searching for users who subscribe yesterday will be faster with a desc direction on index, than with asc direction or no index.
difference between {user_id:1,slug:1} and {slug:1,user_id:1}
mongo will filter on first field, then on second field with first field matching (and so on...) in index. The more restrictive fields must be at first places to really improve the query

Cassandra's ORDER BY does not work as expected

So, I'm storing some statistics in cassandra.
I want to get the top 10 best entrys based on a specific column. The column in this case is kills.
As there is no ORDER BY command like in mysql, I have to create a PARTITION KEY.
I've created the following table:
CREATE TABLE IF NOT EXISTS stats ( uuid uuid, kills int, deaths int, playedGames int, wins int, srt int, PRIMARY KEY (srt, kills) ) WITH CLUSTERING ORDER BY (kills DESC);
The Problem I have is the following, as you see above, I'm using the column srt for ordering because when I'm going to use the column uuid for ordering, the result from my select query is totally random and not sorted as expected.
So I tried to add a column with always the same value for my PARTITION KEY. Sorting works now, but not really good. When I now try to SELECT * FROM stats;, the result is the following:
srt | kills | deaths | playedgames | uuid | wins
-----+-------+--------+-------------+--------------------------------------+------
0 | 49 | 35 | 48 | 6f284e6f-bd9a-491f-9f52-690ea2375fef | 2
0 | 48 | 21 | 30 | 4842ad78-50e4-470c-8ee9-71c5a731c935 | 4
0 | 47 | 48 | 14 | 91f41144-ef5a-4071-8c79-228a7e192f34 | 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0 | 2 | 32 | 20 | 387448a7-a08e-46d4-81a2-33d8a893fdb6 | 31
0 | 1 | 16 | 17 | fe4efbcd-34c3-419a-a52e-f9ae8866f2bf | 12
0 | 0 | 31 | 25 | 82b13d11-7eeb-411c-a521-c2c2f9b8a764 | 10
The problem about the result is, that "per kill" amout/value, there is only one row - but there should be definitly more.
So, any idea about using sorting in Cassandra without getting data stripped out?
I also heard about Datastax Enterprise (DSE) which supports solr in querys but DSE is only free for non-productive (and also only for 6 months) and the paid version is, at least what I heared of, pretty expensive (around 4000$ per node). So, is there any alternative like a Datastax Enterprise Community Edtion? Does not make sense but I'm just asking. I haven't found anything from googling so, can I also use solr with the "normal" cassandra?
Thank you for your help!
PS: Please don't mark this as a duplicate of order by caluse not working in Cassandra query because it didn't helped me. I already googled like 1 and a half hour for a solution.
EDIT:
Because of the fact that my primary key is PRIMARY KEY(srt, kills), the combination of (srt, kills) must be unique. Which basicly means, that rows with the same amout of kills are getting overwritten by each other. I would use PRIMARY KEY(uuid, kills) which would solve the problem with overwriting rows but when I do SELECT * FROM stats LIMIT 10 then, the results are totally random and not sorted by kills.
If you want to use column for sorting get it out from partition key. Rows will be sorted by this column within every partition - Cassandra splits data between nodes using partition key, and ordering it in each partition using clustering key:
PRIMARY KEY ((srt), kills)
EDIT:
You need to understand concepts a little bit more, i suggest you to take some free course on DSE site, it can help you with further development.
Anyway, about your question:
Primary key is a set of columns that make each row unique.
There are 2 types of columns in this primary key - partition key columns and clustering columns.
You can't use partition key for sorting or range queries - it is against the model of Cassandra - such query will be splitted to several nodes, or even all nodes and sstables. If you want to use both of the listed columns for sorting, you can use other column for partitioning (random number from 1 to 100 for example), and then you need to execute your query for each "batch", or simply use another column that has high enough number of unique values (at least 100),the data is evenly distributed between such values, and data is accessed using all these values, otherwise you will end up with hot nodes/partitions.
Primary key ((another_column), kills, srt)
What you have to understand, you can order your data only within partitions, but not between partitions.
that "per kill" amout/value - can you elaborate? There are only one row for each key in Cassandra, if you insert several rows with same key they will be overwritten with last insert values (read about upserts).

Can a 'skinny table' design be compensated with a view?

Context: simple webapp game for personal learning purposes, using postgres. I can design it however I want.
2 tables 1 view (there are additional tables view references that aren't important)
Table: Research
col: research_id (foreign key to an outside table)
col: category (integer foreign key to category table)
col: percent (integer)
constraint (unique combination of the three columns)
Table: Category
col: category_id (primary key auto inc)
col: name(varchar(255))
notes: this table exists to capture the 4 categories of research I want in business logic and which I assume is not best practice to hardcode as columns in the db
View: Research_view
col: research_id (from research table)
col: foo1 (one of the categories from category table)
col: foo2 (etc...)
col: other cols from other joins
notes:has insert/update/delete statements that uses above tables appropriately
The research table itself I worry qualifies as a "Skinny Table" (hadn't heard the term until I just saw it in the Ibatis manning book). For example test data within it looks like:
| research_id | percent | category |
| 1 | 25 | 1 |
| 1 | 25 | 2 |
| 1 | 25 | 3 |
| 1 | 25 | 4 |
| 2 | 20 | 1 |
| 2 | 30 | 2 |
| 2 | 25 | 3 |
| 2 | 25 | 4 |
1) Does it make sense to have all columns in a table collectively define unique entries?
2) Does this 'smell' to you?
Couple of notes to start:
constraint (unique combination of the three columns)
It makes no sense to have a unique constraint that includes a single-column primary key. Including that column will cause every row to be unique.
notes: this table exists to capture the 4 categories of research I want in business logic and which I assume is not best practice to hardcode as columns in the db
If a research item/entity is required to have all four categories defined for it to be valid, they should absolutely be columns in the research table. I can't tell definitively from your statement whether this is the case or not, but your assumption is faulty if looked at in isolation. Let your model reflect reality as closely as possible.
Another factor is whether it's a requirement that additional categories may be added to the system post-deployment. Whether the categories are intended to be flexible vs. fixed should absolutely influence the design.
1) Does it make sense to have all columns in a table collectively
define unique entries?
I would say it's not common, but can imagine there are situations where it might be appropriate.
2) Does this 'smell' to you?
Hard to say without more details.
All that said, if the intent is to view and add research items with all four categories, I would say (again) that you should consider whether the four categories are semantically attributes of the research entity.
As a random example, things like height and weight might be considered categories of a person, but they would likely be stored flat on the person table, and not in a separate table.

Retrieving data from 2 tables that have a 1 to many relationship - more efficient with 1 query or 2?

I need to selectively retrieve data from two tables that have a 1 to many relationship. A simplified example follows.
Table A is a list of events:
Id | TimeStamp | EventTypeId
--------------------------------
1 | 10:26... | 12
2 | 11:31... | 13
3 | 14:56... | 12
Table B is a list of properties for the events. Different event types have different numbers of properties. Some event types have no properties at all:
EventId | Property | Value
------------------------------
1 | 1 | dog
1 | 2 | cat
3 | 1 | mazda
3 | 2 | honda
3 | 3 | toyota
There are a number of conditions that I will apply when I retrieve the data, however they all revolve around table A. For instance, I may want only events on a certain day, or only events of a certain type.
I believe I have two options for retrieving the data:
Option 1
Perform two queries: first query table A (with a WHERE clause) and store data somewhere, then query table B (joining on table A in order to use same WHERE clause) and "fill in the blanks" in the data that I retrieved from table A.
This option requires SQL Server to perform 2 searches through table A, however the resulting 2 data sets contain no duplicate data.
Option 2
Perform a single query, joining table A to table B with a LEFT JOIN.
This option only requires one search of table A but the resulting data set will contain many duplicated values.
Conclusion
Is there a "correct" way to do this or do I need to try both ways and see which one is quicker?
Ex
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
More REF

PostgreSQL multidimensional array search

I am a newbie to Postgresql and was trying with it.
I have created a simple table:
CREATE table items_tags (
ut_id SERIAL Primary KEY,
item_id integer,
item_tags_weights text[]
);
where:
item_id - Item Id with these tags are associated
item_tags_weights - Tags associated with Itm including weight
Example entry:
--------------------
ut_id | item_id | item_tags_weights
---------+---------+-------------------------------------------------------------------------------------------------------------------------------
3 | 2 | {{D,1},{B,9},{W,3},{R,18},{F,9},{L,15},{G,12},{T,17},{0,3},{I,7},{E,14},{S,2},{O,5},{M,4},{V,3},{H,2},{X,14},{Q,9},{U,6},{P,16},{N,11},{J,1},{A,12},{Y,15},{C,15},{K,4},{Z,17}}
1000003 | 3 | {{Q,4},{T,19},{P,15},{M,14},{O,20},{S,3},{0,6},{Z,6},{F,4},{U,13},{E,18},{B,14},{V,14},{X,10},{K,18},{N,17},{R,14},{J,12},{L,15},{Y,3},{D,20},{I,18},{H,20},{W,15},{G,7},{A,11},{C,14}}
4 | 4 | {{Q,2},{W,7},{A,6},{T,19},{P,8},{E,10},{Y,19},{N,11},{Z,13},{U,19},{J,3},{O,1},{C,2},{L,7},{V,2},{H,12},{G,19},{K,15},{D,7},{B,4},{M,9},{X,6},{R,14},{0,9},{I,10},{F,12},{S,11}}
5 | 5 | {{M,9},{B,3},{I,6},{L,12},{J,2},{Y,7},{K,17},{W,6},{R,7},{V,1},{0,12},{N,13},{Q,2},{G,14},{C,2},{S,6},{O,19},{P,19},{F,4},{U,11},{Z,17},{T,3},{E,10},{D,2},{X,18},{H,2},{A,2}}
(4 rows)
where:
{D,1} - D = tag, 1 = tag weight
Well, I just wanted to list the items_id where tags = 'U' according tag weight.
On way is to select ALL the tags from database and do the processing in high-level language with sort and use the result set.
For this, I can do the following:
1) SELECT * FROM user_tags WHERE 'X' = ANY (interest_tags_weights)
2) Extract and sort the information and display.
But considering that multiple items can be associated with a single 'TAG', and assuming
10 million entry, this method will be surely sluggish.
Any idea to list as needed with CREATE function or so?
Any pointers will be helpfull.
Many thanks.
Have you considered normalization, i.e. moving the array field into another table? Apart from being easy to query and extend, it's likely to have better performance on larger databases.

Resources