I have "UserLog" table with 15 millions rows.
On this table I have a cluster index on User_ID field which is of type bigint identity, and a non clustered index on User_uid field which is of type varchar(35) (a fake uniqueidentifier).
On my application we can have 2 categories of users connection. 1 of them concerns only 0.007% of rows (about 1150 raws over 15 millions) and the 2nd concerns the remaining rows (99%).
The objective is to improve performance of the 0.007% users connection.
That's why I create a 'split' field "Userlog_ID" with type bit with default value of 0. So for each user connection we insert a new row in Userlog (with 0 as a value for User_log).
This field (User_Log) will then be update and it will take either 0 (for more then 99% of rows) or 1 (for 0.007% of rows) depending on the user category.
I create then a non clustered index on this field (User_log).
the select statment I want to optimize is:
SELECT User_UID, User_LastAuthentificationDate,
Language_ID,User_SecurityString
FROM dbo.UserLog
WHERE User_Active = 1
AND User_UID = '00F5AA38-C288-48C6-8ED1922601819486'
So the idea is now to add a filter on User_Log field to optimise the performance (precisely the index seek operator), only when the user belongs to the category 1 (0.007%):
SELECT User_UID, User_LastAuthentificationDate, Language_ID,User_SecurityString
FROM dbo.UserLog
WHERE User_Active = 1
AND User_UID = '00F5AA38-C288-48C6-8ED1922601819486'
and User_Log = 1
In my mind, I have the idea that since we add this filter the index seek will perform better because we have now a smaller result set.
Unfortunately, when I compare the 2 queries with the estimated execution plan I obtain 50% for each query. For both queries the optimiser use an index seek on user_uid non clustered index and then a key lookup on the cluster index (User_id).
So In conclusion by adding the split field and a non clustered index (either normal or filtered) on it, I don't improve the performance.
Can anyone have an explanation why. Maybe my reasoning and my interpretation are totaly wrong.
Thank you
Related
I have a table where I keep record of who is following whom on a Twitter-like application:
\d follow
Table "public.follow" .
Column | Type | Modifiers
---------+--------------------------+-----------------------------------------------------
xid | text |
followee | integer |
follower | integer |
id | integer | not null default nextval('follow_id_seq'::regclass)
createdAt | timestamp with time zone |
updatedAt | timestamp with time zone |
source | text |
Indexes:
"follow_pkey" PRIMARY KEY, btree (id)
"follow_uniq_users" UNIQUE CONSTRAINT, btree (follower, followee)
"follow_createdat_idx" btree ("createdAt")
"follow_followee_idx" btree (followee)
"follow_follower_idx" btree (follower)
Number of entries in table is more than a million and when I run explain analyze on the query I get this:
explain analyze SELECT "follow"."follower"
FROM "public"."follow" AS "follow"
WHERE "follow"."followee" = 6
ORDER BY "follow"."createdAt" DESC
LIMIT 15 OFFSET 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..353.69 rows=15 width=12) (actual time=5.456..21.497
rows=15 loops=1)
-> Index Scan Backward using follow_createdat_idx on follow (cost=0.43..61585.45 rows=2615 width=12) (actual time=5.455..21.488 rows=15 loops=1)
Filter: (followee = 6)
Rows Removed by Filter: 62368
Planning time: 0.068 ms
Execution time: 21.516 ms
Why it is doing backward index scan on follow_createdat_idx where it could have been more faster execution if it had used follow_followee_idx.
This query is taking around 33 ms when running first time and then subsequent calls are taking around 22 ms which I feel are on higher side.
I am using Postgres 9.5 provided by Amazon RDS. Any idea what wrong could be happening here?
The multicolumn index on (follower, "createdAt") that user1937198 suggested is perfect for the query - as you found in your test already.
Since "createdAt" can be NULL (not defined NOT NULL), you may want to add NULLS LAST to query and index:
...
ORDER BY "follow"."createdAt" DESC NULLS LAST
And:
"follow_follower_createdat_idx" btree (follower, "createdAt" DESC NULLS LAST)
More:
PostgreSQL sort by datetime asc, null first?
There are minor other performance implications:
The multicolumn index on (follower, "createdAt") is 8 bytes per row bigger than the simple index on (follower) - 44 bytes vs 36. More (btree indexes have mostly the same page layout as tables):
Making sense of Postgres row sizes
Columns involved in an index in any way cannot be changed with a HOT update. Adding more columns to an index might block this optimization - which seems particularly unlikely given the column name. And since you have another index on just ("createdAt") that's not an issue anyway. More:
PostgreSQL Initial Database Size
There is no downside in having another index on just ("createdAt") (other than the maintenance cost for each (for write performance, not for read performance). Both indexes support different queries. You may or may not need the index on just ("createdAt") additionally. Detailed explanation:
Is a composite index also good for queries on the first field?
In SQL Server, if I try the following query:
select id from table
order by id
offset 1000000 ROWS
fetch next 1000000 ROWS ONLY;
How will SQL Server work? What strategy does SQL server use?
1. Do a sorting on the whole table first and then select the 1 million rows we need
2. Do a sorting on partial table and then return the 1 million rows we need.
I assume it is 2nd option. If so, how does SQL server decide which range of the table to be sorted?
Edit 1:
I am asking this question to understand what could cause the query slow. I am testing with two queries:
--Query 1:
select id from table
order by id
offset 1 ROWS
fetch next 1 ROWS ONLY;
and
--Query 2:
select id from table
order by id
offset 1000000000 ROWS
fetch next 1 ROWS ONLY;
I found the second query can take me about 30 minutes to finish while the first takes almost 0 second.
So I am curious on what causes this difference? If the two have same time used for order by (or does it even really do a sorting on the whole table? The id is the clustered indexed column of the table. I cannot imagine that it takes 0 second to finish sorting on a terabyte table.)
Then if the sorting takes same time, only difference would be the clustered-index scan. For first query, it only needs to scan first 1 or 10 (a small number) of rows. While for the second query, it needs to scan a much bigger number of rows ( >1000000000 ). But I am not quite sure if this is correct.
Thank you for your help!
Let me take a simple example..
order by id
offset 50 rows fetch 25 rows only
For the above query,the steps would be
1.Table should be sorted by id (if not pay penalty of sort,there is no partial sort,always a full sort)
2.Then scan 50+25 rows(paying cost of 75 rows) and return 25 rows only..
Below is an example of orders table i have(orderid is Pk,so sorted),you can see even though, we are getting only 20 rows ,you are paying cost of 120 rows...
Coming to your question,there is no partial sort (Which implies first option regarding sort only),even you try to return one row like below..
select top 1* from table
order by orderid
Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only table so every integer column have an index.
Every record is around 50-60 bytes.
The table name is "Item"
The server is: 12 GB RAM, 1,5 TB SATA, 4 CORES. All server for postgres.
There are many more tables in this database so RAM do not cover all database.
I want to add to table "Item" a column "a_elements" (array type of big integers)
Every record would have not more than 50-60 elements in this column.
After that i would create index GIN on this column and typical query should look like this:
select * from item where ...... and '{5}' <# a_elements;
I have also second, more classical, option.
Do not add column a_elements to table item but create table elements with two columns:
id_item
id_element
This table would have around 200 mln records.
I am able to do partitioning on this tables so number of records would reduce to 20 mln in table elements and 500 K in table item.
The second option query looks like this:
select item.*
from item
left join elements on (item.id_item=elements.id_item)
where ....
and 5 = elements.id_element
I wonder what option would be better at performance point of view.
Is postgres able to use many different indexes with index GIN (option 1) in a single query ?
I need to make a good decision because import of this data will take me a 20 days.
I think you should use an elements table:
Postgres would be able to use statistics to predict how many rows will match before executing query, so it would be able to use the best query plan (it is more important if your data is not evenly distributed);
you'll be able to localize query data using CLUSTER elements USING elements_id_element_idx;
when Postgres 9.2 would be released then you would be able to take advantage of index only scans;
But I've made some tests for 10M elements:
create table elements (id_item bigint, id_element bigint);
insert into elements
select (random()*524288)::int, (random()*32768)::int
from generate_series(1,10000000);
\timing
create index elements_id_item on elements(id_item);
Time: 15470,685 ms
create index elements_id_element on elements(id_element);
Time: 15121,090 ms
select relation, pg_size_pretty(pg_relation_size(relation))
from (
select unnest(array['elements','elements_id_item', 'elements_id_element'])
as relation
) as _;
relation | pg_size_pretty
---------------------+----------------
elements | 422 MB
elements_id_item | 214 MB
elements_id_element | 214 MB
create table arrays (id_item bigint, a_elements bigint[]);
insert into arrays select array_agg(id_element) from elements group by id_item;
create index arrays_a_elements_idx on arrays using gin (a_elements);
Time: 22102,700 ms
select relation, pg_size_pretty(pg_relation_size(relation))
from (
select unnest(array['arrays','arrays_a_elements_idx']) as relation
) as _;
relation | pg_size_pretty
-----------------------+----------------
arrays | 108 MB
arrays_a_elements_idx | 73 MB
So in the other hand arrays are smaller, and have smaller index. I'd do some 200M elements tests before making a decision.
Let's say I have a Product table in a shopping site's database to keep description, price, etc of store's products. What is the most efficient way to make my client able to re-order these products?
I create an Order column (integer) to use for sorting records but that gives me some headaches regarding performance due to the primitive methods I use to change the order of every record after the one I actually need to change. An example:
Id Order
5 3
8 1
26 2
32 5
120 4
Now what can I do to change the order of the record with ID=26 to 3?
What I did was creating a procedure which checks whether there is a record in the target order (3) and updates the order of the row (ID=26) if not. If there is a record in target order the procedure executes itself sending that row's ID with target order + 1 as parameters.
That causes to update every single record after the one I want to change to make room:
Id Order
5 4
8 1
26 3
32 6
120 5
So what would a smarter person do?
I use SQL Server 2008 R2.
Edit:
I need the order column of an item to be enough for sorting with no secondary keys involved. Order column alone must specify a unique place for its record.
In addition to all, I wonder if I can implement something like of a linked list: A 'Next' column instead of an 'Order' column to keep the next items ID. But I have no idea how to write the query that retrieves the records with correct order. If anyone has an idea about this approach as well, please share.
Update product set order = order+1 where order >= #value changed
Though over time you'll get larger and larger "spaces" in your order but it will still "sort"
This will add 1 to the value being changed and every value after it in one statement, but the above statement is still true. larger and larger "spaces" will form in your order possibly getting to the point of exceeding an INT value.
Alternate solution given desire for no spaces:
Imagine a procedure for: UpdateSortOrder with parameters of #NewOrderVal, #IDToChange,#OriginalOrderVal
Two step process depending if new/old order is moving up or down the sort.
If #NewOrderVal < #OriginalOrderVal --Moving down chain
--Create space for the movement; no point in changing the original
Update product set order = order+1
where order BETWEEN #NewOrderVal and #OriginalOrderVal-1;
end if
If #NewOrderVal > #OriginalOrderVal --Moving up chain
--Create space for the momvement; no point in changing the original
Update product set order = order-1
where order between #OriginalOrderVal+1 and #NewOrderVal
end if
--Finally update the one we moved to correct value
update product set order = #newOrderVal where ID=#IDToChange;
Regarding best practice; most environments I've been in typically want something grouped by category and sorted alphabetically or based on "popularity on sale" thus negating the need to provide a user defined sort.
Use the old trick that BASIC programs (amongst other places) used: jump the numbers in the order column by 10 or some other convenient increment. You can then insert a single row (indeed, up to 9 rows, if you're lucky) between two existing numbers (that are 10 apart). Or you can move row 370 to 565 without having to change any of the rows from 570 upwards.
Here is an alternative approach using a common table expression (CTE).
This approach respects a unique index on the SortOrder column, and will close any gaps in the sort order sequence that may have been left over from earlier DELETE operations.
/* For example, move Product with id = 26 into position 3 */
DECLARE #id int = 26
DECLARE #sortOrder int = 3
;WITH Sorted AS (
SELECT Id,
ROW_NUMBER() OVER (ORDER BY SortOrder) AS RowNumber
FROM Product
WHERE Id <> #id
)
UPDATE p
SET p.SortOrder =
(CASE
WHEN p.Id = #id THEN #sortOrder
WHEN s.RowNumber >= #sortOrder THEN s.RowNumber + 1
ELSE s.RowNumber
END)
FROM Product p
LEFT JOIN Sorted s ON p.Id = s.Id
It is very simple. You need to have "cardinality hole".
Structure: you need to have 2 columns:
pk = 32bit int
order = 64bit bigint (BIGINT, NOT DOUBLE!!!)
Insert/UpdateL
When you insert first new record you must set order = round(max_bigint / 2).
If you insert at the beginning of the table, you must set order = round("order of first record" / 2)
If you insert at the end of the table, you must set order = round("max_bigint - order of last record" / 2)
If you insert in the middle, you must set order = round("order of record before - order of record after" / 2)
This method has a very big cardinality. If you have constraint error or if you think what you have small cardinality you can rebuild order column (normalize).
In maximality situation with normalization (with this structure) you can have "cardinality hole" in 32 bit.
It is very simple and fast!
Remember NO DOUBLE!!! Only INT - order is precision value!
One solution I have used in the past, with some success, is to use a 'weight' instead of 'order'. Weight being the obvious, the heavier an item (ie: the lower the number) sinks to the bottom, the lighter (higher the number) rises to the top.
In the event I have multiple items with the same weight, I assume they are of the same importance and I order them alphabetically.
This means your SQL will look something like this:
ORDER BY 'weight', 'itemName'
hope that helps.
I am currently developing a database with a tree structure that needs to be ordered. I use a link-list kind of method that will be ordered on the client (not the database). Ordering could also be done in the database via a recursive query, but that is not necessary for this project.
I made this document that describes how we are going to implement storage of the sort order, including an example in postgresql. Please feel free to comment!
https://docs.google.com/document/d/14WuVyGk6ffYyrTzuypY38aIXZIs8H-HbA81st-syFFI/edit?usp=sharing
I am looking for a way to retrieve the "surrounding" rows in a NHibernate query given a primary key and a sort order?
E.g. I have a table with log entries and I want to display the entry with primary key 4242 and the previous 5 entries as well as the following 5 entries ordered by date (there is no direct relation between date and primary key). Such a query should return 11 rows in total (as long as we are not close to either end).
The log entry table can be huge and retrieving all to figure it out is not possible.
Is there such a concept as row number that can be used from within NHibernate? The underlying database is either going to be SQlite or Microsoft SQL Server.
Edited Added sample
Imagine data such as the following:
Id Time
4237 10:00
4238 10:00
1236 10:01
1237 10:01
1238 10:02
4239 10:03
4240 10:04
4241 10:04
4242 10:04 <-- requested "center" row
4243 10:04
4244 10:05
4245 10:06
4246 10:07
4247 10:08
When requesting the entry with primary key 4242 we should get the rows 1237, 1238 and 4239 to 4247. The order is by Time, Id.
Is it possible to retrieve the entries in a single query (which obviously can include subqueries)? Time is a non-unique column so several entries have the same value and in this example is it not possible to change the resolution in a way that makes it unique!
"there is no direct relation between date and primary key" means, that the primary keys are not in a sequential order?
Then I would do it like this:
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time))
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time))
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
There is the risk of having several items with the same time.
Edit
This should work now.
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time)) // less or equal
.Add(Expression.Not(Expression.IdEq(middleItem.id))) // but not the middle
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time)) // greater
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
This should be relatively easy with NHibernate's Criteria API:
List<LogEntry> logEntries = session.CreateCriteria(typeof(LogEntry))
.Add(Expression.InG<int>(Projections.Property("Id"), listOfIds))
.AddOrder(Order.Desc("EntryDate"))
.List<LogEntry>();
Here your listOfIds is just a strongly typed list of integers representing the ids of the entries you want to retrieve (integers 4242-5 through 4242+5 ).
Of course you could also add Expressions that let you retrieve Ids greater than 4242-5 and smaller than 4242+5.
Stefan's solution definitely works but better way exists using a single select and nested Subqueries:
ICriteria crit = NHibernateSession.CreateCriteria(typeof(Item));
DetachedCriteria dcMiddleTime =
DetachedCriteria.For(typeof(Item)).SetProjection(Property.ForName("Time"))
.Add(Restrictions.Eq("Id", id));
DetachedCriteria dcAfterTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyGt("Time", dcMiddleTime));
DetachedCriteria dcBeforeTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyLt("Time", dcMiddleTime));
crit.AddOrder(Order.Asc("Time"));
crit.Add(Restrictions.Eq("Id", id) || Subqueries.PropertyIn("Id", dcAfterTime) ||
Subqueries.PropertyIn("Id", dcBeforeTime));
return crit.List<Item>();
This is NHibernate 2.0 syntax but the same holds true for earlier versions where instead of Restrictions you use Expression.
I have tested this on a test application and it works as advertised