PostgreSQL - Index not used - database

I have created a table.
And I've written a procedure to update the table when I update another table.
That is, when I update table2 few records from the table2 will be updated to table1 using the trigger I've created on table2.
I could've created view instead of doing that. But the main purpose of it is I won't be able to create index on views.
Hence I did like this. Table2 consists about 500k rows. And I'm updating about 220k rows to and a extra column is created on some calculation that'll give each rows either 0 or 1 based on some criteria.
And I've created a index on the table1.
If I execute a count(*) query in table2 in which i already have only one index for date col. The query executes in 200ms which has about 500k rows.
But if I execute the same query on table1 it takes double the time when compared to that of table2.
And if I remove the index on table1 it add another 500-600ms to the execution time.
Creating index on table1 has just reduced 500-600ms.
Explain Analyze of the query with 2 columns.
"HashAggregate (cost=80421.85..80422.94 rows=73 width=4) (actual time=6248.826..6248.829 rows=3 loops=1)"
" -> Seq Scan on table1 (cost=0.00..70306.88 rows=2022994 width=4) (actual time=0.048..4203.224 rows=2022994 loops=1)"
" Filter: ((date >= '2014-02-01'::date) AND (date <= '2014-04-30'::date))"
"Total runtime: 6248.895 ms"
Table Definition:
CREATE TABLE table1
(
label1 text NOT NULL,
label2 text NOT NULL,
label3 text NOT NULL,
date date NOT NULL,
"mobile no" bigint NOT NULL,
"start time" time without time zone NOT NULL,
"end time" time without time zone NOT NULL,
label4 text NOT NULL,
label5 text NOT NULL,
value1 integer NOT NULL,
count numeric NOT NULL
)
Index Definition :
CREATE INDEX ix_date
ON table1
USING btree
(date);
And the COUNT(*) I've given is just for an example.
Actually I sum up the count column by grouping label1, 2, 3 and extracting the month from date.

Firstly,
I could've created view instead of doing that. But the main purpose of it is I won't be able to create index on views.
Views are "expanded" when processing queries so, e.g. a SELECT x FROM my_view JOIN y... will practically directly substitute the view definition inside your query, and the resulting expanded query will be able to use any indexes directly, if applicable.
Secondly,
If I execute a count(*) query in table2 in which i already have only one index for date col. The query executes in 200ms which has about 500k rows.
Unfortunately, COUNT(*) queries in PostgreSQL don't usually use indexes, even in recent (9.2+) versions with index-only scans. See here: https://wiki.postgresql.org/wiki/Index-only_scans#Is_.22count.28.2A.29.22_much_faster_now.3F for a description why is that. A non-unique (or primary key) index will NEVER be used for COUNT(*).
Thirdly, updating records in a MVCC database such as PostgreSQL creates updated copies of those records, instead of updating them in-place. This almost always results in significant internal data fragmentation, which is sorely visible if you use drives with slow seek times, like mechanical drives. If you want linearly reduced COUNT(*) times between tables of different sizes, either make sure the data is not fragmented (VACUUM FULL ANALYZE + REINDEX will mostly do the trick), or just use an SSD.

Related

Slow two-table query in SQL Server

The application I work on generates an SQL query somewhat like this:
Select
VISIT_VIEW.VISIT_ID, VISIT_VIEW.PATIENT_ID, VISIT_VIEW.MRN_ID,
VISIT_VIEW.BILL_NO, INSURANCE.INS_PAYOR'
FROM
'VISIT_VIEW
LEFT JOIN
INSURANCE ON VISIT_VIEW.visit_id = INSURANCE._fk_visit '
WHERE
'VISIT_VIEW.VISIT_ID IN (1002, 1003, 1005, 1006, 1007, 1008, 1010, 1011, <...>, 1193, 1194, 1195, 1196, 1197, 1198, 1199)'
The <...> represents a long list of ids. The size of the list depends on the results of a previous query, and in turn on the parameters selected to generate that query.
The list of IDs can be anywhere from 100 items long to above 2000.
The INSURANCE table is large, over 9 million rows. The visit table is also large, but not quite as large.
As the number of IDs goes up there is a fairly sharp increase from a duration of less than a second to over 15 minutes. The increase starts somewhere around 175 ids.
If the parameters used to generate the query are changed so that the INS_PAYOR column is not selected, and thus there is no left join, the query runs in less than a second, even with over 2000 items in the list of IDs.
The execution plan shows that 97% of the query time is devoted to a clustered seek on the INSURANCE table.
How can I rework this query to get the same results with a less horrific delay?
Do remember that the SQL is being generated by code, not by hand. It is generated from a list of fields (with knowledge of which field belongs to which table) and a list of IDs in the primary table to check. I do have access to the code that does the query generation, and can change it provided that the ultimate results of the query are exactly the same.
Thank you
The <...> represents a long list of ids. The size of the list depends on the results of a previous query
Don't do that.
Do this:
SELECT <...>
FROM VISIT_VIEW
INNER JOIN (
<previous query goes here>
) t on VISIT_VIEW.VISIT_ID = t.<ID?>
LEFT JOIN INSURANCE ON VISIT_VIEW.visit_id=INSURANCE._fk_visit
See if you see any improvements using the following...
IF OBJECT_ID('tempdb..#VisitList', 'U') IS NOT NULL
DROP TABLE #VisitList;
CREATE TABLE #VisitList (
VISIT_ID INT NOT NULL PRIMARY KEY
);
INSERT #VisitList (VISIT_ID) VALUES (1002),(1003),(1005),(1006),(1007),(1008),(1010),(1011),(<...>),(1193),(1194),(1195),(1196),(1197),(1198),(1199);
SELECT
vv.VISIT_ID,
vv.PATIENT_ID,
vv.MRN_ID,
vv.BILL_NO,
ix.INS_PAYOR
FROM
VISIT_VIEW vv
JOIN #VisitList vl
ON vv.VISIT_ID = vl.VISIT_ID
CROSS APPLY (
SELECT TOP 1
i.INS_PAYOR
FROM
INSURANCE i
WHERE
vv.visit_id=i._fk_visit
) ix;

SQL Query is slow when ORDER BY statement added

I have a table [Documents] with the following columns:
Name (string)
Status (string)
DateCreated [datetime]
This table has around 1 million records. All three of these columns have an index (a single index for each one).
When I run this query:
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New';
Execution is really fast (300 ms.)
If I run the same query but with the ORDER BY clause, it's really slow (3000 ms)
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New'
order by DateCreated;
I understand that its searching in another index (DateCreated), but should it really be that much slower? If so, why? Anything I can do to speed this query up (a composite index)?
Thanks
BTW: All Indexes including DateCreated have really low fragmentation, in fact I ran a reorganize and it didn't change a thing.
As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.
Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.
If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.
NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY before the TOP 50. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.
A couple of ideas to improve performance
Adding a composite index (as other answers have suggested) may offer some improvement, for example:
ON Documents (Status, DateCreated, Name)
SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.
For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New' is a relatively small subset of the table.
CREATE NONCLUSTERED INDEX Documents_FIX
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'
I would also modify the query to specify ORDER BY Status, DateCreated, Name
so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.
As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that
ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;
CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;
Then the query could be re-written:
SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;
If DateCreated field means insertion time to table, you can create an integer id field and order by that integer field.
You need an index by 2 columns: (Name, DateCreated). The order of fields in the index is important. So, replace your index for just Name with a new index for two columns (Name, DateCreated).

Performance Strategy for growing data

I know that performance tuning is something which need to be done specific to each environment. But I have put maximum effort to make my question clear to see if I am missing something in the possible improvements.
I have a table [TestExecutions] in SQL Server 2005. It has around 0.2 million records as of today. It is expected to grow as 5 million in couple of months.
CREATE TABLE [dbo].[TestExecutions]
(
[TestExecutionID] [int] IDENTITY(1,1) NOT NULL,
[OrderID] [int] NOT NULL,
[LineItemID] [int] NOT NULL,
[Manifest] [char](7) NOT NULL,
[RowCompanyCD] [char](4) NOT NULL,
[RowReferenceID] [int] NOT NULL,
[RowReferenceValue] [char](3) NOT NULL,
[ExecutedTime] [datetime] NOT NULL
)
CREATE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
I have following two queries for same purpose (Query2 and Query 3). For 100 records in #OrdersForRC, the Query2 is working better (39% vs 47%) whereas with 10000 records in in #OrdersForRC the Query 3 is working better (53% vs 33%) as per the execution plan).
In the initial few months of use, the #OrdersForRC table will have close to 100 records. It will gradually increase to 2500 records over a couple of months.
In the following two approaches which one is good for such a incrementally growing scenario? Or is there any strategy to make one approach work better than the other even if data grows?
Note: In Plan2, the first Query uses Hash Match
References
query optimizer operator choice - nested loops vs hash match (or merge)
Execution Plan Basics — Hash Match Confusion
Test Query
CREATE TABLE #OrdersForRC
(
OrderID INT
)
INSERT INTO #OrdersForRC
--SELECT DISTINCT TOP 100 OrderID FROM [TestExecutions]
SELECT DISTINCT TOP 5000 OrderID FROM LWManifestReceiptExecutions
--QUERY 2:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
INNER JOIN #OrdersForRC R
ON R.OrderID = H.OrderID
--QUERY 3:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
WHERE OrderID IN (SELECT OrderID FROM #OrdersForRC)
DROP TABLE #OrdersForRC
Plan 1
Plan 2
AS commented above you have not specified table definition of table LWManifestReceiptExecutions and how many rows in it and
You are selecting Top N rows without order by, Do you want TOP N random id or in a specific order or order does`t matter for You?
if order does matter then you can create a index on column which you required in Order By
if order id is unique in [dbo].[TestExecutions] table then you should mark it as unique drop and recreate the index if UNIQUE
Drop Index [IX_TestExecutions_OrderID] ON [dbo].[TestExecutions]
CREATE UNIQUE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
You asked that data is keep growing and it will reach to millions in couple of months.
No need to worry sql server can easily handle these query with proper build schema and indexes,
When this data model starting hurting then you could look at the
other options but not now, i have seen people handling billions of data in sql server.
I can see you are comparing the queries on the bases of query cost you are coming the conclusion that
Query with higher percentages mean this is more expensive,
That is not the case always query cost is based on aggregate Subtree cost of all Iterator in the query plan,
and the total estimated cost of an Iterator is a simple sum of the I/O and CPU components.
The cost values represent expected execution times (in seconds) on a particular hardware configuration
But with the morden hardware these cost might be irrelevant.
Now coming to your query,
You have expressed two queries to get the result but both are not identical,
IN PLAN 1 Query 1
Expressed by JOIN
QO is choosing Nested loop join that is good choice for particular this scenario
Every row for the key OrderID IN table #OrdersForRC seeking the value in the table dbo.[TestExecutions]
until all rows matched
IN PLAN 2 Query 2
Expressed by IN
QO is doing the same thing as query one but there is extra distinct Sort ( Sort and Stream aggregated)
the reasoning behind it is you have expressed this query as IN and table #OrdersForRC can contain duplicate Rows
Just to eliminate that is necessary.
IN PLAN 2 Query 1
Expressed by JOIN
Now the Rows in the table in #OrdersForRC in 1000, QO is choosing hash join over loop join
Because loop join for 1000 rows has more cost than hash join and loop join and rows are unordered
and can consist nulls as well thus HASH JOIN is perfect stratergy here.
IN PLAN 2 Query 2
Expressed by IN
QO has chosen Distinct Sort for the same reason as chosen in Plan 2 query 2 and then Merge Join
Because rows are now sorted ON ID column for both tables.
IF you just mark temp table as NOT NULL and Unique then its more likly you will get the same execution plan for both IN the JOIN.
CREATE TABLE #OrdersForRC
(OrderID INT not null Unique)
Execution plan

Why this query is running so slow?

This query runs very fast (<100 msec):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
SELECT
[E2].[CompanyId]
,[E2].[Time]
,[E1].[Id]
,[E1].[Status]
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
)
SELECT TOP 10
[CompanyId]
,[Id]
,[Status]
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
CompanyID
Time
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
or
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.

weird execution plan of a SQL Server query

Context: SQL Server 2008. There are 2 tables to inner join.
The fact table, which has 40 million rows, contains the patient key and the medications administered and other facts. There is a unique index (nonclustered) on medication key and patient key combined in that order.
The dimension table is the medication list (70 rows).
The join is to get the medication code (business code) based on medication key (surrogate key).
Query:
SELECT a.PKey, a.SomeFact, b.MCode
FROM tblFact a
JOIN tblDIM b ON a.MKey = b.MKey
All the columns returned are integer.
The above query runs in 7 minutes and its execution plan shows the index on (MKey,PKey) is used. The index was rebuilt right before the run.
When I disabled the index on the fact table (or copy data to a new table with same structure but without index), the same query takes only 1:40 minutes.
IO Statistics are also stunning.
With index: Table 'tblFACT'. Scan count 70, logical reads 190296338, physical reads 685138, read-ahead reads 98713
Without index: Table 'tblFACT_copy'. Scan count 17, logical reads 468891, physical reads 0, read-ahead reads 419768
Question: why does it try to use the index and head down the inefficient path?
You need to add SomeFact as an INCLUDE on the tblFact index to make it covering.
Currently, the table will be accessed twice: once for the index and then again for a lookup to get SomeFact either as a RID or key lookup (depends on if there is a clustered index)
This doesn't apply to tblDIM because I assume that MKey is the clustered index which makes it covering implicitly
In rare cases, the database chooses an incorrect execution plan. In this case, the index is used for the join, but since all data is fetched from both tables, it would be faster to just scan the whole table.
The indexed version will be much faster if you add a WHERE clause to the query, because without indexes it will still need to scan the whole table, instead of grabbing just the handful of records it needs.
There may be directives to encourage the database not to use indexes or use different indexes, but I don't know SQL server that well.
Are your statistics up to date? Check with:
SELECT object_name = Object_Name(ind.object_id)
, IndexName = ind.name
, StatisticsDate = STATS_DATE(ind.object_id, ind.index_id)
FROM SYS.INDEXES ind
order by
STATS_DATE(ind.object_id, ind.index_id) desc
Update with:
exec sp_updatestats;

Resources