I have created a view that returns data from more than one table using join. When I select from that view without using Order By clause, the time taken to execute that query is only about 1 second or less. But when I use order by with my select query, it takes about 27 seconds to return only the top(15) records from that view.
Here is my query that I run To get data from View
SELECT TOP(15) *
FROM V_transaction
ORDER BY time_stamp DESC
Note : total number of records that view has is about 300000
What can I change in my view's design to get better performance?
First thing that pops into mind is creating an index on time_stamp in the view. If you don't want to/can't create an indexed view you could create an index on the column in the underlying table that you are getting that value from. This should increase your queries performance.
If you are still having issues post the execution plan - this should show you exactly where/why your query is experiencing performance problems.
Why don't you create an extra column which stores the numbers of days + number of seconds for a each date record and then order by that column..
Related
I have following table:
CREATE TABLE public.shop_prices
(
shop_name text COLLATE pg_catalog."default",
product text COLLATE pg_catalog."default",
product_category text COLLATE pg_catalog."default",
price text COLLATE pg_catalog."default"
)
and for this table i have a dataset from 18 months. In each file there are about 15M records. I have to some analysis, like in which month a shop has increased or decreased their price. I imported two months in a table and run following query just to test:
select shop, product from shop_prices group by shop, product limit 10
I waited more than 5 minutes, but no any result and response. It was still on working. What is the best way the store these datasets and run efficiency queries? Is it a good idea if I create for each dataset a seperate tables?
Using explain analyze select shop_name, product from shop_prices group by shop, product limit 10 you can see how Postgres is planning and executing the query and the time the execution takes. You'll see it needs to read the whole table (with the time consuming disk reads) and then sort it in memory - which will probably need to be cached on disk, before returning the results. In the next run you might discover the same query is very snappy if the number of shop_name+product combinations are very limited and thus stored in pg_stats after that explain analyze. The point being that a simple query like this can be deceiving.
You will faster execution by creating an index on the columns you are using (create index shop_prices_shop_prod_idx on public.shop_prices(shop_name,product)).
You should definitely change the price column type to numeric (or float/float8)) if you plan to do any numerical calculations on it.
Having said all that, I suspect this table is not what you will be using as it does not have any timestamp to compare prices between months to begin with.
I suggest you complete the table design and speculate on indices to improve performance. You might even want consider table partitioning https://www.postgresql.org/docs/current/ddl-partitioning.html
You will probably be doing all sorts of queries on this data so there is no simple solution to them all.
By all means return with perhaps more specific questions with complete table description and the output from the explain analyze statement for queries you are trying out and get some good advice.
Best regards,
Bjarni
What is your PostgreSQL version ?
First there is a typo: column shop should be shop_name.
Second you query looks strange because it has only a LIMIT clause without any ORDER BY clause or WHERE clause: do you really want to have "random" rows for this query ?
Can you try to post EXPLAIN output for the SQL statement:
explain select shop_name, product from shop_prices group by shop_name, product limit 10;
Can you also check if any statistics have been computed for this table with:
select * from pg_stats where tablename='shop_prices';
Query 1 - UserId is the main identifier, non-clustered index
update myTable set
CurrentHp=MaximumHp,
SelectedAttack1RemainingPP=SelectedAttack1MaximumPP,
SelectedAttack2RemainingPP=SelectedAttack2MaximumPP,
SelectedAttack3RemainingPP=SelectedAttack3MaximumPP,
SelectedAttack4RemainingPP=SelectedAttack4MaximumPP where UserId=1001695
Query 2
update myTable set
CurrentHp=MaximumHp,
SelectedAttack1RemainingPP=SelectedAttack1MaximumPP,
SelectedAttack2RemainingPP=SelectedAttack2MaximumPP,
SelectedAttack3RemainingPP=SelectedAttack3MaximumPP,
SelectedAttack4RemainingPP=SelectedAttack4MaximumPP
where UserId=1001695
and
(
SelectedAttack1RemainingPP!=SelectedAttack1MaximumPP
or
SelectedAttack2RemainingPP!=SelectedAttack2MaximumPP
or
SelectedAttack3RemainingPP!=SelectedAttack3MaximumPP
or
SelectedAttack4RemainingPP!=SelectedAttack4MaximumPP
or
CurrentHp!=MaximumHp
)
When i check via SQL server management studio and compare "Include Actual Execution Plan", their cost is same
However when i check via Include Client Statistics, I see that the first query shows 1900 rows updated while the second one shows 0 rows updated
So here my question, when column A and B value are equal, do SQL still make an update?
I also logically think that both query should be same but i would like to hear your opinion
execution plan same performance image
client statistics query 1
client statistics query 2
Two execution plans are the same because your first filter condition (UserId=1001695) chooses just one row and the table has an index on this field.
If you change your queries as a range condition such as (userID > 100), the costs in execution plans changed and they are not the same, or if your filter is in another field that table does not have index on it, the structures of execution plans are changed and they are not the same.
I've been tasked with improving the performance (and this is my first real-world performance tuning taks) of a reporting stored procedure which is called by an SSRS front-end and the stored procedure currently takes about 30 seconds to run on the largest amount of data (based on filters set from the report frontend).
This stored procedure has a breakdown of 19 queries executing in it, most of which are transforming the data from an initial (legacy) format from inside the base tables into a meaningful dataset to be displayed to the business side.
I've created a query based on a few DMV's in order to find out which are the most resource-consuming queries from the stored procedure (small snippet below) and I have found one query which takes about 10 seconds, in average, to complete.
select
object_name(st.objectid) [Procedure Name]
, dense_rank() over (partition by st.objectid order by qs.last_elapsed_time desc) [rank-execution time]
, dense_rank() over (partition by st.objectid order by qs.last_logical_reads desc) [rank-logical reads]
, dense_rank() over (partition by st.objectid order by qs.last_worker_time desc) [rank-worker (CPU) time]
, dense_rank() over (partition by st.objectid order by qs.last_logical_writes desc) [rank-logical write]
...
from sys.dm_exec_query_stats as qs
cross apply sys.dm_exec_sql_text (qs.sql_handle) as st
cross apply sys.dm_exec_text_query_plan (qs.plan_handle, qs.statement_start_offset, qs.statement_end_offset) as qp
where st.objectid in ( object_id('SuperDooperReportingProcedure') )
, [rank-execution time]
, [rank-logical reads]
, [rank-worker (CPU) time]
, [rank-logical write] desc
Now, this query is a bit strange in the sense that the execution plan shows that shows that the bulk of the work (~80%) is done when inserting the data into the local temporary table and not when interrogating the other tables from which the source data is taken and then manipulated. (screenshot below is from SQL Sentry Plan Explorer)
Also, in terms of row estimates, the execution plan has way off estimates for this, in the sense that there are only 4218 rows inserted into the local temporary table as opposed to the ~248k rows that the execution plan thinks its moving into the local temporary table. So, becasue of this, I'm thinking "statistics", but still do those even matter if ~80% of the work is the actual insert into the table?
One of my first recommendations was to re-write the entire process and the stored procedure so as to not include the moving and transforming of the data into the reporting stored procedure and to do the data transformation nightly into some persisted tables (real-time data is not required, only relevant data until end of previous day). But the business side does not want to invest time and resources into redesigning this and instead "suggests" I do performance tuning in the sense of finding where and what indexes I can add to speed this up.
I don't believe that adding indexes to base tables will improve the performance of the report since most of the time needed for running the query is saving the data into a temporary table (which from my knowledge it will hit tempdb, which means that they will be written to disk -> increased time due to I/O latency).
But, even so, as I've mentioned this is my first performance tuning task and I've tried to read as much as possible related to this in the last couple of days and these are my conclusions so far, but I'd like to ask for advice from a broader audience and hopefully get a few more insights and understanding on what I can do to improve this procedure.
As a few clear questions I'd appreciate if could be answered are:
Is there anything incorrect in what I have said above (in my understanding of the db or my assumptions) ?
Is it true that adding an index to a temporary table will actually increase the time of execution, since the table (and its associated index(es) is/are being rebuilt on each execution)?
Could there anything else be done in this scenario without having to re-write the procedure / queries and only be done via indexes or other tuning methods? (I've read a few article headlines that you could also "tune tempdb", but I didn't get into the details of those, yet).
Any help is very much appreciated and if you need more details I'll be happy to post.
Update (2 Aug 2016):
The query in question is (partially) below. What is missing are a few more aggregate columns and their corresponding lines in the GROUP BY section:
select
b.ProgramName
,b.Region
,case when b.AM IS null and b.ProgramName IS not null
then 'Unassigned'
else b.AM
end as AM
,rtrim(ltrim(b.Store)) Store
,trd.Store_ID
,b.appliesToPeriod
,isnull(trd.countLeadActual,0) as Actual
,isnull(sum(case when b.budgetType = 0 and b.budgetMonth between #start_date and #end_date then b.budgetValue else 0 end),0) as Budget
,isnull(sum(case when b.budgetType = 0 and b.budgetMonth between #start_date and #end_date and (trd.considerMe = -1 or b.StoreID < 0) then b.budgetValue else 0 end),0) as CleanBudget
...
into #SalvesVsBudgets
from #StoresBudgets b
left join #temp_report_data trd on trd.store_ID = b.StoreID and trd.newSourceID = b.ProgramID
where (b.StoreDivision is not null or (b.StoreDivision is null and b.ProgramName = 'NewProgram'))
group by
b.ProgramName
,b.Region
,case when b.AM IS null and b.ProgramName IS not null
then 'Unassigned'
else b.AM
end
,rtrim(ltrim(b.Store))
,trd.Store_ID
,b.appliesToPeriod
,isnull(trd.countLeadActual,0)
I'm not sure if this is actually helpful, but since #kcung requested it, I added the information.
Also, to answer some his questions:
the temporary tables have no indexes on them
RAM size: 32 GB
Update (3 Aug 2016):
I have tried #kcung's suggestions to move the CASE statements from the aggregate-generating query and unfortunately, overall, the procedure time has not improved, noticeably, as it still fluctuates in the range of ±0.25 to ±1.0 second (yes, both lower and higher time than the original version of the stored procedure - but I'm guessing this is due to variable workload on my machine).
The execution plan for the same query, but modified to remove the CASE conditions, leaving only the SUM aggregates, is now:
Adding indexes to the temporary table will definitely improve the read call but slows down the write calls to the temporary table.
Here, as you mentioned, there are 19 queries executing in the procedure, so analyzing only one query with execution plan would not be more helpful.
Adding more, if possible, execute this query only & check how much time it takes (rows affected).
Other approach you may try, not sure if possible in your case, try using table variable instead of temporary table. This is because, using table variable over the temporary table has additional advantages such as, procedure is pre-compiled, no transactional logs are maintained. & more, you don't need to write drop table.
Any chance I can see the query ? and the indexes on both tables ?
How big is your ram ? how big is the row in each table(roughly) ?
Can you update statistics for both table and resend the query planner ?
To answer your question :
You're mostly right, except in the part of adding indexes. Adding indexes will help the query to do lookup. It will also give chance to the query planner to consider nested loop join plan instead of the hash join plan. Unfortunately, I can't answer more until my question being answered.
You shouldn't need to add index to the temp table. Adding index to this temp(or any insert destination table) table will increase write time, because the insert will need to update that index. Just imagine an index as copy of your table with less information and it sits on top of your table and it needs to be in sync with your table. Every write (insert, update, delete) needs to update this index.
Looking at both tables total rows, this query should run way faster than 10s, unless you have a lemon PC, then it's a different story.
EDIT:
Just want to point out for point 2, I didn't realise you're source table is temp table as well. Temporary table is destroyed after each session of a connection ended. Adding index to temporary table means that you will add extra time to create this index everytime you create this temporary table.
EDIT:
Sorry, I'm using phone now. I'm just gonna be short.
So essentially 2 things :
add primary key on temp table creation time so you do it in one go. Don't bother with adding nonclustered index or any covering index you will end up spending more time creating those.
see your query, all of the case when statement, instead of doing it in this query, why don't you add them as another column in the table. Essentially you want to avoid calculation on the fly when doing group by. You can leave the sum() in the query as it's an aggregate query, but try and reduce run time calculation as much as possible.
Sample :
case when b.AM IS null and b.ProgramName IS not null
then 'Unassigned'
else b.AM
end as AM
You can create a column named AM when creating table b.
Also those rtrim and ltrim. Please remove those and stick it in table creation time. :)
One suggestion is to increase the execution time of stored procedure.
cmd.CommandTimeout = 200 // in seconds.
You can also generate a report link and email it to user when the report was generated.
Other than that use CTE never use temp tables as they are more expensive.
When querying a table using its primary key, like this:
SELECT * FROM foo WHERE myPrimaryKey = #bar;
would it make sense/be faster to use a TOP (1) specification?
SELECT TOP (1) * FROM foo WHERE myPrimaryKey = #bar;
Or is SQL Server smart enough to stop searching after it's found the primary key?
No, In your particular case using the TOP (1) is not useful at all.
The TOP clause is applied after the entire query is processed, so it's useful only to limit the overhead of a possibile high data flow between the server and the client, or when you want to limit no matter what the amount of rows you will retrieve from the server.
The reason because I say that TOP is applied after everything else is because it needs to have the ordered data, so it has to work after the last evaluated clause: ORDER BY.
Also TOP can let you retrieve the first x percent rows using TOP(x) PERCENT, so again, it needs to know the amount of rows and their order.
A simple example is the biggest enemy of a development DBMS: SELECT * FROM Table (I've specified development because that's the only environment where that kind of query should be seen).
Sometimes I use a SELECT * FROM kind of query when I have to understand what kind of data (not data type) I have to expect when I'll develop something that has to use that table.
Since I want to write a very short query and all I need is a bunch of records, I use the TOP clause: SELECT TOP 5 * FROM Table
SQL Server still process the query as SELECT * FROM Table but it will only send me back the first 5 rows.
You can try out yourself: write a query that should retrieve more than 1 row, check its execution plan, add the TOP clause and check the execution plan again. They will be the same in both cases.
The image down there shows how TOP impacts on your query. The query without TOP returned around 40700 rows. You can clearly see that the Wait time on server is only 2ms but all the rest of the time (267ms) is spent in downloading data.
I have a fairly complex query in SQL Server running against a view, in the form:
SELECT *
FROM myview, foo, bar
WHERE shared=1 AND [joins and other stuff]
ORDER BY sortcode;
The query plan as shown above shows a Sort operation just before the final SELECT, which is what I would expect. There are only 35 matching records, and the query takes well under 2 seconds.
But if I add TOP 30, the query takes almost 3 minutes! Using SET ROWCOUNT is just as slow.
Looking at the query plan, it now appears to sort all 2+ million records in myview before the joins and filters.
This "sorting" is shown on the query plan as an Index Scan on the sortcode index, a Clustered Index Seek on the main table, and a Nested Loop between them, all before the joins and filters.
How can I force SQL Server to SORT just before TOP, like it does when TOP isn't specified?
I don't think the construction of myview is the issue, but just in case, it is something like this:
CREATE VIEW myview AS
SELECT columns..., sortcode, 0 as shared FROM mytable
UNION ALL
SELECT columns..., sortcode, 1 as shared FROM [anotherdb].dbo.mytable
The local mytable has a few thousand records, and mytable in the other database in the same MSSQL instance has a few million records. Both tables do have indexes on their respective sortcode column.
And so starts the unfortunate game of "trying to outsmart the optimizer (because it doesn't always know best)".
You can try putting the filtering portions into a subquery or CTE:
SELECT TOP 30 *
FROM
(SELECT *
FROM myview, foo, bar
WHERE shared=1 AND [joins and other stuff]) t
ORDER BY sortcode;
Which may be enough to force it to filter first (but the optimizer gets "smarter" with each release, and can sometimes see through such shenanigans). Or you might have to go as far as putting this code into a UDF. If you write the UDF as a multistatement table-valued function, with the filtering inside, and then query that UDF with your TOP x/ORDER BY, you've pretty well forced the querying order (because SQL Server is currently unable to optimize around multistatement UDFs).
Of course, thinking about it, introducing the UDF is just a way of hiding what we're really doing - create a temp table, use one query to populate it (based on WHERE filters), then another query to find the TOP x from the temp table.