I've got a timesheet table with an id, and total hours.
I now have a
select ts.totalhours, fn_NormalTime(ts.id), fn_Overtime(ts.id) from tsentries ts
I have done this improve readability in the original query as well as to centralize logic with regards to holidays, Sunday time, etc.
However, each function in turn now does a select from the table to get items like date, rules, etc.
Is there a way to get sql to not redo the retrieve for every function or will internal caching suffice? Looking to improve speed.
Using functions in the SELECT part of the query will lead to bad performances because the SELECT inside the function will need to be executed for each line you return from your query. There is no way to cache this result this way. The query optimizer will only do a good job if you write all the data retrieval logic inside the same query.
Related
I have a snowflake query with multiple ctes and inserting into a table using a Talend job. It takes more than 90 minutes to execute the query. It is multiple cascading ctes, one is calling other and other is calling the other.
I want to improve the performance of the query. It is like 1000 lines of code and I can't paste it here. As I checked the profile and it is showing all the window functions and aggregate functions which slows down the query.
For example, the top slower is,
ROW_NUMBER() OVER (PARTITION BY LOWER(S.SUBSCRIPTIONID)
ORDER BY S.ISROWCURRENT DESC NULLS FIRST,
TO_NUMBER(S.STARTDATE) DESC NULLS FIRST,
IFF(S.ENDDATE IS NULL, '29991231', S.ENDDATE) DESC NULLS FIRST)
takes 7.3% of the time. Can you suggest an alternative way to improve the performance of the query please?
The problem is that 1000 lines are very hard for any query analyzer to optimize. It also makes troubleshooting a lot harder for you and for a future team member who inherits the code.
I recommend breaking the query up and these optimizations:
Use CREATE TEMPORARY TABLE AS instead of CTEs. Add ORDER BY as you create the table on the column that you will join or filter on. The temporary tables are easier for the optimizer to build and later use. The ORDER BY helps Snowflake know what to optimize for with subsequent joins to other tables. They're also easier for troubleshooting.
In your example, see if you can persist this data as permanent columns so that Snowflake can skip the conversion portion and have better statistics on it: TO_NUMBER(S.STARTDATE) and IFF(S.ENDDATE IS NULL, '29991231', S.ENDDATE).
Alternatively to step 2, instead of sorting by startDate and endDate, see if you can add an IDENTITY, SEQUENCE, or populate an INTEGER column which you can use as the sortkey. You can also literally name this new column sortKey. Sorting an integer will be significantly faster than running a function on a DATETIME and then ordering by it.
If any of the CTEs can be changed into materialized views, they will be pre-built and significantly faster.
Finally stage all of the data in a temporary table - ordered by the same columns that your target table was created in - before you insert it. This will make the insert step itself quicker and Snowflake will have an easier time handling a concurrent change to that table.
Notes:
To create a temporary table:
create or replace temporary table table1 as select * from dual; After that you refer to table1 instead of your code instead of the CTE.
Materialized views are documented here. They are an Enterprise edition feature. They syntax is: create materialized view mymv as select col1, col2 from mytable;
I've recently hit a bottleneck situation in which if I keep a current version of a query inside a report (designed in Report Builder SSRS 2008) it will generate loading times of up to 15 minutes for a report with specific parameters. This JOIN represents a sub-query which I JOIN to the main query on a non-indexed column. Let's call this sub-query "Units".
If I delete the "Units" JOIN from the SQL Query and set it up as a separate Data Set inside the report, linking it using the SSRS Lookup function (same as the JOIN in SQL) to the Main Data Set (Query), the report runs smoothly, in under a minute (Approximately 3 to 5 miliseconds).
Keeping in mind that the "Units" sub-query, when ran separately runs in under 5 milliseconds for the same parameters that previously took 15 minutes, but when it is attached to the Main query causes severe performance issues.
Is there a clear benefit on doing this type of separation or should I just investigate further on how to improve the query? What are the performance benefits/downsides of using lookup versus improving the current query performance.
My concern is that this is a situational improvement and this will not represent a long term solution. I've used this alternative in the past to avoid tweaking the query and it did not backfire, but I do not fully understand the performance implications of using this workaround.
Thanks,
Radu.
There are a lot of things that could be causing the performance issues but here's a few simple things that might get the dataset back up to speed again with very little effort.
1. Parameter sniffing
You mention with specific parameters, if you mean that the query only performs badly with some parameters and performs well with other parameters, and assuming that the size of the data does not vary significantly based on these parameters then it's likely a parameter sniffing issue. This is caused by a query plan that was generated based on once set of parameters that is not suitable for other parameters. The easiest way to prove this is to simply add option (recompile) to the end of the query. This is not a permanent fix but it will force a new query plan to be generated. If you see an instant improvement then parameter sniffing is the most common cause.
2. Refactor dataset query
The other option is to redesign your query. I don't know what you query looks like but if we take a simple example based on the information you posted...
If you query looks something like..
SELECT * FROM tableA a
JOIN (SELECT * FROM tableB WHERE someValue=someOtherValue) b
ON a.FieldA = b.FieldB
then you could refactor it by putting the subquery into a temp table and joining to that, something like
SELECT *
INTO #t
FROM tableB WHERE someValue=someOtherValue
SELECT * FROM tableA a
JOIN #t b
ON a.FieldA = b.FieldB
This is an approach I often take and it can get round exactly these types of performance issues.
We have a table with 6 million records, and then we have a SQL which need around 7 minutes to query the result. I think the SQL cannot be optimized any more.
The query time causes our weblogic to throw the max stuck thread exception.
Is there any recommendation for me to handle this problem ?
Following is the query, but it's hard for me to change it,
SELECT * FROM table1
WHERE trim(StudentID) IN ('354354','0')
AND concat(concat(substr(table1.LogDate,7,10),'/'),substr(table1.LogDate,1,5))
BETWEEN '2009/02/02' AND '2009/03/02'
AND TerminalType='1'
AND RecStatus='0' ORDER BY StudentID, LogDate DESC, LogTime
However, I know it's time consuming for using strings to compare dates, but someone wrote before I can not change the table structure...
LogDate was defined as a string, and the format is mm/dd/yyyy, so we need to substring and concat it than we can use between ... and ... I think it's hard to optimize here.
The odds are that this query is doing a full-file scan, because you're WHERE conditions are unlikely to be able to take advantage of any indexes.
Is LogDate a date field or a text field? If it's a date field, then don't do the substr's and concat's. Just say "LogDate between '2009-02-02' and '2009-02-03' or whatever the date range is. If it's defined as a text field you should seriously consider redefining it to a date field. (If your date really is text and is written mm/dd/yyyy then your ORDER BY ... LOGDATE DESC is not going to give useful results if the dates span more than one year.)
Is it necessary to do the trim on StudentID? It is far better to clean up your data before putting it in the database then to try to clean it up every time you retrieve it.
If LogDate is defined as a date and you can trim studentid on input, then create indexes on one or both fields and the query time should fall dramatically.
Or if you want a quick and dirty solution, create an index on "trim(studentid)".
If that doesn't help, give us more info about your table layouts and indexes.
SELECT * ... WHERE trim(StudentID) IN ('354354','0')
If this is normal construct, then you need a function based index. Because without it you force the DB server to perform full table scan.
As a rule of thumb, you should avoid as much as possible use of functions in the WHERE clause. The trim(StundentID), substr(table1.LogDate,7,10) prevent DB servers from using any index or applying any optimization to the query. Try to use the native data types as much as possible e.g. DATE instead of VARCHAR for the LogDate. StudentID should be also managed properly in the client software by e.g. triming the data before INSERT/UPDATE.
If your database supports it, you might want to try a materialized view.
If not, it might be worth thinking about implementing something similar yourself, by having a scheduled job that runs a query that does the expensive trims and concats and refreshes a table with the results so that you can run a query against the better table and avoid the expensive stuff. Or use triggers to maintain such a table.
But the query time cause our weblogic to throw the max stuck thread exception.
If the query takes 7 minutes and cannot be made faster, you have to stop running this query real-time. Can you change your application to query a cached results table that you periodically refresh?
As an emergency stop-gap before that, you can implement a latch (in Java) that allows only one thread at a time to execute this query. A second thread would immediately fail with an error (instead of bringing the whole system down). That is probably not making users of this query happy, but at least it protects everyone else.
I updated the query, could you give me some advices ?
Those string manipulations make indexing pretty much impossible. Are you sure you cannot at least get rid of the "trim"? Is there really redundant whitespace in the actual data? If so, you could narrow down on just a single student_id, which should speed things up a lot.
You want a composite index on (student_id, log_date), and hopefully the complex log_date condition can still be resolved using a index range scan (for a given student id).
Without any further information about what kind of query you are executing and wheter you are using indexes or not it is hard to give any specific information.
But here are a few general tips.
Make sure you use indexes on the columns you often filter/order by.
If it is only a certain query that is way too slow, than perhaps you can prevent yourself from executing that query by automatically generating the results while the database changes. For example, instead of a count() you can usually keep a count stored somewhere.
Try to remove the trim() from the query by automatically calling trim() on your data before/while inserting it into the table. That way you can simply use an index to find the StudentID.
Also, the date filter should be possible natively in your database. Without knowing which database it might be more difficult, but something like this should probably work: LogDate BETWEEN '2009-02-02' AND '2009-02-02'
If you also add an index on all of these columns together (i.e. StudentID, LogDate, TerminalType, RecStatus and EmployeeID than it should be lightning fast.
Without knowing what database you are using and what is your table structure, its very difficult to suggest any improvement but queries can be improved by using indexes, hints, etc.
In your query the following part
concat(concat(substr(table1.LogDate,7,10),'/'), substr(table1.LogDate,1,5)) BETWEEN '2009/02/02' AND '2009/02/02'
is too funny. BETWEEN '2009/02/02' AND '2009/02/02' ?? Man, what are yuu trying to do?
Can you post your table structure here?
And 6 million records is not a big thing anyway.
It is told a lot your problem is in date field. You definitely need to change your date from a string field to a native date type. If it is a legacy field that is used in your app in this exact way - you may still create a to_date(logdate, 'DD/MM/YYYY') function-based index that transforms your "string" date into a "date" date, and allows a fast already mentioned between search without modifying your table data.
This should speed things up a lot.
With the little information you have provided, my hunch is that the following clause gives us a clue:
... WHERE trim(StudentID) IN ('354354','0')
If you have large numbers of records with unidentified student (i.e. studentID=0) an index on studentID would be very imbalanced.
Of the 6 million records, how many have studentId=0?
Your main problem is that your query is treating everything as a string.
If LogDate is a Date WITHOUT a time component, you want something like the following
SELECT * FROM table1
WHERE StudentID IN (:SearchStudentId,0)
AND table1.LogDate = :SearchDate
AND TerminalType='1'
AND RecStatus='0'
ORDER BY EmployeeID, LogDate DESC, LogTime
If LogDate has a time component, and SearchDate does NOT have a time component, then something like this. (The .99999 will set the time to 1 second before midnight)
SELECT * FROM table1
WHERE StudentID IN (:SearchStudentId,:StudentId0)
AND table1.LogDate BETWEEN :SearchDate AND :SearchDate+0.99999
AND TerminalType='1'
AND RecStatus='0'
ORDER BY EmployeeID, LogDate DESC, LogTime
Note the use of bind variables for the parameters that change between calls. It won't make the query much faster, but it is 'best practice'.
Depending on your calling language, you may need to add TO_DATE, etc, to cast the incoming bind variable into a Date type.
If StudentID is a char (usually the reason for using trim()) you may be able to get better performance by padding the variables instead of trimming the field, like this (assuming StudentID is a char(10)):
StudentID IN (lpad('354354',10),lpad('0',10))
This will allow the index on StudentID to be used, if one exists.
Is there an inherent cost to using inline-table-valued functions in SQL Server 2008 that is not incurred if the SQL is inlined directly? Our application makes very heavy use of inline-table-valued functions to reuse common queries, but recently, we've found that queries run much faster if we don't use them.
Consider this:
CREATE FUNCTION dbo.fn_InnerQuery (#asOfDate DATETIME)
RETURNS TABLE
AS
RETURN
(
SELECT ... -- common, complicated query here
)
Now, when I do this:
SELECT TOP 10 Amount FROM dbo.fn_InnerQuery(dbo.Date(2009,1,1)) ORDER BY Amount DESC
The query returns with results in about 15 seconds.
However, when I do this:
SELECT TOP 10 Amount FROM
(
SELECT ... -- inline the common, complicated query here
) inline
ORDER BY Amount DESC
The query returns in less than 1 second.
I'm a little baffled by the overhead of using the table valued function in this case. I did not expect that. We have a ton of table valued functions in our application, so I'm wondering if there is something I'm missing here.
In this case, the UDF should be unnested/expanded like a view and it should be transparent.
Obviously, it's not...
In this case, my guess is that the column is smalldatetime and is cast to datetime because of the udf parameter but the constant is correctly evaluated (to match colum datatype) when inline.
datetime has a higher precedence that smalldatetime, so the column would be cast
What do the query plans say? The UDF would show a scan, the inline a seek most likely (not 100%, just based on what I've seen before)
Edit: Blog post by Adam Machanic
One thing that can slow functions down is omitting dbo. from table references inside the function. That causes SQL Server to do a security check for every call, which can be expensive.
Try running the table valued function independently to see, how fast/slow it executes?
Also, I am not sure how to clear the execution cache(?) which SQL Server might retain from the execution of the UDF. I mean - if you run the UDF first, it could be the case where SQL Server has the actual query with it & it could cache the plan/result. So, if you run the complicated query separately - it could be running it from cache.
In your second example the Table Valued function has to return the entire data set before the query can apply the filter. Hopping across the TF boundary is not something that the optimiser can always do.
In the third example the query optimiser can work out that the user only wants the top few 'amounts'. If this isn't an aggregate value the optimiser can push that processing right to the start of the query and not bother with any other data. If it is an aggregate amount then the slowdown is for a different reason.
If you compare the query plans of the two queries you should see that they are different.
I have a query that has been running every day for a little over 2 years now and has typically taken less than 30 seconds to complete. All of a sudden, yesterday, the query started taking 3+ hours to complete and was using 100% CPU the entire time.
The SQL is:
SELECT
#id,
alpha.A, alpha.B, alpha.C,
beta.X, beta.Y, beta.Z,
alpha.P, alpha.Q
FROM
[DifferentDatabase].dbo.fnGetStuff(#id) beta
INNER JOIN vwSomeData alpha ON beta.id = alpha.id
alpha.id is a BIGINT type and beta.id is an INT type. dbo.fnGetStuff() is a simple SELECT statement with 2 INNER JOINs on tables in the same DB, using a WHERE id = #id. The function returns approximately 11000 results.
The view vwSomeData is a simple SELECT statement with two INNER JOINs that returns about 590000 results.
Both the view and the function will complete in less than 10 seconds when executed by themselves. Selecting the results of the function into a temporary table first and then joining on that makes the query finish in < 10 seconds.
How do I troubleshoot what's going on? I don't see any locks in the activity manager.
Look at the query plan. My guess is that there is a table scan or more in the execution plan. This will cause huge amounts of I/O for the few record you get in the result.
You could use the SQL Server Profiler tool to monitor what queries are running on SQL Server. It doesn't show the locks, but it can for instance also give you hints on how to improve your query by suggesting indexes.
If you've got a reasonably recent version of SQL Server Management Studio, it has a Database Tuning Adviser as well, under Tools. It takes a trace from the Profiler and makes some, sometimes highly useful, suggestions. Makes sure there's not too many queries - it takes a long time to build advice.
I'm not an expert on it, but have had some luck with it in the past.
Do you need to use a function? Can you re-write the entire thing into a stored procedure in which you pass in the #ID as a parameter.
Even if your table has indexes because you pass the #ID as a variable to the WHERE clause potentially greatly increasing the amount of time for the query to run.
The reason the indexes may not be used is because the Query Analyzer does not know the value of the variables when it selects an access method to perform the query. Because this is a batch file, only one pass is made of the Transact-SQL code, preventing the Query Optimizer from knowing what it needs to know in order to select an access method that uses the indexes.
You might want to consider an INDEX query hint if you cannot re-write the SQL.
it might also be possible, since this just started happening, that the INDEXes have become fragmented and might need to be rebuilt.
I've had similar problems with joining functions that return large datasets. I had to do what you've already suggested. Put the results in a temp table and join on that.
Look at the estimated plan, this will probably shed some light. Typically when query cost gets orders of magnitude more expensive it is because a loop or merge join is being used where a hash join is more appropriate. If you see a loop or merge join in the estimated plan, look at the number of rows it expects to process - is it far smaller than the number of rows you know will actually be in play? You can also specify a hint to use a hash join and see if it performs much better. If so, try updating statistics and see if it goes back to a hash join without a hint.
SELECT
#id,
alpha.A, alpha.B, alpha.C,
beta.X, beta.Y, beta.Z,
alpha.P, alpha.Q
FROM
[DifferentDatabase].dbo.fnGetStuff(#id) beta
INNER HASH JOIN vwSomeData alpha ON beta.id = alpha.id
-- having no idea what type of schema is in place and just trying to throw out ideas:
Like others have said... use Profiler and find the source of pain... but I'm thinking it is the function on the other database. Since that function might be a source of pain, have you thought about a little denormalization or anything on [DifferentDatabase]. I think you'll find a bit more scalability in joining to a more flattened table with indexes than a costly function.
Run this command:
SET SHOWPLAN_ALL ON
Then run your query. It will display the execution plan, look for a "SCAN" on an index or a table. That is most likely what is happening to your query now. If that is the case, try to figure out why it is not using indexes now (refresh statistics, etc)