I am doing some crude benchmarks with the xml datatype of SQL Server 2008. I've seen many places where .exist is used in where clauses. I recently compared two queries though and got odd results.
select count(testxmlrid) from testxml
where Attributes.exist('(form/fields/field)[#id="1"]')=1
This query takes about 1.5 seconds to run, with no indexes on anything but the primary key(testxmlrid)
select count(testxmlrid) from testxml
where Attributes.value('(/form/fields/field/#id)[1]','integer')=1
This query on the otherhand takes about .75 seconds to run.
I'm using untyped XML and my benchmarking is taking place on a SQL Server 2008 Express instance. There are about 15,000 rows in the dataset and each XML string is about 25 lines long.
Are these results I'm getting correct? If so, why does everyone use .exist? Am I doing something wrong and .exist could be faster?
You are not counting the same things. Your .exist query (form/fields/field)[#id="1"] checks all occurrences of #id in the XML until it finds one with the value 1 and your .value query (/form/fields/field/#id)[1] only fetches the first occurrence of #id.
Test this:
declare #T table
(
testxmlrid int identity primary key,
Attributes xml
)
insert into #T values
('<form>
<fields>
<field id="2"/>
<field id="1"/>
</fields>
</form>')
select count(testxmlrid) from #T
where Attributes.exist('(form/fields/field)[#id="1"]')=1
select count(testxmlrid) from #T
where Attributes.value('(/form/fields/field/#id)[1]','integer')=1
The .exist query count is 1 because it finds the #id=1in the second field node and the .value query count is 0 because it only checks the value for the first occurrence of #id.
An .exist query that only checks the value for the first occurrence of #id like your .value query would look like this.
select count(testxmlrid) from #T
where Attributes.exist('(/form/fields/field/#id)[1][.="1"]')=1
The difference could come from your indexes.
A PATH index will boost performance of the exist() predicate on the WHERE clause, whereas a PROPERTY index will boost performance of the value() function.
Read:
http://msdn.microsoft.com/en-us/library/bb522562.aspx
Related
This is my first post on Stackoverflow. :-)
I use an SQL server, which has a huge table (up to 50'000'000 records), and a smaller table (up to 500'000 records).
Let's consider two queries:
Query A
SELECT *
FROM my_huge_table
WHERE column_name IN(list_of_values)
versus Query B
SELECT *
FROM my_huge_table
WHERE column_name IN (
SELECT column_name
FROM smaller_table
WHERE ...
)
The subquery in query B returns exactly the same list as list_of_values in query A.
The length of list_of_values in query A is limited (as described in many places on web). If my list_of_values is long (eg: 10'000), I have to split it into chunks, but when I use query B, it works fine, although results from the subquery from query B also contains 10'000 records...
What is more when I use query B, it is faster than query A. I looked into execution plans and it shows that query B uses some parallel calculations (I'm not familiar with execution plans).
Questions
In my script I have a list of values, I cannot easily create a subquery. Is there any way to execute something similar to query B using list of values?
Why is query B faster than query A?
Are there any other optimizations to be done?
PS: I've already created an index on the queried column.
Thanks.
Is there any performance difference between query A and query B?
Query A
SELECT * FROM SomeTable
WHERE 1 = 1 AND (SomeField LIKE '[1,m][6,e][n]%')
Query B
SELECT * FROM SomeTable
WHERE 1 = 1 AND (SomeField IN ('16', 'Mens'))
The first could be much slower. An index can't be used with LIKE unless there is a constant prefix, for example LIKE 'foo%'. The first query will therefore require a table scan. The second query however could use an index on SomeField if one is available.
The first query will also give the wrong results as it matches '1en'.
Well, I have a table which is 40,000,000+ records but when I try to execute a simple query, it takes ~3 min to finish execution. Since I am using the same query in my c# solution, which it needs to execute over 100+ times, the overall performance of the solution is deeply hit.
This is the query that I am using in a proc
DECLARE #Id bigint
SELECT #Id = MAX(ExecutionID) from ExecutionLog where TestID=50881
select #Id
Any help to improve the performance would be great. Thanks.
What indexes do you have on the table? It sounds like you don't have anything even close to useful for this particular query, so I'd suggest trying to do:
CREATE INDEX IX_ExecutionLog_TestID ON ExecutionLog (TestID, ExecutionID)
...at the very least. Your query is filtering by TestID, so this needs to be the primary column in the composite index: if you have no indexes on TestID, then SQL Server will resort to scanning the entire table in order to find rows where TestID = 50881.
It may help to think of indexes on SQL tables in the same way as those you'd find in the back of a big book that are hierarchial and multi-level. If you were looking for something, then you'd manually look under 'T' for TestID then there'd be a sub-heading under TestID for ExecutionID. Without an index entry for TestID, you'd have to read through the entire book looking for TestID, then see if there's a mention of ExecutionID with it. This is effectively what SQL Server has to do.
If you don't have any indexes, then you'll find it useful to review all the queries that hit the table, and ensure that one of those indexes is a clustered index (rather than non-clustered).
Try to re-work everything into something that works in a set based manner.
So, for instance, you could write a select statement like this:
;With OrderedLogs as (
Select ExecutionID,TestID,
ROW_NUMBER() OVER (PARTITION BY TestID ORDER By ExecutionID desc) as rn
from ExecutionLog
)
select * from OrderedLogs where rn = 1 and TestID in (50881, 50882, 50883)
This would then find the maximum ExecutionID for 3 different tests simultaneously.
You might need to store that result in a table variable/temp table, but hopefully, instead, you can continue building up a larger, single, query, that processes all of the results in parallel.
This is the sort of processing that SQL is meant to be good at - don't cripple the system by iterating through the TestIDs in your code.
If you need to pass many test IDs into a stored procedure for this sort of query, look at Table Valued Parameters.
We have a CallLog table in Microsoft SQL Server 2000. The table contains CallEndTime field whose type is DATETIME, and it's an index column.
We usually delete free-charge calls and generate monthly fee statistics report and call detail record report, all the SQLs use CallEndTime as query condition in WHERE clause. Due to a lot of records exist in CallLog table, the queries are slow, so we want to optimize it starting from indexing.
Question
Will it more effictient if query upon an extra indexed VARCHAR column CallEndDate ?
Such as
-- DATETIME based query
SELECT COUNT(*) FROM CallLog WHERE CallEndTime BETWEEN '2011-06-01 00:00:00' AND '2011-06-30 23:59:59'
-- VARCHAR based queries
SELECT COUNT(*) FROM CallLog WHERE CallEndDate BETWEEN '2011-06-01' AND '2011-06-30'
SELECT COUNT(*) FROM CallLog WHERE CallEndDate LIKE '2011-06%'
SELECT COUNT(*) FROM CallLog WHERE CallEndMonth = '2011-06'
It has to be the datetime. Dates are essentially stored as a number in the database so it is relatively quick to see if the value is between two numbers.
If I were you, I'd consider splitting the data over multiple tables (by month, year of whatever) and creating a view to combine the data from all those tables. That way, any functionality which needs to entire data set can use the view and anything which only needs a months worth of data can access the specific table which will be a lot quicker as it will contain much less data.
I think comparing DateTime is much faster than LIKE operator.
I agree with DoctorMick on Spliting your DateTime as persisted columns Year, Month, Day
for your query which selects COUNT(*), check if in the execution plan there is a Table LookUp node. if so, this might be because your CallEndTime column is nullable. because you said that you have a [nonclustered] index on CallEndTime column. if you make your column NOT NULL and rebuild that index, counting it would be a INDEX SCAN which is not so slow.and I think you will get much faster results.
I have a sql query with 50 parameters, such as this one.
DECLARE
#p0 int, #p1 int, #p2 int, (text omitted), #p49 int
SELECT
#p0=111227, #p1=146599, #p2=98917, (text omitted), #p49=125319
--
SELECT
[t0].[CustomerID], [t0].[Amount],
[t0].[OrderID], [t0].[InvoiceNumber]
FROM [dbo].[Orders] AS [t0]
WHERE ([t0].[CustomerID]) IN
(#p0, #p1, #p2, (text omitted), #p49)
The estimated execution plan shows that the database will collect these parameters, order them, and then read the index Orders.CustomerID from the smallest parameter to the largest, then do a bookmark lookup for the rest of the record.
The problem is that there the smallest and largest parameter could be quite far apart and this will lead to reading possibly the entire index.
Since this is being done in a loop from the client side (50 params sent each time, for 1000 iterations), this is a bad situation. How can I formulate the query/client side code to get my data without repetitive index scanning while keeping the number of round trips down?
I thought about ordering the 50k parameters such that smaller readings of the index would occur. There is a wierd mitigating circumstance that prevents this - I can't use this solution. To model this circumstance, just assume that I only have 50 id's available at any time and can't control their relative position in the global list.
Insert the parameters into a temporary table, then join it with your table:
DECLARE #params AS TABLE(param INT);
INSERT
INTO #params
VALUES (#p1)
...
INSERT
INTO #params
VALUES (#p49)
SELECT
[t0].[CustomerID], [t0].[Amount],
[t0].[OrderID], [t0].[InvoiceNumber]
FROM #params, [dbo].[Orders] AS [t0]
WHERE ([t0].[CustomerID]) = #params.param
This will most probably use NESTED LOOPS with a INDEX SEEK over CustomerID on each loop.
An index range scan is pretty fast. There's usually a lot less data in the index than in the table and there's a much better chance that the index is already in memory.
I can't blame you for wanting to save round trips to the server by putting each of the IDs your looking for in a bundle. If the index RANGE scan really worries you, you can create a parameterized server side cursor (e.g., in TSQL) that takes the CustomerID as a parameter. Stop as soon as you find a match. That query should definitely use an index unique scan instead of a range scan.
To build on Quassnoi's answer, if you were working with SQL 2008, you could save yourself some time by inserting all 50 items with one statement. SQL 2008 has a new feature for multiple valued inserts.
e.g.
INSERT INTO #Customers (CustID)
VALUES (#p0),
(#p1),
<snip>
(#p49)
Now #Customers table is populated and ready to INNER JOIN on, or your IN clause.