partition function in SQL Server 2005 - sql-server

In MSDN about partition function from here, $PARTITION(Transact-SQL).
I am confused about what the below sample is doing underlying. My understanding is, this SQL statement will iterate all rows in table Production.TransactionHistory, and since for all the rows which will mapping to the same partition, $PARTITION.TransactionRangePF1(TransactionDate) will return the same value, i.e. the partition number for all such rows. So, for example, all rows in partition 1 will result in one row in returning result since they all of the same value of $PARTITION.TransactionRangePF1(TransactionDate). My understanding correct?
USE AdventureWorks ;
GO
SELECT $PARTITION.TransactionRangePF1(TransactionDate) AS Partition,
COUNT(*) AS [COUNT] FROM Production.TransactionHistory
GROUP BY $PARTITION.TransactionRangePF1(TransactionDate)
ORDER BY Partition ;
GO

If your parition function is defined like
CREATE PARTITION FUNCTION TransactionRangePF1(DATETIME)
AS RANGE RIGHT FOR VALUES ('2007-01-01', '2008-01-01', '2009-01-01')
, then this clause:
$PARTITION.TransactionRangePF1(TransactionDate)
is equivalent to:
CASE
WHEN TransactionDate < '2007-01-01' THEN 1
WHEN TransactionDate < '2008-01-01' THEN 2
WHEN TransactionDate < '2009-01-01' THEN 3
ELSE 4
END
If all your dates fall before '2007-01-01', then the first WHEN clause will always fire and it will always return 1.
The query you posted will return at most 1 row for each partition, as it will group all the rows from the partition (if any) into one group.
If there are no rows for any partition, no rows for it will be returned in the resultset.

It returns the number of records in each of the non-empty partitions in the partitioned table Production.TransactionHistory, so yes your reasoning is correct.

Have you tried generating an execution plan for the statement? That might give you some insight into what it's actually doing underneath the cover.
Press "Control-L" to generate an execution plan and post it here if you'd like some interpretation.

Related

SQL Actual Execution Plan with Sort took high cost

I have a table named DocumentItem with Id column was clustered index (primary key).
Please see these two query strings:
Query 1 (not use order by):
select *
from DocumentItem
where (HistoryCreateDate >= '2019-09-04 05:00:00' AND HistoryCreateDate <= '2019-12-04 05:00:00') and ActNodeState>140100
The result took: 00:00:09 with 168.357 rows.
Query 2 (used order by):
select *
from DocumentItem
where (HistoryCreateDate >= '2019-09-04 05:00:00' AND HistoryCreateDate <= '2019-12-04 05:00:00') and ActNodeState>140100 order by Id
The result took: 00:02:41 with 168.357 rows.
Here is the actual execution plan:
Why it took so long in the 2nd query?
SQL Server has decided that your index IX_HistoryCreateDate (not sure of the full name) is sufficiently selective that it will use it to find the rows that it needs. However, that index isn't sorted on the ID column. It does include the ID column already (whether you specified it or not) because it's the clustering key.
I'd suggest recreating your IX_HistoryCreateDate index like this:
CREATE INDEX IX_HistoryCreateDate ON DocumentItem
( HistoryCreateDate, ID)
INCLUDE (ActNodeState);
And I think you'll be fine. It's still not going to be great and it will have to do a large number of lookups, because your query uses SELECT *. Do you really need all columns returned? If so, and you do this all the time, you might consider reclustering the table in the order that you need.

SQL Query is slow when ORDER BY statement added

I have a table [Documents] with the following columns:
Name (string)
Status (string)
DateCreated [datetime]
This table has around 1 million records. All three of these columns have an index (a single index for each one).
When I run this query:
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New';
Execution is really fast (300 ms.)
If I run the same query but with the ORDER BY clause, it's really slow (3000 ms)
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New'
order by DateCreated;
I understand that its searching in another index (DateCreated), but should it really be that much slower? If so, why? Anything I can do to speed this query up (a composite index)?
Thanks
BTW: All Indexes including DateCreated have really low fragmentation, in fact I ran a reorganize and it didn't change a thing.
As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.
Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.
If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.
NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY before the TOP 50. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.
A couple of ideas to improve performance
Adding a composite index (as other answers have suggested) may offer some improvement, for example:
ON Documents (Status, DateCreated, Name)
SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.
For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New' is a relatively small subset of the table.
CREATE NONCLUSTERED INDEX Documents_FIX
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'
I would also modify the query to specify ORDER BY Status, DateCreated, Name
so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.
As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that
ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;
CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;
Then the query could be re-written:
SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;
If DateCreated field means insertion time to table, you can create an integer id field and order by that integer field.
You need an index by 2 columns: (Name, DateCreated). The order of fields in the index is important. So, replace your index for just Name with a new index for two columns (Name, DateCreated).

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

How can this expression reach the NULL expression?

I'm trying to randomly populate a column with values from another table using this statement:
UPDATE dbo.SERVICE_TICKET
SET Vehicle_Type = (SELECT TOP 1 [text]
FROM dbo.vehicle_typ
WHERE id = abs(checksum(NewID()))%21)
It seems to work fine, however the value NULL is inserted into the column. How can I get rid of the NULL and only insert the values from the table?
This can happen when you don't have an appropriate index on the ID column of your vehicle_typ table. Here's a smaller query that exhibits the same problem:
create table T (ID int null)
insert into T(ID) values (0),(1),(2),(3)
select top 1 * from T where ID = abs(checksum(NewID()))%3
Because there's no index on T, what happens is that SQL Server performs a table scan and then, for each row, attempts to satisfy the where clause. Which means that, for each row it evaluates abs(checksum(NewID()))%3 anew. You'll only get a result if, by chance, that expression produces, say, 1 when it's evaluated for the row with ID 1.
If possible (I don't know your table structure) I would first populate a column in SERVICE_TICKET with a random number between 0 and 20 and then perform this update using the already generated number. Otherwise, with the current query structure, you're always relying on SQL Server being clever enough to only evaluate abs(checksum(NewID()))%21once for each outer row, which it may not always do (as you've already found out).
#Damien_The_Unbeliever explained why your query fails.
My first variant was not correct, because I didn't understand the problem in full.
You want to set each row in SERVICE_TICKET to a different random value from vehicle_typ.
To fix it simply order by random number, rather than comparing a random number with ID. Like this (and you don't care how many rows are in vehicle_typ as long as there is at least one row there).
WITH
CTE
AS
(
SELECT
dbo.SERVICE_TICKET.Vehicle_Type
CA.[text]
FROM
dbo.SERVICE_TICKET
CROSS APPLY
(
SELECT TOP 1 [text]
FROM dbo.vehicle_typ
ORDER BY NewID()
) AS CA
)
UPDATE CTE
SET Vehicle_Type = [text];
At first we make a Common Table Expression, you can think of it as a temporary table. For each row in SERVICE_TICKET we pick one random row from vehicle_typ using CROSS APPLY. Then we UPDATE the original table with chosen rows.

determine order of operations in query

Say I have a query like this:
SELECT *
FROM Foo
WHERE Name IN ('name1', 'name2')
AND (Date<'2013-01-01' AND Date>'2010-01-01')
AND Type = 1
Is there a way to force the SQL server to evaluate the expressions in the order I determine and not what the query optimizer says? For example I want the IN clause evaluated first, the output of that evaluated by Type = 1 and finally the dates, in EXACTLY that order.
Yes it is largely possible (though there are some caveats and counter examples discussed in the answers here)
SELECT *
FROM Foo
WHERE 1 = CASE
WHEN Name IN ( 'name1', 'name2' ) THEN
CASE
WHEN Type = 1 THEN
CASE
WHEN ( Date < '2013-01-01'
AND Date > '2010-01-01' ) THEN 1
END
END
END
But why bother? There are only very limited circumstances in which I can see this would be useful (e.g. preventing divide by zero if an earlier predicate evaluated to 0).
Wrapping the predicates up like this makes the query completely unsargable and prevents index usage for any of the three (otherwise sargable) predicates. It guarantees a full scan reading all rows.
To see an example of this
CREATE TABLE Foo
(
Id INT IDENTITY PRIMARY KEY,
Name VARCHAR(10),
[Date] DATE,
[Type] TINYINT,
Filler CHAR(8000) NULL
)
CREATE NONCLUSTERED INDEX IX_Name
ON Foo(Name)
CREATE NONCLUSTERED INDEX IX_Date
ON Foo(Date)
CREATE NONCLUSTERED INDEX IX_Type
ON Foo(Type)
INSERT INTO Foo
(Name,
[Date],
[Type])
SELECT TOP (100000) 'name' + CAST(0 + CRYPT_GEN_RANDOM(1) AS VARCHAR),
DATEADD(DAY, 7 * CRYPT_GEN_RANDOM(1), '2012-01-01'),
0 + CRYPT_GEN_RANDOM(1)
FROM master..spt_values v1,
master..spt_values v2
Then running the original query in the question vs this query gives plans
Note the second query is costed as being 100% of the cost of the batch.
The Query optimizer left to its own devices first seeks into the 414 rows matching the type predicate and uses that as a build input for the hash table. It then seeks into the 728 rows matching the name, sees if it matches anything in the hash table and for the 4 that do it performs a key lookup for the other columns and evaluates the Date predicate against those. Finally it returns the single matching row.
The second query just ploughs through all the rows in the table and evaluates the predicates in the desired order. The difference in number of pages read is pretty significant.
Original Query
Table 'Foo'. Scan count 3, logical reads 23,
Table 'Worktable'. Scan count 0, logical reads 0
Nested case
Table 'Foo'. Scan count 1, logical reads 100373
Short answer: NO!
You can try to use brackets, hints, study query plan, etc.
But is that wise to mess up with engine/optimizer that way?
You ill need a lot of study and experience to outsmart the optimizer, that said, please let the engine take care of that details for you.

Resources