Sequential Guid and fragmentation - sql-server

I'm trying to understand how sequential guid performs better than a regular guid.
Is it because with regular guid, the index use the last byte of the guid to sort? Since it's random it will cause alot of fragmentation and page splits since it will often move data to another page to insert new data?
Sequential guid sine it is sequential it will cause alot less page splits and fragmentation?
Is my understanding correct?
If anyone can shed more lights on the subject, I'll appreciated very much.
Thank you
EDIT:
Sequential guid = NEWSEQUENTIALID(),
Regular guid = NEWID()

You've pretty much said it all in your question.
With a sequential GUID / primary key new rows will be added together at the end of the table, which makes things nice an easy for SQL server. In comparison a random primary key means that new records could be inserted anywhere in the table - the chance of the last page for the table being in the cache is fairly likely (if that's where all of the reads are going), however the chance of a random page in the middle of the table being in the cache is fairly low, meaning additional IO is required.
On top of that, when inserting rows into the middle of the table there is the chance that there isn't enough room to insert the extra row. If this is the case then SQL server needs to perform additional expensive IO operations in order to create room for the record - the only way to avoid this is to have gaps scattered amongst the data to allow for extra records to be inserted (known as a Fill factor), which in itself causes performance issues because the data is spread over more pages and so more IO is required to access the entire table.

I defer to Kimberly L. Tripp's wisdom on this topic:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 263-1 rows.
Read more: http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx#ixzz0wDK6cece

To visualize the whole picture util named ostress might be used.
E.g. you can create two tables: one with normal GUID as PK, another with sequential GUID:
-- normal one
CREATE TABLE dbo.YourTable(
[id] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_YourTable] PRIMARY KEY NONCLUSTERED (id)
);
-- sequential one
CREATE TABLE dbo.YourTableSeq(
[id] [uniqueidentifier] NOT NULL CONSTRAINT [df_yourtable_id] DEFAULT (newsequentialid()),
CONSTRAINT [PK_YourTableSeq] PRIMARY KEY NONCLUSTERED (id)
);
Then with a given util you run a numbero of inserts with selection of statistics about index fragmentation:
ostress -Slocalhost -E -dYourDB -Q"INSERT INTO dbo.YourTable VALUES (NEWID()); SELECT count(*) AS Cnt FROM dbo.YourTable; SELECT AVG_FRAGMENTATION_IN_PERCENT AS AvgPageFragmentation, PAGE_COUNT AS PageCounts FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL , NULL, N'LIMITED') DPS INNER JOIN sysindexes SI ON DPS.OBJECT_ID = SI.ID AND DPS.INDEX_ID = SI.INDID WHERE SI.NAME = 'PK_YourTable';" -oE:\incoming\TMP\ -n1 -r10000
ostress -Slocalhost -E -dYourDB -Q"INSERT INTO dbo.YourTableSeq DEFAULT VALUES; SELECT count(*) AS Cnt FROM dbo.YourTableSeq; SELECT AVG_FRAGMENTATION_IN_PERCENT AS AvgPageFragmentation, PAGE_COUNT AS PageCounts FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL , NULL, N'LIMITED') DPS INNER JOIN sysindexes SI ON DPS.OBJECT_ID = SI.ID AND DPS.INDEX_ID = SI.INDID WHERE SI.NAME = 'PK_YourTableSeq';" -oE:\incoming\TMP\ -n1 -r10000
Then in file E:\incoming\TMP\query.out you will find your statistics.
My results are:
"Normal" GUID:
Records AvgPageFragmentation PageCounts
----------------------------------------------
1000 87.5 8
2000 93.75 16
3000 96.15384615384616 26
4000 96.875 32
5000 96.969696969696969 33
10000 98.571428571428584 70
Sequential GUID:
Records AvgPageFragmentation PageCounts
----------------------------------------------
1000 83.333333333333343 6
2000 63.636363636363633 11
3000 41.17647058823529 17
4000 31.818181818181817 22
5000 25.0 28
10000 12.727272727272727 55
As you can see with sequentially generated GUID being inserted, index is much less fragmented as the insert operation leads to new page allocation rarer.

Related

Large Table Optimization In SQL Server 2014

I have a reporting table that is populated from various fact tables in my Data Warehouse. The issue is that for one customer in that reporting table, it takes 46 seconds to pull his data. The one customer has 4232424 records. In total, the table has 5336393 records in it, and has 4 columns. I'll post the table structure and the query I'm running. I need to get the result time on this down to as low as possible. I've tried In Memory Tables, various Indexes, and Indexed Views.
TABLE STRUCTURE
CREATE TABLE cache.Tree
(
CustomerID INT NOT NULL PRIMARY KEY NONCLUSTERED,
RelationA_ID INT NOT NULL,
RelationB_ID INT NOT NULL,
NestedLevel INT NOT NULL,
lft INT NOT NULL,
rgt INT NOT NULL
INDEX IX_LEGS CLUSTERED (lft, rgt),
INDEX IX_LFT NONCLUSTERED (lft)
)
The Report Query
SELECT
tp.CustomerID AS DLine,
t.CustomerID,
t.RelationA_ID,
Level = t.NestedLevel - tp.NestedLevel,
IndentedSort = t.lft
FROM cache.UnilevelTreeWithLC2 tp
INNER JOIN cache.UniLevelTreeWithLC2 t
ON t.lft between tp.lft AND tp.rgt
WHERE tp.CustomerID = 7664
Any help or guidance would be greatly appreciated.
UPDATE 1: Query Execution Plan
UPDATE 2: Solved
I was able to get permission to filter out inactive people in the tree. This has cut the query execution in almost half, if I keep the indexes I put on the table.
Try forcescan - for a query that pulls 80% of a narrow table I would expect SQL to scan but it might not because of bad stats or one of the various cardinality estimation bugs (that are fixed but require traceflags to enable).
I would also ditch the celko-sets - a single parent_id col will make your table even narrower, which should speed up these throughput bound cases, spare you the left/right maintenance, and be very fast with recursive queries.

Performance Strategy for growing data

I know that performance tuning is something which need to be done specific to each environment. But I have put maximum effort to make my question clear to see if I am missing something in the possible improvements.
I have a table [TestExecutions] in SQL Server 2005. It has around 0.2 million records as of today. It is expected to grow as 5 million in couple of months.
CREATE TABLE [dbo].[TestExecutions]
(
[TestExecutionID] [int] IDENTITY(1,1) NOT NULL,
[OrderID] [int] NOT NULL,
[LineItemID] [int] NOT NULL,
[Manifest] [char](7) NOT NULL,
[RowCompanyCD] [char](4) NOT NULL,
[RowReferenceID] [int] NOT NULL,
[RowReferenceValue] [char](3) NOT NULL,
[ExecutedTime] [datetime] NOT NULL
)
CREATE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
I have following two queries for same purpose (Query2 and Query 3). For 100 records in #OrdersForRC, the Query2 is working better (39% vs 47%) whereas with 10000 records in in #OrdersForRC the Query 3 is working better (53% vs 33%) as per the execution plan).
In the initial few months of use, the #OrdersForRC table will have close to 100 records. It will gradually increase to 2500 records over a couple of months.
In the following two approaches which one is good for such a incrementally growing scenario? Or is there any strategy to make one approach work better than the other even if data grows?
Note: In Plan2, the first Query uses Hash Match
References
query optimizer operator choice - nested loops vs hash match (or merge)
Execution Plan Basics — Hash Match Confusion
Test Query
CREATE TABLE #OrdersForRC
(
OrderID INT
)
INSERT INTO #OrdersForRC
--SELECT DISTINCT TOP 100 OrderID FROM [TestExecutions]
SELECT DISTINCT TOP 5000 OrderID FROM LWManifestReceiptExecutions
--QUERY 2:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
INNER JOIN #OrdersForRC R
ON R.OrderID = H.OrderID
--QUERY 3:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
WHERE OrderID IN (SELECT OrderID FROM #OrdersForRC)
DROP TABLE #OrdersForRC
Plan 1
Plan 2
AS commented above you have not specified table definition of table LWManifestReceiptExecutions and how many rows in it and
You are selecting Top N rows without order by, Do you want TOP N random id or in a specific order or order does`t matter for You?
if order does matter then you can create a index on column which you required in Order By
if order id is unique in [dbo].[TestExecutions] table then you should mark it as unique drop and recreate the index if UNIQUE
Drop Index [IX_TestExecutions_OrderID] ON [dbo].[TestExecutions]
CREATE UNIQUE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
You asked that data is keep growing and it will reach to millions in couple of months.
No need to worry sql server can easily handle these query with proper build schema and indexes,
When this data model starting hurting then you could look at the
other options but not now, i have seen people handling billions of data in sql server.
I can see you are comparing the queries on the bases of query cost you are coming the conclusion that
Query with higher percentages mean this is more expensive,
That is not the case always query cost is based on aggregate Subtree cost of all Iterator in the query plan,
and the total estimated cost of an Iterator is a simple sum of the I/O and CPU components.
The cost values represent expected execution times (in seconds) on a particular hardware configuration
But with the morden hardware these cost might be irrelevant.
Now coming to your query,
You have expressed two queries to get the result but both are not identical,
IN PLAN 1 Query 1
Expressed by JOIN
QO is choosing Nested loop join that is good choice for particular this scenario
Every row for the key OrderID IN table #OrdersForRC seeking the value in the table dbo.[TestExecutions]
until all rows matched
IN PLAN 2 Query 2
Expressed by IN
QO is doing the same thing as query one but there is extra distinct Sort ( Sort and Stream aggregated)
the reasoning behind it is you have expressed this query as IN and table #OrdersForRC can contain duplicate Rows
Just to eliminate that is necessary.
IN PLAN 2 Query 1
Expressed by JOIN
Now the Rows in the table in #OrdersForRC in 1000, QO is choosing hash join over loop join
Because loop join for 1000 rows has more cost than hash join and loop join and rows are unordered
and can consist nulls as well thus HASH JOIN is perfect stratergy here.
IN PLAN 2 Query 2
Expressed by IN
QO has chosen Distinct Sort for the same reason as chosen in Plan 2 query 2 and then Merge Join
Because rows are now sorted ON ID column for both tables.
IF you just mark temp table as NOT NULL and Unique then its more likly you will get the same execution plan for both IN the JOIN.
CREATE TABLE #OrdersForRC
(OrderID INT not null Unique)
Execution plan

Why this query is running so slow?

This query runs very fast (<100 msec):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
SELECT
[E2].[CompanyId]
,[E2].[Time]
,[E1].[Id]
,[E1].[Status]
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
)
SELECT TOP 10
[CompanyId]
,[Id]
,[Status]
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
CompanyID
Time
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
or
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.

Query slow for certain criteria on clustered index

I have a table called readings that has > 76 million rows in it that I'm running this query on:
declare #tunnel_id int = 13
SELECT TOP 1 local_time, recorded_time
FROM readings
WHERE tunnel_id = #tunnel_id
ORDER BY id DESC
The id column is a bigint, set as the primary key, and has a clustered index, and there is also an index on the tunnel_id field.
The works great and returns in less than a second for about 16 out of the 20 different tunnel_id's I'm trying. However, on the last 4 or so the query takes 40 seconds and uses hundreds of thousands of reads.
I tried modifying the query into this:
SELECT TOP (1) local_time, recorded_time
FROM readings
where id = (
SELECT TOP 1 id
FROM readings
WHERE tunnel_id = 13
ORDER BY id DESC
)
Which once again is only slow for a few tunnel_id's. What perplexes me more is that the inner select runs quickly for the slow id's and if I hardcode the maximum id instead of the subquery it also runs quickly.
What am I missing here that's making this query perform poorly?
Edit for comments:
Tunnel_id is not unique, each tunnel has multiple millions of rows. This is running on Sql Server 2012.
I included the actual execution plans from both the fast and slow runs and they are identical.
Fast:
Slow:
But as you can see, the first executes in less than a second while the second takes 51 seconds.
The plan basically scans the entire clustered index from start to end and looks for the first row with tunnel_id = #tunnel_id.
My educated guess is that the 'slow' tunnels don't have any rows in the beginning of the clustered index and so it has to scan more of it.
This non-clustered index should speed things up:
CREATE NONCLUSTERED INDEX [IX_FOO] ON [readings]
(
tunnel_id,
ID
)
INCLUDE
(
local_time,
recorded_time
)
This could replace the existing index on tunnel_id.
The interesting part here is that SQL isn't using the index in tunnel_id at all and is just scanning the table in whole, which is slow if it's big like 76 millions rows.
I think the real cause it isn't using it is because the ordering by id, as it must perform a lookup and then an additional sorting. I doubt at first that parameter sniffing is the main problem here.
I would try to change the index instead, and make it covering. If possible include in the index the local time, recorded time and the id (not 100% sure if it's needed as it's the cluster key anyway).
CREATE NONCLUSTERED INDEX IX_tunnel_id ON dbo.readings (tunnel_id) INCLUDE (id, local_time, recorded_time)
Note that, while this can improve this particular query, it will make inserts and updates a little slower, and require additional storage space.
Just found that you can hint to use the tunnel_id index:
declare #tunnel_id int = 13
SELECT TOP 1 local_time, recorded_time
FROM readings
WITH (INDEX(idx_tunnel_id))
WHERE tunnel_id = #tunnel_id
ORDER BY id DESC
which works as expected and returns in less than 1 second.

Getting rid of full index scan

The following query performs badly because of a full non-clustered index scan of 6.5 million records in P4FileReleases followed by a hash join. I'm looking for possible reasons the optimizer picks a scan over a seek.
SELECT p4f.FileReleaseID
FROM P4FileReleases p4f
INNER JOIN AnalyzedFileView af
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
WHERE (af.tracked_change_id = 1)
From what I can tell, I see no reason for the optimizer to pick a scan of P4FileReleases. The WHERE clause limits the size of the right dataset to about 1K of records and the optimizer should know it (see the histogram below).
If fact, if I take the view data and throw it into a heap table (same structure as the indexed view), then the query is performed with an index seek on the larger table and an inner join loop instead of a hash join (and the total cost drops from 145 to around 1).
Any ideas on what might be throwing the optimizer off?
Details. Sql Server 2008 (v. 10.0.2757.0).
P4FileReleases table
Holds 6.5 million records
CREATE TABLE [dbo].[P4FileReleases](
[FileReleaseID] [int] IDENTITY(1,1) NOT NULL,
[FileRelease] [varchar](254) NOT NULL,
-- 5 more fields
CONSTRAINT [CIX_P4FileReleases_FileReleaseID_PK] PRIMARY KEY CLUSTERED
(
[FileReleaseID] ASC
),
CONSTRAINT [NCIX_P4FileReleases_FileRelease] UNIQUE NONCLUSTERED
(
[FileRelease] ASC
)
AnalyzedFileView
is an indexed view with statistics enabled and up-to-date.
It has four columns:
key int (int, PK) - clustered index
tracked_change_id (int, FK) - non-unique, non-clustered index (covering 'path', 'revision')
path (nvarchar(1024), null)
revision (smallint, null)
tracked_change_id histogram:
1 0 1222 0 1
4 0 787 0 1
8 0 2754 0 1
12 0 254 0 1
13 0 34 0 1
Query Plan
|--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([Expr1011])=([Expr1010]), RESIDUAL:([Expr1010]=[Expr1011]))
|--Bitmap(HASH:([Expr1011]), DEFINE:([Bitmap1015]))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1011]))
| |--Compute Scalar(DEFINE:([Expr1011]=([qpsitools].[dbo].[analyzed_file_view].[path]+N'#')+CONVERT_IMPLICIT(nvarchar(30),CONVERT(varchar(30),[qpsitools].[dbo].[analyzed_file_view].[revision],0),0)))
| |--Index Seek(OBJECT:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]), SEEK:([qpsitools].[dbo].[analyzed_file_view].[tracked_change_id]=(1)) ORDERED FORWARD)
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([Expr1010]), WHERE:(PROBE([Bitmap1015],[Expr1010])))
|--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(nvarchar(254),[Blueprint].[dbo].[P4FileReleases].[FileRelease] as [p4f].[FileRelease],0)))
|--Index Scan(OBJECT:([Blueprint].[dbo].[P4FileReleases].[NCIX_P4FileReleases_FileRelease] AS [p4f]))
You are joining varchar column p4f.FileRelease with an nvarchar column (af.path). Since the data types don't match, SQL has to convert one's type to the other's (and of course it can't go from nvarchar to varchar). In converting af.path to nvarchar, it loses the ability to use the index to lookup/filter those values, resulting in the need to scan and convert all possible rows.
The best solution is to store the data as matching data types (change column p4f.FileRelase to nvarchar, or af.path to varchar). Since no one ever gets to modify existing database structures, a work-around might be to explicitly cast af.path to varchar in the query. Test it and see... though of course you can't do this if the data truly requires double-byte formatting.
your problem is not the WHERE but the JOIN, you are getting an implicit conversion and a scan on the JOIN, on the WHERE condition you are getting a SEEK
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
Parallelism could also be a problem, try adding MAXDOP=1
Are your statistics up to date? Is there excessive fragmentation?
Try moving "af.tracked_change_id = 1" into the join clause.
INNER JOIN AnalyzedFileView af
ON p4f.FileRelease = (af.path+'#'+cast(af.revision as varchar))
AND af.tracked_change_id = 1
WHERE is applied after the INNER JOIN
Philip Kelley spotted the problem. It was a datatype mismatch between varchar in P4FileReleases and nvarchar in AnalyzedFileView.

Resources