XML Index Slows Down Queries - sql-server

I have a simple table with the following structure, with ~10 million rows:
CREATE TABLE [dbo].[DataPoints](
[ID] [bigint] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[ModuleID] [uniqueidentifier] NOT NULL,
[DateAndTime] [datetime] NOT NULL,
[Username] [nvarchar](100) NULL,
[Payload] [xml] NULL
)
Payload is similar to this for all rows:
<payload>
<total>1000000</total>
<free>300000</free>
</payload>
The following two queries take around 11 seconds each to execute on my dev machine before creating an index on Payload column:
SELECT AVG(Payload.value('(/payload/total)[1]','bigint')) FROM DataPoints
SELECT COUNT(*) FROM DataPoints
WHERE Payload.value('(/payload/total)[1]','bigint') = 1000000
The problem is when I create an XML index on Payload column, both queries take much longer to complete! I want to know:
1) Why is this happening? Isn't an XML index supposed to speed up queries , or at least a query where a value from the XML column is used in WHERE clause?
2) What would be the proper scenario for using XML indexes, if they are not suitable for my case?
This is on SQL Server 2014.

A ordinary XML Index indexes everything in the XML Payload
Selective XML Indexes (SXI)
The main limitation with ordinary XML indexes is that they index the entire XML document. This leads to several significant drawbacks, such as decreased query performance and increased index maintenance cost, mostly related to the storage costs of the index.
You will want to create a Selective XML index for better performance.
The other option is to create Secondary Indexes
XML Indexes (SQL Server)
To enhance search performance, you can create secondary XML indexes. A primary XML index must first exist before you can create secondary indexes.
So the purpose of the Primary Index is so you can create the secondary indexes

Related

Index and primary key in large table that doesn't have an Id column

I'm looking for guidance on the best practice for adding indexes / primary key for the following table in SQL Server.
My goal is to maximize performance mostly on selecting data, but also in inserts.
IndicatorValue
(
[IndicatorId] [uniqueidentifier] NOT NULL, -- this is a foreign key
[UnixTime] [bigint] NOT null,
[Value] [decimal](15,4) NOT NULL,
[Interval] [int] NOT NULL
)
The table will have over 10 million rows. Data is batch inserted between 5-10 thousand rows at a time.
I frequently query the data and retrieve the same 5-10 thousand rows at a time with SQL similar to
SELECT [UnixTime]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
or
SELECT [UnixTime], [Value]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
Based on my limited knowledge of SQL indexes, I think:
I should have a clustered index on IndicatorId and Interval. Because of the ORDER BY, should it also include UnixTime?
As I don't have an identity column (didn't create one because I wouldn't use it), I could have a non-clustered primary key on IndicatorId, UnixTime and Interval, because I read that it's always good to have PK on every table.
Also, the data is very rarely deleted, and there are not many updates, but when they happen it's only on 1 row.
Any insight on best practices would be much appreciated.

Recommended SQL Server table design for file import and processing

I have a scenario where files will be uploaded into a database table (dbo.FileImport) with each line of the file in a new row. Each row will contain the line data and the name of the file it came from. The file names are unique but may contain a few million lines. Multiple file's data may exist in the table at one time.
Each file is processed and the results are stored in a separate table. After processing the data related to the file, the data is deleted from the import table to keep the table from growing indefinitely.
The table structure is as follows:
CREATE TABLE [dbo].[FileImport] (
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[FileName] VARCHAR (100) NOT NULL,
[LineData] NVARCHAR (300) NOT NULL
);
During the processing the data for the relevant file is loaded with the following query:
SELECT [LineData] FROM [dbo].[FileImport] WHERE [FileName] = #FileName
And then deleted with the following statement:
DELETE FROM [dbo].[FileImport] WHERE [FileName] = #FileName
My question is pertaining to the table design with regard to performance and longevity...
Is it necessary to have the [Id] column if I never use it (I am concerned about running out of numbers in the Identity eventually too)?
Should I add a PRIMARY KEY Constraint to the [Id] column?
Should I have a CLUSTERED or NONCLUSTERED index for the [FileName] column?
Should I be making use of NOLOCK whenever I query this table (it is updated very regularly)?
Would there be concern of fragmentation with the continual adding and deleting of data to/from this table? If so, how should I handle this?
Any advice or thoughts would be much appreciated. Opinionated designs are welcome ;-)
Update 2017-12-10
I failed to mention that the lines of a file may not be unique. So please take this into account if this affects the recommendation.
An example script in the answer would be an added bonus! ;-)
Is it necessary to have the [Id] column if I never use it (I am
concerned about running out of numbers in the Identity eventually
too)?
It is not necessary to have an unused column. This is not a relational table and will not be referenced by a foreign key so one could make the argument a primary key is unnecessary.
I would not be concerned about running out of 64-bit integer values. bigint can hold a positive value of up to 36,028,797,018,963,967. It would take centuries to run out of values if you load 1 billion rows a second.
Should I add a PRIMARY KEY Constraint to the [Id] column?
I would create a composite clustered primary key on FileName and ID. That would provide an incremental value to facilitate retrieving rows in the order of insertion and the FileName leftmost key column would benefit your queries greatly.
Should I have a CLUSTERED or NONCLUSTERED index for the [FileName]
column?
See above.
Should I be making use of NOLOCK whenever I query this table (it is
updated very regularly)?
No. Assuming you query by FileName, only the rows requested will be touched with the suggested primary key.
Would there be concern of fragmentation with the continual adding and
deleting of data to/from this table? If so, how should I handle this?
Incremental keys avoid fragmentation.
EDIT:
Here's the suggested DDL for the table:
CREATE TABLE dbo.FileImport (
FileName VARCHAR (100) NOT NULL
, RecordNumber BIGINT NOT NULL IDENTITY
, LineData NVARCHAR (300) NOT NULL
CONSTRAINT PK_FileImport PRIMARY KEY CLUSTERED(FileName, RecordNumber)
);
Here is a rough sketch how I would do it
CREATE TABLE [FileImport].[FileName] (
[FileId] BIGINT IDENTITY (1, 1) NOT NULL,
[FileName] VARCHAR (100) NOT NULL
);
go
alter table [FileImport].[FileName]
add constraint pk_FileName primary key nonclustered (FileId)
go
create clustered index cix_FileName on [FileImport].[FileName]([FileName])
go
CREATE TABLE [FileImport].[LineData] (
[FileId] VARCHAR (100) NOT NULL,
[LineDataId] BIGINT IDENTITY (1, 1) NOT NULL,
[LineData] NVARCHAR (300) NOT NULLL.
constraint fk_LineData_FileName foreign key (FileId) references [FileImport].[FileName](FIleId)
);
alter table [FileImport].[LineData]
add constraint pk_FileName primary key clustered (FileId, LineDataId)
go
This is with some normalization so you don't have to reference your full file name every time - you probably don't have to do (in case you prefer not to and just move FileName to second table instead of the FileId and cluster your index on (FileName, LeneDataId)) it but since we are using relational database ...
No need for any additional indexes - tables are sorted by the right keys
Should I be making use of NOLOCK whenever I query this table (it is
updated very regularly)?
If your data means anything to you, don't use it, It's a matter in fact, if you have to use it - something really wrong with your DB architecture. The way it is indexed SQL Server will use Seek operation which is very fast.
Would there be concern of fragmentation with the continual adding and
deleting of data to/from this table? If so, how should I handle this?
You can set up a maintenance job that rebuilds your indexes and run it nightly with Agent (or what ever)

Recreate index on column store indexed table with 35 billion rows

I have a big table that I need to rebuild the index. The table is configured with Clustered Column Store Index (CCI) and we realized we need to sort the data according to specific use case.
User performs date range and equality query but because the data was not sorted in the way they would like to get it back, the query is not optimal. SQL Advisory Team recommended that data are organized in right row group so query can benefit from row group elimination.
Table Description:
Partition by Timestamp1, monthly PF
Total Rows: 31 billion
Est row size: 60 bytes
Est table size: 600 GB
Table Definition:
CREATE TABLE [dbo].[Table1](
[PkId] [int] NOT NULL,
[FKId1] [smallint] NOT NULL,
[FKId2] [int] NOT NULL,
[FKId3] [int] NOT NULL,
[FKId4] [int] NOT NULL,
[Timestamp1] [datetime2](0) NOT NULL,
[Measurement1] [real] NULL,
[Measurement2] [real] NULL,
[Measurement3] [real] NULL,
[Measurement4] [real] NULL,
[Measurement5] [real] NULL,
[Timestamp2] [datetime2](3) NULL,
[TimeZoneOffset] [tinyint] NULL
)
CREATE CLUSTERED COLUMNSTORE INDEX [Table1_ColumnStoreIndex] ON [dbo].[Table1] WITH (DROP_EXISTING = OFF)
GO
Environment:
SQL Server 2014 Enterprise Ed.
8 Cores, 32 GB RAM
VMWare High
Performance Platform
My strategy is:
Drop the existing CCI
Create ordinary Clustered Row Index with the right columns, this will sort the data
Recreate CCI with DROP EXISTING = OFF. This will convert the existing CRI into CCI.
My questions are:
Does it make sense to rebuild the index or just reload the data? Reloading may take a month to complete where as rebuilding the index may take as much time either, maybe...
If I drop the existing CCI, the table will expand as it may not be compressed anymore?
31 billion rows is 31,000 perfect row groups, a rowgroup is just another horizontal partitioning, so it really matters when and how you load your data. SQL 2014 supports only offline index build.
There are a few cons and pros when considering create index vs. reload:
Create index is a single operation, so if it fails at any point you lost your progress. I would not recommend it at your data size.
Index build will create primary dictionaries so for low cardinality dictionary encoded columns it is beneficial.
Bulk load won't create primary dictionaries, but you can reload data if for some reason your batches fail.
Both index build and bulk load will be parallel if you give enough resources, which means your ordering from the base clustered index won't be perfectly preserved, this is just something to be aware of; at your scale of data it won't matter if you have a few overlapping rowgroups.
If your data will undergo updates/deletes and you reorganize (from SQL19 will also do it Tuple Mover) your ordering might degrade over time.
I would create a Clustered Index ordered and partition on the date_range column so that you have anything between 50-200 rowgroups per partition (do some experiments). Then you can create a partition aligned Clustered Columnstore Index and switch in one partition at a time, the partition switch will trigger index build so you'll get the benefit from primary dictionaries and if you end up with updates/deletes on a partition you can fix the index quality up by rebuilding the partition rather than the whole table. If you decide to use reorganize you still maintain some level of ordering, because rowgroups will only be merged within the same partition.

SQL Server 2014 Index Optimization: Any benefit with including primary key in indexes?

After running a query, the SQL Server 2014 Actual Query Plan shows a missing index like below:
CREATE NONCLUSTERED INDEX IX_1 ON Table1 (Column1) INCLUDE
(PK_Column,SomeOtherColumn)
The missing index suggests to include the Primary Key column in the index. The table is clustered index with the PK_Column.
I am confused and it seems that I don’t get the concept of Clustered Index Primary Key right.
My assumption was: when a table has a clustered PK, all of the non-clustered indexes point to the PK value. Am I correct? If I am, why the query plan missing index asks me to include the PK column in the index?
Summary:
Index advised is not valid,but it doesn't make any difference.See below tests section for details..
After researching for some time,found an answer here and below statement explains convincingly about missing index feature..
they only look at a single query, or a single operation within a single query. They don't take into account what already exists or your other query patterns.
You still need a thinking human being to analyze the overall indexing strategy and make sure that you index structure is efficient and cohesive.
So coming to your question,this index advised may be valid ,but should not to be taken for granted. The index advised is useful for SQL Server for the particular query executed, to reduce cost.
This is the index that was advised..
CREATE NONCLUSTERED INDEX IX_1 ON Table1 (Column1)
INCLUDE (PK_Column, SomeOtherColumn)
Assume you have a query like below..
select pk_column, someothercolumn
from table
where column1 = 'somevalue'
SQL Server tries to scan a narrow index as well if available, so in this case an index as advised will be helpful..
Further you didn't share the schema of table, if you have an index like below
create index nci_test on table(column1)
and a query of below form will advise again same index as stated in question
select pk_column, someothercolumn
from table
where column1 = 'somevalue'
Update :
i have orders table with below schema..
[orderid] [int] NOT NULL Primary key,
[custid] [char](11) NOT NULL,
[empid] [int] NOT NULL,
[shipperid] [varchar](5) NOT NULL,
[orderdate] [date] NOT NULL,
[filler] [char](160) NOT NULL
Now i created one more index of below structure..
create index onlyempid on orderstest(empid)
Now when i have a query of below form
select empid,orderid,orderdate --6.3 units
from orderstest
where empid=5
index advisor will advise below missing index .
CREATE NONCLUSTERED INDEX empidalongwithorderiddate
ON [dbo].[orderstest] ([empid])
INCLUDE ([orderid],[orderdate])--you can drop orderid too ,it doesnt make any difference
If you can see orderid is also included in above suggestion
now lets create it and observe both structures..
---Root level-------
For index onlyempid..
for index empidalongwithorderiddate
----leaf level-------
For index onlyempid..
for index empidalongwithorderiddate
As you can see , creating as per suggestion makes no difference,Even though it is invalid.
I Assume suggestion was made by Index advisor based on query ran and is specifically for the query and it has no idea of other indexes involved
I don't know your schema, nor your queries. Just guessing.
Please correct me if this theory is incorrect.
You are right that non-clustered indexes point to the PK value. Imagine you have large database (for example gigabytes of files) stored on ordinary platter hard-drive. Lets suppose that the disk is fragmented and the PK_index is saved physical far from your Table1 Index.
Imagine that your query need to evaluate Column1 and PK_column as well. The query execution read Column1 value, then PK_value, then Column1 value, then PK_value...
The hard-drive platter is spinning from one physical place to another, this can take time.
Having all you need in one index is more effective, because it means reading one file sequentially.

Querying a varbinary column in SQL Server

I have some issues with querying varbinary columns using the contains predicate (it only works on nvarchar/varchar but on the msdn documentation it is specified that it works on image/varbinary also)
I have this table
[dbo].[Documents]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[title] [nvarchar](100) NOT NULL,
[doctype] [nchar](4) NOT NULL,
[docexcerpt] [nvarchar](1000) NOT NULL,
[doccontent] [varbinary](max) NOT NULL,
CONSTRAINT [PK_Documents]
PRIMARY KEY CLUSTERED ([id] ASC)
)
doctype - document type (format)
docexcerpt - small fragment of the document
doccontent - whole document stored in varbinary
Code:
INSERT INTO dbo.Documents (title, doctype, docexcerpt, doccontent)
SELECT
N'Columnstore Indices and Batch Processing',
N'docx',
N'You should use a columnstore index on your fact tables, putting all columns of a fact table in a columnstore index. In addition to fact tables, very large dimensions could benefit from columnstore indices as well. Do not use columnstore indices for small dimensions. ',
bulkcolumn
FROM
OPENROWSET(BULK 'myUrl', SINGLE_BLOB) AS doc;
Now this is how it looks like :
I have installed the Microsoft Office 2010 Filter Packs and registered them in SQL Server and checked if what I need (.docx) is installed using
SELECT document_type, path
FROM sys.fulltext_document_types;
Here's the output
My issue is that this query doesn't return anything :
As an observation, I have created a fulltext catalog and index on my table using the following code(s), making both docexcerpt and doccontent index-able columns
--fulltext index
CREATE FULLTEXT INDEX ON dbo.Documents
(
docexcerpt Language 1033,
doccontent TYPE COLUMN doctype Language 1033
STATISTICAL_SEMANTICS
)
KEY INDEX PK_Documents
ON DocumentsFtCatalog
WITH STOPLIST = SQLStopList,
SEARCH PROPERTY LIST = WordSearchPropertyList,
CHANGE_TRACKING AUTO;
I'm not sure what am I doing wrong/missing. I'd appreciate any help. Thanks
I've managed to 'solve' the mistery, well.... I forgot that I had to re-insert my documents into my tables (after editing them) in order for my queries to work properly. Can't believe I've been so numb.

Resources