I have simple table:
CREATE TABLE dbo.Table1 (
ID int IDENTITY(1,1) PRIMARY KEY,
TextField varchar(100)
)
I have nonclustered index on TextField column.
I am creating a simple query which selects both columns and in where condition i have next situation:
...
WHERE SUBSTRING(TextField, 1, 1) = 'x'
Is it better to convert query to LIKE condition with 'x%' or to create partition function on TextField column.
How partitioning can affect on search condition over varchar column, and what solution will be better for large amount of rows?
By default ,SUBSTRING(TextField, 1, 1) = 'x' is not SARGable.
First, I would test that query with following solutions (SQL Profiler > {SQLStatement|Batch} Completed > CPU,Reads,Writes,Duration columns):
1) A non-clustered index on TextField column:
CREATE INDEX IN_Table1_TextField
ON dbo.Table1(TextField)
INCLUDE(non-indexed columns); -- See `SELECT` columns
GO
And the query should use LIKE:
SELECT ... FROM TextField LIKE 'x%'; -- Where "x" represent one or more chars.
Pros/cons: B-Tree/index will have many levels because o key length (maximum 100 chars + RowID if isn't a UNIQUE index) .
2) I would create an computed column for the first char:
-- TextField column needs to be mandatory
ALTER TABLE dbo.Table1
ADD FirstChar AS (CONVERT(CHAR(1),SUBSTRING(TextField,1,1))); -- This computed column could be non-persistent
GO
plus
CREATE INDEX IN_Table1_FirstChar
On dbo.Table1(FirstChar)
INCLUDE (non-indexed columns);
GO
In this case, the predicate could be
WHERE SUBSTRING(TextField, 1, 1) = 'x'
or
WHERE FirstChar = 'x'
Pros/cons: B-Tree/index will have far less levels because o key length (1 char + RowID). I would use if predicate selectivity is high (small number of rows verifies) but without covered columns (see INCLUDE clause).
3) A clustered index on FirstChar column thus:
CREATE TABLE dbo.Table1 (
ID int IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
TextField varchar(100) NOT NULL, -- This column needs to be mandatory
ADD FirstChar AS (CONVERT(CHAR(1),SUBSTRING(TextField,1,1))),
UNIQUE CLUSTERED(FirstChar,ID)
);
In this case, the predicate could be
WHERE SUBSTRING(TextField, 1, 1) = 'x'
or
WHERE FirstChar = 'x'
Pros/cons: should give you good performance if you have many rows. In this case, the B-Tree levels will minimum (1 CHAR + 1 INT) or minimum->medium.
Your non-clustered index can not be utilized if there is a function applied to the column (i.e. SUBSTRING). LIKE 'x%' would be preferable here.
Related
Question:
Can I somehow hint SQL Server on expected number of rows returned from index seek?
Background:
I have a unique clustered index:
ALTER TABLE [dbo].[T] ADD CONSTRAINT [X] PRIMARY KEY CLUSTERED
(
[Int1] ASC,
[Int2] ASC,
[Int3] ASC,
[Int4] ASC
)
Now I have a query that fetches particular single value:
SELECT
...
FROM [dbo].[T]
WHERE
[Int1] = #Int1 AND
[Int2] = #Int2 AND
[Int3] = #Int3 AND
[Int4] = #Int4
This runs instantaneously. With any values for arguments #Int1-4
Now I actually want a range of values.
If I iterate with cycle with increasing value of #Int4 (yes - something that's sounds completely wrong to do in SQL) - I get my results instantaneously.
-- Looks completely wrong for SQL - but it seems to be fastest way to fetch range of values
DECLARE #I INT = 1
WHILE #I <= 50
BEGIN
SELECT
...
FROM [dbo].[T]
WHERE
[Int1] = #Int1 AND
[Int2] = #Int2 AND
[Int3] = #Int3 AND
[Int4] = #I
SET #I = #I + 1
END
GO
If I specify the last condition as range:
SELECT
...
FROM [dbo].[T]
WHERE
[Int1] = #Int1 AND
[Int2] = #Int2 AND
[Int3] = #Int3 AND
[Int4] BETWEEN #Int4 AND (#Int4 + 2)
The query takes minutes.
Same happens if I omit the [Int4] constraint altogether.
In all 3 cases the actual execution plan looks the same (clustered index seek):
The difference is in estimated vs actual rows returned. In case of exact condition it's both 1. In case of between or omitted condition it's a huge difference:
Why is the difference in estimate hurting the performance so badly?
Is there any way how can I make the between or omitted condition run more quicker? Any way how to hint SQL that number of rows will be very low?
Btw. the table contains 73 billion rows. Data size is ~ 1.7TB and index size 4.2TB.
It can probably be rebuild, however it would require huge downtime. Plus I can make the query lightening fast if I just switch to dummy cycle.
EDIT1:
As requested - here is the actual DDL for table and indexes (first 4 columns are the INT1-INT4 in my simplified example above):
CREATE TABLE [dbo].[RelationalResultValueVectorial](
[RelationalResultRowId] [bigint] NOT NULL,
[RelationalResultPropertyId] [int] NOT NULL,
[RelationalResultVectorialDimensionId] [int] NOT NULL,
[OrdinalRowIdWithinProperty] [int] NOT NULL,
[RelationalResultValueId] [bigint] IDENTITY(1,1) NOT NULL,
CONSTRAINT [Idx_RelationalResultValueVectorial] PRIMARY KEY CLUSTERED
(
[RelationalResultRowId] ASC,
[RelationalResultPropertyId] ASC,
[RelationalResultVectorialDimensionId] ASC,
[OrdinalRowIdWithinProperty] ASC
) ON [RelationalDataFileGroup]
) ON [RelationalDataFileGroup]
GO
CREATE UNIQUE NONCLUSTERED INDEX [IX_RelationalResultValueVectorial_ValueId] ON [dbo].[RelationalResultValueVectorial]
(
[RelationalResultValueId] ASC
) ON [RelationalDataFileGroup]
GO
-- + some FKs
EDIT2:
As to the answer about parameters sniffing - here is what I get if I use solely constants (still wrong estimate and still very slow execution):
This is because you use variables, and they are not "sniffed" in the query as it's written above.
In case of single values for every of your 4 fields composing primary key, knowing of these values is not needed, as server knows that every combination of these 4 fields is unique.
The thing is different when you use the condition [Int4] BETWEEN #Int4 AND (#Int4 + 2). The result set cardinality may vary, you can specify the range of 1 value or all the values possible for #int4, and I repeat, if you don't ask your server to evaluate your expression/variables by using optin(recompile), server will estimate the cardinality of between as 9% of rows (16% starting with sql server 2014).
Try to substitute your variables with constants and cardinality estimation will be based on the statistics, now it estimates it as for "unknown values".
So, the solution for your case is recompile option.
Is there any way to add a constraint on a column that is an array to limit it's length? I want these arrays to be no longer than 6. And yes, I understand that often a new table is better than storing in an array but I am in a situation where an array makes more sense.
You can add a CHECK constraint to the table definition:
CREATE TABLE my_table (
id serial PRIMARY KEY,
arr int[] CHECK (array_length(arr, 1) < 7),
...
);
If the table already exists, you can add the constraint with ALTER TABLE:
ALTER TABLE my_table ADD CONSTRAINT arr_len CHECK (array_length(arr, 1) < 7);
I have a table like this :
create table ReceptionR1
(
numOrdre char(20) not null,
dateDepot datetime null,
...
)
I want to increment my id field (numOrdre) like '225/2015','226/2015',...,'1/2016' etc. What should I have to do for that?
2015 means the actual year.
Please let me know any possible way.
You really, and I mean Really don't want to do such a thing, especially as your primary key. You better use a simple int identity column for you primary key and add a non nullable create date column of type datetime2 with a default value of sysDateTime().
Create the increment number by year either as a calculated column or by using an instead of insert trigger (if you don't want it to be re-calculated each time). This can be done fairly easy with the use of row_number function.
As everyone else has said - don't use this as your primary key! But you could do the following, if you're on SQL Server 2012 or newer:
-- step 1 - create a sequence
CREATE SEQUENCE dbo.SeqOrderNo AS INT
START WITH 1001 -- start with whatever value you need
INCREMENT BY 1
NO CYCLE
NO CACHE;
-- create your table - use INT IDENTITY as your primary key
CREATE TABLE dbo.ReceptionR1
(
ID INT IDENTITY
CONSTRAINT PK_ReceptionR1 PRIMARY KEY CLUSTERED,
dateDepot DATE NOT NULL,
...
-- add a colum called "SeqNumber" that gets filled from the sequence
SeqNumber INT,
-- you can add a *computed* column here
OrderNo = CAST(YEAR(dateDepot) AS VARCHAR(4)) + '/' + CAST(SeqNumber AS VARCHAR(4))
)
So now, when you insert a row, it has a proper and well defined primary key (ID), and when you fill the SeqNumber with
INSERT INTO dbo.ReceptionR1 (dateDepot, SeqNumber)
VALUES (SYSDATETIME(), NEXT VALUE FOR dbo.SeqOrderNo)
then the SeqNumber column gets the next value for the sequence, and the OrderNo computed column gets filled with 2015/1001, 2015/1002 and so forth.
Now when 2016 comes around, you just reset the sequence back to a starting value:
ALTER SEQUENCE dbo.SeqOrderNo RESTART WITH 1000;
and you're done - the rest of your solution works as before.
If you want to make sure you never accidentally insert a duplicate value, you can even put a unique index on your OrderNo column in your table.
Once more, you cannot use the combo field as your primary key. This solution sort or works on earlier versions of SQL and calculates the new annual YearlySeq counter automatically - but you had better have an index on dateDepot and you might still have issues if there are many, many (100's of thousands) of rows per year.
In short: fight the requirement.
Given
create table dbo.ReceptionR1
(
ReceptionR1ID INT IDENTITY PRIMARY KEY,
YearlySeq INT ,
dateDepot datetime DEFAULT (GETDATE()) ,
somethingElse varchar(99) null,
numOrdre as LTRIM(STR(YearlySeq)) + '/' + CONVERT(CHAR(4),dateDepot,111)
)
GO
CREATE TRIGGER R1Insert on dbo.ReceptionR1 for INSERT
as
UPDATE tt SET YearlySeq = ISNULL(ii.ReceptionR1ID - (SELECT MIN(ReceptionR1ID) FROM dbo.ReceptionR1 xr WHERE DATEPART(year,xr.dateDepot) = DATEPART(year,ii.dateDepot) and xr.ReceptionR1ID <> ii.ReceptionR1ID ),0) + 1
FROM dbo.ReceptionR1 tt
JOIN inserted ii on ii.ReceptionR1ID = tt.ReceptionR1ID
GO
insert into ReceptionR1 (somethingElse) values ('dumb')
insert into ReceptionR1 (somethingElse) values ('requirements')
insert into ReceptionR1 (somethingElse) values ('lead')
insert into ReceptionR1 (somethingElse) values ('to')
insert into ReceptionR1 (somethingElse) values ('big')
insert into ReceptionR1 (somethingElse) values ('problems')
insert into ReceptionR1 (somethingElse) values ('later')
select * from ReceptionR1
Assume there is a table like:
create table #data (ID int identity(1, 1) not NULL, Value int)
Put some data into it:
insert into #data (Value)
select top (1000000) case when (row_number() over (order by ##spid)) % 5 in (0, 1) then 1 else NULL end
from sys.all_columns c1, sys.all_columns c2
And two indexes:
create index #ix_data_n on #data (Value) include (ID) where Value is NULL
create index #ix_data_nn on #data (Value) include (ID) where Value is not NULL
Data is queried like:
select ID from #data where Value is NULL
or
select ID from #data where Value is not NULL
If I examine query plan, I see that in first case index seek is performed and in the second case index scan is performed. Why is it seek in first case and scan in second?
Addition after comments:
If I create ordinary covering index instead of two filtered covering:
create index #ix_data on #data (Value) include (ID)
Query plan is showing index seek for both is NULL and is not NULL conditions, disregarding % of the NULL values in column (0% of NULLs or 10% or 90% or 100%, not matter).
When there are two filtered indexes, query plan is showing index seek for is NULL always, and can be index scan or table scan (depending on % of the NULLs), but it is never index seek. So, seems, essentially the difference is in the way condition 'is not NULL' handled.
It means, probably, that if index is intended for 'is not NULL' check only, then either normal index or filtered index should perform better and be preferred, isn't it? Which one?
SqlServer 2008, 2008r2 and 2012
The Seek vs Scan in query plans you are seeing is a red-herring.
In both cases, the query is being answered by scanning the appropriate non-clustered index from beginning to end, returning every row.
By examing the XML query plan, you can see that the index Seek predicate is "#data.Value = Scalar Operator (Null)" which is meaningless as every row meets that criteria.
2nd Edit: The source code for the involved function is as follows:
ALTER FUNCTION [Fileserver].[fn_CheckSingleFileSource] ( #fileId INT )
RETURNS INT
AS
BEGIN
-- Declare the return variable here
DECLARE #sourceCount INT ;
-- Add the T-SQL statements to compute the return value here
SELECT #sourceCount = COUNT(*)
FROM Fileserver.FileUri
WHERE FileId = #fileId
AND FileUriTypeId = Fileserver.fn_Const_SourceFileUriTypeId() ;
-- Return the result of the function
RETURN #sourceCount ;
END
Edit: The example table is a simplification. I need this to work as a Scaler Function / CHECK CONSTRAINT operation. The real-world arrangement is not so simple.
Original Question: Assume the following table named FileUri
FileUriId, FileId, FileTypeId
I need to write a check constraint such that FileId are unique for a FileTypeId of 1. You could insert the same FileId as much as you want, but only a single row where FileTypeId is 1.
The approach that DIDN'T work:
1) dbo.fn_CheckFileTypeId returns INT with following logic: SELECT Count(FileId) FROM FileUri WHERE FileTypeId = 1
2) ALTER TABLE FileUri ADD CONSTRAINT CK_FileUri_FileTypeId CHECK dbo.fn_CheckFileTypeId(FileId) <= 1
When I insert FileId 1, FileTypeId 1 twice, the second insert is allowed.
Thanks SO!
You need to create a filtered unique index (SQL Server 2008)
CREATE UNIQUE NONCLUSTERED INDEX ix ON YourTable(FileId) WHERE FileTypeId=1
or simulate this with an indexed view (2000 and 2005)
CREATE VIEW dbo.UniqueConstraintView
WITH SCHEMABINDING
AS
SELECT FileId
FROM dbo.YourTable
WHERE FileTypeId = 1
GO
CREATE UNIQUE CLUSTERED INDEX ix ON dbo.UniqueConstraintView(FileId)
Why don't you make FieldTypeID and Field both the primary key of the table?
Or at least a Unique Index on the table. That should solve your problem.