can i have unique constrain on a NULLable fields? - sql-server

I have a table with following scehma
CREATE TABLE MyTable
(
ID INTEGER DEFAULT(1,1),
FirstIdentifier INTEGER NULL,
SecondIdentifier INTEGER NULL,
--.... some other fields .....
)
Now each of FirstIdentifier & SecondIdentifier isunique but NULLable. I want to put a unique constraint on each of this column but cannot do it because its NULLable and can have two rows with NULL values that will fail that unique constraints. Any ideas of how can I address it on schema level?

You can use a filtered index as a unique constraint.
create unique index ix_FirstIdentifier on MyTable(FirstIdentifier)
where FirstIdentifier is not null

As several have suggested, using a filtered indexe is probably the way to get what you want.
But the book answer to your direct question is that a column can be nullable if it has a unique index, but it will only be able to have one row with a null value in that field. Any more than one null would violate the index.

Your question is a little bit confusing. First, in your schema definition, you say that your columns are not allowed to hold null values, but in your description, you say that they can be null.
Anyway, assuming you've got the schema wrong, and you actually wanted the columns to allow null values, SQL Server allows you to do this by adding WHERE IS NOT NULL to the constraint.
Something along the lines of:
CREATE UNIQUE NONCLUSTERED INDEX IDX_my_index
ON MyTable (firstIdentifier)
WHERE firstIdentifier IS NOT NULL

Do a filtered unique index on the fields:
CREATE UNIQUE INDEX ix_IndexName ON MyTable (FirstIdentifier, SecondIdentifier)
WHERE FirstIdentifier IS NOT NULL
AND SecondIdentifier IS NOT NULL
It will allow NULL but still enforce uniqueness.

You can use a Filter predicate on the CREATE INDEX
From CREATE INDEX
CREATE [ UNIQUE ] [ CLUSTERED | NONCLUSTERED ] INDEX index_name
ON ( column [ ASC | DESC ] [ ,...n ] )
[ INCLUDE ( column_name [ ,...n ] ) ]
[ WHERE <filter_predicate> ]
[ WITH ( [ ,...n ] ) ]
[ ON { partition_scheme_name ( column_name )
| filegroup_name
| default
}
]
[ FILESTREAM_ON { filestream_filegroup_name |
partition_scheme_name | "NULL" } ] [ ; ]
WHERE <filter_predicate> Creates a filtered index by specifying which
rows to include in the index. The filtered index must be a
nonclustered index on a table. Creates filtered statistics for the
data rows in the filtered index.
The filter predicate uses simple comparison logic and cannot reference
a computed column, a UDT column, a spatial data type column, or a
hierarchyID data type column. Comparisons using NULL literals are not
allowed with the comparison operators. Use the IS NULL and IS NOT NULL
operators instead.
Here are some examples of filter predicates for the
Production.BillOfMaterials table:
WHERE StartDate > '20040101' AND EndDate <= '20040630'
WHERE ComponentID IN (533, 324, 753)
WHERE StartDate IN ('20040404', '20040905') AND EndDate IS NOT NULL
Filtered indexes do not apply to XML indexes and full-text indexes.
For UNIQUE indexes, only the selected rows must have unique index
values. Filtered indexes do not allow the IGNORE_DUP_KEY option.

Related

index on persisted computed column slower than index on non-persisted computed column

we have a big table with a varchar(max) column containing elements that are 'inspected' with charindex. For instance:
select x from y where charindex('string',[varchar_max_field]) > 0
In order to speed that up, I created a computed column with the result of the charindex command. As a test, I created both a persisted and a non-persisted version of that colum and created a nc-index for each, containing only the computed column:
CREATE TABLE [schema].[table] (
[other fields...]
[State] [NVARCHAR](MAX) NULL, /* Contains JSON information */
[NonPersistedColumn] AS (CHARINDEX('something',[State],(1))),
[PersistedColumn] AS (CHARINDEX('something',[State],(1))) PERSISTED )
CREATE NONCLUSTERED INDEX [ix_NonPersistedColumn] ON [schema].[table]
([NonPersistedColumn] ASC )
CREATE NONCLUSTERED INDEX [ix_PersistedColumn] ON [schema].[table]
([PersistedColumn] ASC )
Next,
SELECT TOP (50) [NonPersistedColumn] FROM [table] WHERE [NonPersistedColumn] > 0
uses an index seek on the index for the non-persisted column, as expected.
However,
SELECT TOP (50) [PersistedColumn] FROM [table] WHERE [PersistedColumn] > 0
uses the index of the non-persisted column (equal charindex logic, so ok) and performs an identical index seek.
If I force it to use the index on the persisted column, it reverts to a Key Lookup on the clusterd index with table's ID column as a seek predicate and the [State] column in the query plan Output List. I am only asking for the column in the nc index, it is not a covering query.
Why is it using the PK index (containing an ID column)?
Is that related to the PK always being added to the nc index?
In this way, there is little advantage in persisting the computed column, or am I missing something?
Links to the the plans (persisted and non-persisted, respectively):
https://www.brentozar.com/pastetheplan/?id=S1zoLmEEs
https://www.brentozar.com/pastetheplan/?id=S1CHwmE4j
Link to the plan for the query on the persisted column without the index hint (uses the index on the non-persisted column)
https://www.brentozar.com/pastetheplan/?id=HJB6j7EVs

INSERT data from spark dataframe to a table in SQL server

I am using Scala Notebook on Databricks. I need to perform an INSERT of data from a dataframe to a table in SQL server. If the data already exist, no need modify or insert - only insert data that do not exist.
I tried the methods specified here https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#write-data-to-jdbc, however, they don't address my usecase. The SaveMode.Append creates duplicate entries of the data, SaveMode.Overwrite replaces the existing data (table), SaveMode.Ignore does not add any new data if the table already exists.
df.write.mode(SaveMode.Overwrite).jdbc(url=dbUrl, table=table_name, dbConnectionProperties)
How can I do an INSERT of new data only to the database?
Thanks a lot in advance for your help!
Assume your current dataframe is df1.
You should read the existing data in the SQL table into another dataframe (df2).
Then use subtract (or subtractByKey): http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract
val dfFinal = df1.subtract(df2)
dfFinal will contain the remaining records to insert.
NOTE: - It's a workaround. Not a full proof solution.
There can be a workaround for this issue.
You need to maintain an auto-increment key/primary key in the SQL server table. And the source data too should have this key in the data before an insert.
The following conditions can arise:
New Primary key == Old Primary key -> job will fail with constraints exception.
New Primary key != Old Primary key -> insert successfully.
The insert into table failure can be handled at the program level.
To avoid bringing in the entire set to do a comparison you could put a Unique Index on the SQL Table and use the option to Ignore Duplicates.
MS Document: Create Unique Indexes
CREATE [ UNIQUE ] [ CLUSTERED | NONCLUSTERED ] INDEX index_name
ON <object> ( column [ ASC | DESC ] [ ,...n ] )
[ INCLUDE ( column_name [ ,...n ] ) ]
[ WHERE <filter_predicate> ]
[ WITH ( <relational_index_option> [ ,...n ] ) ]
[ ON { partition_scheme_name ( column_name )
| filegroup_name
| default
}
]
[ FILESTREAM_ON { filestream_filegroup_name | partition_scheme_name | "NULL" } ]
[ ; ]
<object> ::=
{ database_name.schema_name.table_or_view_name | schema_name.table_or_view_name | table_or_view_name }
<relational_index_option> ::=
{
| IGNORE_DUP_KEY = { ON | OFF }
}

Question on clustered and non clustered index

I have only two columns on a table, on column1 I applied clustered index and on column2 I applied a non clustered index but when I enter values into a table why is that column2 values are sorted rather than column1 if clustered is going to sort the data that should only sort column1, not column2? - SQL server 2012
Think of it in this way: a clustered index IS the data. So in your example your data is sorted on column1 by default because you applied a clustered index on column1.

index on two columns that uniquely find a row

Hi if I have tables like:
table_A:
time_id | transaction_id | other columns...
table_B:
time_id | transaction_id | other columns...
Combination of time_id and transaction_id uniquely defines a row (or almost uniquely)
The query I want to be fast is like:
SELECT ..
FROM [table_A] as a
join [table_B] as b
on a.time_id = b.time_id and a.transaction_id = b.transaction_id
WHERE a.time_id = '201601' and b.time_id = '201601'
What would be the suggested practice in indexing?
I was thinking of
create index time_trans on [product] (time_id, transaction_id)
but is it too granular? (since combination of time_id and transaction_id uniquely defines a row)
how the tables were created (by loading csv into sql server, updated csv provided monthly)
CREATE TABLE [dbo].[table_A] (
[time_id] ...,
[transaction_id] ...,
[other columns] ...
)
BULK INSERT [dbo].[table_A_2010]
FROM 'table_A_2010.CSV'
WITH ( FIRSTROW = 2, FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
BULK INSERT [dbo].[table_A_2011]
FROM 'table_A_2011.CSV'
WITH ( FIRSTROW = 2, FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
BULK INSERT [dbo].[table_A_2012]
FROM 'table_A_2012.CSV'
WITH ( FIRSTROW = 2, FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
...
For any new table, determine what column(s) uniquely identify a row then make that the PRIMARY KEY, which is automatically backed by an index (by default, a clustered index).
Index on granular rows are good, in fact it's usually what you want.
Imagine you have a table that has a row for every human in the world. Which index do you think it would be better from these?
Index by sex (male / female).
Index by name, surname.
Index by bornCountry, documentNumber.
There is no "better" index here. Reasons to compare indexes vary from case to case and at some point even what seems to be a lousy index will work better for certain scenarios.
For you case, creating an index by time_id, transaction_id seems a very sound choice, since you are filtering by time_id and joining against other table with transaction_id. Another case would be if you don't filter by time_id, might want to switch the order of the columns.
If you know that the combination of time_id, transaction_id is unique and has to be enforced you can create a UNIQUE index or if you don't have a clustered index yet, you can create a CLUSTERED INDEX which will reorganize the actual stored data to match this order which will make SELECT queries faster (but might hinder INSERT or UPDATE statements, depending on the inserted or updated values!).
If this combination can be repeated, you can simply create a NONCLUSTERED INDEX. It will help if you create the same index on the other table.
CREATE NONCLUSTERED INDEX time_trans on [product] (time_id, transaction_id)
Also keep in mind that you can INCLUDE columns on nonclustered indexes. You don't show which columns you actually SELECT, but consider including them on the index with INCLUDE so the engine doesn't have to read an additional page from disk when retriving the data since included columns on index store their value with the indexed columns.

Specify Xml index storage location in Sql Server

Is there a way how to specify storage location in MS Sql Server?
I have a relatively small tabel with xmls (up to 2 GBs of data) and a primary and secondary XML indexes on this table. Those indexes takes few time more space than the underlying table.
Is there any way how to physically separate the table and indexes into two separate partitions?
I do not see any sucha n option in MSDN (https://learn.microsoft.com/en-us/sql/t-sql/statements/create-xml-index-transact-sql):
CREATE [ PRIMARY ] XML INDEX index_name
ON <object> ( xml_column_name )
[ USING XML INDEX xml_index_name
[ FOR { VALUE | PATH | PROPERTY } ] ]
[ WITH ( <xml_index_option> [ ,...n ] ) ]
[ ; ]
But I'm hoping there is some trick.
From the Creating XML Indexes:
The filegroup or partitioning information of the user table is applied
to the XML index. Users cannot specify these separately on an XML
index.
However, if I understand it correctly, you should be able to move the XML column and all indices based on it into a separate filegroup by using the TEXTIMAGE_ON clause of the create table statement:
create table dbo.MyTable (
Id int not null,
XMLData xml not null
)
on [PRIMARY]
textimage_on [LARGEDAT];
The only drawback I see here is that all LOB-typed columns from the table will be placed on that filegroup, not just XML one(s).
EDIT:
You can use the large value types out of row option of the sp_tableoption to force LOB data types out of row even if their values are small enough to fit the table page.

Resources