Specify Xml index storage location in Sql Server - sql-server

Is there a way how to specify storage location in MS Sql Server?
I have a relatively small tabel with xmls (up to 2 GBs of data) and a primary and secondary XML indexes on this table. Those indexes takes few time more space than the underlying table.
Is there any way how to physically separate the table and indexes into two separate partitions?
I do not see any sucha n option in MSDN (https://learn.microsoft.com/en-us/sql/t-sql/statements/create-xml-index-transact-sql):
CREATE [ PRIMARY ] XML INDEX index_name
ON <object> ( xml_column_name )
[ USING XML INDEX xml_index_name
[ FOR { VALUE | PATH | PROPERTY } ] ]
[ WITH ( <xml_index_option> [ ,...n ] ) ]
[ ; ]
But I'm hoping there is some trick.

From the Creating XML Indexes:
The filegroup or partitioning information of the user table is applied
to the XML index. Users cannot specify these separately on an XML
index.
However, if I understand it correctly, you should be able to move the XML column and all indices based on it into a separate filegroup by using the TEXTIMAGE_ON clause of the create table statement:
create table dbo.MyTable (
Id int not null,
XMLData xml not null
)
on [PRIMARY]
textimage_on [LARGEDAT];
The only drawback I see here is that all LOB-typed columns from the table will be placed on that filegroup, not just XML one(s).
EDIT:
You can use the large value types out of row option of the sp_tableoption to force LOB data types out of row even if their values are small enough to fit the table page.

Related

INSERT data from spark dataframe to a table in SQL server

I am using Scala Notebook on Databricks. I need to perform an INSERT of data from a dataframe to a table in SQL server. If the data already exist, no need modify or insert - only insert data that do not exist.
I tried the methods specified here https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#write-data-to-jdbc, however, they don't address my usecase. The SaveMode.Append creates duplicate entries of the data, SaveMode.Overwrite replaces the existing data (table), SaveMode.Ignore does not add any new data if the table already exists.
df.write.mode(SaveMode.Overwrite).jdbc(url=dbUrl, table=table_name, dbConnectionProperties)
How can I do an INSERT of new data only to the database?
Thanks a lot in advance for your help!
Assume your current dataframe is df1.
You should read the existing data in the SQL table into another dataframe (df2).
Then use subtract (or subtractByKey): http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract
val dfFinal = df1.subtract(df2)
dfFinal will contain the remaining records to insert.
NOTE: - It's a workaround. Not a full proof solution.
There can be a workaround for this issue.
You need to maintain an auto-increment key/primary key in the SQL server table. And the source data too should have this key in the data before an insert.
The following conditions can arise:
New Primary key == Old Primary key -> job will fail with constraints exception.
New Primary key != Old Primary key -> insert successfully.
The insert into table failure can be handled at the program level.
To avoid bringing in the entire set to do a comparison you could put a Unique Index on the SQL Table and use the option to Ignore Duplicates.
MS Document: Create Unique Indexes
CREATE [ UNIQUE ] [ CLUSTERED | NONCLUSTERED ] INDEX index_name
ON <object> ( column [ ASC | DESC ] [ ,...n ] )
[ INCLUDE ( column_name [ ,...n ] ) ]
[ WHERE <filter_predicate> ]
[ WITH ( <relational_index_option> [ ,...n ] ) ]
[ ON { partition_scheme_name ( column_name )
| filegroup_name
| default
}
]
[ FILESTREAM_ON { filestream_filegroup_name | partition_scheme_name | "NULL" } ]
[ ; ]
<object> ::=
{ database_name.schema_name.table_or_view_name | schema_name.table_or_view_name | table_or_view_name }
<relational_index_option> ::=
{
| IGNORE_DUP_KEY = { ON | OFF }
}

index on two columns that uniquely find a row

Hi if I have tables like:
table_A:
time_id | transaction_id | other columns...
table_B:
time_id | transaction_id | other columns...
Combination of time_id and transaction_id uniquely defines a row (or almost uniquely)
The query I want to be fast is like:
SELECT ..
FROM [table_A] as a
join [table_B] as b
on a.time_id = b.time_id and a.transaction_id = b.transaction_id
WHERE a.time_id = '201601' and b.time_id = '201601'
What would be the suggested practice in indexing?
I was thinking of
create index time_trans on [product] (time_id, transaction_id)
but is it too granular? (since combination of time_id and transaction_id uniquely defines a row)
how the tables were created (by loading csv into sql server, updated csv provided monthly)
CREATE TABLE [dbo].[table_A] (
[time_id] ...,
[transaction_id] ...,
[other columns] ...
)
BULK INSERT [dbo].[table_A_2010]
FROM 'table_A_2010.CSV'
WITH ( FIRSTROW = 2, FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
BULK INSERT [dbo].[table_A_2011]
FROM 'table_A_2011.CSV'
WITH ( FIRSTROW = 2, FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
BULK INSERT [dbo].[table_A_2012]
FROM 'table_A_2012.CSV'
WITH ( FIRSTROW = 2, FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
...
For any new table, determine what column(s) uniquely identify a row then make that the PRIMARY KEY, which is automatically backed by an index (by default, a clustered index).
Index on granular rows are good, in fact it's usually what you want.
Imagine you have a table that has a row for every human in the world. Which index do you think it would be better from these?
Index by sex (male / female).
Index by name, surname.
Index by bornCountry, documentNumber.
There is no "better" index here. Reasons to compare indexes vary from case to case and at some point even what seems to be a lousy index will work better for certain scenarios.
For you case, creating an index by time_id, transaction_id seems a very sound choice, since you are filtering by time_id and joining against other table with transaction_id. Another case would be if you don't filter by time_id, might want to switch the order of the columns.
If you know that the combination of time_id, transaction_id is unique and has to be enforced you can create a UNIQUE index or if you don't have a clustered index yet, you can create a CLUSTERED INDEX which will reorganize the actual stored data to match this order which will make SELECT queries faster (but might hinder INSERT or UPDATE statements, depending on the inserted or updated values!).
If this combination can be repeated, you can simply create a NONCLUSTERED INDEX. It will help if you create the same index on the other table.
CREATE NONCLUSTERED INDEX time_trans on [product] (time_id, transaction_id)
Also keep in mind that you can INCLUDE columns on nonclustered indexes. You don't show which columns you actually SELECT, but consider including them on the index with INCLUDE so the engine doesn't have to read an additional page from disk when retriving the data since included columns on index store their value with the indexed columns.

XML Index Slows Down Queries

I have a simple table with the following structure, with ~10 million rows:
CREATE TABLE [dbo].[DataPoints](
[ID] [bigint] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[ModuleID] [uniqueidentifier] NOT NULL,
[DateAndTime] [datetime] NOT NULL,
[Username] [nvarchar](100) NULL,
[Payload] [xml] NULL
)
Payload is similar to this for all rows:
<payload>
<total>1000000</total>
<free>300000</free>
</payload>
The following two queries take around 11 seconds each to execute on my dev machine before creating an index on Payload column:
SELECT AVG(Payload.value('(/payload/total)[1]','bigint')) FROM DataPoints
SELECT COUNT(*) FROM DataPoints
WHERE Payload.value('(/payload/total)[1]','bigint') = 1000000
The problem is when I create an XML index on Payload column, both queries take much longer to complete! I want to know:
1) Why is this happening? Isn't an XML index supposed to speed up queries , or at least a query where a value from the XML column is used in WHERE clause?
2) What would be the proper scenario for using XML indexes, if they are not suitable for my case?
This is on SQL Server 2014.
A ordinary XML Index indexes everything in the XML Payload
Selective XML Indexes (SXI)
The main limitation with ordinary XML indexes is that they index the entire XML document. This leads to several significant drawbacks, such as decreased query performance and increased index maintenance cost, mostly related to the storage costs of the index.
You will want to create a Selective XML index for better performance.
The other option is to create Secondary Indexes
XML Indexes (SQL Server)
To enhance search performance, you can create secondary XML indexes. A primary XML index must first exist before you can create secondary indexes.
So the purpose of the Primary Index is so you can create the secondary indexes

SQL Server redesign to reduce disk space

We have an SQL Data Base that will be redesign. We are going to change some column types from INT to TINYINT. With this transformation we can reduce the length in disk of the data base or this transformation will not reduce disk space considerably? The data base version is SQL Server 2008 R2 32bits.
We are going to change some column types from INT to TINYINT.
Note 1 (my point of view): The real problem isn't about storage on disk but about buffer pool.
Note 2: Choosing the right data type for a column (ex. MyColumn) influences the size of following database structures:
First the heap/clustered index size and then
{Non-clustered index|xml index|spatial index} size if MyColumn if part if clustered index key because clustered index key is duplicated in every non-clustered index row, is duplicated in every primary xml index (I'm not 100% sure: and in every secondary xml 'FOR PROPERTY' index) and is duplicated in every spatial index.
Non-clustered index size if MyColumn is part of index (as a key or as a covered column (SQL2005+, see CREATE ... INDEX ... ON ... INCLUDE()).
If MyColumn has {PRIMARY KEY constraint|UNIQUE constraint+NOT NULL|UNIQUE index+NOT NULL} and if there are foreign keys that reference MyColumn then those FKs must have the same data type, max length, precision, scale and collation like MyColumn (PK/UQ). Of course, these FKs can be, also, indexed (clustered/non-clustered index).
If MyColumn is included in a indexed view. Indexed views can have unique clustered and non-clustered indices so, MyColumn can be duplicated again.
If MyColumn has {a PRIMARY KEY constraint|a UNIQUE constraint+NOT NULL|a UNIQUE index+NOT NULL} them is duplicated in every full-text index that references this column (see CREATE FULLTEXT INDEX ... KEY unique_index_name).
Topic non-covered: column-store indices (SQL2012+).
So, yes, changing the data type for some columns might have a big impact. You can use this script:
SELECT i.object_id, i.index_id, i.name, i.type_desc, ips.page_count, ips.*
FROM sys.indexes i
INNER JOIN sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'DETAILED') AS ips ON i.object_id = ips.object_id AND i.index_id = ips.index_id
WHERE i.name IN (
N'PK_PurchaseOrderHeader_PurchaseOrderID', ...
)
ORDER BY i.object_id, i.index_id;
to find out how many pages have every {heap structure|index} before and after data type change.
Note 3: If you have Enterprise Edition then you could compress data (link 1, link 2).
It will reduce the row size by 3 bytes for each column you change from INT to TINYINT. You may need to reclaim the space manually afterward though. Rebuilding the clustered index will do this.

Extending SQL Server full-text index to search through foreign keys

I know that a SQL Server full text index can not index more than one table. But, I have relationships in tables that I would like to implement full text indexes on.
Take the 3 tables below...
Vehicle
Veh_ID - int (Primary Key)
FK_Atr_VehicleColor - int
Veh_Make - nvarchar(20)
Veh_Model - nvarchar(50)
Veh_LicensePlate - nvarchar(10)
Attributes
Atr_ID - int (Primary Key)
FK_Aty_ID - int
Atr_Name - nvarchar(50)
AttributeTypes
Aty_ID - int (Primary key)
Aty_Name - nvarchar(50)
The Attributes and AttributeTypes tables hold values that can be used in drop down lists throughout the application being built. For example, Attribute Type of "Vehicle Color" with Attributes of "Black", "Blue", "Red", etc...
Ok, so the problem comes when a user is trying to search for a "Blue Ford Mustang". So what is the best solution considering that tables like Vehicle will get rather large?
Do I create another field in the "Vehicle" table that is "Veh Color" that holds the text value of what is selected in the drop down in addition to "FK Atr VehicleColor"?
Or, do I drop "FK Atr VehicleColor" altogether and add "Veh Color"? I can use text value of "Veh Color" to match against "Atr Name" when the drop down is populated in an update form. With this approach I will have to handle if Attributes are dropped from the database.
-- Note: could not use underscore outside of code view as everything between two underscores is italicized.
I believe it's a common practice to have separate denormalized table specifically for full-text indexing. This table is then updated by triggers or, as it was in our case, by SQL Server's scheduled task.
This was SQL Server 2000. In SQL Server you can have an indexed view with full-text index: http://msdn.microsoft.com/en-us/library/ms187317.aspx. But note that there are many restrictions on indexed views; for instance, you can't index a view that uses OUTER join.
You can create a view that pulls in whatever data you need, then apply the full-text index to the view. The view needs to be created with the 'WITH SCHEMABINDING' option, and needs to have a UNIQUE index.
CREATE VIEW VehicleSearch
WITH SCHEMABINDING
AS
SELECT
v.Veh_ID,
v.Veh_Make,
v.Veh_Model,
v.Veh_LicensePlate,
a.Atr_Name as Veh_Color
FROM
Vehicle v
INNER JOIN
Attributes a on a.Atr_ID = v.FK_Atr_VehicleColor
GO
CREATE UNIQUE CLUSTERED INDEX IX_VehicleSearch_Veh_ID ON VehicleSearch (
Veh_ID ASC
) ON [PRIMARY]
GO
CREATE FULLTEXT INDEX ON VehicleSearch (
Veh_Make LANGUAGE [English],
Veh_Model LANGUAGE [English],
Veh_Color LANGUAGE [English]
)
KEY INDEX IX_VehicleSearch_Veh_ID ON [YourFullTextCatalog]
WITH CHANGE_TRACKING AUTO
GO
As I understand it (I've used SQL Server a lot but never full-text indexing) SQL Server 2005 allows you to create full text indexes against a view. So you could create a view on
SELECT
Vehicle.VehID, ..., Color.Atr_Name AS ColorName
FROM
Vehicle
LEFT OUTER JOIN Attributes AS Color ON (Vehicle.FK_Atr_VehicleColor = Attributes.Atr_Id)
and then create your full-text index across this view, including 'ColorName' in the index.

Resources