Querying a varbinary column in SQL Server

Querying a varbinary column in SQL Server - sql-server

I have some issues with querying varbinary columns using the contains predicate (it only works on nvarchar/varchar but on the msdn documentation it is specified that it works on image/varbinary also)
I have this table
[dbo].[Documents]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[title] [nvarchar](100) NOT NULL,
[doctype] [nchar](4) NOT NULL,
[docexcerpt] [nvarchar](1000) NOT NULL,
[doccontent] [varbinary](max) NOT NULL,
CONSTRAINT [PK_Documents]
PRIMARY KEY CLUSTERED ([id] ASC)
)
doctype - document type (format)
docexcerpt - small fragment of the document
doccontent - whole document stored in varbinary
Code:
INSERT INTO dbo.Documents (title, doctype, docexcerpt, doccontent)
SELECT
N'Columnstore Indices and Batch Processing',
N'docx',
N'You should use a columnstore index on your fact tables, putting all columns of a fact table in a columnstore index. In addition to fact tables, very large dimensions could benefit from columnstore indices as well. Do not use columnstore indices for small dimensions. ',
bulkcolumn
FROM
OPENROWSET(BULK 'myUrl', SINGLE_BLOB) AS doc;
Now this is how it looks like :
I have installed the Microsoft Office 2010 Filter Packs and registered them in SQL Server and checked if what I need (.docx) is installed using
SELECT document_type, path
FROM sys.fulltext_document_types;
Here's the output
My issue is that this query doesn't return anything :
As an observation, I have created a fulltext catalog and index on my table using the following code(s), making both docexcerpt and doccontent index-able columns
--fulltext index
CREATE FULLTEXT INDEX ON dbo.Documents
(
docexcerpt Language 1033,
doccontent TYPE COLUMN doctype Language 1033
STATISTICAL_SEMANTICS
)
KEY INDEX PK_Documents
ON DocumentsFtCatalog
WITH STOPLIST = SQLStopList,
SEARCH PROPERTY LIST = WordSearchPropertyList,
CHANGE_TRACKING AUTO;
I'm not sure what am I doing wrong/missing. I'd appreciate any help. Thanks

I've managed to 'solve' the mistery, well.... I forgot that I had to re-insert my documents into my tables (after editing them) in order for my queries to work properly. Can't believe I've been so numb.

Related

Dynamic SQL to execute large number of rows from a table

I have a table with a very large number of rows which I wish to execute via dynamic SQL. They are basically existence checks and insert statements and I want to migrate data from one production database to another - we are merging transactional data. I am trying to find the optimal way to execute the rows.
I've been finding the coalesce method for appending all the rows to one another to not be efficient for this particularly when the number of rows executed at a time is greater than ~100.
Assume the structure of the source table is something arbitrary like this:
CREATE TABLE [dbo].[MyTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[DataField1] [int] NOT NULL,
[FK_ID1] [int] NOT NULL,
[LotsMoreFields] [NVARCHAR] (MAX),
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED ([ID] ASC)
)
CREATE TABLE [dbo].[FK1]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [int] NOT NULL, -- Unique constrained value
CONSTRAINT [PK_FK1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
The other requirement is I am tracking the source table PK vs the target PK and whether an insert occurred or whether I have already migrated that row to the target. To do this, I'm tracking migrated rows in another table like so:
CREATE TABLE [dbo].[ChangeTracking]
(
[ReferenceID] BIGINT IDENTITY(1,1),
[Src_ID] BIGINT,
[Dest_ID] BIGINT,
[TableName] NVARCHAR(255),
CONSTRAINT [PK_ChangeTracking] PRIMARY KEY CLUSTERED ([ReferenceID] ASC)
)
My existing method is executing some dynamic sql generated by a stored procedure. The stored proc does PK lookups as the source system has different PK values for table [dbo].[FK1].
E.g.
IF NOT EXISTS (<ignore this existence check for now>)
BEGIN
INSERT INTO [Dest].[dbo].[MyTable] ([DataField1],[FK_ID1],[LotsMoreFields]) VALUES (333,(SELECT [ID] FROM [Dest].[dbo].[FK1] WHERE [Name]=N'ValueFoundInSource'),N'LotsMoreValues');
INSERT INTO [Dest].[dbo].[ChangeTracking] ([Src_ID],[Dest_ID],[TableName]) VALUES (666,SCOPE_IDENTITY(),N'MyTable'); --666 is the PK in [Src].[dbo].[MyTable] for this inserted row
END
So when you have a million of these, it isn't quick.
Is there a recommended performant way of doing this?

As mentioned, the MERGE statement works well when you're looking at a complex JOIN condition (if any of these fields are different, update the record to match). You can also look into creating a HASHBYTES hash of the entire record to quickly find differences between source and target tables, though that can also be time-consuming on very large data sets.

It sounds like you're making these updates like a front-end developer, by checking each row for a match and then doing the insert. It will be far more efficient to do the inserts with a single query. Below is an example that looks for names that are in the tblNewClient table, but not in the tblClient table:
INSERT INTO tblClient
( [Name] ,
TypeID ,
ParentID
)
SELECT nc.[Name] ,
nc.TypeID ,
nc.ParentID
FROM tblNewClient nc
LEFT JOIN tblClient cl
ON nc.[Name] = cl.[Name]
WHERE cl.ID IS NULL;
This is will way more efficient than doing it RBAR (row by agonizing row).

Taking the two answers from #RusselFox and putting them together, I reached this tentative solution (but looking a LOT more efficient):
MERGE INTO [Dest].[dbo].[MyTable] [MT_D]
USING (SELECT [MT_S].[ID] as [SrcID],[MT_S].[DataField1],[FK_1_D].[ID] as [FK_ID1],[MT_S].[LotsMoreFields]
FROM [Src].[dbo].[MyTable] [MT_S]
JOIN [Src].[dbo].[FK_1] ON [MT_S].[FK_ID1] = [FK_1].[ID]
JOIN [Dest].[dbo].[FK_1] [FK_1_D] ON [FK_1].[Name] = [FK_1_D].[Name]
) [SRC] ON 1 = 0
WHEN NOT MATCHED THEN
INSERT([DataField1],[FL_ID1],[LotsMoreFields])
VALUES ([DataField1],[FL_ID1],[LotsMoreFields])
OUTPUT [SRC].[SrcID],INSERTED.[ID],0,N'MyTable' INTO [Dest].[dbo].[ChangeTracking]([Src_ID],[Dest_ID],[AlreadyExists],[TableName]);

XML Index Slows Down Queries

I have a simple table with the following structure, with ~10 million rows:
CREATE TABLE [dbo].[DataPoints](
[ID] [bigint] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[ModuleID] [uniqueidentifier] NOT NULL,
[DateAndTime] [datetime] NOT NULL,
[Username] [nvarchar](100) NULL,
[Payload] [xml] NULL
)
Payload is similar to this for all rows:
<payload>
<total>1000000</total>
<free>300000</free>
</payload>
The following two queries take around 11 seconds each to execute on my dev machine before creating an index on Payload column:
SELECT AVG(Payload.value('(/payload/total)[1]','bigint')) FROM DataPoints
SELECT COUNT(*) FROM DataPoints
WHERE Payload.value('(/payload/total)[1]','bigint') = 1000000
The problem is when I create an XML index on Payload column, both queries take much longer to complete! I want to know:
1) Why is this happening? Isn't an XML index supposed to speed up queries , or at least a query where a value from the XML column is used in WHERE clause?
2) What would be the proper scenario for using XML indexes, if they are not suitable for my case?
This is on SQL Server 2014.

A ordinary XML Index indexes everything in the XML Payload
Selective XML Indexes (SXI)
The main limitation with ordinary XML indexes is that they index the entire XML document. This leads to several significant drawbacks, such as decreased query performance and increased index maintenance cost, mostly related to the storage costs of the index.
You will want to create a Selective XML index for better performance.
The other option is to create Secondary Indexes
XML Indexes (SQL Server)
To enhance search performance, you can create secondary XML indexes. A primary XML index must first exist before you can create secondary indexes.
So the purpose of the Primary Index is so you can create the secondary indexes

Eastern Character Set Causes Problems For SQL Server 2012

I have a table with contents:
internalid foreignWord
1 បរិស្ថាន
2 ការអភិវឌ្ឍសហគមន៍
And its schema:
CREATE TABLE [dbo].[CE_testTable](
[internalid] [int] IDENTITY(1,1) NOT NULL,
[foreignWord] [nvarchar](50) NOT NULL
If I run:
SELECT TOP 1000 [internalid] ,[foreignWord] FROM CE_testTable where foreignWord = N'ការអភិវឌ្ឍសហគមន៍'
I get:
internalid foreignWord
1 បរិស្ថាន
2 ការអភិវឌ្ឍសហគមន៍
Which is both rows, it should have only returned the row with "ការអភិវឌ្ឍសហគមន៍" which is "community development" in Cambodian.
It is a NVARCHAR column and I'm selecting where N' etc? Any ideas?

Change the collation to Latin1_General_100_CI_AS.
You can specify collation for each column when you create the tables.
If you don't specify collation the columns will have the same collation that the database has.
CREATE TABLE [dbo].[CE_testTable](
[internalid] [int] IDENTITY(1,1) NOT NULL,
[foreignWord] [nvarchar](50) collate Latin1_General_100_CI_AS NOT NULL
)
SQL Fiddle

Try the query without the N
SELECT [internalid],
[foreignWord]
FROM CE_testTable
WHERE foreignWord = 'ការអភិវឌ្ឍសហគមន៍'

I seems like SQL Server can't do what I'm asking of it.
Looking at the comment from Erland Sommarskog, this explains my situation. It stores OK, and I can see the rows in there. But comparisons may fail. They did. So its a design level problem. I can't have different collations on the table, so I can't compare. For me this is not a problem it was only a PK that was erroring, I can work around.

How can I add a timestamp column to my SQL Server table when I create it?

I am trying to use the following:
CREATE TABLE [dbo].[Application] (
[ApplicationId] INT IDENTITY (1, 1) NOT NULL,
[Name] NVARCHAR (MAX) NULL,
timestamp
CONSTRAINT [PK_dbo.Application] PRIMARY KEY CLUSTERED ([ApplicationId] ASC)
);
Can someone confirm if this is the correct way. Also can or should I give that column a name of its own?
* Note that I am using Entity Framework. So is it okay to add a column like this but to not add it to the Application object?

I think that timestamp is a poor name for that datatype (it does not store time) and somewhere along the way Microsoft did too and has deprecated the use of timestamp since SQL Server 2008 in favor of rowversion introduced in SQL Server 2000.
Your code uses a behavior of timestamp that it gives the column a default name, rowversion does not do that so you have to give the column a name.
CREATE TABLE [dbo].[Application] (
[ApplicationId] INT IDENTITY (1, 1) NOT NULL,
[Name] NVARCHAR (MAX) NULL,
VerCol rowversion
CONSTRAINT [PK_dbo.Application] PRIMARY KEY CLUSTERED ([ApplicationId] ASC)
);
Ref:
rowversion (Transact-SQL)
timestamp SQL Server 2000
* Note that I know nothing about using Entity Framework.

CREATE TABLE [dbo].[Application] (
[ApplicationId] INT IDENTITY (1, 1) NOT NULL,
[Name] NVARCHAR (MAX) NULL,
timestamp DATETIME NULL DEFAULT GETDATE()
CONSTRAINT [PK_dbo.Application] PRIMARY KEY CLUSTERED ([ApplicationId] ASC)
);

To add the timestamp / rowversion to an existing table you can do this.
ALTER Table OrderAction ADD [RowVersion] rowversion not null
It will automatically assign timestamps, you don't need to anything like UPDATE rowversion = getdate()
Please note that if your table is large it can take a while since it needs to add a timestamp for every row. If you have a huge table and you're using a scalable database like Azure SQL you might want to increase capacity first and/or do it during off hours.
timestamp data type is identical to rowversion datatype - it's just up to you what you call the column.
It also doesn't need to be in your data model to be updated by an UPDATE or INSERT. However if it isn't in your data model then you won't actually benefit from the whole point of it which is to get a simplified UPDATE like this:
WHERE ([OrderId] = #p0) AND ([RowVersion] = #p1)

Sql server query using function and view is slower

I have a table with a xml column named Data:
CREATE TABLE [dbo].[Users](
[UserId] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [nvarchar](max) NOT NULL,
[LastName] [nvarchar](max) NOT NULL,
[Email] [nvarchar](250) NOT NULL,
[Password] [nvarchar](max) NULL,
[UserName] [nvarchar](250) NOT NULL,
[LanguageId] [int] NOT NULL,
[Data] [xml] NULL,
[IsDeleted] [bit] NOT NULL,...
In the Data column there's this xml
<data>
<RRN>...</RRN>
<DateOfBirth>...</DateOfBirth>
<Gender>...</Gender>
</data>
Now, executing this query:
SELECT UserId FROM Users
WHERE data.value('(/data/RRN)[1]', 'nvarchar(max)') = #RRN
after clearing the cache takes (if I execute it a couple of times after each other) 910, 739, 630, 635, ... ms.
Now, a db specialist told me that adding a function, a view and changing the query would make it much more faster to search a user with a given RRN. But, instead, these are the results when I execute with the changes from the db specialist: 2584, 2342, 2322, 2383, ...
This is the added function:
CREATE FUNCTION dbo.fn_Users_RRN(#data xml)
RETURNS nvarchar(100)
WITH SCHEMABINDING
AS
BEGIN
RETURN #data.value('(/data/RRN)[1]', 'varchar(max)');
END;
The added view:
CREATE VIEW vwi_Users
WITH SCHEMABINDING
AS
SELECT UserId, dbo.fn_Users_RRN(Data) AS RRN from dbo.Users
Indexes:
CREATE UNIQUE CLUSTERED INDEX cx_vwi_Users ON vwi_Users(UserId)
CREATE NONCLUSTERED INDEX cx_vwi_Users__RRN ON vwi_Users(RRN)
And then the changed query:
SELECT UserId FROM Users
WHERE dbo.fn_Users_RRN(Data) = #RRN
Why is the solution with a function and a view going slower?

the point of the view was to pre-compute the XML value into a regular column. To then use that precomputed value in the index on the view, shouldn't you actually query the view?
SELECT
UserId
FROM vwi_Users
WHERE RRN= '59021626919-61861855-S_FA1E11'
also, make the index this:
CREATE NONCLUSTERED INDEX cx_vwi_Users__RRN ON vwi_Users(RRN) INCLUDE (UserId)
it is called a covering index, since all columns needed in the query are in the index.

Have you tried to add that function result to your table (not a view) as a persisted, computed column??
ALTER TABLE dbo.Users
ADD dbo.fn_Users_RRN(Data) PERSISTED
Doing so will extract that piece of information from the XML, store it in a computed, always up-to-date column, and the persisted flag makes it physically stored along side the other columns in your table.
If this works (the PERSISTED flag is a bit iffy in terms of all the limitations it has), then you should see nearly the same performance as querying any other string field on your table... and if the computed column is PERSISTED, you can even put an index on it if you feel the need for that.

Check the query execution plan and confirm whether or not the new query is even using the view. If the query doesn't use the view, that's the problem.
How does this query fair?
SELECT UserId FROM vwi_Users
WHERE RRN = '59021626919-61861855-S_FA1E11'
I see you're freely mixing nvarchar and varchar. Don't do that! It can cause full index conversions (eeeeevil).

Scalar functions tend to perform very poorly in SQL Server. I'm not sure why if you make it a persisted computed column and index it, it doesn't have identical performance to a normal indexed-column, but it may be due to the UDF being called even though you think it's no longer needed to be called once the data is computed.
I think you know this from another answer, but your final query is wrongly calling the scalar UDF on every row (defeating the point of persisting the computation):
SELECT UserId FROM Users
WHERE dbo.fn_Users_RRN(Data) = #RRN
It should be
SELECT UserId FROM vwi_Users
WHERE RNN = #RRN