Improving Merge Performance with Change Data Capture and Hash

Improving Merge Performance with Change Data Capture and Hash - sql-server

Today I'm trying to tune the performance of an audit database. I have a legal reason for tracking changes to rows, and I've implemented a set of tables using the System Versioned tables method in SQL Server 2016.
My overall process lands "RAW" data into an initial table from a source system. From here, I then have a MERGE process that takes data from the RAW table and compares every column in the RAW table to what exists in the audit-able system versioned staging table and decides what has changed. System row versioning then tells me what has changed and what hasn't.
The trouble with this approach is that my tables are very wide. Some of them have 400 columns or more. Even tables that have 450,000 records take SQL server about 17 minutes to perform a MERGE operation. It's really slowing down the performance of our solution and it seems it would help things greatly if we could speed it up. We presently have hundreds of tables we need to do this for.
At the moment both the RAW and STAGE tables are indexed on an ID column.
I've read in several places that we might consider using a CHECKSUM or HASHBYTES function to record a value in the RAW extract. (What would you call this? GUID? UUID? Hash?). We'd then compare the calculated value to what exists in the STAGE table. But here's the rub: There are often quite a few NULL values across many columns. It's been suggested that we cast all the column types to be the same (nvarchar(max))?, and NULL values seem to cause the entire computation of the checksum to fall flat. So I'm also coding lots of ISNULL(,'UNKNOWN') statements into my code too.
So - Are there better methods for improving the performance of the merge here? I thought that I could use a row updated timestamp column as a single value to compare instead of the checksum, but I am not certain that that would pass legal muster/scrutiny. Legal is concerned that rows may be edited outside of an interface and the column wouldn't always be updated. I've seen approaches with developers using a concatenate function (shown below) to combine many column values together. This seems code intensive and expensive to compute / cast columns too.
So my questions are:
Given the situational reality, can I improve MERGE performance in any way here?
Should I use a checksum, or hashbytes, and why?
Which hashbytes method makes the most sense here? (I'm only comparing one RAW row to another STAGE row based on an ID match right)?
Did I miss something with functions that might make this comparison faster or easier in the reading
I have done? It seems odd there aren't better functions besides CONCAT available to do this in SQL Server.
I wrote the below code to show some of the ideas I am considering. Is there something better than what I wrote below?
DROP TABLE IF EXISTS MyTable;
CREATE TABLE MyTable
(C1 VARCHAR(10),
C2 VARCHAR(10),
C3 VARCHAR(10)
);
INSERT INTO MyTable
(C1,C2,C3)
VALUES
(NULL,NULL,NULL),
(NULL,NULL,3),
(NULL,2,3),
(1,2,3);
SELECT
HASHBYTES('SHA2_256',
CONCAT(C1,'-',
C2,'-',
C3)) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(CAST(C1 as varchar(max)),'-',
CAST(C2 as varchar(max)),'-',
CAST(C3 as varchar(max)))) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN'))) AS HashbytesValueWithCastWithNullCheck,
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN')) AS StringValue,
CONCAT(C1,'-',C2,'-',C3) AS ConcatString,
C1,
C2,
C3
FROM
MyTable;
'''

Given the situational reality, can I improve MERGE performance in any way here?
You should test, but storing a hash for every row, computing the hash for the new rows, and comparing based on the (key,hash) should be cheaper than comparing every column.
Should I use a checksum, or hashbytes, and why?
HASHBYTES has a much lower probability of missing a change. Roughly, with CHECKSUM you'll probably eventually miss a change or two, with HASHBYTES you probably won't ever miss a change. See remarks here: BINARY_CHECKSUM.
Did I miss something with functions that might make this comparison faster or easier in the reading I have done?
No. There's no special way to compare multiple columns.
Is there something better than what I wrote below?
You definitely should replace nulls, else a row (1,null,'A') and (1,'A',null) would get the same hash. And you should replace nulls, and delimit, with something that won't appear as a value in any column. And if you have Unicode text, converting to varchar may erase some changes, so it's safer to use nvarchar. eg:
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C2 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C3 as nvarchar(max)),N'~'))) AS HashbytesValueWithCastWithNullCheck
JSON in SQL Server is very fast. So you might try a pattern like:
select t.Id, z.RowJSON, hashbytes('SHA2_256', RowJSON) RowHash
from SomeTable t
cross apply (select t.* for json path) z(RowJSON)

Related

How to identify if row changed in SAP ASE 16.0

I have a SAP ASE 16.0 SP02 PL06 and would like to know if the data of the row changed.
My use-case is that I have a table with the data and the same table with all previous statuses of all rows (it is history of the "data evolution", if you like) plus some auditing columns (rowID, historizationDate). And I need to know if the last historized version and the current version of a given row differs.
First I was overjoyed by HASH() function until I found it takes only single piece of data, e.g. one cell, constant or combination of those. Then my idea moved to ugly and dirty hack: concatenate all columns of the given row and compare those directly (this leads to a lot of convert(varchar, column), but no hashing).
Are there better solution with the respect to the constraints given bellow?
Constraints: I cannot alter the original table, solution has to be as fast as reasonably possible (high throughput through table, high concurrency, literally the heart of whole database) and the source table has a 4-column primary key and high 10´s of columns overall. No sensitive data (like passwords) which needs hashing are present.
Drastically simplified structure of tables:
Original:
CREATE TABLE data (
dataID int,
column1 int,
column2 datetime,
...)
History:
CREATE TABLE dataHistory (
rowID int identity,
historizationDate datetime default getDate(),
dataID int,
column1 int,
column2 datetime,
...)
EDIT: As per #markp´s comment: The table is not accessed directly by user, but through a stored procedure. So when all checking/preparation is done, then data are saved into the table. The problem is that after all checking the new data can be rejected. Not because some invalidity/referential integrity/etc., but because the source of data is deemed less reliable then data already present. This checking is done through a few dozens procedures nested several layers deep and for each column separately, so alternating all those procedures to see if/what they did to the data is not a very viable solution... (Yes, the system is very well matured having some 20 years of age.)
The final use of all those historized data is to see what data was present in the database at any given moment to draw some business conclusions from them (e.g. how often or how much are those data changed, if operational procedures are adhered to and similar uses).

First I was overjoyed by HASH() function until I found it takes only single piece of data, e.g. one cell, constant or combination of those. Then my idea moved to ugly and dirty hack: concatenate all columns of the given row and compare those directly
SAP ASE also has a hashbytes() function that can do something like this:
select hashbytes('sha1',col1,col2,col3) from mytable
Note, in some versions of ASE 16, using 'md5' with hashbytes() causes major memory leakage. You can't use "*" for the column list.

Best SQL indexing plan for wide table

Hello SQL Gurus out there... I have this month long problem that I can't seem to solve. I have this really wide (Reporting) table approx. 100+ fields. Right now, it has around 2M records and with the rate the data is coming, it will probably double this count in the next 1-2months. It is running acceptable at the moment. But not so sure anymore in the next couple of months.
Basically, I just want to know what would be the best indexing plan for this type of table. This is not the actual table, but it is very close. I just want to illustrate the various data types used.
ID bigint IDENTITY (UNIQUE CLUSTERED INDEX)
Column1 varchar(4000) NULL
...
Column50 varchar(1000) NULL
Column60 int NULL
...
Column60 nvarchar(max) NULL
Now, this table is currently queried in this manner:
SELECT <ALL COLUMNS EXCEPT ID>
FROM ThisTable WITH(NOLOCK) --i know how this works dont warn me :)
WHERE
(#Column1 IS NULL OR CHARINDEX('|' + [Column1] + '|', #Column1) > 0))
...
AND (#Column99 IS NULL OR CHARINDEX('|' + [Column99] + '|', #Column99) > 0))
Possible values for each params/Fields are:
#Column1 = '|value1|value2|value3|'
Column1
value1
value2
I know basic execution plan reading and from what I can see... there's a lot of SCANS going on. But as i've said, it seems like i've already hit the wall that's why I'm here now seeking for help :). I know how to index small tables based on what they are intended for, tables but THIS wide?? I can just lol at myself :D
Any idea guys? I've read a little about Columnstore INDEX.. this is the most viable solution that i can think of.. but there's very little information available at this point for this approach.
If you're wondering how I came up with this table. Well its contents are coming from different tables (flatten). Processing are done on a nightly basis. The result are exported into CSV file which will then serves as the data source for another Reporting application.
Thanks in advance!

OK, if you already have a table-valued function, you can probably use it to speed things up. But beware that it makes no sense to use it directly in the query. First do a select from the function into a #temp table(s), and then join to it.
To deal with ORs, use dynamic sql with sp_executesql or option(recompile) (see my answer to similar question here for a sample).
If you can narrow down the select list, you might use covering indexes.
Columnstore indexes are great for aggregations on a single column (because with CSI all values for a column are grouped together, as opposed to row-oriented storage), but won't help in your case.

SQL: Persisted computed field in table sometimes not used during query execution [duplicate]

This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
Why does the Execution Plan include a user-defined function call for a computed column that is persisted?
In SQL Server 2008 I'm running the SQL profiler on a long running query and can see that a persisted computed column is being repeatedly recalculated. I've noticed this before and anecdotally I'd say that this seems to occur on more complex queries and/or tables with at least a few thousand rows.
This recalculation is definitely the cause of the long execution as it speeds up dramatically if I comment out that one column from the returned results (The field is computed by running an XPath against an Xml field).
EDIT: Offending SQL has the following structure:
DECLARE #OrderBy nvarchar(50);
SELECT
A.[Id],
CASE
WHEN #OrderBy = 'Col1' THEN A.[ComputedCol1]
WHEN #OrderBy = 'Col2' THEN C.[ComputedCol2]
ELSE C.[ComputedCol3]
END AS [Order]
FROM
[Stuff] AS A
INNER JOIN
[StuffCode] AS SC
ON
A.[Code] = SC.[Code]
All columns are nvarchar(50) except for ComputedCol3 which is nvarchar(250).

The query optimizer always tries to pick the cheapest plan, but it may not make the right choice. By persisting a column you are putting it in the main table (in the clustered index or the heap) but in order to pull out these values, normal data access paths are still required.
This means that the engine may choose other indexes instead of the main table to satisfy the query, and it could choose to recalculate the computed column if it thinks doing so combined with its chosen I/O access pattern will cost less. In general, a fair amount of CPU is cheaper than a little I/O, but no internal analysis of the cost of the expression is done, so if your column calls an expensive UDF it may make the wrong decision.
Putting an index on the column could make a difference. Note that you don't have to make the column persisted to put an index on it. If after making an index, the engine is still making mistakes, check to see if you have proper statistics being collected and frequently updated on all the indexes on the table.
It would help us help you if you posted the structure of your table (just the important columns) and the definitions of any indexes, along with some ideas of what the execution plan looks like when things go badly.
One thing to consider is that it may actually be better to recompute the column in some cases, so make sure that it's really correct to force the engine to go get it before doing so.

What is a maintainable way to store large text fields without sacrificing performance?

I have been dancing around this issue for awhile but it keeps coming up. We have a system and our may of our tables start with a description that is originally stored as an NVARCHAR(150) and I then we get a ticket asking to expand the field size to 250, then 1000 etc, etc...
This cycle is repeated on ever "note" field and/or "description" field we add to most tables. Of course the concern for me is performance and breaking the 8k limit of the page. However, my other concern is making the system less maintainable by breaking these fields out of EVERY table in the system into a lazy loaded reference.
So here I am faced with these same to 2 options that have been staring me in the face. (others are welcome) please lend me your opinions.
Change all may notes and/or descriptions to NVARCHAR(MAX) and make sure we do exclude these fields in all listings. Basically never do a: SELECT * FROM [TableName] unless is it only retrieving one record.
Remove all notes and/or description fields and replace them with a forign key reference to a [Notes] table.
CREATE TABLE [dbo].[Notes] (
[NoteId] [int] NOT NULL,
[NoteText] [NVARCHAR](MAX)NOT NULL )
Obviously I would prefer use option 1 because it will change so much in our system if we go with 2. However if option 2 is really the only good way to proceed, then at least I can say these changes are necessary and I have done the homework.
UPDATE:
I ran several test on a sample database with 100,000 records in it. What I find is that the because of cluster index scans the IO required for option 1 is "roughly" twice that of option 2. If I select a large number of records (1000 or more) option 1 is twice as slow even if I do not include the large text field in the select. As I request less rows the lines blur more. I a web app where page sizes of 50 or so are the norm, so option 1 will work, but I will be converting all instances to option 2 in the (very) near future for scalability.

Option 2 is better for several reasons:
When querying your tables, the large
text fields fill up pages quickly,
forcing the database to scan more
pages to retrieve data. This is
especially taxing when you don't
actually need to return the text
data.
As you mentioned, it gives you
a clean break to change the data
type in one swoop. Microsoft has
deprecated TEXT in SQL Server 2008,
so you should stick with
VARCHAR/VARBINARY.
Separate filegroups. Having
all your text data in a slower,
cheaper storage location might be
something you decide to pursue in
the future. If not, no harm, no
foul.
While Option 1 is easier for now, Option 2 will give you more flexibility in the long-term. My suggestion would be to implement a simple proof-of-concept with the "notes" information separated from the main table and perform some of your queries on both examples. Compare the execution plans, client statistics and logical I/O reads (SET STATISTICS IO ON) for some of your queries against these tables.
A quick note to those suggesting the use of a TEXT/NTEXT from MSDN:
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature. Use varchar(max),
nvarchar(max) and varbinary(max) data
types instead. For more information,
see Using Large-Value Data Types.

I'd go with Option 2.
You can create a view that joins the two tables to make the transition easier on everyone, and then go through a clean-up process that removes the view and uses the single table wherever possible.

You want to use a TEXT field. TEXT fields aren't stored directly in the row; instead, it stores a pointer to the text data. This is transparent to queries, though - if you ask for a TEXT field, it will return the actual text, not the pointer.
Essentially, using a TEXT field is somewhat between your two solutions. It keeps your table rows much smaller than using a varchar, but you'll still want to avoid asking for them in your queries if possible.

The TEXT/NTEXT data type has practically unlimited length while taking up next to nothing in your record.
It comes with a few strings attached, like special behavior with string functions, but for a secondary "notes/description" type of field these may be less of a problem.

Just to expand on Option 2
You could:
Rename existing MyTable to MyTable_V2
Move the Notes column into a joined Notes table (with 1:1 joining ID)
Create a VIEW called MyTable that joins MyTable_V2 and Notes tables
Create an INSTEAD OF trigger on MyTable view which saves the Notes column into the Notes table (IF NULL then delete any existing Notes row, if NOT NULL then Insert if not found, otherwise Update). Perform appropriate action on MyTable_V2 table
Note: We've had trouble doing this where there is a Computed column in MyTable_V2 (I think that was the problem, either way we've hit snags when doing this with "unusual" tables)
All new Insert/Update/Delete code should be written to operate directly on MyTable_V2 and Notes tables
Optionally: Have the INSERT OF trigger on MyTable log the fact that it was called (it can do this minimally, UPDATE a pre-existing log table row with GetDate() only if existing row's date is > 24 hours old - so will only do an update once a day).
When you are no longer getting any log records you can drop the INSTEAD OF trigger on MyTable view and you are now fully MyTable_V2 compliant!
Huge amount of hassle to implement, as you surmised.
Alternatively trawl the code for all references to MyTable and change them to MyTable_V2, put a VIEW in place of MyTable for SELECT only, and not create the INSTEAD OF trigger.
My plan would be to fix all Insert/Update/Delete statements referencing the now deprecated MyTable. For me this would be made somewhat easier because we use unique names for all tables and columns in the database, and we use the same names in all application code, so making sure I had found all instances by a simple FIND would be high.
P.S. Option 2 is also preferable if you have any SELECT * lying around. We have had clients whos application performance has gone downhill fast when they added large Text/Blob columns to existing tables - because of "lazy" SELECT * statements. Hopefully that isn;t the case in your shop though!

TSQL "LIKE" or Regular Expressions?

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?

What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.

First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.

One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)

Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0

First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.

There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.

Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.

Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight