Many times I need to move the data of a large table (let's call it source) to a clone of it (let's call it target). Due to the large size, instead of just deleting/inserting all, I prefer to upsert.
For easiness, let's assume an int PK col named "id".
Until now, in order to do this, I've used the datetime field dbupddate, existent on both tables, which holds the most recent time the row was inserted/updated. This is done by using a trigger which, for any insert/updates, sets dbupddate to getdate().
Thus, my run-of-the-mill upsert code until now looks something like:
update t set (col1=s.col1,col2=s.col2 etc)
from source s
inner join target t on s.id=t.id and s.dbupddate>t.dbupddate
insert target
select * from source s
where not exists (select 1 from target t where t.id=s.id)
Recently I stumbled on rowversion. I have read and understood up to an extent its function, but I'd like to know practically what benefits/drawbacks there are in case I change dbupddate to rowversion instead of datetime.
Although datetime carries information that may be useful in some cases, rowversion is more reliable since system datetime is always at the risk of getting changed and losing accuracy.
In your case, I personally prefer rowversion for its reliability.
Related
Today I'm trying to tune the performance of an audit database. I have a legal reason for tracking changes to rows, and I've implemented a set of tables using the System Versioned tables method in SQL Server 2016.
My overall process lands "RAW" data into an initial table from a source system. From here, I then have a MERGE process that takes data from the RAW table and compares every column in the RAW table to what exists in the audit-able system versioned staging table and decides what has changed. System row versioning then tells me what has changed and what hasn't.
The trouble with this approach is that my tables are very wide. Some of them have 400 columns or more. Even tables that have 450,000 records take SQL server about 17 minutes to perform a MERGE operation. It's really slowing down the performance of our solution and it seems it would help things greatly if we could speed it up. We presently have hundreds of tables we need to do this for.
At the moment both the RAW and STAGE tables are indexed on an ID column.
I've read in several places that we might consider using a CHECKSUM or HASHBYTES function to record a value in the RAW extract. (What would you call this? GUID? UUID? Hash?). We'd then compare the calculated value to what exists in the STAGE table. But here's the rub: There are often quite a few NULL values across many columns. It's been suggested that we cast all the column types to be the same (nvarchar(max))?, and NULL values seem to cause the entire computation of the checksum to fall flat. So I'm also coding lots of ISNULL(,'UNKNOWN') statements into my code too.
So - Are there better methods for improving the performance of the merge here? I thought that I could use a row updated timestamp column as a single value to compare instead of the checksum, but I am not certain that that would pass legal muster/scrutiny. Legal is concerned that rows may be edited outside of an interface and the column wouldn't always be updated. I've seen approaches with developers using a concatenate function (shown below) to combine many column values together. This seems code intensive and expensive to compute / cast columns too.
So my questions are:
Given the situational reality, can I improve MERGE performance in any way here?
Should I use a checksum, or hashbytes, and why?
Which hashbytes method makes the most sense here? (I'm only comparing one RAW row to another STAGE row based on an ID match right)?
Did I miss something with functions that might make this comparison faster or easier in the reading
I have done? It seems odd there aren't better functions besides CONCAT available to do this in SQL Server.
I wrote the below code to show some of the ideas I am considering. Is there something better than what I wrote below?
DROP TABLE IF EXISTS MyTable;
CREATE TABLE MyTable
(C1 VARCHAR(10),
C2 VARCHAR(10),
C3 VARCHAR(10)
);
INSERT INTO MyTable
(C1,C2,C3)
VALUES
(NULL,NULL,NULL),
(NULL,NULL,3),
(NULL,2,3),
(1,2,3);
SELECT
HASHBYTES('SHA2_256',
CONCAT(C1,'-',
C2,'-',
C3)) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(CAST(C1 as varchar(max)),'-',
CAST(C2 as varchar(max)),'-',
CAST(C3 as varchar(max)))) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN'))) AS HashbytesValueWithCastWithNullCheck,
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN')) AS StringValue,
CONCAT(C1,'-',C2,'-',C3) AS ConcatString,
C1,
C2,
C3
FROM
MyTable;
'''
Given the situational reality, can I improve MERGE performance in any way here?
You should test, but storing a hash for every row, computing the hash for the new rows, and comparing based on the (key,hash) should be cheaper than comparing every column.
Should I use a checksum, or hashbytes, and why?
HASHBYTES has a much lower probability of missing a change. Roughly, with CHECKSUM you'll probably eventually miss a change or two, with HASHBYTES you probably won't ever miss a change. See remarks here: BINARY_CHECKSUM.
Did I miss something with functions that might make this comparison faster or easier in the reading I have done?
No. There's no special way to compare multiple columns.
Is there something better than what I wrote below?
You definitely should replace nulls, else a row (1,null,'A') and (1,'A',null) would get the same hash. And you should replace nulls, and delimit, with something that won't appear as a value in any column. And if you have Unicode text, converting to varchar may erase some changes, so it's safer to use nvarchar. eg:
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C2 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C3 as nvarchar(max)),N'~'))) AS HashbytesValueWithCastWithNullCheck
JSON in SQL Server is very fast. So you might try a pattern like:
select t.Id, z.RowJSON, hashbytes('SHA2_256', RowJSON) RowHash
from SomeTable t
cross apply (select t.* for json path) z(RowJSON)
I have a SAP ASE 16.0 SP02 PL06 and would like to know if the data of the row changed.
My use-case is that I have a table with the data and the same table with all previous statuses of all rows (it is history of the "data evolution", if you like) plus some auditing columns (rowID, historizationDate). And I need to know if the last historized version and the current version of a given row differs.
First I was overjoyed by HASH() function until I found it takes only single piece of data, e.g. one cell, constant or combination of those. Then my idea moved to ugly and dirty hack: concatenate all columns of the given row and compare those directly (this leads to a lot of convert(varchar, column), but no hashing).
Are there better solution with the respect to the constraints given bellow?
Constraints: I cannot alter the original table, solution has to be as fast as reasonably possible (high throughput through table, high concurrency, literally the heart of whole database) and the source table has a 4-column primary key and high 10´s of columns overall. No sensitive data (like passwords) which needs hashing are present.
Drastically simplified structure of tables:
Original:
CREATE TABLE data (
dataID int,
column1 int,
column2 datetime,
...)
History:
CREATE TABLE dataHistory (
rowID int identity,
historizationDate datetime default getDate(),
dataID int,
column1 int,
column2 datetime,
...)
EDIT: As per #markp´s comment: The table is not accessed directly by user, but through a stored procedure. So when all checking/preparation is done, then data are saved into the table. The problem is that after all checking the new data can be rejected. Not because some invalidity/referential integrity/etc., but because the source of data is deemed less reliable then data already present. This checking is done through a few dozens procedures nested several layers deep and for each column separately, so alternating all those procedures to see if/what they did to the data is not a very viable solution... (Yes, the system is very well matured having some 20 years of age.)
The final use of all those historized data is to see what data was present in the database at any given moment to draw some business conclusions from them (e.g. how often or how much are those data changed, if operational procedures are adhered to and similar uses).
First I was overjoyed by HASH() function until I found it takes only single piece of data, e.g. one cell, constant or combination of those. Then my idea moved to ugly and dirty hack: concatenate all columns of the given row and compare those directly
SAP ASE also has a hashbytes() function that can do something like this:
select hashbytes('sha1',col1,col2,col3) from mytable
Note, in some versions of ASE 16, using 'md5' with hashbytes() causes major memory leakage. You can't use "*" for the column list.
How can I capture the time at which a record was added to the database - effortlessly. I am using this
create table YourTable
(
Created datetime default getdate()
)
ANy other alternatives?
I think that's the canonical approach, do you have a problem with it?
Other approaches would be using an insert trigger, which is probably slower and slightly more complex in that the code is in two places. Or you could channel all updates via an SP, which would also update the Created field - again that's slightly more complex and easy to circumvent unless your permissions are set carefully.
Still in the vein of using a default constraint... there are other values you can consider using -- different advantages to each (involving universal time, precision, etc.).
http://msdn.microsoft.com/en-us/library/ms188383.aspx
Also -- consider the size of your data type -- datetime is 8bytes -- you could define the column as smalldatetime and improve that to 4 bytes (or in 2008, just plain old date, which is 3bytes -- though you might actually like knowing the time as well).
Triggers are also an option, but not preferable IMO -- for one thing, they can be rolled-back if any constraints are violated (such as external relationship to a table you just created, forgetting about the trigger -- oops!)
Options:
DEFAULT COLUMN (as you have)
INSERT TRIGGER that updates the column to the current_timestamp
Option #2 is more foolproof as it is always updated with GETDATE(). Using option #1 allows the user to manually override the Created date by specifying it in the INSERT clause.
I have column in my table, say updateStamp. I'd like to get an approach to update that field with a new sequential number upon row update.
The database has lot of traffic, mostly read, but multiple concurrent updates could also happen in batches. Therefore the solution should cause minimal locks.
Reason for this requirement is that I need to have a solution for clients to iterate over the table forwards and if a row is updated - it should come up on the result set again.
So, query would then be like
SELECT *
FROM mytable
WHERE updateStamp > #lastReturnedUpdateStamp
ORDER BY updateStamp
Unfortunately timestamps do not work here because multiple updates could happen at same time.
The timestamp (deprecated) or rowversion (current) data type is the only one I'm aware of that is updated on every write operation on the row.
It's not a time stamp per se - it doesn't store date, time in hours, seconds etc. - it's really more of a RowVersion (hence the name change) - a unique, ever-increasing number (binary) on the row.
It's typically used to check for any modifications between the time you have read the row, and the time you're going to update it.
Since it's not really a date/time information, you will most likely have to have another column for that human-readable information. You can add a LastModified DATETIME column to your table, and with a DEFAULT GETDATE() constraint, you can insert a new value upon insertion. For keeping that up to date, you'll have to write a AFTER UPDATE trigger to update the LastModified column when any update occurs.
SQL Server 2011 (a.k.a. "Denali") will bring us SEQUENCES which would be the perfect fit in your case here - but alas, that' still at least a year from official release.....
I have seen several patterns used to 'overcome' the lack of constants in SQL Server, but none of them seem to satisfy both performance and readability / maintainability concerns.
In the below example, assuming that we have an integral 'status' classification on our table, the options seem to be:
Just to hard code it, and possibly just 'comment' the status
-- StatusId 87 = Loaded
SELECT ... FROM [Table] WHERE StatusId = 87;
Using a lookup table for states, and then joining to this table so that the WHERE clause references the friendly name.
SubQuery:
SELECT ...
FROM [Table]
WHERE
StatusId = (SELECT StatusId FROM TableStatus WHERE StatusName = 'Loaded');
or joined
SELECT ...
FROM [Table] t INNER JOIN TableStatus ts On t.StatusId = ts.StatusId
WHERE ts.StatusName = 'Loaded';
A bunch of scalar UDF's defined which return constants, viz
CREATE Function LoadedStatus()
RETURNS INT
AS
BEGIN
RETURN 87
END;
and then
SELECT ... FROM [Table] WHERE StatusId = LoadedStatus();
(IMO this causes a lot of pollution in the database - this might be OK in an Oracle package wrapper)
And similar patterns with Table Valued Functions holding the constants with values as rows or columns, which are CROSS APPLIED back to [Table]
How have other SO users have solved this common issue?
Edit : Bounty - Does anyone have a best practice method for maintaining $(variables) in DBProj DDL / Schema scripts as per Remus answer and comment?
Hard coded. With SQL performance trumps maintainability.
The consequences in the execution plan between using a constant that the optimizer can inspect at plan generation time vs. using any form of indirection (UDF, JOIN, sub-query) are often dramatic. SQL 'compilation' is an extraordinary process (in the sense that is not 'ordinary' like say IL code generation) in as the result is determined not only by the language construct being compiled (ie. the actual text of the query) but also by the data schema (existing indexes) and actual data in those indexes (statistics). When a hard coded value is used, the optimizer can give a better plan because it can actually check the value against the index statistics and get an estimate of the result.
Another consideration is that a SQL application is not code only, but by a large margin is code and data. 'Refactoring' a SQL program is ... different. Where in a C# program one can change a constant or enum, recompile and happily run the application, in SQL one cannot do so because the value is likely present in millions of records in the database and changing the constant value implies also changing GBs of data, often online while new operations occur.
Just because the value is hard-coded in the queries and procedures seen by the server does not necessarily mean the value has to be hard coded in the original project source code. There are various code generation tools that can take care of this. Consider something as trivial as leveraging the sqlcmd scripting variables:
defines.sql:
:setvar STATUS_LOADED 87
somesource.sql:
:r defines.sql
SELECT ... FROM [Table] WHERE StatusId = $(STATUS_LOADED);
someothersource.sql:
:r defines.sql
UPDATE [Table] SET StatusId = $(STATUS_LOADED) WHERE ...;
While I agree with Remus Rusanu, IMO, maintainability of the code (and thus readability, least astonishment etc.) trump other concerns unless the performance difference is sufficiently significant as to warrant doing otherwise. Thus, the following query loses on readability:
Select ..
From Table
Where StatusId = 87
In general, when I have system dependent values which will be referenced in code (perhaps mimicked in an enumeration by name), I use string primary keys for the tables in which they are kept. Contrast this to user-changeable data in which I generally use surrogate keys. The use of a primary key that requires entry helps (albeit not perfectly) to indicate to other developers that this value is not meant to be arbitrary.
Thus, my "Status" table would look like:
Create Table Status
(
Code varchar(6) Not Null Primary Key
, ...
)
Select ...
From Table
Where StatusCode = 'Loaded'
This makes the query more readable, it does not require a join to the Status table, and does not require the use of a magic number (or guid). Using user-defined functions, IMO is a bad practice. Beyond the performance implications, no developer would ever expect UDFs to be used in this manner and thus violates the least astonishment criteria. You would almost be compelled to have a UDF for each constant value; otherwise, what you are passing into the function: a name? a magic value? If a name, you might as well keep the name in a table and use it directly in the query. If a magic value, you are back the original problem.
I have been using the scalar function option in our DB and it's work fine and as per my view is the best way of this solution.
if more values related to one item then made lookup like if you load combobox or any other control with static value then use lookup that's the best way to do this.
You can also add more fields to your status table that act as unique markers or groupers for status values.
For example, if you add an isLoaded field to your status table, record 87 could be the only one with the field's value set, and you can test for the value of the isLoaded field instead of the hard-coded 87 or status description.