I have a SAP ASE 16.0 SP02 PL06 and would like to know if the data of the row changed.
My use-case is that I have a table with the data and the same table with all previous statuses of all rows (it is history of the "data evolution", if you like) plus some auditing columns (rowID, historizationDate). And I need to know if the last historized version and the current version of a given row differs.
First I was overjoyed by HASH() function until I found it takes only single piece of data, e.g. one cell, constant or combination of those. Then my idea moved to ugly and dirty hack: concatenate all columns of the given row and compare those directly (this leads to a lot of convert(varchar, column), but no hashing).
Are there better solution with the respect to the constraints given bellow?
Constraints: I cannot alter the original table, solution has to be as fast as reasonably possible (high throughput through table, high concurrency, literally the heart of whole database) and the source table has a 4-column primary key and high 10´s of columns overall. No sensitive data (like passwords) which needs hashing are present.
Drastically simplified structure of tables:
Original:
CREATE TABLE data (
dataID int,
column1 int,
column2 datetime,
...)
History:
CREATE TABLE dataHistory (
rowID int identity,
historizationDate datetime default getDate(),
dataID int,
column1 int,
column2 datetime,
...)
EDIT: As per #markp´s comment: The table is not accessed directly by user, but through a stored procedure. So when all checking/preparation is done, then data are saved into the table. The problem is that after all checking the new data can be rejected. Not because some invalidity/referential integrity/etc., but because the source of data is deemed less reliable then data already present. This checking is done through a few dozens procedures nested several layers deep and for each column separately, so alternating all those procedures to see if/what they did to the data is not a very viable solution... (Yes, the system is very well matured having some 20 years of age.)
The final use of all those historized data is to see what data was present in the database at any given moment to draw some business conclusions from them (e.g. how often or how much are those data changed, if operational procedures are adhered to and similar uses).
First I was overjoyed by HASH() function until I found it takes only single piece of data, e.g. one cell, constant or combination of those. Then my idea moved to ugly and dirty hack: concatenate all columns of the given row and compare those directly
SAP ASE also has a hashbytes() function that can do something like this:
select hashbytes('sha1',col1,col2,col3) from mytable
Note, in some versions of ASE 16, using 'md5' with hashbytes() causes major memory leakage. You can't use "*" for the column list.
Related
Our application shows near-real-time IoT data (up to 5 minute intervals) for our customers' remote equipment.
The original pilot project stores every device reading for all time, in a simple "Measurements" table on a SQL Server 2008 database.
The table looks something like this:
Measurements: (DeviceId, Property, Value, DateTime).
Within a year or two, there will be maybe 100,000 records in the table per device, with the queries typically falling into two categories:
"Device latest value" (95% of queries): looking at the latest value only
"Device daily snapshot" (5% of queries): looking at a single representative value for each day
We are now expanding to 5000 devices. The Measurements table is small now, but will quickly get to half a billion records or so, for just those 5000 devices.
The application is very read-intensive, with frequently-run queries looking at the "Device latest values" in particular.
[EDIT #1: To make it less opinion-based]
What database design techniques can we use to optimise for fast reads of the "latest" IoT values, given a big table with years worth of "historic" IoT values?
One suggestion from our team was to store MeasurementLatest and MeasurementHistory as two separate tables.
[EDIT #2: In response to feedback]
In our test database, seeded with 50 million records, and with the following index applied:
CREATE NONCLUSTERED INDEX [IX_Measurement_DeviceId_DateTime] ON Measurement (DeviceId ASC, DateTime DESC)
a typical "get device latest values" query (e.g. below) still takes more than 4,000 ms to execute, which is way too slow for our needs:
SELECT DeviceId, Property, Value, DateTime
FROM Measurements m
WHERE m.DateTime = (
SELECT MAX(DateTime)
FROM Measurements m2
WHERE m2.DeviceId = m.DeviceId)
This is a very broad question - and as such, it's unlikely you'll get a definitive answer.
However, I have been in a similar situation, and I'll run through my thinking and eventual approach. In summary though - I did option B but in a way to mirror option A: I used a filtered index to 'mimic' the separate smaller table.
My original thinking was to have two tables - one with the 'latest data only' for most reporting, then a table with all historical values. An alternate was to have two tables - one with all records, and one with just the latest.
When inserting a new row, it would typically need to therefore update at least two rows, if not more (depending on how it's stored).
Instead, I went for a slightly different route
Put all the data into one table
On that one table, add a new column 'Latest_Flag' (bit, NOT NULL, DEFAULT 1). If it's 1 then it's the latest value; otherwise it's historical
Have a filtered index on the table that has all columns (with appropriate column order) and filter of Latest_Flag = 1
This filtered index is similar to a second copy of the table with just the latest rows only
The insert process therefore has two steps in a transaction
'Unflag' the last Latest_Flag for that device, etc
Insert the new row
It still makes the writes a bit slower (as it needs to do several row updates as well as index updates) but fundamentally it does the pre-calculation for later reads.
When reading from the table, however, you need to then specify WHERE Latest_Flag = 1. Alternatively, you may want to put it into a view or similar.
For the filtered index, it may be something like
CREATE INDEX ix_measurements_deviceproperty_latest
ON Measurements (DeviceId, Property)
INCLUDE (Value, DateTime, Latest_Flag)
WHERE (Latest_Flag = 1)
Note - another version of this can be done in a trigger e.g., when inserting a new row, it invalidates (sets Latest_Flag = 0) any previous rows. It means you don't need to do the two-step inserts; but you do then rely on business/processing logic being within triggers.
Today I'm trying to tune the performance of an audit database. I have a legal reason for tracking changes to rows, and I've implemented a set of tables using the System Versioned tables method in SQL Server 2016.
My overall process lands "RAW" data into an initial table from a source system. From here, I then have a MERGE process that takes data from the RAW table and compares every column in the RAW table to what exists in the audit-able system versioned staging table and decides what has changed. System row versioning then tells me what has changed and what hasn't.
The trouble with this approach is that my tables are very wide. Some of them have 400 columns or more. Even tables that have 450,000 records take SQL server about 17 minutes to perform a MERGE operation. It's really slowing down the performance of our solution and it seems it would help things greatly if we could speed it up. We presently have hundreds of tables we need to do this for.
At the moment both the RAW and STAGE tables are indexed on an ID column.
I've read in several places that we might consider using a CHECKSUM or HASHBYTES function to record a value in the RAW extract. (What would you call this? GUID? UUID? Hash?). We'd then compare the calculated value to what exists in the STAGE table. But here's the rub: There are often quite a few NULL values across many columns. It's been suggested that we cast all the column types to be the same (nvarchar(max))?, and NULL values seem to cause the entire computation of the checksum to fall flat. So I'm also coding lots of ISNULL(,'UNKNOWN') statements into my code too.
So - Are there better methods for improving the performance of the merge here? I thought that I could use a row updated timestamp column as a single value to compare instead of the checksum, but I am not certain that that would pass legal muster/scrutiny. Legal is concerned that rows may be edited outside of an interface and the column wouldn't always be updated. I've seen approaches with developers using a concatenate function (shown below) to combine many column values together. This seems code intensive and expensive to compute / cast columns too.
So my questions are:
Given the situational reality, can I improve MERGE performance in any way here?
Should I use a checksum, or hashbytes, and why?
Which hashbytes method makes the most sense here? (I'm only comparing one RAW row to another STAGE row based on an ID match right)?
Did I miss something with functions that might make this comparison faster or easier in the reading
I have done? It seems odd there aren't better functions besides CONCAT available to do this in SQL Server.
I wrote the below code to show some of the ideas I am considering. Is there something better than what I wrote below?
DROP TABLE IF EXISTS MyTable;
CREATE TABLE MyTable
(C1 VARCHAR(10),
C2 VARCHAR(10),
C3 VARCHAR(10)
);
INSERT INTO MyTable
(C1,C2,C3)
VALUES
(NULL,NULL,NULL),
(NULL,NULL,3),
(NULL,2,3),
(1,2,3);
SELECT
HASHBYTES('SHA2_256',
CONCAT(C1,'-',
C2,'-',
C3)) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(CAST(C1 as varchar(max)),'-',
CAST(C2 as varchar(max)),'-',
CAST(C3 as varchar(max)))) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN'))) AS HashbytesValueWithCastWithNullCheck,
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN')) AS StringValue,
CONCAT(C1,'-',C2,'-',C3) AS ConcatString,
C1,
C2,
C3
FROM
MyTable;
'''
Given the situational reality, can I improve MERGE performance in any way here?
You should test, but storing a hash for every row, computing the hash for the new rows, and comparing based on the (key,hash) should be cheaper than comparing every column.
Should I use a checksum, or hashbytes, and why?
HASHBYTES has a much lower probability of missing a change. Roughly, with CHECKSUM you'll probably eventually miss a change or two, with HASHBYTES you probably won't ever miss a change. See remarks here: BINARY_CHECKSUM.
Did I miss something with functions that might make this comparison faster or easier in the reading I have done?
No. There's no special way to compare multiple columns.
Is there something better than what I wrote below?
You definitely should replace nulls, else a row (1,null,'A') and (1,'A',null) would get the same hash. And you should replace nulls, and delimit, with something that won't appear as a value in any column. And if you have Unicode text, converting to varchar may erase some changes, so it's safer to use nvarchar. eg:
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C2 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C3 as nvarchar(max)),N'~'))) AS HashbytesValueWithCastWithNullCheck
JSON in SQL Server is very fast. So you might try a pattern like:
select t.Id, z.RowJSON, hashbytes('SHA2_256', RowJSON) RowHash
from SomeTable t
cross apply (select t.* for json path) z(RowJSON)
Many times I need to move the data of a large table (let's call it source) to a clone of it (let's call it target). Due to the large size, instead of just deleting/inserting all, I prefer to upsert.
For easiness, let's assume an int PK col named "id".
Until now, in order to do this, I've used the datetime field dbupddate, existent on both tables, which holds the most recent time the row was inserted/updated. This is done by using a trigger which, for any insert/updates, sets dbupddate to getdate().
Thus, my run-of-the-mill upsert code until now looks something like:
update t set (col1=s.col1,col2=s.col2 etc)
from source s
inner join target t on s.id=t.id and s.dbupddate>t.dbupddate
insert target
select * from source s
where not exists (select 1 from target t where t.id=s.id)
Recently I stumbled on rowversion. I have read and understood up to an extent its function, but I'd like to know practically what benefits/drawbacks there are in case I change dbupddate to rowversion instead of datetime.
Although datetime carries information that may be useful in some cases, rowversion is more reliable since system datetime is always at the risk of getting changed and losing accuracy.
In your case, I personally prefer rowversion for its reliability.
I have a lot (over a thousand places) of legacy T-SQL code that only makes INSERTs into a varchar(8000) column in a utility table. Our needs have changed and now that column needs to be able to handle larger values. As a result I need to make that column varchar(max). This is just a plain data column where there are no searches preformed on it, no index on it, only one procedure reads it, it is INSERT and forget for the application (almost like a log entry).
I plan on making changes in only a few places that will actually generate the larger data, and in the single stored procedure that processes this column.
Are there any hidden pitfalls changing a column from varchar(8000) to a varchar(max)?
Will all the T-SQL string functions work the same, LEN(), RTRIM(), SUBSTRING(), etc.
Can anyone imagine any reason why I'd have to make any changes to the code that thinks the column is still varchar(8000)?
All MAX types have a small performance penalty, see Performance comparison of varchar(max) vs. varchar(N).
If your maintenance include online operations (online index rebuild), you will lose the possibility to do them. Online operations are not supported for tables with BLOB columns:
Clustered indexes must be created, rebuilt, or dropped offline when the underlying table contains large object (LOB) data types: image, ntext, text, varchar(max), nvarchar(max), varbinary(max), and xml.
Nonunique nonclustered indexes can be created online when the table contains LOB data types but none of these columns are used in the index definition as either key or nonkey (included) columns. Nonclustered indexes defined with LOB data type columns must be created or rebuilt offline.
The performance penalty is really small, so I wouldn't worry about it. The loss of ability to do online rebuilds may be problematic for really hot must-be-online operations tables. Unless online operations are a must, I'd vote to go for it and change it to MAX.
Crystal Reports 12 (and other versions, as far as I know) doesn't handle varchar(max) properly and interprets it as varchar(255) which leads to truncated data in reports.
So if you're using Crystal Reports, thats a disadvantage to varchar(max). Or a disadvantage to using Crystal, to be precise.
See:
http://www.crystalreportsbook.com/Forum/forum_posts.asp?TID=5843&PID=17503
http://michaeltbeeitprof.blogspot.com/2010/05/crystal-xi-and-varcharmax-aka-memo.html
If you genuinely don't need indexes and it is a large column you should be fine. Varchar (max) appears to be exactly what you need and you will have less problems with existing code than you would if you used text.
Make sure to test any updates where text is added to the existing text. It should work using regular concatenation, but I'd want to be able to prove it.
I have been dancing around this issue for awhile but it keeps coming up. We have a system and our may of our tables start with a description that is originally stored as an NVARCHAR(150) and I then we get a ticket asking to expand the field size to 250, then 1000 etc, etc...
This cycle is repeated on ever "note" field and/or "description" field we add to most tables. Of course the concern for me is performance and breaking the 8k limit of the page. However, my other concern is making the system less maintainable by breaking these fields out of EVERY table in the system into a lazy loaded reference.
So here I am faced with these same to 2 options that have been staring me in the face. (others are welcome) please lend me your opinions.
Change all may notes and/or descriptions to NVARCHAR(MAX) and make sure we do exclude these fields in all listings. Basically never do a: SELECT * FROM [TableName] unless is it only retrieving one record.
Remove all notes and/or description fields and replace them with a forign key reference to a [Notes] table.
CREATE TABLE [dbo].[Notes] (
[NoteId] [int] NOT NULL,
[NoteText] [NVARCHAR](MAX)NOT NULL )
Obviously I would prefer use option 1 because it will change so much in our system if we go with 2. However if option 2 is really the only good way to proceed, then at least I can say these changes are necessary and I have done the homework.
UPDATE:
I ran several test on a sample database with 100,000 records in it. What I find is that the because of cluster index scans the IO required for option 1 is "roughly" twice that of option 2. If I select a large number of records (1000 or more) option 1 is twice as slow even if I do not include the large text field in the select. As I request less rows the lines blur more. I a web app where page sizes of 50 or so are the norm, so option 1 will work, but I will be converting all instances to option 2 in the (very) near future for scalability.
Option 2 is better for several reasons:
When querying your tables, the large
text fields fill up pages quickly,
forcing the database to scan more
pages to retrieve data. This is
especially taxing when you don't
actually need to return the text
data.
As you mentioned, it gives you
a clean break to change the data
type in one swoop. Microsoft has
deprecated TEXT in SQL Server 2008,
so you should stick with
VARCHAR/VARBINARY.
Separate filegroups. Having
all your text data in a slower,
cheaper storage location might be
something you decide to pursue in
the future. If not, no harm, no
foul.
While Option 1 is easier for now, Option 2 will give you more flexibility in the long-term. My suggestion would be to implement a simple proof-of-concept with the "notes" information separated from the main table and perform some of your queries on both examples. Compare the execution plans, client statistics and logical I/O reads (SET STATISTICS IO ON) for some of your queries against these tables.
A quick note to those suggesting the use of a TEXT/NTEXT from MSDN:
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature. Use varchar(max),
nvarchar(max) and varbinary(max) data
types instead. For more information,
see Using Large-Value Data Types.
I'd go with Option 2.
You can create a view that joins the two tables to make the transition easier on everyone, and then go through a clean-up process that removes the view and uses the single table wherever possible.
You want to use a TEXT field. TEXT fields aren't stored directly in the row; instead, it stores a pointer to the text data. This is transparent to queries, though - if you ask for a TEXT field, it will return the actual text, not the pointer.
Essentially, using a TEXT field is somewhat between your two solutions. It keeps your table rows much smaller than using a varchar, but you'll still want to avoid asking for them in your queries if possible.
The TEXT/NTEXT data type has practically unlimited length while taking up next to nothing in your record.
It comes with a few strings attached, like special behavior with string functions, but for a secondary "notes/description" type of field these may be less of a problem.
Just to expand on Option 2
You could:
Rename existing MyTable to MyTable_V2
Move the Notes column into a joined Notes table (with 1:1 joining ID)
Create a VIEW called MyTable that joins MyTable_V2 and Notes tables
Create an INSTEAD OF trigger on MyTable view which saves the Notes column into the Notes table (IF NULL then delete any existing Notes row, if NOT NULL then Insert if not found, otherwise Update). Perform appropriate action on MyTable_V2 table
Note: We've had trouble doing this where there is a Computed column in MyTable_V2 (I think that was the problem, either way we've hit snags when doing this with "unusual" tables)
All new Insert/Update/Delete code should be written to operate directly on MyTable_V2 and Notes tables
Optionally: Have the INSERT OF trigger on MyTable log the fact that it was called (it can do this minimally, UPDATE a pre-existing log table row with GetDate() only if existing row's date is > 24 hours old - so will only do an update once a day).
When you are no longer getting any log records you can drop the INSTEAD OF trigger on MyTable view and you are now fully MyTable_V2 compliant!
Huge amount of hassle to implement, as you surmised.
Alternatively trawl the code for all references to MyTable and change them to MyTable_V2, put a VIEW in place of MyTable for SELECT only, and not create the INSTEAD OF trigger.
My plan would be to fix all Insert/Update/Delete statements referencing the now deprecated MyTable. For me this would be made somewhat easier because we use unique names for all tables and columns in the database, and we use the same names in all application code, so making sure I had found all instances by a simple FIND would be high.
P.S. Option 2 is also preferable if you have any SELECT * lying around. We have had clients whos application performance has gone downhill fast when they added large Text/Blob columns to existing tables - because of "lazy" SELECT * statements. Hopefully that isn;t the case in your shop though!