I have column in my table, say updateStamp. I'd like to get an approach to update that field with a new sequential number upon row update.
The database has lot of traffic, mostly read, but multiple concurrent updates could also happen in batches. Therefore the solution should cause minimal locks.
Reason for this requirement is that I need to have a solution for clients to iterate over the table forwards and if a row is updated - it should come up on the result set again.
So, query would then be like
SELECT *
FROM mytable
WHERE updateStamp > #lastReturnedUpdateStamp
ORDER BY updateStamp
Unfortunately timestamps do not work here because multiple updates could happen at same time.
The timestamp (deprecated) or rowversion (current) data type is the only one I'm aware of that is updated on every write operation on the row.
It's not a time stamp per se - it doesn't store date, time in hours, seconds etc. - it's really more of a RowVersion (hence the name change) - a unique, ever-increasing number (binary) on the row.
It's typically used to check for any modifications between the time you have read the row, and the time you're going to update it.
Since it's not really a date/time information, you will most likely have to have another column for that human-readable information. You can add a LastModified DATETIME column to your table, and with a DEFAULT GETDATE() constraint, you can insert a new value upon insertion. For keeping that up to date, you'll have to write a AFTER UPDATE trigger to update the LastModified column when any update occurs.
SQL Server 2011 (a.k.a. "Denali") will bring us SEQUENCES which would be the perfect fit in your case here - but alas, that' still at least a year from official release.....
Related
Our application shows near-real-time IoT data (up to 5 minute intervals) for our customers' remote equipment.
The original pilot project stores every device reading for all time, in a simple "Measurements" table on a SQL Server 2008 database.
The table looks something like this:
Measurements: (DeviceId, Property, Value, DateTime).
Within a year or two, there will be maybe 100,000 records in the table per device, with the queries typically falling into two categories:
"Device latest value" (95% of queries): looking at the latest value only
"Device daily snapshot" (5% of queries): looking at a single representative value for each day
We are now expanding to 5000 devices. The Measurements table is small now, but will quickly get to half a billion records or so, for just those 5000 devices.
The application is very read-intensive, with frequently-run queries looking at the "Device latest values" in particular.
[EDIT #1: To make it less opinion-based]
What database design techniques can we use to optimise for fast reads of the "latest" IoT values, given a big table with years worth of "historic" IoT values?
One suggestion from our team was to store MeasurementLatest and MeasurementHistory as two separate tables.
[EDIT #2: In response to feedback]
In our test database, seeded with 50 million records, and with the following index applied:
CREATE NONCLUSTERED INDEX [IX_Measurement_DeviceId_DateTime] ON Measurement (DeviceId ASC, DateTime DESC)
a typical "get device latest values" query (e.g. below) still takes more than 4,000 ms to execute, which is way too slow for our needs:
SELECT DeviceId, Property, Value, DateTime
FROM Measurements m
WHERE m.DateTime = (
SELECT MAX(DateTime)
FROM Measurements m2
WHERE m2.DeviceId = m.DeviceId)
This is a very broad question - and as such, it's unlikely you'll get a definitive answer.
However, I have been in a similar situation, and I'll run through my thinking and eventual approach. In summary though - I did option B but in a way to mirror option A: I used a filtered index to 'mimic' the separate smaller table.
My original thinking was to have two tables - one with the 'latest data only' for most reporting, then a table with all historical values. An alternate was to have two tables - one with all records, and one with just the latest.
When inserting a new row, it would typically need to therefore update at least two rows, if not more (depending on how it's stored).
Instead, I went for a slightly different route
Put all the data into one table
On that one table, add a new column 'Latest_Flag' (bit, NOT NULL, DEFAULT 1). If it's 1 then it's the latest value; otherwise it's historical
Have a filtered index on the table that has all columns (with appropriate column order) and filter of Latest_Flag = 1
This filtered index is similar to a second copy of the table with just the latest rows only
The insert process therefore has two steps in a transaction
'Unflag' the last Latest_Flag for that device, etc
Insert the new row
It still makes the writes a bit slower (as it needs to do several row updates as well as index updates) but fundamentally it does the pre-calculation for later reads.
When reading from the table, however, you need to then specify WHERE Latest_Flag = 1. Alternatively, you may want to put it into a view or similar.
For the filtered index, it may be something like
CREATE INDEX ix_measurements_deviceproperty_latest
ON Measurements (DeviceId, Property)
INCLUDE (Value, DateTime, Latest_Flag)
WHERE (Latest_Flag = 1)
Note - another version of this can be done in a trigger e.g., when inserting a new row, it invalidates (sets Latest_Flag = 0) any previous rows. It means you don't need to do the two-step inserts; but you do then rely on business/processing logic being within triggers.
If an ETL process attempts to detect data changes on system-versioned tables in SQL Server by including rows as defined by a rowversion column to be within a rowversion "delta window", e.g.:
where row_version >= #previous_etl_cycle_rowversion
and row_version < #current_etl_cycle_rowversion
.. and the values for #previous_etl_cycle_rowversion and #current_etl_cycle_rowversion are selected from a logging table whose newest rowversion gets appended to said logging table at the start of each ETL cycle via:
insert into etl_cycle_logged_rowversion_marker (cycle_start_row_version)
select ##DBTS
... is it possible that a rowversion of a record falling within a given "delta window" (bounded by the 2 ##DBTS values) could be missed/skipped due to rowversion's behavior vis-à-vis transactional consistency? - i.e., is it possible that rowversion would be reflected on a basis of "eventual" consistency?
I'm thinking of a case where say, 1000 records are updated within a single transaction and somehow ##DBTS is "ahead" of the record's committed rowversion yet that specific version of the record is not yet readable...
(For the sake of scoping the question, please exclude any cases of deleted records or immediately consecutive updates on a given record within such a large batch transaction.)
If you make sure to avoid row versioning for the queries that read the change windows you shouldn't miss many rows. With READ COMMITTED SNAPSHOT or SNAPSHOT ISOLATION an updated but uncommitted row would not appear in your query.
But you can also miss rows that got updated after you query ##dbts. That's not such a big deal usually as they'll be in the next window. But if you have a row that is constantly updated you may miss it for a long time.
But why use rowversion? If these are temporal tables you can query the history table directly. And Change Tracking is better and easier than using rowversion, as it tracks deletes and optionally column changes. The feature was literally built for to replace the need to do this manually which:
usually involved a lot of work and frequently involved using a
combination of triggers, timestamp columns, new tables to store
tracking information, and custom cleanup processes
.
Under SNAPSHOT isolation, it turns out the proper function to inspect rowversion which will ensure contiguous delta windows while not skipping rowversion values attached to long-running transactions is MIN_ACTIVE_ROWVERSION() rather than ##DBTS.
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
I am trying to write a sybase query for how to get a last modified column and timestamp for all the tables in SYBASE database. Please find the sql below. but this is not accuarte one.. it is working in one syabase DB object and not in another environment. Please help me with proper sql.
select TableName=object_name(ss.id), RowCnt=st.rowcnt,
ColName=col_name(ss.id,convert(int,substring(ss.colidarray,1,2))),
UpdStatsDate=convert(varchar(20),moddate,100),
DaysAgo=datediff(dd,moddate,getdate())
from sysstatistics ss, systabstats st
where ss.id > 100 and st.id > 100
and ss.id=st.id
and ss.formatid=100
and st.indid in (0,1)
and ss.c4 is not null
order by TableName, ColName
Sybase ASE does not keep track when a column was modified last. You can include a column of datatype 'timestamp' in a table, but this value is updated when any column in the row is updated. Moreover, such 'timestamp' column so not reflect a real-world clock time, but are an artificial internal counter based on the internal 'database timestamp' (which, again, is not related to what we call 'time' in the real world). For more info on what that timestamp means, see chapter 13 of my book 'Tips, Tricks & recipes for Sybase ASE' (sypron.nl/ttr).
But more importantly, the sysstatistics.moddate column does NOT reflect when a column was updated. Instead, it is the time when the statistics for a column were last updated (as a result of running UPDATE STATISTICS).
If you want to keep track of the last update time for a column, you could do this with an UPDATE trigger which detects which columns are updated as well as the time, and records this in a separate table. Note that such a trigger could quickly become a bottleneck in busy systems.
Can I specify in a range that all rows having value in CreatedDate column earlier than one month from GETDATE() should be placed in one partition and the rest in other, so that I should query the 2nd partition for latest data and 1st one for archived data?
No, you can't. Partition function must be deterministic. Deterministic functions always return the same result any time they are called with a specific set of input values.
Unfortunately, GetDate() is nondeterministic function.
Unfortunately, you can't use GetDate(), because GetDate() is nondeterministic function.
See http://shannonlowder.com/2010/08/partitioning/ for more details
#Ismail
There are alternatives:
Create bit column LastMonth and partition function based on LastMonth column. You need to update field every day, before you start using your data. You don't need to do it daily, maybe is better way to update column you choose to flag your fresh data (or change your partition function), once in a period you choose (week/month/quarter).
I don't try this approach, you may need to start some maintenance on table for full performance after updating column.
Another idea that might be work is to make partition for every month, and change filegroups when new month start. For example, if you want your latest data on fast disk f: and history on s:, you will have PartitionJan on s: and PartitionFebruary on f:, when martch started move PartitionFebruary to s:, and start using PartitionMartch on f:.