How can I store temporary, self-destroying data into a t-sql table? - sql-server

Suppose I have a table which contains relevant information. However, the data is only relevant for, let's say, 30 minutes.
After that it's just database junk, so I need to get rid of it asap.
If I wanted, I could clean this table periodically, setting an expiration date time for each record individually and deleting expired records through a job or something. This is my #1 option, and it's what will be done unless someone convince me otherwise.
But I think this solution may be problematic. What if someone stops the job from running and no one notices? I'm looking for something like a built-in way to insert temporary data into a table. Or a table that has "volatile" data itself, in a way that it automagically removes data after x amount of time after its insertion.
And last but not least, if there's no built-in way to do that, could I be able to implement this functionality in SQL server 2008 (or 2012, we will be migrating soon) myself? If so, could someone give me directions as to what to look for to implement something like it?
(Sorry if the formatting ends up bad, first time using a smartphone to post on SO)

As another answer indicated, TRUNCATE TABLE is a fast way to remove the contents of a table, but it's aggressive; it will completely empty the table. Also, there are restrictions on its use; among others, it can't be used on tables which "are referenced by a FOREIGN KEY constraint".
Any more targeted removal of rows will require a DELETE statement with a WHERE clause. Having an index on relevant criteria fields (such as the insertion date) will improve performance of the deletion and might be a good idea (depending on its effect on INSERT and UPDATE statements).
You will need something to "trigger" the DELETE statement (or TRUNCATE statement). As you've suggested, a SQL Server Agent job is an obvious choice, but you are worried about the job being disabled or removed. Any solution will be vulnerable to someone removing your work, but there are more obscure ways to trigger an activity than a job. You could embed the deletion into the insertion process-- either in whatever stored procedure or application code you have, or as an actual table trigger. Both of those methods increase the time required for an INSERT and, because they are not handled out of band by the SQL Server Agent, will require your users to wait slightly longer. If you have the right indexes and the table is reasonably-sized, that might be an acceptable trade-off.
There isn't any other capability that I'm aware of for SQL Server to just start deleting data. There isn't automatic data retention policy enforcement.

See #Yuriy comment, that's relevant.
If you really need to implement it DB side....
Truncate table is fast way to get rid of records.
If all you need is ONE table and you just need to fill it with data, use it and dispose it asap you can consider truncating a (permanent) "CACHE_TEMP" table.
The scenario can become more complicated you are running concurrent threads/jobs and each is handling it's own data.
If that data is just existing for a single "job"/context you can consider using #TEMP tables. They are a bit volatile and maybe can be what you are looking for.
Also you maybe can use table variables, they are a bit more volatile than temporary tables but it depends on things you don't posted, so I cannot say what's really better.

Related

When is it needed to manually reanalyze a table in PostgreSQL?

Recently I've migrated primary keys from integer to bigint and found an article where the author manually updates the table statistics after updating PK data type:
-- reanalyze the table in 3 steps
SET default_statistics_target TO 1;
ANALYZE [table];
SET default_statistics_target TO 10;
ANALYZE [table];
SET default_statistics_target TO DEFAULT;
ANALYZE [table];
As far as I understand analyzer automatically runs in the background to keep statistics up to date. In Postgres docs (Notes section) I've found that the analyzer can be run manually after some major changes.
So the questions are:
Does it make sense to manually run the analyzer if autovacuum is enabled? And if yes, in which situations?
What are best practices with it (e.g. that default_statistics_target switching above)?
Autoanalyze is triggered by data modifications: it normally runs when 10% of your table has changed (with INSERT, UPDATE or DELETE). If you rewrite a table or create an expression index, that does not trigger autoanalyze. In these cases, it is a good idea to run a manual ANALYZE.
Don't bother with this dance around default_statistics_target. A simple, single ANALYZE tab; will do the trick.
Temporary tables cannot be processed by autovacuum. If you want them to have stats, you need to do it manually.
Creating an expressional index creates the need for new stats, but doesn't do anything to schedule an auto-ANALYZE to happen. So if you make one of those, you should do an ANALYZE manually. Table rewrites like the type change you linked to are the similar, they destroy the old stats for that column, but don't schedule an auto-analyze to collect new ones. Eventually the table would probably be analyzed again just due to "natural" turnover, but that could be a very long time. (We really should do something about that, so they happen automatically)
If some part of the table is both updated and queried much heavily than others, the cold bulk of the table can dilute out the activity counters driven by the hot part, meaning the stats for the rapidly changing part can grow very out of date before auto-analyze kicks in.
Even if the activity counters for a bulk operation do drive an auto-analyze, there is no telling how long it might take to finish. This table might get stuck behind lots of other tables already being vacuumed or waiting to be. If you just analyze it yourself after a major change, you know it will get started promptly, and you will know when it has finished (when your prompt comes back) without needing to launch an investigation into the matter.

SQL Server 1B+ records table - best way to handle upserts?

The company I work for is building a data mart that wants 7 years of data maintained within it. Unfortunately, one table is well over 1 billion records.
My question is this: what would be the best way to keep this table current? (Daily update or quicker)
I know the MERGE statement is quite beneficial for this but I'm hoping to not have to parse through 1 billion records for each MERGE. Table partitioning is out as we do not have the Enterprise edition of SQL Server.
Any direction would be greatly appreciated :)
there are several options here; two are explained above on the comment.
The answer depends on which operation you want to do on the records.
If you just want to modify your recent records, the best way is to keep those records on an active table and move other records to an archive table as an archive. this way you need a scheduled job to move unnecessary records to the archive table.
If you want also to have a reporting module, you may need to provide an additional table which holds some abstracts of data in such a way that you can extract the reports you want.
You will want to seriously consider splitting the table. For example, see the operational versus archive paradigm.
The first step of splitting the data up from such a massive table is to identify the clustering index (if it has one) and all the other indexes as well, because you will want to avoid operations that cause big rebuilds and data shifting.
Otherwise, if you need to keep the status quo, with good indexes (you should bite the bullet and define one if they have somehow survived this long without one), you can rely on the query optimizer to do a quick index seek (much better than a scan, especially a table scan, which is the concern you seem to have). So just write your MERGE statement and make sure to use indexes in the ON clause (and avoid using functions on indexed columns at all costs!).

ALL_TAB_MODIFICATIONS, but for columns

I have a legacy database and I was able to use ALL_TAB_MODIFICATIONS to easily find candidates for deletion - tables, that are no longer used by the application using that database, and then continuing with more detailed review of these tables and possibly deleting them
(I know it isn't guaranteed, hence I call them just "candidates for deletion").
My question is, is there anything similar I could use to find columns which aren't used?
That means that all newly inserted data has NULL, and it is never UPDATEd to get a not null value.
If there isn't a similar view provided by Oracle, what other approach would you recommend to find them?
This isn't needed because of storage reasons, but because the database is open also for reporting purposes and we have had cases where reports have been created based on old columns and thus providing wrong results.
Why not invest a little and get an exact result?
The idea is to store the contant of the tables say at the begin of the month and repeat this at the end of teh month.
From the difference you can see with table columns were changed by updates or with inserts. You'd probably tolerate changes caused by deletes.
You'll only need twice the space of your DB and invest a bit in the reporting SQLs.
Please note also that a drop of a column - even if not actively used - may invalidate your application in case wenn the column is referenced or select * from is used.

Voting System: Should I use a SQL Trigger or more Code?

I'm building a voting system whereby each vote is captured in a votes table with the UserID and the DateTime of the vote along with an int of either 1 or -1.
I'm also keeping a running total of TotalVotes in the table that contains the item that the user actually voted on. This way I'm not constantly running a query to SUM the Vote table.
My Question is kind of a pros/cons question when it comes to updating the TotalVotes field. With regards to code manageability, putting an additional update method in the application makes it easy to troubleshoot and find any potential problems. But if this application grows significantly in it's user base, this could potentially cause a lot of additional SQL calls from the app to the DB. Using a trigger keeps it "all in the sql family" so to speak, and should add a little performance boost, as well as keep a mundane activity out of the code base.
I understand that premature optimization could be called in this specific question, but since I haven't built it yet, I might as well try and figure out the better approach right out of the gate.
Personally I'm leaning towards a trigger. Please give me your thoughts/reasoning.
Another option is to create view on the votes table aggregating the votes as TotalVotes.
Then index the view.
The magic of SQL Server optimizer (enterprise edition only i think) is that when it sees queries of sum(voteColumn) it will pick that value up from the index on the view of the same data, which is amazing when you consider you're not referencing the view directly in your query!
If you don't have the enterprise edition you could query for the total votes on the view rather than the table, and then take advantage of the index.
Indexes are essentially denormalization of your data that the optimizer is aware of. You create or drop them as required, and let the optimizer figure it out (no code changes required) Once you start down the path of your own hand crafted denormalization you'll have that baked into your code for years to come.
Check out Improving performance with indexed views
There are some specific criteria that must be met to get indexed views working. Here's a sample based on a guess of your data model:
create database indexdemo
go
create table votes(id int identity primary key, ItemToVoteOn int, vote int not null)
go
CREATE VIEW dbo.VoteCount WITH SCHEMABINDING AS
select ItemToVoteOn, SUM(vote) as TotalVotes, COUNT_BIG(*) as CountOfVotes from dbo.votes group by ItemToVoteOn
go
CREATE UNIQUE CLUSTERED INDEX VoteCount_IndexedView ON dbo.VoteCount(itemtovoteon)
go
insert into votes values(1,1)
insert into votes values(1,1)
insert into votes values(2,1)
insert into votes values(2,1)
insert into votes values(2,1)
go
select ItemToVoteOn, SUM(vote) as TotalVotes from dbo.votes group by ItemToVoteOn
And this query (which doesn't reference the view or by extension it's index) results in this execution plan. Notice the index is used. Of course drop the index, (and gain insert performance)
And one last word. Until you're up and running there is really know way that you'll know if any sort of denormalization will in fact help overall throughput. By using indexes you can create them, measure if it helps or hurts, and then keep or drop them as required. It's the only kind of denormalization for performance that is safe to do.
I'd suggest you build a Stored Procedure that does both the vote insert and the update on the total votes. Then your application only has to know how to record a vote, but the logic on exactly what is going on when you call that is still contained in one place (the stored procedure, rather then an ad-hoc update query and a seperate trigger).
It also means that later on if you want to remove the update to total votes, all you have to change is the procedure by commenting out the update part.
The premature premature optimization is saving the total in a table instead of just summing the data as needed. Do you really need to denormalize the data for performance?
If you do not need to denormalize the data then you do not need to write a trigger.
I've done the trigger method for years and was always happier for it. So, as they say, "come on in, the water's fine." However, I usually do it when there are many many tables involved, not just one.
The pros/cons are well known. Materializing the value is a "pay me now" decision, you pay a little more on the insert to get faster reads. This is the way to go if and only if you want a read in 5 milliseconds instead of 500 milliseconds.
PRO: The TotalVotes will always be instantly available with one read.
PRO: You don't have to worry about code path, the code that makes the insert is much simpler. Multiplied over many tables on larger apps this is a big deal for maintainability.
CON: For each INSERT you also pay with an additional UPDATE. It takes a lot more inserts/second than most people think before you ever notice this.
CON: For many tables, manually coding triggers can get tricky. I recommend a code generator, but as I wrote the only one I know about, that would get me into self-promotion territory. If you have only one table, just code it manually.
CON: To ensure complete correctness, it should not be possible to issue an UPDATE from a console or code to modify TotalVotes. This means it is more complicated. The trigger should execute as a special superuser that is not normally used. A second trigger on the parent table fires on UPDATE and prevents changes to TotalVotes unless the user making the update is that special super-user.
Hope this gives you enough to decide.
My first gut instinct would be to write a UDF to perform the SUM operation and make TotalVotes a computed column based on that UDF.

Do triggers decreases the performance? Inserted and deleted tables?

Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience

Resources