When is it needed to manually reanalyze a table in PostgreSQL? - database

Recently I've migrated primary keys from integer to bigint and found an article where the author manually updates the table statistics after updating PK data type:
-- reanalyze the table in 3 steps
SET default_statistics_target TO 1;
ANALYZE [table];
SET default_statistics_target TO 10;
ANALYZE [table];
SET default_statistics_target TO DEFAULT;
ANALYZE [table];
As far as I understand analyzer automatically runs in the background to keep statistics up to date. In Postgres docs (Notes section) I've found that the analyzer can be run manually after some major changes.
So the questions are:
Does it make sense to manually run the analyzer if autovacuum is enabled? And if yes, in which situations?
What are best practices with it (e.g. that default_statistics_target switching above)?

Autoanalyze is triggered by data modifications: it normally runs when 10% of your table has changed (with INSERT, UPDATE or DELETE). If you rewrite a table or create an expression index, that does not trigger autoanalyze. In these cases, it is a good idea to run a manual ANALYZE.
Don't bother with this dance around default_statistics_target. A simple, single ANALYZE tab; will do the trick.

Temporary tables cannot be processed by autovacuum. If you want them to have stats, you need to do it manually.
Creating an expressional index creates the need for new stats, but doesn't do anything to schedule an auto-ANALYZE to happen. So if you make one of those, you should do an ANALYZE manually. Table rewrites like the type change you linked to are the similar, they destroy the old stats for that column, but don't schedule an auto-analyze to collect new ones. Eventually the table would probably be analyzed again just due to "natural" turnover, but that could be a very long time. (We really should do something about that, so they happen automatically)
If some part of the table is both updated and queried much heavily than others, the cold bulk of the table can dilute out the activity counters driven by the hot part, meaning the stats for the rapidly changing part can grow very out of date before auto-analyze kicks in.
Even if the activity counters for a bulk operation do drive an auto-analyze, there is no telling how long it might take to finish. This table might get stuck behind lots of other tables already being vacuumed or waiting to be. If you just analyze it yourself after a major change, you know it will get started promptly, and you will know when it has finished (when your prompt comes back) without needing to launch an investigation into the matter.

Related

MonetDB refresh data in background best strategy with active connections making queries

I'm testing MonetDB and getting an amazing performance while querying millions of rows on my laptop.
I expect to work with billions in production and I need to update the data as often as possible, let say each 1 minute or 5 minutes worst case. Just updating existing records or adding new ones, deletion can be scheduled once a day.
I've seen a good performance for the updates on my tests, but i'm a bit worried about same operations over three of four times more data.
About BULK insert, got 1 million rows in 5 secs, so good enough performance right now as well. I have not tried deletion.
Everything works fine unless you run queries at the same time you update the data, in this case all seems to be frozen for a long-long-long time.
So, what's the best strategy to get MonetDB updated in background?
Thanks
You could do each load in a new table with the same schema, then create a VIEW that unions them all together. Queries will run on the view, and dropping and recreating that view is very fast.
However, it would probably be best to merge some of these smaller tables together every now and then. For example, a nightly job could combine all load tables from the previous day(s) into a new table (runs independently, no problem) and then recreate the view again.
Alternatively, you could use the BINARY COPY INTO to speed up the loading process in the first place.
There is a new merge table functionnality that could replace the view in Hannes Mühleisen answer and would be more idiomatic.
You can attach / detach partitions using:
alter table mergedTable ADD/DROP table partitionTable
It will be problematic for updates as they must be made directly to the partition tables easier if you have a partitionning key (date/...)
But it was the same with the previous solution.

How can I store temporary, self-destroying data into a t-sql table?

Suppose I have a table which contains relevant information. However, the data is only relevant for, let's say, 30 minutes.
After that it's just database junk, so I need to get rid of it asap.
If I wanted, I could clean this table periodically, setting an expiration date time for each record individually and deleting expired records through a job or something. This is my #1 option, and it's what will be done unless someone convince me otherwise.
But I think this solution may be problematic. What if someone stops the job from running and no one notices? I'm looking for something like a built-in way to insert temporary data into a table. Or a table that has "volatile" data itself, in a way that it automagically removes data after x amount of time after its insertion.
And last but not least, if there's no built-in way to do that, could I be able to implement this functionality in SQL server 2008 (or 2012, we will be migrating soon) myself? If so, could someone give me directions as to what to look for to implement something like it?
(Sorry if the formatting ends up bad, first time using a smartphone to post on SO)
As another answer indicated, TRUNCATE TABLE is a fast way to remove the contents of a table, but it's aggressive; it will completely empty the table. Also, there are restrictions on its use; among others, it can't be used on tables which "are referenced by a FOREIGN KEY constraint".
Any more targeted removal of rows will require a DELETE statement with a WHERE clause. Having an index on relevant criteria fields (such as the insertion date) will improve performance of the deletion and might be a good idea (depending on its effect on INSERT and UPDATE statements).
You will need something to "trigger" the DELETE statement (or TRUNCATE statement). As you've suggested, a SQL Server Agent job is an obvious choice, but you are worried about the job being disabled or removed. Any solution will be vulnerable to someone removing your work, but there are more obscure ways to trigger an activity than a job. You could embed the deletion into the insertion process-- either in whatever stored procedure or application code you have, or as an actual table trigger. Both of those methods increase the time required for an INSERT and, because they are not handled out of band by the SQL Server Agent, will require your users to wait slightly longer. If you have the right indexes and the table is reasonably-sized, that might be an acceptable trade-off.
There isn't any other capability that I'm aware of for SQL Server to just start deleting data. There isn't automatic data retention policy enforcement.
See #Yuriy comment, that's relevant.
If you really need to implement it DB side....
Truncate table is fast way to get rid of records.
If all you need is ONE table and you just need to fill it with data, use it and dispose it asap you can consider truncating a (permanent) "CACHE_TEMP" table.
The scenario can become more complicated you are running concurrent threads/jobs and each is handling it's own data.
If that data is just existing for a single "job"/context you can consider using #TEMP tables. They are a bit volatile and maybe can be what you are looking for.
Also you maybe can use table variables, they are a bit more volatile than temporary tables but it depends on things you don't posted, so I cannot say what's really better.

Voting System: Should I use a SQL Trigger or more Code?

I'm building a voting system whereby each vote is captured in a votes table with the UserID and the DateTime of the vote along with an int of either 1 or -1.
I'm also keeping a running total of TotalVotes in the table that contains the item that the user actually voted on. This way I'm not constantly running a query to SUM the Vote table.
My Question is kind of a pros/cons question when it comes to updating the TotalVotes field. With regards to code manageability, putting an additional update method in the application makes it easy to troubleshoot and find any potential problems. But if this application grows significantly in it's user base, this could potentially cause a lot of additional SQL calls from the app to the DB. Using a trigger keeps it "all in the sql family" so to speak, and should add a little performance boost, as well as keep a mundane activity out of the code base.
I understand that premature optimization could be called in this specific question, but since I haven't built it yet, I might as well try and figure out the better approach right out of the gate.
Personally I'm leaning towards a trigger. Please give me your thoughts/reasoning.
Another option is to create view on the votes table aggregating the votes as TotalVotes.
Then index the view.
The magic of SQL Server optimizer (enterprise edition only i think) is that when it sees queries of sum(voteColumn) it will pick that value up from the index on the view of the same data, which is amazing when you consider you're not referencing the view directly in your query!
If you don't have the enterprise edition you could query for the total votes on the view rather than the table, and then take advantage of the index.
Indexes are essentially denormalization of your data that the optimizer is aware of. You create or drop them as required, and let the optimizer figure it out (no code changes required) Once you start down the path of your own hand crafted denormalization you'll have that baked into your code for years to come.
Check out Improving performance with indexed views
There are some specific criteria that must be met to get indexed views working. Here's a sample based on a guess of your data model:
create database indexdemo
go
create table votes(id int identity primary key, ItemToVoteOn int, vote int not null)
go
CREATE VIEW dbo.VoteCount WITH SCHEMABINDING AS
select ItemToVoteOn, SUM(vote) as TotalVotes, COUNT_BIG(*) as CountOfVotes from dbo.votes group by ItemToVoteOn
go
CREATE UNIQUE CLUSTERED INDEX VoteCount_IndexedView ON dbo.VoteCount(itemtovoteon)
go
insert into votes values(1,1)
insert into votes values(1,1)
insert into votes values(2,1)
insert into votes values(2,1)
insert into votes values(2,1)
go
select ItemToVoteOn, SUM(vote) as TotalVotes from dbo.votes group by ItemToVoteOn
And this query (which doesn't reference the view or by extension it's index) results in this execution plan. Notice the index is used. Of course drop the index, (and gain insert performance)
And one last word. Until you're up and running there is really know way that you'll know if any sort of denormalization will in fact help overall throughput. By using indexes you can create them, measure if it helps or hurts, and then keep or drop them as required. It's the only kind of denormalization for performance that is safe to do.
I'd suggest you build a Stored Procedure that does both the vote insert and the update on the total votes. Then your application only has to know how to record a vote, but the logic on exactly what is going on when you call that is still contained in one place (the stored procedure, rather then an ad-hoc update query and a seperate trigger).
It also means that later on if you want to remove the update to total votes, all you have to change is the procedure by commenting out the update part.
The premature premature optimization is saving the total in a table instead of just summing the data as needed. Do you really need to denormalize the data for performance?
If you do not need to denormalize the data then you do not need to write a trigger.
I've done the trigger method for years and was always happier for it. So, as they say, "come on in, the water's fine." However, I usually do it when there are many many tables involved, not just one.
The pros/cons are well known. Materializing the value is a "pay me now" decision, you pay a little more on the insert to get faster reads. This is the way to go if and only if you want a read in 5 milliseconds instead of 500 milliseconds.
PRO: The TotalVotes will always be instantly available with one read.
PRO: You don't have to worry about code path, the code that makes the insert is much simpler. Multiplied over many tables on larger apps this is a big deal for maintainability.
CON: For each INSERT you also pay with an additional UPDATE. It takes a lot more inserts/second than most people think before you ever notice this.
CON: For many tables, manually coding triggers can get tricky. I recommend a code generator, but as I wrote the only one I know about, that would get me into self-promotion territory. If you have only one table, just code it manually.
CON: To ensure complete correctness, it should not be possible to issue an UPDATE from a console or code to modify TotalVotes. This means it is more complicated. The trigger should execute as a special superuser that is not normally used. A second trigger on the parent table fires on UPDATE and prevents changes to TotalVotes unless the user making the update is that special super-user.
Hope this gives you enough to decide.
My first gut instinct would be to write a UDF to perform the SUM operation and make TotalVotes a computed column based on that UDF.

Do triggers decreases the performance? Inserted and deleted tables?

Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience

Is bigint large enough for an event log table?

Now I know that bigint is 2^64; that is, more atoms than there are in the known universe. I shouldn't be worried, as my mere human brain simply can't get around the enormity of that number.
However, let's say that I record every change to every category, product and order in my system, from launch until the end of time. Should I be concerned about the performance of table writes before I worry about running out of primary key values? Should I record events of different priorities to different event tables? Will I run out of atoms on a hard drive before I run out of bigints? How big should I let an event log table get before I start archiving / clearing it out?
Even if every of your entries only had 1 byte, 2^64 entries would occupy around 18000000 TB on your hard drive, so I guess you shouldn't worry about this.
If your application added a record to the table once every millionth of a second, it would run for over five hundred thousand years before it ran out of keys.
"How big should I let an event log table get before I start archiving / clearing it out?"
Never clear the event logs -- the information has significant value.
However, when some manager insists that an archive is necessary, you can show the cost of storage vs. the cost of your time to (a) think about it, (b) get second and third opinions, and then (c) write a procedure to archive log records.
The cost of storage is plummeting. Your time is better spent on ANYTHING other than purging log records.
Bottom line: you have permission to stop wringing your hands. It's all good. You're not making a fundamental mistake.
It is highly unlikely that you will ever run out of primary key values. However you may need to give consideration to how you want to access the log table to retrieve data. Use this to inform when you should be archiving or cleaning the data. If the log data is read frequently think about addding indexes to improve read performance but keep in mind that indexes need to be maintained for every record added.
The way we handle this is by providing a log archiving functionality, that separates out the log table into separate databases by year, allowing us to reset the identity seed on our LogEvent table.
We also have different log tables, though only two main ones.

Resources