MonetDB refresh data in background best strategy with active connections making queries - database

I'm testing MonetDB and getting an amazing performance while querying millions of rows on my laptop.
I expect to work with billions in production and I need to update the data as often as possible, let say each 1 minute or 5 minutes worst case. Just updating existing records or adding new ones, deletion can be scheduled once a day.
I've seen a good performance for the updates on my tests, but i'm a bit worried about same operations over three of four times more data.
About BULK insert, got 1 million rows in 5 secs, so good enough performance right now as well. I have not tried deletion.
Everything works fine unless you run queries at the same time you update the data, in this case all seems to be frozen for a long-long-long time.
So, what's the best strategy to get MonetDB updated in background?
Thanks

You could do each load in a new table with the same schema, then create a VIEW that unions them all together. Queries will run on the view, and dropping and recreating that view is very fast.
However, it would probably be best to merge some of these smaller tables together every now and then. For example, a nightly job could combine all load tables from the previous day(s) into a new table (runs independently, no problem) and then recreate the view again.
Alternatively, you could use the BINARY COPY INTO to speed up the loading process in the first place.

There is a new merge table functionnality that could replace the view in Hannes Mühleisen answer and would be more idiomatic.
You can attach / detach partitions using:
alter table mergedTable ADD/DROP table partitionTable
It will be problematic for updates as they must be made directly to the partition tables easier if you have a partitionning key (date/...)
But it was the same with the previous solution.

Related

When is it needed to manually reanalyze a table in PostgreSQL?

Recently I've migrated primary keys from integer to bigint and found an article where the author manually updates the table statistics after updating PK data type:
-- reanalyze the table in 3 steps
SET default_statistics_target TO 1;
ANALYZE [table];
SET default_statistics_target TO 10;
ANALYZE [table];
SET default_statistics_target TO DEFAULT;
ANALYZE [table];
As far as I understand analyzer automatically runs in the background to keep statistics up to date. In Postgres docs (Notes section) I've found that the analyzer can be run manually after some major changes.
So the questions are:
Does it make sense to manually run the analyzer if autovacuum is enabled? And if yes, in which situations?
What are best practices with it (e.g. that default_statistics_target switching above)?
Autoanalyze is triggered by data modifications: it normally runs when 10% of your table has changed (with INSERT, UPDATE or DELETE). If you rewrite a table or create an expression index, that does not trigger autoanalyze. In these cases, it is a good idea to run a manual ANALYZE.
Don't bother with this dance around default_statistics_target. A simple, single ANALYZE tab; will do the trick.
Temporary tables cannot be processed by autovacuum. If you want them to have stats, you need to do it manually.
Creating an expressional index creates the need for new stats, but doesn't do anything to schedule an auto-ANALYZE to happen. So if you make one of those, you should do an ANALYZE manually. Table rewrites like the type change you linked to are the similar, they destroy the old stats for that column, but don't schedule an auto-analyze to collect new ones. Eventually the table would probably be analyzed again just due to "natural" turnover, but that could be a very long time. (We really should do something about that, so they happen automatically)
If some part of the table is both updated and queried much heavily than others, the cold bulk of the table can dilute out the activity counters driven by the hot part, meaning the stats for the rapidly changing part can grow very out of date before auto-analyze kicks in.
Even if the activity counters for a bulk operation do drive an auto-analyze, there is no telling how long it might take to finish. This table might get stuck behind lots of other tables already being vacuumed or waiting to be. If you just analyze it yourself after a major change, you know it will get started promptly, and you will know when it has finished (when your prompt comes back) without needing to launch an investigation into the matter.

data migration takes long time

I have been written an c# console app for migrate data;
count of records not so more; each table has almost 100 hundred records. but the structure of data and the businesses logic is so complicated with almost 200 tables.
my data migration has all types of actions: delete, update, insert, get.
delete and update operation just use for data correction in source database
now my migrate data takes so long time; almost three days or more!
some actions for improvement:
1- at first set 'NOCHECK CONSTRAINT' in source database; when this operation do it: delete, update and insert.
2- then for fetch data from source database; set some index.
3- disable all index and constraint in destination database when insert data.
now can any one suggest a solution for improvement duration time?
It should be noted that in this phase of the project I couldn't switch to another solution for example SSIS. I must be improvement this console app!.
be used EFCore 2.2 with pure query for transfer data;
thanks a lot
I had a very significant performance improvement by switching recovery model from Full to Simple. Obviously, there are good reasons to use Full; but depending on the changes that your migration is doing, the performance improvement may be order of magnitude!
Sorry to bring it up, but it is very difficult to understand what you are trying to say (I suppose it is due to poor English). Maybe run some grammar checker to improve clarity! Is your question about EF Migrations? Or you are using a custom query to read data from one DB and write into another? It seems the latter, then you probably need to look at the SQL extended events to identify poorly written queries before you start tuning the database instance! Nothing improves performance like tuning the SQL!

improve database querying in ms sql

what's a fast way to query large amounts of data (between 10.000 - 100.000, it will get bigger in the future ... maybe 1.000.000+) spread across multiple tables (20+) that involves left joins, functions (sum, max, count,etc.)?
my solution would be to make one table that contains all the data i need and have triggers that update this table whenever one of the other tables gets updated. i know that trigger aren't really recommended, but this way i take the load off the querying. or do one big update every night.
i've also tried with views, but once it starts involving left joins and calculations it's way too slow and times out.
Since your question is too general, here's a general answer...
The path you're taking right now is optimizing a single query/single issue. Sure, it might solve the issue you have right now, but it's usually not very good in the long run (not to mention the cumulative cost of maintainance of such a thing).
The common path to take is to create an 'analytics' database - the real-time copy of your production database that you're going to query for all your reports. This analytics database can eventually be even a full blown DWH, but you're probably going to start with a simple real-time replication (or replicate nightly or whatever) and work from there...
As I said, the question/problem is too broad to be answered in a couple of paragraphs, these only some of the guidelines...
Need a bit more details, but I can already suggest this:
Use "with(nolock)", this will slightly improve the speed.
Reference: Effect of NOLOCK hint in SELECT statements
Use Indexing for your table fields for fetching data fast.
Reference: sql query to select millions record very fast

Do triggers decreases the performance? Inserted and deleted tables?

Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience

bulk insert with or without index

In a comment I read
Just as a side note, it's sometimes faster to drop the indices of your table and recreate them after the bulk insert operation.
Is this true? Under which circumstances?
As with Joel I will echo the statement that yes it can be true. I've found that the key to identifying the scenario that he mentioned is all in the distribution of data, and the size of the index(es) that you have on the specific table.
In an application that I used to support that did a regular bulk import of 1.8 million rows, with 4 indexes on the table, 1 with 11 columns, and a total of 90 columns in the table. The import with indexes took over 20 hours to complete. Dropping the indexes, inserting, and re-creating the indexes only took 1 hour and 25 minutes.
So it can be a big help, but a lot of it comes down to your data, the indexes, and the distribution of data values.
Yes, it is true. When there are indexes on the table during an insert, the server will need to be constantly re-ordering/paging the table to keep the indexes up to date. If you drop the indexes, it can just add the rows without worrying about that, and then build the indexes all at once when you re-create them.
The exception, of course, is when the import data is already in index order. In fact, I should note that I'm working on a project right now where this opposite effect was observed. We wanted to reduce the run-time of a large import (nightly dump from a mainframe system). We tried removing the indexes, importing the data, and re-creating them. It actually significantly increased the time for the import to complete. But, this is not typical. It just goes to show that you should always test first for your particular system.
One thing you should consider when dropping and recreating indexes is that it should only be done on automated processes that run during the low volumne periods of database use. While the index is dropped it can't be used for other queries that other users might be riunning at the same time. If you do this during production hours ,your users will probably start complaining of timeouts.

Resources